September 26th - Day 1


Registration & Coffee

Event Inauguration
Seif Haridi
Sharing Knowledge without Sharing Data: Secure Multiparty Computation on Big Data

Azer Bestavros: Boston University, USA

A common misconception that dominates our society is the belief that data sharing is a prerequisite to knowledge sharing. This is often translated into a false choice between data utility and data privacy, resulting in handicapping our ability to fully leverage data or else sacrificing privacy and confidentiality. In this talk, I will argue for an alternative tack that allows knowledge extraction from one or more data sets, which remain otherwise private, using a cryptographic technique called secure multi-party computation (MPC). I will begin by introducing MPC and by sharing my experience in deploying this technology for a study of the gender pay gap in Boston based on actual, confidential payroll data from hundreds of employers that cover tens of thousands of employees. Next, I will provide an overview of projects we have pursued to enable the use of MPC in various settings and under different assumptions, including incorporation of optimized MPC implementations of common algorithms into software libraries and the integration of MPC into existing big-data workflow management in a cloud setting.

Dataset Marketplace Support for Deep Learning

Alex Ormenisan: Logical Clocks & KTH, Sweden

Abstract: Large datasets ranging from tens of gigabytes or terabytes are the core of deep learning. Transferring these datasets from one platform to another can be a challenging task. There is also a need to be able to reproduce or re-run these experiments. As such, the link between computation and datasets as well as dataset evolution is important to track.



Hopworks: The Cloud-Agnostic Platform for Scale-Out Data Science

Mahmoud Ismail, Logical Clocks & KTH, Sweden

Abstract: Software-defined, container-based, immutable software platforms are currently the dominant approach for both development and production machine learning pipelines. However, this approach adds software development complexity and externalizes responsibility for stateful services, such as security and data management, to the parent cloud platform, leading to cloud vendor lock-in. In this presentation, we introduce Hopworks, a cloud-agnostic platform for scale-out data science, based on Hops Hadoop. Hopsworks builds a comprehensive security model based on certificates by extending the metadata for Hops' HDFS-compatible distributed filesystem, HopsFS, to introduce a new project abstraction that provides dynamic role-based access control to data with no runtime overhead. Projects enable sandboxed access to sensitive data in a shared platform. Projects also have persistent state in the cluster, with their own conda environment replicated at all hosts in the cluster, reducing application startup time. Hopsworks also provides support for the scheduling of distributed TensorFlow/Keras/PyTorch applications on GPUs. Hopsworks is open-source and we present extensive production statistics to demonstrate its properties.

Automerge: Replicated data structures for peer-to-peer collaboration

Martin Kleppmann: University of Cambridge, UK

Abstract: This talk introduces Automerge, a JavaScript library for data synchronisation between mobile devices such as laptop computers and smartphones. It allows users to read and modify data even while their device is offline, and it automatically merges changes made concurrently on different devices. Unlike most existing data synchronisation systems, Automerge does not require data to be sent via a centralised server, but rather allows local and peer-to-peer networks to be used. We show how this project spans the gamut from the theory of Conflict-free Replicated Data Types (CRDTs) and formal verification, all the way to implementing collaborative applications that use these data structures.


Coffee Break

Stad: Stateful Diffusion for Linear Time Community Detection

Amira Soliman: RISE SICS, Sweden

Abstract: In this talk, we present a stateful diffusion approach (Stad) for community detection that employs diffusion. Stad boosts diffusion with a conductance-based function that acts like a tuning parameter to control the diffusion speed. In contrast to existing diffusion mechanisms which operate with global and fixed speed, Stad introduces stateful diffusion to treat every community individually.

Towards Ontology-Based Event Processing

Riccardo Tommasini: Politecnico di Milano, Italy

Abstract: Data variety and velocity call for real-time decision making systems that can interpret complex domain models. In this talk, we present the vision of Ontology-Based Event Processing that bridges the gap between ontology-based reasoning and event processing. I will propose a language and an architecture to perform event processing over abstract ontology concepts. OBEP allows performing practical temporal reasoning, while the high-level ontological definitions reduce the need for knowledge of the underlying data structure in complex domains.


Coffee Break

Deterministic Model for Distributed Speculative Stream Processing

Artem Trofimov: JetBrains Research, Saint Petersburg State University, Russia

Abstract: Exactly-once semantics without high latency overhead is still hard to achieve within state-of-the-art stream processing systems. We introduce a model providing for exactly-once using lightweight optimistic approach for obtaining determinism and idempotence. We show its feasibility with a prototype.

Proof of Trust Blockchains - Can trust save energy?

Leila Bahri: KTH EECS, Sweden

Abstract: Blockchains are attracting the attention of many technical, financial, and industrial parties, as a promising infrastructure for achieving secure peer-to-peer (P2P) transactional systems. At the heart of blockchains is proof-of-work (PoW), a trustless leader election mechanism based on demonstration of computational power. PoW provides secure consensus in trustless P2P environments, but comes at the expense of wasting huge amounts of expensive energy. We question this energy expenditure in use cases where some form of trust exists between the peers and design a Proof-of-Trust (PoT) protocol, that valuates peers trust in the network and uses it as a waiver for the energy that has to be spent.

September 27th - Day 2


Coffee & Morning Discussions

Scaling Deep Learning on Multi-GPU Servers

Peter Pietzuch: Imperial College London, UK

Abstract: With the widespread availability of GPU servers, scalability in terms of the number of GPUs when training deep learning models becomes a paramount concern. For many deep learning models, there is a scalability challenge: to keep multiple GPUs fully utilised, the batch size must be sufficiently large, but a large batch size slows down model convergence due to the less frequent model updates. In this talk, I describe CrossBow, a new single-server multi-GPU deep learning system that avoids the above trade-off. CrossBow trains multiple model replicas concurrently on each GPU, thereby avoiding under-utilisation of GPUs even when the preferred batch size is small. For this, CrossBow (i) decides on an appropriate number of model replicas per GPU and (ii) employs an efficient and scalable synchronisation scheme within and across GPUs.

From Weld to Arc

Lars Kroll: KTH EECS, Sweden (Intro by Paris Carbone)

Abstract: Weld is an intermediate representation (IR) for unifying batch processing systems in different languages and frameworks, thus avoiding materialisation costs for intermediate results. In this talk we will show that with minor alterations a Weld-like IR can also be used for a Stream Processing use case, allowing unification and optimisation across programming languages and models.


Coffee Break

Optimizing Across Relational and Linear Algebra in Parallel Analytics Pipelines

Asterios Katsifodimos: TU Delft, Netherlands

Abstract: My talk will be split in two parts. In the first part I will briefly introduce a deeply embedded language in Scala, which enables authoring scalable programs using two abstract data types, namely DataBag and Matrix, enabling joint optimizations over both relational and linear algebra. In the second part I will discuss a concrete optimization which can be applied in the context of analysis programs comprising both linear and relational algebra operations. More specifically, I will present BlockJoin, a distributed join algorithm presented in this year's VLDB, which emits block-partitioned results to subsequent linear algebra operations such as matrix multiplications. BlockJoin applies database techniques known from columnar processing, such as index-joins and late materialization, in the context of parallel dataflow engines, in order to minimize very expensive shuffling costs.

Fast, accurate, automatic scaling decisions for distributed streaming dataflows

Vasiliki Kalavri (Vasia): ETH Zurich, Switzerland

Streaming computations are by nature long-running and their workloads can change in unpredictable ways. Thus, maintaining performance may require dynamic scaling of allocated resources. Modern large-scale stream processors often rely on coarse-grained metrics and thus tend to show incorrect provisioning, oscillations, and long convergence times. In this talk, I will present DS2, an automatic scaling controller for streaming systems which uses lightweight instrumentation to estimate the true processing and output rates of individual dataflow operators. We have applied DS2 on Apache Flink and Timely Dataflow and demonstrate its accuracy and fast convergence. When compared to Dhalion, the state-of-the-art technique in Heron, DS2 converges to the optimal, backpressure-free configuration in a single step instead of six.



DRS: Resource Auto-Scaling for Real-Time Stream Analytics

Tom Zhengjia Fu, Un. of Illinois Urbana Champaign, USA

Abstract: We propose DRS, a dynamic resource scaling framework for cloud-based stream data analytics systems. DRS overcomes three fundamental challenges: (i) how to model the relationship between the provisioned resources and the application performance, (ii) where to best place resources, and (iii) how to measure the system load with minimal overhead. In particular, DRS includes an accurate performance model based on the theory of Jackson open queueing networks and is capable of handling arbitrary operator topologies, possibly with loops, splits and joins. Extensive experiments with real data show that DRS is capable of detecting sub-optimal resource allocation and making quick and effective resource adjustment.

Parameter Server on Flink, an approach for model-parallel machine learning

Dániel Berecz, MTA SZTAKI, Hungary

Abstract: In this talk we show a Parameter Server implementation in Flink for model-parallel machine learning. To scale efficiently, some machine learning algorithms not only require the input to be processed in parallel, but to train and store the model in a distributed manner. The Parameter Server provides an abstraction layer for the distributed model, so the implementation of such algorithms is much easier. We present how our Parameter Server can be used for model-parallel training in Flink through recommendation with matrix factorization. Our implementation is built entirely on top of the Streaming API, so Flink can also be used to preprocess the data and even to serve predictions in a single job.


Coffee Break

Edge Compute at Hyperscale

Ali Shoker: INESC TEC, Portugal

Abstract: Edge computing is an extension to cloud computing that brings some of the computation and storage closer to the user. In addition to reducing the loads on the cloud datacenters, this brings several benefits like reduced response times, resilience to network failures, and privacy. The challenge is however how to build edge applications that are close to classical cloud applications as much as possible, i.e., beyond solo edge device computing or sticking to immutable data in a Peer to Peer manner. In this talk, we show how Conflict-free Replicated DataTypes (CRDT) can help building edge applications using replicated (mutable) data in a loosely coupled fashion. We then touch upon a novel hyper-scalable CRDT design where the meta-data overhead is logarithmic to the number of edge nodes.

The CASTOR Software Research Center

Benoit Baudry, KTH EECS & CASTOR, Sweden

Abstract: In this talk I introduce the CASTOR is researcher center. The center, created as a collaboration between KTH, SAAB and Ericsson, focuses on research in the area of software technology to build reliable, secure, distributed and embedded software systems. I will illustrate the type of technology that we develop, focusing on the latest results in the area of software testing and automatic repair.

Hopsworks Demo

Jim Dowling, Logical Clocks, Sweden