Dissertation Talk: Ray: A Distributed Execution Engine for the Machine Learning Ecosystem

Seminar: Dissertation Talk: CS | May 13 | 2-3 p.m. | 380 Soda Hall

 Philipp Moritz, UC Berkeley

 Electrical Engineering and Computer Sciences (EECS)

In recent years, growing data volumes and more sophisticated computational procedures have greatly increased the demand for computational power. Machine learning and artificial intelligence applications, for example, are notorious for their computational requirements. At the same time, Moore’s law is ending and processor speeds are stalling. As a result, distributed computing and the cloud have become ubiquitous. While the cloud makes distributed hardware infrastructure easily accessible, writing distributed algorithms and applications remains surprisingly hard. This is due to the inherent complexity of concurrent algorithms, the engineering challenges that arise when communicating between many machines, and also due to new requirements like fault tolerance that arise at a large scale.

In this thesis, we study requirements for a general purpose distributed computation model and present a solution that is easy to use yet powerful and can recover transparently from faults. At its core, it unifies stateless tasks and stateful actors. This model not only supports many machine learning workloads like data processing, training or serving, but is also a good fit for cross-cutting machine learning applications like reinforcement learning, and supports broader distributed applications like streaming or graph processing well. We implement this computational model as an open-source system called Ray, which matches or exceeds the performance of specialized systems in many application domains, and at the same time is horizontally scalable and offers strong fault tolerance properties.

 CA, pcm@berkeley.edu, 510-4232119