Dissertation Talk: The Design and Implementation of Low-Latency Prediction Serving Systems
Seminar: Dissertation Talk: CS | May 10 | 9-10 a.m. | 606 Soda Hall
Machine learning is being deployed in a growing number of applications which demand real-time, accurate, and cost-efficient predictions under heavy query load. These applications employ a variety of machine learning frameworks and models, often composing several models within the same application. However, most machine learning frameworks and systems are optimized for model training and not deployment.
In this talk, I will discuss three prediction serving systems designed to meet the needs of modern interactive machine learning applications. The key idea in this work is to utilize a decoupled, layered design that interposes systems on top of training frameworks to build low-latency, scalable serving systems. Velox introduced this decoupled architecture to enable fast online learning and model personalization in response to feedback. Clipper generalized this system architecture to be framework-agnostic and introduced a set of optimizations to reduce and bound prediction latency and improve prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks. And InferLine provisions and executes ML prediction pipelines subject to end-to-end latency constraints by proactively optimizing and reactively controlling per-model configurations in a fine-grained fashion.