Dissertation Talk: Designing Performant Systems for Human-Powered Data Analysis

Presentation | May 15 | 3:30-4:30 p.m. | 465H Soda Hall

 Daniel Haas, UC Berkeley

 Electrical Engineering and Computer Sciences (EECS)

In spite of the dramatic recent progress in automated techniques for computer vision and natural language understanding, human effort, often in the form of crowd workers recruited on marketplaces such as Amazon’s Mechanical Turk, remains a necessary part of data analysis workflows. However, embedding manual steps in automated workflows comes with a performance cost, since humans seldom process data at the speed of computers. In order to rapidly iterate between hypotheses and evidence, data analysts need tools that can provide human processing at close to machine latencies.

In this talk, I will discuss the design, theory, and implementation of performant crowd-powered systems. I will begin with a discussion of the sources of latency in crowd systems, and describe CLAMShell, a system that accurately labels terabyte-scale datasets in one to two minutes, and its evaluation on over a thousand workers processing nearly a quarter million tasks. Next, I will explore the theory of identifying fast individuals in an unknown population of workers, which can be modeled as an instance of the infinite-armed bandit problem. The analysis results in novel near-optimal algorithms with applications to broader statistical theory. Finally, I will consider the design of multi-tenant crowd systems running many heterogeneous applications at once. I will describe Cioppino, a system designed to improve throughput and reduce cost in this setting, while taking into account worker preferences. Together, these components provide for the implementation of human computation systems that are cost-efficient, scalable, and fast enough to integrate into existing data analysis workflows without compromising performance.

 dhaas@cs.berkeley.edu, 408-476-3072