The question driving my work is: How should one deploy statistical data-analysis tools to enhance data-driven systems? Even partial answers to this question may have a large impact on science, government, and industry--each of which is increasingly turning to statistical techniques to get value from their data.
To understand this question, my group has built or contributed to a diverse set of data-processing systems: a system, called GeoDeepDive, that reads and answers questions about the Geology literature and is used by geologists to gain insight into the Earth's carbon cycle; a
muon filter that is used in the IceCube neutrino telescope to process over 250 Million events each day in the hunt for the origins of the
universe; and a host of enterprise analytics applications with Oracle and EMC/Greenplum. Even in this diverse set, we have found common
abstractions that enable one to build and maintain such systems in a more cost-effective way. In this talk, I will describe some of these
abstractions along with the theoretical and algorithmic questions that they raise. Finally, I will describe my vision of how and why
classical data management will continue to play an important role in the age of statistical data analysis.
Papers, software, virtual machines that contain installations of our software, links to applications that are discussed in this talk, and our list of collaborators are available from http://www.cs.wisc.edu/hazy. We also have a YouTube channel (http://www.youtube.com/HazyResearch) that contains videos about our projects.