Dissertation Talk: Alluxio: A Virtual Distributed File System
Seminar: Dissertation Talk: CS | May 4 | 12-1 p.m. | 380 Soda Hall
The world is entering the data revolution era. Along with the latest advancements of the Internet, Artificial Intelligence (AI), mobile devices, autonomous driving, and Internet of Things (IoT), the amount of data we are generating, collecting, storing, managing, and analyzing is growing exponentially. To store and process these data has exposed tremendous challenges and opportunities.
Over the past decade, we have seen significant innovation in the data stack. For example, in the computation layer, the ecosystem started from the MapReduce framework, and grew to many different general and specialized systems such as Apache Spark for general data processing, Apache Storm, Apache Samza for stream processing, Apache Mahout for machine learning, Tensorflow, Caffe for deep learning, Presto, Apache Drill for SQL workloads. There are more than fifty popular frameworks for various workloads and the number is growing. Similarly, the storage layer of the ecosystem grew from the Hadoop Distributed File System (HDFS) to a variety of choices as well, such as file systems, object stores, blob stores, key-value systems, and NoSQL databases to realize different tradeoffs in cost, speed and semantics.
This increasing complexity in the stack creates challenges in multi-fold. For system developers, it requires more work to integrate a new compute or storage component as a building block to work with the existing ecosystem. For big data application developers, understanding and managing the correct way to access different data stores becomes more complex. For end users, accessing data from various and often remote data stores often results in performance penalty and semantics mismatch. For system admins, adding, removing, or upgrading an existing compute or data store or migrating data from one store to another can be arduous if the physical storage has been deeply coupled with all applications.
To address these challenges, this dissertation proposes an architecture to have a Virtual Distributed File System (VDFS) as a new layer between the compute layer and the storage layer. Adding VDFS into the stack brings many benefits. Specifically, VDFS enables simple data access for different compute frameworks, efficient in-memory data sharing and management across applications, high I/O performance and efficient use of network bandwidth, and the flexible choice of compute and storage. Meanwhile, as the layer to access data and collect data metrics and usage patterns, it also provides users insight into their data and can also be used to optimize the data access based on workloads.
We achieve these goals through an implementation of VDFS called Alluxio (formerly Tachyon). Alluxio presents a set of disparate data stores as a single file system, greatly reducing the complexity of storage APIs, locations, and semantics exposed to applications. Alluxio is designed with a memory centric architecture, enabling applications to leverage memory speed I/O by simply using Alluxio. Alluxio has been deployed at hundreds of leading companies in production, serving critical workloads. Its open source community has attracted more than 800 contributors worldwide.
In this dissertation, we also investigate lineage as an important technique in the VDFS to improve write performance, and also propose DFS-Perf, a scalable distributed file system performance evaluation framework to help researchers and developers better design and implement systems in the Alluxio ecosystem.