BLISS Seminar: DNA Sequencing: From Information Limits to Assembly Software

Seminar | February 21 | 2-3 p.m. | 400 Cory Hall

 Ilan Shomorony, UC Berkeley

 Electrical Engineering and Computer Sciences (EECS)

Emerging long-read sequencing technologies promise to enable near-perfect reconstruction of whole genomes. Assembly of long reads is usually accomplished using a read-overlap graph, in which the true sequence corresponds to a Hamiltonian path. As such, the assembly problem becomes NP-hard under most formulations, and most of the known algorithmic approaches are heuristic in nature.

In this talk, we show that by focusing on the informational limits of this problem, rather than the computational ones, we can design assembly algorithms with correctness guarantees. We begin with a basic feasibility question: when does the set of reads contain enough information to allow unambiguous reconstruction of the genome? We show that in most instances of the problem where the reads contain enough information for assembly, the read-overlap graph can be sparsified, allowing the problem to be solved in linear time. To study the information-infeasible instances, we formulate the partial assembly problem from a rate-distortion perspective. We introduce a notion of assembly graph distortion, and propose an algorithm that seeks to minimize this quantity. Finally, we describe how these ideas formed the theoretical foundation of our long-read assembly software HINGE. HINGE is shown to outperform existing tools through extensive validation on long-read datasets, and is currently being employed by genomics research groups and companies.