The DNA Assembly Problem: Designing algorithms based on Information Limits

Feb 10

Wednesday, February 10, 2016

3:30 pm - 4:30 pm
Gross Hall 330


Ilan Shomorony

Emerging long-read sequencing technologies promise to enable near-perfect reconstruction of whole genomes. Assembly of long reads is usually accomplished using a read-overlap graph, in which the true sequence corresponds to a Hamiltonian path. As such, the assembly problem becomes NP-hard under most formulations, and most of the known algorithmic approaches are heuristic in nature. In this talk, however, we will see that the NP-hardness of the computational problem can be overcome if we focus on instances of the assembly problem that are feasible from an information-theoretic standpoint. Somewhat surprisingly, we arrive at this conclusion by first setting the computational complexity issue aside, and instead seeking algorithms that target information limits. In particular, we begin with a basic feasibility question: when does the set of reads contain enough information to allow unambiguous reconstruction of the true sequence? We show that in most instances of the problem where the reads do contain enough information for assembly, the read-overlap graph can be sparsified, giving rise to an Eulerian graph where the problem can be solved in linear time. We also study the cases where the reads do not contain enough information for assembly, but from a rate-distortion perspective. We propose a new notion of distortion for assembly graphs and describe a practical assembly algorithm whose achieved distortion can be bounded in terms of the repeat complexity of the target genome.