Theoretical Breakthrough in Genome Assembly

Mon, 02.05.2016

One of the oldest bioinformatics problems is to reconstruct the genome of a species from short fragments, such as those produced by high-throughput sequencing. Due to various technical limitations, it is currently impossible to fully reconstruct an entire genome. State-of-the-art genome assemblers in fact produce long genomic fragments that are "guaranteed" to occur in the genome that generated the data. A major question, originating more than 20 years ago, is to characterize all the information that can be safely assembled in this way. 

Research conducted by Alexandru Tomescu (Genome-Scale Algorithmics group, HIIT sub-programme Algorithmic Data Analysis) and Paul Medvedev (Penn State University, USA) solved this problem, by obtaining a mathematical characterization of all these long fragments. As a consequence, this result also provides the first tight upper-bound on what can be safely assembled from input data. We expect these theoretical results to gradually make their way into practical genome assemblers.

This research was published in a paper "Safe and complete contig assembly via omnitigs" (see the Open Access version) and presented at the conference RECOMB 2016 - The 20th Annual International Conference on Research in Computational Molecular Biology. This is one of the most prestigious and selective conferences in Bioinformatics. 

At the same conference, Ahmed Sobih, Alexandru Tomescu, and Veli Mäkinen from the Genome-Scale Algorithmics group also presented another paper on finding the bacterial composition of a high-throughput sequencing sample taken from an environment, such as the human gut. Their method "MetaFlow: Metagenomic profiling based on whole-genome coverage analysis with min-cost flows" was shown to be more precise and sensitive than current popular tools.

Contact person: Alexandru Tomescu (

Last updated on 2 May 2016 by Alexandru Tomescu - Page created on 2 May 2016 by Alexandru Tomescu