20.1.2020 @ 14:15 - 16:00
Speaker: Brendan Mumey (2.15pm-3pm)
Title: Inexact Flows for RNA Transcript Assembly and Finding Pangenomic Haplotype Blocks
Abstract: I will talk about a couple recent problems we’ve been working on. The first is RNA Transcript Assembly Using Inexact Flows: RNA-Seq technology allows for high-throughput, low cost measurement of gene expression. An important step in this process is the assembly of mRNA transcript short reads into full transcripts. The problem can be viewed as a flow decomposition problem in which the objective is to minimize the number of path flows needed to represent a given flow. In this work we relax the edge flow constraints to allow for some uncertainty in their measurement. We formulate this as the Inexact Flow Decomposition problem and propose an algorithmic strategy and several heuristics to solve it. The second problem is Extending Maximal Perfect Haplotype Blocks to the Realm of Pangenomics: Recent work provides the first method to measure the relative fitness of genomic variants within a population that scales to large numbers of genomes. A key component of the computation involves finding conserved haplotype blocks, which can be done in linear time. Here, we extend the notion of conserved haplotype blocks to pangenomes, which can store more complex variation than a single reference genome. We define a maximal perfect pangenome haplotype block and give a linear-time, suffix tree-based approach to find all such blocks from a set of pangenome haplotypes. We demonstrate the method by applying it to a pangenome built from yeast strains.
Bio: Brendan Mumey is a professor of computer science at Montana State University. He is visiting Fulbright scholar at the University of Helsinki from January to June 2020. His research interests are in computational biology and applied algorithms.
Snacks and drinks (3pm-3.30pm)
Speaker: Jarno Alanko (3.30pm – 4pm)
Title: Themisto: Practical pseudoalignment for metapangenomics in small space
Abstract: Metagenomics and pangenomics are two important trends in sequence analysis. Metagenomics studies the gene pools of all microbial life found in some environment, while pangenomics is the idea of modeling the genome of a species in a way that encompasses all known variation between different individuals of the species. With the rapid growth of reference sequence databases, it is now becoming feasible to run metagenomic analysis using pangenomic references. This approach is hindered by the fact that the reference data can easily be so large that it does not fit into the RAM of a moderately sized server machine. In this work we present Themisto, a new parallel space-efficient software tool for pseudoaligning reads against a large number of pangenomic references, using a succinct colored de Bruijn graph. The tool exploits similarity between the reference genomes to pack the index into a size that can be significantly smaller than the original reference data. We show that the tool is useful in practice as a part of the mSWEEP metagenomics pipeline.
Bio: Jarno N. Alanko is a final year PhD. student at the department of computer science at University of Helsinki. His research focuses on space-efficient data structures for bioinformatics.