A combinatorial and integrated method to analyse High Throughput Sequencing reads

Lecturer : 
Eric Rivals
Event type: 
HIIT seminar
Event time: 
2011-01-14 10:15 to 11:00
Exactum B222

Talk announcement:
HIIT Seminar Kumpula, Friday Jan 14, 10:15 a.m., Exactum B222

The first HIIT Seminar Kumpula talk for the Spring term 2011 takes place
Jan 14. Please notice the lecture hall for this presentation (B222).
Forthcoming presentations will take place at C222 as usual.

--Matti Järvisalo


Eric Rivals
CNRS & Université Montpellier 2

A combinatorial and integrated method to analyse High Throughput Sequencing

Next-generation sequencing technologies are presently being used to answer
key biological questions at the scale of the entire genome and with
unprecedented depth. Whether determining genetic or genomic variations,
cataloguing transcripts and assessing their expression levels, finding
recurrent mutations in cancer, identifying DNA-protein interactions or
chromatin modifications, surveying the species diversity in an
environmental sample, all these tasks are now tackled with High Throughput
Sequencing (HTS). For genomics and transcriptomics data sets, the current
paradigm of analysis of large read sets consists in
1. mapping the reads to a reference genome contigously allowing as many
differences as one expects to be necessary to accomodate sequence errors
and small polymorphisms;
2. using uniquely mapped reads to determine covered genomic regions, either
for computing a local coverage to predict SNPs and filter out sequence
errors (cf. program ERANGE), or for delimiting expressed exons
approximately (with RNA-seq; cf. programs TopHat GMORSE),
3. re-aligning unmapped reads, which were not mapped contigously at step
one, to reveal exon boundaries or larger indels.
As shown by the results of approaches following this paradigm, a number of
pitfalls/drawbacks must be accomodated: mapping errors induce false
predictions at further steps, indels larger than 4 bp are not handled, the
impossibility to distinguish SNPs from sequence errors at mapping stage,
the lack of precision on exon boundaries, etc.

On the other hand, we have developped an exact mapper, called MPSCAN, for
short reads (Rivals et al. 09), and analysed its performance in detecting
uniquely mapped regions in function of tag length (Philippe et al. 09). We
could show that one can estimate depending on the genome length, a length k
of substring that will in average point to a single genomic location.
Building on this work, we have conceived a new approach to analyse nowadays
longer reads (> 50 bp). We record for all the k-mers along the read their
matching genomic positions and number of occurrences in the reads, and then
analyse jointly these profiles to determine whether a read can be mapped
contigously or detect multiple causes of alignment disruption: large
indels, introns, rearrangements. In this talk, we will present this
procedure, the underlying data structures, show that it distinguishes SNP
from sequence errors, and allies sensitivity and specificity in the
prediction of exon boundaries, indels, and rearrangements.

Work in collaboration with M. Salson (U. Rouen), N. Philippe et T. Commes
(U. Montpellier 2).

Related publications:
  * Using reads to annotate the genome: influence of length, background
distribution, and sequence errors on prediction capacity
  N. Philippe*, A. Boureux*, L. Bréhèlin, J. Tarhio, T. Commes, E. Rivals
  Nucleic Acids Research (NAR) doi:10.1093/nar/gkp492; 2009.
  * MPSCAN: fast localisation of multiple reads in genomes
  E. Rivals, L. Salmela, P. Kiiskinen, P. Kalsi, J. Tarhio
  Proc. 9th Workshop on Algorithms in Bioinformatics
  Lecture Notes in BioInformatics (LNBI), Springer-Verlag, Vol. 5724, p.
246-260, 2009.

Last updated on 5 Jan 2011 by Matti Järvisalo - Page created on 5 Jan 2011 by Matti Järvisalo