SeqPig: Simple and scalable scripting for large sequencing data sets in Hadoop

Lecturer : 
Prof. Keijo Heljanko, HIIT / Aalto University
Event type: 
HIIT seminar
Event time: 
2014-01-20 13:15 to 14:00
Aalto University, Computer Science Building, lecture hall T2

Hadoop MapReduce-based approaches have become increasingly popular due to their scalability in processing large sequencing datasets. However, as these methods typically require in-depth expertise in Hadoop and Java, they are still out of reach of many bioinformaticians. To solve this problem, we have created SeqPig, a library and a collection of tools to manipulate, analyze and query sequencing datasets in a scalable and simple manner. SeqPig scripts use the Hadoop-based distributed scripting engine Apache Pig, which automatically parallelizes and distributes data processing tasks. We demonstrate SeqPig’s scalability over many computing nodes and illustrate its use with example scripts.

This is joint work with André Schumacher, Luca Pireddu, Matti Niemenmaa, Aleksi Kallio, Eija Korpelainen, and Gianluigi Zanetti

About the speaker:

Last updated on 13 Jan 2014 by Antti Ukkonen - Page created on 13 Jan 2014 by Antti Ukkonen