T3: Production Pipelines for Virus Sequencing Data
Niko Beerenwinkel (ETH Zurich, Switzerland)
Tobias Marschall (Saarland University and Max Planck Institute for Informatics, Germany)
Virus infections are a major burden to human health. The biology of viruses, treatment and vaccination have hence been subject to intense research worldwide. With the advent of next-generation sequencing (NGS) technologies over the past decade, a powerful tool has emerged that is currently transforming virus research. This tutorial provides the bioinformatics expertise needed to turn the promises that NGS technologies hold for fundamental research and clinical diagnostics into reality.
In the course of an infection, viruses can mutate and recombine. This phenomenon is particularly pronounced for RNA viruses such as HIV. This means that a single patient is in fact infected by a virus population with considerable genetic diversity. Being able to characterize this virus population, known as a viral quasispecies, is of utmost importance for treatment stratification as well as for studying virus evolution. Detecting all single-nucleotide variants and their allele frequencies is one way of doing this. However, the most complete characterization possible consists in reconstructing all viral haplotypes and their relative abundances from a patient sample.
Data analysis pipelines to achieve this goal often contain many steps, including read quality control, preprocessing such as adapter trimming, read mapping, local re-alignment, variant calling, allele frequency estimation, and haplotype reconstruction. For each of these steps, many different approaches have been proposed in the bioinformatics literature and come with software implementations of different qualities, ranging from proof-of-principle implementations to mature and maintained packages. Choosing the right combination of tools and assembling them into a robust pipeline suitable for production use is a difficult and time-consuming task.
To alleviate this task, we have developed V-pipe, a versatile and robust data analysis pipeline for virus sequencing data. V-pipe’s main goal is to provide extensively tested best practice workflows to turn raw sequencing data of RNA and DNA viruses into finished haplotypes. At the same time, V-pipe is modular and extensible so as to allow experienced users to customize the workflow. In this way, V-pipe also serves as a unified benchmarking platform aiding developers in testing their own tools.Expected Goals
This tutorial aims to put participants in a position to analyze virus sequencing data in production settings, including genomics research and clinical diagnostics. In particular, participants will be enabled
- to understand the computational concepts behind the most important processing steps and selected software tools implementing them,
- to perform quality control of input sequencing data,
- to setup a robust workflow to map reads, call variants, and reconstruct haplotypes,
- to use V-pipe as a benchmarking platform, and
- to extend and customize V-pipe.
The tutorial will be understandable to anyone with a Bachelor degree in bioinformatics or equivalent, including biomedical scientists with additional (self-)training in bioinformatics.
Our main target audience are practitioners who analyze virus sequencing data either in basic research or clinical settings. This particularly includes staff of bioinformatics core units. Since V-pipe is an open platform with considerable merit for benchmarking, this tutorial is also of interest to developers who want to test their own tools.
For the hands-on sessions, we assume familiarity with a Unix command-line environment and basic file formats like FASTQ, BAM, and VCF. To allow for a smooth organization, we will provide virtual machine images and corresponding instructions ahead of the tutorial. All participants should bring a laptop and make sure they can run the virtual machine on it.
Tuesday, September 12, 2017
Venue: Kollegienhaus, University of Basel
|9:00 – 10:30||
Session I - Methodological background I (talks; Niko Beerenwinkel and Tobias Marschall)
|10:30 – 11:00||Coffee break|
|11:00 – 12:30||
Session II - Methodological background II (talks; Niko Beerenwinkel and Tobias Marschall)
|12:30 – 13:30||Lunch break|
|13:30 – 15:00||
Session III - Basic workflows (hands-on; Niko Beerenwinkel, Tobias Marschall, and Susana Posada Cespedes)
|15:00 – 15:30||Coffee break|
|15:30 – 17:00||Session IV - Advanced workflows (hands-on; Niko Beerenwinkel, Tobias Marschall, and Susana Posada Cespedes)
Tobias Marschall is an assistant professor at the Center for Bioinformatics at Saarland University and affiliated with the Max Planck Institute for Informatics as a senior researcher. He is heading the Algorithms for Computational Genomics group. His research targets algorithmic and statistical challenges arising from present-day genomics technologies. A particular focus is on population-scale sequencing efforts, on structural variation discovery, on algorithms and data structures for pan-genomes and on haplotyping in various contexts (from humans to viruses).
Niko Beerenwinkel is an associate professor of computational biology at ETH Zurich and a group leader of the Swiss Institute of Bioinformatics. His research is at the interface of mathematics, statistics, and computer science with biology and medicine. His interests range from mathematical foundations of biostatistical models to evolutionary modeling and clinical applications in oncology and infectious diseases. He has developed several computational methods for the analysis of deep sequencing data to infer the genetic composition of virus populations.