T3: Production Pipelines for Virus Sequencing Data


Niko Beerenwinkel (ETH Zurich, Switzerland)

Tobias Marschall (Saarland University and Max Planck Institute for Informatics, Germany)

Tutorial Summary:

Virus infections are a major burden to human health. The biology of viruses, treatment and vaccination have hence been subject to intense research worldwide. With the advent of next-generation sequencing (NGS) technologies over the past decade, a powerful tool has emerged that is currently transforming virus research. This tutorial provides the bioinformatics expertise needed to turn the promises that NGS technologies hold for fundamental research and clinical diagnostics into reality.

In the course of an infection, viruses can mutate and recombine. This phenomenon is particularly pronounced for RNA viruses such as HIV. This means that a single patient is in fact infected by a virus population with considerable genetic diversity. Being able to characterize this virus population, known as a viral quasispecies, is of utmost importance for treatment stratification as well as for studying virus evolution. Detecting all single-nucleotide variants and their allele frequencies is one way of doing this. However, the most complete characterization possible consists in reconstructing all viral haplotypes and their relative abundances from a patient sample.

Data analysis pipelines to achieve this goal often contain many steps, including read quality control, preprocessing such as adapter trimming, read mapping, local re-alignment, variant calling, allele frequency estimation, and haplotype reconstruction. For each of these steps, many different approaches have been proposed in the bioinformatics literature and come with software implementations of different qualities, ranging from proof-of-principle implementations to mature and maintained packages. Choosing the right combination of tools and assembling them into a robust pipeline suitable for production use is a difficult and time-consuming task.

To alleviate this task, we have developed V-pipe, a versatile and robust data analysis pipeline for virus sequencing data. V-pipe’s main goal is to provide extensively tested best practice workflows to turn raw sequencing data of RNA and DNA viruses into finished haplotypes. At the same time, V-pipe is modular and extensible so as to allow experienced users to customize the workflow. In this way, V-pipe also serves as a unified benchmarking platform aiding developers in testing their own tools.

Expected Goals
This tutorial aims to put participants in a position to analyze virus sequencing data in production settings, including genomics research and clinical diagnostics. In particular, participants will be enabled
  • to understand the computational concepts behind the most important processing steps and selected software tools implementing them,
  • to perform quality control of input sequencing data,
  • to setup a robust workflow to map reads, call variants, and reconstruct haplotypes,
  • to use V-pipe as a benchmarking platform, and
  • to extend and customize V-pipe.

The tutorial will be understandable to anyone with a Bachelor degree in bioinformatics or equivalent, including biomedical scientists with additional (self-)training in bioinformatics.

Intended Audience
Our main target audience are practitioners who analyze virus sequencing data either in basic research or clinical settings. This particularly includes staff of bioinformatics core units. Since V-pipe is an open platform with considerable merit for benchmarking, this tutorial is also of interest to developers who want to test their own tools.

For the hands-on sessions, we assume familiarity with a Unix command-line environment and basic file formats like FASTQ, BAM, and VCF. To allow for a smooth organization, we will provide virtual machine images and corresponding instructions ahead of the tutorial. All participants should bring a laptop and make sure they can run the virtual machine on it.

Tutorial Agenda:

Tuesday, September 12, 2017
Venue: Kollegienhaus, University of Basel

9:00 – 10:30

Session I - Methodological background I (talks; Niko Beerenwinkel and Tobias Marschall)

  • Introduction
  • Data quality control and cleaning, sequencing error characteristics of relevant platforms
  • Read alignment artifacts and common pitfalls
10:30 – 11:00 Coffee break
11:00 – 12:30

Session II - Methodological background II (talks; Niko Beerenwinkel and Tobias Marschall)

  • Principles of variant calling and allele frequency estimation
  • Methods for haplotype reconstruction
  • Design and usage of the V-pipe framework
12:30 – 13:30 Lunch break
13:30 – 15:00

Session III - Basic workflows (hands-on; Niko Beerenwinkel, Tobias Marschall, and Susana Posada Cespedes)

  • “Hello V-pipe” - create your first workflow
  • Analysis of example data from HIV sequencing
  • Result visualization and interpretation
15:00 – 15:30 Coffee break
15:30 – 17:00 Session IV - Advanced workflows (hands-on; Niko Beerenwinkel, Tobias Marschall, and Susana Posada Cespedes)
  • Parameter tuning and customization
  • V-pipe as a benchmarking platform
  • Extending V-pipe with your own tools

Tutorial Speaker:

Tobias Marschall is an assistant professor at the Center for Bioinformatics at Saarland University and affiliated with the Max Planck Institute for Informatics as a senior researcher. He is heading the Algorithms for Computational Genomics group. His research targets algorithmic and statistical challenges arising from present-day genomics technologies. A particular focus is on population-scale sequencing efforts, on structural variation discovery, on algorithms and data structures for pan-genomes and on haplotyping in various contexts (from humans to viruses).

Niko Beerenwinkel is an associate professor of computational biology at ETH Zurich and a group leader of the Swiss Institute of Bioinformatics. His research is at the interface of mathematics, statistics, and computer science with biology and medicine. His interests range from mathematical foundations of biostatistical models to evolutionary modeling and clinical applications in oncology and infectious diseases. He has developed several computational methods for the analysis of deep sequencing data to infer the genetic composition of virus populations.