WS4: Beyond Position weight matrices – towards next generation tools for predicting protein-DNA interactions


Bart Deplancke (EPFL, Lausanne), Philipp Bucher (EPFL, Lausanne)

Workshop Summary:

Motivation. Evolution proceeds to a large extent via nucleotide changes in gene regulatory regions. Indeed, it has been shown that the majority of variants in the human genome that have been associated with phenotypic variation affect cis-regulatory elements rather than trans-acting effector proteins. It is thereby broadly assumed that these variants impact upon the sequence-specific interactions between transcription factors (TFs) and transcription factor binding sites (TFBS), which in turn result in changes in gene regulation.
Thus, to understand the molecular mechanisms underlying phenotypic variation, we need quantitative models to predict TF-TFBS interactions under varying or genetically distinct conditions. Progress towards such models has for a long time been hampered by the lack of experimental protocols to generate data in sufficient amounts for accurately estimating the parameters of a TF binding site model. This situation has radically changed with the advent of microarray and next-generation sequencing (NGS) –based high-throughput technologies to study TF-TFBS interactions in vitro and in vivo.
Within a few years, this discipline has changed from a data-sparse field to a data-inflated field. Moreover, gene regulatory bioinformatics tools, which once led a Cinderella-like existence compared to protein-centric bioinformatics tools, are now among the best known and the most-widely used bioinformatics tools. The current impact of TFBS prediction resources is perhaps best reflected by the sweeping proliferation of sequence logos appearing in high impact biological papers.
Nevertheless, it is very important to recognize that the tools that we are currently using stem largely from the data-sparse era. The most salient example to support this claim is the widely used position weight matrix (PWM) for predicting TFBSs. This simple model, which assumes independence between the nucleotides within a binding site, probably offers the best trade-off between accuracy and number of parameters to fit in the a data sparse situation, but it is not capable of leveraging on the high-throughput technologies and data sets available today. PWMs available from libraries such as TRANSFAC and JASPAR are also modest in scope, in that they serve to predict the relative affinity between a protein monomer and its corresponding target site. However, a large fraction of TFs bind as homodimers, heterodimers, or multimers to DNA and the same polypeptide can often form complexes with a variety of interactions partners. Thus, next generation TF binding prediction tools will need to take into account that in vivo protein-DNA interactions take place in a complex biochemical environment where different protein monomers compete for the same or different protein interaction partner, and multimeric TFs with overlapping sequence specificity compete for cognate DNA sites.

Purpose of the workshop. The main goal is to bring together researchers from different fields who are studying protein-DNA interactions, including experimental genomics-oriented researchers as well as developers and users of bioinformatics tools for analyzing gene regulatory regions. As such, this workshop will represent an exciting opportunity to initiate discussions that may continue during the BC2 conference, aiming to reach a common view and better understanding of (i) the major outstanding challenges in the field, (ii) the pros and cons of different experimental technologies and computational approaches to study protein DNA interactions, and (iii) the most urgently required bioinformatics resources that still need to be developed.

Specific issues to be discussed:

  • Next generation TF-DNA binding models: Which extensions over simple PWMs offer the best trade-off between accuracy and practicality?
  • Next-generation technologies such as PBMs (protein-binding microarrays) or high-throughout (HT)-SELEX. Which type of approach is more cost-effective or extensible to complex settings?
  • Inference of TF binding site model parameters. What kind of algorithms should be used? How best to deal with technology-related artifacts and biases?
  • Beyond predicting binary protein-DNA interactions: How to model protein-DNA interactions in complex in vitro or in vivo environments containing many protein species. How to integrate protein-protein interaction parameters into a TF binding site model?
  • Prediction of multimeric protein-DNA complexes.
  • Interpretation of ChIP-seq data: To what extent does a TF binding site model explain DNA-protein interactions revealed by a ChIP-seq experiment. Appropriate statistics are needed to quantitatively answer such and related questions.
  • Current TF binding site resources: PWMs of current TF binding site resources are often blindly trusted by end users. Users need to be aware that these models come with an experimental error. How can we identify and eliminate inaccurate PWMS. Could we provide accuracy measures for next generation TF binding site resources.

Workshop speakers:

Riccardo Dainese (EPFL, Lausanne)

Ron Shamir (Tel Aviv University)

Edgar Wingender (University of Göttingen)

Jacques van Helden (AMU, Marseille)

Thomas Sakoparnig (Biozentrum, Basel)


Workshop Agenda

14:00 to 14:05 : Bart Deplancke & Philipp Bucher


14:05 to 14:35 : Jacques van Helden (Aix-Marseille University, France)

What's wrong with Position-Specific Scoring Matrices?

14:35 to 15:05 : Edgar Wingender (University of G
öttingen, Germany)

Pitfalls in the automatic generation of binding motifs - the Fos case

15:05 to 15:35 : Ron Shamir (Tel Aviv University, Israel)

Inference of binding site models from HT-Selex data.

15:35 to 16:00 : Coffee break

16:00 to 16:20 : Riccardo Dainese (EPFL, Switzerland)

MITOMI-seq: a microfluidic protein-DNA interaction screening platform for de novo motif discovery.

16:20 to 16:40 : Thomas Sakoparnig (Universty of Basel, Switzerland)

ChIP-CRUNCH: Completely automated processing and motif analysis for ChIP-seq data.

16:40 to 17:00 : Anthony Mathelier (University of British Columbia, Canada)

Combining DNA sequence and structure to better predict TFBS.

17:00 to 17:45 : Round table discussion