Technical University of Denmark & University of Copenhagen, Denmark
Creating disease trajectories from big biomedical data
Electronic patient records remain a rather unexplored, but potentially rich data source for discovering correlations between diseases, drugs and genetic information in individual patients. Such data makes it possible to compute fine-grained disease co-occurrence statistics, and to link the comorbidities to the treatment history of the patients. A fundamental issue is to resolve whether specific adverse drug reaction stem from variation in the individual genome of a patient, from drug/environment cocktail effects, or both. Here it is essential to perform temporal analysis of the records for identification of ADRs directly from the free text narratives describing patient disease trajectories over time. ADR profiles of approved drugs can then be constructed using drug-ADR networks, or alternatively patients can be stratified from their ADR profiles and compared. Given the availability of longitudinal data covering long periods of time we can extend the temporal analysis to become more life-course oriented. We describe how the use of an unbiased, national registry covering 6.2 million people from Denmark can be used to construct disease trajectories which describe the relative risk of diseases following one another over time. We show how one can “condense” millions of trajectories into a smaller set which reflect the most frequent and most populated ones. This set of trajectories then represent a temporal diseaseome as opposed to a static one computed from non-directional comorbidities only.
Using electronic patient records to discover disease correlations and stratify patient cohorts. Roque FS et al., PLoS Comput Biol. 2011 Aug;7(8):e1002141.
Mining electronic health records: towards better research applications and clinical care. Jensen PB, Jensen LJ, and Brunak S, Nature Reviews Genetics, 13, 395-405, 2012.
A nondegenerate code of deleterious variants in mendelian Loci contributes to complex disease risk. Blair DR, Lyttle CS, Mortensen JM, Bearden CF, Jensen AB, Khiabanian H, Melamed R, Rabadan R, Bernstam EV, Brunak S, Jensen LJ, Nicolae D, Shah NH, Grossman RL, Cox NJ, White KP, Rzhetsky A. Cell. 155, 70-80, 2013.
Dose-specific adverse drug reaction identification in electronic patient records, Robert Eriksson R, Werge T, Jensen LJ, Brunak S. Drug Safety, 37, 237-47, 2014.
Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients
Jensen AB, Moseley PL, Oprea TI, Ellesøe SG, Eriksson R, Schmock H, Jensen PB, Jensen LJ, Brunak S. Nature Comm, 5, 4022, 2014.
Howard Hughes Medical Institute, UCLA-DOE Institute, Departments of Biological Chemistry and Chemistry & Biochemistry, UCLA, Los Angeles, California, USA
The Amyloid State of Proteins
Amyloid diseases, including Alzheimer’s, Parkinson’s, and the prion conditions, are each associated with a particular protein in fibrillar form. At the morphological level, these fibers appear similar and are termed “amyloid.” From x-ray and electron diffraction, we found that the adhesive segments of amyloid fibers are short protein sequences which form pairs of interdigitated, in-register beta sheets. These amyloid fibrils were long suspected to be the disease agents, but evidence suggests that in at least some of the neurodegenerative diseases, smaller, often transient and polymorphic oligomers are the toxic entities. In attempts to determine structures for such oligomers, we have discovered segments of amyloid-forming proteins that form toxic, antiparallel beta, out-of-register structures. In one case, the oligomer is a cylindrical barrel, formed from six anti-parallel, out-of-register protein strands, which we term a cylindrin. In another case, the oligomer is an open, continuous cylindrin-like structure that we term a corkscrew. Cylindrins offer models for the hitherto elusive structures of amyloid oligomers, and are distinct in structure from amyloid fibrils. From the known structure of the spine of an amyloid fibril, we find it is possible to design an inhibitor of fibril formation. Computational methods are used to identify amyloid-forming segments, and in the design of inhibitors.
Joint work by David Eisenberg, Michael Sawaya, Rebecca Nelson, Alice Soragni, Jose Rodriguez, , Lin Jiang, Smriti Sangwon, Lisa Johnson, Arthur Laganowsky, Cong Liu, Angela Soriaga, Meytal Landau, Duilio Cascio, Stuart Sievers,, Lorena Saelices-Gomez, Elizabeth Guenther, Michael Hughes.
Center for Genomic Regulations, Universitat Pompeu Fabra, Barcelona, Spain
The human transcriptome across tissues and individuals
The pilot phase of the Genotype-Tissue Expression (GTEx) project has produced RNASeq from 1,641 samples originated from up to 43 tissues from 175 post-mortem donors, and constitutes a unique resource to investigate the human transcriptome across tissues and individuals. Clustering of samples based on gene expression recapitulates tissue types, separating solid from not solid tissues, while clustering based on splicing separates neural from non-neural tissues, suggesting that post-transcriptional regulation plays a comparatively important role in the definition of neural tissues .About 47 % of the variation in gene expression can be explained by variation of across tissues, while only 4% by variation across individuals. We find that the relative contribution of individual variation is similar for lncRNAs and for protein coding genes. However, we find that genes that vary with ethnicity are enriched in lncRNAs, whereas genes that vary with age are mostly protein coding. Among genes that vary with gender, we find novel candidates both to participate and to escape X-inactivation. In addition, by merging information on GWAS we are able to identify specific candidate genes that may explain differences in susceptibility to cardiovascular diseases between males and females and different ethnic groups. We find that genes that decrease with age are involved in neurodegenerative diseases such as Parkinson and Alzheimer and identify novel candidates that could be involved in these diseases. In contrast to gene expression, splicing varies similarly among tissues and individuals, and exhibits a larger proportion of residual unexplained variance. This may reflect that that stochastic, non-functional fluctuations of the relative abundances of splice isoforms may be more common than random fluctuations of gene expression. By comparing the variation of the abundance of individual isoforms across all GTEx samples, we find that a large fraction of this variation between tissues (84%) can be simply explained by variation in bulk gene expression, with splicing variation contributing comparatively little. This strongly suggests that regulation at the primary transcription level is the main driver of tissue specificity. Although blood is the most transcriptionally distinct of the surveyed tissues, RNA levels monitored in blood may retain clinically relevant information that can be used to help assess medical or biological conditions.
Lawrence Berkeley National Laboratory, CA, USA
On the utility of ontologies for disease diagnosis
Understanding causality of disease is fundamental to human health research – for informing diagnosis and for designing potential treatment strategies. In particular, how an individual’s genomic constitution, under the influence of various environmental factors, collectively results in different phenotypic outcomes. For diagnosis of more common diseases statistical correlation techniques on whole exome or genome sequencing data are relatively effective. However, for rare and novel undiagnosed diseases it is necessary to utilize a diversity of other data sources to aid in determining their etiology. Convenient access to clinically relevant variant-phenotype associations using shared semantics and syntax would provide more effective diagnosis in these cases. This talk will present our experiences and progress in developing such an approach to address this challenge, including a standard conveyance of phenotype and environment information, its computational analysis, and its utility for diagnosis of rare and newly reported diseases.
Katherine S. Pollard
Gladstone Institutes & UC San Francisco, USA
Massive data integration enables discovery of gene regulatory enhancers and their targets
Despite a wide variety of molecular and clinical data suggesting that disease can be caused by mutations in gene regulatory enhancers and that phenotypic differences between closely related species are frequently driven by changes in developmental gene expression, only a few mutations that alter enhancer function have been identified thus far. Human Accelerated Regions (HARs), the fastest evolving regions of the human genome, are mostly uncharacterized non-coding sequences. Because HAR sequences are highly conserved across mammals, the human-specific mutations in them are excellent candidates for regulatory mutations with functional effects. To test this hypothesis, we developed computational tools to predict gene regulatory enhancers and their targets and to test if mutations in regulatory sequences significantly alter their transcription factor binding site content. We learned that the of pattern proteins bound to chromatin in between active enhancers and promoters is a novel and highly predictive signature for enhancer-promoter interactions. Then, we and others used transient transgenic reporter gene assays in mouse embryos to validate over 60 HARs that function as developmental enhancers, several of which show differences in expression patterns between human and chimpanzee sequences. Because this approach to characterizing regulatory mutations is costly and slow, we are now using massively parallel reporter assays (MPRAs) coupled with stem cell techniques to functionally screen all HARs for enhancer activity in developing human and chimp neuronal cells (early initiation, neural progenitor, glial progenitor) and to characterize the effects of nucleotide variants on their enhancer activity. This strategy could be adapted to pinpoint causative regulatory variants in the many disease-associated genomic loci that lack an obvious coding mutation.
Tel Aviv University, Israel
Supple algorithms and data integration for understanding diseases
Disease classification using expression and other "omics" data has been a partial success to date. On one hand, powerful biomarkers were identified. On the other hand, the robustness and reproducibility of the results has been limited. Here we show that integrated analysis of very large scale data – across studies, experimental platforms and diseases – provides improved results. We collected and manually annotated more than 14,000 gene expression profiles, covering nearly 50 diseases. By developing classifiers that take into account the diversity of non-disease samples when analyzing each disease, we obtained high accuracy results for half of all the diseases, and identified disease-specific biomarkers that classify better and are more robust. By integrating the results with mutation, protein interaction and drug target data, we create multi-faceted disease summaries, which serve as a starting point for understanding the disease, planning drug repositioning and multi-drug treatments. A parallel challenge is improving the algorithmics for gene expression analyses. We will describe a new algorithm for finding coherent and flexible modules in 3-way data, e.g., measurements of gene expression for a group of patients over a sequence of time points. Our method can identify both core modules that appear in multiple patients and patient-specific augmentations of these core modules that contain additional genes. Our algorithm uses a hierarchical Bayesian data model and Gibbs sampling. We demonstrate its utility and advantage in analysis of gene expression time series following septic shock response, and in analyzing brain fMRI time series of subjects at rest.
Joint work with D. Amar, T. Hait, A. Maron-Katz, D. Yekutieli (Tel Aviv University), S. Izraeli (Sheba Hospital), and T. Hendler (Sourasky Medical Center)
University of Zurich and SIB Swiss Institute of Bioinformatics, Switzerland
Protein abundance evolution and its systems-level constraints
Absolute protein quantification data at proteome-wide scale is increasingly becoming available, enabling insights into fundamental cellular biology and serving to constrain experiments and theoretical models. While proteome-wide quantification still has some way to go before it is fully routine, many ground-breaking datasets now exist, based on biophysical and MS techniques. However, the data are quite heterogeneous, and can be difficult to assess in a systematic way. Here, we describe how protein quantification data can be re-processed, standardized and integrated, and we introduce a public web-resource dedicated to this purpose (“PaxDb”; Protein Abundances Across Organisms). Using PaxDb, we explore how protein abundance levels change during evolution, and what might act to constrain such changes. We also explore high-level stoichiometries between cellular processes, and the extent to which environmental and/or energy constraints may drive protein sequence evolution.