Abstract
Genetic variants underlying complex traits, including disease susceptibility, are enriched within the transcriptional regulatory elements, promoters and enhancers. There is emerging evidence that regulatory elements associated with particular traits or diseases share patterns of transcriptional regulation. Accordingly, shared transcriptional regulation (coexpression) may help prioritise loci associated with a given trait, and help to identify the biological processes underlying it. Using cap analysis of gene expression (CAGE) profiles of promoter-and enhancer-derived RNAs across 1824 human samples, we have quantified coexpression of RNAs originating from trait-associated regulatory regions using a novel analytical method (network density analysis; NDA). For most traits studied, sequence variants in regulatory regions were linked to tightly coexpressed networks that are likely to share important functional characteristics. These networks implicate particular cell types and tissues in disease pathogenesis; for example, variants associated with ulcerative colitis are linked to expression in gut tissue, whereas Crohn’s disease variants are restricted to immune cells. We show that this coexpression signal provides additional independent information for fine mapping likely causative variants. This approach identifies additional genetic variants associated with specific traits, including an association between the regulation of the OCT1 cation transporter and genetic variants underlying circulating cholesterol levels. This approach enables a deeper biological understanding of the causal basis of complex traits.
ONE SENTENCE SUMMARY We discover that variants associated with a specific disease share expression profiles across tissues and cell types, enabling fine mapping and identification of new disease-associated variants, illuminating key cell types involved in disease pathogenesis.
Introduction
Genome-wide association studies (GWAS) have considerable untapped potential to reveal new mechanisms of disease1. Variants associated with disease are strongly over-represented in regulatory, rather than protein-coding, sequence; this enrichment is particularly strong in promoters and enhancers2–4. There is emerging evidence that gene products associated with a specific disease participate in the same pathway or process5, and therefore share transcriptional control6.
We have recently shown that cell-type specific patterns of activity at multiple alternative promoters7 and enhancers3 can be identified using cap-analysis of gene expression (CAGE) to detect capped RNA transcripts, including mRNAs, lncRNAs and eRNAs3,5. In the FANTOM5 project, we used CAGE to locate transcription start sites at single-base resolution and quantified the activity of 267,225 regulatory regions in 1824 human samples (primary cells, tissues, and cells following various perturbations)8.
Unlike analysis of chromatin modifications or accessibility, the CAGE sequencing used in FANTOM5 combines extremely high resolution in three relevant dimensions: maximal spatial resolution on the genome, quantification of activity (transcript expression) over a wide dynamic range, and high biological resolution – quantifying activity in a much wider range of cell types and conditions than any previous study of regulatory variation2,4. Since a majority of human protein-coding genes have multiple promoters5 with distinct transcriptional regulation, CAGE also provides a more detailed survey of transcriptional regulation than microarray or RNAseq resources. Heritability of traits studied by GWAS is substantially enriched in these FANTOM5 promoters9.
Genes that are coexpressed are more likely to share common biology10,11. Similarly, regulatory regions that share activity patterns are more likely to contribute to the same biological pathways5. Transcriptional activity of regulatory elements (both promoters and enhancers3) is associated with variable levels of expression arising at these elements in different cell types and tissues5.
In order to determine whether coexpression can provide additional information to prioritise genome-wide associations that would otherwise fall below genome-wide significance, we developed network density analysis (NDA). The NDA method combines genetic signals (disease association in a GWAS) with functional signals (correlation in expression across numerous cell types and tissues, Figure 1), by mapping genetic signals onto a pairwise coexpression network of regulatory regions, and then quantifying the density of genetic signals within the network. Every regulatory region that contains a GWAS SNP is assigned a score quantifying its proximity in the network to every other regulatory region containing a GWAS SNP for that trait. We then identified specific cell types and tissues in which there is preferential activity of regulatory elements associated with selected disease-related phenotypes, thereby providing appropriate cell culture models for critical disease processes.
Results
Discovery and prioritisation of GWAS hits in regulatory sequence
We defined regulatory regions as the transcription start site (TSS) −300bp and +100bp for promoters5, and the region between bidirectional TSS for enhancers3 (See Online Methods). For each of 7 GWAS studies for which high-resolution complete datasets were publicly available, we identified a set of regulatory regions containing variants with GWAS p-values below a permissive threshold (5e-8; Table 1). We devised NDA to examine the similarity in activity patterns among the set of regulatory regions detected in each GWAS (that is, the similarity in expression profile of transcripts arising from these regulatory regions).
NDA detected significant coexpression (see below) among the sets of transcripts arising from regulatory regions containing variants associated with each of the following diseases and traits: ulcerative colitis, Crohn’s disease, height, HDL cholesterol, LDL cholesterol, total cholesterol and triglyceride levels (Table 1). One lower-resolution study, of blood pressure, was also analysed: in this smaller study, no coexpression signal was detected among transcripts arising near variants associated with either systolic or diastolic blood pressure (Table 1).
Significant coexpression was only detected within loci containing variants with low p-values (Fig 2a). Similar expression profiles are often seen arising from regulatory regions that are close to each other on the same chromosome, which may also span linkage disequilibrium blocks. The effect of this on the coexpression signal was mitigated by grouping nearby (within 100,000bp) regulatory regions into a single unit, unless they have notably different expression patterns (Fig 2c; Online Methods). SNPs in nearby regulatory regions are also more likely to be in linkage disequilibrium, and these regulatory regions themselves are more likely to share cis or short range trans-regulatory signals in common. We checked for significant linkage disequilibrium between regulatory regions assigned to independent groups (Supplementary files 1, 4-12). At a threshold of r2 > 0.8, there is no linkage disequilibrium between significantly coexpressed groups; three examples of weaker linkage relationships were detected with 0.08 ≤ r2 ≤ 0.6 (Supplementary file 1).
Regulatory regions around individual TSS with higher coexpression scores contain variants with stronger GWAS p-values (Fig 2b), indicating that this independent signal provides additional information that may be used for fine-mapping causative loci (Fig 2c).
In order to enable the detection of new regulatory regions with strong coexpression relationships, we chose a permissive p-value threshold for trait association of 5×10-6(see Online Methods). GWAS data for Crohn’s disease12 were used for initial optimisation of the NDA approach; among GWAS datasets for phenotypes that were not used in algorithm development (i.e. all apart from Crohn’s disease), 0-24% of regulatory regions containing a GWAS SNP showed significant coexpression with other regulatory elements associated with the same phenotype (FDR < 0.05, compared with 100 permuted subsets of equal size; see Online Methods).
For a given disease, regulatory regions containing GWAS variants are coexpressed if they share similar activity patterns (i.e. similar expression patterns among transcripts arising from these regulatory regions) with other regulatory regions implicated in that disease. Figure 3 shows significant coexpression superimposed on a two-dimensional representation of the entire network of pairwise correlations. Since activity (transcript expression) was measured in numerous samples, the true proximity of regulatory regions to one another cannot be accurately represented in two dimensions – a perfect representation would require as many dimensions as there are unique samples. However, the NDA method is designed to quantify proximity in network space, so that significantly coexpressed elements are detected, even if they are not directly adjacent on a two-dimensional representation of the network (Figure 3). Among strong coexpression was seen between loci that were widely separated on the genome (Figure 4).
The coexpression signal essentially combines the signal for association in a GWAS with the location and activity pattern of regulatory regions on the genome. We deliberately chose a permissive GWAS p-value threshold in order to enable the detection of new signals that did not achieve genome-wide significance in the original studies. For example, we found that coexpressed transcripts for both LDL and total cholesterol (TC) arise from promoters for well-studied genes such as APOB13 and ABCG514, but also from regulatory regions not previously associated with cholesterol levels. A promoter for SLC22A1, which encodes an organic cation transporter, OCT115, is strongly coexpressed among elements associated with both conditions (Supplementary File 1). OCT1 transcription is regulated by cholesterol16 and the transporter regulates hepatic steatosis through its role in thiamine transport17. This action of OCT1 is inhibited by metformin17, an oral hypoglycaemic agent whose cholesterol-lowering effect18 is not well understood19. Full results of coexpression analyses are in Supplementary File 1, and online at www.coexpression.net.
Cell-type and tissue specificity
The significantly-coexpressed networks detected here could be regarded as revealing the signature expression profile, at least within the FANTOM5 dataset, for a given disease or trait. We next explored whether these signature expression patterns reveal cell types or biological processes that may contribute to the trait or disease susceptibility.
We therefore ranked cell types and tissues by transcriptional activity for each of the significantly-coexpressed loci for each trait, and combined the rankings using a robust rank aggregation20 (Online Methods). By first detecting the characteristic expression signature associated with a given phenotype using only high-resolution GWAS data, and then detecting the cell type and tissue activity profiles that underlie this signature, we improve on the statistical power of previous methods that have attempted to detect cell-type specific signatures of disease4,6,21. Strong signals reported previously are highly significant in our analysis; for example genetic loci associated with cholesterol are transcriptionally active in hepatocytes and liver tissue6(Supplementary File 8).
This analysis reveals robust cell-type associations that have important implications for understanding disease pathogenesis. For example, cell-type associations with Crohn’s disease were restricted to immune cells, particularly monocytes exposed to inflammatory stimuli (Supplementary File 4). In contrast, cell type associations with ulcerative colitis were statistically significant in rectum, colon and intestine samples, and in a distinct group of immune cells: macrophages exposed to bacterial lipopolysaccharide (Supplementary File 5). This is consistent with the view that ulcerative colitis, in which disease processes are primarily restricted to the colon and rectum, is a consequence of dysregulation of processes that are intrinsic to the large bowel, including epithelial barrier function22, whereas Crohn’s disease is a multisystem autoimmune disorder with more diverse extra-intestinal manifestations23, consistent with a primary immune aetiology.
Discussion
The development of high-throughput genotyping methods has led to an explosion of associations between genetic markers and human diseases24. The results presented here are a step towards overcoming the next challenge for this field: making sense of these associations to advance the practice of medicine. There has been increasing recognition of the potential to utilise prior knowledge to improve detection and interpretation of genome-wide signals25. The results of our analysis demonstrate that there is biological information in the coexpression of genetic variants associated with a particular disease that can provide the basis for prioritising variants that would not otherwise meet standard thresholds for genome-wide statistical significance.
We report relationships between numerous regulatory regions that are not associated with named genes – a restriction that has previously limited the transition from genetic discovery to biological understanding26-30. The analysis reveals the impact of specific enhancers and promoters that may be remote from the genes they regulate, or may contribute to tissue-specific regulation of a gene that may otherwise appear to be more widely-expressed.
Even for those disease-associated variants that can be reliably assigned to a named gene, previous attempts to draw functional inferences have, by necessity, relied on published data26,annotated biological pathways31, or gene sets30,32. Although many important insights have been gained from these approaches, they share a fundamental limitation: reliance on existing knowledge. This restricts the ability to exploit the potential of genomics to deliver insights into new, previously unseen, mechanisms of disease33.
The data used for development and testing of the coexpression approach were from large meta-analyses that incorporate genotyping (or imputation) of genetic variants at extremely high resolution, increasing the probability that variants will be found within regulatory regions. In future, the availability of whole-genome sequencing can reasonably be expected to produce many additional high-quality datasets for coexpression analysis. In principle, the NDA approach can be generalised to any network in which it is desirable to quantify the proximity of a subset of nodes.
The scale, depth and breadth of the FANTOM5 expression atlas, together with the NDA approach, enable detection of subtle coexpression signals for regulatory regions that have previously been undetectable. As additional genetic studies become available at greater genotyping resolution, we anticipate that this method will detect new genetic associations with disease, coexpressed modules underlying pathogenesis, identify critical cell types implicated in mechanisms of disease.
DATA ACCESS
The FANTOM5 atlas is accessible from http://fantom.gsc.riken.jp/data/
An online service running the coexpression method is available at https://coexpression.roslin.ed.ac.uk
username: fantom5
password: review
Authors’ contributions
JKB conceived the study, designed and led the analyses and wrote the manuscript. AB and AG contributed to computational optimisation and description of methods. AB and SC generated network and circos images, respectively. CH, JB, JBB, TF, and AT advised on statistical and network analysis methods. ML managed the data collection, including annotation, expression profiling, metadata association and archiving. CW, JS, NH and TF contributed biological expertise. CW, RA, JS, AS, MR, VB, and PH advised on methodology. ARRF, MI, CD, NK, TL, JK, HS, HK, YH, and PC organised the FANTOM5 project including sample collection, data production, mapping and tag clustering. JKB, ARRF, DAH, TF, GJF, PC and YH provided resources. DAH and ARRF advised on methodology, and contributed to the manuscript. All authors contributed to and approved the final version of the manuscript.
DISCLOSURE DECLARATION
All authors report that they have no conflicts of interests to declare in respect of this manuscript.
ACKNOWLEDGEMENTS
We would like to express our gratitude for the diligence and professionalism of the entire FANTOM5 consortium and to the members of the IIBDGC group, GIANT consortium, and Global Lipids consortium for freely sharing their data. We are particularly grateful to the tens of thousands of patients and healthy volunteers who donated DNA and other material to these studies.
JKB gratefully acknowledges funding support from a Wellcome Trust Intermediate Clinical Fellowship (103258/Z/13/Z) and a Wellcome-Beit Prize (103258/Z/13/A), BBSRC Institute Strategic Programme Grant to the Roslin Institute (BBS/E/D/20241864), the UK Intensive Care Foundation, and the Edinburgh Clinical Academic Track (ECAT) scheme. Funds were provided to the Roslin Institute through a BBSRC Strategic Programme Grant (JKB, SC, CSH, GJF, TCF, DAH; BBS/E/D/20211551, BBS/E/D/20231760). We acknowledge the financial support provided by the MRC-HGU Core Fund (CSH, AT). FANTOM5 was made possible by a Research Grant for RIKEN Omics Science Center from MEXT to YH and a Grant of the Innovative Cell Biology by Innovative Technology (Cell Innovation Program) from the MEXT, Japan to YH. RIKEN Centre for Life Science Technologies, Division of Genomic Technologies members (RIKEN CLST (DGT)) are supported by institutional funds from the MEXT, Japan. ARRF is supported by a Senior Cancer Research Fellowship from the Cancer Research Trust and funds raised by the Ride to Conquer Cancer. JCB is supported by Wellcome Trust grant WT098051. GJF acknowledges the support of an NHMRC Career Development Fellowship (GNT1045237), NHMRC Project Grants (GNT1042449, GNT1045991, GNT1067983 and GNT1068789), and the EU FP7 under grant agreement No. 259743 underpinning the MODHEP consortium. MR was supported by grants from the Deutsch Forschungsgemeinschaft, the German Cancer Aid and the Rudolf Bartling Foundation. RA was supported by funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 638273). US and VBB are supported by the KAUST Base Research Fund to VBB and KAUST CBRC Base Fund. RMP is supported by grants from the US National Institutes of Health (R01-AR057108, R01-AR056768, U01-GM092691 and R01-AR059648) and holds a Career Award for Medical Scientists from the Burroughs Wellcome Fund. RA and AS were supported by funds from FP7/2007-2013/ERC grant agreement 204135, the Novo Nordisk foundation, and the Lundbeck Foundation and the Danish Cancer Society. CAW is supported by a Queensland Government Smart Futures Fellowship, and samples were collected under Australian National Health and Medical Research council project grants 455947 and 597452, under agreement from the Australian Red Cross 11-02QLD-10 and the University of QLD ethics committee.