Abstract
Genome-wide association studies in autoimmune and inflammatory diseases (AID) have uncovered hundreds of loci mediating risk1,2. These associations are preferentially located in non-coding DNA regions3,4 and in particular to tissue-specific Dnase I hypersensitivity sites (DHS)5,6. Whilst these analyses clearly demonstrate the overall enrichment of disease risk alleles on gene regulatory regions, they are not designed to identify individual regulatory regions mediating risk or the genes under their control, and thus uncover the specific molecular events driving disease risk. To do so we have departed from standard practice by identifying regulatory regions which replicate across samples, and connect them to the genes they control through robust re-analysis of public data. We find substantial evidence of regulatory potential in 132/301 (44%) risk loci across nine autoimmune and inflammatory diseases, and are able to prioritize a single gene in 104/132 (79%) of these. Thus, we are able to generate testable mechanistic hypotheses of the molecular changes that drive disease risk.
The autoimmune and inflammatory diseases (AID) are a group of more than 80 common, complex diseases driven by systemic or tissue-specific immunological attack. This pathology is driven by loss of tolerance to self-antigens or chronic inflammatory episodes leading to long-term organ and tissue damage. Risk variants identified by genome-wide association studies (GWAS) are preferentially located in non-coding regions with tissue-specific chromatin accessibility7,3,8,9 and in transcriptional enhancer regions active after T cell stimulation4. Formal analyses partitioning the heritability of disease risk across different genomic regions support this enrichment6, with excess heritability localizing to tissue-specific DNase I hypersensitive sites (DHS)5. Cumulatively, these results suggest that AID pathology is mediated by changes to gene regulation in specific cell populations, but are not designed to identify individual regulatory regions mediating risk or the genes under their control. Several fine-mapping efforts have jointly considered genetic association and epigenetic modification data as a way to identify causal variants 10,11,12. However, these efforts use epigenetic mark information to assess whether associated variants are likely to be causal, rather then to identify the regulatory sequences that mediate risk, and the genes they affect.
We have therefore developed a systematic approach to identify regulatory regions mediating disease risk, and thus generate testable mechanistic hypotheses of the molecular changes that drive disease risk (Supplementary Figure 1). For each association, we first calculate posterior probabilities of association from GWAS data and thence the set of markers forming the 99% credible interval (CI)13,32,14. We then overlap CI SNPs with DHS in the region to identify which regulatory regions may harbor risk, and from these SNPs calculate the fraction of posterior probability attributable to each DHS. We chose DHS as they are general markers of chromatin accessibility and typically only 150-250 base pairs long, compared to other histone modifications which can span tens to hundreds of kilobasepairs. Next, we identify genes controlled by each DHS by correlating chromatin accessibility state to expression levels of nearby genes. We use the atlas of tissues available at Roadmap Epigenomics Project (REP) data15,16, where both DHS and gene expression have been measured in the same samples. Finally, we combine the posterior probability of disease association of each DHS and the correlation between that DHS and the expression levels of nearby genes to calculate a per-gene posterior probability of disease association. This allows us to estimate the probability that a gene mediates disease risk, and to rank genes in a locus by these values.
DHS peaks, as all epigenetic marks, are called in each sample separately17. We therefore clustered DHS peaks to identify those corresponding to the same underlying regulatory site, so we could correlate accessibility state of the same site to gene expression data (Supplementary Figure 2). In 56 REP tissues with at least two replicate DHS sequencing runs, we called 22,060,505 narrow–sense 150bp peaks at a false discovery rate FDR < 1%, which fell into 1,994,675 DHS clusters of 250-400bp each, covering 14.8% of the autosomal genome. Of these, 1,079,138 (54.1%) covering 8.5% of the genome passed nominal significance in a statistical replication test ( test, p < 0.05). This subset explains essentially all the heritability attributable to all peaks in multiple sclerosis and inflammatory bowel disease GWAS (Supplementary Figure 3), indicating they represent the majority of regulatory regions relevant to AID risk. Of these 56 REP tissues, 22 also have gene expression measurements, from which we calculated the correlation between DHS accessibility state and transcript levels. We therefore restricted our present analysis to 1,079,138 DHS clusters and 13771 genes across these 22 REP tissues, though we note our framework can be used with any regulatory feature and expression dataset, and is publicly available.
With this framework, we dissected 301 genome-wide significant associations to one of nine AID, using publicly available summary association statistics from samples genotyped on the Immunochip, a targeted genotyping array18,19 (available at immunobase.org; Table 1). We first collated all reported genome-wide significant associations reported for each disease, then restricted our analysis to the loci genotyped at high density on the Immunochip13,32. We excluded the Major Histocompatibility Locus, where complex LD patterns make credible interval mapping challenging20. For each association, we calculated posterior probabilities of association for all markers and defined credible interval SNP sets13,14. We find a median of 4 (standard deviation, sd = 7.8) DHS clusters overlap CI SNPs, out of a median 822 (sd = 205.2) DHS clusters in each 2Mb window around an association (Figure 1A), indicating this data integration step alone vastly reduces the number of potentially disease-relevant regulatory regions.
To assess how likely each association is to be mediated by a variants on a regulatory region, we compute their regulatory potential ρ, as the proportion of the posterior probability of association localizing to DHS clusters. Consistent with previous observations3,21,4 we find that risk often localizes to DHS clusters: over 25% of the posterior is located on DHS clusters in 132/301 (44%) of loci, and over 50% of the posterior in 53/301 loci (18%) (Figure 1C). We reasoned that if DHS clusters harboring CI SNPs actually mediate risk, their accessibility state should be perturbed by the variants they harbor, and they should be accessible in disease-relevant cell populations. We find that CI SNPs on DHS clusters are more likely to induce allele-specific accessibility22 (Fisher exact test p = 7e−6, Figure 1E), and that these DHS clusters are more likely to be accessible in immune cell subpopulations (Figure 1F). These results show our approach identifies disease-relevant regulatory events, and support the view that common genetic variants influence disease risk by altering the accessibility of gene regulatory regions.
Having validated that our analysis was identifying genuine regulatory risk effects, we next turned to identifying specific disease-mediating DHS clusters and the genes they control (Supplementary Table 1 and Supplementary Table 2). We focused on the 132 loci where regulatory potential ρ > 0.25, as these associations may be mediated by genetic perturbation of a regulatory region. We found that an average of two DHS clusters (sd = 4.0) account for > 90% of the total association posterior attributable to all DHS clusters in these loci, indicating we can resolve most loci to a small number of candidate regulators (Supplementary Figure 4). By correlating the accessibility state (open or closed) of each DHS cluster to the expression of nearby genes across 22 REP tissues, we were able to prioritize a median of 3/14 genes per locus (Figure 1B and Supplementary Table 2), and could attribute a gene-wise proportion of posterior probability γ > 0.25 to a single gene in 104/132 (79%) loci. Surprisingly, the top-scoring genes are not the closest to the most associated variant in 92/104 (88%) of these cases, suggesting that risk-relevant regulatory regions exerting influence over genes at considerable distances (Table 2). The DHS clusters with high ρ values are more likely to be marked as active enhancers of transcription, which can bind distant promoters through long-range DNA looping events23,24,25, further supporting this conclusion (Supplementary Figure 5).
In several cases, we found evidence supporting a previous hypothesis for a causal gene in a locus. For example, we were able to resolve an association to multiple sclerosis (MS) risk on chromosome 1 to two DHS clusters, both of which implicate the CD58 gene (γ = 0.49, Figure 2). CD58 encodes lymophocyte-function associated antigen 3 (LFA3), a co-stimulatory molecule expressed by antigen presenting cells, mediating their interaction with circulating T cells by binding lymophocyte-function associated antigen 2 (LFA2)26. The latter is encoded by the CD2 gene immediately proximal to CD58, but does not show strong evidence of control by risk-mediating DHS clusters (γ = 0.12). The protective MS effect in this region is associated with an increase in CD58 expression, leading to an up-regulation of the transcription factor FoxP3 via CD2. This results in enhanced functioning of CD4+CD25high regulatory T cells, thought to be defective in MS patients 26. Similarly, we are able to prioritize ETS2 for an inflammatory bowel disease association (IBD) on chromosome 21 (γ = 0.92, Supplementary Figure 6), CD40 for another MS association on chromosome 20 (γ = 0.31, Supplementary Figure 7), and IRF8 for a rheumatoid arthritis (RA) association on chromosome 16 (γ = 0.43, Supplementary Figure 8).
Many Immunochip loci harbor associations to multiple diseases, suggesting that a portion of risk is shared27,28. Consistent with this observation, we found that 24 Immunochip loci had p > 0.25 for more than one disease, representing 59 of the 301 initially considered associations. Of these, 17/24 loci showed regulatory potential in two AID, with four, two, and a single locus showing regulatory potential to three, four and five AID, respectively. Due to the correlation imposed by linkage disequilibrium, it remains challenging to conclude that associations to different traits in the same locus represent a true shared effect, where the same underlying causal variant drives risk for multiple diseases29. We therefore sought to establish if associations to different diseases in these 24 loci identify the same DHS clusters and prioritize the same genes, indicating a shared effect. We found striking examples of shared and distinct effects across these 24 loci. For example, five diseases show association to a region of chromosome 6, with the most significant SNPs residing in the coding region of BACH2. We are able to prioritize the associations for autoimmune thyroid disease (AITD), MS and type I diabetes (T1D) to a single DHS cluster each, and independently prioritize MDN1 as the most likely target gene for these effects (γAITD = 0.81, γMS = 0.42, and γT1 D = 0.73, Figure 3). Our model only attributes a small proportion of the overall posterior probability of association to BACH2 for these diseases (γAIT D = 0.14, γMS = 0.09, and γT1D = 0.13). In contrast, we find that the associations for IBD and celiac disease (CEL) each identify different DHS clusters and prioritize MAP3K7 (γAIT D = 0.59) and GABBR2 (γCEL = 0.13), respectively, despite the credible intervals for these diseases essentially overlapping those for AITD, MS and T1D (Figure 3). We note that the most associated SNPs for MS, AITD, and T1D are the same (rs72928038), and the R2 between this SNP and the most associated SNPs of IBD (rs1847472) and CEL (rs7753008) are 0.34 and 0.25, respectively. Similarly, we identify the same DHS cluster and prioritize CTLA4 (γAIT D = 0.47, γRA = 0.38, and γT1 D = 0.52) and ICOS (γAIT D = 0.41, γRA = 0.33, and γT1 D = 0.46) for AITD, RA and T1D associations on chromosome 2 (Supplementary Figure 9). We are thus able to begin resolving associations across multiple diseases into shared and distinct effects in the same locus.
To more generally assess how our approach resolves shared associations, we assessed the overlap between shared signals in the 24 loci. We compared the overlap between 51 pairs of associations in terms of most associated markers, credible interval sets, DHS clusters harboring CI variants, and genes identified (Table 3). We found that, whilst the overlap between lead variants was low, we could more often identify the same DHS clusters and prioritize the same genes (Fisher exact test between proportion of lead SNPs and prioritized genes p = 0.014). We found the rate of prioritized gene overlap is correlated to linkage disequilibrium between lead variants (Supplementary Figure 10), suggesting that though GWAS may not identify the same variant representing a shared association, shared effects can clearly be identified by considering the likely functional effects in a locus. These observations hold true when we only consider the 17 loci harboring two disease associations (Supplementary Table 3 and Supplementary Figure 10), indicating our conclusions are not based on biases in a minority of loci harboring many associations. Thus, our approach can uncover biological pleiotropy30 across diseases even when the identity of the causal variant remains unknown, beyond the comparison of credible interval sets.
We have described an approach to detect gene regulatory regions driving disease risk and through them, the genes likely to mediate pathogenesis, through robust re-analysis of public data. We find substantial evidence of regulatory potential in a substantial proportion (44%) of loci across nine AID, and resolve these to a single gene in 104/132 (79%) controlled by regulatory regions active in immune cells. In the majority of loci we examine, we do not prioritize the gene closest to the maximally associated marker. This suggests that risk-mediating regulatory elements act at considerable distances, either by influencing the overall transcriptional landscape of the region or by acting on individual genes at a distance through DNA looping events mediated by DNA-protein interactions31. These competing explanations make different predictions: the former implies many genes will be controlled by the risk-mediating regulator, whereas the latter predicts a limited number of targets. As we are able to prioritize a single gene in the majority of cases, our results strongly suggest that risk is mediated by changes to specific gene regulatory programs affecting particular genes, which must be involved in pathogenesis.
More broadly, the observation that most common, complex disease risk aggregates in gene regulatory regions3,4,5 has made the translation of genetic association results into molecular and cellular mechanisms challenging. Fine-mapping is limited in resolution by linkage disequilibrium, making association data alone insufficient to identify a causal variant driving risk in a locus. For example, in a recent Immunochip study of multiple sclerosis32, we were able to reduce 14/66 (21%) Immunochip regions to 90% credible interval sets of fewer than 15 variants, and 5/66 to fewer than 5 variants, though increases in sample size will raise the resolution of these approaches14. Unlike coding variants, inferring function of non-coding polymorphisms remains challenging, though efforts to integrate functional genomics and population genetics data into composite functional scores33,34 or integrating genetic and epigenetic data11 are gaining some traction on this problem. Our own work complements these efforts by focusing on identifying individual regulators and the genes they control to generate testable hypotheses of the molecular basis of disease mechanism.
Methods
DNase I Hypersensitivity data peak-calling, clustering, and quality control
We obtained processed DNase I hypersensitivity (BED format) sequencing reads for 350 Roadmap Epigenomics Project (REP) samples15,16 corresponding to 73 cell types from http://www.genboree.org/EdaccData/Current-Release/experiment-sample/Chromatin_Accessibility/. For each sample, we called 150bp DNase I hypersensitive sites (DHS) passing a 1% FDR threshold17. We found 56 tissues with at least two replicates, which our statistical replication design requires, and limited our analysis to these. Where more than two replicates were available, we chose the two replicates with the smallest Jaccard distance between their DHS peaks positions on the genome.
To identify corresponding DHS across samples, we calculated the overlap between neighboring peaks across the 112 replicate samples as: where, Oi,j is the number of base pairs shared by DHS i and j, and li and lj are the length of DHS i and j respectively. We then grouped DHS with a graph-based approach, the Markov Clustering Algorithm35 (MCL) using the default parameters, and defined the coordinates of a DHS cluster as the extreme positions covered by DHS peaks included in that cluster. Finally, we define each cluster as accessible in a sample if we observe at least one DHS peak within its boundaries in that sample.
Both peak calling and MCL clustering are naive to sample labels, so we can test for evidence that DHS clusters replicate in this analysis. We expect that DHS clusters representing true regulatory regions should be consistently accessible or unaccessible in replicate samples. We can thus calculate a replication statistic for DHS cluster d as: where n1 is the number of cell types where DHS cluster d is active in both replicates; n2 is the number of cell types where the cluster is active in only one of the two replicates; and n3 is the number of cell types where the cluster is inactive in both replicates. For N = 56 tissues in our data a = n1/N, b = n2/N and c = n3/N. Further, if r is the number of samples where DHS cluster is active, then p = r/(2 × N), and q is 1 — p. Note that we distinguish between the number of cell types (N = 56) and number of samples considered (2 × N = 112). We expect Sd to follow a distribution, and selected DHS clusters passing a nominal significance threshold of pd ≤ 0.05. To assess if DHS clusters capture the majority of disease-relevant signal, we compared the proportion of disease heritability (h2g) explained by all DHS detected peaks in a tissue to that explained by the active DHS clusters we annotated6. For this we used genome-wide association summary statistics for MS 36 and IBD37.
Expression profile processing and analysis across REP tissues
There are 88 Roadmap Epigenomics Project (REP) samples corresponding to 27 cell types profiled on the Affymetrix HuEx-1_0-st-v2 exon array, which we downloaded as raw CEL files on 9/25/2013 from http://www.genboree.org/EdaccData/Current-Release/experiment-sample/Expression_Array/. We processed these data using standard methods available from the BioConductor project38. Briefly, we filtered cross-hybridizing probesets, corrected background intensities with RMA and quantile normalizated the remaining probeset intensities across samples. We then collapsed probesets to transcript-level intensities, and mapped transcripts to genes using the current Gencode annotations for human genes (version 12), removing any transcripts without a single exact match to a gene annotation. We then identified the 22 tissues with matched DHS data, averaged measurements over all replicates of each tissue, and quantile normalized the resulting dataset, comprising 13822 transcripts mapping to 13771 unique geneIDs.
Credible interval mapping for Immunochip loci
We obtained publicly available summary association statistics from case/control cohorts profiled on the Immunochip (Immunobase, http://www.immunobase.org; accessed May 2015) for autoimmune thyroid disease (ATD)39, celiac disease (CEL)40, inflammatory bowel disease (IBD)41, juvenile idiopathic arthritis (JIA)42, multiple sclerosis (MS)32, primary biliary cirrhosis (PBC)43, psoriasis (PSO)44, rheumatoid arthritis (RA)45, and type 1 diabetes (T1D)46 (Table 1). For each of the nine diseases, we compiled a list of genome-wide significant associations from the largest published GWAS39,40,32,47,46,42,44,37,45. We then pruned this list of lead SNPs to include only those that overlap densely genotyped regions of Immunochip data and were present in the 1000 Genomes European ancestry cohorts48. We excluded the Major Histocompatibility Complex (MHC) region on chromosome 6, where fine-mapping has been previously reported20. As summary statistics for conditional associations are not available, we limited our analyses to primary reported signals in each disease.
We identified credible interval SNPs explaining 99% of the posterior probability of association for the remaining lead SNPs13,14. For each lead SNP, we identified SNPs within 2Mb in linkage disequilibrium r2 > 0.1 in the non-Finnish European 1000 Genomes reference panels48. For each set S of these SNPs, we calculated posterior probabilities of association as where is the Immunochip association chi-square test statistics of SNP i. We then selected the smallest number of SNPs required to explain 99% of the posterior probability.
Calculating regulatory potential of disease loci
We first overlapped credible interval (CI) SNPs with our DHS clusters, then computed the posterior probability of association attributable to each DHS cluster d as where PPs is the posterior probability of association for SNP s. Od(s) is equal to one if SNP s is located on DHS cluster d or the 100 bp flanking region each side of DHS cluster d, and it is zero otherwise. For SNPs overlapping two or more DHS clusters or their 100 bp flanking regions, we divided its posterior probability PPs between those DHS clusters equally. We then calculated the regulatory potential of each disease risk locus over all DHSs in the locus as where D is the set of all DHS clusters in the region.
Calculating posterior probabilities of association for each gene in a risk locus
We identified all genes within 1Mb of the lead SNP for each locus, and for all DHS clusters with (ρd > 0), computed the correlation between transcript levels and DHS accessibility across the 22 REP tissues with a two-sided Wilcoxon rank sum test. To account for the correlation structure between expression levels, we estimated the expected null empirically. We first decorrelate the matrix of gene expression levels to (WPCA) using PCA whitening, then use the Cholesky decomposition of the covariance matrix (L) to obtain the expected null as GNull = L'WPCA (Supplementary Figures 11 and 12). For any given DHS cluster d, we computed the Wilcoxon rank sum test statistics between d and all genes of GNull. This formed our null Wilcoxon rank sum test statistics (). From this null, we computed empirical P-values of significance of correlation between DHS cluster d and gene g as Where is the Wilcoxon rank sum test statistics between DHS cluster d and gene g, and |·| denotes the number of events satisfying the enclosed criterion. This formulation accounts for the two-sided test. We used a permutation-based approach to assess the significance of the correlation between DHS clusters and gene expression using a random set of 2000 genes from across the genome. We correlated each random gene to each DHS cluster, and compared test genes against this expected distribution of correlation coefficients to obtain an empirical P value (Supplementary Figure 12).
We next calculated the proportion of posterior probability of association transmitted from DHS cluster d to gene g as where is the chi-squared test statistic corresponding to the empirical correlation P value for DHS cluster d and gene gi. From this we computed the total posterior transmitted from DHS cluster d to gene g as
For each gene, we then sum over all DHS clusters D to obtain the overall posterior probability of association:
In practice, if Pd,g > 0.25 we set βd,g to zero to control noise from small values (Supplementary Figure 13)
Enrichment of allele—specific accessibility, tissue specificity and functional class for DHS clusters
From Maurano et. al.22 we obtained a list of 64,597/362,284 SNPs across the genome associated to allele-specific DHS accessibility in heterozygous individuals at 5% FDR. For each disease, we calculated if credible interval SNPs overlapping DHS are likelier to show allelic imbalance than expected by chance using Fisher’s exact test. We also calculated this for all CI SNPs from all diseases as a joint set. We found this enrichment to be consistent across minor allele frequency bins (Supplementary Figure 14).
We used Fisher’s exact test to determine if DHS clusters harboring credible interval SNPs are preferentially active in each tissue. For each tissue, we compare the proportion of active DHS clusters to the genome-wide expectation of active DHS clusters in that tissue. We used the same process to determine enrichment for functional categories defined by ChromHMM ?, and identified genomic functions of DHS clusters through overlapping them with annotated ChromHMM regions (Supplementary Figure 5).
Acknowledgements
C. C. and P.S. were partly supported by a shared research agreement with Biogen, Inc, who had no role in designing or interpreting this study. We are grateful to the International Multiple Sclerosis Genetics Consortium and the International Inflammatory Bowel Disease Genetics Consortium, and specifically to Mark Daly and Stephan Ripke, for access to GWAS summary statistics from their respective meta-analyses.