Abstract
The vast majority of genetic variants associated with complex human traits map to non-coding regions, but little is understood about how they modulate gene regulation in health and disease. Here, we analyzed Assay for Transposase-Accessible Chromatin (ATAC-seq) profiles from activated primary CD4+ T cells of 105 healthy donors to identify ATAC-QTLs: genetic variants that affect chromatin accessibility. We found that ATAC-QTLs are widespread, disrupt binding sites for transcription factors known to be important for CD4+ T cell differentiation and activation, overlap and mediate expression QTLs from the same cells and are enriched for SNPs associated with autoimmune diseases. We also identified numerous pairs of ATAC-peaks with highly correlated chromatin accessibility. When we characterize 3D chromosome organization in primary CD4+ T cells by in situ-Hi-C, we found that correlated peaks tend to reside in the same chromatin contact domains, span super-enhancers, and are more impacted by ATAC-QTLs than single peaks. Thus, variability in chromatin accessibility in primary CD4+ T cells is heritable, is determined by genetic variation in a manner affected by the 3D organization of the genome, and mediates genetic effects on gene expression. Our results provide insights into how genetic variants modulate chromatin state and gene expression in primary immune cells that play a key role in many human diseases.
The vast majority of disease-associated loci identified through genome-wide association studies (GWAS) (1-3) are located in non-coding regions of the genome, often distant from the nearest gene (4), suggesting that abnormalities in transcriptional regulation are a key driver of human disease. Quantitative trait loci (QTL) studies that map genetic variants associated with molecular traits provide a framework for interpreting the regulatory and functional role of disease-associated loci. For example, thousands of non-coding variants associated with gene expression (expression QTLs – eQTLs), a significant number of which overlap GWAS loci, have been identified in diverse cell types and tissues (5), including in resting (6-8) and in stimulated (9, 10) immune cells. However, because of linkage disequilibrium and the complex regulation of gene expression, it remains difficult to pinpoint the causal genetic variant and to determine the mechanistic basis for most eQTLs.
Genetic analysis of chromatin organization (11-15) provides a powerful complementary approach for identifying genetic variants that affect transcriptional regulation in cis-regulatory regions (16). In lymphoblastoid cell lines, many genetic variants have been associated with variability in DNase I hypersensitivity (measured by DNase-seq) (17) or histone tail modifications (measured by ChIP-seq) (18-20). However, both DNase-seq and ChIP-seq are laborious and require large numbers of cells, thus limiting genetic studies using these techniques to cell lines. The recent development of Assay for Transposase-Accessible Chromatin (ATAC-seq), a simple yet efficient two-step protocol (21), has opened the way to profiling of chromatin accessibility with small numbers of disease-relevant primary cells isolated from a large human cohort.
Here, we studied the genetic determinants of chromatin accessibility by performing ATAC-seq on primary CD4+ T cells isolated from 105 healthy donors of European descent in the ImmVar Consortium (10) (Fig. 1A). To this end, we developed an optimized ATAC-seq protocol (SOM) that achieved high technical and biological reproducibility (fig. S2), highly complex libraries (on average 84% usable nuclear reads, as opposed to 40% prior to optimization) (fig. S1) and low mitochrondrial DNA (mtDNA) contamination (on average contamination < 3%, as opposed to 53% prior to optimization).
We first assayed CD4+ T cells (T helper, Th) in two different states: either unstimulated, or activated in vitro using anti-CD3 and anti-CD28 antibodies for 48 hours (Fig. 1A) to assess global changes in chromatin accessibility after in vitro activation. Comparing pooled reads across six samples (five donors, one of which is replicated) in each state (stimulated or unstimulated), we observed a global increase in chromatin accessibility after activation, detecting 36,486 chromatin accessibility peaks (ATAC-peaks) in unstimulated cells and 52,154 ATAC-peaks in activated cells. Of the 63,763 ATAC-peaks identified in at least one of the two states, 27,446 were shared across states, 28,017 were more accessible in activated cells (FDR, q < 0.05), but only 8,298 ATAC-peaks were more accessible in unstimulated cells (FDR, q < 0.05) (Fig. 1B and table S1). Activation-specific ATAC-peaks were enriched at enhancers associated with conventional T helper cells (Tconv, a class that includes Th1, and Th17 cells) (16) and depleted at enhancers associated with regulatory T cells (Treg) and naïve Th cells. They were also enriched in stimulated Th cells and Th0 cells, consistent with our stimulation protocol (Fig. 1C).
Similarly, activation-specific ATAC-peaks were enriched for motifs binding transcription factors (TFs) that are important for Th cell activation, including BATF, IRF and AP1 (22-24) (Fig. 1D). In particular, 1,594 of the 9,724 activation-specific ATAC-peaks located in noncoding regions previously unannotated by H3K27Ac overlapped BATF motifs (16%) (fig. S3). TF footprints derived from normalized ATAC-seq reads that span known TF binding sites from ChIP-seq data (25) showed increased accessibility of BATF, ISRE and BATF/IRF motifs in stimulated cells, consistent with the known role of these factors in Th cell development and activation (22-24) (Fig. 1E). In contrast, CTCF and ETS binding motifs were enriched at shared ATAC-peaks showing little change in accessibility following activation (Fig. 1D and 1E). This is consistent with the known role of ETS family transcription factors as pioneer factors that establish the chromatin landscape for T cells (26-28). Finally, single nucleotide polymorphisms (SNPs) associated with several autoimmune diseases (most notably inflammatory bowel disease (IBD)) were more enriched in activation-specific and shared ATAC-peaks than in baseline-specific peaks (Fig. 1F). These results demonstrate the power of ATAC-seq to identify previously unannotated cis-regulatory elements and generate high-resolution TF footprints, providing insights into the regulation of Th cell function and its role in underlying disease mechanisms.
Next, we explored higher order relationships in our ATAC-Seq data using the 105 profiles from activated T cells. We found 1,762 pairs of ATAC-peaks with highly correlated accessibility (linear regression, FDR < 0.05, table S2, fig. S4), corresponding to 851 (1.6%) distinct ATAC-peaks of the 52,154 total ATAC-peaks we observed in activated T cells. On average, correlated peaks were located 313 kb apart, with the closest located only 668 bp apart. The large average distance between correlated peaks suggests that the correlation is unlikely to be the result of local biases in sequencing depth (fig. S5).
Next, we characterized the genetic basis of chromatin accessibility in activated T cells by comparing the variability in ATAC-peaks across the 105 individuals with the imputed genotypes of ∼10 million SNPs in those individuals (SOM). We found 1,790 ATAC-peaks associated with at least one significant local (+/- 20kb) SNP (RASQUAL (29), P < 3.02×10-4, permutation FDR < 0.1) (Fig. 2A, table S3). We term each such associated SNP an ATAC quantitative trait locus (ATAC-QTL) and the corresponding peak an ATAC-QTL-peak. Of the 1,790 ATAC-QTL-peaks, 599 were significantly heritable (GCTA FDR < 0.1) with an average heritability h2 = 60%. In 580 (97%) of those cases, 36% of the heritability was predicted by the best lead SNP (Fig. 2B, SOM, table S9), suggesting that the heritable variability of ATAC-QTL-peaks is largely determined by a single SNP. There were also 6,154 ATAC-QTL-peaks (RASQUAL, P < 3.02×10-4, permutation FDR < 0.1) with distal associations to SNPs located between +/- 20 kb and +/- 500 kb away, but only 2,634 ATAC-QTL-peaks (linear regression, P < 6.46×10-5, permutation FDR < 0.05) with distal associations to SNPs located over 500 kb away. This is consistent with previous observations of limited distal associations to chromatin accessibility traits estimated using DNase I hypersensitivity (17, 18).
Several lines of evidence support a model where local ATAC-QTLs disrupt cis-regulatory functions in Th cell enhancers. First, of the 1,790 local ATAC-QTL-peaks, 33% (589) of the lead associated SNPs were located within 2 kb of the ATAC-peak and 18% (327) were located within the ATAC-peak width proper (Fig. 2C), suggesting that the direct disruption of cis-regulatory elements may be an important determinant of the observed variation in accessibility in those cases. Second, local ATAC-QTLs-peaks were more enriched near transcription start sites (TSS) than transcription termination sites (TTS) of the closest gene, supporting a transcriptional role in cis (Fig. 2D). Third, 77% of local ATAC-QTLs-peaks were in intronic or intergenic regions (fig. S6A); of these, 70% lie in regions that were previously identified as enhancers for different Th cell subtypes (fig. S6B). Fourth, ATAC-QTL-peaks were enriched for motifs bound by TFs involved in T cell development and activation (e.g., BATF, AP1 and IRF, Fig. 2E) as compared to all ATAC-peaks detected in activated cells as background. In fact, 57% of ATAC-QTL-peaks contained either a BATF or an ETS1 motif (a 1.2 fold enrichment compared to all ATAC-peaks in activated cells, P < 1.79×10-19, hypergeometric test; with all peaks as background) and 11% contained both (an 1.3-fold enrichment, P < 4.02×10-5, hypergeometric test). Furthermore, ATAC-QTL-peaks overlapping BATF, ETS1 and CTCF binding sites showed differential accessibility between genotypes at single nucleotide resolution, with the core motif exhibiting the most striking difference in accessibility (Fig. 2F and fig. S7). This effect extended 1 kb outside the core binding motif, demonstrating that genetic variants may not only cause differential transcription factor binding, but also give rise to long-range effects on chromatin accessibility (Fig. 2F and fig. S8). Similarly, the effect size of ATAC-QTLs is correlated with SNP motif disruption scores obtained by deltaSVM (30), an unbiased analysis to discover de novo cis-regulatory elements in ATAC peaks (Fig. 2G, SOM). Most (61%) of the ATAC-QTL lead SNPs strongly disrupted one of 11 predicted TF binding sites (TFBSs) (Fig. 2H). These 11 predicted TFBS, which arose automatically from this analysis, included many known to act in T cell activation, such as BATF, IRF, NFκB and ATF.
Local ATAC-QTLs were also enriched for GWAS SNPs from autoimmune diseases (Fig. 2I), providing a functional context for interpreting these disease associations. For example, rs17293632, an ATAC-QTL SNP, has also been associated with Crohn’s disease and IBD in GWAS studies. This SNP is located in the first intron of SMAD3, a TF gene involved in the TGF-β signaling pathway that regulates T cell activation and metabolism (33). This SNP disrupts a consensus BATF binding site at a conserved position (deltaSVM=-12.72), and results in decreased chromatin accessibility in individuals that possess the alternate allele (Fig. 2J). Rs17293632 was also identified as a causal variant by a statistical fine-mapping algorithm that leveraged patterns of linkage disequilibrium but is agnostic to the functional genomic data in this region (16). This suggests that rs17293632 may increase susceptibility to Crohn’s disease and IBD by disrupting BATF binding at the SMAD3 locus.
As noted above, we found substantial correlation in accessibility between 1,762 pairs of ATAC-peaks. We next tested the hypothesis that correlated ATAC-peaks could be simultaneously modulated by a single SNP, thereby allowing a single variant to influence multiple cis-regulatory regions. Specifically, we found that both local and moderately distal ATAC-QTLs (< 1 Mb) are more strongly associated with correlated peaks than to single peaks (defined as any ATAC-peak that is not part of a correlated peak pair; Fig. 3A, fig. S8). Furthermore, correlated peaks that were associated with ATAC-QTLs were more strongly correlated with each other than correlated peaks that were not associated with an ATAC-QTL (Fig. 3B and table S2). This suggests that genetic variants could impart stronger co-regulation of cis-regulatory regions than epigenetic mechanisms. For example, rs10815868, an ATAC-QTLs associated with a pair of correlated peaks, resides in the 18th intron of PTPRD, a tumor suppressor gene, where it disrupts a consensus BATF binding site at a conserved position. This ATAC-QTL is associated with decreased chromatin accessibility in a 4 kb region that contains four highly correlated ATAC-peaks (Fig. 3C, yellow box). Notably, an adjacent peak located 5 kb upstream of this region was not affected by this variant (Fig. 3C, grey box), suggesting that the genetic control of multiple ATAC-peaks was limited to a defined regulatory region.
To test if the relationship between correlated peaks could be influenced by the 3D conformation of the Th cell chromatin, we performed loop-resolution in situ Hi-C (34) in primary CD4+ T cells activated for 48 hr. We obtained Hi-C maps at 50 kb resolution and identified 4,614 contact domains (internals of chromatin in which all loci exhibit an enhanced frequency of contact with one another) and 4,419 chromatin loops (pairs of loci that are in closer physical proximity to one another than to neighboring sequences on the chromosome, such as the interaction between a promoter and its distal enhancer) (fig. S9 and table S4). ATAC-peaks overlapping BATF and ETS1 motifs were enriched within Hi-C contact domains (Fig. 3D), whereas those overlapping CTCF motifs were enriched at the contact domain boundaries (Fig. 3D and fig. S10A). This is consistent with previous reports of CTCF enrichment at loop anchors and at contact domain boundaries (34-37).
There was a clear relationship between the 3D structure of chromatin, the correlated peaks, and the associated ATAC-QTLs. Specifically, we found that correlated peaks were the most enriched when both peaks were in the same contact domain than when (i) both peaks lie outside of a contact domain, or (ii) only one of the two peaks was in a contact domain, or (iii) the two peaks span a contact domain (Fig. 3E, fig. S10B,C and table S4; considering all domains, SOM). Moreover, correlated ATAC-QTL-peaks exhibited stronger genetic associations if both peaks resided in the same contact domain than if they did not (Fig. 3F). This suggests that 3D chromatin structure may explain the ability of a single genetic variant to influence multiple correlated ATAC-peaks.
Furthermore, correlated peaks – whether or not they were associated with an ATAC-QTL – were enriched in super-enhancer regions from CD4+ Th cell (38) (Super-enhancers, also called stretch enhancers, are defined as large clusters of contiguous enhancers (38, 39).) (Fig. 3G, table S2 and SOM). Correlated peaks with an associated ATAC-QTL were more enriched for super-enhancers if they reside in the same contact domain (1.6-fold) (Fig. 3G). For example, rs2732588 is an ATAC-QTL associated with three correlated peaks. Individuals who are homozygous alternate at the variant [G→A] tend to exhibit decreased chromatin accessibility in a large 100 kb region containing multiple correlated ATAC-peaks (Fig. 3H). The affected region partially overlaps a previously identified super-enhancer in CD4+ T cells (38). This super-enhancer overlaps with both coding and intronic regions of KANSL1, a chromatin regulator that is part of the nonspecific lethal (NSL) complex controlling expression of constitutively expressed genes (40, 41) (Fig. 3H). This region containing the super-enhancer and these correlated peaks was mostly contained inside a Hi-C contact domain, although it also extends immediately outside of the Hi-C contact domain boundary (Fig. 3H). Together, these data support a model in which the effects of a genetic variant on the accessibility of nearby loci is influenced by 3D proximity and that these effects are stronger when both the variant and the affected peaks lie in the same contact domain.
To further understand the consequences of ATAC-QTLs on gene regulation, we measured RNA-seq profiles from activated CD4+ T cells from 96 donors (93 with matching ATAC-seq data), and mapped expression QTLs (eQTLs), genetic variants that affect gene expression (Fig. 4). We identified 816 genes with at least one significant local eQTL (+/- 500 kb centered around the gene) (RASQUAL, P < 4.12×10-5, permutation FDR < 0.05) (Fig. 4C, fig. S11, table S5), termed eQTL-genes. The 816 eQTL-genes were enriched in HiC contact domains compared to all genes (437 genes intersecting contact domains, 6.6 fold enrichment, P < 3.44×10-247, hypergeometric test), suggesting that eQTL-genes are in regions of 3D chromatin interaction in T cells.
We assessed the mediation of genetic effects on gene expression through chromatin accessibility. 71 genetic variants, corresponding to 103 unique genes, were simultaneously the lead SNP for an eQTL and an ATAC-QTL with correlated effect sizes (Pearson rho = 0.52, Fig. 4A, table S6). Because in-sample overlapping of eQTL and ATAC-QTLs has limited power to detect shared genetic variants, we used a feature selection prediction framework (43, 44) to detect associations between imputed ATAC-peaks and gene expression. For 1,790 significant ATAC-QTL-peaks, we identified 104 genes significantly associated with predicted local (+/- 500 kb) ATAC-peak accessibility in 96 individuals (FDR < 0.05, 10 permutations, Fig. 4B,C, table S7), of which 67 were eQTL-genes. These results demonstrate how the inherent hierarchy in how genetic variants modulate chromatin state and gene expression enables the use of imputed chromatin accessibility to simultaneously identify genetically controlled genes and to annotate cis regulatory elements mediating the genetic effect.
For example, rs174556, an ATAC-QTL associated with a pair of correlated peaks, resides in a 25 kb region between two Hi-C contact domains, where the alternative allele disrupts a CTCF binding site (Fig. 4D). Rs174556 is linked (D’=1, R2=0.79) with rs102275, a variant previously associated with Crohn’s disease (45). The associated correlated peaks span the promoters of FADS1 and FADS2, two fatty acid desaturase (FADS) genes. Rs174556 is also identified as an eQTL for both FADS1 and FADS2 in our T cell dataset (Fig. 4C). FADS1 and FADS2 regulate inflammation, promote cancer development by regulating the metabolism of arachiodonic acid and its derivatives prostaglandin E2, and FADS2 knockout mice develop dermal and intestinal ulcerations (46-49). Given the well-known role of CTCF in maintaining the integrity of chromatin domain boundaries and insulation of transcriptional activities, abolishing CTCF binding may abolish the insulation, opening chromatin and causing increased expression of both target genes. These results suggest that variability in chromatin accessibility may underlie variability in gene expression and may increase disease risk.
In conclusion, we integrated genetic variation and ATAC-seq data from primary activated CD4+ T cells from 105 healthy donors to identify and characterize genetic variants that contribute to variability in chromatin accessibility. We found widespread genetic control of chromatin accessibility, some of which affected multiple ATAC-peaks in a manner consistent with the 3D organization of the genome. Integrating genotyping, ATAC-seq and RNA-seq data provided causal anchors for predicting and explaining the variability in molecular traits in a manner consistent with known modes of transcriptional regulation. We did not find significant distal effects, consistent with reports that measured chromatin state by DNAse-I-seq (17), but unlike studies that measured chromatin state using ChIP-seq (18-20). Predicting variability in gene expression between individuals based on chromatin state is significantly impacted by technical and biological variability in the assays, but is helped by leveraging genetic variation as causal anchors. It is possible that our ability to detect weaker interactions and predict gene expression could significantly improve with increased sample sizes and deeper sequencing.
Our findings, derived from large scale mapping of epigenetic quantitative traits in primary human cells implicated in many diseases, provide a molecular framework for the interpretation of disease-causing variants, focused on modeling how genetic variants could alter local chromatin structure to modulate gene expression. Future studies that use other disease-relevant primary cells and tissues will help pinpoint causal disease variants and understand the regulatory mechanism underlying common disease.
Acknowledgements
We thank the ImmVar participants. We would like to thank Jason Buenrostro for critical reading of the manuscript and advice on ATAC-seq analysis, Jenna Pfiffner and Charles Fulco for initial experimental help with ATAC-seq, Alicia Schep for ATAC-seq nucleosome free caller, Natasha Asinovski and Ho-keun Kwon for help setting up primary T cell cultures and members of the Regev laboratory for discussions. M.B. and K.L.H. are supported by NIH HG007348 to M.B., H.Y.C. is supported by NIH grant P50-HG007735, C.S.C is supported by the NIH through a Ruth L. Kirschstein National Research Service Award (F32-DK096822). This work was supported by the Klarman Cell Observatory at the Broad Institute. A.R. is a Howard Hughes Medical Institute Investigator. Raw data are deposited to the Gene Expression Omnibus with accession no. GSE86888.