Abstract
Neutrophils play fundamental roles in innate inflammatory response, shape adaptive immunity1, and have been identified as a potentially causal cell type underpinning genetic associations with immune system traits and diseases2,3 The majority of these variants are non-coding and the underlying mechanisms are not fully understood. Here, we profiled the binding of one of the principal myeloid transcriptional regulators, PU.1, in primary neutrophils across nearly a hundred volunteers, and elucidate the coordinated genetic effects of PU.1 binding variation, local chromatin state, promoter-enhancer interactions and gene expression. We show that PU.1 binding and the associated chain of molecular changes underlie genetically-driven differences in cell count and autoimmune disease susceptibility. Our results advance interpretation for genetic loci associated with neutrophil biology and immune disease.
Results
Non-coding DNA sequence variation affects chromatin state and gene expression within human populations, and accounts for the majority of complex genetic traits and disease associations4–7. The commonly accepted model of genetic control of transcriptional activity postulates that genetic variation modifies the DNA recognition sequences of specific transcription factors (TFs), thus altering their ability to bind to DNA at a specific locus8–15. PU.1 (encoded by Spi1) is a key TF regulating myeloid development16–18, and its deficiency has profound effects on neutrophil maturation and function19,20. To study genetically determined variation in PU.1 recruitment to DNA, we used chromatin immunoprecipitation sequencing (ChIP-seq) to profile PU.1 genome wide binding in human primary neutrophils (CD16+ CD66b+) isolated from 93 donors of the BLUEPRINT project. The same donors were previously characterised at genome-wide DNA sequence and multi-level regulatory annotation7 (Figure 1a). We identified 36,530 TF-binding peaks across the 93 individuals (Online Methods, Supplementary Figure 1, Supplementary Table 1) and used normalised read counts at peak regions to determine transcription factor quantitative trait loci (tfQTLs; Online Methods). We detected 1,868 independent (linkage disequilibrium [LD] r2≥0.8) PU.1 binding QTLs at a False Discovery Rate [FDR] <0.05 (Supplementary Table 2). Lead PU.1 tfQTL SNPs showed a bimodal distribution of distances to their respective differential binding peaks (Figure 1b, Supplementary Table 3), with just over half of them (55%, 1,036/1,868) mapping proximally from the peak edge (<2.5kb; median distance 264bp), and the remaining SNPs (45%; 995/1,868) localising more distally (2.5Kb-1Mb, median distance 23Kb)21,22. As shown for other cell types22, tfQTL effect sizes were stronger for proximal compared to distal variants (t-test p=2.2×10−16, Figure 1c). We further validated a subset of the detected tfQTLs using allele-specific association analysis23 (Online Methods), which confirmed a significant allelic imbalance for the majority of the tested peaks (98.8% and 95.5% for peaks associated with proximal and distal variants respectively; Figure 1d).
Binding of pioneer TFs to DNA alters local nucleosome positioning, thus allowing recruitment of activating co-factors24. However, DNA recognition sequence alone is not sufficient to establish occupancy, and secondary collaborating factors are required to maintain affinity25. C/EBPβ is upregulated throughout neutrophil terminal differentiation18 and has been shown to co-occupy myeloid enhancers at thousands of PU.1 bound sites26,27. The constitutively expressed CTCF is known to play a role in gene regulation by anchoring chromatin interactions28, but is not known to functionally associate with PU.1. We assayed these two additional TFs in neutrophils from a subset of overlapping individuals (n=22 donors with QC-pass assays for C/EBPβ, n=30 for CTCF), and identified 18,862 C/EBPβ and 22,197 CTCF filtered peaks from the combined datasets. We performed QTL analysis as before, and prioritised 427 C/EBPβ and 769 CTCF putative tfQTLs reaching a nominal p-value threshold (p≤1×10−5; Supplementary Table 2). We found that C/EBPβ tfQTLs effect sizes decreased with increasing distance from PU.1 tfQTLs (Figure 2a-b), reflecting cooperative binding of PU.1 and C/EBPβ at myeloid enhancers26,29. Interestingly, CTCF tfQTLs displaying a shared genetic effect with PU.1 predominantly involved CTCF-bound regions located distally to the PU.1 tfQTL lead SNP, suggesting that PU.1 QTL genetic effects may be in part mediated by the 3D chromosomal architecture (Figure 2c).
Transcription factor occupancy has been shown to act predominantly through cis regulatory SNPs, where coordination of cis-acting variants has been shown to decay with increasing physical distance of SNPs from bound regions30. To assess the potential sharing of our tfQTLs across cell types, we additionally generated PU.1 binding maps in primary monocytes (CD14+CD16−) isolated from ten BLUEPRINT donors7, five of which overlap with the neutrophil PU.1 dataset. Of the neutrophil PU.1 peaks implicating a tfQTL, 93% were also observed in monocytes (Supplementary Figure 2a). The low number of donors tested did not allow us to carry out a tfQTL analysis in monocytes. To assess coordination of genetic effects at PU.1 binding sites across cell types, we therefore assessed the strength of binding at monocyte peaks for individuals stratified by PU.1 tfQTL lead variant genotype. We found the monocytes displayed consistent direction and strength of binding at proximal SNPs (linear regression p=3×10−9) compared to neutrophils (p=2×10−13), compatible with shared genetic effects between the two cell types. However, the same was not true for distal SNPs (neutrophils p=4×10−7, monocytes p=0.793; Figure 2d), which may be driven by more complex and cell-type specific long-distance chromatin contacts.
To explore coordination of genetic influences on PU.1 binding and local chromatin state, we initially took advantage of the previously published histone associated QTL (hQTL) data for the enhancer-associated histone marks H3K4me1 and H3K27ac in neutrophils7. In total, 808 H3K4me1 and 946 H3K27ac lead hQTL SNPs overlapped (r2≥0.8) PU.1 tfQTLs. We next generated binding profiles of the active promoter-associated histone mark H3K4me3 and Polycomb-associated repressive mark H3K27me3 in neutrophils (n=110 and n=109 donors, respectively) identifying 621 and 367 shared tfQTL/hQTLs, respectively (Supplementary Table 2). Using the pi1 statistic31, we found evidence of sharing between PU.1 tfQTLs with hQTLs in both neutrophils and monocytes (pi1H3K27ac=0.73-0.76, and pi1H3K4me1=0.76-0.80). Sharing between neutrophil PU.1 tfQTLs and hQTLs detected in CD4 naïve T cells was lower (pi1H3K27ac=0.36-0.72, and pi1H3K4me1=0.30-0.79; Supplementary Figure 2c), compatible with PU.1 not being expressed in the latter (Supplementary Figure 2c)32. Further, H3K27ac marked regions7 co-occupied by PU.1 and C/EBPβ displayed greater hQTL effect sizes compared to peaks bound by PU.1 alone (t-test p=1.34×10−6; Figure 2e), suggesting stronger genetic effects for enhancers at co-occupied sites in neutrophils29. Consistent with this, cell type-specific binding of PU.1 and C/EBPβ correlated with cell type-specific chromatin activity (Supplementary Figure 3a-b). H3K27ac and H3K4me1 hQTLs intersecting proximal neutrophil-specific PU.1 tfQTLs had significantly lower effect sizes in monocytes compared to cell-shared sites (Figure 2f), consistent with a neutrophil-specific role of PU.1 in activating chromatin state in these regions.
We next assessed the distance between the PU.1 and histone mark peaks for each shared tfQTL-hQTL genetic association. As previously observed21, there was a pronounced bimodal distribution of distances between PU.1 binding peaks and the locations of H3K27ac and H3K4me3 marks (Figure 3a), with around a half of PU.1 peaks localizing to less than 1kb away from the respective H3K27ac and H3K4me3 peaks, and others mapping 10-100kb from them. Given that H3K4me3 is associated with active promoters, this observation highlights the potential long-range regulatory effects of PU.1 binding to distal DNA elements on promoter activity, which are commonly mediated by three-dimensional DNA looping interactions33. To investigate the role of PU.1 long distance regulation, we generated Capture Hi-C (PCHi-C) profiling in neutrophils and monocytes isolated from three donors each and integrated these data with previously published PCHi-C data for these cell types in three more individuals34 (Supplementary Figure 4a-b). We detected ~190,000 Promoter Interacting Regions (PIRs) in total across neutrophils and monocytes (CHiCAGO score > 5)35, ~82,000 of which were detectable in each of the cell types (Supplementary Figure 4c). PIRs enriched in PU.1 binding and enhancer-associated H3K4me1/H3K27ac marks were correlated with the level of expression of the genes they contacted (Figure 3b), as previously shown in the context of other cell types36–38. In contrast, CTCF binding at PIRs did not correlate with target gene expression (Figure 3b), as expected given the constitutive nature of many CTCF-mediated chromosomal interactions. Notably, the PIRs of genes showing differential expression between neutrophils and monocytes were enriched (100 permutations, p≥0.01) for the binding of PU.1 and C/EBPβ in the highest-expressing cell type (Figure 3c). Consistently we also found cell type-specific binding of these TFs to be enriched within cell type-specific PIRs (permutation p≤0.01) (Figure 3d). Jointly, these results reinforce the role of PU.1 and C/EBPβ in establishing tissue-specific transcriptional patterns.
We next investigated the effect of PU.1 binding variation at PIRs on the expression of target genes. PU.1 tfQTLs were intersected with PCHi-C and expression QTL (eQTL) data from the Chen et al. study7. Only PU.1 tfQTL/eQTL (p<1×10−5) pairs located distally to TSSs (>25Kb) were considered, in order to exclude eQTLs implicating promoter-based variants and ensure a high resolution of PCHi-C signal detection. PU.1 tfQTL SNPs mapping to PIRs showed significantly larger effects on the expression of the genes they contacted compared with distance-matched SNPs that did not map to a PIR (t-test p<2×10−16; Figure 3e), in agreement with physical interactions playing a role in mediating the distal regulatory effects of PU.1 binding.
To explore the extent to which genetic variation affecting PU.1 binding may directly affect promoter-enhancer interactions, we employed an allele-specific strategy (Methods) to identify heterozygous sites within PIRs that exhibited allelic imbalance at PCHiC contacts (Supplementary Figure 4d-e). We found that ~14,000 heterozygous SNPs within PIRs that displayed evidence of allelic bias in both neutrophils and monocytes were enriched for PU.1 and CTCF binding (Figure 4a-c, Supplementary Figure 4f). Notably, the same was true for the hQTLs for the Polycomb-associated inhibitory mark H3K27me3, consisted with a role of Polycomb repressive complexes in shaping regulatory chromatin architecture39. An example of a SNP showing allelic imbalance affecting promoter-enhancer connectivity was rs519989, which was also associated with PU.1 binding, histone modifications and expression the gene LRRC8C (Figure 4d-f; Supplementary Table 4). LRRC8C encodes a volume-regulated anion channel subunit40 upregulated during terminal differentiation of neutrophils41. This and other loci thus demonstrate coordinated genetic influences on PU.1 binding, chromatin activity and the formation of promoter interactions in the regulation of neutrophil gene expression.
Finally, to explore the influence of the identified PU.1 tfQTLs and their potential downstream effects on haematological traits and diseases, we accessed summary statistics from public GWAS studies of cell-matched full blood count traits2 and autoimmune diseases42–47 PU.1 binding regions were enriched48 for GWAS SNPs associated with myeloid cell traits (eg. neutrophil counts) and with autoimmune diseases (Figure 5a). We formally tested the overlap of PU.1 tfQTLs and GWAS SNPs using colocalisation analysis49,50, revealing 43 proximal and 74 distal tfQTLs that shared a genetic signal (posterior probability [PP]>0.9) with at least one GWAS locus (Table 1, Supplementary Table 5). We next used CATO (Contextual Analysis of TF Occupancy)51 to identify PU.1 collaborating factors that may be involved in mediating these traits at shared PU.1 tfQTL / GWAS loci. Colocalising SNPs were shown to affect predominantly binding recognition motifs for several PU.1 binding partners, including C/EBP, AP-1, ETS, CTF/NF-1, ATF/CREB and RUNX (Supplementary Figure 5)52. These results highlight the likely role of PU.1 and its partners in mediating the functional effects of GWAS variants in neutrophils.
To determine the putative target genes underpinning PU.1-mediated disease associations, we integrated PCHi-C and eQTL data7 in neutrophils. Overall, 27 high confidence target genes at QTL loci colocalised with GWAS summary statistics (Supplementary Table 6). Interestingly, 35% of the shared tfQTL / GWAS SNPs could be attributed to a proximal tfQTL at a PU.1 binding site. This finding suggests that many of PU.1 tfQTL themselves are under distal genetic control potentially mediated through enhancer-enhancer interactions53–55. One such example is the rs791357_C variant associated with decreased neutrophil and monocyte cell counts. PCHi-C data shows that this region is highly connected to the CPEB4 gene in both neutrophils and monocytes (Figure 5b). CPEB4 is a cytoplasmic polyadenylation element which binds to recognition sequence in PolyA tail of mRNAs and can activate or inhibit translation56. CPEB4 is involved in controlling terminal differentiation in erythroid cells57 and the proliferation of some cancers58. The SNP rs791357 is a proximal tfQTL for PU.1 (p=9.05×10−21) and C/EBPβ (p=1.963×10−9), an hQTL for H3K4me3 (p=1.98×10−17) and H3K27ac (p=1.41×10−33) and an eQTL for CPEB4 (p=1.16×10−30) in neutrophils. Similar sharing of PU.1 and C/EBPβ binding site was observed in monocytes with hQTL (H3K27ac p=8.14×10−26) and eQTL for CPEB4 (p=2.55×10−18). Additional example loci shared tfQTL function through multiple traits and the presence of PCHi-C interactions between enhancers and colocalised genes (Supplementary Figure 6a-b).
In conclusion, our analysis suggests that genetically-determined variation in PU.1 binding in neutrophils modulates gene expression, acting via changes in the local chromatin state and, at least in some cases, in the patterns of promoter-enhancer interactions. We show that these effects underpin the genetic associations for a number of important human blood cell traits and diseases, confirming the role of PU.1 in neutrophil biology and implicating this cell type as a potentially causal for a number of autoimmune traits.
Author Contributions
Conceived and designed the study, S.W., B.M.J., M.S., and N.S. Performed experiments, S.W., and B.M.J. Generated experimental resources, F.B., S.F., and B.F. Performed formal analysis, S.W., L.V., A.L.M., K.K., L.C., Y.Y., S.E., V.I., H.E., M.T., D.R., A.D., and M.S. Investigation, S.W., L.V., K.W., and A.L.M. Data Curation, L.V., Y.Y., H.P., D.R., A.D., and L.C. Supervision and study coordination, P.F., L.C., K.D., T.P., P.F., M.F., M.S., and N.S. Project Administration, D.M., L.C., K.D., P.F., M.F., M.S., and N.S. Performed primary manuscript writing S.W., A.L.M., M.S., and N.S.
Competing Interests statement
The authors declare no competing interests.
Online Methods
Sample collection and cell isolation
Peripheral adult blood collection
ChIP-seq data generated in this study used donor samples which were collected as part of the previously described study7. Blood was obtained from donors who were members of the NIHR Cambridge BioResource (http://www.cambridgebioresource.org.uk/) with informed consent (REC 12/EE/0040) at the NHS Blood and Transplant, Cambridge. Donors were on average 55 years old (range 20-75 years old), with 46% of donors being male. A unit of whole blood (475 ml) was collected in 3.2% Sodium Citrate. An aliquot of this sample was collected in EDTA for genomic DNA purification. A full blood count (FBC) for all donors was obtained from an EDTA blood sample, collected in parallel with the whole blood unit, using a Sysmex Haematological analyser. The level of C-reactive protein (CRP), an inflammatory marker, was also measured in the sera of all individuals. All donors used for the collection had FBC and CRP parameters within the normal healthy range. Blood was processed within 4 hours of collection.
Isolation of cell subsets
Samples were as those as described in7. To obtain pure samples of ‘classical’ monocytes (CD14+ CD16-) and neutrophils (CD66b+ CD16+) we implemented a multi-step purification strategy. Whole blood was diluted 1:1 in a buffer of Dulbecco’s Phosphate Buffered Saline (PBS, Sigma) containing 13mM sodium citrate tribasic dehydrate (Sigma) and 0.2% human serum albumin (HSA, PAA) and separated using an isotonic Percoll gradient of 1.078 g/ml (Fisher Scientific). Peripheral blood mononuclear cells (PBMCs) were collected and washed twice with buffer, diluted to 25 million cells/ml and separated into two layers, a monocyte rich layer and a lymphocyte rich layer, using a Percoll gradient of 1.066g/ml. Cells from each layer were washed in PBS (13mM sodium citrate and 0.2% HSA) and subsets purified using an antibody/magnetic bead strategy. To purify monocytes, CD16+ cells were depleted from the monocyte rich layer using CD16 microbeads (Miltenyi) according to the manufacturer’s instructions. Cells were washed in PBS (13mM sodium citrate and 0.2% HSA) and CD14+ cells were positively selected using CD14 microbeads (Miltenyi). To purify neutrophils, the dense layer of cells from the 1.078 g/ml Percoll separation was lysed twice using an ammonium chloride buffer to remove erythrocytes. The resulting cells (including neutrophils and eosinophils) were washed and neutrophils positively selected using CD16 microbeads (Miltenyi) according to the manufacturer’s instructions. The purity of each cell preparation was assessed by multicolour FACS using conjugated antibodies for CD14 (MφP9, BD Biosciences) and CD16 (B73.1 / leu11c, BD Biosciences) for monocytes, CD16 (VEP13, MACS, Miltenyi) and CD66b (BIRMA 17C, IBGRL-NHS) for neutrophils. Purity was on average 95% for monocytes and 98% for neutrophils.
ChlP-sequencing
Purified cells were fixed with 1% formaldehyde (Sigma) at a concentration of approximately 10 million cells/ml. Fixed cell preparations were washed and stored re-suspended in PBS at 4°C prior to lysis and sonication. Sonication protocols were performed in a Diagenode PicoRuptor for 8 cycles of 30 seconds on, 30 seconds off in a 4°C water cooler. Samples were checked for sonication efficiency using the criteria of 150-500bp, by Agilent DNA bioanalyzer. ChIP-seq was carried out as previously described59 all liquid handling steps were performed on an Agilent Bravo NGS. Protein A Dynabeads (Invitrogen) were coupled with 2.5μg of antibody. Sonicated lysate (3-5 million cells) was then added to the bead/antibody mix and incubated at 4°C overnight. ChIP-DNA bound beads were washed for ten repetitions in cold RIPA solution. Elution of DNA from beads at 65°C for five hours to reverse the cross linking process. 2μl RNase was added to ChIP-DNA and incubated at 37°C for 30 minutes, followed by 2μl of Proteinase K treatment at 55 °C for 1 hour. 1:1.8 ratio of Ampure beads (Beckman Coulter, A63881) were added to the DNA followed by two cold 70% ethanol washes. ChIP-DNA was eluted in 50μl elution buffer. Illumina sequencing libraries were prepared on a Beckman Fx liquid handling system. End-repair, A-tailing and paired-end adapter ligation were performed using NEBnext reagents from New England Biolabs (E6000S), with purification using a 1:1 ratio of AMPure XP to sample between each reaction. Amplification of ChIP-DNA was performed using Kapa HiFi master mix (Kapa Biosystems KK2602), 18 cycles of PCR followed by a 0.7:1 Ampure XP clean-up. Antibodies for H3K4me3 (C15410003), H3K27me3 (C15410195), CTCF (C15410210) were obtained from Diagenode, Liege, Belgium. Antibodies for PU.1 (sc-352x, sc-22805x) and C/EBPβ (sc-150x) were obtained from Santa Cruz Biotechnology.
Data processing and peak calling
ChIP libraries were sequenced using Illumina HiSeq 2000 and HiSeq 2500 at 50bp single end reads. Sequenced reads were aligned to reference genome using BWA (bwa aln −q 15). Duplicate reads were marked using Picard MarkDuplicates (v1.103). Reads with mapping quality less than 15 were removed (SAMtools v0.1.18). The fragment size L for each aligned bam was estimated using PhantomPeakQualTools vr18, which uses cross correlation of binned read counts between forward and reverse strands. To identify highly enriched genomic regions, we used MACS260 (v2.0.10.20131216, standard options) for peak calling with the estimated fragment size from PhantomPeakQualTools (--shiftsize=half fragment size), with narrow for PU.1, C/EBPβ, CTCF, H3K4me3 and broad flags set for H3K27me3. For background control ChIP input was created from merging random selected samples. Reads from 4 pools of 12 individuals for neutrophil input and 2 pools of 6 individuals for monocytes. ChIP inputs were as follows:
Significant peaks were selected to be at 1% FDR or less.
Data Quality
We removed ChIP samples that had a relative strand correlation (RSC) < 0.8 and normalised strand correlation (NSC) < 1.0561. We defined high confidence data those from ChIP with RSC > 0.8 and NSC > 1.05. Otherwise, we used genome browser tracks to confirm visually a good ChIP and include it in the final data set. Supplementary Figure 1 and Supplementary Table 1 shows quality control metrics and corresponding principal components, showing no batch effects after PEER correction using K=10 factors.
Normalised read count in the reference peak set
Consensus peak sets were constructed using dba.peakset function within DiffBind R package62,63. http://bioconductor.org/packages/release/bioc/vignettes/DiffBind/inst/doc/DiffBind.pdf. For PU.1, H3K4me3 and H3K27me3 we set the minimum number of samples for a peak to be included in consensus to 3, for C/EBPβ, CTCF and monocyte samples minimum was set to 2. Sex chromosomes were not included in the QTL analysis. The reference peak set was filtered further for read counts as described below. Next, we generated quantification signal of ChIP-seq for each donor. Here we only considered read counts under the peaks, as the regions outside peaks are more likely to be noise or background signal than true enrichment. For each donor, we generated a vector of log2 reads per million (log2RPM) per peak in the reference peak set by counting the number of overlapping reads under the peaks (BEDOPS bedmap-count) and normalised the counts with the total number of reads in the library. We further filtered the reference peak set to only consider peaks with log2RPM > 0 in at least 50% of the donors in a given cell type, corrected for ten PEER factors and applied quantile normalisation across donors. For QTL calling with H3K27me3, two sets of summary statistics are provided on two separate signal matrices. In the first set H3K4me3 peak annotations were used in conjunction with H3K27me3 signal to enrich for poised promoter QTLs. In the second set broad called H3K27me3 peaks were divided into 2500bp windows.
Identification of PU.1 and C/EBPβ differential binding sites
We used DiffBind version 1.12.0 with default EdgeR (3.8.3) option to identify peaks which were differentially bound between neutrophils and monocytes. We used the six best quality samples and their peak sets for this analysis:
We selected peaks present in at least three individuals and that had a minimum three-fold difference in binding signal as cut off. Heatmap visualisation of differentially bound regions Deeptools 264.
Transcription factor enrichments
For determining enrichment of ChIP-seq regions of interest within PIRs we used regioneR (1.0.3)65, which performs a statistical evaluation of two sets of genomic regions by permutation testing. We set to 50 permutations the randomisation of genomic regions to determine the null. In Figure 3b-d are Neu PU.1 and Mono PU.1 regions identified from DiffBind differential binding analysis (Supplementary Figure 3a). In Figure 3D are cell type biased PIRs were constructed using data from Javierre et al. We took a subset PIRs from B cells, CD8 T cells, CD4 T cells, Neutrophils, Monocytes, Megakaryocytes and Erythrocytes. These were split into three classifications: (i) PIRs that were found in neutrophil and one other cell type, (ii) PIRs that were found in monocyte and one other cell type, and (iii) as an outgroup, megakaryocyte PIRs that were not shared with neutrophil and monocyte.
Differentially expressed genes and gene expression counts
Gene expression counts and list of differentially expressed genes were available from Ecker et al.3
QTL mapping
Cis-acting QTL mapping was done using the LIMIX package66, available from github (https://github.com/PMBio/limix). We considered genetic variants mapping to within 1 Mb (on each side) of each tested feature (peak), and tested their association using linear regression. Models were fit on quantile-normalized PEER residuals, also including a random effect term accounting for polygenic signal and sample relatedness (as in the variance component models above we used the realized relatedness matrix to capture sample relatedness). From the linear regression we obtained the effect size and p-value for each tested association. To correct for multiple hypothesis testing, we performed a two-step procedure67: first, we corrected for multiple testing across variants for each molecular outcome using Bonferroni correction and, second, we adjusted the obtained p-values for multiple-testing across phenotypes within each layer using a the Q-value procedure31, considered QTLs at a significance threshold of 5% FDR.
Promoter Capture HiC (PCHi-C)
Cells were isolated as described34. One donor was used for preparing each PCHi-C library. In total, twelve PCHi-C libraries were prepared, six using monocytes and six using neutrophils. Approximately 8×10−7cells per library were resuspended in 30.625 ml of DMEM supplemented with 10% FBS, and 4.375 ml of formaldehyde was added (16% stock solution; 2% final concentration). The fixation reaction continued for 10 min at room temperature with mixing and was then quenched by the addition of 5 ml of 1 M glycine (125 mM final concentration). Cells were incubated at room temperature for 5 min and then on ice for 15 min. Cells were pelleted by centrifugation at 400g for 10 min at 4°C, and the supernatant was discarded. The pellet was washed briefly in cold PBS, and samples were centrifuged again to pellet the cells. The supernatant was removed, and the cell pellets were flash frozen in liquid nitrogen and stored at −80 °C. Biotinylated 120-mer RNA baits were designed to the ends of HindIII restriction fragments overlapping Ensembl-annotated promoters of protein-coding, noncoding, antisense, snRNA, miRNA and snoRNA transcripts37. A target sequence was accepted if its GC content ranged between 25% and 65%, the sequence contained no more than two consecutive Ns and was within 330 bp of the HindIII restriction fragment terminus. A total of 22,076 HindIII fragments were captured, containing a total of 31,253 annotated promoters for 18,202 protein coding and 10,929 non-protein genes according to Ensembl v75 (http://grch37.ensembl.org). Hi-C library generation was carried with in-nucleus ligation as described previously68. Chromatin was then de-crosslinked and purified by phenol:chloroform extraction. DNA concentration was measured using Quant-iT PicoGreen (Life Technologies), and 40 μg of DNA was sheared to an average size of 400 bp, using the manufacturer’s instructions (Covaris). The sheared DNA was end-repaired, adenine-tailed and double size-selected using AMPure XP beads to isolate DNA ranging from 250 to 550 bp. Ligation fragments marked by biotin were immobilized using MyOne Streptavidin C1 DynaBeads (Invitrogen) and ligated to paired-end adaptors (Illumina). The immobilized Hi-C libraries were amplified using PE PCR 1.0 and PE PCR 2.0 primers (Illumina) with 7 PCR amplification cycles. PCHi-C. Capture Hi-C of promoters was carried out with SureSelect target enrichment, using the custom designed biotinylated RNA bait library and custom paired-end blockers according to the manufacturer’s instructions (Agilent Technologies). After library enrichment, a post capture PCR amplification step was carried out using PE PCR 1.0 and PE PCR 2.0 primers with 4 PCR amplification cycles. For more details, see36. PCHi-C libraries were sequenced on the Illumina HiSeq2500 platform. 3 sequencing lanes per PCHi-C library.
HICUP and CHiCAGO Sequencing reads were processed and mapped with HiCUP and PCHi-C interaction was called using CHiCAGO with default parameters35,69.
Datasets
Data generated in this study was deposited to the European Genome-phenome Archive under the following accession IDs: transcription factor data: EGAD00001004571; H3K4me3: EGAD00001002711; H3K27me3: EGAD00001002712; PCHiC: EGAS00001001911.
Genotyping check of ChlP-Seq and PCHi-C bams
Identity matching for each sample and for each analysis was performed by extracting genotypes from RNA-seq and ChIP-seq and comparing them to SNPs from the WGS data. The first stage of verifying the sample identity concordance between the RNA-seq/ChIP-seq and WGS data involved pre-processing the BAM files for one autosomal chromosome (chr1) to remove PCR duplicates and reads with mapping quality score <10. The variants were then called from the resulting BAM file using mpileup from the SAMtools package70. The variants with QUAL <20, DP <5 and GQ <5 were filtered out. Then, we compared genotypes of the filtered variants with genotypes generated from WGS and imputation. The genotypes generated were considered to be from the same sample if the concordance rate was greater than 90%.
Allele specific analysis of transcription factor binding
For allele specific analysis, we used the phased WGS VCF that was also utilised for QTL mapping but here we removed indels and only considered biallelic single nucleotide variants. We then mapped deduplicated ChIP-seq reads on each allele of each SNVs using GATK ASEReadCounter with default parameters, base quality ≥2 and mapping quality ≥15. We then filtered for heterozygous SNVs only with ≥10 read counts per site and nonzero counts in both alleles. We required 2 donors meeting these read counts criteria at each site. To carry out association analysis, we used Rasqual23 with total read counts per sample as offset parameter. Note that Rasqual uses a model that corrects for reference mapping bias and genotyping errors. To correct for non-genetic confounders, we applied PCA with and without permutation on normalised read counts in log2RPM across all sites and picked the first N components whose explained variances are greater than those from permutation as covariates for Rasqual. Finally, we only considered SNVs found within peaks to determine direct allele specific effect on TF binding of PU.1 and CTCF in neutrophils.
Allele specific analysis of PCHI-C
The genotypes of PCHIC donors were obtained from Cambridge Bioresource phase 4 (Illumina core exome chip). We phased the genotype using BEAGLE2 (v2.0.5)71 and imputed using Positional Burrows-Wheeler Transform and Haplotype Reference Consortium (release 1.1) as reference panel, via the Sanger imputation service. We then filtered sites for ≥5% minor allele frequency, HWE p-value ≥ 1×10−6, ≤5% sample missingness and INFO score > 0. 8. We removed indels and only considered biallelic single nucleotide variants. We used WASP72 to remove PCHIC reads that are likely to be biased towards the reference allele. We then mapped deduplicated ChIP-seq reads on each allele of each SNVs using GATK ASEReadCounter with default parameters, base quality ≥2 and mapping quality ≥15. We then filtered for heterozygous SNVs only with ≥ 10 read counts per site and nonzero counts in both alleles. Finally, we only considered heterozygous sites with allele bias of ≤40% or ≥60%, after removing extreme bias of <1% or >100%.
Enrichment analysis of tfQTLs and hQTLs in PIRs
Each of these heterozygous SNVs was annotated based on whether they were located in a PIR and whether they were significant tfQTLs (PU.1 and CTCF; p<1×10−5) or significant hQTLs (H3K27me3, H3K4me3, H3K27ac; p<1×10−5). Fisher’s exact tests were carried out separately for each sample and for each cell type to test for enrichment of tfQTLs and hQTLs that fall into PIRs. Finally, the mean and standard deviation were calculated across all samples for each cell type. In another approach, all samples were combined across both cell types. SNVs were removed if they were not observed in at least two samples, or in one sample and in the two cell types, or if the allelic ratio (REF reads/ALT reads) was not consistent across the samples or cell types. Enrichment was tested for SNVs where at least N samples fell into a PIR and at least N samples carried a significant tfQTL or hQTL for increasing number of samples N (N=1,2,3,4).
Enrichment of genome wide association SNPs within ChIP-seq marked regions
To test for significant enrichment of trait associated SNPs within regions of interest, we applied GWAS analysis of regulatory or functional information enrichment with LD (GARFIELD)48. H3K27ac and H3K4me1 occupied regions in neutrophils were obtained from7. Neutrophil annotations for PU.1, C/EBPβ, H3K4me3 and H3K27me3 were generated as described above. With the exception that H3K27me3, regions were not chunked into 2.5Kb bins. Monocyte annotation are described in Supplementary Figure 3a for PU.1 and C/EBPβ.
Colocalisation between diseases and molecular trait
To overlap our QTL results to GWAS catalogue, we calculated the LD information based on our WGS data using plink v1.973. For all the QTLs that either directly mapped to the GWAS variants or in LD (r2≥0.8), we considered that the QTL variant overlapped with a GWAS signal. For the cases where we further selected six autoimmune diseases, we took forward the overlapping disease variants with P-value ≤5×10−8 in six selected studies are celiac disease [CD]42, inflammatory bowel disease [IBD]43, including Crohn’s disease [CD] and ulcerative colitis [UC], multiple sclerosis [MS]44, Type 1 diabetes [T1D]45, and rheumatoid arthritis [RA]46. The associations of IBD, CD and UC in the European cohorts were used for this study. We also used Type 2 diabetes47 as a negative control. We used a Bayesian colocalization method49,50 to elucidate whether the observed overlap between disease and molecular trait may due to a shared genetic effect. The method calculates the posterior probability (PP), versus the null model of no association, for four alternative models: a model where a region or locus contains a single variant associated with either the molecular trait or disease (models 1,2); a model where a single causal variant affects association with both traits (model 3); or a model where two distinct associations exist (model 4). The method derives the PP of each variant in the locus being causal one under different models, and the PP of a given locus is then the integral sum of the PPs of all variants within, with all variants under equal prior probability to be causal. The prior for each model is computed to be one that maximizes the log-likelihood function50. We acknowledge the limitations of the model: it assumes one causal variant in the locus; and in the case of high LD between two causal variants the model has limited power to distinguish model 4 from model 3. We also note that colocalization does not imply a causal relationship between molecular trait and diseases, but may be compatible also with the same variant having independent (‘pleiotropic’) effects on molecular traits and disease. We applied colocalization test for each of the 1,003 disease-molecular trait pairs, where the lead SNPs in both traits are in high. r2≥0.8. To avoid overlapping 2Mb-wide genetic loci due to features in close proximity (e.g., splicing junctions, genes, histones peaks, CpGs in islands), we tested colocalization per locus, which means that the prior model parameters were estimated using one locus instead of multiple loci and hence the priors may be overestimated.
Footnotes
↵* Joint senior authors