Abstract
Blood cell-free DNA (cfDNA) is derived from fragmented chromatin in dying cells. As such, it remains associated with histones that may retain the covalent modifications present in the cell of origin. Until now this rich epigenetic information carried by cell-free nucleosomes has not been explored at the genome level. Here, we perform ChIP-seq of cell free nucleosomes (cfChIP-seq) directly from human blood plasma to sequence DNA fragments from nucleosomes carrying specific chromatin marks. We assay a cohort of healthy subjects and patients and use cfChIP-seq to generate rich sequencing libraries from low volumes of blood. We find that cfChIP-seq of chromatin marks associated with active transcription recapitulates ChIP-seq profiles of the same marks in the tissue of origin, and reflects gene activity in these cells of origin. We demonstrate that cfChIP-seq detects changes in expression programs in patients with heart and liver injury or cancer. cfChIP-seq opens a new window into normal and pathologic tissue dynamics with far-reaching implications for biology and medicine.
One Sentence Summary Chromatin immunoprecipitation and sequencing of histone modifications on blood-circulating nucleosomes (cfChIP-seq) provides detailed information about gene expression programs in different human organs.
Main Text
Cell death throughout the human body results in the release of short nucleosome-size DNA fragments (cfDNA) into the circulatory blood system (Mandel and Metais 1948). The plasma of healthy individuals contains the equivalent of ∼1000 genomes per ml, with a marked increase in the amount of cfDNA in many pathologies (e.g. cancer) (Lu and Liang 2016) and physiological conditions (e.g. exercise) (Haller et al. 2017). cfDNA fragments are short-lived with an estimated half-life of less than one hour (De Vlaminck et al. 2014; Jiang and Lo 2016; Schwarzenbach, Hoon, and Pantel 2011; Lo et al. 1999), making them ideal biomarkers for noninvasive monitoring of active physiological and pathological processes. Indeed, genetic variation in cfDNA is used to detect fetal chromosomal aberrations in maternal plasma, graft rejection, mutations, and for monitoring tumor dynamics (Schwarzenbach, Hoon, and Pantel 2011; Sun et al. 2015; Lu and Liang 2016; Wan et al. 2017). Importantly, cfDNA contains information beyond genetic variation. For example, the precise genomic location of specific cfDNA sequences reflects nucleosome positions in the source tissue and may thus suggest the cfDNA’s cellular origins (Snyder et al. 2016); the underrepresentation of specific promoter sequences in cfDNA may reflect nucleosome-free regions associated with genes expressed in the source tissue (Ulz et al. 2016); and, cfDNA methylation patterns can be used to determine its tissue of origin (Guo et al. 2017; Zemmour et al. 2018; Lehmann-Werman et al. 2016, 2018; Kang et al. 2017; Li et al. 2018; Xu et al. 2017; Shen et al. 2018; Moss et al. 2018). Indeed, cfDNA methylation analysis demonstrated that most of the cfDNA pool of healthy individual originates from leukocytes, specifically neutrophils, and monocytes (Moss et al. 2018; Sun et al. 2015).
Genomic DNA, the source of cfDNA, is packaged into nucleosomes complexes made of 150bp DNA and histone proteins which are heavily post-translationally modified. Upon cell death, the genome is fragmented to nucleosome sized DNA fragments, most still associated with the histone proteins, and released to the blood as cell-free nucleosomes (cf-nucleosomes) (Holdenrieder et al. 2001, 2005; Rumore and Steinman 1990), and there is evidence that these nucleosomes carry histone modifications (Gezer et al. 2013; Bauden et al. 2015; Deligezer et al. 2011).
In cells a plethora of histone modifications mark specific regions of the genome, such as enhancers and promoters, and their level correlates with gene activity (Gates, Foulds, and O’Malley 2017; Guenther et al. 2007; Barski et al. 2007; Roadmap Epigenomics Consortium et al. 2015). For example, tri-methylation of Lysine 4 on histone H3 (H3K4me3) mark active and paused promoters (Barth and Imhof 2010; Guenther et al. 2007). Additional histone modification mark accessible enhancers (H3K4me1/2) or elongation by RNA Pol II at gene bodies (H3K36me3).
We reasoned that if active chromatin marks are retained on circulating cf-nucleosomes, capturing and sequencing marked nucleosomes may inform on transcriptional activity within cells contributing to the cf-nucleosome pool (Figure 1A). Here, we develop a method to perform chromatin immunoprecipitation of modified nucleosomes directly from plasma followed by sequencing (cfChIP-seq). We show that cfChIP-seq can specifically capture nucleosomes with different active chromatin marks, and that these recapitulate the original genomic distribution of the modifications, and detect changes in gene expression programs in the cells of origin.
Results
ChIP-seq of cf-nucleosomes from plasma
We devised a simple protocol for cf-nucleosome ChIP-seq (cfChIP-seq) from 2ml of plasma from healthy subjects and <0.5ml from patients with increased levels of cfDNA (Figure 1B, Methods). Briefly, to overcome the extremely low concentration of cf-nucleosomes and the high concentration of native antibodies in plasma we incorporated two modifications to standard ChIP-seq. First, we covalently immobilized the ChIP antibodies to paramagnetic beads (Figure 1B), which can be incubated directly in plasma without interferences with native antibodies. Second, we minimize material loss by using an on bead ligation (Lara-Astiaso et al. 2014; Gutin et al. 2018; Singh et al. 2014; Rhee and Pugh 2011), where barcoded sequencing DNA adaptors are ligated directly to chromatin fragments prior to the isolation of DNA. The resulting protocol allows us to simply and efficiently enrich and sequence targeted chromatin fragments from low volumes of plasma.
We performed cfChIP-seq on multiple plasma samples from healthy individuals with antibodies targeting marks of accessible/active promoters (H3K4me3), enhancers (H3K4me2, or H3K4me1), and gene body of actively transcribed genes (H3K36me3) with reproducible yields (Figure 1C, Table S1). Several lines of arguments demonstrate the specificity of cfChIP-seq: (a) cfChIP-seq signal is consistent with reference ChIP-seq against the same modification in tissues (Roadmap Epigenomics Consortium et al. 2015). This is seen in the remarkable agreement of peaks in genome browser (Figures 1C, S1A), in the average pattern around promoters and enhancers (Figures 1D, S1B), and in quantitative comparison of the signal across multiple genomic locations, such as all promoters, (R > 0.8 Figures 1E, S1C). Essentially all promoters that are ubiquitously marked (housekeeping) by H3K4me3 in reference ChIP-seq are significantly enriched for this mark in cfChIP-seq (9,795/10,505 promoters 93%, p < 10−1000). Focusing on remaining marked promoters in cfChIP-seq, there is a significant overlap (1,324/2,311 promoters 57%, p <10−288) with promoters from monocytes and neutrophils that are the major contributors to the cfDNA pool (Moss et al. 2018; Sun et al. 2015) (Figures 1F). (b) performing cfChIP-seq with a mock antibody resulted in dramatically lower yield, with no observable signal as seen for histone modifications (Table S1). (c) We estimated the rate of non-specific events in each sample (Methods), and used this background noise model to evaluate the expected amount of signal originating from non-specific source in each assay (Table S1). These results show that for H3K4me3 the levels of non-specific reads is comparable to or better than reference ChIP (Figure S1D) while for other antibodies such as H3K36me3 the performance is lower, but still informative (below).
A potential concern is contamination by chromatin released from in-tube lysis of white blood cells however, this is highly unlikely for several reasons. (a) Fragment size distributions of cfChIP-seq correspond to DNA wrapped around mono- and di-nucleosomes (Figure 1F), consistent with apoptotic or necrotic cell death, but not with cell lysis, which results in much larger fragments (Mizuta et al. 2013). (b) We identified hundreds of promoters carrying H3K4me3 that are absent in ChIP-seq from white blood cells (leukocytes, peripheral blood mononuclear cells; Figures 1F), these include promoters of genes that are expressed specifically in megakaryocytes, which reside in the bone marrow (Figures 1H and S1E). (c) We are able to detect disease-related chromatin from remote tissues from patients (below).
Together, these results strongly suggest that cf-nucleosomes preserve the endogenous patterns of active histone methylation marks in the cells of origin and can be assayed with cfChIP-seq.
cfChIP-seq of H3K4me3 correlates with gene expression
Having established that cfChIP-seq captures active histone modifications, we next asked whether these reflect gene expression patterns in cells. We decided to focus first on H3K4me3 for several reasons: (a) H3K4me3 ChIP signal is concentrated as narrow peaks at promoters of active and poised genes. This allowed us to differentiate signal from noise. (b) H3K4me3 antibodies are well established in ChIP-seq assays in terms of specificity and sensitivity. (c) The localization of the mark at the promoter simplified connecting ChIP-seq signal and the expression of the associated genes. (d) The level of H3K4me3 at promoters from tissue samples correlates with the level of transcription (Karlić et al. 2010; Weiner et al. 2015; Liu et al. 2005; Pokholok et al. 2005) and is predictive of gene expression levels (Karlić et al. 2010; Weiner et al. 2015; Liu et al. 2005) (Figure S2A). Consistently, we find that H3K4me3 cfChIP-seq signal of healthy subjects is in agreement with gene expression levels in leukocytes (Figure 2A).
We next estimated the specific H3K4me3 signal at each promoter as the total number of cfChIP-seq reads mapped to it above the estimated level of background (Figure 2B, Methods, and Supplemental Note). Comparison of the specific signal at individual promoters shows a good agreement with expression levels of cells contributing to cfDNA (R = 0.560, Figure 2C), similarly to a comparison of ChIP-seq data and matched expression data from tissues (Roadmap Epigenomics Consortium et al. 2015) (0.600 < R < 0.675 Figure S2B), but not with gene expression levels of an irrelevant tissue (Figure S2C).
Together, these results strongly suggest that H3K4me3 cfChIP-seq signal is informative of gene-specific expression levels in tissue of origin.
cfChIP-seq detects expression changes
Can cfChIP-seq profiles capture changes in gene expression that reflect the underlying physiology? To better understand the variation of cfChIP-seq signal among subjects and in different physiological conditions, we performed H3K4me3 cfChIP-seq on samples from a diverse cohort of subjects (clinical details summarized in Table S2). These include: 15 healthy subjects (ages 23 -52); Four patients admitted to the emergency room with acute myocardial infarction (AMI); Nine patients with gastrointestinal (GI) tract adenocarcinoma; and two patients that underwent a partial hepatectomy (PHx). Some of these subjects were sampled multiple times at different intervals (e.g., before and after medical procedure). In some of these subjects there are expected changes in the cfDNA content: Cancer subjects are known to have large amount of tumor cfDNA (Swarup and Rajeswari 2007; Leon et al. 1977) ; AMI patients after percutaneous coronary intervention (PCI) have increased amount of cfDNA from cardiomyocytes (Zemmour et al. 2018). In other cases, such as hepatectomy, we expect damage to the tissue following the procedure.
To get a bird’s eye view of the differences in cfChIP-seq signal among samples, we performed hierarchical clustering of 17,750 promoters of RefSeq genes that have a signal in at least one sample (Methods). This clustering includes cfChIP-seq samples from our cohort and representative reference ChIP-seq (Roadmap Epigenomics Consortium et al. 2015) processed with the same pipeline (Methods). The clustering shows several trends (Figure 3A).
A large group of 9,376 genes show relatively small differences among samples. The genes in this cluster tend to be highly expressed, housekeeping genes with CpG-island at their promoters (Figure S3).
The remaining 8,374 genes display a rich tapestry of patterns. To investigated these patterns, we trimed the cluster hierarchy at 35 clusters, and summarized their average profile across samples (Figure 3B). To provide orthogonal view, we also summarized the average expression levels of genes in each cluster in two expression databases (Methods): The GTEx compendium (GTEx Consortium 2015) contains samples from multiple tissues, and the BLUEPRINT Epigenome Project (Stunnenberg, International Human Epigenome Consortium, and Hirst 2016) focuses on specific hematopoietic cell types sorted from blood, cord blood, and bone marrow samples. Combining these patterns with enrichments analysis (Kuleshov et al. 2016) allowed us to assign putative names to clusters (Figure 3 and Table S3).
Several clusters show moderate to high signal in cfChIP-seq from healthy subjects (e.g clusters 16-26). Some of these have high signal in cells that contribute significantly to the cfDNA pool in healthy individuals such as neutrophils and monocytes (clusters 20 and 26) (Moss et al. 2018; Sun et al. 2015). Cluster 23 has strong cfChIP-seq signal and very low ChIP signal in all tissues. This cluster is enriched for genes such as GP6 and GP9 that are expressed in megakaryocytes and their protein products function in platelets, suggesting large contribution of bone marrow-residing megakaryocytes, or their immediate progenitors, to the cfDNA pool. We also identified clusters with strong patient-specific cfChIP-seq signals. These clusters are enriched for genes that are expressed in GI tract (clusters 1 and 4) with corresponding strong cfChIP-seq signal in GI cancers. Genes expressed in heart (clusters 3 and 5) with corresponding strong cfChIP-seq signal in AMI patients. And, genes expressed in liver (clusters 9 and 30) with corresponding strong cfChIP-seq signal in PHx and some AMI patients that also experienced liver injury (see below). In contrast, genes specific to T-lymphocytes that have long half life and hence are not major contributors to the cfDNA pool are found in cluster 8 that indeed show low cfChIP-seq signal.
Multiple sources for differences in cfChIP-seq signal
What drives the observed differences between cfDNA profiles? (Figure 3) cfChIP-seq measures the cf-nucleosome pool that results from combined contribution from multiple cells. Thus, we need to consider two scenarios for generating the observed differences (Figure 4A). (Scenario 1) Differences in the proportion of cells that contributed to the cf-nucleosome pool. Such changes lead to coordinated changes in the cfChIP-seq signal of genes that are constitutive in the particular cell type (e.g., ALB, complement, and cytochrome-C genes in hepatocytes). (Scenario 2) Changes in the expression of a pathway (e.g., glycolysis) or a broader program (e.g., proliferation) in a subset of cells will also result in coordinated changes in the signal of multiple genes. More generally the changes we observe would be superimposition of both types of changes.
The observations above suggest that some of the differences between samples are due to differences in contributions from cell types. For example, cluster 30 is highly enriched for liver-expressed genes. However, other differences potentially reflect changes in transcriptional programs. For example, type I interferon response in cluster 31, and histone 1 gene cluster in cluster 18 represent programs that can be activated in various cell types.
cfChip detects cell-type expression programs
To better understand the contribution of cell types to the patterns of cfChIP-seq signal, we set out to define cell-type/tissue -specific signatures. Briefly, using the reference ChIP-seq compendium (Roadmap Epigenomics Consortium et al. 2015), we searched for genomic locations (e.g., promoters) that have high signal in the cell type in question and low, or non-existent, signal in all other cell types (Methods, Table S4). We then estimated the cumulative signal of each cell type-specific signature in each cfChIP-seq sample (Figure S4A, Methods). In healthy subjects most signal is from neutrophils and monocytes, and lower but significant signal from liver, in agreement with cfDNA methylation analysis (Moss et al. 2018) (Figure 4B). Testing significance against the null hypothesis of non-specific (background) signal, shows that liver-specific signature although much weaker than those of monocytes and neutrophils, is significantly higher than background. This is consistent with the estimates of 1-2% liver contribution to the cfDNA pool (Moss et al. 2018). In contrast, the heart-specific and brain-specific signatures are not observed in a significant manner above background (Figure 4B), consistent with cfDNA methylation analysis.
In patients we see contribution of additional cell-types/tissues. These changes are consistent with predictions. For example, we detected heart-specific signature (Figure S4B) in samples from AMI patient undergoing percutaneous coronary intervention (PCI) (Figure 4C). We see good agreement between the strength of the cfChIP-seq heart signature, the levels of troponin measured in the blood, and the estimate of heart cfDNA based on heart-specific differentially methylated CpGs (Zemmour et al. 2018) (Figure 4C). When examining the changes in heart signature from admittance to the emergency room to post-PCI checkup (Figure 4D), we see an increase in heart signature immediately following the procedure, as previously reported by assaying cfDNA methylation (Zemmour et al. 2018).
Another example involves a patient undergoing partial hepatectomy (PHx). We expected that during and following the surgical procedure we will see increase in liver cell death. Indeed, we observe dramatic changes during the operation which persist for a few days and slowly decay to original levels (Figure 4F). These changes are strikingly consistent with measurements of the classic marker for liver damage, the enzyme ALT. A noticeable difference is that the cfChIP-seq liver signature dropped back to normal levels about 2 days earlier than ALT, likely reflecting the shorter half life of cfDNA (<2 hours) compared to ALT (∼47 hours) in the circulation (Giannini, Testa, and Savarino 2005).
For an unbiased view of contributions of different cell types to cfDNA, we evaluated our panel of cell-type specific signatures across cfChIP-seq samples (Figure 4G and Table S5). This analysis shows that in all samples we can detect signatures of a range of cell types from the blood (e.g., monocytes and neutrophils), and organs (e.g., liver). Of note, there is a decrease in blood cell types in samples with increased cfDNA load, consistent with smaller proportion of cfDNA from these cells. These samples include the the cancers C001 and C002 (cfDNA: 46.7, 84, 122ng/ml for C001.1, C002.1, and C002.2, respectively, compared to 4.5, and 11.75 ng/ml for healthy subjects H012.1, and H013.1, respectively) and AMI patient M004 (cfDNA: 21.13ng/ml and 35% of cfDNA originating from heart based on analysis of DNA methylation markers).
This unbiased approach reveals a more complex picture in AMI patients. In addition to the heart signature discussed above, in some AMI patients both before and shortly after PCI, we observe a significant increase in liver cell signature. This signature includes clear signal at liver-specific genes, such as Albumin and complement genes (Figure S4C). This increase is presumably a result of the well-known phenomenon of liver injury in AMI patients secondary to low organ perfusion and liver hypoxia (Ebert 2006). To confirm our cfChIP-seq observations we analysed the cfDNA methylation status for liver-specific DNA methylation regions indicative of liver cell death (Lehmann-Werman et al. 2018). Indeed, we observe excellent agreement between liver cfChIP-seq signature levels and liver cfDNA estimates (R2=0.96, Figure 4H).
In samples from GI tract cancer patients we observe signal from tissues that are not observed in healthy subjects. Most evidently we observe signal originating from GI tissue and GI smooth muscle, which is in agreement with the primary sites of the tumors. A weaker but significant GI signature was evident even when the primary tumor was removed by surgery (patients C003-C007).
Together, these results demonstrate that differences in the tissues of origin contribute to the differences in cfChIP-seq signal among subjects. In particular, in patients the differences correspond to the tissue where ongoing pathological processes take place, such as heart, liver, and gastrointestinal tissue.
cfChIP-seq signal reflects gene programs activity patterns
Since cfChIP-seq signal correlates with the gene expression programs in the cell of origin, we proceeded to inquire whether cfChIP-seq can reveal more dynamic transcriptional programs beyond the information on tissue of origin. To test this hypothesis, we evaluated the H3K4me3 cfChIP-seq signal in gene signatures representing different cellular processes, protein complexes and transcriptional responses based on gene expression studies (Liberzon et al. 2015; Drew et al. 2017; Giurgiu et al. 2019; Kamburov et al. 2013) (Methods, Table S6).
This analysis uncovered multiple signatures that differ from expected signal --- that is, the amount of H3K4me3 cfChIP-seq signal captured for a signature in a subject is significantly different from the observed signature in an averaged reference composed of a large cohort of healthy subjects (33 samples) (Methods, Table S7). For example, a strong signature of Heme Metabolism is observed in M002 who suffered from hypoxia (Figure 5A). The blood counts of M002 indeed show high RDW and low RBC and HGB, indicating higher production rate of red blood cells. During red blood cell production, erythroblasts lose their nucleus to become erythrocytes, presumably releasing their nucleosomes to the bloodstream (Lam et al. 2017; Moss et al. 2018). In C001 and C002 where the majority of the cfDNA originated from the tumors, we observe a sharp decrease in this signature consistent with overall reduction in the contribution of the non-tumor cells (Figure 4G). Thus, this signature is indicative of a specific hematopoietic cellular differentiation process.
Other signatures, such as glycolysis, unfolded protein response (UPR), and ribosomal protein genes reflect processes that can take place in multiple cell types, but are known to be upregulated in cancers (Hanahan and Weinberg 2011; Cubillos-Ruiz, Bettigole, and Glimcher 2017; Bhat et al. 2015). Indeed, we see upregulation of these pathways in cancer patients (Figure 5A). In C001 and C002 we see large increase in glycolysis signature, in agreement with the metabolic reprogramming, known as the Warburg effect, that is considered a hallmark of advanced cancers (Hanahan and Weinberg 2011). Interestingly, we observe increased UPR and ribosomal proteins signatures in C002 but not in C001, suggesting that these tumors are molecularly different. In other cancer samples, the amount of tumor derived cfDNA is probably too low to pass a significance test by our current analysis. This could be improved in the future by correction to the fraction of tumor derived cfDNA from the total cfDNA.
Another example is Interferon-alpha response that is normally induced due to the presence of pathogens such as viruses and bacteria. We observe dramatic increase in interferon signature in M004, who seems to undergo a more severe heart damage compared to other AMI patients in terms of troponin levels and cfChIP-seq heart markers (Figure 4C). Induction of interferon response was recently shown to promote a fatal response to AMI (King et al. 2017)
Looking at protein complexes signatures, we find a dramatic downregulation in the signature of the SWI/SNF (BAF) tumor suppressor, and chromatin remodeling complex in C001 and C002. Genes encoding for this complex are collectively mutated in ∼20% of human cancers including GI-tract cancers (Kadoch et al. 2013). This decrease is SWI/SNF signature likely reflects tumor-specific transcriptional programs and not merely tissue of origin (i.e GI tract) since the signature is lower in liver tissue compared with GI tract tissues (Figure S4D), yet the significant reduction of the SWI/SNF signature is observed only in the cancer patients cfChIP-seq, but not in patients with liver injury.
Together, these observations demonstrate that cfChIP-seq reflects detailed changes in gene expression programs beyond cell type-specific programs.
cfChIP-seq allows to dissect patient-specific molecular phenotypes
A hallmark of cancer cells is genetic alterations that lead to dysregulated gene expression programs (Hanahan and Weinberg 2011). Identification of such cancer-specific transcriptional programs can assist diagnosis and treatment choice (Bradner, Hnisz, and Young 2017). We tested each sample for genes whose signal was elevated compared to “reference” healthy samples. As a control, unrelated healthy samples outside the reference set were in high correlation with healthy reference, with few genes (usually less than 100) showing significantly elevated signal (Figure 5B). In contrast, samples from patients revealed hundreds to thousands of genes with significantly elevated signal (Figure 5B, Table S8). Examining these genes for enrichment in annotated gene lists (Kuleshov et al. 2016) recapitulated some of the results discussed above (Figure 4F, Table S5). For example, genes with abnormally high H3K4me3 mark in C001 were enriched for gene sets of GI tract and Brain consistent with the pathology of this patient (Table S5).
To test for cancer-specific signatures in the H3K4me3 cfChIP-seq signal we analysed expression profiles from The Cancer Genome Atlas and GTEx projects (GTEx Consortium 2015; Cancer Genome Atlas Research Network et al. 2013). For each tumor type we identified a set of genes whose expression is significantly higher in the tumor compared to normal tissues (Methods, Table S9). We then tested for significant overlaps between the set of genes with higher H3K4me3 signal in a cfChIP-seq sample and the set of genes over-expressed in a tumor type (Methods). For example, the set of genes with high signal in C002.1 has significant overlap (q < 10−60) with GI-tract adenocarcinoma genes, but only negligible overlap with non GI cancer such as diffuse large b-cell lymphoma (DLBC) (Figure 5C). The analysis of all samples against all tumor types (Figure 5C and Table S10) shows that only samples from cancer patients have significant enrichment of tumor related gene expression, while healthy and MI patients do not. Importantly, the enrichment for cancers of the GI tract is in line with the diagnosed pathology.
Focusing on specific genes that are known to be upregulated in GI tract cancers (Nissan et al. 2012; Rodia et al. 2016; Wu et al. 2018) we observe a clear increase of the H3K4me3 cfChIP-seq signal in the patients compared to healthy reference (Figure 5D). Among these genes we find the carcinoma markers CEACAM5 and CEACAM6. The protein products of these genes are used in an antibody-based assay for clinical cancer diagnosis (Duffy 2001). A second colorectal cancer marker, the long non-coding RNA CCAT1 (colorectal cancer associated transcript 1)(Ozawa et al. 2017) shows strong signal in one of the cancer patients (C002) but not in healthy subjects. Another example is the non-coding antisense RNA EGFR-AS1 that mediates cancer addiction to EGFR and when highly expressed can render tumors insensitive to anti EGFR antibodies (Tan et al. 2017). While cfChIP-seq signal for EGFR is detected in all cancers, EGFR-AS1 is detected only in C002. This finding, which would not be detected by cfDNA mutation analysis, raises the exciting possibility that cfChIP-seq can be informative for treatment choice beyond genomic mutations.
Analysis of enhancer and gene body marks
Our analysis of the active promoter mark H3K4me3 provided rich information regarding transcriptional programs in tissue of origin. Can we gain information from additional chromatin marks that are associated with enhancer and gene activity? Mono and di-methylation of H3 lysine 4 (H3K4me1 and H3K4me2, respectively) are found in two types of genomic regions: 1) promoter flanking regions at the boundaries of regions marked with H3K4me3. 2) poised/active enhancers, where the H3K4me3 is barely detected (Visel et al. 2009). cfChIP-seq of these marks recapitulates the expected distribution (Figures 1C, 1D). Around active promoters, the H3K4me2 and H3K4me1 flank the main H3K4me3 peak and correlate with H3K4me3 (Figure S5A). Additionally, in non genic region adjacent to the IFNB1 locus, we clearly see H3K4me1 and H3K4me2 signal with little or no H3K4me3 coinciding with regions annotated as enhancers by ChromHMM (Roadmap Epigenomics Consortium et al. 2015) (Figure 6A). As expected, not all the blood enhancers in this region have cfChIP signal since not all blood cells contribute to the cfDNA pool (e.g B cells).
We chose to focus on H3K4me2 cfChIP-seq for enhancer analysis since this antibody had low background and high reproducibility (Figure S5B). Comparing cfChIP-seq of the same blood sample with the H3K4me2 and H3K4me3 antibodies shows expected differences (Figures 1C, 1D, 6A). While we see H3K4me3 cfChIP-seq signal almost exclusively at promoters, a large fraction of the reads from H3K4me2 cfChIP-seq are mapped to putative enhancer regions based on ChromHMM (Figure S5C).
We next examined enhancers’ tissue-specificity. We assigned each putative enhancer region to a cell type or combination of cell types. To ensure that we are not biased by TSS H3K4me2 signal or noise, we focused on enhancers of size larger than 600bp that are at least 5Kb from the nearest TSS and do not overlap a gene body. This smaller set of regions (48,525/2,345,831 regions) can be safely assumed to be enhancers. Using the predictions from Roadmap Epigenomics compendium, we assigned for each cell-type a set of distal enhancers that are annotated only in that cell-type. Examining the number of reads in these groups (Figure S5D) in healthy samples recapitulated the observations for H3K4me3 with high coverage for neutrophils and monocytes, less in liver and T-cells, and essentially none in heart and brain. Comparing H3K4me2 signal in two samples from a colorectal cancer patient (C002.1 and C002.2) against healthy samples, we again recapitulate the observations made above with H3K4me3 signal - the cancer samples have lower neutrophils and monocytes signal, and higher signal in colon-specific enhancers, which are barely present in healthy samples (Figure 6B).
We next examined the coordination between promoter marking with H3K4me3 and enhancer marking with H3K4me2. As expected, we find examples of coordinated promoter:enhancer activation (Figure 6C, CDX1). The intestine-specific transcription factor CDX1 is expressed weakly in colon but not in leukocytes and its expression is dramatically increased in GI cancers. Indeed, we observe strong promoter H3K4me3 signal only in cancer patients but not in healthy subjects (Figure S5E). The strong H3K4me3 signal in C002 promoter is accompanied by a pronounced H3K4me2 signal over large GI-specific enhancer regions in the vicinity of the gene only in C002 (Figure 6C), suggesting that CDX1 is activated through these enhancers in colon tissue. We also find examples for enhancer swapping (Figure 6D). The transcription factor TCF3 has strong promoter H3K4me3 signal in every subject tested including C002 (Figure S5E), and is expressed in many tissues including leukocytes. However, the H3K4me2 signal in the vicinity of the gene is strikingly different, with only cancer samples showing clear H3K4me2 signal in regions that correspond to putative colon enhancers (Figure 6D, regions E2 and E4). Interestingly, we only observe cancer-specific increase in H3K4me2 when the signal correspond to putative fetal colon enhancers (compare E1 with E2, E4, and E5), consistent with de-represion of fetal oncogenes. Altogether, these results suggest that CDX1 is activated through different enhancers in different tissues and these differences can be captured by cfChIP-seq and add important information on tissue of origin beyond the promoter signal.
Tri-methylation of H3 lysine 36 (H3K36me3) is found at the body of transcribed genes. Unlike H3K4me3, which marks transcription start sites at both poised and active genes, H3K36me3 requires active transcription elongation to be deposited, and is hence more indicative of gene activity (Guenther et al. 2007). Despite the high background of H3K36me3 in cfChIP (Figure S1D), we do observe the typical enrichment at gene bodies (Figures 1D and S6A) and the signal is in correlation to leukocyte H3K36me3 and RNA-seq (Figures S6B and S6C). Comparing the H3K36me3 signal from a healthy subject to that of a colorectal adenocarcinoma patient, we see 3580 genes that have H3K36me3 that are significantly increased by at least 4 fold in the cancer sample (Figure S6D).
Genes with H3K36me3 mark in this cancer sample can be assigned to three main classes. Class I includes 3404 genes that are marked by both marks in healthy and cancer samples (e.g., DHX9, Figure 6E). The other two classes involve H3K36me3 marks observed only in the cancer sample. 1290 Class II genes are marked with H3K4me3 in both healthy and cancer samples (e.g., SAP18 and SKA1, Figure 6E) and provide new information beyond H3K4me3. In contrast, 163 Class III genes are not marked with either signal in healthy samples (e.g., VWA2, Figure 6E).
Contrasting the set of highly expressed genes in colorectal adenocarcinoma (COAD) from analysis of the TCGA expression profiles (Methods), with these three classes, we observe that each of them captures different parts of these sets (Figure 6F). Specifically, there are 24 COAD genes that were not detected by H3K4me3 and are detected by H3K36me3. Moreover, there are 48 COAD genes (10 in Class II and 38 in Class III) for which the change in H3K4me3 signal is further corroborated by H3K36me3 signal.
Altogether, these results demonstrate how cfChIP-seq can probe the state of various genomic functionalities including promoters, enhancers, and gene bodies. Each is highly informative on transcriptional activity in cells of origin, but in a complementary manner, suggesting increased information that can be obtained by combining data from different histone marks.
Discussion
With advances in sequencing technologies there is a growing interest in using cfDNA as a non-invasive assay for monitoring human physiology. Here we introduce cfChIP-seq, a new method, to infer the transcriptional programs of dying cells from plasma cell-free nucleosomes. We have established capability for genome-wide mapping of plasma cf-nucleosomes carrying histone marks associated with active transcription. We demonstrated that cfChIP-seq signal is informative about the cf-nucleosomes tissue of origin, and used this information to detect various pathophysiological states including heart and liver injury, and cancer. Notably, our results show that cfChIP-seq detects activation of genes that are normally not expressed in healthy subjects, allowing us to identify abnormal transcriptional processes. The assay requires a modest amount of plasma (2ml or less), low sequencing depth, is inexpensive, fully automatable, and provides a sensitive and robust signal. Moreover, the assay leaves most of the original sample intact, allowing reuse for multiple assays, such as genomic sequencing, methylation analysis, or cfChIP-seq with additional antibodies, which is important in situations where blood volume is a limiting factor.
Most current cfDNA-based methods rely on detecting genomic alterations in cfDNA to quantify the contribution of cfDNA from cells with altered genomic sequence, such as fetus, a transplant, or mutated genes in tumors (Schwarzenbach, Hoon, and Pantel 2011; Sun et al. 2015; Lu and Liang 2016; Wan et al. 2017). Thus, these methods are biased towards a set of pre-selected genes, and are blind to events that involve turnover and death of cells whose genome is identical to the host genome. More recent approaches leverage epigenetic information in cell free DNA. Extremely deep sequencing of total cfDNA to identify nucleosomes and transcription factors positions (Snyder et al. 2016) and occupancy (Ulz et al. 2016) reflect tissue of origin and gene expression. However, they rely on detecting changes in coverage over target regions, with a signal of source tissue imposed on the background of normal cells (e.g., detection of an event causing nucleosome depletion in 10% of the cells requires 90% occupancy to be distinguished from 100% occupancy). Thus, such methods avoid sampling noise by using extremely deep sequencing coverage (100s of million reads per sample). Even with such sequencing depth, there is a prohibitive harsh detection limit for events in rare subsets of cells (Snyder et al. 2016). A promising alternative is assaying DNA CpG methylation along the sequence to identify cell of origin (Guo et al. 2017; Zemmour et al. 2018; Lehmann-Werman et al. 2016, 2018; Kang et al. 2017; Li et al. 2018; Xu et al. 2017; Shen et al. 2018). DNA methylation serves as a stable epigenetic memory and is largely unchanged upon dynamic cellular responses. As such, it is highly informative regarding cell lineage, but much less about transient changes in expression. Moreover, unbiased analysis of DNA methylation requires high sequencing depth since most CpGs are methylated.
cfChIP-seq has the potential to circumvent some of these limitations. Targeted enrichment of active marks results in dramatic reduced representation of the genome such that fewer sequencing reads (∼two orders of magnitude less) are required to obtain informative signal. Since we target marks associated with active transcription, we are assaying a positive signal, where few reads are indicative of the presence of a particular cell type or expression program. This is in contrast to methods such as occupancy or DNA methylation that either measure negative signal (lack of nucleosome occupancy) or both negative and positive signals (e.g., %methylated).
Intensive research during the last two decades established the connection between specific histone marks and chromatin-templated processes including transcription, replication, and damage repair. Leveraging this rich and complex information to circulating cfDNA analysis has the potential to unravel physiological processes in remote organs, such as cell proliferation, hypoxia, inflammation, metabolic changes, and cancerous transformation, with minimal invasiveness. All of these processes involve activation of large transcriptional programs, which leave unique imprint on chromatin.
Assaying modified cf-nucleosomes, either used alone or in combination with existing biomarkers, has multiple potential medical applications, such as disease detection (e.g., detecting unknown tumors), improved diagnosis (e.g., replacing tissue biopsy with liquid biopsy), and non-invasive monitoring of disease progression and treatment efficacy. Moreover, the use of minimally invasive, easy to collect assay opens up a wide range of opportunities for studying basic questions in human physiology that have not been accessible until now.
Materials and Methods
Patients
All clinical studies were approved by the relevant local ethics committees. The study was approved by the Ethics Committees of the Hebrew University-Hadassah Medical Center of Jerusalem. Informed consent was obtained from all subjects or from their legal guardians before blood sampling.
Sample collection
Blood samples were collected in VACUETTE® K3 EDTA tubes, transferred immediately to ice and 1X protease inhibitor cocktail (Roche) and 10mM EDTA were added. The blood was centrifuged (10 minutes, 1500 × g, 4°C), the supernatant was transferred to fresh 14ml tubes, centrifuged again (10 minutes, 3000 × g, 4°C), and the supernatant was used as plasma for ChIP experiments. The plasma was used fresh or flash frozen and stored at −80°C for long storage.
cfChIP-seq
Bead preparation
50μg of antibody were conjugated to 5mg of epoxy M270 Dynabeads (Invitrogen) according to manufacturer instructions. The antibody-beads complexes were kept at 4°C in PBS, 0.02% azide solution.
Immunoprecipitation, NGS library preparation, and sequencing
0.2mg of conjugated beads (∼2μg of antibody) were used per cfChIP-seq sample. The antibody-beads complexes were added directly into the plasma (1-2 ml of plasma) and allowed to bind to cf-nucleosomes by rotating overnight at 4°C. The beads were magnetized and washed 8 times with blood wash buffer (BWB 50mM Tris-HCl, 150mM NaCl, 1% Triton X-100, 0.1% Sodium DeoxyCholate, 2mM EDTA, 1X protease inhibitors cocktail), and three times with 10mM Tris pH 7.4. All washes were done with 150ul buffer on ice by shifting the beads from side to side on a magnet. Do not use vacuum to remove supernatant during washes in buffers that do not contain detergents.
On-beads chromatin barcoding and library amplification was done as previously described (Lara-Astiaso et al. 2014; Gutin et al. 2018) except for the DNA elution and cleanup step where the beads were incubated for 1 hour at 55°C in 50μl of chromatin elution buffer (10mM Tris pH 8.0, 5mM EDTA, 300mM NaCl, 0.6% SDS) supplemented with 50 units of proteinase K (Epicenter), and the DNA was purified by 0.9 X SPRI cleanup (Ampure xp, agencourt). The purified DNA is eluted in 25 μl EB (10mM tris pH 8.0) and 23 μl of the eluted DNA were used for PCR amplification with Kapa hotstart polymerase (16 cycles). The amplified DNA was purified by 0.8 X SPRI cleanup and eluted in 12 μl EB. The eluted DNA concentration was measured by Qubit and the fragments size was analysed by tapestation visualization. Note: If too much adapter dimers were still visible by tapestation post library amplification, we recommend pooling samples and performing additional X 0.8 SPRI DNA cleanup, or separating the pooled samples on a 4% agarose gel (E-Gel® EX Agarose Gels, 4%, Invitrogen), and gel purification of fragments larger than adapter dimers (>150bp). DNA libraries were paired end sequenced by Illumina NextSeq 500.
Sequence analysis
Reads were aligned to the human genome (hg19) using bowtie2 with ‘no-mixed’ and ‘no-discordant’ flags. We discarded reads with low alignment scores and duplicate fragments. See Table S1 for read number, alignment statistics, and numbers of unique fragments for each sample.
Roadmap Epigenome atlas
We downloaded aligned read data from the Roadmap Epigenome Consortium database (Table S11). For our analysis we discarded pre-natal, ESC, and cell-line samples, resulting with 64 tissues and cell types (Table S12). The aligned read files were then processed with the same pipeline as cfChIP-seq samples.
Tumor-type Gene Signatures
We downloaded RNA-seq data from the TCGA and GTEx projects as analysed by the Xena project (Vivian et al. 2017) (Table S11). We defined the set of genes that are over-expressed in a tumor type to satisfy three requirements: 1) Significantly higher expression in tumor samples compared to the corresponding tissue samples (t-test, q < 0.001 after FDR correction); 2) Significantly higher expression compared to all healthy samples (t-test, q < 0.001 after FDR correction); and 3) Median expression in the tumor is higher than median expression in each of the healthy samples.
Expected healthy expression level
To best emulate expression profiles, we performed in silico mix of the four cells types that contribute the most to cfDNA (Moss et al. 2018): neutrophils, 32%; monocytes 32%; erythrocyte progenitors 20%; and NK cells 5%. The gene expression for these cell types was downloaded from BLUEPRINT consortium website (Table S11).
TSS/Enhancer location catalogue
We downloaded the Roadmap Epigenome Consortium ChromHMM annotation of all consolidated tissues (Table S11). Using these annotations we constructed a catalogue of potential functional sites (enhancers, TSSs, and genes). We extended the catalogue to include 3kb regions centered on TSS of annotated transcripts in the UCSC gene database and ENSEMBL transcript database (Table S11). We used the combined catalogue to define regions along the genome. We used different version of the catalogue for analysis of each antibody, to match the mark. For H3K4me3 analysis we used only TSSs, for H3K36me3 analysis we used only gene bodies, and for H3K4me2 we had annotations of TSSs and enhancers. In each version of the catalogue, the remaining mappable genome regions were assigned to background, and tiled at 5kb windows. See Supplemental Note for more detailed procedures.
We quantified the number of reads covering each region in the catalogue in each of our samples and atlas samples. We estimated locally adaptive model of non-specific reads along the genome for each of the samples, and extracted counts that represent specific ChIP signal in the catalogue for each sample (Supplemental Note). These were then normalized (Supplemental Note) and scaled to 1M reads in the reference healthy samples.
Tissue Signatures
To define tissue specific signatures of a specific modification, we examined binned representation of the atlas according to our catalogue. For each tissue we defined a signature of unique windows with signal in one of the samples of the target tissue and without coverage in all others (Supplemental Note).
Gene level analysis
For each gene we defined the set of windows that match the gene (TSS in H3K4me3/2 and gene body in H3K36me3). The signal for a gene is the aggregate signal-background over windows associated with it (Supplemental Note).
Statistical analysis
We test whether a signature is present in the analysis of Figure 4. Formally, we examined whether we can reject the null hypothesis that the number of reads in signature windows will be Poisson distribution according to background rate (Supplemental Note). We compute p-value of the actual number of observed reads in signature windows as the probability of having this number or higher according to the null hypothesis. Rejection of the null hypothesis for a specific signature is an indication that some of the windows in the signature carry the modification in question in a subpopulation of cells contributing to the cf-nucleosome pool.
The second test is whether a gene presents a high signal with respect to its level in healthy baseline subjects (Figure 5B). We use average signal from 5 healthy samples to define the average number of reads in each window. We then estimate two sample-specific parameters: 1) background rate (discussed above) and 2) a scaling factor that rescales average expectations to the sequencing depth of the specific sample (Supplemental Note). Together, these define the expected coverage of each gene-associated group of windows under the null-hypothesis that the subject is from the healthy population. We compute p-value of the actual number of observed reads in the gene windows as the probability of having this number or higher according to the null hypothesis.
Pathways and complexes
We downloaded a large collection of gene expression signatures representing different cellular processes, protein complexes, and transcriptional responses (Liberzon et al. 2015; Drew et al. 2017; Giurgiu et al. 2019; Kamburov et al. 2013) (Table S11). Each such pathway (or complex) is represented by a list of genes participating in it. Not all of the pathways represented genes with coherent co-expression, and so we filtered the compendium to include only pathways that behave coherently across the 64 Roadmap Epigenomics reference samples (Table S6; Supplemental Note).
The score assigned to a pathway on a sample is the sum of normalized signal for the genes in the pathway. We evaluated each pathway on 33 healthy samples, to estimate distribution of scores in healthy reference. We then evaluate scores in other samples using the Z-score of this distribution. The significance of scores was determined by two-tailed test assuming Z-scores from a normal distribution.
Acknowledgements
We thank A. Appleboim, J. Mosses, O.J. Rando, A. Regev, and members of the Friedman lab for comments on this manuscript. We thank L. Friedman for help with illustrations and graphics. This work was supported by: European Research Council’s AdG Grants “ChromatinSys” (NF) and “RxmiRcanceR” (EG); Israel Science Foundation’s I-CORE program grant 1796/12 (TK and NF) and grants 2612/18 (NF); NIH Grants RM1HG006193 (NF); and Israel Innovation Authority’s Kamin grant 63381 (NF). EG is also supported by a number of ISF grants, Israeli MOS grant and an NIH grant. A patent application for cfChIP-seq has been submitted by Yissum, the Hebrew University of Jerusalem.