SUMMARY
Embryogenesis requires epigenetic information that allows each cell to respond appropriately to developmental cues. Histone modifications are core components of a cell’s epigenome, giving rise to chromatin states that modulate genome function. Here, we systematically profile histone modifications in a diverse panel of mouse tissues at 8 developmental stages from 10.5 days post conception until birth, performing a total of 1,128 ChIP-seq assays across 72 distinct tissue-stages. We combine these histone modification profiles into a unified set of chromatin state annotations, and track their activity across developmental time and space. Through integrative analysis we identify dynamic enhancers, reveal key transcriptional regulators, and characterize the role of chromatin-based repression in developmental gene regulation. We also leverage these data to link enhancers to putative target genes, revealing connections between coding and non-coding sequence variation in disease etiology. Our study provides a compendium of resources for biomedical researchers, and achieves the most comprehensive view of embryonic chromatin states to date.
MAIN TEXT
Embryonic development relies on a complex interplay between genetic and epigenetic factors. While genetic information encoded in DNA sequence provides the instructions for an embryo to develop, epigenetic information is required for each cell type in an embryo to obtain its specialized function from a single set of instructions. More specifically, the functional output of a given DNA sequence depends on its chromatin state, which is defined in large part by posttranslational modifications of histone proteins1,2. These histone modifications take many forms, and they can influence gene expression by directly modulating DNA-histone interactions, or serving as binding sites for cognate reader proteins that function downstream to enhance or repress transcription3. Importantly, histone modifications allow for the transmission of gene activity states through cell division4, which is crucial for the maintenance of cellular identity. Given the importance of histone modifications in gene regulation, it is not surprising that mutations in genes that encode histones or histone modifying enzymes can lead to a wide variety of diseases including cancer, neurodevelopmental disorders, and congenital malformations5-7.
Histone modifications have also proven to be valuable markers for genome annotation8,9. Different types of functional sequences have distinct patterns of histone modifications, which can be profiled using Chromatin Immunoprecipitation sequencing (ChIP-seq). For example, chromatin around active promoters is characterized by H3K4me3 and H3K27ac, while chromatin around active enhancers is often marked by H3K4me1 and H3K27ac10-12. While the profiles of individual histone modifications can be informative, the most comprehensive functional annotations are achieved when multiple histone modifications are integrated into a unified set of chromatin states based on combinatorial patterns13-15. Each region in the genome can then be classified according to its chromatin state, providing insight into its potential function.
Profiling of histone modifications across developmental time and tissues
Despite the importance of chromatin states in determining the functional output of the genome, a comprehensive survey of chromatin states during mammalian embryonic development has been lacking. The mouse is an ideal system to study the developmental epigenome because it is the most widely used animal model in biomedical research, and offers experimental access to early stages of embryonic development. We harvested mouse tissues at closely spaced intervals from 10.5 days post conception (E10.5) until birth, which is roughly equivalent to gestational week 3 of human development through the end of the first trimester of pregnancy. At each stage, we dissected a diverse panel of tissues from multiple litters of C57BL/6N embryos and performed 2 replicates of ChIP-seq for each of 8 histone modifications (H3K4me1, H3K4me2, H3Kme3, H3K27ac, H3K27me3, H3K9ac, H3K9me3, H3K36me3) (Figure 1a-b; Extended data figure 1). These include all six “core” histone modifications required by the International Human Epigenome Consortium (IHEC) to generate a reference epigenome16,17. At the earliest stage sampled (E10.5) it was not feasible to harvest enough tissue for standard ChIP-seq, so we used a modified micro-ChIP-seq procedure designed to work on much smaller numbers of cells and restricted our scope to only these 6 histone modifications. The complete data series includes more than 66 billion sequencing reads from 564 ChIP-seq experiments, each consisting of two biological replicates derived from different embryo pools (N=1,128 replicates total). All datasets were generated according to ENCODE standards and processed with a standardized pipeline that only outputs peaks passing strict reproducibility criteria18 (see Methods; Extended data figure 2a). In total, we identified more than 45 million peaks covering roughly one-third of the mouse genome (Extended data figure 2b-d).
We observed several high-level features of the data series that warrant comment here. First, as expected, the landscape of histone modifications is highly variable between tissues, particularly for marks of activity such H3K27ac and H3K9ac (Figures 1c, 2a; Extended data figure 3d). We observe that tissues with closer developmental origins tend to have similar histone modification profiles. For example, tissues of the Central Nervous System (CNS) can be easily distinguished from other tissues by hierarchical or k-means clustering (Figures 2a-b), as well as by principle component analysis (Extended data figure 2e). In addition to differences between tissues, histone modification profiles change progressively within each tissue during development (Figures 1d, 2c-d, Extended data figure 3). These developmental dynamics likely reflect at least two underlying biological processes: 1) changes in the epigenetic landscape of individual cells within a tissue as they undergo differentiation, and 2) shifts in the relative abundance of different cell types that compose a tissue. Although we cannot separate the contributions of these two factors, many of the epigenetic changes we observe reflect known hallmarks of cellular differentiation. For example, we observe a dramatic shift in activity from Myh7 to Myh6 in heart at P0, which is known to occur in cardiomyocytes just prior to birth19 (Figure 1d). In the developing forebrain, we observe that markers of mature neurons such as NeuroD2 and Gad1 acquire active histone modifications during development, and concomitantly lose repressive modifications20,21 (Figure 1d; Extended data figure 3a). We observe the opposite trend at genes encoding key cell cycle factors such as Ccnb1 and Cdk2, where active histone modifications decrease during forebrain development as neurons mature and exit the cell cycle (Extended data figure 3b-c). As a general trend, we observe that regions of histone modification restricted in developmental time or space are more likely to be distal to Transcriptional Start Sites (TSS; Extended data figure 3d). This is consistent with the model that TSS-distal regulatory elements play a key role in determining cellular identity22,23.
An integrated 15-state model characterizes the developmental chromatin landscape
To leverage the information captured by combinatorial patterns of histone modifications (i.e. chromatin states) we applied ChromHMM13 to the data series, excluding E10.5 because this stage did not have the full complement of histone modification profiles. We first determined the optimal number of chromatin states to describe the data (Figure 3a, Extended data figure 4a-c, and Methods). Multiple approaches converged on a 15-state model, which shows near-perfect consistency between biological replicates, and is in general agreement with previously published models developed from partially overlapping sets of histone modifications13,16,24 (Extended data figure 4d-g). Although our chromatin state designations were made without considering gene locations, the distribution of each state relative to genes matches expectations25 (Figure 3b). We grouped the resulting chromatin states into 4 broad functional classes: promoter states, enhancer states, transcriptional states, and heterochromatic states. It is important to note that these chromatin state annotations are specific to this study, and are distinct from the ENCODE Consortium’s candidate regulatory element registry and Encyclopedia (ENCODE Consortium, in preparation). In total, we find that ~33.4% of the genome shows a chromatin signature characteristic of one of these four functional classes (i.e. excluding state 15 “no signal”, and state 11 “permissive”) in at least one tissue or developmental stage. This does not necessarily mean that 33.4% of the genome sequence is functional, but rather that 33.4% of the genome sequence is packaged in chromatin that shows a reproducible signature in at least one of the tissues or stages profiled here.These chromatin signatures reflect transcriptional and/or regulatory activity, but the sequences they mark may not be functional by strict definition. For example, the transcriptional elongation state (state #10) covers the introns of many highly transcribed genes, but much of the sequence in these introns may not contribute to the fitness of the organism.
The chromatin state maps provide powerful tools for genome annotation, as a broad range of inferred functional information can be visualized across multiple tissues and stages in a single genome browser view (Figure 3c-e; Extended data figure 5a-b). For example, we see that active chromatin states robustly and specifically mark master regulators of tissue development including Gata4 and Nkx2-5 in heart26, Nkx2-1 in lung27, Cdx2 in intestine28, Barx1 in stomach29, and Wt1 in kidney30. While examining chromatin state maps at these genes that code for tissue-restricted transcription factors (TFs), we noted that they are frequently marked by a heterochromatic state characteristic of Polycomb-mediated repression (state 13, Hc - Polycomb associated) outside of their tissues of activity. Polycomb-mediated repression is known to be essential for embryogenesis31-33, but the genomic distribution of repressive chromatin during mammalian development has not been thoroughly examined. To further investigate the role of Polycomb-mediated repression in development, we used GREAT34 to examine Gene Ontology (GO) terms associated with regions of Polycomb repressive chromatin. We found that Polycomb repressive chromatin is highly enriched near TF genes in every tissue and stage surveyed here (Figure 4a). Moreover, TF genes that are underlie human Mendelian diseases35 are particularly likely to be marked by repressive chromatin (Figure 4b), suggesting that these genes are tightly regulated by a combination of repression and activation. We also examined a small but well-characterized set of genes (Sox9, Pax3, Shh, Ihh, and Wnt6) that are known to cause human congenital phenotypes when ectopically expressed during development36-38. Notably, every one of these genes is widely marked by repressive chromatin in tissues where they are not active (Figure 4b; Extended data figure 5c). These data support an essential and pervasive role for Polycomb-mediated repression in silencing key developmental regulators outside of their normal expression domains, which may function in turn to control broader programs of gene expression39. We note that the heterochromatic state characterized by H3K9me3 (state #14, Hc - H3K9me3) and likely representative of facultative heterochromatin shows a very different distribution, as previously described40-44 (Extended data figure 6).
The breadth of data collected here allowed us to globally characterize both the spatial and temporal dynamics of chromatin states. On average, ~1.2% of the genome differs in chromatin state between tissues at the same stage (mean 1.2%, 31.3Mb; range 1.0%-4.0%, 26.8Mb-109.1Mb), but that percentage jumps to 17.2% if we discard the portion of the genome that shows no chromatin signal across all tissue-stages (i.e. constitutive state 15). Enhancer states show the highest degree of variability between tissues, reinforcing the role of enhancers in defining tissue and cell identity (Extended data figure 7a). Indeed, hierarchical clustering of our samples based on strong enhancer predictions alone (i.e. state 5) leads to clear separation by tissue, and captures relationships between tissues with similar developmental origin (Extended data figure 7b-d). Interestingly, this analysis revealed a strong relationship between limb and facial tissue, which was also observed in clustering results based on single histone modifications (Figure 2a-b). These findings are suggestive of a deep similarity in gene regulatory programs between these two tissues, and provide further support for the hypothesis of a common developmental origin for facial structures and limbs45,46. Within a given tissue, we find that on average ~1.3% of the genome differs in chromatin state between adjacent developmental stages (mean 1.3%, 36.6Mb; range 0.03%-3.01%, 9.4MB-82.1Mb). While the fraction of the genome that differs in state between adjacent stages is roughly the same as the fraction that differs between tissues at the same stage, differences between stages are more subtle, as they are less likely to affect states characteristic of strong enhancers, promoters, or heterochromatin (States 5, 1, and 13, respectively; Extended data figure 7e). Nonetheless, these temporal changes capture important developmental processes like the maturation of Alveolar Type I cells (marked by Aqp527) just prior to the onset of pulmonary gas exchange at birth (Figure 3e), and the transition of predominant liver function from hematopoiesis in the early embryo to metabolism in the later stages of development47 (Extended data figure 7f). A global summary of temporal changes in chromatin state reveals the tendencies of certain states to transition to others, including shifts from poised to active enhancers, bivalent to active promoters, and permissive to repressed chromatin (Figure 4c, Extended data figure 7g).
To gain further insight into the dynamic processes occurring in each tissue, we focused specifically on the subset of enhancers that show dynamic activity across stages (referred to as “dynamic enhancers” below). To identify dynamic enhancers, we began by compiling lists of putative enhancers in each tissue (i.e. chromatin state 5). Next, we calculated a stage-specific activity score for each enhancer based on its level of H3K27ac enrichment, and used these activity scores to identify enhancers that change activity from stage-to-stage48 (see methods). Within each tissue, clustering of these dynamic enhancers revealed several predominant temporal patterns of activity (Extended data figure 8). Considering forebrain as an example, we found four predominant patterns of temporal activity, which reflect underlying developmental processes (Figure 4d-g). Enhancers that peak in activity early in development are associated with GO terms related to general CNS and brain development (Figure 4d, cluster A), while enhancers that peak in middle stages are associated more specifically with neurogenesis and gliogenesis (Figure 4d, clusters B-C), and enhancers peaking latest in the developmental series are associated with synaptic function (Figure 4d, cluster D). By comparing the TF binding motifs enriched in each cluster to the expression levels of TF genes that target these motifs, we identified specific TFs that likely regulate the biological processes revealed by GO analysis. Consistent with published reports, these data suggest key roles for Sox2 in early brain development49,50, NeuroD6 in neurogenesis during mid-late gestration51, and Mef2c in synapse formation52. Notably, the human orthologs of Mef2c and Sox2 are Mendelian disease genes linked to congenital defects of the CNS (MIM # 613443 and #206900, respectively). Similar catalogs for each tissue are available in Extended data figure 8 and Table S1, including dynamic enhancer clusters, corresponding GO terms, and enriched TFs.
Linking Enhancers to Target Genes and Phenotypes
Predicting the target genes of enhancers remains a major challenge in the field, in part because enhancers can be located hundreds of kilobases away from the genes they regulate22,23. To address this challenge, we developed a map of predicted enhancer target genes based on the correlation between gene expression (as measured by RNA-seq) and H3K27ac enrichment at TSS-distal enhancers within the same Topologically-Associating Domain (TAD) (Figure 5a; Extended data figure 9a-b; Table S2). TADs have been shown to constrain the activity of enhancers to the genes in the same domain36,37,53. We assigned enhancers to target genes within the same TAD based on Spearman’s rank Correlation Coefficient (SCC) between H3K27ac enrichment and mRNA expression across all 66 tissue-stages (Extended data figure 9a) (p-value <= 0.05 and SCC >= 0.25; see methods). We took advantage of biological replicates for both the chromatin and RNA-seq data to derive independent target gene maps for each replicate. These maps include 31,965 and 32,735 enhancer-gene assignments, respectively, with an overlap 21,142 (Extended data figure 9c-e). Overall, these target gene maps are highly reproducible (SCC = 0.71 between replicates, evaluated as the number of enhancers per gene). Interestingly, we found that genes with 5 or more assigned enhancers are associated with biological processes related to development and differentiation, whereas genes with only a single assigned enhancer tend to be involved in cellular and metabolic processes more consistent with “housekeeping” functions (Extended data figure 9f). This result further highlights the importance of enhancers in directing the expression of developmental genes.
To further evaluate the accuracy of our enhancer target gene maps, we made use of published data from chromatin conformation capture technologies54-56, which assay physical interactions between promoters and enhancers. We found that our correlation-based map predicts experimentally determined enhancer-gene interactions with higher accuracy than assigning an enhancer to the nearest adjacent gene (Figure 5b; Extended data figure 9g). We also sought to determine whether our map of enhancer target genes in mouse could be useful for predicting human enhancer-gene interactions. To this end, we identified human orthologs of the enhancers and genes in our mouse map, and used data from the GTEx project to assess whether the enhancer-gene relationships inferred in mouse are likely to be conserved in human. The GTEx project has identified a large collection of human expression quantitative trait loci (eQTLs)57, which are known to be enriched in enhancers and other regulatory elements58. Importantly, each eQTL is connected to a target gene by genetic association. Thus, we hypothesized that if the enhancer target genes we predict in mouse are applicable to human, we should see an enrichment for eQTLs that link the human orthologs of these enhancers to the same target genes by genetic association. Indeed, across a variety of human tissues we see a significant enrichment of eQTLs that support our enhancer target gene predictions (Figure 5c).
With this set of enhancer target gene predictions in hand, we again focused on the subset of genes known to cause Mendelian disease when mutated (“OMIM genes”)35. Mendelian diseases are predominantly caused by highly penetrant mutations that often effect coding sequence59. In contrast, SNPs associated with lower-penetrance non-Mendelian phenotypes by Genome-Wide Association Studies (GWAS) are highly enriched in putative regulatory sequences60. This suggests that dysregulation of gene expression is a common mechanism contributing to non-Mendelian phenotypes, which include diseases with enormous human health burden like coronary heart disease61 and Type 2 diabetes62. We hypothesized that if enhancers which regulate Mendelian disease genes are disrupted by common sequence variants (i.e. SNPs), it might lead to less penetrant non-Mendelian phenotypes. To explore this hypothesis, we examined the frequency of GWAS SNPs (i.e. SNPs associated with a human phenotype by GWAS) in enhancers linked to OMIM genes by our correlative map, and compared this with the frequency of GWAS SNPs in enhancers linked to non-OMIM genes. Indeed, we find a highly significant enrichment for GWAS SNPs in enhancers linked to OMIM genes (p=4.3e-8, chisquared test). We also compared enhancers linked to OMIM genes with enhancers in the same TADs that are not linked to OMIM genes. Again, we see a significant enrichment for GWAS SNPs in the enhancers linked to OMIM genes (p=7.29e-4, chi-squared test; Extended data figure 9h; Supplementary Table S4). These results points to a systematic relationship between highly penetrant mutations that lead to Mendelian disease, and lower penetrance enhancer variants that may impact the regulation of those same genes and lead to more common phenotypes. For example, in Figure 5d we highlight Bcl11a. Coding mutations in the human ortholog (BCL11A) cause Dias-Logan syndrome (MIM# 617101) – also known as “Intellectual developmental disorder with hereditary persistence of fetal hemoglobin.” The Bcl11a locus contains two large clusters of putative enhancers that are active in the CNS and embryonic liver (a site of erythropoiesis), respectively. Indeed, the human ortholog of the liver enhancer cluster has been extensively characterized in human blood lineages63,64, and a human sequence orthologous to the CNS enhancer cluster has been validated in vivo65. Interestingly, the human liver enhancer ortholog contains nine total GWAS associations, eight of which are related to hemoglobin levels (the 9th association is to schizophrenia). In contrast, the CNS enhancer orthologs contains three GWAS associations, all with phenotypes related to neurodevelopment or cognitive function (Attention deficit hyperactivity disorder, Educational attainment). These observations suggest that while Bcl11a coding mutations lead to a pleiotropic syndrome with manifestations in both the CNS and hematopoietic systems, non-coding regulatory variants at the same locus may lead to non-syndromic phenotypes that predominantly effect one system or the other, depending on which enhancer is affected.
Relationship between H3K27ac enrichment and enhancer validation rate
In several of the analyses described above, we have used chromatin state to study putative enhancers. Chromatin states (and even H3K27ac alone) have proven to be effective tools for identifying enhancers14,48,66,67, but the accuracy of these methods remains under investigation. The putative enhancers identified in this study are supported by multiple lines of evidence including the GO and motif enrichments discussed above (Figure 4f) as well as tissue-specific enrichments for in vivo validated enhancers from the VISTA database65 (Extended data figure 10a). Anecdotally, we have noted that putative enhancers with stronger H3K27ac enrichment are more likely to validate in transgenic mouse reporter assays, which is an important consideration because the level of H3K27ac enrichment at putative enhancers can vary by orders of magnitude within a single tissue or cell type. To more systematically examine this relationship between H3K27ac enrichment level and validation rate, we used transgenic mouse assays68 in E12.5 embryos to test 150 enhancers predicted from E12.5 forebrain, heart, or limb tissue that fell into one of three H3K27ac enrichment rank categories: high (selected from ranks 1-85), medium (ranks 1500-1550), or low (ranks 3000-3050) (Figure 6, Extended data figure 10b, Supplementary Table S5). We found that ~60% of the high-ranking enhancers displayed reproducible reporter gene expression in the expected tissue. However, <30% of predicted enhancers from the two lower-rank groups validated (Figure 5a, p-values < 0.01, Fisher’s exact test). These results are consistent with a similar validation of enhancers predicted from e11.5 tissues (ENCODE Consortium, in preparation). In addition, high-ranking enhancers that validated in the expected tissue were also more likely to have enhancer activity in additional tissues (Figure 6b, p-values < 0.05, Mann-Whitney U Test). Importantly, no significant differences were observed in reproducibility rates (i.e. the percentage of embryos with reproducible activity) between rank classes (Extended data figure 10c). Overall, these results demonstrate that loci with stronger H3K27ac enrichment are more likely to direct reporter expression in the expected tissue, and tend to direct reporter expression in a wider variety of tissues.
The results of these transgenic reporter assays provide an intriguing perspective on one of the most important questions in the field: what portion of sequences marked by an enhancer chromatin signature truly function as enhancers in vivo? Surveys of chromatin state and chromatin accessibility in a single tissue or cell type often predict enhancers numbering in the tens or even hundreds of thousands. In the past this has led us and others to estimate the total number of enhancers in the genome at several hundred thousand or more66,69. In contrast, the results of our transgenic assays suggest that outside of the top enhancers in a given tissue (as ranked by H3K27ac enrichment), the validation rate in this assay drops to roughly one-third. In light of these results, we think it is important to approach estimates of enhancer abundance with a healthy degree of caution, but we do not think these estimates should be abandoned entirely as they likely provide a reasonable upper bound estimate for the number enhancers in the genome. Although our data clearly suggests that the true positive rate of chromatin-based enhancer predictions decreases as function of H3K27ac level, it is difficult to determine exactly what this true positive rate is. Definitive proof of an enhancer’s function (or lack thereof) requires genetic manipulation of that enhancer in its native context, followed by extensive measurements of the resulting impact on gene expression70-72. Even such rigorous genetic experiments can be difficult to interpret, because an enhancer’s function might be limited to a very specific set of conditions that are difficult to identify and/or reproduce in an experimental setting. In addition, one enhancer may compensate for the loss of another, achieving a redundancy that could serve to make regulatory programs more robust to perturbations. Ultimately, we think our results highlight the importance of continued investigation into the molecular basis of enhancer function, and encourage the continual assessment of global enhancer estimates as new data points like those reported here become available.
In summary, we present a survey of the chromatin landscape in the developing mouse embryo that is unprecedented in is breadth. Our results describe a multi-tiered compendium of functional annotations for the developmental mouse genome, yielding valuable insight into key developmental processes and regulatory factors. The resources generated here include chromatin state maps for each tissue and stage, catalogs of enhancers that show dynamic activity during development, a genome-wide map of predicted enhancer target genes, and a collection of transgenic reporter assays that demonstrates a strong relationship between H3K27ac signal and likelihood of validation. We further reveal an important role for Polycomb-mediated repression in regulating TF genes during development, and a systematic connection between Mendelian diseases and complex common diseases likely caused by perturbation of the same genes. Due to the uniquely critical role of the mouse in biomedical research, we believe that the tools and insights developed here will prove to be a valuable resource to the biomedical research community.
AUTHOR CONTRIBUTIONS
Study was conceived and overseen by B.R., D.U.G., L.A.P., A.V., D.E.D., Y.S., F.Y., J.E., W.W, and J.M.C. Tissue collections performed by V.A., J.A.A., I.P-F., C.S.P., and M.K. ChIP-seq experiments performed by A.Y.L. and S.C. Data curation and processing by J.S.S., J.M.D., and D.U.G. Data analysis performed by D.U.G., I.B., Y.F-Y., Y.Zhang, A.W., B.D., B.Z., Y. Zhao, M.W., Y.Q., Y.H., and S.P. Transgenic assays performed by V.A., J.A.A., I.P-F., C.S.P., M.K., T.H.G., Q.T.P., A.N.H., B.J.M., and E.A.L. Manuscript written by D.U.G., I.B., D.E.D., and Y.X. with input from all authors.
METHODS
Tissue Collection
All animal work was reviewed and approved by the Lawrence Berkeley National Laboratory Animal Welfare and Research Committee. Tissue collection for all developmental stages was performed using C57BL/6N strain Mus musculus animals. For E14.5 and P0, breeding animals were purchased from both Charles River Laboratories (C57BL/6NCrl strain) and Taconic Biosciences (C57BL/6NTac strain). For all remaining developmental stages, breeding animals were purchased exclusively from Charles River Laboratories (C57BL/6NCrl strain). Wild-type male and female mice were mated using a standard timed breeding strategy. Embryos and P0 pups were collected for dissection using approved institutional protocols. Embryos were excluded if they were not the expected developmental stage. To avoid sample degradation, only one embryonic litter or P0 pup was processed at a time, and tissue was kept ice cold during dissection. Collection tubes for each tissue type were placed in a dry ice ethanol bath so that tissue samples could be flash frozen immediately upon dissection. Tissue from multiple embryos was pooled together in the same collection tube, and at least two separate collection tubes were collected for each tissue-stage for biological replication. Tissue was stored in a −80°C freezer or on dry ice until further processing. A step-by-step protocol for tissue collection, including detailed information about how embryonic stage was determined, can be found on the ENCODE Project website at https://www.encodeproject.org/documents/631aa21c-8e48-467e-8cac-d40c875b3913/@@download/attachment/StandardTissueExcisionProtocol_02132017.pdf.
ChIP-seq data generation
ChIP-seq experiments for all marks and tissues from E11.5 – P0 were performed as previously described66. The ChIP-seq protocol was modified slightly for all E10.5 experiments due to the low amount of input (“micro” ChIP-seq). Detailed protocols for both standard and micro ChIP-seq including antibodies used and antibodies validations performed are available at www.encodedcc.org associated with each experiment described here. They can also be found at the links below. Standard ChIP-seq, Tissue fixation & sonication: https://www.encodeproject.org/documents/3125496b-c833-4414-bf5f-84dd633eb30d/@@download/attachment/Ren_Tissue_Fixation_and_Sonication_v060614.pdf, Immunoprecipitation: https://www.encodeproject.org/documents/89795b31-e65a-42ca-9d7b-d75196f6f4b3/@@download/attachment/Ren%20Lab%20ENCODE%20Chromatin%20Immunoprecipitation%20Protocol_V2.pdf, Library preparation: https://www.encodeproject.org/documents/4f73fbc3-956e-47ae-aa2d-41a7df552c81/@@download/attachment/Ren_ChIP_Library_Preparation_v060614.pdf. Micro-ChIP-seq (E10.5), Tissue fixation & sonication: https://www.encodeproject.org/documents/1fcaab50-6ca0-4778-88cb-5f6b85170d21/@@download/attachment/Ren%20Lab%20ENCODE%20Tissue%20Fixation%20and%20Sonication%20Protocol%20MicroChIP.pdf, Immunoprecipitation & Library preparation: https://www.encodeproject.org/documents/18580e80-0907-4258-a412-46bcc37bd040/@@download/attachment/Ren%20Lab%20ENCODE%20Chromatin%20Immunoprecipitation%20Protocol%20MicroChIP.pdf.
ChIP-seq data processing
Histone ChIP-seq data were analyzed using a software pipeline implemented by the ENCODE Data Coordinating Center (DCC) for the ENCODE Consortium. A schematic representation of the pipeline is shown in Extended figure 10. Each step of the pipeline corresponds to a script written in the Python programming language that assembles the input files, runs external programs (such as the MACS2 peak caller), and calculates quality-control metrics. The methodology is similar to that previously described for ENCODE73 with the following modifications: The mapping step used bwa version 0.7.10 and samtools version 1.0, and MACS2 version 2.1.0 was used for signal track generation and peak calling. In order to ensure an adequate sampling of noise for subsequent replicate comparisons, peaks were initially called at a relaxed p-value threshold of 1 x 10-2. Such relaxed peak sets were generated for each biological replicate, for the replicates pooled, and for pooled pseudoreplicates of each true replicate (each pseudoreplicate consists of half the reads sampled without replacement). Peaks from the pooled replicate set were retained in the replicated peak set if they overlapped by at least half their length (in bases) peaks from both biological replicates. Additionally, peaks that overlapped both pooled pseudoreplicates were added to the replicated peak set. In this way very strong biological replicates could “rescue” peaks that were only marginal in a second replicate. The pipeline is available to be run on the DNAnexus (https://www.dnanexus.com/) web platform, backed by cloud computing from Amazon Web Services (AWS), and is the same pipeline used for the analysis of all ENCODE histone ChIP-seq experiments. The platform provides both an API for programmatic execution of the pipeline and a web-based interface for interactive execution of the same workflows. ENCODE DCC uses this approach to ensure that primary data from different labs within the Consortium are processed uniformly, and thus minimizing factors that could confound subsequent comparisons74. The ENCODE DCC analyzed the experiments in parallel and accessioned the results to the ENCODE Portal73 (https://www.encodeproject.org/).
Code availability
The ENCODE histone ChIP-seq pipeline is among the collection of ENCODE Uniform Processing Pipelines that can be found here: https://platform.dnanexus.com/projects/featured. The code is open-source, and available here: https://github.com/ENCODE-DCC/chip-seq-pipeline.
Analysis of individual histone ChIP-seq data. K means clustering
To facilitate clustering across all tissues and stages, one master H3K27ac peak list was generated by merging peaks across tissuestages. Each peak was then scored for H3K27ac ChIP-seq fold enrichment over input in each tissuestage using bigWigAverageOverBed, and these values were quantile normalized across tissue-stages to eliminate potential confounding effects of systematic biases in the distribution of signal between tissues and/or stages. These normalized values are represented by the blue-black-yellow color scale in Figure 2a. One additional data processing step was performed prior to clustering: across each row, the values were converted to a unit vector in R (x / sqrt(sum(x^2)), to prevent enrichment level within tissue-stages from dominating the clusters, rather than enrichment level between tissue-stages. Note that these unit vector values were only used for clustering; the values plotted are the normalized H3K27ac enrichment values. K means clustering was performed in R with k = 8 and default parameters. Rows were ordered within each cluster based on mean normalized enrichment. Hierarchical clustering of tissue-stages. Normalized ChIP-seq fold enrichment values were calculated for each mark as described for H3K27ac in the preceding subsection (not including unit vector conversion). Hierarchical clustering was performed in R with default parameters. The resulting dendrograms are plotted in Figure 2b. Calculation of distance between stages. For each mark, we used Pearson correlation to compare the similarity between all datasets from the same tissue. We then categorized these correlations values based on how many stages separate the datasets being compared: 0 for true biological replicates from the same stage up to 7 for comparisons of E11.5 datasets to P0 datasets. These correlations are plotted in Figure 2c. Principle Component Analysis (PCA). The whole genome was split into 1kb tiling bins. Average fold enrichment signals were calculated for each bin using the bigwigoverbed.sh script from the UCSC website. Bins that overlapped a merged peak by a minimal 20% (reciprocal) was denoted as peak-bin. The average fold enrichment signals from each peak-bin were quantile normalized within the tissue (not across the tissue). The signal strength for each peak was calculated as the sum of the signals of all bins overlapping that peak. Principal component analysis was performed on the peak signals for each histone mark with the R function “prcomp”. PC1, PC2 and PC3 values were plotted for each sample. Metagene profiles. To illustrate the characteristic enrichment patterns at active and silent genes, we used very conservative definitions to define these gene sets here. Active genes are defined as RPKM > 10 in every tissue-stage evaluated here (see Table S6 for list of public RNA-seq datasets). Silent genes were defined as RPKM < 2 in all tissue-stages. Metagene profiles were plotted with deeptools plotProfile75, using data from E15.5 Heart. Correlation between replicates. To facilitate comparisons of peak strength between replicates, we began with a single list of peaks called for each tissue-stage using data pooled from both replicates. Then, we scored these peaks for ChIP-seq fold enrichment over input for each replicate separately using bigWigAvgOverBed, and performed quantile normalization to eliminate confounding effect of systematic biases in the distribution of signal between replicates. We then used Pearson Correlation to compare the replicated of each experiment. “Narrow” marks (H3K4me3, H3K4me2, H3K27ac, H3K9ac) for tighter peaks of enrichment and tend to correlate more strongly than “broad” marks (H3K27me3, H3K4me1, H3K9me3, H3K36me3).
ChromHMM. Generating the model
Chromatin datasets (bam files) were downloaded from the ENCODE DCC on October 15, 2016. De-duplicated bam files for each sample, along with their respective input controls, were binarized using the binarizeBam function of ChromHMM, with default parameters. Models considering 2 to 24 states were learned separately on the two replicates using the LearnModel function, with default parameters. For the rest of the analyses, we leveraged the availability of two, distinct time series; namely, we applied the same strategy separately and compared the results a posteriori. The conclusions obtained were invariably consistent, suggesting that the inferences on a single time series (at least in terms of global genomic patterns) are highly reproducible. Identifying the optimal number of chromatin states. We devised two different strategies to identify the minimal number of states that capture the combinations of histone modifications present in the data, both of which converged on a 15 state model. First, the ChromHMM CompareModels function was run separately on the two series. This function compares the emission parameters of a selected model to a set of models (in terms of Pearson’s correlation), and outputs the maximum correlation of each state in the selected model with its best matching state in each other model. We used this function to compare the “full” model (namely the one that consider 24 states) to the states in the simpler models. We then calculated the median correlation of all the 24 states against the simpler models, plotted these numbers against the number of states in the model and look at what number of states both series reached a plateau. As a complementary strategy, the emission probabilities from all the 23 models (considering 2 to 24 states) from both replicates were clustered together. The rationale behind this strategy is that very similar states across models will tend to cluster together, so there must be an optimal number of clusters corresponding to the optimal number of states in the model. To this end, we applied k-means clustering with k between 2 and 24, and evaluated the goodness of the separation for each k as the ratio between the “Between sum of squares” (referred to as Between SS) and the “Total sum of squares” (Total SS). Very cohesive, well-separated clusters tend to approach a ratio of 1. Given a value of k, the ratio was averaged over one hundred realizations of the clustering. The ratio observed for k = 24 was used as maximum, and the optimal number of states was then defined by the smallest value of k showing a ratio equal or higher than 95% of the maximum. Genome segmentation and chromatin state tracking across genomic positions. The segmentation was run separately for each sample, using the MakeSegmentation function of ChromHMM (default parameters) and the model derived from the first replicate. For the final set of replicated state calls we required that a region was assigned to the same state in both biological replicates (within a given tissue & stage). Regions that were not assigned to the same state in both replicates were re classified as “No Reproducible Signal” (distinct from state 15 – no signal in both replicates). The unionbedg functionality of BEDTools76 v2.17.0 was exploited to keep track of the chromatin state of genomic intervals across a defined set of samples. Chromatin state trajectories along developmental time. Given a replicate for a defined tissue, all those genomic intervals classified in a specified state (e.g. #5, strong enhancers) in one or more time points were tracked using the approach described above. Considering each pair of adjacent developmental time points, the genomic coverage of each transition between each pair of states was then calculated. The resulting numbers were then normalized on the coverage of the largest transition in the time series under investigation (e.g. liver, replicate #2) and shown as a directed graph. Clustering on enhancer states. After tracking the changes in chromatin state of each genomic base pair in the genome across multiple stages and tissues, the resulting matrix was binarized according to each segment being classified in a specified state (1) or any other state (0). The binary distances between all the pairs of samples considered in each specific analysis were then calculated. These were used either for comparisons or hierarchical clustering (Ward’s method). GO analysis. Functional enrichments through GREAT34 were obtained through the greatBatchQuery.py script. The resulting lists were first filtered for the relevant ontologies. After that, only the terms showing a binomial FDR <= 0.05 and a regional enrichment equal or higher than 2-fold were considered. WebGestalt77 was run with default parameters, specifically restricting the enrichment analysis to the GO Biological Process Ontology. VISTA validated elements were downloaded from https://enhancer.lbl.gov on June 17, 2016. Mm9 and hg19 coordinates were converted to mm10 using liftOver (setting -minMatch to 0.95 and 0.1, respectively). VISTA positive elements with any of the following annotations: "forebrain", "midbrain", "hindbrain", "neuraltube", "limb", "facialmesenchyme" or "heart" were considered for the following analysis. Liver was not considered in this enrichment analysis since there are currently <10 validated elements in VISTA showing reproducible staining in the liver. coverageBed from BEDTools v2.17.0 was used to calculate the coverage of the regions in each state in the E11.5 predictions with each tissue-specific group of VISTA elements. The fraction of bases covered was then normalized to the expected overlap, based on the overall genome-wide coverage of each state. The enrichment for repetitive elements was calculated using the OverlapEnrichment function of ChromHMM. Classification of TSS as active, repressed, OMIM, and/or TF. A 2kb window was defined around the TSS coordinates of all protein coding transcripts in GENCODE78 vM12. These 2kb windows (one per TSS) we overlapped with replicated chromHMM calls to determine their chromatin state in each tissue and stage. A TSS was classified as “active” in a given tissue-stage if this 2kb window overlapped the active promoter state (#1), and did not overlap any repressive states (#3, #13, #14). A TSS was classified as “repressed” in a given tissue-stage if this 2kb window overlapped a state characteristic of polycomb-mediated repression state (#3, #13), and did not overlap any active states (#1, #3, #4, #5, #6, #7, #10, #12). TSS that did not meet the criteria for either active or repressed in a given tissue-stage were left unclassified. Mouse orthologs of OMIM genes were taken directly from the “genemap.txt” available from www.omim.org, and filtered for genes with a “Disorder” type value of 3 – “the molecular basis of the disorder is known”. The full list of mouse TFs was downloaded from TFClass79. Ensembl Gene IDs were used to match GENCODE TSS to corresponding OMIM and TF genes. Characterization of dynamic enhancer elements. The temporal dynamic analysis was performed for each tissue separately. First, 1kb genomic bins that overlapped with regions defined as ChromHMM strong enhancer states in at least one stage were identified. Then we selected dynamic elements (bins) from these strong enhancer bins using the bioconductor LIMMA package80. LIMMA is a package developed for calling differentially expressed genes for microarray but was also adapted for sequencing data with the LOOM functionality. LIMMA package was used to call differential binding between each adjacent stage comparisons (e.g. E11.5 vs E12.5, E12.5 vs E13.5, etc). P-values were calculated with the eBayes function within LIMMA with trend parameter disabled, and were adjusted using Benjamini-Hochberg method. A bin was called overall dynamic if its adjusted p-value is less than 0.05 in any adjacent stage comparisons; otherwise it is called a non-dynamic bin. Non-dynamic bins were not included in the following analysis to reduce noise. We performed K-means clustering on dynamic bins across stages. The rows (bins) are normalized by dividing a common value so that the squares of the values sum up to 1. The optimal K was determined using the “elbow method” to cut off at the K value where percentage of withinness values transition from increasing quickly to increasing steadily with larger K. The resulting heatmap of the K-means clusters were shown in Figure 4d and Extended data figure 8. For each of the identified clusters, we performed enrichment testing of “GO Biological Processes” using GREAT. Over-represented motifs for each dynamic cluster was identified with the following method: First, all vertebrate Motif position weight matrices (PWMs) were downloaded from JASPAR TF database and used to scan the peak-bins for motif occurrences with FIMO, MEME suite81. For each motif, we computed the odds ratio and the significance of enrichment in each cluster, comparing to non-dynamic bin pool using Fisher’s exact test. The non-dynamic bin pool were sampled with replacement to match the distribution of average signal strength from the dynamic bins. Following that, significant TF PWMs were grouped in subfamilies using the structural information from TFClass79 because they share similar if not identical binding motifs. The top significantly over-represented TFs and their associated subfamilies were reported.
A TAD-constrained map of enhancer-promoter associations
The reproducible strong enhancer calls (state #5) were merged using the mergeBed utility from BEDTools v2.17.0. After that, those regions or sub-regions overlapping the intervals +/-2.5 kb from the TSS of genes in Gencode were excluded from the merged regions using subtractBed from BEDTools v2.17.0. Regions smaller than 2 kb were enlarged to 2 kb from their central coordinate (to allow for more robust signal estimation). This resulted in 66,556 putative enhancers. H3K27ac signals at these regions were then quantified using uniquely aligned, deduplicated reads. These measurements were carried out by the coverageBed utility from BEDTools v2.17.0, then normalized to RPKM according to the sequencing depth of each sample, and log2-transformed (zeros were replaced by the smallest detectable value larger than zero). The mRNA expression of protein-coding genes was tracked across the 66 samples. Small and non-coding RNAs were excluded from any subsequent step by considering only those genes with a Gencode biotype supporting protein-coding functionality. FPKM were log2-transformed (zeros were replaced by the smallest detectable value larger than zero). For each TAD defined in the genome of mouse ES-cells53, the putative enhancers and genes were retrieved. All the enhancer-gene pairs within the TAD were then evaluated in terms of Spearman’s rank correlation coefficient (SCC) between the H3K27ac pattern of enrichment and the mRNA expression across the samples. Each gene was assigned to the putative enhancer showing the highest value of SCC. In order to attach p-values to these correlations, a null distribution was estimated empirically, by calculating the SCC of the enhancer with all the genes on the chromosome. Two different strategies were employed to estimate a p-value: 1) a z-score was calculated by subtracting the mean and dividing by the standard deviation of the null. The corresponding p-value was then calculated using the pnorm function in R; 2) an empirical p-value was defined as the number of times an equal or better than the observed SCC was found in the null. Only those putative enhancers showing a p-value <= 0.05 (for both strategies) and a SCC >= 0.25 were retained. Two maps were independently derived from the two biological replicates. These respectively encompassed 31,965 and 32,735 associations, with an overlap 21,142. Only these overlapping associations were used for further evaluation and analyses. Validation of the enhancer-gene map using published chromatin conformation data. Capture-C interaction data from the developing limb and brain54 were retrieved from the GEO (GSE84792). ChIA-PET interactions at sites bound by the cohesion subunit Smc1a in the developing limb55 were retrieved from the Supp. Table 2 of the original publication. Enhancer-gene contacts in fetal liver cells as inferred from Capture HiC56 were downloaded from ArrayExpress (E-MTAB-2414). In all cases, mm9 coordinates were mapped to mm10 using liftOver. For each published dataset, only those regions in the enhancer-gene map overlapping any experimentally validated interaction were retained. The fraction of interactions showing experimental support was then calculated for both the gene assigned by correlation and the nearest RefSeq gene. Mapping of murine enhancer-gene map to human. The putative enhancer regions were mapped to the human genome (hg19) using liftOver, with a strategy similar to previous reports67. Each region was required to both uniquely map to hg19, and to uniquely map back to the original region in mm10, with the requirement that >=50% of the bases in each region were mapped back to mouse after being mapped to human. For each enhancer-gene pair, the orthologous human gene was inferred using BioMart82. The orthologous pairs were also required to share the same TAD in human (TADs derived from human ES-cells53). Validation of the enhancer-gene map using published eQTL-gene associations. Single-tissue eQTL-gene associations generated by the GTEx consortium83 were downloaded from the GTEx portal (http://gtexportal.org, release v6p). Only those tissues with more than 750k annotated eQTLs were considered. A control set of enhancer-gene associations matching the size and the TSS-distance distributions of the real enhancer-gene map was generated. Briefly, for each enhancer-gene pair, the distance between the TSS of the gene and the central coordinate of the enhancer was calculated; after that, a region of the same size of the enhancer centered at the same distance to the TSS of the gene but on the opposite side of the enhancer was picked as a control set.
Transgenic reporter assays
Names for functionally validated enhancers used throughout this work (mm numbers) are the unique identifiers from the VISTA Enhancer Browser (www.enhancer.lbl.gov)65. Enhancers were selected for testing as follows: The H3K27ac peak calls for three tissues (E12.5 heart, forebrain, and limb) were taken from the TSS-distal H3K27ac peaks called using the uniform processing pipeline (mm10-minimal) by the ENCODE Data Coordinating Center (narrow peaks from combined replicates). Peaks for each tissue were ranked by enrichment score (most to least significant). We then selected predicted enhancers from three different bins within each tissue’s ranked list for testing (bins were approximately ranks 1-85, 1500-1550, and 3000-3050). Loci that were already included in the VISTA Enhancer Browser or that appeared to overlap unannotated promoters were excluded from testing. In total, 150 predicted enhancers were tested, including 60 top ranked candidates (20 per tissue), 45 middle ranked (15 per tissue), and 45 lower ranked candidates (15 per tissue). Transgenic mouse assays were performed in FVB/NCrl strain Mus musculus animals (Charles River) as previously described68,84. Briefly, predicted enhancers were PCR amplified and cloned into a plasmid upstream of a minimal Hsp68 promoter and a lacZ reporter gene. Transgenic embryos were generated by pronuclear injection of the resulting plasmids into fertilized mouse eggs. Embryos were implanted into surrogate mothers, collected at E12.5, and stained for β-galactosidase activity. Elements were scored as positive enhancers if at least three embryos had identical β-galactosidase staining in the same tissue. Elements were scored as negative if no reproducible staining was observed and at least five embryos harboring a transgene insertion were obtained. Genomic coordinates and results for each element are provided in Supplementary Table 5, through the ENCODE project data portal (www.encodeproject.org), and at the VISTA Enhancer Browser website (www.enhancer.lbl.gov).
Mapping to repeat element families
Since ENCODE analysis pipeline were focusing on the uniquely mapped reads, a lot of information provided by the reads that are not uniquely mapped will be missed. This is especially undesirable for repetitive element analysis as there is a higher chance for read coming from a repetitive region of DNA will map un-uniquely to the reference genome. Hence, we use a new pipeline with two rounds of mapping steps to re-process all the fastq files. The pipeline utilizes all the reads for the purpose of gaining enough sensitivity to uncover the enrichment of any repetitive element subfamilies that were previously undiscovered because of low coverage. In the first round of mapping, sample reads were aligned to the reference genome mm10 using Bowtie1 with: bowtie hg19 - p 16 -t -m 1 -S –chunkmbs 512 –max multimap.fastq input.fastq output.sam85. –max is used to separate reads mapping to multiple locations of the genome from uniquely mapped reads. In the second round of mapping, a customized assemblies file are constructed by concatenating genomic instances of each repetitive element subfamily, their 15 bp flanking genomic sequences and a 200bp spacer sequence in FASTA format86. The annotation file for repetitive elements used in this step was downloaded from Repeatmasker.org. A python script used with parameters are as following: python RepEnrich.py /data/mm10_repeatmasker.txt /data/sample_A sample_A /data/mm10_setup_folder sampleA_multimap.fastq sampleA_unique.bam –cpus 1687. The number of reads mapping to repetitive element subfamilies, repetitive element families, or repetitive element classes are determined using information from both uniquely mapped reads that overlap with repetitive element and un-uniquely mapped reads. Since some of the repetitive element subfamilies are very similar to each other, a fractional counts method was used to classify the reads that map to multiple repetitive element subfamilies. It sums reads mapping uniquely to a repetitive element subfamily once and counts reads mapping to multiple subfamilies using a fraction 1/Ns, where Ns = number of repetitive element subfamilies the read aligns with. A table of counts that estimate enrichment signal for the repeats classes across different tissues is built as the final output for plotting the figures.
Data processing in R
Most of the described data processing steps (plotting, statistical tests, calculating correlations and hierarchical clustering) were performed in the statistical computing environment R v.3.3.1 (www.r-project.org).
DATA ACCESS
Raw and processed ChIP-seq data reported here can be accessed via the ENCODE Data Collection and Coordiantion (DCC) website www.encodedcc.org. More specifically, a full list of the ChIP-seq experiments can be found here: https://www.encodeproject.org/search/?type=Experiment&assay_slims=DNA+binding&assay_title=ChIP-seq&award.rfa=ENCODE3&lab.title=Bing+Ren%2C+UCSD&limit=all.
ACKNOWLEDGMENTS
This study was funded by the National Human Genome Research Institute as part of the Encyclopedia of DNA Elements (ENCODE) project (U54HG006997). D.U.G. was supported by the NIH Institutional Research and Academic Career Development Awards (IRACDA) program, and a postdoctoral fellowship from the A.P. Giannini Foundation. D.E.D, A.V., and L.A.P. were also supported by NIDCR FaceBase grant U01DE024427, NHLBI grant R24HL123879, and NHGRI grants R01HG003988, and UM1HG009421, and research conducted at the E.O. Lawrence Berkeley National Laboratory was performed under Department of Energy Contract DE-AC02-05CH11231, University of California. Y.H. is supported by the H.A. and Mary K. Chapman Charitable Trust. J.R.E. is an Investigator of the Howard Hughes Medical Institute.