Abstract
Regulatory networks control the spatiotemporal gene expression patterns that give rise to and define the individual cell types of multicellular organisms. In Eumetazoa, distal regulatory elements called enhancers play a key role in determining the structure of such networks, particularly the wiring diagram of “who regulates whom.” Mutations that affect enhancer activity can therefore rewire regulatory networks, potentially causing changes in gene expression that may be adaptive. Here, we use single-cell transcriptomic and chromatin accessibility data from mouse to show that enhancers play an additional role in the evolution of regulatory networks: They facilitate network growth by creating transcriptionally active regions of open chromatin that are conducive to de novo gene evolution. Specifically, our comparative transcriptomic analysis with three other mammalian species shows that young, mouse-specific transcribed open reading frames are preferentially located near enhancers, whereas older open reading frames are not. Interactions with enhancers are then gained incrementally over macro-evolutionary timescales, helping integrate new genes into existing regulatory networks. Taken together, our results highlight a dual role of enhancers in expanding and rewiring gene regulatory networks.
Introduction
Enhancers are a defining characteristic of eumetazoan gene regulatory networks. They recruit transcription factors and cofactors that “loop out” DNA to bind core promoters and increase the expression of target genes [1, 2], thus mediating interactions between genes. Such interactions are highly dynamic throughout development, facilitating the differential deployment of distinct regulatory sub-networks in different cells, which helps define cell-type specific spatiotemporal gene expression patterns [3, 4].
Enhancer activity is not only dynamic throughout development, but also throughout evolutionary time [5]. The reason is that mutations in enhancer sequences can create or ablate interactions with regulatory proteins, thus enabling modifications in gene use without affecting gene product [6, 7]. Such changes alter a regulatory network’s wiring diagram of “who regulates whom,” which can cause changes in gene expression patterns that embody or lead to evolutionary adaptations or innovations [8]. Examples include the archetypical pentadactyl limb anatomy of extant tetrapods [9], ocular regression in subterranean rodents [10, 11], limb loss in snakes [11, 12], convergent pigmentation patterns in East African cichlids [13], the mammalian neocortex [14], and cell type diversity in eumetazoans [15].
Regulatory networks not only evolve via rewiring, but also via the addition of new genes [16]. Gene duplication, retrotransposition, gene fusion, the domestication of genomic parasites, and horizontal gene transfer are all means by which new genes can arise from pre-existing genes [17], and thus expand gene regulatory networks. In addition, it is becoming increasingly appreciated that new genes can arise de novo from non-coding regions of the genome [18–22]. For protein-coding genes, the essential prerequisites of this process are the formation of an open reading frame (ORF), together with the transcription and translation of that ORF. Because much of the genome is transcribed [23, 24] and many lineage-specific transcripts containing ORFs are potentially translated [25–30], the de novo evolution of new protein-coding genes is also a likely contributor to the growth of gene regulatory networks.
An important question concerning de novo genes is how they integrate into existing regulatory networks, and what role enhancers may play in this process. It has been hypothesized that enhancer acquisition allows new genes to expand their breadth of expression, providing opportunities to acquire new functions in different cellular contexts [31]. Enhancers may therefore help new genes integrate into existing regulatory networks via edge formation and rewiring. Less appreciated is the role enhancers may play in the origin of de novo genes [32], and thus in the growth of gene regulatory networks. The physical proximity between active enhancers and their target genes [33] – facilitated by DNA looping – creates a transcriptionally permissive environment that is engaged with RNA polymerase II, which may lead to the transcription of regions near the enhancer, or to the transcription of the enhancer itself, producing so-called enhancer RNA [1, 34]. If the resulting transcript is stable, harbors an open reading frame, and engages with ribosomes, then it fulfills the basic prerequisites of de novo gene birth. Thus, enhancers may play a dual role in the evolution of de novo genes, and consequently in the evolution of gene regulatory networks. By creating a transcriptionally permissive environment that is engaged with the transcriptional machinery, enhancers may facilitate the origin of de novo genes; by physically interacting with gene promoters, enhancers may facilitate the integration of de novo genes into existing regulatory networks.
Here, we take an integrative approach to study this potential dual role of enhancers. We leverage single-cell transcriptomic and functional genomics data from mouse that describe gene expression levels, chromatin accessibility, and chemical modifications to histones, as well as phylostratigraphic estimates of the ages of transcribed ORFs. We find that the distance between ORFs and enhancers in nucleotide sequence increases with ORF age, indicating that young ORFs preferentially emerge near enhancers. We also find that the number of enhancer interactions per ORF increases with ORF age, even across macro-evolutionary timescales. In sum, our findings support a dual role for enhancers in the origin of de novo genes and in their functional integration into gene regulatory networks.
Results
The maturity and age of transcribed open reading frames
To set the stage for our study, we first characterized the maturity and age of a set of mouse transcripts bearing ORFs [29]. Specifically, we characterized the transcript maturity of 46,501 murine ORFs by assessing whether i) the ORF resides in a region of open chromatin, which implies it is accessible to the transcriptional machinery; ii) the transcript has detectable 5’ capping, which confers stability [35, 36], permits its export from the nucleus to the cytoplasm [37] and promotes translation [36]; and iii) the transcript associates with ribosomes, indicating the potential for translation [25, 29, 30]. Fig. 1A shows a schematic of our classification of transcript maturity.
We found that over a third (16,735) of the 46,501 ORFs had the highest level of transcript maturity, which we refer to as maturity level 3 (Fig. 1B). The remaining ORFs were distributed among different combinations of the three maturity indicators. We refer to ORFs found in regions of open chromatin as having a maturity level 1 (5,640 ORFs) and those that are also 5’ capped as having a maturity level 2 (4,927 ORFs).
The ORFs we assessed had their phylogenetic age estimated by Schmitz et al. [29], based on their presence in the transcriptomes of other mammalian species, including rat, human, and opossum (Fig. 2A). If a homolog of a mouse ORF is found in another species, then it is assumed to have emerged before the common ancestor of that species and mouse. For example, if an ORF is shared with opossum, it is assumed to have originated before the branching of marsupials and placental mammals ~160 million years ago; if it is not shared with any of the other three species, it is assumed to have emerged only after the split between mouse and rat ~20 million years ago. Expectedly, when assessing the distribution of ORFs with each of the maturity indicators across the different age categories, we found that the older an ORF is, the more likely it is to correspond to higher levels of maturity. This is clear from the observation that the percentage of ORFs corresponding to the oldest age class (i.e., opossum) increases with the maturity level, while the percentage corresponding to the youngest age class (i.e., mouse) decreases (Fig. 2B). Furthermore, whereas most mouse-specific ORFs have a maturity level of 1, that fraction gradually decreases as ORFs grow older, while the fraction of ORFs of maturity level 3 increases with age from their minimum in mouse-specific ORFs to their maximum in opossum-shared ORFs (Fig. 2C).
Due to the resolution of the phylogeny shown in Fig. 2A, there is variation in the ages of the ORFs even within a given lineage. We therefore reasoned that such variation might be reflected by variation in transcript maturity. To determine if this was the case, we considered the expression of mouse-specific ORFs from ten different taxa from the mouse branch after the mouse-rat split (Fig. 2D) [23]. Making use of transcriptomic data from those ten taxa, we determined when in the recent phylogenetic history leading to our focal species (Mus musculus domesticus) did the genomic regions harboring mouse-specific ORFs start being transcribed. As anticipated, we found that whereas the fraction of non-mouse-specific ORFs with detectable transcription is relatively constant across the different lineages, fewer mouse-specific ORFs are expressed in the lineages that are more distantly related to M. m. domesticus (Fig. 2E). We also observed that more mature ORFs are more likely to be transcribed at more basal branches of the mouse phylogeny than are less mature ORFs, indicating that transcript maturity is indicative of when in the mouse phylogeny the genomic region harboring the ORF started being transcribed (Fig. 2F).
In sum, these results show that an ORF’s transcript maturity increases with its age, complementing previous reports that focused on the correlation between age and translation potential [29]. With these estimates of transcript maturity and age at hand, we next studied the role enhancers play in the birth of de novo genes and in their integration into regulatory networks.
Many young and transcriptionally immature ORFs are proximal to enhancers
H3K27ac and H3K4me1 are histone modifications that are commonly used to identify enhancers, specifically when they are not found overlapping H3K4me3 modifications, which are indicative of promoters [38]. We therefore merged chromatin immunoprecipitation followed by DNA sequencing (ChIP-seq) data for H3K27ac, H3K4me1, and H3K4me3 obtained from 23 mouse tissues and cell types [39], and considered enhancers to be those genomic regions where H3K27ac and/or H3K4me1 peaks do not overlap H3K4me3 peaks in any tissue [40, 41] (Materials and Methods). Assessing the 27,347 ORFs with an assigned maturity level, we found that i) mouse-specific ORFs are significantly closer to enhancer marks than ORFs shared with rat, human, or opossum (Spearman’s correlation coefficient ρ = 0.27, p < 0.01), with a median distance to their closest enhancer mark of 1,589bp for mouse-specific ORFs compared to more than 2,500bp for the remaining age classes (Fig. 3A); ii) over 30% of mouse-specific ORFs are in regions of open chromatin containing enhancer marks, while this percentage decreases as ORFs grow older to less than 5% for those shared with opossum (Fig. 3B); iii) significantly more enhancers are found within 50kb upstream and 50kb downstream of mouse-specific ORFs than in any other age class (Fig. S1, Wilcoxon’s rank sum test p < 0.05); iv) the mouse-specific age class has the highest percentage of ORFs showing evidence of bidirectional transcription – a hallmark of enhancer activity [42] (Fig. 3C); and v) ORFs of lower transcript maturity, which tend to be younger, are nearer to enhancers than ORFs of higher transcript maturity, which tend to be older (Fig. S2). These results suggest that the birth of many new genes is facilitated by their close proximity to enhancers.
Because many (58%) of the mouse-specific ORFs are found in genomic regions that overlap or are very close to genomic regions that harbor annotated genes, we expect that at their birth, such ORFs will inherit the regulatory properties of their host gene, which is older. To specifically assess the regulatory background of ORFs that emerged from or near enhancers and thus did not coopt the regulatory features of the promoters of older genes, we separated ORFs stemming from genomic regions annotated as intergenic (which are the ORFs most likely to have emerged de novo [29]) from those that we considered genic, which are those ORFs overlapping other genes or that are near the promoters of other genes (Materials and Methods). We found that intergenic ORFs are considerably more likely to be found closer to enhancers than genic ORFs (Fig. 3D; Fig. S3). For example, ~65% of mouse-specific intergenic ORFs were within 1kb of an enhancer, as compared to ~25% for mouse-specific genic ORFs and ~10% for non-mouse-specific ORFs. This implies that ORFs emerging within intergenic regions of the genome lose their proximity to enhancers as they age, perhaps via the transformation of enhancers to promoters [43]. This possibility is supported by the observation that the chromatin modification indicative of promoters, H3K4me3, shows trends opposite to the ones described above for enhancers. That is, older ORFs are closer to a larger number of H3K4me3 marks than younger ORFs (Fig. S4).
These observations support the hypothesis that enhancers facilitate the de novo evolution of genes from non-coding DNA, and thus contribute to the expansion of gene regulatory networks. However, our analyses so far have considered enhancer marks that were merged across a diversity of cell types and tissues. To provide more direct evidence that enhancers facilitate de novo gene birth, we separately considered three tissues (liver, brain, and testis) for which we had both transcriptomic and histone modification data. We found that 24% (100 ORFs), 36% (931 ORFs), and 26% (244 ORFs) of intergenic mouse-specific ORFs with evidence for transcription in liver, brain, and testis, respectively, are within 1kb of an enhancer (Fig. S5). These percentages are considerably lower for genic ORFs (< 8%) and for ORFs shared with rat, human, and opossum (< 2%). Enhancers therefore provide fertile ground for the de novo birth of new genes from intergenic regions of the genome.
Enhancer interactions are gradually acquired over macro-evolutionary timescales
We next asked how enhancers integrate new genes into existing regulatory networks. The CCCTC-binding factor (CTCF) is an architectural DNA-binding protein that mediates physical interactions between promoters and enhancers [44]. Using ChIP-seq data for CTCF in 15 cell and tissue types, we found that CTCF-bound regions of the genome overlap a larger fraction of older ORFs than younger ORFs (~75% of opossum-shared ORFs compared to ~45% of mouse-specific ORFs; Fig. 4A), that there is a negative correlation between the age of an ORF and its distance to the closest CTCF-bound region (Spearman’s correlation coefficient ρ = −0.27, p < 0.01), and that among young mouse-specific ORFs the distance to the closest CTCF peak is significantly higher for intergenic than genic ORFs (p < 0.01; Fig. S6). These results suggest that while young ORFs are proximal to enhancers, they are not specifically targeted by them. Such enhancer interactions are likely acquired gradually over time, as CTCF motifs, and other sequence changes conducive to enhancer-promoter interactions, evolve in the proximity of ORFs.
To study how ORFs acquire interactions with enhancers, we considered an enhancer-promoter interaction map derived from single-cell chromatin accessibility data in 13 murine tissues [45] (Materials and Methods). We first corroborated the negative correlation between an ORF’s number of enhancer interactions and its distance to the closest CTCF-bound region (Spearman’s correlation coefficient ρ = −0.35, p < 0.01). We then uncovered a positive correlation between the age of an ORF and its number of enhancer interactions (Spearman’s correlation coefficient ρ = 0.17, p < 0.01; Fig. 4B). This number increased from a median of 5 enhancer interactions for mouse-specific ORFs to a median of 13 for ORFs that are shared with opossum, indicating that enhancer-promoter interactions are gradually acquired over time. However, when restricting our analysis to ORFs of the highest transcript maturity class, this positive correlation was lost (Spearman’s correlation coefficient ρ = 0.001, p = 0.9).
We reasoned that this loss could be because mouse-specific ORFs of genic origin are enriched for transcripts of the highest maturity class (38% as compared to 1.4% for intergenic ORFs). We therefore partitioned the mouse-specific ORFs according to whether they were intergenic or genic, and compared the number of enhancer interactions in these classes to the number of enhancer interactions for non-mouse-specific ORFs. We found that intergenic ORFs had fewer enhancer interactions than genic ORFs, which were similar to non-mouse-specific ORFs in their number of enhancer interactions (Fig. 4C). This suggests that mouse-specific ORFs of genic origin, which are enriched for mature transcripts, tend to coopt the regulatory interactions of their host gene, or of nearby genes. To account for this confounding effect, we considered ORFs that do not share their segment of open chromatin with any other ORF and are therefore unlikely to be coopting the enhancer interactions of other genes (Materials and Methods). We call these ‘single ORFs’. We use this distinction, rather than intergenic vs. genic, because only 0.06% of ORFs that emerged before the rat/mouse split are annotated as intergenic, whereas 48% can be considered single ORFs. After making this distinction, we recovered the positive correlation between an ORF’s number of enhancer interactions and its age (Spearman’s correlation coefficient ρ = 0.24, p < 0.01); even for ORFs of the highest transcript maturity class, we found that mouse-specific ORFs were involved in fewer interactions than opossum-shared ORFs (Wilcoxon’s tailed test, p < 0.01; Fig. 4D). Therefore, intergenic mouse-specific ORFs with the highest level of transcript maturity, which tend to be older than those with lower levels of transcript maturity (Fig. 2), have fewer interactions than ORFs in the oldest age class, providing further evidence of the gradual acquisition of enhancer interactions over time.
To further explore the pace at which new enhancer interactions are gained over evolutionary time, we shifted our focus to opossum-shared ORFs, most of which (95%) correspond to annotated genes. We separated these into 15 new age classes dating back to the origin of cellular life [46] in order to understand how enhancer interactions are acquired over macroevolutionary timescales (Fig. 5A). With the sole exception of the oldest genes shared with bacteria and archaea, which have significantly fewer interactions than ORFs that emerged before the common ancestor of all eukaryotes, no other age class shows significantly fewer interactions than a younger age class (Fig. 5B; in Fig. S7, note that only a single element below the main diagonal is significant). Disregarding ORFs from the oldest age class, we found a significant correlation between the age of genes and their number of enhancer interactions (Spearman’s correlation coefficient ρ = 0.15, p < 0.01).
In sum, young ORFs have relatively few interactions with enhancers, despite being proximal to them in nucleotide sequence. As ORFs age, they gradually acquire enhancer interactions (Fig. 4), a process that continues over macroevolutionary timescales (Fig. 5B).
Enhancer acquisition influences expression breadth and variance
We next explored the functional consequences of enhancer acquisition. To do so, we first studied the expression breadth of opossum-shared annotated genes using the phylogeny shown in Fig. 5A and single-cell transcriptomic data from 68 cell types of ten murine tissues [47], for which we also had single-cell chromatin accessibility data (Materials and Methods). We found that expression breadth increases with gene age (Spearman’s correlation coefficient ρ = 0.30, p < 0.01; Fig S8A), corroborating previous analyses performed using transcriptomic data from whole tissues [48]. We additionally found that a gene’s expression breadth increases with its number of enhancer interactions (Spearman’s correlation coefficient ρ = 0.37, p < 0.01; Fig. 5C), suggesting that enhancer acquisition has functional consequences.
We next measured the coefficient of variation for the expression of each gene, a measure that is useful for identifying stably vs. variably expressed genes from single cell RNA sequencing [49]. It is calculated as the standard deviation of a gene’s expression across cell types, divided by the mean expression across cell types (Materials and Methods). Genes with a lower coefficient of variation tend to be more tightly regulated than those with a higher coefficient of variation [49]. We found a significant correlation between the coefficient of variation and gene age (Spearman’s correlation coefficient ρ = −0.32, p < 0.01; Fig. S8B), as well as with a gene’s number of enhancer interactions (Spearman’s correlation coefficient ρ = −0.32, p < 0.01; Fig 5D). Specifically, the coefficient of variation decreases as genes acquire more enhancer interactions, stabilizing around one when genes acquire at least 20 enhancer interactions. These results show that enhancer acquisition affects gene expression breadth and variance, further supporting the role of enhancers in the integration of genes into regulatory networks.
Discussion
We report a dual role of enhancers in the evolution of gene regulatory networks: They engage with the transcriptional machinery to create an environment of open chromatin that is conducive to the de novo birth of new genes, and they help integrate these new genes into existing regulatory networks by interacting with gene promoters, thus facilitating the evolution of controlled and robust gene expression in space and time.
Our study provides empirical support for the hypothesis that enhancers may facilitate de novo gene evolution, which to our knowledge was first proposed upon the discovery of enhancer RNA [34] and later expanded upon in a perspective piece by Wu and Sharp [32]. Our findings complement contemporaneous work [50] on the regulatory architecture of the nematode Pristionchus pacificus, which showed that young genes – those private to P. pacificus – are in closer proximity to enhancers than genes with one-to-one orthologs in other nematode species. The observation that enhancers facilitate de novo gene birth in both nematodes and mammals suggests that this mode of de novo gene evolution dates back to at least the common ancestor of Bilateria, and possibly even earlier, since cnidarians and ctenophores also employ distal regulatory elements [15, 51, 52].
The facilitating role of enhancers in de novo gene birth is conceptually similar to the facilitating role of the permissive chromatin state of meiotic spermatocytes and post-meiotic round spermatids that underlies the “out-of-testis hypothesis,” which proposes the testis as a primary tissue for the origination of new genes [17]. Both scenarios envision regions of open chromatin that are exposed to the transcriptional machinery, and thus produce a transcriptionally active environment that is conducive to the evolution of new genes. The two scenarios differ, however, in at least two ways. First, genes that emerge from or near enhancers may rapidly acquire their own promoters, due to the similar architectural and functional features of enhancers and promoters, a similarity that facilitates the rapid turnover of the former to the latter [43]. Second, enhancers are often deployed in multiple cell types or developmental stages [53], exposing enhancer-proximal de novo genes to distinct cellular contexts where they may confer a selective advantage.
The hypothesis that enhancers help de novo genes integrate into existing regulatory networks was previously proposed in the context of the out-of-testis hypothesis, as a means to expand a new gene’s breadth of expression [31]. Using single-cell chromatin accessibility and transcriptomic data, our study provides the first empirical support for the hypothesis that de novo genes gradually acquire enhancer interactions over time, and that this acquisition increases expression breadth. These findings complement related studies of gene integration into cellular networks, such as networks of protein-protein interactions [54, 55]. Our observation that genes continue to acquire enhancer interactions over macro-evolutionary timescales mirrors similar increases in other aspects of gene regulation, such as in the number of proximal transcription factor binding sites, alternative transcript isoforms, and miRNA targets [56].
Regulatory networks drive the spatiotemporal gene expression patterns that give rise to and define the numerous and distinct cellular identities characteristic of Metazoan life. Enhancers play an integral role in this process, mediating cell-type-specific gene-gene interactions, thus facilitating the combinatorial deployment of different genes in different contexts. Genetic changes that affect such interactions are responsible for myriad evolutionary adaptations and innovations [6–8, 57]. Our results suggest that the power of enhancers in creating such evolutionary novelties lies not only in their ability to rewire gene regulatory networks, but also in their ability to expand them, by providing fertile ground for de novo gene birth.
Materials and methods
ORF age and transcript maturity
Schmitz et al. [29] identified a set of 58,864 ORFs from the transcriptomes of three murine tissues: liver, testis, and brain. Blasting against the transcriptomes of four other mammalian species (rat, human, kangaroo rat, and opossum), they estimated the age of each ORF by phylostratigraphic methods [29, 58]. Because of the small number of ORFs shared with the kangaroo rat (49 ORFs), we merged these ORFs together with those from the rat age class. We used the genomic coordinates of the first exon of each ORF in the mm10 mouse genome reference to study the regulatory properties of ORFs of different ages, for example to study their distance to the nearest enhancer.
We considered three indicators of ORF transcript maturity:
Open chromatin: We used single-cell ATAC-seq data from 13 different mouse tissues (bone marrow, cerebellum, large intestine, heart, small intestine, kidney, liver, lung, cortex, spleen, testes, thymus, and whole brain). The ATAC-seq method detects regions of open chromatin through the insertion of transposons in random accessible regions of the genome that can later be sequenced [59]. We obtained the data from the Mouse ATAC atlas [45], which comprised 436,206 peaks of open chromatin. We used liftOver from the Genome Browser at UCSC [60] to convert the genome coordinates from mm9 to mm10. A total of 29 peaks could not be converted. Using the “intersect” function of bedtools with default parameters [61], we found which ORFs have their first exons in regions of open chromatin and are therefore accessible to the transcriptional machinery in at least one of the tissues.
5’ capping: We used cap analysis of gene expression (CAGE) data from the FANTOM5 consortium from 1,016 mouse samples including cell lines, primary cells and tissues [62, 63]. This method is based on the capture of 5’ capped ends of mRNA, which allows the mapping of regions of transcription initiation genome-wide [64]. Using the “closest” function from bedtools with default parameters [61], we measured the distance between an ORF’s first exon and its closest CAGE peak. We considered a transcript to be 5’ capped if the start site of its first exon was located within 200 bases of a CAGE peak (Fig. S9).
Ribosome association: We used ribosome profiling (ribo-seq) data from 9 different mouse tissues (embryonic stem cells, neutrophils, fibroblasts, liver, brain, testis, epidermis, kidney, and adipose tissue). This method is based on the sequencing of mRNA fragments that are protected from RNase digestion by ribosomes [65]. We obtained the coordinates of mRNA segments detected by ribo-seq from GWIPS-viz [66], a database that includes such data from different studies. Following Schmitz et al. [29], we considered an ORF as being potentially translated if at least one read from the ribo-seq datasets could be assigned to the ORF in question.
Using these indicators, we defined three levels of transcript maturity: maturity level 1 for ORFs whose first exon overlaps open chromatin, maturity level 2 for ORFs that are also 5’ capped, and maturity level 3 for ORFs that also associate to ribosomes. Because the ribo-seq data may be limited by the detectability of the transcript [29], we only considered ORFs that were also found in the mRNA-seq dataset available at GWIPS-viz; this filter lead us to only consider a subset of the ORFs reported by Schmitz et al. [29]. Specifically, we assigned transcript maturity levels to 46,501 ORFs (~79% of the 58,864 ORFs).
To determine if transcript maturity correlates with gene age even within the mouse lineage, we considered the transcriptomes of brain, liver and testis from 10 different mouse taxa (3 populations of Mus musculus domesticus, 2 populations of M. m. musculus, and 1 from M. m. castaneus, M. spicilegus, M. spretus, M. mattheyi and Apodemus uralensis). The data consisted of read counts from the transcriptomes of each taxon mapped to 200 bp windows of the mm10 mouse reference genome [23]. We considered an ORF to be expressed in any of the ten taxa if at least 10 reads (the upper threshold to be considered “lowly expressed” [23]) could be detected in the 200 bp windows overlapping at least 60% of the length of the first exon of the ORF.
Enhancer association
We obtained ChIP-seq data for H3K27ac, H3K4me1, and H3K4me3 modifications from 23 different tissues and cell types from the ENCODE project (bone marrow, cerebellum, cortex, heart, kidney, liver, lung, olfactory bulb, placenta, spleen, small intestine, testis, thymus, embryonic whole brain, embryonic liver, embryonic limb, brown adipose tissue, macrophages, MEL, MEF, mESC, CH12 cell line, and E14 embryonic mouse) [39]. We used liftover to convert the genomic coordinates of the peaks from mm9 to mm10. We used the “merge” function of bedtools with default parameters to collate the peaks for all tissues and cell types, considering any overlapping H3K27ac and H3K4me1 peak as part of the same enhancer. We used the “intersect” function of bedtools with default parameters to separate H3K27ac and H3K4me1 peaks that overlapped any length of H3K4me3 peaks from those that did not. This resulted in 172,930 H3K27ac and 277,187 H3K4me1 peaks that did not overlap H3K4me3 peaks. We considered genomic regions with H3K4me3 peaks to be promoters, and those exclusively with H3K27ac and/or H3K4me1 peaks to be enhancers [41]. We measured the distance in base pairs between the first exon of an ORF to an enhancer or promoter using the “closest” function of bedtools with default parameters. To assess the number of enhancers surrounding an ORF, we considered the 50,000 base pairs upstream and downstream of the first exon of each ORF, and determined the number of H3K27ac and H3K4me1 peaks within that window.
We also studied the association of ORFs that are expressed in different tissues to chromatin modifications in those same tissues. To do so, we used the transcriptomic data for brain, testis and liver from the samples of Mus musculus domesticus as described in the previous section to classify ORFs as expressed or not expressed in each tissue. We determined the fraction of ORFs expressed in each tissue that were up to 1kb away from a H3K4me1, H3K27ac and H3K4me3 ChIP-seq peak identified from liver, testis, embryonic whole brain, and cortex samples.
We also considered bidirectional CAGE peaks, which are indicative of enhancers [42, 67]. We assigned bidirectional CAGE peaks to ORFs using the same criteria we used to assign H3K27ac and H3K4me1 peaks to ORFs, as described above.
ORF origin
Schmitz et al. [29] annotated each ORF as belonging to one of 8 different categories: “intergenic,” “close to promoter same strand,” “close to promoter opposite strand,” “overlapping same strand,” “overlapping opposite strand,” “overlapping coding sequence same strand,” “overlapping coding sequence opposite strand,” and “overlapping annotated gene in frame.” We considered all categories except “intergenic” to be “genic” in order to separate ORFs that are born within or near existing genes from those that are not. This classification is more challenging for non-mouse-specific ORFs due to the better annotation of older genes [29], which makes them more likely to correspond to the “overlapping annotated gene in frame” category even if they are of intergenic origin. We therefore further classified ORFs according to whether they shared their segment of open chromatin with another ORF. Specifically, we classified an ORF as “shared” if its first exon was in the same segment of open chromatin as the first exon of any other ORF, and as “single” otherwise.
Enhancer interactions
As with H3K27ac, H3K4me1, and H3K4me3 histone modifications, we evaluated the distance of each ORF to CTCF ChIP-seq peaks obtained from 15 different cell and tissue types (bone marrow, cerebellum, cortex, heart, kidney, developing limb during stage E14.5, liver, fibroblasts, mESC, olfactory bulb, small intestine, spleen, testis, thymus and the whole brain) [39]. We used liftOver to convert the data from mm9 to mm10.
Cusanovich et al. [45] used single-cell ATAC-seq data to predict physical interactions between regions of open chromatin [68], thus creating an atlas of enhancer interactions in single murine cells. We downloaded these data from the Mouse ATAC atlas [45], which includes the cell clusters where the interactions occur, as well as the co-accessibility scores of pairs of regions of open chromatin – a measure of interaction strength. We disregarded cell clusters classified as “unknown” or “collisions”, as well as interactions with a co-accessibility score lower than 0.25, following Pliner et al. [68]. We also filtered out interactions with regions of open chromatin that harbored annotated promoters, in order to focus solely on interactions with enhancers. An interaction was assigned to an ORF if the ORF’s first exon was included in the interaction.
Age of annotated genes
To study how genes acquire enhancer interactions over macro-evolutionary timescales, we considered the subset of ORFs that belong to the opossum age class in Schmitz et al. [29] and that are annotated as genes in the latest version of Ensembl (release 95) [69]. We matched these genes to age estimates reported by Neme & Tautz [46], based on a phylostratigraphic analysis of 20 lineages spanning 4 billion years from the last universal common ancestor to the common ancestor of mouse and rat. We further filtered the dataset to only include ORFs that emerged in the first 15 of the 20 phylostrata, in order to focus on ORFs that are considered to have emerged before the split between the common ancestor of placental mammals and marsupials by both Schmitz et al. [29] and Neme & Tautz [46]. This left us with ~16,000 ORFs corresponding to annotated genes that emerged prior to the origin of placental mammals.
Breadth of expression
To study the transcription of annotated genes, we used the expression data reported by the Tabula Muris Consortium [47] for the single-cell RNA sequencing performed with FACS-based cell capture in plates, for 20 different mouse tissues. The data include the log-normalization of 1 + counts per million for each of the annotated genes in each of the sequenced cells. We considered ten tissues that were also used for the construction of the Mouse ATAC Atlas [45]. We measured the expression breadth of each ORF corresponding to an annotated gene as the number of cell types in which expression could be detected in at least 5% of the cells assigned to a cell type. Additionally, we calculated the coefficient of variation of the expression of each gene as the standard deviation over the mean of the log-normalisation of 1 + counts per million across cell types.