Abstract
Long noncoding RNAs (lncRNAs) localize in the cell nucleus and influence gene expression through a variety of molecular mechanisms. RNA sequencing of two biochemical fractions of nuclei reveals a unique class of lncRNAs, termed chromatin-enriched nuclear RNAs (cheRNAs) that are tightly bound to chromatin and putatively function to cis-activate gene expression. Until now, a rigorous analytic pipeline for nuclear RNA-seq has been lacking. In this study, we survey four computational strategies for nuclear RNA-seq data analysis and show that a new pipeline, Tuxedo, outperforms other approaches. Tuxedo not only assembles a more complete transcriptome, but also identifies cheRNA with higher accuracy. We have used Tuxedo to analyze gold-standard K562 cell datasets and further characterize the genomic features of intergenic cheRNA (icheRNA) and their similarity to those of enhancer RNA (eRNA). Moreover, we quantify the transcriptional correlation of icheRNA and adjacent genes, and suggest that icheRNA may be the cis-acting transcriptional regulator that is more positively associated with neighboring gene expression than eRNA predicted by state-of-art method or CAGE signal. We also explore two novel genomic associations, suggesting cheRNA may have diverse functions. A possible new role of H3K9me3 modification coincident with icheRNA may be associated with active enhancer derived from ancient mobile elements, while a potential cis-repressive function of antisense cheRNA (as-cheRNA) is likely to be involved in transiently modulating cell type-specific cis-regulation.
Author Summary Chromatin-enriched nuclear RNA (cheRNA) is a class of gene regulatory non-coding RNAs. CheRNA provides a powerful way to profile the nuclear transcriptional landscape, especially to profile the noncoding transcriptome. The computational framework presented here provides a reliable approach to identifying cheRNA, and for studying cell-type specific gene regulation. We found that intergenic cheRNA, including intergenic cheRNA with high levels of H3K9me3 (a mark associated with closed/repressed chromatin), may act as a transcriptional activator. In contrast, antisense cheRNA, which originates from the complementary strand of the protein-coding gene, may interact with diverse chromatin modulators to repress local transcription. With our new pipeline, one future challenge will be refining the functional mechanisms of these noncoding RNA classes through exploring their regulatory roles, which are involved in diverse molecular and cellular processes in human and other organisms.
1. Introduction
Long noncoding RNA (lncRNA) is enriched not only in the cell nucleus, but also within the chromatin-associated fraction [1]. Many nuclear lncRNAs affect coding gene expression and chromatin organization, and are important in diverse biological processes [2, 3]. Among these nuclear lncRNAs, chromatin-associated lncRNA (cheRNA) processes gene-regulatory roles [4–6]. In our recent studies, cheRNA promotes essential gene-enhancer contacts and is associated with transcript factor-dependent lncRNA [5, 7]. However, a robust analytic pipeline for cheRNA as a group of functional nuclear RNAs has not been performed.
Bioinformatic efforts on nuclear RNA-seq are required because the distinctions between mRNA-seq and nuclear RNA-seq are substantial. The differences on library construction have significant consequences for the interpretation and analysis of the sequencing data [8]. For instance, sequencing of polyadenylated (polyA+) RNA may miss transcripts that are not usually polyadenylated, which includes many lncRNAs. Total-RNA sequencing detects a higher proportion of lncRNA, but is more expensive and less efficient in quantifying coding-gene expression [9].
In this study, we compare one published and three new analytic pipelines for nuclear RNA-seq data analysis. A newly developed pipeline, Tuxedo, outperforms the other pipelines with respect to transcriptome completeness, accuracy of cheRNA identification, and enrichment of enhancer-hallmarks at cheRNA gene regions. Using the Tuxedo pipeline, we identify cheRNAs genome-wide. Based on the GENCODE annotation (v25) [10], we verify that cheRNA provides a new way to annotate putative enhancer RNA (eRNA).
We explore the functions of two subsets of cheRNA with potential regulatory functions but new genomic markers: intergenic cheRNA transcribed from genes in condensed chromatin (marked by H3K9me3), and cheRNA transcripts antisense to coding genes. H3K9me3 is primarily recognized for transcript repression. Surprisingly, our data indicate that cheRNA-marked eRNAs are located within regions of high H3K9me3. Given H3K9me3 is associated with mobile elements and their transcription [11, 12], our data agrees with the theory that enhancers are co-opted from ancient mobile elements [13]. CheRNA may have subgroups, notably is the one antisense to coding genes distinctly showing negative correlations in transcription level with their corresponding sense mRNA. Our approach affords a straightforward approach to identifying and predicting novel regulatory lncRNA for future mechanistic evaluation.
2. Results
2.1. Nuclear RNA-seq requires rigorous computational strategies
Nuclear RNA-seq library construction differs from other RNA-seq protocols (Fig 1a). For example, the numbers of detected transcripts differ when nuclear or total RNA is sequenced, with 30.0% (7.0 k out of 23.3 k) of the transcripts detected only by total RNA sequencing, and 15.9% only by nuclear (Fig 1b). This difference is unlikely to be simply due to sequencing depth -- the median depth was 49M for four pooled total RNA samples and 33M for 22 nuclear RNA samples; the latter includes the 9 CPE and 9 SNE samples reanalyzed in this study (Table S1). Markers of transcriptional regulation including RNA polymerase II (Pol II) sites, transcription factor binding sites, cis-regulatory RNA structures, histone deacetylase, and histone enhancer hallmarks are common in the DNA corresponding to the 3700 RNAs detectable only by nuclear RNA-seq (Fig S1, Table S2). This observation agrees with previous suggestions that nuclear-retained lncRNA may interact with chromatin regulatory proteins and recruit them to cis-regulatory elements in order to regulate gene expression [1, 3]. Therefore, nuclear RNA-seq requires rigorous and effective pipelines different from the conventional pipelines used for total RNA-seq datasets.
After the isolation of RNA and generation of sequencing libraries, a typical RNA-seq analytic workflow involves sequencing hundreds of millions of reads, alignment of reads against a reference genome or transcriptome, and downstream statistical analysis of expression. In the original method developed by Werner et al. [4, 5], which we refer to as Werner, cheRNA was identified in four steps (S Method). This Werner pipeline has three important biases: 1) Werner overestimates the proportion of de novo transcripts originating from CPE because it applies reference-guided de novo assembly (which can discover novel transcripts) to CPE but not to SNE fractions. 2) Werner removes transcripts shorter than 1000 bases from the analysis. LncRNA transcripts are typically shorter than (median length 592 bases) protein-coding transcripts (median 2.4k bases), and 33% of GENCODE-annotated noncoding RNA is shorter than 1000 bases long [14]. Removing transcripts shorter than 1,000 bases from the CPE assembly leads to significant under-detection of lncRNA. 3) In the differential expression analysis, Cuffdiff was used in Werner. However, Cuffdiff cannot do a two-group test on RNAs that have high expression levels. For example, noncoding RNA XIST, which is a canonical cheRNA, was categorized as “HiDATA” and excluded from differential expression analysis by Cuffdiff. In addition, it has been shown that discarding genes that are not expressed at a biologically meaningful level in any condition (prefiltering) can increase the power for detecting differential gene expression [15], but Werner doesn’t include a prefiltering step in the differential expression analysis.
2.2. Tuxedo outperforms three existing and alternative analytic pipelines
Tuxedo builds a complete transcriptome for active transcripts
We developed three new pipelines (referred to here as Tuxedo, Concatenating, and Taco) to analyze these datasets; each with four major analytic steps (Fig 1c). Detailed discussion about the four nuclear RNA-seq analytic pipelines in theory is given in the S Method. The Tuxedo pipeline makes three key computational improvements: 1) Tuxedo assembles the complete transcriptome in an unbiased way, covering both highly-expressed transcripts and lncRNA shorter than 1,000 bases. 2) Tuxedo employs an empirical threshold to distinguish between low but informative lncRNA transcription and noise. And 3) for the first time, we report a correlation between the expression of icheRNA and adjacent genes, facilitating the prioritization of potentially cis-acting cheRNA for further experimental evaluation. The strategies used in the Tuxedo pipeline are not restricted to cheRNA identification, and could be beneficial to nuclear RNA-seq data analyses testing broader biological hypotheses, such as to the relationship between enhancer marked (e.g., Fig S1) and differentially expressed nuclear RNAs.
Given that lowly expressed transcripts are likely to be experimental noise [16], we adopted Tuxedo to identify lncRNAs that may expressed lowly. Unlike methods applied to coding gene profiles, in which one can define an expression cutoff for active promoters, we made an empirical decision to define predicted transcripts with CPM (counts per million) ≥ 1 as ‘expressed’ for downstream analysis. This filter resulted in an approximately log-normal distribution of expression levels and about 14 k measured transcripts per sample (Fig S2a), ensuring the appropriateness of model-based differential expressional analyses such as Limma [17].
To evaluate the transcriptomes assembled by the four pipelines, we used the assembly result of K562 cell data as a reference. We first compared the completeness of the transcriptomes. Transcriptomes assembled by the Tuxedo and Concatenating assemblies are very concordant. 99.8% of transcripts are the same. 84.4% of transcripts assembled by both Tuxedo and Concatenating are also assembled by Werner. This number decreases to 27.0% for Taco (Fig 2a). To determine the reasons for assembly inconsistency, we compared the assembly results for annotated transcripts (Fig 2b) and unannotated transcripts (Fig 2c) separately. The annotated transcripts assembled by Tuxedo, Concatenating and Werner pipelines are almost identical, and correspond to the set of annotated transcripts in GENCODE (v25). Taco only assembled 26.2% of annotated transcripts. This is because the Taco pipeline uses TACO instead of Cufflinks as the assembly tool. TACO only includes transcripts that have significant expression, while Cufflinks keeps all annotated transcripts when building the transcriptome. By looking at the expression level of transcripts, we confirmed that the 40.7 k annotated transcripts that are omitted by Taco are transcripts with low expression levels in K562. Among the unannotated transcripts assembled by Tuxedo and Concatenating, 96.8 % are the same. Moreover, Tuxedo and Concatenating assembled more transcripts than Werner and Taco. The length distribution of transcripts assembled by Tuxedo and Concatenating, but not by Werner, shows that the majority of such transcripts are shorter than 1000 bases (Fig 2d), which is caused by removing transcripts shorter than 1000 bases from CPE samples in Werner. Taco assembled the smallest number of unannotated transcripts. The unannotated transcripts omitted by Taco also have low expression. Even though these transcripts were lowly expressed in samples, it is still necessary to keep them in the assembled transcriptome to accurately estimate gene expression levels.
We next investigated the length of assembled transcripts (Fig 2e, Fig S2b). Approximately half of the assembled transcripts have lengths between 200-1000 bases and show similar log-normal distributions for Tuxedo and Concatenating. In contrast, transcripts assembled by Werner are generally longer (71% of the assembled transcripts are longer than 1000 bases), which is another indication of the incompleteness caused by removing short transcripts. Transcripts assembled by Taco are much shorter (83% of assembled transcripts are shorter than 1000 bases). The TACO assembler employs an algorithm based on change-point detection via binary segmentation to predict transcript structure [18]. This algorithm is more robust in the assembly of annotated and conserved transcript such as mRNA. However, when it is applied to assembly of noncoding RNA, the TACO assembler overestimates the degree of alternative splicing and results in a large number of truncated transcripts. This is incorrect since only a small fraction of lncRNA undergo splicing [19]. In summary, Tuxedo and Concatenating construct relatively complete and correctly structured transcriptomes for analysis.
Additionally, we compared the transcriptional activity of the expressed transcripts assembled by the four pipelines using two independent measurements: Pol II ChIP-seq and global run-on sequencing (GRO-seq) (Table S1). Expressed transcripts are defined as those having CPM ≥ 1. The expressed transcripts assembled by Tuxedo show the highest proportion of overlap with peaks representing both ongoing transcription by Pol II and peaks representing nascent transcription by GRO-seq (Fig 2f), demonstrating that expressed transcripts assembled by Tuxedo are more concordant with active transcription signal represented by other methods. In summary, we introduce the computational pipeline, Tuxedo, for analyzing nuclear RNA-seq data containing both high low expression lncRNA.
Tuxedo identifies cheRNAs precisely while recapturing three known genomic features of active enhancers
To evaluate the performance of the four pipelines in cheRNA identification, we used the set of transcripts identified by all methods as a proxy gold standard, and found Tuxedo and Concatenating outperformed Werner and Taco in the identification of both cheRNA and sneRNA (Fig 3a). To further check the accuracy we examined sixteen loci of known cheRNA, sneRNA, or chromatin-independent RNA (transcripts not significantly differentially expressed between CPE and SNE samples) that were previously validated in specific cell types [4, 5]. Tuxedo and Concatenating successfully confirmed the chromatin enrichment in all canonical cheRNAs and outperform Werner and Taco with overall positive predicted value (ppv) of 0.88 (Fig 3b, Table S3). This analysis, although possibly susceptible to threshold effects, makes up the shortage of lack of a true gold standard in the ROC-analysis. Both analyses suggest that Tuxedo and Concatenating are better than Werner and Taco.
Because intergenic cheRNA (icheRNA), which are defined as cheRNA without no coordinate overlap with known coding genes, is similar to eRNA [4, 5], we examined the occupancy of enhancer marks (ChIP-seq signals of EP300, H3K27ac, H3K4me1) and a repressive mark (H3K27me3) around the TSS of the 2.0 k to 6.7 k icheRNA identified by each pipeline (Fig 3c). In this analysis, we used ChromHMM [20]predicted eRNA (Fig 3c, yellow) and non-enhancer RNA (Fig 3c, black) as positive and negative controls. ChromHMM-predicted eRNA is defined as intergenic RNA that overlaps (at least 1 base, same strand) with any ChromHMM predicted “strong enhancer” region and ChromHMM-predicted non-enhancer RNA is defined as transcribed RNA that has no overlap with any predicted “strong enhancer” or “weak enhancer” region. We found that the levels of enhancer marks (EP300, H3K27ac, H3K4me1) are significantly higher around TSS of icheRNA than at the TSS of ChromHMM predicted non-enhancer RNA, while the level of repressive marks is significantly lower. Moreover, among the 4 pipelines, the icheRNA identified by Tuxedo pipeline have the highest levels of H3K27ac and H3K4me1 enhancer marks around their TSS, and relatively lower levels of repressive marks. We also noticed that the levels of enhancer marks around TSS of ChromHMM-predicted eRNA are much higher than those around TSS of icheRNA. Considering that ChromHMM predicts enhancer regions based on histone modification patterns, the enhancers predicted by ChromHMM may be biased toward having high occupancy of these canonical enhancer marks. Additionally, all three new pipelines slightly improved the cell-type specificity compared to Werner, as evaluated by the proportion of tissue-specific icheRNA identified by each pipeline (represented by R1 score in Fig 3d).
Overall, we conclude that Tuxedo and Concatenating outperform the other two pipelines in identifying expected cheRNA, and that the Tuxedo predicted icheRNA transcripts are more highly enriched in enhancer hallmarks compared to other methods. In this sense, Tuxedo outperforms Concatenating and other pipelines in enriching enhancer hallmarks in the same cell type.
2.3. Intergenic cheRNAs (icheRNA) uniquely present eRNA features
icheRNA represents a subset of noncoding RNAs de novo
Werner et al proposed that icheRNA is a distinct subclass of unannotated eRNA. To further examine this hypothesis, we categorized the 14k expressed nuclear RNAs detected by Tuxedo into three groups (intergenic RNA, coding-antisense RNA (labeled as “antisense RNA” in Fig 4a) and those that overlap mRNAs in the sense orientation (labeled as “mRNA” in Fig 4a). Fig S3 shows the workflow used to categorize the three RNA groups). A large fraction (66%) of the 5,680 identified cheRNAs are transcribed from noncoding regions (Fig 4a, pink bar). In contrast, approximate 90% of the identified 5,672 sneRNAs were mRNAs (Fig 4a, blue bar). Additionally, icheRNA exhibits lower coding potential (cumulative CPC2 score [21]) than coding genes, intergenic sneRNA (isneRNA), and intergenic RNA transcribed from ChromHMM- or FANTOM-[22] predicted enhancer regions in the same cell lines (Fig 4b). The coding potential of icheRNA is therefore more similar to that of ChromHMM predicted eRNA, while that of isneRNA is more similar to that of mRNA.
81% (2.7 k) of the identified 3.3 k icheRNAs are previously unannotated transcripts, in contrast to only 6% (27) of the 459 isneRNAs, (Fig 4c). Additionally, over half (69% of 445) of the antisense cheRNAs are unannotated, in contrast to only 2% of 163 antisense sneRNAs. This biased annotation of noncoding RNA suggests that previously detected noncoding RNAs primarily correspond to chromatin-depleted noncoding RNA (noncoding sneRNA), and that identifying chromatin enriched RNAs from nuclear extracts can give a more balanced picture of the overall noncoding RNA population.
icheRNA positively correlate with adjacent genes in expression
RT-PCR experiments have shown that several eRNAs are intergenic chromatin enriched RNAs (icheRNA) [7]. Werner et al. also showed that protein-coding genes proximal to icheRNA have higher expression levels than those near to other expressed lncRNA, suggesting that icheRNA could predict cis-gene transcription [4, 5]. However, it is not clear from previous work whether higher icheRNA expression is correlated with expression of proximal protein-coding genes. To quantitatively confirm the cis-regulatory potential of icheRNA, we calculated the Pearson correlation coefficient between the expression of icheRNA and neighboring protein-coding genes, and compared it to the correlation coefficient between the expression levels of icheRNA and randomly selected protein-coding genes. isneRNA and neighboring protein-coding gene, ChromHMM predicted eRNA and neighboring protein-coding gene, and FANTOM predicted eRNA and neighboring protein-coding gene.
We find that icheRNA are more positively correlated with neighboring genes than with randomly selected genes (Fig 4d, red line shows relative density > 1 when correlation coefficient > 0.5). The same calculation for FAMTOM- or ChromHMM-predicted eRNAs, which are believed to have cis-regulatory enhancer effects, and adjacent genes in the same cell types, shows similar but weaker positive correlations. In contrast, pairs of intergenic sneRNAs (isneRNA) and neighboring genes (blue line) showed negative correlation (Fig 4d, blue line shows relative density > 1 when correlation coefficient < −0.5). Specifically, with a significance cutoff of correlation coefficient=0.8, 23% of the identified icheRNA transcripts, in contrast to only 11% of the isneRNA are positively correlated with proximal genes. This observation, for the first time, gives quantitative evidence for a potential cis-regulatory effect of icheRNA on adjacent genes. It also suggests that identification of icheRNA can be used as another approach to predict eRNA, comparable to approaches using ChromHMM and FANTOM database.
Transcriptional correlation analysis also displayed high relative density at correlation coefficient < −0.5 for pairs of icheRNA and neighboring coding genes (Fig 4d), indicating that not all icheRNAs are positively correlated with proximal protein-coding gene expression. Indeed, XIST is a canonical icheRNA that has a well-known repressive regulatory role, and it might be one of the icheRNAs that are negatively correlated with proximal protein-coding genes.
Polyadenylated RNA is relatively depleted in icheRNA
Most eRNAs have been reported to be unspliced and non-polyadenylated [23–26]. To test if icheRNA are similar in this regard, we compared the relative expression (measured as Reads Per Kilobase of transcript per Million mapped reads, RPKM) of intergenic cheRNAs in nuclear Poly(A)+ RNA-seq library and nuclear total RNA-seq libraries using published datasets for K562 cells (Table S1). We observe (Fig 4e) lower relative abundance of icheRNA in the nuclear total-RNA-seq library than in the nuclear Poly(A)+ RNA-seq library, indicating the majority of icheRNA lack polyadenylation. A similar but weaker preference for the total-RNA-seq library was also observed for antisense cheRNAs (Fig S4). In contrast, all protein-coding mRNAs have equivalent expression levels in two libraries, which is consistent with the role of polyadenylation in producing mRNA in the eukaryotic cell nucleus [27]. Chromatin depleted non-coding RNAs (isneRNA and antisense sneRNAs) also have similar expression levels in the two RNA-seq libraries as those of mRNAs (Fig S4). The patterns of polyadenylation indicate that icheRNA and isneRNA are differentially polyadenylated. With respect to polyadenylation, icheRNA is more similar to eRNA than is sneRNA, since the majority of icheRNA are not polyadenylated.
icheRNA and isneRNA confer different chromatin characteristics
Histone 3 lysine 4 monomethylation (H3K4me1) and histone 3 lysine 27 acetylation (H3K27ac) have been identified as key histone modification features that mark enhancers. H3K4me1 is present at both poised and active enhancers [28], while H3K27ac uniquely marks active enhancers [29]. Werner et al. previously observed peaks of H3K27ac near the transcriptional start sites (TSS) of icheRNA, however, unlike prototypical eRNA, these regions did not show abundant H3K4me1 modification [4]. To further investigate whether icheRNA have a distinct chromatin signature, we profiled the relative reads per million (RPM) of RNA polymerase II (POLII), H3K27ac, H3K4me3, H3K4me1, and H3K27me3 marks on the flanking 1 kb sequences around TSS of icheRNA, isneRNA, mRNA and unexpressed mRNA (RNAs annotated in GENCODE(v25) but not transcribed in K562) (Fig 4f). IcheRNA shows low levels of marks associated with active transcription (POLII and H3K4me3), similar to the levels of unexpressed mRNA, and lower than those of isneRNA and mRNA (Fig 4f1, red and purple box). In contrast to unexpressed mRNA, icheRNA TSS flanking regions show low levels of repressive (H3K27me3) and poised enhancer (H3K4me1) marks (Fig 4f2, red and purple box), but are enriched in active enhancer (H3K27ac and EP300) marks (Fig 4f3, red and purple box). Note that in addition to being enriched at enhancer regions, H3K27ac and EP300 are also pervasively found near TSS of actively transcribed regions. icheRNA TSS thus have a chromatin profile that is distinctly different from those of mRNA, isneRNA, and unexpressed mRNA, suggesting that significantly different modes of regulation may be controlling icheRNA expression.
In summary, icheRNA and isneRNA differ in many respects. In addition to the enrichment of specific epigenetic marks near the TSS, icheRNA has lower coding probability, lacks polyadenylation, and its expression is more positively correlated with that of neighboring coding genes. Overall icheRNA is more similar to eRNA, while isneRNA is more similar to mRNA. The similarity of icheRNA to eRNA, as defined by ChromHMM and FANTOM predictions, suggests that icheRNA identification may provide a useful independent approach to predicting eRNA.
2.4. Cis-regulatory potential of both icheRNA coincident with histone mark H3K9me3 and the cheRNA antisense to a coding gene
icheRNA with H3K9me3 signal across transcript bodies is more prone to present active cis-regulation
Histone 3 lysine 9 trimethylation (H3K9me3) is associated with constitutive heterochromatin, and has been shown to mark transcriptionally repressed regions that are mutually exclusive with H3K27me3 marked repressive regions [30–33]. We find that the levels of H3K9me3 near (100 base away from) actively transcribed icheRNA and mRNA TSS (Fig 5a1, red line and green line) are much higher than near transcriptionally silenced regions (DNA regions near to unexpressed mRNA) (Fig 5a1, purple line). In addition, H3K9me3 profiles at actively transcribed regions are quite different from those at transcriptionally silent regions: H3K9me3 signal drops at the TSS of transcribed RNA (icheRNA, isneRNA, and mRNA) (Fig 5a1, red line, blue line and green line) but is flat around TSS of unexpressed mRNA (Fig 5a1, purple line). It has been suggested, for coding transcripts, that H3K9me3 at the promoter is repressive, whereas H3K9me3 across the mRNA transcript body is activatory [31]. The pattern we observe is similar, and when combined with the previous observation that high levels of H3K9me3 modification are present in some active genes [34], it suggests that high H3K9me3 levels do not necessarily indicate transcriptional repression; H3K9me3 modification at the TSS region is more strongly associated with transcriptional silencing, in contrast, H3K9me3 at gene body regions can be actively transcribed.
We also note that H3K9me3 levels within the DNA region of transcribed icheRNA is substantially higher than near other transcribed RNA (e.g., mRNA and isneRNA) (Fig 5a2). To further investigate the effect of H3K9me3 on icheRNA transcription, we separated DNA regions transcribing icheRNA into high H3K9me3 (at least one peak of H3K9me3 mark near the transcribed icheRNA) and low H3K9me3 (no H3K9me3 mark near the transcribed icheRNA) groups. These groups are labeled as “icheRNA with H3K9me3” and “icheRNA without H3K9me3”, respectively in Fig 5b. DNA regions in the “icheRNA with H3K9me3” have significantly higher levels of H3K9me3 modification than those in the “icheRNA without H3K9me3” group (Fig 5b1). Furthermore, chromatin signatures associated with active transcription (POL2, H3K4me3) (Fig 5b2), as well as transcription levels in both CPE and SNE samples (Fig 5c), are strikingly elevated in the “icheRNA with H3K9me3” group compared to the “icheRNA without H3K9me3” group, indicating that icheRNA are more actively transcribed within regions with high H3K9me3 modification. It also reinforces the evidence indicating that regions with abundant H3K9me3 modification can be actively transcribed.
Our previous analysis showed that icheRNA possesses features similar to eRNA, however, the TSS of icheRNA show only moderately higher levels of enhancer marks compared to unexpressed mRNA, and lower levels than TSS of isneRNA. We measured the levels of enhancer marks (H3K27ac, EP300 and H3K4me1) around TSS of “icheRNA with H3K9me3” (Fig 5d, red box). We found that levels of active enhancer marks (H3K27ac and EP300) around TSS of “icheRNA with H3K9me3” are significantly higher than at the TSS of “icheRNA without H3K9me3” (Fig 5d, yellow box) and TSS of isneRNA (Fig 5d, purple box), indicating that icheRNA surrounded by H3K9me3 marks shows high levels of active enhancer marks near the TSS, but all icheRNA do not. Moreover, we measured the H3K9me3 levels across canonical icheRNA transcribed regions and found that three previously identified icheRNA (HIDALGO, ILYICH, BONIFACIO) with validated positive activator functions show relatively higher H3K9me3 levels than the only icheRNA with a known repressive role (XIST) (Fig 5e). These examples reinforce the hypothesis that DNA regions transcribing icheRNA, even with high levels of H3K9me3 modification, may act as enhancers.
Antisense cheRNA (as-cheRNA) co-occurs with local mRNA silencing (Fig 6)
Antisense RNA (asRNA) complementary to protein-coding transcript(s) has been shown to interfere with transcription of mRNA on the opposite strand [35]. Consistent with this, asRNA accumulates preferentially in the nucleus associating with chromatin, we observe that almost (59%) of the identified 756 asRNAs in K562 cell nucleus are chromatin enriched and only 22% are chromatin depleted (Fig 4a), indicating a significant enrichment of asRNA in the chromatin pellet (Fig 6a). Moreover, we notice that about one third of the chromatin enriched asRNAs (antisense cheRNA, as-cheRNA) are unannotated while almost all chromatin depleted asRNAs (antisense sneRNA, as-sneRNA) are annotated (Fig 6b), suggesting that many as-cheRNA are completely novel.
Regulatory mechanisms involving asRNA range from simple transcriptional interference through competing for RNA Pol II [36] to regulation of epigenomic modifications [37, 38]. A current hypothesis suggests that asRNA is acts in gene regulation at the chromatin level by recruiting epigenetic regulators, e.g., polycomb repressive complex 2 (PRC2), to its corresponding sense mRNA to induce histone methylation and gene repression [39, 40]. Inspired by this hypothesis, we investigated a similar potential function for both as-cheRNA and as-sneRNA. Functional RNA molecules often exhibit secondary structures that are better conserved than their sequences [41], we first interrogated the sequence based predicted secondary structure of as-cheRNA and as-sneRNA in comparison to known RNA families in Rfam. Rfam collects multiple-sequence alignment-based families of RNA secondary structural motifs [41]. The motif sizes are generally less than 400 bases long Fig S5b), much shorter than the asRNA in the assembled transcriptome. We annotated each asRNA as belonging to a Rfam family if it fully covered a Rfam family motif. We then calculated, for each Rfam family, a) the fraction of as-cheRNA/as-sneRNA annotated to this Rfam family (the fraction in observation); b) the fraction of all assembled RNA annotated to this Rfam family (the fraction in background); and c) the ratio of the fraction in observation over the fraction background (RR-score). An RR-score larger than 1 indicates that as-cheRNA/as-sneRNA is overrepresented in the selected Rfam family. Among fourteen major RNA structural groups in the Rfam database (v13, hg38) (Fig S5a), three structural groups (Histone 3, lncRNA, and antisense) are significantly overrepresented in as-cheRNA (Fig 6c, two or more folds). In particular, the overrepresentation of the antisense structure group among as-cheRNA suggests that the function of as-cheRNA, rather than that of as-sneRNA, is likely to be structure-based.
We then measured the transcription level in CPE and SNE of mRNA that antisense overlaps with as-cheRNA and as-sneRNA. We find that the transcription of both as-cheRNA-colocalized mRNA and as-sneRNA-colocalized mRNA are relatively low compared to that of random mRNA (Fig 6d), suggesting a negative correlation between the transcription of sense and antisense RNA. Even though both as-cheRNA-colocalized mRNA and as-sneRNA-colocalized mRNA are shown to be repressed at similar levels, the chromatin features and histone patterns around the TSS of the two mRNA groups are significantly different. The TSS of as-cheRNA-colocalized mRNA (Fig 6e-f, red box) are significantly less open (measured by Encode ATAC-seq signal), have fewer active transcription marks (POL2, H3K4me3), but have more repressive marks (H3K27me3) and show higher PRC2 complex binding (SUZ12, EZH2) compared with random mRNA (Fig 6e-f, black box). This pattern was not observed in as-sneRNA-colocalized mRNAs (Fig 6e-f, blue box). Altogether, this suggests that as-cheRNA and as-sneRNA may cis-repress gene transcription through different mechanisms. As-cheRNA may be cis-regulatory elements that repress transcription of colocalized mRNAs on the opposite strand via recruiting the PRC2 complex to specific genomic loci.
3. Discussion
Detail analysis of nuclear RNA-seq sheds new light on cis-regulatory elements from lncRNA that are shorter than 1,000 bases or lowly transcribed previously embedded in noise. Operationally, cheRNA is defined by its statistically significant enrichment in chromatin after biochemical fractionation of nuclei. With our improved computational strategy, we have examined the molecular characteristics of cheRNAs in greater detail than has heretofore been possible. We find that, first, cheRNAs are more likely to be transcribed from noncoding regions, while sneRNAs are mostly transcribed from protein-coding regions. Second, icheRNA has a lower transcription level and is largely unannotated, in contrast to isneRNA which is more highly transcribed and better annotated. Traditional transcriptome profiling of non-coding RNA, using techniques such as total RNA-seq, yields the broadest survey of transcripts, but has limited ability to detect low expression transcripts such as those of icheRNA. Thus, previous analyses of noncoding RNA primarily focused on noncoding RNA with relatively high transcription levels (e.g., isneRNA and as-sneRNA). In contrast, sequencing and identifying chromatin enriched RNAs in a nuclear extract more sensitively identifies low expression noncoding RNAs that previously have been ignored by conventional sequencing and analysis methods. Third, we have shown that icheRNA, in contrast to isneRNA, is mostly non-coding, non-polyadenylated, and positively correlated with the expression of neighboring coding genes (Fig 5a-5e). Notwithstanding the similarity of these features to those of eRNA, icheRNA has several unique molecular characteristics that distinguish it. For example, icheRNA is generally longer than eRNA (median length of icheRNA is ~4,400 bases; eRNA is ~350 bases, [42]) and icheRNA shows only modest coincidence with enhancer marks (H3K27ac, H3K4me1 and EP300) that are used to canonically define eRNA (Fig 4f). Moreover, some canonical icheRNA (e.g., XIST) are known to be repressive regulators rather than activators as is eRNA. Combining all this evidence, we conclude that icheRNA and eRNAs are two distinct non-coding RNA groups that overlap. Despite there are differences, icheRNA still shows more apparent similarity to eRNAs than other RNA groups.
IcheRNA transcripts mapped to regions with high levels of H3K9me3-marks in the transcript body are more actively transcribed, and more highly associated with elevated levels of enhancer marks, than icheRNA transcribed from regions with low levels of H3K9me3 marks. This observation indicates that H3K9me3 may cover potential enhancer regions. The association between H3K9me3 and enhancers was also previously suggested by [43], who described the widespread presence of H3K9me3 at enhancer flanking regions. They also showed anecdotal examples in which regulating H3K9me3 levels at the enhancers of Mdc and Il12b affected Mdc and Il12b transcription in dendritic cells and macrophages, suggesting that H3K9me3 plays an important role in regulating enhancer activity [43]. If it can be verified that the regulatory role of H3K9me3 is a common feature of enhancers, icheRNA coincident with H3K9me3 marks may prove a very effective predictor for chromatin-based eRNA, and may be a powerful approach to predicting novel enhancer regions.
Both icheRNA and transcribed mRNA show a severe drop of H3K9me3 signal at the TSS; however, the degree to which H3K9me3 signal is decreased at the TSS is much greater for mRNA than icheRNA (Fig 5a1). As H3K9me3 is associated with condensed chromatin regions, we hypothesized that the body of icheRNAs are located within condensed chromatin regions. Agreeing with this hypothesis is the high signal of H3K9me3 along icheRNA transcript body (Fig 5a2). In metazoan cell nuclei, hundreds of large chromatin domains, termed Lamina-Associated Domains (LADs), are enriched in histone modification of H3K9me2 and H3K9me3, modifications that are typical of heterochromatin [44]. A study on a 1 Mb LAD encompassing the human HBB loci showed that knockdown of H3K9me3 by depletion of the two H3K9me3 methyltransferases Suv39H1 and Suv39H2 caused detachment of the LADs and nuclear lamina [45]. Considering H3K9me3 is a chromatin mark associated with closed/repressed chromatin, lncRNA transcription from H3K9me3-enriched LADs is expected to be repressed. However, the unexpected association between icheRNA and high levels of H3K9me3 chromatin marks suggests that icheRNAs may be embedded in, and actively transcribed from, condensed LADs. Indeed, we find that 48% of icheRNAs are transcribed from LADs (greater than chance expectation, empirical p < 2.2e-16), in contrast, only 12% of other RNAs are transcribed from LADs. Moreover, agreeing with the previous hypothesis by Werner et al. that transposable element may provide an evolutionary origin to chromatin enriched RNAs [5], we noticed that 82% of icheRNAs and 96% of icheRNAs with H3K9me3 chromatin marks in K562 overlap with class 1 transposable elements (TEs). Together, these observations suggest that icheRNA may represent a group of RNAs transcribed from condensed chromatin domains derived from mobile elements [13], and that the transcription of these domains is regulated in a cell-specific way.
Antisense RNA (asRNA) is another class of noncoding RNA that has been shown to have cis-regulatory functions. Consistent with previous knowledge, our analysis confirms that asRNA is more abundant in the nuclear chromatin enriched pellet than in soluble nuclear pellet. Similar to isneRNA and icheRNA, almost all as-sneRNAs are annotated, while a large fraction of as-cheRNAs lack annotation. This further suggests that sequencing RNAs abundant in the nuclear chromatin pellet can identify many novel noncoding RNAs. Despite the fact that both as-cheRNA and as-sneRNA show negative correlations in transcription level with their corresponding sense mRNA, the chromatin pattern around the TSS of as-cheRNA-colocalized mRNA and as-sneRNA-colocalized mRNA are quite different. Regions around the TSS of as-cheRNA-colocalized mRNA are less open and lack active transcription marks, but have high a level of H3K27me3 and PRC2 binding, suggesting that as-cheRNA may regulate sense mRNA transcription in cis-acting as a guide RNA for regulatory complexes that modify the target chromatin. Even though this investigation of as-cheRNA is still preliminary, it provides some testable hypotheses for asRNA function.
Chromatin-enriched nuclear RNA provides a powerful way to profile the nuclear transcriptional landscape, especially to profile the noncoding transcriptome. The computational pipeline presented here provides researchers with a reliable approach to identifying cheRNA, and for studying cell-type specific gene regulators. Although the cheRNA is unlikely to be monolithic in function, icheRNA, including icheRNA with high levels of H3K9me3 marks, may act as a transcriptional cis-activator similar to eRNA. In contrast, as-cheRNA may interact with diverse chromatin modulators to cis-repress transcription. With the Tuxedo pipeline, the future challenge will be refining the functional mechanisms of these noncoding RNA classes through exploring their regulatory roles, which are involved in diverse molecular and cellular processes in human and other organisms.
3. Methods
Tables S1-2 list all publically available datasets analyzed in this study.
We compared three pipelines with the original cheRNA-identification pipeline [4]. Each pipeline includes four analytic steps: sequence mapping, transcript assembly, transcriptome construction, and signature identification (Fig 1c). Computational strategies in the latter three steps varied in four different pipelines (S Method). Source file for the Tuxedo pipeline is provided in the Supplementary Materials.
Relative density of correlation between intergenic RNAs and neighbor coding genes (Fig 4d)
We calculated the pairwise Pearson correlation coefficient (PCC) between the intergenic RNA and protein-coding gene. We tested five types of intergenic RNA-gene groups: the icheRNA with random protein-coding gene pairs; the icheRNA with neighbor protein-coding gene pairs (icheRNA:neighborCoding); ChromHMM-predicted eRNAs with neighbor protein-coding genes pairs (ChromHMM-neighborCoding), FANTOM-predicted eRNAs with neighbor protein-coding genes pairs (FANTOM-neighborCoding) and the isneRNA with neighbor protein-coding gene pairs (isneRNA:neighborCoding).
The PCC of each intergenic RNA-gene pair was calculated based on their expression levels across all CPE and SNE samples of three cell types (K563, HEK293, H1-hESC). To pair an intergenic RNA with its neighbor protein-coding gene out of its nearest upstream and downstream genes on the same strand, the one with the highest absolute PCC value is selected. A significant cutoff of PCC values was set at −0.8 or 0.8, respectively for the negative or positive correlation. Kernel density is estimated for each intergenic RNA-gene pair group. Relative density for each intergenic RNA and neighboring protein-coding gene pairs group (e.g. icheRNA:neighborCoding) is calculated in the way of dividing the kernel density estimates of indicated intergenic RNA and neighboring protein-coding gene pairs group (e.g. icheRNA:neighborCoding) by the kernel density of icheRNA and randomly selected coding gene pairs group.
RNA structural analysis (Fig 6c)
RNA structural analysis based on the Rfam annotations (v13, hg38) was cinducted for each Rfam-family, by dividing its frequency within a ncRNA set (icheRNA or isneRNA) versus the frequency in all assembled RNAs. Each Rfam family is represented by a multiple-sequence alignment, a consensus secondary structure, and a covariance model [41]; and we grouped one or more annotating families into a super-family according to their function as well proportions in the above noncoding transcriptome (Fig S5a). The homologous ncRNA sequences in each super-family were generally less than 400 bp (2.6 on the log10-scale, Fig S5b), much shorter than the ncRNAs in the assembled transcriptome. Therefore, only when a ncRNA transcript fully covers an annotating family sequence, we annotated this ncRNA transcript with a Rfam super-family.
To assess the probabilities of a Rfam super-family (i) among a set of ncRNAs of interest (t), we calculated the ratio of ratios (RR) using Formula 1. where T = {t} is the collection of all noncoding transcripts in the transcriptome, and |.| is the number of transcripts meeting a condition.
This RR score was calculated for each Rfam-family for its frequency within a ncRNA set (t) versus its global frequency. Therefore, an RR-score above 2 indicates a ncRNA set (t)-selective RNA structural family.
Author Contributions
XY initiated this work and supervised this work together with MG. XS contributed in collecting datasets, designing pipelines, performing evaluation and analysis work, and interpreting results, with input from XY, MG, IM and AR. ZW helped with Rfam data analysis (Fig 6c), transcription factor data collection (Table S2), and transcription factor occupancy analysis (Figs 1, S1). CP participated in implicating pipelines at early stage of the project. XS wrote the manuscript, with comments from MG. XY designed pipelines, contributed to the discussion, wrote the abstract, part of introduction, and Rfam method of the manuscript, with comments from MG, IM. XS, XY revised the manuscript, with comments from MG, IM. All authors approved the final manuscript.
Declaration of Interests
The authors declare no competing interests.
Acknowledgments
This work was supported by NIH R21LM012619 and through resources (BEAGLE) provided by the University of Chicago under grant 1S10OD018495-01. We would also like to acknowledge ENCODE for contributing sequencing datasets. We specifically acknowledge the assistance of Lorenzo Pesce and Purdue University ITaP Research Computing (RCAC) team for supporting high performance computing services. We thank Jeffrey Steimle and Kohta Ikegami for discussion on H3K9me3 patterns.
Footnotes
Xiangying Sun sun325{at}purdue.edu, Zhezhen Wang zhezhen{at}uchicago.edu, Carlos Perez-Cervantes cperez5{at}uchicago.edu, Alex Ruthenburg aruthenburg{at}uchicago.edu, Ivan Moskowitz imoskowitz{at}peds.bsd.uchicago.edu, Michael Gribskov mgribsko{at}purdue.edu, Xinan H Yang xyang2{at}uchicago.edu