Abstract
Heterochromatin is associated with transcriptional repression. In contrast, several genes in the pericentromeric regions of Drosophila melanogaster are dependent on this heterochromatic environment for their expression. Heterochromatic genes encode proteins involved in various developmental processes. Several studies have shown that a variety of epigenetic modifications is associated with these genes. Here we present a comprehensive analysis of the epigenetic landscape of heterochromatic genes across all the developmental stages of Drosophila using the available histone modification and expression data from modENCODE. We find that heterochromatic genes exhibit combinations of active and inactive histone marks that correspond to their level of expression during development. Thus, we classified these genes into three groups based on the combinations of histone modifications present. We also looked for potential regulatory DNA sequence elements in the genomic neighborhood of these genes. Our results show that Nuclear Matrix Associated Regions (MARs) are prominently present in the intergenic regions of heterochromatic genes during embryonic stages suggesting their plausible role in pericentromeric genome organization. We also find that the intergenic sequences in the heterochromatic regions have binding sites for transcription factors known to modulate epigenetic status. Taken together, our meta-analysis of the various genomic datasets suggest that the epigenomic and genomic landscape of the heterochromatic genes are distinct from that of euchromatic genes. These features could be contributing to the unusual regulatory status of the heterochromatic genes as opposed to the surrounding heterochromatin, which is repressive in nature.
Introduction
Eukaryotic genomes can be broadly classified into euchromatin and heterochromatin. While euchromatin is gene rich and transcriptionally active, heterochromatin is repeat-rich, gene scarce and refractory to transcription. Constitutive heterochromatin present at the centromeres and telomeres, is marked by the repressive histone modifications - H3K9me3 and H4K20me3 and Su(var) proteins like HP1a. Constitutive heterochromatin accounts for nearly 30% of the genome in Drosophila melanogaster [1] but has remained relatively less studied as it is believed to have fewer functional features. However there has been a paradigm shift in the recent years, as active transcription of protein coding genes from the constitutive heterochromatin has been reported in different species like fly, mouse, humans and plants [2–6]
Heterochromatic genes encode various proteins involved in different cellular functions like ribosomal proteins, kinases, transporter proteins [7]. Drosophila is an excellent model with robust genetic techniques that made studies on genes located in deep heterochromatin possible even before the advent of genomic technologies. Heterochromatic genes were initially discovered in Drosophila melanogaster while searching for complementation groups in Ethyl methanesulfonate mutagenesis screen and well-known heterochromatic genes – light and rolled were mapped on either side of the centromere on chromosome 2 [8]. Subsequently many other heterochromatic genes were characterized on the autosomal chromosomes [9–17]. Several heterochromatic genes were also identified on the sex chromosomes [18]. 4th chromosome although small in size, has peculiar features with respect to the heterochromatin. While some parts of the 4th chromosome is facultative heterochromatin, majority of it (~75%) is interspersed with constitutive heterochromatic patches and active genes between those domains [19]. With the improvements in genome-sequencing technologies and the Drosophila Heterochromatin Genome Project (DHGP) [20] the number of centromeric heterochromatic genes in Drosophila melanogaster is currently in hundreds [1]. Upon comparing the structure of D melanogaster heterochromatic genes with their orthologs in other species of Drosophila where they are located in the euchromatin, promoter sequences were unchanged. However, in Drosophila melanogaster these genes have larger introns due to integration of transposable elements.[21]. This indicates that as a measure to counter-act genomic invasions by transposable elements, these genes are now in the heterochromatin but have evolved newer regulatory mechanisms to retain the functional status. Such regulatory mechanisms utilizes the specific features of the genomic and epigenomic landscape of heterochromatin.
Interestingly, not only do heterochromatic genes reside in repressive domains but they also require the heterochromatic environment for transcription. Chromosomal rearrangements that translocated these genes into the euchromatin suppressed their expression levels [22–23]. These genes showed ‘reciprocal’ heterochromatic Position Effect Variegation (PEV), more prominently in the background of Su(var) mutants as in case of rolled and light [24]. It was also reported that combinations of Su(var) mutations – Su(var)205 (HP1a), Su(var)208 and Su(var)210 could reduce eye pigmentation due to lowering of light expression even in the absence of chromosomal rearrangements [25,26]. Expression of both Rpl15 (essential gene) and Dbp80 (non-essential genes) were compromised in the genetic background deficient for HP1a [14]. These experiments confirmed that the genes located in heterochromatin require the heterochromatic environment for their expression. However, the mechanisms of transcriptional regulation of heterochromatic genes are still unknown
In this study, we have explored the DNA sequence and epigenetic features at the heterochromatic genes of Drosophila melanogaster to understand how heterochromatin not only permits but also is essential for the transcription of genes located within it. Customizing our epigenomic analysis tool, C-State [27], we have created an interactive Drosophila melanogaster heterochromatic gene database - Drosophila Het C-State. It uses published whole genome datasets to provide a platform for integrating the histone mark trends at heterochromatic genes with their expression profiles at different developmental stages. We find that heterochromatic genes can be broadly classified into three groups based on the combinations of histone marks associated with them in their highest and lowest expression stages. We also determined the characteristics of nuclear-matrix associated regions (MARs) present in the pericentromeric heterochromatin. Furthermore, we investigated the intergenic sequences for enrichments of any transcription factor binding sites. These results provide novel insights into the various genomic and epigenomic features of Drosophila heterochromatic genes and sets the stage for further experimental investigations.
2 Materials and methods
2.1 Retreiving datasets for epigenomic meta-analysis
60 well characterised (annotated gene structure, function and expression profile known) pericentromeric heterochromatic genes which were chosen. Gene expression data for these genes, normalized for both sequencing depth and transcript size (RPKM values, Supplementary Table 1: Stagewise expression file), were retrieved from the Flybase Temporal RNA Expression Data [28] using DGET [29]. Use of RPKM values allowed us to compare the expression levels of genes across various developmental stages directly. For each gene, a gene-epicard (available at the Table Summary of the tool webpage) was created that contains information like location, size, expression values in each stage, expression pattern and functions. For comparison of dynamicity of the epigenetic landscape over each gene, the histone modification ChIP-seq (generated by Kevin White lab, University of Chicago as part of modENCODE project) data for all developmental stages were downloaded from modENCODE [30]. We used ChIP-chip data where ChIP-Seq data was unavailable for a given histone modification/developmental stage. To facilitate comparison across various studies, we used the interpreted data files that provide enrichment of various histone modifications as genome coordinates. They were scaled to same parameters for evaluation of enrichment or lack of peaks.
2.2 Analysis of matrix associated regions (MARs)
Raw data from our previously published embryo 0–16hrs MAR sequencing (NCBI SRA Accession number-SRX443533) was used.. Previously unmapped dm3 MARs were mapped to the heterochromatic regions of dm6 build (better annotation of pericentromeric arm heterochromatin) [31] using Flybase Coordinate converter [32] to allow for inclusion of most MARs falling in heterochromatin. Supplementary Table 2 shows the coordinates of pericentromeric region used in the study. The MARs (Supplementary Table 3: Het MARs) were compared to other genomic features like introns, exons and intergenic regions
2.3 Motif analysis of intergenic sequences
The intergenic sequences from Chr2L,2R, 3L and 3R heterochromatic region were extracted using in-house Perl script and submitted for motif analysis using MEME [33]. The motif search was run with both default parameters (resulted in 50bp motif) and modified search for small motif of 15bp. The motif, thus, obtained was contributed by 53 out of 105 sequences submitted, 10 sequences had multiple occurrence of the motif. We queried the motifs against database of known motifs using TOMTOM.
2.4 Creation of Drosophila Het C-state
We have previously developed a visualization tool called C-State [27] to analyze multiple chromatin and expression datasets at once. This was customized such that it automatically loads the histone marks and expression data of Drosophila heterochromatic genes across various developmental stages (Drosophila Het C-State). C-State fetches the relevant genomic information based on the input identifiers from its genome repository folder. Once genomic coordinates of all genes were retrieved, histone mark peaks mapping to the coordinates were extracted from the genome-wide data of each developmental stage, and transformed into coordinates relative to the TSS of the gene. C-State plots every gene was corrected for the genomic orientation, so that genes on both the positive and negative strands of the genome can be compared directly. Similarly, the expression level of each gene was retrieved from the genome-wide expression data for each developmental stage.
Average feature profiles are calculated using a constant-binning approach. First, each gene is divided into equal number of bins irrespective of its size and orientation, such that the TSS of all genes are represented by the same bin. Starting from the upstream most bin, the number of peaks mapping to each bin for all genes is calculated. Multiple peaks mapping to the same bin are counted only once, to avoid biases created by large bin sizes. Once the number of peaks for each bin is derived, they were plotted as line charts using a common Y-scale, to directly visualize the histone peak trends in a comparable manner.The “Files” interface was replaced with buttons that enable toggling between the display of data across all developmental. However the C-State json file is provided as Drosophila HetCState file.json for the ease of including new heterochromatic genes and ChIP datasets by the programmers. All the remaining options of C-State such as the filters, tables, plots and customization options are functional, and can be utilized by the user according to their convenience. The gene-epicard can be viewed at the Table summary tab.
3. Results
3.1 Epigenetic profile of heterochromatic genes across developmental stages is dynamic
Pericentromeric heterochromatin is reported to have several hundred genes [20]. We chose 60 representative heterochromatic genes that fall within autosomal chr2 and 3 pericentromeres as per Drosophila Heterochromatin Genome Project and their gene-structure; function and expression data was available. Some of these genes are studied extensively in the field of heterochromatic gene regulation-eg: light, rolled, RpL(s), Nipped etc. To investigate if these genes are subjected to dynamic epigenomic changes, we looked into the profile of six histone modifications across 12 well-established developmental stages: (E0-4, E4-8, E8-12, E12-16, E16-20, E20-24, L1, L2, L3, pupae, adult female and adult male) [34] using published datasets from modENCODE. Histone modification marks included are– two “inactive”: H3K9me3, H3K27me3 and four “active”: H3K4me1, H3K4me3, H3K9ac and H3K27ac, with corresponding developmental-stage specific ChIP seq/ ChIP-chip (Table1) and RNAseq datasets (refer to Drosophila Het C-State, for the trend of histone marks present on individual heterochromatic genes across developmental stages). Chr4 and X heterochromatic genes were excluded from the analysis as they were reported to have other specific marks like H4K16ac [35] for which datasets across all developmental stages were unavailable.
We analyzed the enrichments of six histone marks on the heterochromatic genes at their different levels of gene-expression. (Fig1) shows the average trend of their occurrence across the 12 developmental stages. The heterochromatin marker, H3K9me3, is present throughout the gene body with lesser enrichment in larval (L3) and adult stages (male and female). H3K4me3 was consistently present on most genes at their TSS except in L1. H3K4me1 is present mostly in embryo12-16 hours and rest of the stages show minimal enrichment at the TSS. H3K9and H3K27 acetylation peaks at the TSS with gradual tapering on the gene body, larval stage onwards consistent with the fact that most genes have low expression in larval and adult stages (eg:CG17691, CG17683, uex, vtd, CG40006). [Supplementary Table 1: Stagewise expression data]. We found H3K27me3 was absent on most of the genes. Constitutively expressed heterochromatic genes like (RpL5) have less of H3K9me3, restricted mostly at the introns. H3K27me3 is present in the lowest / no expression stages of certain genes. The active marks H3K4me3 is present at the exons or throughout the gene body while H3K9/27Ac is also present at the gene body and the UTRs. For the heterochromatic genes with tissue/stage specific expression pattern (Nipped-A), during highest expression stage, active marks are present at the coding regions and repressive marks at the introns, while the entire gene gets marked with H3K9me3 marks during lower/no expression stages. This trend in occurrence of different histone marks is correlated to the expression pattern of these across development and was used to classify the genes in the following section
3.2 Classification of heterochromatic genes based on the dynamics of repressive histone marks
We were interested to understand the dynamics of histone modifications and correlate it with the expression pattern that might not be evident by looking at the cumulative trend across all stages. Thus, we divided the expression datasets into two stages: highest and lowest/no expression for each gene. Fig2A shows the average trend in the distribution of the histone marks on the heterochromatic genes at their highest and lowest stages of expression. The typical feature of the heterochromatic genes that sets it apart from the euchromatic genes is the presence of a combination of active and inactive histone modifications [Supplementary Figure 1]. We sought to group the genes based on the presence or absence of inactive H3K9me3 marks in the stages of highest and lowest expression. We grouped the genes into three classes as shown in Table 2. These results show that genes of each group depend on the inactive histone marks for their expression to different extents (Fig2B). Group I-A genes have both inactive and active histone marks in stages of highest expression and only inactive in lowest stages while Group I-B have both kinds of histone marks in both highest and lowest expression. Genes in both these sub-classes are majorly (92%) constitutively expressing (Table 2) and encode proteins involved in metabolic and developmental pathways. Notable examples are ribosomal proteins (RpL 5/15/38), kinases (Stlk) and proteins involved in signaling (Gprk1), cell-divison (Nipped-B, uex) and transcriptional regulation (Atf6, CG10395, Maf1). Group I also constitutes the largest class with 58% of the genes included in our study. Group II genes consisting of 10% of genes studied, showed enrichment of only active marks in highest and inactive marks in lowest stages of expression respectively. Group III genes had only inactive histone marks in the both the stages. Interestingly, almost 50% of the genes in this class had tissue-specific expression having role in metabolic processes and transferase activity implicated in wing and muscle development.
3.3 Heterochromatic genes have preferential enrichment of Matrix Associated Regions in the pericentromeric intergenic regions
Regulatory DNA elements along with the histone modifications modulate the transcriptional status of genes. We hypothesized that MARs might regulate genome organization in pericentromere, looping out active genes from neighboring heterochromatin into separate chromatin domains. We have previously published the genome-wide sequencing of MARs in the Drosophila embryo (0–16hrs) and their features [36]. However, MARs mapping to the pericentromeric regions were not included. Hence, we were interested to determine the characteristics of those Het-MARs. We took up all the MARs that fall within the pericentromeric heterochromatin, including those that map to centromere as per the latest genome build -dm6 annotation [31]. 350 MARs were mapped to pericentromeric regions of the autosomal chromosomes and X chromosome. We compared the properties of MARs with those reported previously in our study for the euchromatic regions as shown in Table3. We found that the overall average size of Het MARs is 354 bp, almost half of the size in the euchromatic region. The distribution of the MARs with respect to the various genomic features (Fig3A) shows that most pericentromeric heterochromatin are intergenic. We also looked at the inter-Het MAR distance, the average being 45 Kb and the largest being close to few Mb. The MARs/genes value indicates that MARs can probably loop out few similarly expressing genes into separate domain. (Fig3B) shows the genomic map of the distribution of MARs on the chromosome arms along with the histone marks indicating that in certain cases MARs are present at the borders of distinct chromatin domains.
3.4 Heterochromatic intergenic sequences have DNA motifs of transcription factors
To investigate whether any specific motifs are enriched at the pericentromeric heterochromatin, intergenic sequences of Chr 2 and 3, pericentromeric genes were extracted and used for motif analysis using MEME by the default parameters. We found a 15 bp motif to be highly enriched in the intergenic sequences with a dependable e value 1.9e – 423 and an occurrence in 70% sequences studied as seen in (Fig4). The motif matched to known motifs of Zn finger DNA binding proteins - CG3065, Sp1 and CG12029. Of these hits, particularly interesting is the presence of motifs for binding of Sp1 a well-known transcription factor that is known to modulate epigenetic environment by interacting with several chromatin remodelers.
Discussion
It is counterintuitive to have genes in the repressed chromatin environment that get expressed by adapting and utilizing the heterochromatic factors rather than being aversive to them. To understand this paradigm, we used Drosophila Het C-State to compare trends of histone modifications on heterochromatic genes across development. This also serves as the repository for information specifically related to Drosophila heterochromatic genes to which new coordinate based genomic datasets or heterochromatic genes can be added to cater to the community of researchers working on these genes.
Previous studies had shown that in normal embryos H3K9me2 is enriched throughout the gene body while in embryos with chromosomal rearrangement that created a new eu-het junction-H3K9me2 distribution is altered [37]. More recently, epigenetic landscape of the heterochromatic genes was investigated in S2, BG3 cells and embryos using ChIP-chip data of several histone marks. Collectively these studies showed that active heterochromatic genes have unique combinations-enrichment of active (H3K9/14ac and H3K4me2) marks with a dip in H3K9me3 at the TSS [37] and the presence of both active (H3K36me3) and inactive (H3K9me3) marks on the gene body, not seen in case of euchromatic genes [35]. Previous studies had used tiling array based ChIP data and focused on a single cell type or developmental stage. Therefore, how the histone marks change with respect to changing expression patterns across development was unexplored. Our results show that the distribution of H3K4me1/3, H3K9/27ac and H3K9me3 on the heterochromatic genes is different from euchromatic genes such that both these active marks and inactive mark of H3K9me3 occur together [Supplementary Fig 1]. This shows that there could be different combinations of active marks - H3K4me3-H3K9/27ac along with H3K9me3, which was not, reported earlier (Fig1). We believe that the dip in H3K9me3 but presence of H3K9/27ac at the TSS is required to allow access of transcriptional machinery following which different combinations of active marks are put to mark the gene for expression. H3K4me1 promotes chromatin accessibility and is present (on average) at the TSS only in the highest expression stage. (Fig2A). As the distinctive feature of heterochromatic genes, H3K9me3 is present in both the high and low expression stages. To probe further, as to how the heterochromatic mark: H3K9me3 regulates both activation and repression of heterochromatic genes we looked into the genes groupwise.
We grouped the genes based on the presence of heterochromatic mark during active expression (Table 2). Group I genes include most of the constitutive genes that have active marks like H3K9/27ac at TSS and H3K4me3 at the gene body along with H3K9me3. In stages of lower expression, there is decrease in presence of active marks but H3K9me3 persists. Inactive marks on the 5’end of the gene are probably read differently than those on the gene body that was reported to occur in combination with other active marks [35,38]. The second category (Group II) of genes had only active marks in both highest and lowest expression stages. These genes are like euchromatic genes embedded in heterochromatin. We also found the well-known heterochromatic gene -rolled in this class. Probably, they need other heterochromatic factors like HP1, Su(var)3-9 for expression although H3K9me3 is not present on the gene body and thus the pattern of histone marks on them resembles those on euchromatic genes. Greil and colleagues had reported the binding of HP1a and Su(var)3-9 on rolled and light but upon knockdown of these two proteins in Kc167 cells they do not see change in expression level of these 2 genes [39]. Hence how HP1a and Su(var)3-9 controls heterochromatic genes expression needs further experimental dissections. Group III genes showed enrichment for only inactive marks in stages of active expression. To explain that only inactive mark-H3K9me3 is present on transcriptionally active group III genes, we consider two possibilities. First, in mammals H3K9me3 has also been reported to promotes transcriptional elongation in concert with HP1c [40]. In fly heterochromatic genes H3K36me3 (elongation mark) has been shown to be present along with H3K9me3 on the gene-body [35]. However, how it regulates expression is not understood. Second, the major heterochromatic factor HP1 that binds to H3K9me3, has been shown to be present on active pericentric genes [41]. HP1 is known to have gene activation effect [42] through its differential post-translational modifications [43] and interaction partners [44] including MES-4, a H3K36 methyl-transferases. Hence, the role of HP1a in regulation of Group III gene expression needs to be explored.
In summary, these three classes of genes are regulated to different extents by the heterochromatic factors or has combinations of other active histone modifications. Thus, resulting in a complex crosstalk of multiple pathways being involved in the context of heterochromatic gene expression. This highlights the need for meta-analysis of available genomic datasets to bring out the specialties of heterochromatic genes and test out hypotheses experimentally.
Among the several models proposed for the regulation of the heterochromatic genes, ‘Integration model’ postulates that the long-range interactions mediated by the heterochromatic proteins and sequences can explain context dependent regulation of heterochromatic genes [45]. It is known that the co-expression clusters are encompassed within topologically associated domains (TADs) that can be of 40–70kb in Drosophila [46]. Although based on the epigenetic landscape, it is believed that centromeric heterochromatic regions are folded into a compact inactive TAD, the detailed characterization of long-range interactions in pericentromeric associated domains as reported for mice [47], is lacking for Drosophila. MARs tether the genome to the nuclear matrix thus determining the folding of the genome into chromatin domains. MARs have been shown to promote long range enhancer-promoter interactions even in the repressive environment [48]. However, their role in the context of heterochromatic gene regulation is unexplored. In this backdrop, we looked into the distribution of MARs within the pericentromeric heterochromatin. Our analysis shows preferential enrichment of MARs in intergenic regions with the inter Het MARs distance in the range of average Drosophila TAD size. MARs have shown to modulate chromatin accessibility [49] and in many cases associate with actively transcribing genes [50]. More recently MARs were associated with active transcription [50,51] and that inter-MAR looping contributes to transcriptionally active DNA looping [52]. Thus, it can play a role in mediating long-range interaction to keep heterochromatic genes active. We propose that MARs could be defining the borders in pericentromeric heterochromatin that confines long-range interactions between similarly expressing heterochromatic genes into separate chromatin domains. However, this observation needs further experimental validations.
Transcription factor binding sites (TFBS) and transcription factor dosage has also been shown to remodel chromatin landscape and impact transcriptional status of the genes [53]. The presence of motif recognized by a transcription factor like Sp1 points to the significance of non-coding intergenic sequences in shaping the chromatin environment. Sp1 through to its DNA binding domain interacts with acetyltransferase domain of co-activator p300 to increase the gene expression [27]. In addition, Sp1 binding is involved in de-acetylation and cause repression of several genes [28]. The presence of Sp1 binding motifs at the intergenic regions can be speculated to modulate the chromatin environment. However, experimental validations of this hypothesis must ascertain how the combination of non-coding sequences along with trans factors regulate the epigenetic landscape of centromeric heterochromatin.
In conclusion, this is the first report of the dynamic epigenetic landscape of D melanogaster heterochromatic genes during development. We present Drosophila Het C-State as platform for comprehensive bioinformatics analyses using the publicly available genomics datasets and bring out the peculiarities of heterochromatic genes that might be involved in their regulation. Heterochromatic genes are also known in mammals and many of them are expressed during early development [4,54] and disease conditions like cancer [3,55]. We believe that despite the inherent challenges of studying heterochromatic sequences due to high repeat content and compaction, more experimental evidences are required to explain these observations. Such studies will lead us to understand the dynamics of heterochromatic gene regulation and shed more light into the dark matter of the genome.
Conflicts of interests
The authors declare no conflicts of interest.
Supplementary Figure 1: Distinction between euchromatic and heterochromatic epigenomic landscape- shows that the euchromatic regions there is presence of only active marks viz, H3K27ac, H3K9ac and H3K4me1/3. At some places even facultative repressive mark H3K27me3 is present. However, on the right side of the dotted line demarcates a heterochromatic region where H3K9me3 is present along with the other active marks mostly at the TSS – an unique feature of the epigenomic landscape of heterochromatic genes. Notably, the gene density is lower in heterochromatin as compared to heterochromatin
Acknowledgements
We thank Rashmi Upadhyay Pathak and A Srinivasan for their help with the MAR dataset. We thank Surabhi Srivastav for critically reading the manuscript. PS thanks University Grants Commision (UGC), India for the doctoral fellowship. DTS and RKM acknowledges the financial support of the Council of Scientific and Industrial Research (CSIR), India.
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].
- [11].
- [12].
- [13].
- [14].↵
- [15].
- [16].
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵