Abstract
Histone modifications play important roles in gene regulation, heredity, imprinting, and many human diseases including diabetes, obesity, and cancer. The histone code is complex and consists of more than 100 marks. Therefore, biologists need computational tools to characterize general signatures representing the distributions of tens of chromatin marks around thousands of regions. To this end, we developed a software tool called HebbPlot, which utilizes a Hebb neural network in learning a general chromatin signature from regions with a common function. Hebb networks can learn the associations between tens of marks and thousands of regions. This is the first application of Hebb networks in the epigenetics field. HebbPlot presents a signature as a digitized image, in which a bright pixel indicates the presence of a mark around a part of the genetic element, and a black pixel indicates the absence of the mark. A row of pixels represents one mark. Similar rows are clustered in the image. We validated HebbPlot on synthetic data and on 111 epigenomes provided by the Roadmap Epigenomics Project. HebbPlot was able to retrieve distinct chromatin signatures for promoters, enhancers, and genes active in each of the 111 cell types. Our analysis reveals that active promoters have a directional signature; marks such as H3K79(me1/me2), H3K4(me1,me2,me3), and H3K9ac stretch toward coding regions. The plots of inactive promoters show that H3K27me3 is consistently present around them. Further, the signatures of enhancers that are fully included in repetitive regions are almost identical to those located outside repeats, indicating that transposons have an enhancer-like function in the human genome. Furthermore, the chromatin signature of active elements consists of the presence of H3K79me1 and the absence of H3K9me3 and H3K27me3. In sum, HebbPlot is a general tool that can be applied to wide array of studies, facilitating the deciphering of the histone code.
Author summary Chromatin marks have gained much attention because of their important roles in gene regulation, cell differentiation, Lamarckian inheritance, and imprinting. A chromatin signature of a genetic element, such as genes or enhancers, consists of multiple marks and may differ from a tissue to a tissue. Currently, tens of histone modifications are known. Several marks of more than 100 human cell types have been determined. Many epigenomes of other normal and pathological cell types will be available soon.
Extracting a chromatin signature representing the distributions of tens of marks around thousands of regions is a challenging task. Hebb networks are a special type of artificial neural networks known for their ability to learn associations. We developed a software tool called HebbPlot. The tool uses a Hebb network to learn how a mark is distributed around a set of regions that have the same function, e.g. promoters active in the same tissue. HebbPlot produces a pattern representing mark distributions around all of the regions. Mark patterns are clustered based on their similarity to one another. Then a digitized image representing the learned pattern is generated. HebbPlot will help biologist with characterizing and visualizing chromatin signatures in numerous studies.
Introduction
Understanding the effects of histone modifications will provide answers to important questions in biology and will help with finding cures to several diseases including cancer. Carey highlights several functions of epigenetic factors including Cytosine methylation and histone modifications [1]. It was reported that methylation of CpG islands inhibit transcription [2], whereas the complex histone code has a wide range of regulatory functions [3,4]. Additionally, epigenetic marks may affect body weight and metabolism [5]. Interestingly, chromatin marks may explain how some acquired traits, such as obesity and exposure to some toxins, are passed from one generation to the next (Lamarckian inheritance) [6–⇓⇓9]. Further, epigenetics may explain how two identical twins have different disease susceptibilities [10]. Epigenetic factors play a role in imprinting, in which a chromosome, or a part of it, carries a maternal or a paternal mark(s) [11,12]. Defects in the imprinting process may lead to several disorders [13–⇓⇓⇓⇓18], and may increase the “birth defects” rate of assisted reproduction [19]. Furthermore, chromatin marks play a role in cell differentiation by selectively activating and deactivating certain genes [20,21]. Some chromatin marks take part in deactivating one of the X chromosomes [22]. It has been observed in multiple types of cancer that some tumor suppressor genes were deactivated by hypermethylating their promoters [23–⇓25], the removal of activating chromatin marks [26,27], or adding repressive chromatin marks [28]. Anti-cancer drugs that target the epigenome [1] have been designed. Two compounds are used in these drugs. One compound inhibits DNA methylation [29,30], whereas the other compound inhibits histone deacetylation [31] (histone acetylation is an activating mark).
Pioneering computational and statistical methods for deciphering the histone code have been developed. Some tools are designed for profiling and visualizing the distribution of a chromatin mark(s) around multiple regions [32, 33]. Additionally, a tool for clustering and visualizing genomic regions based on their chromatin marks has been developed [34]. Several systems are available for characterizing histone codes/states in an epigenome [35–⇓⇓⇓⇓⇓⇓⇓43]. Further, an alphabet system for histone codes was proposed [44]. Other tools can recognize and classify the chromatin signature associated with a specific genetic element [?, 45–⇓⇓⇓⇓⇓⇓⇓⇓54]. Furthermore, methods that compare the chromatin signature of healthy and sick individuals have been proposed [55].
Scientists have identified about 100 histone marks [37]. Additionally, there is a near infinite number of future studies, in which scientists need to characterize the pattern of chromatin marks around a set of regions in the genome. Therefore, there is a definite need for an automated framework that enables scientists to (i) automatically characterize the chromatin signature of a set of sequences that have a common function, e.g. exons, promoters, or enhancers; and (ii) visualize the identified signature in a simple intuitive form. To meet this need, we designed and developed a software tool called HebbPlot. This tool allows average users, without extensive computational knowledge, to characterize and visualize the chromatin signature associated with a genetic element automatically.
HebbPlot includes the following four innovative approaches in an area that has become the frontier of medicine and biology:
HebbPlot can learn the chromatin signature of a set of regions automatically. Sequences that have the same function in a specific cell type, e.g. exons, promoters, or enhancers, are expected to have similar marks. The learned signature represents these marks around all of the regions. HebbPlot differs from the other tools in its ability to learn one signature representing the distributions of all available chromatin marks around thousands of regions.
This is the first application of Hebb neural networks in the epigenetics field. These networks are capable of learning associations; therefore, they are well suited for learning the associations among tens of marks and genetic elements.
The framework enables average users to train artificial neural networks automatically. Users are not burdened with the training process. Self-trained systems for analyzing protein structures and sequence data have been proposed [56–⇓58]. HebbPlot is the analogous system for analyzing chromatin marks.
HebbPlot is the first system that integrates the tasks of learning and visualizing a chromatin signature. Once the signature is learned, the marks are clustered and displayed as a digitized image. This image shows one pattern representing thousands of regions. To illustrate, the distributions of the marks appear around one region; however, they are learned from all input regions.
We have applied our tool to learning and visualizing the chromatin signatures of several active and inactive genetic elements in the 111 consolidated epigenomes provided by the Roadmap Epigenomics Project. These case studies demonstrate the applicability of HebbPlot to many interesting problems in molecular biology, facilitating the deciphering of the histone code.
Materials and methods
Methods
In this section, we describe the computational principles of our software tool, HebbPlot. The core of the tool is an unsupervised neural network known as Hebb network.
Region representation
To represent a group of histone marks overlapping a region, these marks are arranged according to their genomic locations on top of each other and the region. Then equally-spaced vertical lines are superimposed on the stack of the marks and the region. The numerical representation of this group of marks is a matrix. A row of the matrix represents a mark. A column of the matrix represents a vertical line. If the ith mark intersects the jth vertical line, the entry i and j in the matrix is 1, otherwise it is −1. The first vertical line is at the beginning of the region; the last vertical line is at the end of the region. The rest of the lines are spread out evenly. Fig 1 shows the graphical and the numerical representations of a region and the overlapping marks. Finally, the two-dimensional matrix is converted to a one dimensional vector called the epigenetic vector. The number of vertical lines is determined experimentally. We used 41 and 101 lines in our experiments. This number should be adjusted according to the average size of a region.
Data preprocessing
Preprocessing input data is a standard procedure in machine learning. During this procedure, the noise in the input data is reduced. Each epigenetics vector is compared to two other vectors selected randomly from the same set. The value of an entry in the vector is kept if it is the same in the three vectors, otherwise it is set to zero. For example, consider the vector [1 1 −1]. Suppose that the vectors [1 −1 −1] and [1 −1 −1] were selected randomly. The preprocessed vector would be [1 0 −1] because the first and the third elements are the same in the three vectors, but the second element is not.
Hebb recall network
Associative learning, also known as Hebbian learning, is inspired by biology. “When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased” [59]. In behavioral psychology, Ivan Pavlov conducted a famous experiment, which demonstrated learning by association. In this experiment, a dog was trained to associate the sound of a bell with food; this dog salivated when it heard the bell whether or not food was present. The presence of food is referred to as the unconditioned stimulus, p0, and the sound of the bell is referred to as the conditioned stimulus, p. Associating these two stimuli together is the goal. After training, the response to either the conditioned stimulus or the unconditioned one is the same as the response to both stimuli combined [60].
In the context of epigenetics, a Hebb network can be viewed as the dog in Pavlov’s experiment. The unconditioned stimulus, p0, is a one-dimensional vector representing the distributions of histone marks over a sequence e.g. one tissue-specific enhancer. This vector is referred to as the epigenetic vector; it is obtained as outlined above. The conditioned stimulus is always the one vector, which include ones in all entries. We would like to train the network to give a response, analogous to the salivation of the dog, when it is given the ones vector, whether or not the epigenetic vector is provided. The response of the network is a prototype/signature representing the distributions of histone marks over the entire set of genomic locations, e.g. all enhancers of a specific tissue.
Eq 1 and Eq 2 define how the response of a Hebb network is calculated. The training of the network is given by Eq 3 [60].
Eq 1 defines a transformation function. This function ensures that the response of the network is similar to the unconditioned stimulus, i.e. each element of the response is between 1 and −1. If x is a vector, the function is applied component wise.
Eq 2 describes how a Hebb network responds to the two stimuli. The response of the network is transformed using Eq 1. In Eq 2, p0 is the unconditioned stimulus, e.g. presence of food or an epigenetic vector; w is the weights vector, which is the prototype/signature learned so far; and p is the conditioned stimulus, e.g. sound of a bell or the one vector. The operator ⊙ represents the component-wise multiplication of two vectors. In the current adaptation, if the network is presented with an epigenetic vector and the one vector, the response is the sum of the prototype learned so far and the epigenetic vector. In the absence of the epigenetic vector, i.e. all-zeros p0, the response of the network is the prototype, demonstrating the ability of the network to learn associations.
Eq 3 defines Hebb’s unsupervised learning rule. Here, wi and wi−1 are the prototype vectors learned in iterations i and i - 1. The ith pair of unconditioned and conditioned stimuli is and pi. Learning occurs, i.e. the prototype changes, only when the ith conditioned stimulus, pi, has non-zero components. This is the case in our adaptation because pi is always the one vector. Due to a small α, which represents the learning and the decay rates, the prototype vector changes a little bit in each iteration when learning occurs; it moves closer to the response of the network to the ith pair of stimuli.
Comparing two signatures
Two signatures can be compared numerically. The dot product of two victors indicates how close they are to each other in space. When these vectors are normalized, i.e. each element is divided by the vector norm, the dot product is between 1 and −1. The dotsim function (Eq 4) normalizes the vectors and calculates their dot product. Here, x and y are vector; ‖x‖ and ‖y‖ are the norms of these vectors; the ·symbol is the dot product operator.
It is easy to interpret the meaning of the dot product of two normalized vectors. If the two vectors are very similar to each other, the value of the dotsim function approaches 1. If the values at the same index of the two vectors are opposite of each other, i.e. 1 and −1, the value of dotsim approaches −1. The dotsim function can be applied to the whole epigenetic vector or to the part representing a specific chromatin mark. When comparing the chromatin signatures of two sets of regions, a mark with a dotsim value approaching 1 is common in the two signatures. A mark with a dotsim value approaching −1 has opposite distributions, distinguishing the signatures. Marks with dotsim values approaching zero do not have consistent distribution(s) in one or both sets; these marks should not be considered while comparing the two signatures.
Visualizing a chromatin signature
Row vectors representing different marks are clustered according to their similarity to each other. We used hierarchical clustering in grouping marks with similar distributions. Hierarchical clustering is an iterative bottom-up approach, in which the closest two items/groups are merged at each iteration. The algorithm requires a pair-wise distance function and a cluster-wise distance function. For the pair-wise distance function, we utilized the city block function to determine the distance between two vectors representing marks. For the group-wise distance function, we applied the weighted pair group method with arithmetic mean [61]. To determine the group-wise distance between a cluster A, and another cluster consisting of two sub-clusters B and C, add the distance between A and B to the distance between A and C; then divide the sum by 2. We utilized the implementation of hierarchical clustering provided in the Statistics and Machine Learning Toolbox of Matlab (R2017A) by MathWorks.
A digitized image represents the chromatin signature of a genetic element. A one-unit-by-one-unit square in the image represents an entry in the matrix representing the signature. A row of these squares represents one mark. The color of a square is a shade of gray if the entry value is less than 1 and greater than −1; the closer the value to 1 (−1), the closer its color to white (black).
Up to this point, we illustrated the computational principles of our software tool, HebbPlot. Next, we provide the details of the data used in validating the tool.
Data
We used HebbPlot in extracting and visualizing chromatin signatures characterizing multiple genetic elements of the 111 consolidated epigenomes of the Roadmap Epigenomics Project [62]. Specifically, we applied HebbPlot to:
Active promoters.
Active promoters on the positive strand.
Active promoters on the negative strand.
Inactive promoters.
Active enhancers.
Active repetitive enhancers.
Active non-repetitive enhancers.
Inactive enhancers.
Coding regions of active genes.
Coding regions of inactive genes.
Random genomic locations.
We obtained the genomic locations of the putative promoters specific to each of the 111 consolidated epigenomes from the Roadmap Epigenomics Project (http://egg2.wustl.edu/roadmap/data/byDataType/dnase/BED_files_prom/). These promoters were predicted using DNase I hypersensitive sites and chromatin states characterizing active promoters. To obtain the inactive promoters, we performed the following two steps: (i) all tissue-specific promoters are collected and merged if overlapping and (ii) all promoters are compared to the tissue-specific promoters; for each tissue, promoters that do not overlap with the tissue-specific promoters are considered inactive in this tissue. To compare the chromatin signatures of promoters on the positive and the negative strands, we separated the promoters according to the strand. If a putative promoter overlaps a transcription start site on the positive strand only, it is considered positive and vice versa. Each group was sorted and overlapping regions, if any, were merged.
The putative enhancers were obtained from the Roadmap Epigenomics Project (http://egg2.wustl.edu/roadmap/data/byDataType/dnase/BED_files_enh/). The inactive enhancers were obtained using the same procedure applied in obtaining the inactive promoters. Later in this paper, we compare the chromatin signature of putative enhancers overlapping with repeats to that of the non-overlapping ones. The hg19 human assembly repeats (http://www.repeatmasker.org/species/hg.html), including transposons and simple tandem repeats, were used for determining repetitive enhancers. In order for an enhancer to be considered repetitive, it must be entirely included in a repetitive region. In another experiment, we considered an enhancer to be repetitive if at least half of its sequence overlaps a repetitive region.
The coding regions were obtained from the University of California Santa Cruz Genome Browser (http://genome.ucsc.edu). The Ensemble genes for the hg19 human genome assembly were used in this study. Active genes in a tissue are defined as those that their transcription start sites overlap with the tissue-specific putative promoters. Otherwise, they are considered inactive. After that, coding regions of the active (or the inactive) genes in a tissue are collected and merged if overlapping.
Regarding the random genomic locations, we sampled uniformly 500 regions from each chromosome of the human genome. Each region is 1000 base pairs (bp) long. For each of the 111 consolidated epigenomes, chromatin marks overlapping with the random locations were obtained.
If the number of the regions, e.g. tissue-specific enhancers, was more than 10,000 regions, we sampled uniformly 500 regions from each chromosome.
In this section, we discussed the computational method and the data used in the validation experiments. In the next section, we validate HebbPlot on synthetic and real data.
Results and Discussion
HebbPlot
We invented a new software tool called HebbPlot. HebbPlot has the following two specific aims: (i) learning automatically the chromatin signature of a group of genomic locations that have a common function, and (ii) representing this signature as a digitized image that is easily interpreted. The core of HebbPlot is a Hebb neural network. Hebb networks are known for their ability to learn associations, making them well suited for learning the chromatin signatures of genetic elements. To the best of our knowledge, this is the first application of Hebb networks in the field of epigenetics. The training process of the neural network is fully automated, enabling biologists without extensive computational knowledge to take advantage of advanced machine learning algorithms such as Hebb networks. The tool is general and can be applied to any set of genomic locations. HebbPlot is freely available to the academic community. It can be found at Software S1.
Results on synthetic data
Consider a step-pyramidal shape (Figure 2). One thousand noisy instances of this shape were generated by randomly shifting a step of the pyramid to the right or to the left by at most 200 units. A step may be deleted with a probability of 0.2. Each shape is represented by a matrix, in which an entry has a value of 1 (white) or −1 (black). To obtain this matrix, a group of evenly-spaced vertical lines are superimposed on the shape. If a line intersects a step of the pyramid, the corresponding entry in the matrix is 1. Otherwise, it is −1. More details about representing a shape are given under the Materials and Methods Section.
As a baseline, the original shape was retrieved from the noisy instances by a simple majority voting scheme. In this scheme, an entry of the prototype matrix is assigned 1 if the majority of the values stored in same entry of the 1000 matrices are 1; otherwise, it is assigned −1. The prototype due to this method is similar to the original shape; however, its boundaries are inaccurate, whereas the prototype retrieved by the Hebb network looks very similar to the original shape. The boundaries of the steps are accurate; however, they are fuzzy. Similar results were obtained when this experiment was repeated multiple times using higher and lower mutation rates (shift amount: 0-300, step-deletion probability: 0-0.3), demonstrating the ability of Hebb networks to retrieve the original shape successfully.
Results on real data
Next, we studied multiple enhancers potentially active in the H1 cell line (embryonic stem cell) obtained from the Roadmap Epigenomics Project. These enhancers were predicted using DNase I Hypersensitive sites and chromatin states associated with enhancers. This data set contains 11,369 putative H1-specific enhancers and 27 chromatin marks. Each enhancer region was expanded by 10% on each end to study how chromatin marks differ from/resemble the surrounding regions. To begin, 41 uniform samples/points were obtained from each region. Then for each point, it was determined whether or not it falls in a mark region overlapping the putative enhancer.
Next, we plotted the results as shown in Fig. 3. No clear signature appears in these plots. After that, we used the majority-voting scheme described earlier and HebbPlot in generating the signature of the H1-specific enhancers. The figure generated by HebbPlot shows more information than the majority plot.
The Hebb plot shows four distinct zones representing the absent marks, and the present ones with different confidence levels. For example, the top zone shows marks that are absent from the H1-specific enhancers. These marks include H2A.Z, H4K8ac, H3K9me3, H3K4me3, and H3K36me3. The bottom zone shows the marks that present around these enhancers with the highest confidence level. These marks include H3K4(me1,me2), H3K79(me1,me2), and many acetylation marks. In contrast, the plot due to the majority-voting scheme shows only two zones representing the absent and the present marks without confidence information.
Further, because the enhancer regions were expanded on each end by 10%, a present mark is expected to be brighter around the center of an enhancer than its peripheries. The Hebb plot shows such information, whereas the brightness of the present marks is uniform around almost all marks shown in the majority plot. These results show that a Hebb plot is more accurate and shows more information than a plot generated by the majority-voting scheme.
The distinct chromatin signatures of different active elements
Twenty eight chromatin marks of the IMR-90 (fetal lung fibroblasts cell Line) epigenome are available through the Roadmap Epigenomics Project. The project provides access to predicted enhancers and promoters specific to IMR-90. We sampled 11,268 enhancers, 13,226 promoters, and 11,390 coding regions of active genes in IMR-90. About 500 regions were uniformly sampled from each chromosome. In addition, we selected 10,000 locations sampled uniformly from all chromosomes of the human genome. Then we trained four Hebb networks to learn the chromatin signature of each genetic element.
Fig 4 shows the four Hebb plots. The promoter signature is characterized by a bright box that is clearly different from the surrounding regions. The center, where the transcription start sites are located, of the upper part of the box is less bright than its peripheries. With regard to the chromatin signature of the enhancers, it is characterized by multiple zones. Each zone has consistent brightness. The brightest zone at the bottom of the Hebb plot is the widest. Similarly, the coding regions signature is multi-zonal; however, the brightest zone is the narrowest and the middle gray zone is the widest. Chromatin marks should not be distributed in a consistent manner around regions that do not have a common function. As expected, the Hebb plot representing the random genomic locations displays a black box, indicating that no chromatin mark is distributed consistently around these regions.
After that, we repeated the same experiment on each of the 111 epigenomes of the Roadmap Epigenomics Project. The Hebb plots of the promoters, the enhancers, and the coding regions of active genes are available through Data set S1, Data set S2, and Data set S3. The four distinct signatures are consistent across all tissue types.
These plots demonstrate that HebbPlot is able to learn the chromatin signature from a group of regions with the same function. In addition, the chromatin signatures of the promoters, the enhancers, and the coding regions are clearly distinct.
The directional signature of active promoters
Because promoters are upstream from their genes, some marks may indicate the direction of the transcription. To determine whether or not marks have direction, tissue-specific putative promoters were separated according to the positive and the negative strands into two groups. Then the promoter region was expanded to include two equal-size regions upstream and downstream from the promoter region. Thus, the expanded region has these three equal-size parts: (i) the region upstream from the promoter, (ii) the promoter region itself, and (iii) the region downstream from the promoter. We trained two Hebb networks to learn the chromatin signatures of tissue-specific promoters on the positive and the negative strands. Fig. 5 shows the Hebb plots of the positive and the negative promoters active in H1 and male skeletal muscle. The two plots of the promoters on the positive and the negative strands are mirror images of each other, indicating that multiple marks are distributed in a directional manner; some marks tend to stretch more downstream (bright) than upstream (dark).
Next, we generated Hebb plots for the positive (Data set S4) and the negative (Data set S5) promoters of all tissues available through the Roadmap Epigenomics Project. This phenomenon was very consistent in all tissues.
Recall that two vectors pointing in opposite directions have a dotsim value of −1. The closer the value to −1 is, the closer the angle between the two vectors to 180° is. To determine directional marks, the learned prototype of a mark over the upstream part of the expanded promoter region was compared to the prototype of the same mark over the downstream part. If the dotsim value between the two prototypes is −0.5 or lower, this mark is considered directional.
We list the number of times a chromatin mark was determined for a tissue and the number of times it showed directional preference in Table 1. The Roadmap Epigenomics Project did not determine all marks for the 111 tissues. We found that H3K79(me1/me2), H3K4(me1,me2,me3), and H3K9ac are extended toward the coding regions in 50% or more of the tissues, in which they are known. These results show that active promoters have a directional chromatin signature.
Promoters were separated according to the strand to positive and negative groups. Then the region of a promoter was expanded 100% on each end. Mark vectors over the upstream and the downstream thirds of the expanded regions were compared. A mark is considered directional if these two vectors are opposite to one another (a dotsim value of −0.5 or lower). Not all marks were determined for all tissues. The number of tissues, for which a mark was determined, is listed under the column titled “Known.” The number of tissues, in which a mark has directional preference around the promoter regions, is listed under the column titled “Directional.” The ratio of these two numbers are listed under the column labeled with “Ratio”.
The chromatin signatures of repetitive and non-repetitive enhancers
It has been reported that transposon subfamilies have an enhancer-like function in the human genome [63]. Further, transposons are known to act as enhancers in plant genomes [64–⇓⇓⇓68]. Given the availability of the putative enhancers of more than a hundred cell types, we asked two questions.
First, what is the percentage of enhancers that are located within repeat sequences, e.g. transposons? To answer this question, we calculated the percentage of the tissue-specific enhancers that are included entirely in repetitive regions. Interestingly, up to 25% of the tissue-specific enhancers are repetitive. The highest percentage of 25% was observed in the primary T helper cells PMA-I stimulated, and the lowest percentage of 12% was observed in the female fetal brain. If the overlap percentage between enhancers and repeats is lowered to 50% instead of 100%, the percentages of the repetitive tissue-specific enhancers range between 22% and 37% (see Table S1). These results indicate that a large portion of enhancers are repetitive.
Second, how similar/different are the chromatin signatures of the repetitive enhancers and the non-repetitive ones? To answer this question, we obtained two chromatin signatures by training a Hebb network on the repetitive enhancers (Data set S6) and another network on the non-repetitive enhancers (Data set S7) active in each tissue. Then, we compared the two chromatin signatures using the dotsim function. The two signatures are almost identical (mean = 0.98, standard deviation = 0.03, maximum=0.99, minimum=0.83); recall that the dotsim value obtained by comparing a signature to itself is 1 (see Table S2). As an example, Fig 6 shows the two Hebb plots of the repetitive and the non-repetitive enhancers active in IMR-90. The two Hebb plots are almost identical. These results prove that the chromatin signature of the repetitive tissue-specific enhancers is identical to the signature of the non-repetitive enhancers, further supporting the enhancer-like function of transposons in the human genome.
The signature of active elements
Next, we asked if there is a common code among active genetic elements. Specifically, what is the combination of marks absent or present around active promoters, active enhancers, and coding regions of active genes? To answer this question, we applied our software tool, HebbPlot, to three active elements in the 111 consolidated epigenomes. A mark is included in our analysis if it is known in at least 5 of the 111 epigenomes. We compared the distributions of the same mark around two active genetic elements using the dotsim function (see the Materials and Methods Section). Two distributions of a mark are considered similar if they have a dotsim value of 0.5 or higher in at least 50% of the tissues, in which this mark is known.
Table 2 shows the similar marks between (i) active promoters and active enhancers; (ii) active promoters and coding regions of active genes; and (iii) active enhancers and coding regions of active genes. These comparisons show that H3K79me1 is present with similar distributions around the three elements. Further, H3K9me3 and H3K27me3 are absent from these elements. Previously, H3K79me1 is reported to be associated with gene expression [47], whereas the two absent marks are known to be repressive marks [69]. These results imply that the chromatin signature of active elements consists of the presence of H3K79me1 and the absence of H3K9me3 and H3K27me3. These three marks represent a basic signature, which may be expanded by studying other active elements and additional chromatin marks when they become available.
The distributions of known marks in each of the 111 tissues were compared between (i) active promoters and active enhancers; (ii) active promoters and coding regions of active genes; and (iii) active enhancers and coding regions of active genes. The distributions of a mark over two genetic elements are considered similar if they have a dotsim value of 0.5 or higher. Recall that the dotsim values range between −1 and 1. The number of tissues, for which a mark was determined, is listed under the column titled “Known.” The number of tissues, in which a mark has similar distributions around two genetic elements, is listed under the column titled “Similar.” The ratio of these two numbers are listed under the column labeled with “Ratio”.
Differences among the signatures of active elements
The figures generated by HebbPlot show that the signatures of active promoters, active enhancers, and coding regions of active genes are distinct. Additionally, the figures of the promoters and the enhancers appear more similar to one another than to the figure representing coding regions. In this analysis, we wanted to quantify the similarity/difference among these three elements by determining marks that are distributed differently.
We applied HebbPlot to the 111 epigenomes. Then we compared the distributions of the same mark around two genetic elements. The distributions of a mark around two genetic elements are considered opposite if they have a dotsim value of −0.5 or lower in at least 50% of the tissues, in which this mark is known.
Table 3 shows marks with different distributions between (i) active promoters and active enhancers; (ii) active promoters and coding regions of active genes; and (iii)
The distributions of known marks in each of the 111 tissues were compared between (i) active promoters and active enhancers; (ii) active promoters and coding regions of active genes; and (iii) active enhancers and coding regions of active genes. The distributions of a mark around two genetic elements are considered opposite if they have a dotsim value of −0.5 or lower. Recall that the dotsim values range between −1 and 1. Not all marks were determined for all tissues. The number of tissues, for which a mark was determined, is listed under the column titled “Known.” The number of tissues, in which a mark has opposite distributions over two genetic elements, is listed under the column titled “Opposite.” The ratio of these two numbers are listed under the column labeled with “Ratio”.active enhancers and coding regions of active genes. These comparisons reveal that the signatures of active enhancers and active promoters are very similar; only one mark, H3K4me3, has different distributions around them. In contrast, the signature of active promoters differs in 8 marks from that of coding regions of active genes; these marks are H3K4(me1,me2,me3), H3K(9,18,27)ac, and H2A.Z. The signature of active enhancers differs in 14 marks from that of coding regions of active genes. These marks are H3K4(me1,me2), H3K36me3, H3K79me2, and 10 acetylation marks including H3K(9,18,27)ac. Interestingly, H3K14ac has opposite distributions around active enhancers and coding regions of active genes in all of the six tissues, in which it is known.
Clearly, the distributions of these marks can be used for distinguishing the signatures of the three active elements from each other. These results show that active enhancers and active promoters have similar signatures which markedly differ from the signature of coding regions of active genes.
Signature of inactive elements
We conducted the following experiment in search of a chromatin signature for inactive elements. Specifically, we aimed at studying the chromatin signatures of inactive promoters, inactive enhancers, and inactive genes. To determine promoters that are inactive in a specific tissue, we merged all putative promoters of all tissues. A promoter is considered inactive in a tissue if it does not overlap with any of the promoters active in this tissue. Inactive enhancers were determined in the same way. A gene that its transcription start site does not overlap with any of the putative tissue-specific promoters is considered inactive in this tissue. Next, we sampled about 500 elements from each chromosome of the human genome, totaling 11,000−13,000 elements. Then three Hebb networks were trained on the inactive promoters, the inactive enhancers, and the inactive genes of each tissue. After that, Hebb plots were generated from the signatures learned by these networks (Data set S8, Data set S9, and Data set S10). Upon examining the Hebb plots generated for the 111 tissues, we found the following:
Promoters and enhancers that are inactive in stem cells have chromatin signatures consisting of many marks. The intensities of these marks are weaker (less bright) than their counterparts in the signatures of promoters and enhancers active in stem cells (Fig 7 and Fig 8).
Out of the 111 tissues, the inactive promoters of 84 tissues were marked by H3K27me3, which is a repressive mark [69]. The H3K27me3 shows a moderate signal around inactive promoters of the steam cells and the differentiated cells alike.
No mark of the available ones was present consistently around inactive enhancers in the differentiated cells (Fig 8).
No mark of the available ones was present consistently around coding regions of genes that are inactive in the stem and the differentiated cells (Fig 9).
There are more than 100 chromatin marks [37]. Therefore, it is possible that other marks may repress promoters, enhancers, or genes. However, the currently available data indicate that only H3K27me3 is consistently present around inactive promoters.
Online resource
We generated Hebb plots for multiple genetic elements, which are active and inactive in the 111 consolidated epigenomes provided by the Roadmap Epigenomics Project. Specifically, Hebb plots were generated for the following elements:
Active promoters.
Active promoters on the positive strand.
Active promoters on the negative strand.
Inactive promoters.
Active enhancers.
Active repetitive enhancers.
Active non-repetitive enhancers.
Inactive enhancers.
Coding regions of active genes.
Coding regions of inactive genes.
These Hebb plots are available in Data set S1-Data set S10. All of these regions were expanded by 10% on each end, except the active promoters on the positive and the negative strands were expanded by 100% on each end. The HebbPlot program is provided in Software S1.
Conclusion
Identifying a complex chromatin signature consisting of tens of marks distributed around thousands of regions is a challenging task. In this article, we described the first application of Hebb networks to learning the chromatin signature of a genetic element, e.g. promoters active in a specific tissue. These networks are known for their ability to learn associations. Therefore, they are well suited for learning the association between chromatin marks and thousands of sequences. We have developed a software tool called HebbPlot. The core of this tool is a Hebb network. Additionally, HebbPlot generates a digitized image representing the learned signature. The brightness level of a pixel indicates the confidence with which a mark is present or absent. For example, a white pixel indicates the presence of a mark around a part of the genetic element, and a black pixel indicates the absence of the mark. A row of pixels represents one mark. Similar rows are clustered and displayed together.
The Roadmap Epigenomics Project determined tens of chromatin marks for 111 cell types. We used HebbPlot in driving the chromatin signatures of multiple genetic elements including: (1) active promoters, (2) active promoters on the positive strand, (3) active promoters on the negative strand, (4) inactive promoters, (5) active enhancers, (6) active enhancers within repetitive regions, (7) active enhancers outside repetitive regions, (8) inactive enhancers, (9) active genes, and (10) inactive genes. By analyzing these plots, we drove the following conclusions:
Active promoters, active enhancers, and active genes have distinct chromatin signatures.
The promoter signature is directional; multiple marks around the promoters are stretched toward coding regions.
Enhancers within and outside repeats have almost identical chromatin signatures, supporting the enhancer-like functionality of transposons in the human genome.
H3K79me1 is distributed similarly around the three active elements. Additionally, H3K9me3 and H3K27me3 are absent from the three genetic elements. These three marks represent a basic signature of elements active in almost all of the 111 cell types.
The signatures of active promoters and active enhancers are more similar to one another than to the signature of coding regions of active genes.
H3K27me3, which is a repressive mark, is consistently present around inactive promoters.
The software and the signature plots of all elements of the 111 epigenomes have been made available.
In sum, HebbPlot is a general software tool that can learn and represent visually the chromatin signature of thousands of regions having the same function. HebbPlot can be applied to the currently available epigenomes and the ones that will be available in the near future.
Supporting information
Software S1 The source code of our software tool, HebbPlot.
Data set S1 Hebb plots of potential promoters of the 111 tissues.
Data set S2 Hebb plots of potential enhancers of the 111 tissues.
Data set S3 Hebb plots of coding regions of active genes of the 111 tissues.
Data set S4 Hebb plots of potential promoters, on the positive strand, of the 111 tissues.
Data set S5 Hebb plots of potential promoters, on the negative strand, of the 111 tissues.
Data set S6 Hebb plots of repetitive enhancers of the 111 tissues.
Data set S7 Hebb plots of non-repetitive enhancers of the 111 tissues.
Data set S8 Hebb plots of inactive promoters of the 111 tissues.
Data set S9 Hebb plots of inactive enhancers of the 111 tissues.
Data set S10 Hebb plots of coding regions of inactive genes of the 111 tissues.
Table S1 Percentages of repetitive enhancers in the 111 tissues. The percentages of the tissue-specific enhancers overlapping simple and interspersed repeats are listed in this file. Enhancers that do not overlap repeats are listed under the column “0%.” Under column “50%,” we list the percentages of enhancers that at least 50% of their nucleotides overlap repeats. The percentages of enhancers fully included within repetitive regions are listed under the column titled “100%.” (XLS)
Table S2 Comparisons between the signatures of repetitive and non-repetitive enhancers of the 111 tissues. The two signatures were compared using the dotsim function. (XLS)
Acknowledgments
This research was supported by internal funds provided by the College of Engineering and Natural Sciences and the Faculty Research Grant Program at the University of Tulsa.
Footnotes
↵* hani-girgis{at}utulsa.edu