Abstract
Motivation High throughput sequencing methods produce massive amounts of data. The most common first step in interpretation of these data is to map the data to genomic intervals and then overlap with genome annotations. A major interest in computational genomics is spatial genome-wide correlation among genomic features (e.g. between transcription and histone modification). The key hypothesis here is that features that are similarly distributed along a genome may be functionally related.
Results Here, we propose a method that rapidly estimates genomewide correlation of genomic annotations; these annotations can be derived from high throughput experiments, databases, or other means. The method goes far beyond the simple overlap and proximity tests that are commonly used, by enabling correlation of continuous data, so that the loss of data that occurs upon reduction to intervals is unnecessary. To include analysis of nonoverlapping but spatially related features, we use kernel correlation. Implementation of this method allows for correlation analysis of two or three profiles across the human genome in a few minutes on a personal computer. Another novel and extraordinarily powerful feature of our approach is the local correlation track output that enables overlap with other correlations (correlation of correlations). We applied our method to the datasets from the Human Epigenome Atlas and FANTOM CAGE. We observed the changes of the correlation between epigenomic features across developmental trajectories of several tissue types, and found unexpected strong spatial correlation of CAGE clusters with splicing donor sites and with poly(A) sites.
Availability The StereoGene C++ source code, program documentation, Galaxy integration scrips and examples are available at the project homepage at http://stereogene.bioinf.fbb.msu.ru/
Contact favorov{at}sensi.org
Supplementary information Supplementary data are available online.
Introduction
Modern high throughput genomic methods generate large amounts of data, which can come from experimental designs that compare tissue-specific or developmental stage-specific phenomena for human [7] and model organisms [4]. Single-cell approaches are also rapidly advancing [3]. Such datasets are integrated into several different archive databases [8, 37, 44] and manually curated databases [25].
An important challenge of genome-wide data analysis is to reveal and assess the interactions between biological processes, e.g. chromatin profiles and gene expression. A rapidly emerging approach to this challenge is to represent data as functions on genomic positions and to estimate correlations between these functions.
Numerous recent biological publications employ the correlation-based approach. Several research papers [41, 43] focus on relationships between transcription factor binding and chromatin state. These studies also include information on DNA accessibility [1], higher-order chromosomal organization [19], and association of chromatin modifications and alternative splicing [18, 23]. The research field has broadened its focus on analysis of individual and cell/tissue specific variation of epigenomic features and their relationship with diverse traits [31]. An interesting “Comparative epigenomics” paradigm [42] has emerged from an observation that combinations of epigenetic marks are more conserved than the individual marks themselves. This cooperation requires spatial relationships that are difficult to statistically ascertain.
Several bioinformatic methods that estimate the association between genome-wide numerical features have been recently proposed, and powerful aggregation and visualization tools were developed for manual analysis of colocalization of multiple features [12, 36, 37, 40].
Computational assessment of correlations on continuous genomewide data recruits various mathematical and statistical methods. For consistency with existing bioinformatic methods for positional correlation analysis, we use the terms profile or track for position-defined genomic features. For the colocalization analysis, genomic features are formalized as one of three types: profiles that are represented as a set of intervals on the genome (genes, repeats, CpG islands, etc.); point profiles (binding sites, TSS, splice sites); and continuous profiles, such as coverage data (expression, ChIP etc) resulting from high throughput sequencing experiments.
Many computational approaches have been developed to assess genomic features. An entropy-based approach has been developed for identification of differentially methylated regions [45]. A Bayesian mixture model is used for consistency analysis of different sources of data (ChIP-ChIP and ChIP-seq, [35]. A Hidden Markov Model is used for prediction of generalized chromatin states [10]. A probabilistic approach for the chromatin code landscape is introduced in [48]. A compendium of epigenomic maps is used in [9] to generate genome-wide predictions of epigenomic signal tracks, and a detailed review of machine learning for genome features is given in [21].
Correlations may be direct overlaps, but many of the most interesting relationships are more difficult to discern, as they require a general proximity but not overlap. For example, gene expression (RNA-seq coverage) correlates with transcription factor binding or chromatin state in nearby promoter regions or distant enhancer regions. The distant spatial correlations of interval profiles is addressed in [6,11].
The interval and point-wise genome-wide correlations are addressed in [6,10,11], [16]. A common approach to investigating genomic features is to represent these features as intervals, computed from the original continuous coverage data using a threshold or more sophisticated algorithms [46]. With these methods, the resulting track depends on the algorithm used, and portions of the original data are lost.
[24] work with continuous profiles directly using the Karhunen-Loeve transform. This enables evaluation of both experimental variability and true biological signal (the biological signal tends to be in the higher components). While elegant, this method is slow and precludes the investigational analyses that are so important when analyzing these data.
Here, we propose a fast universal method to assess correlation of genomic profiles. The data can be discrete features (e.g. intervals) or continuous profiles (e.g. coverage data representing the level of histone methylation, protein binding, or expression). The method is based on calculation of the convolution integral with some kernel (kernel correlation, KC), with speedup using Fast Fourier Transform (FFT). The kernel allows calculation of correlation of the profiles that are smoothed over a genomic neighborhood.
The KC measure provides us with an estimate of spatial correlation (overlap, colocalization, or relative distance) of two features. To estimate the statistical significance of the correlation, we split the genome into a set of non-overlapping windows (100kb-1Mb). The foreground signal is computed as the distribution of correlation values for each of the windows. To get the background signal, we shuffle windows and recalculate the correlations. Statistical analysis is based on comparison of foreground and background distributions.
Our implementation is very quick: calculation for a pair of profiles over the human genome takes approximately 1-3 minutes on a standard PC.
StereoGene is presented and source code and some examples are available at the project homepage at http://stereogene.bioinf.fbb.msu.ru/.
Materials and Method
Kernel correlation
We consider each genomic feature as a numeric function (profile) on the genomic position x. The standard Pearson correlation of two profiles f = f(x) and g = g(x) is defined as: where is the mean value of f; σf is the standard deviation of f, the integration is performed over the genome G. The Pearson correlation relates profile values on exactly the same genomic positions. In biological systems, the relationships of values at proximal but nonoverlapping (in genomic coordinates) positions are also important. These correlations may be mediated by chromatin looping or other interactions. To account for them, we use the following generalization for the covariation integral: where ρ(x − y) is a kernel function that reflects the expectations of interaction of features at adjoining positions. In the case ρ(x − y) = δ(x − y) we get the standard covariation integral. Here, we use the Gaussian kernel but other non-negative kernel functions can be used.
The two-dimensional integral Qp(f,g) can be rapidly calculated using a Fourier transform: where Øk(x) are the harmonic basis functions Øk(x) = exp(k · 2πi/L), and means complex conjunction of Øk. The equation takes into consideration that the zero coefficient of a Fourier transform of a function f is the average of the function . Thus, the kernel correlation KC is defined as: where The value KC(f,g) satisfies the inequality: −1 ≤ KC (f,g) ≤ 1. The Fourier transform can be calculated by the discrete Fast Fourier Transform (FFT) algorithm [22] that have the computational complexity O(|G| · log|G|) where |G| is the genome length. Complexity of correlation coefficient calculation consists of the complexity of Fourier transform and the complexity of summation O(|G|). Hence the calculation time can be evaluated as O(|G| · log|G|).
Cross-correlation (Distance correlation)
For two given profiles, f(x) and g(x) the cross-correlation function can be calculated:
The cross-correlation function reflects a distance dependence of the profiles. This function can also be calculated using Fourier transform:
Where FT−1 means the reverse Fourier transform that can also be calculated using FFT algorithm.
Local correlation profile generation
Along with the integration of the correlation measure along the genome, StereoGene can generate a new profile that describes the kerneled local correlation of two profiles.
The integrals in this equation can be represented via Fourier transform, and the correlation profile is expressed as
This profile is necessary to investigate relationships that are non-uniform along the genome, revealing more or less correlated segments. In particular, it can be used for a gene set enrichment analysis or correlated with a third genomic profile, and thus it can be involved in a 3-way correlation analysis that is analogous to liquid correlation [20]. This is a powerful and unique approach to dissecting complex relationships among genomewide datasets. Note that the value of LC is not restricted by ±1 boundaries and can take any values.
Partial correlation
Nonrandom correlation of the two profiles may occur due to their correlation with a third profile (confounder) that systematically biases both signals (e.g. level of mapability). To computationally exclude such an influence, StereoGene can correlate projections of the two profiles orthogonal to the confounder profile a subspace:
A typical example of a confounder could be a common input track for two ChIP-seqs from the same cell type.
Statistical significance
The KC value provides useful information about the relative genomewide correlation of features, but it does not carry any information on statistical significance. To obtain the latter, KC is calculated in a set of adjacent large windows that cover the genome. Then, a shuffling procedure is used that randomly matches windows of one profile to another, and KC calculation is repeated in all the window pairs. Thus, two distributions, a foreground distribution of the real KC values and a background of permuted values, are obtained. The statistical significance is provided by a Mann-Whitney test of these two sets of values.
Program implementation
As input, StereoGene accepts two or more input files in one of the standard Genome Browser formats: BED, WIG, BedGraph, and BroadPeak. In the first step, StereoGene converts input profiles to an internal binary format and saves the binary tracks for future runs. If a project refers to the saved profile and the parameters have not changed, StereoGene reuses the saved tracks. StereoGene also requires chromosome length information provided in any standard UCSC form.
Output depends on parameters and will provide the following files: *.bkg — array of correlations for shuffled windows; *.fg — correlations in coherent windows; *.dist — distance distribution (correlation function) for background, foreground, and chromosomes; *.wig — a wig file for local kernel correlations; *.chrom — statistics by chromosomes; ‘statistics’ — a file that stores statistics for all runs and provides a summary, including total correlation, Z-score for Mann-Whitney statistics, and p-value.
For a quick and intuitive depiction of results, the StereoGene optionally generates an R script that graphs the output in two plots (Fig.1). The first plot displays foreground and background (permuted) distributions of genomic windows of the kernel correlation. A right shift of the foreground distribution relative to the background distribution represents positive correlation, and vice versa. The plot also displays more complicated features, such as multimodality, which show that the correlation is not uniform over the genome, or that multiple classes of features with different correlation profiles exist. The second plot, the (not kernel) cross-correlation function on possible feature-to-feature shifts, represents local relationships between them. The program can provide mass analysis using track lists. The local correlation track is supported with data table for the observed and the expected local correlation distributions and the FDR q-values calculated for each local correlation values (Fig. 2). More detailed information about the the StereoGene software is presented in the supplementary file ProgDescr.pdf and in the program documentation
StereoGene is implemented in C++. The time required for the binary file preparation depends on file size. On a standard computer, the preparation takes from a few seconds to 1-2 minutes. The calculation of correlation with shuffling requires roughly one minute. A complete description of the keys and the parameters of StereoGene and output files formats are presented on StereoGene homepage.
Data source
Data by Roadmap Epigenomics Project [2] was obtained via the Human Epigenome Atlas (http://www.genboree.org/). Data for FANTOM4 CAGE clusters [30] was obtained from the UCSC website (RIKEN CAGE Loc tracks, GEO accession IDs were GSM849326 for nucleus GSM849356 for cytosole in H1 Human Embryonic Stem Cell Line, RRID: CVCL_9771). The datasets with the tracks are listed in Supplementary file 1.
1 Results
Human Epigenome Atlas Pairwise Correlation Anthology
As a straightforward test of our method, we prepared an anthology of pairwise correlations of the profiles from the Human Epigenome Atlas [2]. We built a pipeline that analyzes colocalization at all pairs of different profiles from the same tissue (or cell line) and all the pairs of the same profiles from different tissues. The results are displayed here http://StereoGene.bioinf.fbb.msu.ru/epiatlas.html Interestingly, the majority of the comparisons of Epigenomics Roadmap profiles show a significant positive correlation, while negative correlations appear rarely. In many cases the cross-correlation function has a narrow peak centered at zero. For example, Fig.1 shows the distributions of the KC and the cross-correlation function for tracks H3K27me3 vs H3K36me3 in fetal brain cells. Though further evidence is needed, such behavior could reflect very precise nucleosome positioning; indeed, recent results from reChIP [15] suggests that the same nucleosome can carry different modifications.
To prepare an overview of this complex and multifaceted dataset, we split the Human Epigenome Atlas data into two collections: mature tissues and fetal tissues (refer to Supplementary file 1 for the track URL’s). We focused on correlations of the most frequently studied epigenetic marks (i.e.), H3K4me1, H3K4me3, H3K9me3, H3K27me3, and H3K36me3, as well as RNA-seq. Fig 3A showed distributions of genome-wide correlations for the pairs of profiles. The highest difference of feature-to-feature correlation between the collections was observed for the H3K9me3 and H3K27me3 pair: they were significantly more correlated in adult tissues than in fetal ones. A comparison of correlation between H3K9me3 and H3K27me3 in the same tissue for fetal and adult gave a p-value = 3.2 10 −5 (Wilcoxon test). This result is consistent with the prior observation that at early stages, different genomic regions are separately regulated by H3K9me3 and H3K27me3, but during tissue maturation, these heterochromatin marks became more synchronized [5]. One possible explanation is that H3K27me3 initiates chromatin compaction by recruitment of H3K9me3. The colocalization of H3K27me3 vs H3K36me3 relates to monoallelic gene expression [26]. Figure 3 shows significant increase of correlation of these marks in adult tissues in comparison with fetal tissues. The observation is consistent with the recent studies [27]. Other pairs of epigenomic marks behaved similarly, but with more moderate effect (Fig.3B).
In some cases, a bimodal shape is observed among the distribution of correlations in a feature-to-feature comparison; this may indicate that subsets of a feature fall into multiple classifications, each with different correlation properties. The correlation of H3K4me3 and H3K27me3 in adult lung tissue provides a good example of such bimodal behavior (fig.4A). These marks are widely assumed to have opposite effects: H3K4me3 is associated with active genes, while H3K27me3 is associated with closed chromatin. Simultaneously, the trimethylation of H3K4 and H3K27 presumably delineates bivalent domains in which developmental genes are poised for expression as the cell differentiates, and in general they are to be repressed in adult tissue [33]. To provide an example of StereoGene application, we analyzed genes associated with regions that carry high correlation of H3K4me3 vs H3K27me3 in the adult lung. To do this, we took the local correlation track (*.wig StereoGene output file) and selected 3000 of the highest peaks using MACS-1.4.2 [47]. Then we selected the genes with TSS, which were located in the interval ±5k around these peaks. The resulting list of genes was mined for biological enrichment using DAVID 6.7 software [14]. The most interesting terms that were found under FDR < 5% threshold were alternatively spliced mRNAs, cell motion regulation and apoptosis (Supplementary file 2).
Chromosome-specific correlation of promoter and polycomb marks
We compared the relationship between two well-investigated histone marks: the promoter-related H3K4me3, and the heterochromatin polycomb-related H3K27me3, in the adult lung, chromosome by chromosome (see fig. 4B). The genome-wide correlation (e.g. with all the chromosomes pooled) distribution for these marks is bimodal with a rather high peak on positive correlations. At the same time, the correlation distribution on chromosome 19 has a significantly different shape and is significantly less. This result could be explained by the fact that chromosome 19 has very high gene density and contains a lot of housekeeping genes. The correlation distribution on chromosome X also slightly differs from the genome-wide distribution and has a peak on very high correlations.
Example of partial correlations
The H3K4me3 is an ‘active promoter’ mark and is expected to be positively correlated with RNA-seq. Indeed, fig. 5A shows some weak positive but statistically significant correlation. Interestingly, using a projection mode to remove H3K27me3 binding from the correlation of H3K4me3 with RNA-seq profile (fig 5B) produces a much stronger, and bimodal, correlation. This suggests that the relationship H3K4me3 to gene expression is modulated by H3K27me3 in some way. This observation is consistent with “poised promoters,” in which the activating and repressive histone marks are both bound; these promoters are a subset of all genes and this unusual behavior runs contrary to what is seen in the majority of promoters. Here, we have uncovered multiple promoter states in addition to the multiple modes of interaction between H3K4me3 and H3K27me3.
Chromatin marks vs gene features
We separated genes on three fractions by expression level: genes with high expression level (top 25% of mRNA-seq level), genes with low expression (bottom 25% of mRNA-seq level) and moderately expressed genes (other genes) for certain cell type (Brain Cingulate gyrus) and plot cross-correlation function of histone marks vs gene features – start/end, and intron beg/end (Fig.6). Generally, we can see:
Specific distribution of H3K4meX and H3K9ac near TSS – two high peak left and right to TSS and a gap at TSS position. This behavior is in an agreement with other research [9].
Some specificity of the histone modification near intron boundaries. The sharp break of the H3K4meX and H3K9ac level of at the ends of the introns may be related to splicing definition. The H3K36me3 mark is usually related to PolII activity and the observed cross-correlation function can be explained by possible regulation of the intron and polyadenilation sites definition.
The H3K27me3 mark usually is related to bivalent promoters. We can see that for active genes there exist rather strong peak right to TSS while for low expressed and silent genes this mark has wide peak.
Cohesin and histone modifications
We calculated the positional correlations of cohesin protein Rad21 with CTCF and different histone modifications in H1 stem cells (RRID: CVCL_9771) and in the K562 (RRID: CVCL_5145) cell line (table 1). We observed very strong positional correlation of the CTCF binding with cohesin protein Rad21. Promoter and enhancer regions (H3K4meX) were co-localized with cohesin while active transcribed regions and repressed regions were not related to cohesin. These observations are consistent with [38].
CAGE vs gene annotation
We analyzed the positional relationship of CAGE (FANTOM4 [30]) data, a genome-wide map of capped mRNA, for the nucleus and for cytosol of H1-hESC cells and the RefSeq [28] gene annotations.
The correlation functions are presented in fig.7. CAGE clusters are highly correlated with transcription start sites (fig. 7A), as expected. In addition, we observed two unexpected phenomena: strong positional correlation of CAGE clusters (panel B) with intron start sites and strong positional correlation of CAGE clusters with transcription termination sites (panel C). Both observations were relevant only when the CAGE clusters and genes were on the same strand, further supporting a meaningful biological relationship. More detailed analysis showed very precise localization of CAGE clusters at donor sites and at polyadenylation sites (fig. 7D). To check statistical significance of this observation, we selected equivalent random positions at 500 bp downstream from the donor splice sites or polyadenylation sites, as a control set. The resulting contingency tables are here 2. The Exact Fisher test for these contingency tables gave p-values less than 2.2 · 10−16 in both cases.
CAGE association with intron starts may be explained by the activity of debranching enzymes [32]. After lariat debranching, the freed 5′ end of the intron may become available for capping, and this cap would be detected by CAGE. Taft et al. [39] observed short (18-30 nucleotides) RNAs associated with donor splice sites. The authors suggested a model where RNA polymerase produced such transcripts on donor splice sites during mRNA transcription. The transcriptional stop site correlation is less evident, though suggests that occasional capping of the free 5′ end after cleavage by the polyadenylation complex is possible.
Discussion
We present a new method with unprecedented speed for estimation of genomewide positional correlations. As seen on public datasets, the approach yields biologically plausible results. The correlation distribution graphs depict multiple varieties of genomewide relationships. Local correlation tracks can be used for traditional gene enrichment analysis or to describe the relationship between genomic features. Currently, we use a permutation procedure to estimate statistical significance, still we can compare KC distributions for different comparisons. Thus, for coverage tracks that are supplied with the input track for control, we can use the values of KC between a feature and other feature’s input or even between two inputs as a background instead of the permutation results
StereoGene is also available as a Galaxy plugin, and we provide two examples, one using 2-way correlation and the other using partial correlation, to illustrate usage. In both cases, the user can save the correlation track and use these data for more complex queries.
We compare (Table 3) StereoGene with commonly used tools. Notably, very few programs can compute on continuous data (bedGraph, wig, etc) and require establishment of often arbitrary thresholds in order to create intervals for analysis. KLTepigenome [24] is able to work with continuous profiles, but is limited to sparse data and is quite slow even when compared to StereoGene doing the same computation on the full profile. We test the difference of the binarized profiles with different thresholds and continuous profiles. Usage of high threshold lead to overestimated correlations (Fig.8). StereoGene has additional, unique functions such as partial correlation analysis and the ability to compute over a linear combination of different profiles.
We applied StereoGene to continuous, interval, and pointwise genomic data, including experimental results and annotation tracks. In all cases, StereoGene produced reliable and sometimes nonobvious, yet intuitive, results that stimulate further investigation. StereoGene is thus a powerful and promising method for identifying genome-level biological patterns. The potential for guided 3-way (liquid) correlation is particularly novel and enables elucidation of the phenomena underlying complex relationships.
Funding
This work was supported by Russian Scientific Foundation (grant 14-24-00155) and by National Institutes of Health (grant P30 CA006973). A.M. and A.F. were supported by Russian Foundation for Basic Research (grants 14-04-01872 and 14-04-00576) S.J.W and T.N. were supported by Allegheny Health Network-Johns Hopkins Cancer Research Fund and JHU IDIES/Moore Foundation
Acknowledgments
We are grateful to Roman Kudrin, Ekaterina Khrameeva and Alexandra Golytsyna for testing the program. Thanks to Renat Arufilov, Artur Zalevsky and to Dmitriy Vinogradov for technical solutions and for support. Thanks to Aleksey Stupnikov for his ideas for the future. Thanks to Leslie Cope for his advice. Thanks to Patricia Palmer for her help with the text of the manuscript.
Footnotes
↵* favorov{at}sensi.org