Abstract
Transcriptional regulatory networks (TRNs) can be developed by computational approaches that infer regulator-target gene interactions from transcriptional assays. Successful algorithms that generate predictive, accurate TRNs enable the identification of regulator-target relationships in conditions where experimentally determining regulatory interactions is a challenge. Improving the ability of TRNs to successfully predict known regulator-target relationships in model species will enhance confidence in applying these approaches to determine regulator-target interactions in non-model species where experimental validation is challenging. Many transcriptional profiling experiments are performed across multiple time points; therefore we sought to improve regulator-target predictions by adjusting how time is incorporated into the network. We created ExRANGES, which incorporates Expression in a Rate-Normalized GEne Specific manner that adjusts how expression data is provided to the network algorithm. We tested this on a two different network construction approaches and found that ExRANGES prioritizes targets differently than traditional expression and improves the ability of these networks to accurately predict known regulator targets. ExRANGES improved the ability to correctly identify targets of transcription factors in large data sets in four different model systems: mouse, human, Arabidopsis, and yeast. Finally, we examined the performance of ExRANGES on a small data set from field-grown Oryza sativa and found that it also improved the ability to identify known targets even with a limited data set.
Author Summary In model organisms, the ability to identify direct targets of transcription factors (TFs) via high throughput experimental assays has advanced our understanding of transcriptional regulatory networks and how organisms regulate gene expression. However, for non-model organisms, it remains a challenge to identify TF–target relationships through experimental approaches such as ChIP-Seq, thus limiting the ability to understand regulatory control is limited. Computational approaches to identify regulator-target relationships in silico from easily attainable transcriptional data offer a solution. Most algorithms for identifying gene regulatory networks from time series data weigh the relationship between regulators and putative targets at all time points equally. However, many regulators may control a single target in response to different inputs. Our approach, ExRANGES, focuses on time points where there is a significant change in expression to identify the association between regulators and targets. ExRANGES essentially weights the expression value of each time point by the slope change after that time point, thereby emphasizing the relationship between regulators and targets at the time points when the transcript levels are changing. We show that this change to the way expression data is included into gene regulatory network algorithms improves the identification of regulator-target interactions and we hope this will improve in silico identification of regulatory relationships in many species.
Introduction
Transcriptional regulatory networks provide a framework for understanding how signals are propagated throughout the transcriptome of an organism. These regulatory networks are biological computational modules that carry out decision-making processes and, in many cases, determine the ultimate response of an organism to a stimulus [1]. Understanding the regulatory networks that drive responses of an organism to the environment provide access points to modulate these responses through breeding or genetic modifications. The first step in constructing such networks is to identify the primary relationships between transcription factor (TF) regulators and the target genes they control.
Experimental approaches such as ChIP-Seq can identify direct targets of transcriptional regulators. However, ChIP-Seq must be optimized to each specific TF and specific antibodies must be used that recognize either the native TF or a tagged version of the protein. This can present a technical challenge particularly for TFs where the tag interferes with function, for species that are not easily transformable, or for tissues that are limited in availability [2]. Since global transcript levels are comparatively easy to measure in most species and tissues, several approaches have been developed to identify connections between regulators and their targets by examining the changes in transcription levels across many samples [3–6]. The assumption of these approaches is that there is a correspondence between the expression of the regulator gene and its targets that can be discerned from RNA levels. Therefore, given sufficient variation in expression, the targets of a given factor can be predicted based on associated changes in expression. Initial approaches focused on the correlation between regulators and targets such that activators are positively correlated and repressors are negatively correlated with their target expression levels. These approaches have been successful in identifying some relationships [7]. More recent methods improved the ability to identify connections between regulators and targets even in sparse and noisy data sets [4–6,8–10]. The DREAM5 challenge compared many methods for their ability to identify transcriptional regulatory networks from gene expression datasets [11]. One of the top performing methods was GENIE3 [8]. This method identifies targets for selected regulators by taking advantage of the regressive capabilities of the random forest machine learning algorithm [12] and [13]. Other successfully implemented approaches include SVM [3], CLR [6], CSI [14,15], ARACNE [5], Inferelator [4], and DELDBN [9]. Common to these methods is the use of the transcript abundance levels to evaluate the relationship between a regulator and its putative targets. However, correlation between expression levels alone may not utilize all information available in time series data. Many approaches have been developed that take advantage of the additional information available from time series data [reviewed in [16,17].
Here we present an approach that expands upon these existing algorithms by using the rate of change between consecutive time points to emphasize the relationships between regulator and targets at times when expression is significantly changing. We predict that: 1) Focusing on the rate of change will utilize different characteristics in the data and identify different regulatory relationships than using the expression values. 2) Combining expression level and the rate of change will result in improved identification of true regulatory relationships.
We first evaluated the effects of incorporating the rate of change, and developed RANGES RAte Normalized in a GEne Specific manner to evaluate the significance of the rate changes at each consecutive time point. This approach has a similar recall rate to using expression values alone, but identifies a distinct set of true-positive targets. We then combined the expression and slope change in ExRANGES (Expression by RANGES) to emphasize the connections between regulators and targets at time points before a significant change in gene expression. ExRANGES improves the ability to identify experimentally validated TF targets in microarray and RNA-Seq data sets across multiple experimental designs, and in several different species. We demonstrate that this approach improves the identification of experimentally validated TF targets for GENIE3 [8] and INFERELATOR [4], but anticipate that it will offer a similar benefit to when combined with other network inference algorithms.
Results
RANGES Identifies Significant Changes in Rate of Expression
We hypothesized that for experiments measuring RNA levels across multiple time points incorporating the rate of change between consecutive time points would identify regulator -target relationships missed by comparing expression values alone. If a gene is changing in expression at only a few time points across a data series these time points may be more important samples for considering the relationship between potential regulators of that gene than time points where the target is expressed at a stable level. Therefore, we developed an approach that evaluates the rate of change of target genes across all consecutive time points and weights the change between each consecutive time point based on the background variance observed across the dataset for each gene. We predict that this approach focuses the comparison between regulatory factors and their targets to the time points where the effects of active regulation can be observed based on changes in RNA levels and will therefore identify regulatory relationships not detected by comparing expression values alone.
The first step in incorporating the rate of change into the identification of regulatory networks is to distinguish significant rate changes from normal variation between time points caused by sampling or measurement error. Our method determines the significance of the change in expression between two consecutive time points on a per gene basis enabling us to assess the significance of the change at each time step for a given gene. For each gene, we quantified the significance of the change in expression at a given time point by estimating a p-value for the change in expression between the consecutive time points under evaluation against the background of all possible time steps. The background was constructed from the change in expression at all consecutive time steps in all samples across all experiments from a given data set (Fig 1A). For example, if we consider the mammalian circadian data set available from CIRCADB [18], the data set consists of time series experiments from 12 different tissues, sampled every 2 h for 48 h (288 samples). Therefore, the change in expression levels between time t and time t +1 can be determined for each consecutive time point. Since this data is cyclical, the interval between the last time point and the first time point is also included. We defined the background as the change in expression for each consecutive time interval across the entire time series. For this data set, the background consists of 288 slopes (12 tissues x 24 time points) for each gene. At each time step, t the slope between t and t + 1 was compared to this background a p-value is estimated. This was done for each gene and the resulting p-value was transformed to the negative log 10 and the sign of the change in slope was preserved (R script provided). We call this value RANGES (for RAte Normalized in a GEne Specific manner). The RANGES value was used in lieu of the expression in generation of a regulatory network using GENIE3 [8]. We considered 1690 TFs as the regulators [19]. To determine the potential for the rate of change to identify targets of each TF we compared RANGES to the standard approach of using expression values (hereinafter after called EXPRESSION). For the EXPRESSION approach, the input into the regression analysis included the expression values across the 288 samples for each of the 1690 TFs as regulators and the expression values of 35,556 genes as potential targets across the same samples. For the RANGES approach, the –log10 of the p-value for the significance of each change in time across the 24 time steps was used as the input for both the 35,556 targets and 1690 regulators. For both approaches all TFs were also included in the target list to identify regulatory connections between TFs.
To evaluate the ability of each approach to correctly identify targets of the TFs, we compared the resulting targets of each TF identified by either the RANGES or EXPRESSION approach with the targets identified by ChIP-Seq for five TFs involved in circadian regulation where three replicates of each ChIP-Seq experiment were performed: PER1, CLOCK, NPAS2, NR1d2, and ARNTL [20,21]. Targets identified by each approach that were considered significant targets by these published ChIP-Seq experiments were scored as true positive targets of that TF.
RANGES and EXPRESSION Values Identify Different Sets of True Positive Targets
We compared the targets identified by using RANGES to those identified using EXPRESSION. For PER1 both approaches identified true targets more than would be expected by chance (ROC curve, Fig S1). EXPRESSION showed a larger area under the ROC curve, indicating higher accuracy in identifying true positive targets of PER1. However, there was little overlap in the top true positive targets identified by each approach (Fig 1B). Many genes that were scored strongly by RANGES as PER1 targets, including many true positive targets of PER1, had low scores when evaluated using the EXPRESSION approach. Likewise, several of the top scoring true positive targets by EXPRESSION had low RANGE scores. This difference in the targets identified by each approach, including true positives, was also observed for the other four TFs we evaluated (Fig S2). These results indicate that information contained in the relationship between the rate of change of the TF and target identifies TF-target relationships missed by analyzing expression levels alone.
Rate Change Identified Samples with Lower Variation Between Tissues
To understand why some targets are identified by EXPRESSION only and others by RANGES only we compared the expression of the top predicted PER1 targets for each method (Fig 2A). We observed that the top hits identified by EXPRESSION showed more variation between each tissue than in those identified by RANGES. We therefore examined the variance between each tissue by calculating the variance of the mean expression for each of the 12 tissue samples for the top 1000 targets for all five of the TFs with ChIP-Seq data available (Fig 2B) [20]. As observed for the top PER1 targets, the targets identified by EXPRESSION generally showed more variation between tissues than the targets identified by RANGES. We also examined the within tissue variation to evaluate how well each approach identified targets that show a range of expression throughout the day within each time series (Fig 2C). The targets identified by RANGES showed more variation in the time series within each tissue suggesting that this approach might be more sensitive to changes that are dependent on the rate of expression as we would expect for this rate-based approach. To evaluate if the increased variance within each tissue observed for top TF targets identified by the RANGES approach is limited to circadian associated TFs, we compared the between tissue and within tissue standard deviation for the top 1000 targets identified by EXPRESSION or RANGES for all 1690 TF regulators (Figs 2D and E). As we observed for the circadian TFs, the targets identified by EXPRESSION showed more variation between tissue types (Fig 2D). The RANGES approach was able to identify targets with increased variation within each tissue time series compared to the EXPRESSION approach (Fig 2E).
We also compared the mean intensity level of the top 1000 predicted targets of the RANGES and EXPRESSION approaches. We observed that the top 1000 targets of PER1 identified by EXPRESSION had higher intensity levels compared to the distribution of expression of all transcripts on the microarray (Fig S3A). In contrast, the top 1000 predicted targets of PER1 identified by RANGES resembled the background distribution of intensity for all the transcripts on the array (Fig S3B). Likewise, the hybridization intensity of the genes identified as the top 1000 targets identified by EXPRESSION of all 1690 TFs considered as regulators was shifted higher compared to the background distribution levels (Fig S3C). While the top 1000 targets of all 1690 TFs identified by RANGES reflected the background distribution of hybridization intensity (Fig S3D). While hybridization intensity cannot directly be translated into expression levels, these observations suggest that there are features of the targets identified by RANGES that are distinct from those identified by EXPRESSION. We hypothesized that combining these two approaches would improve the overall ability to detect true positive targets of each regulator.
ExRANGES Combines Rate Change with Expression Levels
Since many of the true positive targets of the TFs we evaluated identified by RANGES were not identified by EXPRESSION and visa versa, we hypothesized that combining these two would improve the overall ability to predict true positive targets. To combine these approaches we took the product of the expression at time point t by the RANGES p-values for the change in expression from time point t to t+1 for each target (ExRANGES) (Fig 3A). This adjusts each time point by the rate of change in the following time interval. Therefore, the value of the time point preceding a significant change in expression is higher than the value of a time point when the following expression remains unchanged. We anticipate that this will enhance the signal between the regulator and target for the time points where regulation is occurring, thus improving the ability to correctly identify targets of each TF. For the regulators, only the expression value of the TF was provided. For all targets, this ExRANGES value was provided to GENIE3. All TFs were also considered as potential targets and the ExRANGES value was used in the target matrix for all TFs.
Using the identified ChIP-Seq targets as true positives from Koike et al. [20], we calculated the area under the ROC curve to compare the identification of true targets attained by EXPRESSION to the combination of expression and p-values using ExRANGES. We observed that for all five TFs there was an improvement in the ability to identify ChIP-Seq targets (Fig 3B).
A modification of GENIE3 uses a time delay to identify transcriptional changes in the regulator that precedes the effects on the target by a defined time step as incorporation of a delay between regulator expression and target expression has previously been shown to improve the ability to identify regulatory networks [22]. We compared our approach to this modified implementation of GENIE3 that includes the time delay step. As previously reported, we observed that the time step delay improved target identification for some transcription factors, compared to EXPRESSION alone, although in this data set, target identification for CLOCK, PER1, and NR1D2 TFs did not improve. However, for all five TFs, ExRANGES outperformed both the EXPRESSION and time-delay approaches in identifying the true positive targets of each TF; although for CLOCK, this improvement was very small (Fig 3B).
The ExRANGES Approach Improves Target Identification for TFs That Are Not Components of the Circadian Clock
To evaluate the performance of ExRANGES on TFs that are not core components of the circadian clock, we compared the ability to identify targets of additional TFs validated by ChIP- Seq. To test ExRANGES performance across tissue types, we selected seven TFs in our regulator list that have available ChIP-Seq data from at least two experimental replicates performed in epithelial cells, a tissue not included in the circadian time series samples. The seven TFs that we tested are: ESR1, STAT5A, STAT5B, POL2A, FOXA1, TFAP2A, and CHD4 [23]. We observed improvement of the area under the ROC curve for five of the seven TFs (ESR1, POL2A, FOXA1, TFAP2A, and CHD4) by combining expression and rate change information using ExRANGES (Fig 3C). As we observed above for CLOCK, STAT5A and STAT5B performed equally well, but did not show significant improvement. STAT5A and STAT5B are known to be activated post-transcriptionally perhaps indicating why evaluating the change in expression of these TFs did not lead to improved identification of targets [24–29]. This suggests that for TFs that show little variation in expression throughout the day in each time series the addition of the RANGES component may not offer much improvement. (Fig S4).
ExRANGES Improves Identification of TF Targets in Unevenly Spaced Time Series Data
Although circadian and diel time series experiments are a rich resource providing substantial variance for identifying regulatory relationships, most available experimental data is not collected with this design. Often sample collection cannot be controlled precisely to attain evenly spaced time points. For example, in human studies, the subject may not be available for consistent sampling. To evaluate the ability of ExRANGES to identify true targets of TFs across unevenly spaced and heterogeneous genotypes, we analyzed expression studies of viral infections in various individuals [30,31] using both ExRANGES and EXPRESSION approaches. This data set consists of a series of blood samples from human patients taken over a seven to nine day period, depending on the specific study. Sampling was not evenly spaced between time points. Seven studies that each sampled multiple individuals before and after respiratory infection are included. In total 2372 samples were used, providing a background of 2231 consecutive time steps. Overall, the variance between samples was lower for this study than the circadian study examined above (Figure 4A). The significance of a change in expression for each gene at each time step was compared to a background distribution of change in expression across all patients and time steps (2231 total slope changes). For the 83 TFs on the HGU133 Plus 2.0 microarray (Affymetrix, Santa Clara, CA) with ChIP-Seq data from blood tissue [32], we observed an overall improvement in the detection of ChIP-Seq identified targets (Fig 4B). The improvement varies by TF (Fig 4C).
ExRANGES Improves Functional Cohesion of Identified Targets
ChIP-Seq targets are one method to identify true targets of a TF. Another approach is to look at functional enrichment of predicted targets for a given regulator. The true targets of a TF are likely to be involved in the same functional pathways and therefore true targets would be enriched for the same functional categories as measured by enrichment of GO terms. Comparison of functional enrichment of TF targets identified by each approach enables the evaluation of how each approach performs on identifying targets for TFs without available ChIP‐Seq data. We compared the functional enrichment of the top 1000 targets of each TF predicted by either approach using Homo sapiens GO slim annotation categories. We evaluated the 930 TFs on the HGU133 microrarray [19]. Of these, the targets identified by ExRANGES for the majority of the TFs (590) showed improved functional enrichment compared to the targets identified by EXPRESSION (Fig 5A and B). Likewise, when focusing on the 83 TFs with available ChIP-Seq data from blood, the majority of TF targets predicted by ExRANGES were more functionally cohesive compared to EXPRESSION targets as evaluated by GO slim (Fig 5C). We observed that the improvement ranking of ExRANGES over EXPRESSION varies between the two validation approaches. For example, targets of the TF JUND identified by ExRANGES show no improvement over EXPRESSION when validated by ChIP-Seq identified targets, yet showed improved functional cohesion (Supplemental Table ST1).
ExRANGES Improves TF Target Identification from RNA-Seq Data and Validated by Experimental Methods Other Than ChIP-Seq
The previous evaluations of ExRANGES were performed on expression data obtained from microarray-based studies and true positives were based on ChIP-Seq identified targets of each TF. To evaluate the performance of ExRANGES compared to EXPRESSION for RNA-Seq data we applied each approach to an RNA-Seq data set performed in Saccharomyces cerevisiae. This data set consisted of samples collected from six different genotypes every fifteen minutes for six hours after transfer to media lacking phosphate. The slope background was calculated from 144 time steps. To evaluate the performance of ExRANGES compared to EXPRESSION approaches we calculated the area under the ROC curve for the identified targets for each of the 52 TFs using the TF targets identified by protein binding microarray analysis as true positives [33]. For most TFs, the AUC was improved by the use of ExRANGES compared to EXPRESSION (Fig 6A).
We next evaluated the performance of EXPRESSION and ExRANGES on a set of data from Arabidopsis consisting of 144 samples collected every four hours for two days in 12 different growth conditions. Even though fewer ChIP-Seq data sets are available to validate the predicted targets in Arabidopsis, we were able to evaluate the performance of the algorithms for five TFs with available ChIP-Seq or ChIP-Chip identified targets performed in at least two replicates [34–38]. We observed that for all five TFs ExRANGES showed improved identification of the ChIP-based true positive TF targets (Fig 6B). To evaluate a larger range of targets we compared our predicted targets by EXPRESSION or ExRANGES to 307 TFs targets identified by DAP-Seq [39]. We observed that ExRANGES also showed an improved ability to identify targets as validated by DAP-Seq compared to EXPRESSION (Fig 6C).
Application of ExRANGES to Smaller Data Sets with Limited Validation Resources
Time series data offers several advantages, however the expense is also significantly increased. We have shown that using ExRANGES in conjunction with GENIE3 improves performance on large data sets as validated by ChIP-Seq (228 samples in mouse, 2372 in human, and 144 in arabidopsis) (Fig 7). We also compared the use of the ExRANGES approach to EXPRESSION alone with the INFERELATOR algorithm, although ExRANGES showed an improved AUROC in all three data sets; the largest increase observed was in the Arabidopsis data set, which has the lowest sample number (Fig S4). Since our interest is to develop a tool that can assist with the identification of regulatory networks in non-model species, we wanted to determine if ExRANGES could also improve identification of TF targets in more sparsely sampled data sets where there is only limited validation data available.
To determine the effectiveness of the ExRANGES approach for experiments with limited time steps, we evaluated the targets identified by ExRANGES and EXPRESSION for a single time series consisting of 28 samples from seven unevenly sampled time points of field grown rice data. ChIP-Seq has only been performed for one transcription factor in rice, OsMADS1 [40]. Therefore, we compared the ability of ExRANGES and EXPRESSION to identify the OsMADS1 targets identified by L. Khanday et al. Of the 3112 OsMADS1 targets identified by ChIP-Seq, ExRANGES showed an improved ability to identify these targets (Fig 8) compared to EXPRESSION.
Discussion
Computational approaches that can identify candidate targets of regulators can advance research. Many approaches have been developed to identify regulator targets, but most of these use expression values. We have demonstrated that combining the expression levels and rate of change improves the ability to predict true targets of TFs across a range of species and experimental designs. This approach improves the identification of targets as determined by ChIP-Seq and protein binding microarray across many different collections of time series data including experiments with replicates and without, with time series that have unevenly sampled time points, and even for time series with limited number of samples. ExRANGES provides improvement in TF target identification over EXPRESSION values alone for time series performed with both microarray and RNA-Seq measurements of expression.
Expression analysis performed in time series, such as experiments evaluating the transcriptional changes throughout a circadian cycle, provide rich resources for identifying relationships between individual transcripts. Since in many species the majority of transcripts show variation in expression levels throughout the day [18,41,42] circadian and diel data sets provide a snapshot of the potential ranges in expression that a regulator can attain. The associated changes in target expression levels can be analyzed to identify potential regulatory relationships that may be enhanced in response to other perturbations such as stress. Here, we show that data sets that combine circadian time series in multiple tissues can be a powerful resource for identifying regulatory relationships between TFs and their targets not just for circadian regulators, but also for regulators that are not components of the circadian clock. Targets identified using EXPRESSION as the features were those that showed large variance between tissue, while RANGES identified targets that showed larger variance within each time series. ExRANGES takes advantage of both sources of variation and improves the identification of TF targets for most regulators tested, including for TF-target relationships in tissues not included in the transcriptional analysis. Additionally, ExRANGES simplifies incorporation of replicate samples.
As implemented, ExRANGES improves the ability to identify regulator targets, however, there are many aspects that could be further optimized. For example, we tested ExRANGES with the network inference algorithm GENIE3 and demonstrated that it improved the performance of this algorithm. The ExRANGES method can be applied to most other machine learning applications such as Bayesian networks, mutual information networks, or even supervised machine learning tools. In addition, we showed that ExRANGES outperformed a one-step time delay. Conceptually, our method essentially increases the weight of the time point before a major change in expression level. ExRANGES could be further modified to adjust where that weight is placed, a step or more in advance, depending on the time series data. Such incorporation of a time delay optimization into the ExRANGES approach could lead to further improvement for identification of some TF targets, although it would increase the computational cost.
Here, we compared ExRANGES based features to EXPRESSION based features by validating against TF targets identified by ChIP-Seq, ChIP-Chip, DAP-Seq, and protein binding microarray. While these experimental approaches identify potential TF targets in a genome-wide manner, they are not perfect as gold-standards for validation of transcriptional regulatory networks. If there are systematic errors in target identification by ChIP-Seq, ExRANGES may perform better than indicated here. Although ChIP-Seq may not be an ideal gold standard, it does provide a benchmark for comparing computational approaches to identifying TF targets. Unfortunately, high quality ChIP-Seq data is not available in most organisms for more than a handful of TFs. For example, validation of this approach in rice was limited to one recently published ChIP-Seq dataset. This lack of experimentally identified targets is a severe hindrance to advancing research in these species. New experimental approaches such as DAP-Seq may provide alternatives for TF target identification in species recalcitrant to ChIP-Seq analysis [39]. Additionally, the authors of this paper improved their recall of ChIP-Seq identified targets by selecting targets that were also supported by DNase-Seq sensitivity assays [43,44]. Likewise, distinguishing between direct and indirect targets predicted computationally could be enhanced by incorporation of DNase-Seq or motif occurrence information for the targets. Incorporation of such a priori information on regions of open chromatin and occurrence of cis-regulatory elements leads to improved network reconstruction [10,45]. Use of ExRANGES could lead to improvement for these integrated approaches. Although approaches such as DAP-Seq are more global in analyses than individual ChIP-Seq assays, these genome-wide approaches still require a significant investment from the community in the development of an expressed TF library collection. For non-model systems, computational identification of TF targets can provide an economical first pass that can be followed up by experimental analysis of predicted targets, accepting the fact that there will be false positives in the validation pipeline. In this strategy, a small improvement in the ability to identify true targets of a given TF can translate into a reduced number of candidates to test and fewer experiments that must be performed. We hope that the modest improvements to regulatory network algorithms provided by the ExRANGES approach can facilitate research in species where identification of TF targets is experimentally challenging. Additionally, we hope that our finding of how gene expression values are incorporated in a network has a significant effect on the ability to identify regulatory relationships will stimulate evaluation of new approaches that use alternative methods to incorporate time signals into regulatory network analysis.
In summary, we demonstrate that consideration of how expression data is incorporated can contribute to the success of transcriptional regulatory network reconstruction. ExRANGES is a first step at evaluating different approaches for how features are supplied to regulatory network inference algorithms. We anticipate that further optimization and other novel methods for integrating expression information will lead to improvements in network reconstruction that ultimately will accelerate biological discovery.
Materials and Methods
Sources for Expression Data Sets
Circadian Data Set
Normalized expression data from murine sources was downloaded from CircaDB [18]. Microarray-based expression levels from 288 samples were used in this study. The data available was from twelve different tissues that were sampled every 2 h for 48 h.
Viral Data Set
The expression data used for the viral experimental analysis was downloaded from GEO GSE73072. The data is composed of seven studies of individuals sampled before and after respiratory infection. Expression is data from blood samples of approximately twenty individuals taken over a seven to nine day period depending on the individual study. Sampling was not evenly spaced between time points. In total data from 2372 microarrays were used. The expression datasets used for the analyses described in this manuscript were contributed by Drs. Ephraim Tsalik and Geoffrey Ginsburg from Duke University and the Durham VA Medical Center. They were obtained as part of The Respiratory Viral DREAM Challenge through Synapse ID syn5647810 [31].
S. cerevisiae RNA-Seq Data
RNA-Seq based expression data from S. cerevisiae was downloaded from GEO GSE61668 [46]. This data set evaluates phosphate starvation in six genotypes of S. cerevisiae. Transcript expression was measured by RNA-Seq every 15m for six hours after transfer to reduced phosphate media (150 samples total).
Arabidopsis Circadian Data
Normalized microarray expression data for Arabidopsis was obtained from www.mocklerlab.org/diurnal [47]. This data set consisted of Arabidopsis plants of various ages grown in 12 different environmental conditions sampled every 4 h for 48 h for a total of 144 samples.
Oryza sativa Diel Data
Rice variety IR64 was grown in the field at the International Rice Research Institute (, Philippines). When the plants reached 50% flowering, panicle tissue was harvested at dawn, dawn + 3.5h, dawn + 7h, dawn + 10.5h, dusk, dawn + 14h, dawn + 17.5h, and dawn + 21h. Four replicates were harvested for each of these eight time points for a total of 32 samples. The third rachis of the panicle was ground in liquid nitrogen with a metal pestle. The tissue was then lyophilized at -60C°overnight. Total RNA was isolated using RNeasy Plant Mini Kit (Qiagen, Germany) with the recommended RLT lysis buffer. The RNA extraction protocol was modified to include and additional incubation with DNaseI. mRNA was isolated from 2 µg of total RNA using magnetic oligo(dT) (NEB, Ipswich, MA). Directional RNA-Seq libraries were prepared from isolated mRNA. Libraries were quantified using a 2100 Bioanalyzer (Agilent, Santa Clara, CA) RNA-Seq was performed on a HiSeq 2500 (Illumina, San Diego, CA). Reads were trimmed using seqtk (https://github.com/lh3/seqtk). Samples were aligned with Tophat2 to the IRGSP-1.0 genome [48,49]. Counts per gene were identified by HTSeq Count [50]. Data is available through GSE92302.
Selection of Regulators
Transcription factors used as regulators for the murine circadian data and human viral data were obtained from http://www.bioguo.org/AnimalTFDB/index.php [19]. Arabidopsis transcription factor list were obtained from http://planttfdb.cbi.pku.edu.cn/ [51]. S. cerevisiae transcription factors were obtained from [33].
Sources for Validation Resources
The direct targets for the five circadian TFs from murine data were obtained from the supplementary information provided in [20]. Targets for additional TFs and the validation of the viral data from human expression data were obtained from the cistrome project (http://cistrome.org/Cistrome/Cistrome_Project.html) [23]. Eighty-three TFs were selected as regulators that were labeled as evaluated blood tissue and present on the HGU133 microarray. ChIP-Seq targets were determined by BETA (http://cistrome.org/BETA/). If multiple ChIP-Seq were provided they were combined. The Arabidopsis ChIP-Seq validations were obtained from multiple sources [34–38]. The validation for the yeast analysis was obtained from a TF-DNA binding array from Zhu et al. [33].
Slope and p-value Calculation for RANGES and ExRANGES
The R package ExRANGES has been prepared and is available http://github.com/DohertyLab/ExRANGES. Briefly the package performs the following modifications to expression data. The slope was calculated as . Sample from the R base package was used to sample 10,000 with replacement for the slopes calculated for each gene. The sampling population is dependent on the time series length (i.e. the circadian data has 48 data points to sample from) P-values of the actual slope compared to the distribution of the background slopes were calculated using ecdf from the R stat package [52]. To preserve direction a duplicate version of the p-values matrix is created. The tails are switched in this duplicated matrix by subtracting 1 from the matrix. The original matrix and the switched matrix are both transformed by -log10. The matrices are then combined by taking the higher value of the two matrices if the switched version is taken the sign is changed (ifelse(matrix.up<matrix.down, -(matrix.down), matrix.up)). ExRANGES values were determined for each gene by multiplying the expression at tn by the weighted rate calculated from tn to tn+1. See R package provided in http://github.com/DohertyLab/ExRANGES. and linked in OMICS tools https://omictools.com/expression-in-a-rate-normalized-gene-specific-tool.
Network Inference using GENIE3
To predict regulatory interaction between transcription factor and target gene, GENIE3 was used. GENIE3 script was taken from http://www.montefiore.ulg.ac.be/~huynh-thu/software.html on June 14, 2016 [8]. GENIE3 was modified by to be usable with parLapply from the R parallel package [52]. We used 2000 trees for random forest for all data sets except the viral data set. For the viral data set we limited it to 100 trees due to the size. The importance measure from random forest was used to calculate the area under ROC.
Network Inference using INFERELATOR
TF-target interactions were calculated from both EXPRESSION and ExRANGES for the Circadian, Viral, Arabidopsis, and rice datasets. TF and targets labels are identical to those used as GENIE3 input. Time information in the form of the time step between each sample was added to satisfy time course conditions as a parameter, default values were used for all other parameters. Only confidence scores of TF-target interactions greater than 0 were evaluated against ChIP-Seq standards. The confidence scores were used as the prediction score to evaluate against the targets identified for each TF from experimental ChIP-Seq data.
ROC Calculation
ROC values were determined by the ROCR package in R [53]. The importance measures were used as the prediction score and the targets from the respective experimental validation (ChIP- Seq, protein binding array, or DAP-Seq) were used as the metric to evaluate the performance function. The area under the ROC is presented to summarize the accuracy.
Acknowledgments
This work was supported by funding from USDA NIFA 2014-04051. We would like to thank Katie Greenham and Erin Slabaugh for critical suggestions on the manuscript preparation. Additionally, we thank Steve Briggs for sharing the time, expertise, and helpful discussions of his research group.