Abstract
RNA-seq is commonly used to identify genetic modules that respond to a perturbation. Although transcriptomes have been mainly used for target gene discovery, their quantitative nature makes them attractive structures with which to study genetic interactions. To understand whether whole-organism RNA-seq is suitable for genetic pathway reconstruction, we sequenced the transcriptome of four single mutants and two double mutants of the hypoxia pathway in C. elegans. By comparing the expression levels of double mutants with their corresponding single mutants, we were able to determine, on a genome-wide level, that EGL-9 acts along VHL-1-dependent and independent branches to inhibit HIF-1. We were also able to observe transcriptome-wide suppression of the egl-9(lf) phenotype in an egl-9(lf) hif-1(lf) double mutant. As a by-product of our analysis, we identified a core hypoxic response consisting of 355 genes, and 45 genes that have hif-1-independent, vhl-1-dependent expression. Finally, we are able to identify 31 genes that exhibit non-canonical epistasis: for these genes, vhl-1(lf) mutants show opposing effects to egl-9(lf) mutants, but the egl-9(lf);vhl-1(lf) exhibits the egl-9(lf) phenotype. We suggest that this non-canonical epistasis reflects unexplored aspects of the hypoxia pathway. We discuss the implications, benefits and advantages of using transcriptomic phenotypes to perform pathway analysis.
Introduction
Genetic analysis of molecular pathways has traditionally been performed through epistatis analysis. Generalized epistasis indicates that two genes interact functionally; such interaction can involve the direct interaction of their products or the interaction of any consequence of their function (small molecules, physiological or behavioral effects)1. If two genes interact, and the mutants of these genes have a quantifiable phenotype, the double mutant of interacting genes will have a phenotype that is not the sum of the phenotypes of the single mutants that make up its genotype. Epistasis analysis remains a cornerstone of genetics today2.
Recently, biological studies have shifted in focus from studying single genes to studying all genes in parallel. In particular, RNA-seq3 enables biologists to identify genes that change expression in response to a perturbation. Gene expression profiling using RNA-seq has become much more sensitive thanks to deeper and more frequent sequencing due to lower sequencing costs4, better and faster abundance quantification 5,6,7, and improved differential expression analysis methods8,9. RNA-seq has been successfully used to identify genetic modules involved in a variety of processes, including T-cell regulation10,11, the Caenorhabditis elegans (C. elegans) linker cell migration12, and planarian stem cell maintenance13,14. For the most part, the role of transcriptional profiling has been restricted to target gene identification.
Although transcriptional profiling has been primarily used for descriptive purposes, transcriptomic phenotypes have previously been used to make genetic inferences. Microarray analyses in S. cerevisiae and D. discoideum were used to show that transcriptomes can be interpreted to infer genetic relationships in simple eukaryotes15,16. eQTL studies in many organisms, from yeast to humans, have established the usefulness of transcriptomic phenotypes for population genetics studies17,18,19,20. In cell culture, single-cell RNA-seq has seen significant progress towards using transcriptomes as phenotypes with which to test genetic interactions21,22. More recently, we have identified a new developmental state of C. elegans using whole-organism transcriptome profiling23. To investigate the ability of whole-organism transcriptomes to serve as quantitative phenotypes for epistasis analysis in metazoans, we sequenced the transcriptomes of of four well-characterized loss of function mutants in the C. elegans hypoxia pathway24,25,26,27.
Metazoans depend on the presence of oxygen in sufficient concentrations to support aerobic metabolism. Genetic pathways evolved to rapidly respond to any acute or chronic changes in oxygen levels at the cellular or organismal level. Biochemical and genetic approaches identified the Hypoxia Inducible Factors (HIFs) as an important group of oxygen-responsive genes that are involved in a broad range of human pathologies28.
Hypoxia Inducible Factors are highly conserved in metazoans29. A common mechanism for hypoxia-response induction is heterodimerization between a HIFα and a HIFβ subunit; the heterodimer then initiates transcription of target genes30. The number and complexity of HIFs varies throughout metazoans, with humans having three HIFα subunits and two HIFβ subunits, whereas in the roundworm C. elegans there is a single HIFα gene, hif-1 27 and a single HIFβ gene, ahr-1 31. HIF target genes have been implicated in a wide variety of cellular and extracellular processes including glycolysis, extracellular matrix modification, autophagy and immunity32,33,34,35,28.
Levels of HIFα proteins tend to be tightly regulated. Under conditions of normoxia, HIF-1α exists in the cytoplasm and partakes in a futile cycle of continuous protein production and rapid degradation36. HIF-1α is hydroxylated by three proline hydroxylases in humans (PHD1, PHD2 and PHD3) but is only hydroxylated by one proline hydroxylase (EGL-9) in C. elegans37. HIF-1 hydroxylation increases its binding affinity to Von Hippel Lindau Tumor Suppressor 1 (VHL-1), which allows ubiquitination of HIF-1 leading to its subsequent degradation. In C. elegans, EGL-9 activity is inhibited by binding of CYSL-1, and CYSL-1 activity is in turn inhibited at the protein level by RHY-1, possibly by post-translational modifications to CYSL-138 (see Fig. 1).
Here, we show that transcriptomes contain robust signals that can be used to infer relationships between genes in complex metazoans by reconstructing the hypoxia pathway in C. elegans using RNA-seq. Furthermore, we show that the phenomenon of phenotypic epistasis, a hallmark of genetic interaction, holds at the molecular systems level. We also demonstrate that transcriptomes contain sufficient information, under certain circumstances, to order genes in a pathway using only single mutants. Finally, we were able to identify genes that appear to be downstream of egl-9 and vhl-1, but do not appear to be targets of hif-1. Using a single set of genome-wide measurements, we were able to observe and quantitatively assess significant fraction of the known transcriptional effects of hif-1 in C. elegans. A complete version of the analysis, with ample documentation, is available at https://wormlabcaltech.github.io/mprsq.
Results
The hypoxia pathway controls thousands of genes in C. elegans
We selected four single mutants within the hypoxia pathway for expression profiling: egl-9(lf) (sa307), rhy-1(lf) (ok1402), vhl-1(lf) (ok161), hif-1(lf) (ia4). We also sequenced the transcriptomes of two double mutants, egl-9(lf);vhl-1(lf) (sa307, ok161) and egl-9(lf) hif-1(lf) (sa307, ia4) as well as wild-type N2 as a control sample. Each genotype was sequenced in triplicate at a depth of 15 million reads. We performed whole-organism RNA-seq of these mutants at a moderate sequencing depth (~ 7 million mapped reads for each individual replicate) under normoxic conditions. For single samples, we identified around 22,000 different isoforms per sample, which allowed us to measure differential expression of 18,344 isoforms across all replicates and genotypes (this constitutes ~70% of the protein coding isoforms in C. elegans). We also included in our analysis a fog-2(lf) (q71) mutant which we have previously studied23, because fog-2 is not reported to interact with the hypoxia pathway. We analyzed our data using a general linear model on logarithm-transformed counts. Changes in gene expression are reflected in the regression coefficient, β which is specific to each isoform within a genotype. Statistical significance is achieved when the q-values for each β (p-values adjusted for multiple testing) are less than 0.1. Genes that are significantly altered between wild-type and a given mutant have β values that are statistically significantly different from 0. These coefficients are not equal to the average log-fold change per gene, although they are loosely related to this quantity. Larger magnitudes of β correspond to larger perturbations. These coefficients can be used to study the RNA-seq data in question.
In spite of the moderate sequencing depth, transcriptome profiling of the hypoxia pathway revealed that this pathway controls thousands of genes in C. elegans. The egl-9(lf) transcriptome showed differential expression of 1,806 genes. Similarly, 2,103 genes were differentially expressed in rhy-1(lf) mutants. The vhl-1(lf) transcriptome showed considerably fewer differentially expressed genes (689), possibly because it is a weaker controller of hif-1(lf) than egl-9(lf)26. The egl-9(lf);vhl-1(lf) double mutant transcriptome showed 2,376 differentially expressed genes. The hif-1(lf) mutant also showed a transcriptomic phenotype involving 546 genes. The egl-9(lf) hif-1(lf) double mutant showed a similar number of genes with altered expression (404 genes, see Table 1).
Principal Component Analysis visualizes epistatic relationships between genotypes
Principal Component Analysis (PCA) is a well-known technique in bioinformatics that is used to identify relationships between high dimensional data points39 We performed PCA on our data to examine whether each genotype clustered in a biologically relevant manner. PCA identifies the vector that can explain most of the variation in the data;this is called the first PCA dimension. Using PCA, one can identify the first n dimensions that can explain more than 95% of the variation in the data. Sample clustering in these n dimensions often indicates biological relationships between the data, although interpreting PCA dimensions can be difficult.
After applying PCA, we expected hif-1(lf) to cluster near egl-9(lf) hif-1(lf), because hif-1(lf) exhibits no phenotypic defects under normoxic conditions, in contrast to egl-9(lf), which exhibits an egg-laying (Egl) phenotype in the same environment. In egl-9(lf) hif-1(lf) mutants the Egl phenotype of egl-9(lf) mutants is suppressed and instead the grossly wild-type phenotype of hif-1(lf) is observed. On the other hand, we expected egl-9(lf), rhy-1(lf), vhl-1(lf) and egl-9(lf);vhl-1(lf) to form a separate cluster since each of these genotypes is Egl and has a constitutive hypoxic response. Finally, we included as a negative control a fog-2(lf) mutant we have analyzed previously 23. This data was obtained at a different time from the other genotypes, so we included a batch-normalization term in our equations to account for this. Since fog-2 has not been described to interact with the hypoxia pathway, we expected that it should appear far away from either cluster.
The first dimension of the PCA analysis was able to discriminate between mutants that have constitutive high levels of HIF-1 and mutants that have no HIF-1, whereas the second dimension was able to discriminate between mutants within the hypoxia pathway and outside the hypoxia pathway (see Fig. 2). Therefore expression profiling measures enough signal to cluster genes in a meaningful manner in complex metazoans.
Reconstruction of the hypoxia pathway from first genetic principles
Having shown that the signal in the mutants we selected was sufficient to cluster mutants using the values of the regression coefficients β, we set out to reconstruct the hypoxia pathway from genetic first principles. In general, to reconstruct a pathway, we must first assess whether two genes act on the same phenotype. If they do not act on the same phenotype (the set of commonly differentially regulated genes between two mutants is empty), these mutants are independent. If they are not independent, then two mutants have a shared transcriptomic phenotype (STP)—a set of genes or isoforms that are differentially expressed in both mutants, without taking into account what direction they change in. In this case, we must measure whether these genes act additively or epistatically on the measured phenotype; if there is epistasis we must measure whether it is positive or negative, in order to assess whether the epistatic relationship is a genetic suppression or a synthetic interaction.
Genes in the hypoxia mutant act on the same transcriptional phenotype
We observed that all the hypoxia mutants had significant shared transcriptomic phenotypes (fraction of the transcriptomes that was shared between mutants ranged from a minimum of 6.8% shared between hif-1(lf) and egl-9(lf);vhl-1(lf) to a maximum of 31% shared genes between egl-9(lf) and egl-9(lf);vhl-1(lf)). For comparison, we also analyzed a previously published fog-2(lf) transcriptome23. The fog-2 gene is involved in masculinization of the C. elegans germline, which enables sperm formation, and is not known to be involved in the hypoxia pathway. The hypoxia pathway mutants and the fog-2(lf) mutant also showed shared transcriptomic phenotypes (3.6%–12% genes), but correlations between expression level changes were considerably weaker (see below), suggesting that there is minor cross-talk between these pathways.
We wanted to know whether it was informative to look at quantitative agreement within STPs. For each mutant pair, we rank-transformed the regression coefficients β of each isoform within the STP, and calculated lines of best fit using Bayesian regression with a Student-T distribution to mitigate noise from outliers and plotted the results in a rank plot (see Fig 3). For transcriptomes associated with the hypoxia pathway, we found that these correlations tended to have values higher than 0.9 with a tight distribution around the line of best fit. The correlations for mutants from the hypoxia pathway with the fog-2(lf) mutant were considerably weaker, with magnitudes between 0.6–0.85 and greater variance around the line of best fit. Although hif-1 is known to be genetically repressed by egl-9, rhy-1 and vhl-124,25, all the correlations between mutants of these genes and hif-1(lf) were positive.
After we calculated the pairwise correlation within each STP, we weighted the result of each regression by the number of isoforms within the STP and divided by the total number of differentially expressed isoforms present in the two mutant transcriptomes that contributed to that specific STP, Noverlap/Ng1∪g2. The weighted regressions recapitulated a module network (see Fig. 4). We identified a strong positive interaction between egl-9(lf) and rhy-1(lf). The magnitude of this weighted correlation derives from the magnitude of the transcriptomes for these mutants (1,806 and 2,103 differentially expressed genes respectively) and the overlap between both genes was extensive, which makes the weighting factor considerably larger than other pairs. The weak correlation between hif-1(lf) and egl-9(lf) results from the small size of the hif-1(lf) transcriptome and the small overlap between the transcriptomes.
The fine-grained nature of transcriptional phenotypes means that these weighted correlations between transcriptomes of single mutants are predictive of genetic interaction.
A quality check of the transcriptomic data reveals excellent agreement with the literature
One way to establish whether genes are acting additively or epistatically to each other is to perform qPCR of a reporter gene in the single and double mutants. This approach was used to successfully map the relationships within the hypoxia pathway (see, for example26,25). A commonly used hypoxia reporter gene is nhr-57, which is known to exhibit a several-fold increase in mRNA expression when HIF-1 accumulates25,34,40. Likewise, increased HIF-1 fucntion is known to cause increased of rhy-1 and egl-9 41.
We can selectively look at the expression of a few genes at a time. Therefore, we queried the changes in expression of rhy-1, egl-9, and nhr-57. We included the nuclear laminin gene lam-3 as a representative negative control not believed to be responsive to alterations in the hypoxia pathway. nhr-57 was upregulated in egl-9(lf), rhy-1(lf) and vhl-1(lf), but remains unchanged in hif-1(lf). egl-9(lf);vhl-1(lf) had an expression level similar to egl-9(lf); whereas the egl-9(lf) hif-1(lf) mutant showed wild-type levels of the reporter expression, as reported previously25 (see Fig. 5).
We observed changes in rhy-1(lf) expression consistent with previous literature25 when HIF-1 accumulates. We also observed increases in egl-9 expression in egl-9(lf). egl-9 is known as a hypoxia responsive gene41. Although changes in egl-9 expression were not statistically significantly different from the wild-type in rhy-1(lf) and vhl-1(lf) mutants, the mRNA levels of egl-9 still trended towards increased expression in these genotypes. As with nhr-57, egl-9 and rhy-1 expression were wild-type in egl-9(lf) hif-1(lf) and egl-9(lf);vhl-1(lf) mutant showed expression phenotypes identical to egl-9(lf). This dataset also showed that knockout of hif-1 resulted in a modest increase in the levels of rhy-1. This suggests that hif-1, in addition to being a positive regulator of rhy-1, also inhibits it, which constitutes a novel observation. Using a single reporter we would have been able to reconstruct an important fraction of the genetic relationships between the genes in the hypoxia pathway—-but would likely fail to observe yet other genetic interactions, such as the evidence for hif-1 negatively regulating rhy-1 transcript levels.
Transcriptome-wide epistasis
Ideally, any measurement of transcriptome-wide epistasis should conform to certain expectations. First, it should make use of the regression coefficients of as many genes as possible. Second, it should be summarizable in a single, well-defined number. Third, it should have an intuitive behavior, such that special values of the statistic should each have an unambiguous interpretation.
One way of displaying transcriptome-wide epistasis is to plot transcriptome data onto an epistasis plot (see Fig 6). In an epistasis plot, the X-axis represents the expected expression of a double mutant a−b− if a and b interact additively. In other words, each individual isoform’s x-coordinate is the sum of the regression coefficients from the single mutants a− and b−. The Y-axis represents the deviations from the additive (null) model, and can be calculated as the difference between the observed regression coefficient and the predicted regression coefficient. Only genes that are differentially expressed in all three genotypes are plotted. Assuming that the two genes interact via a simple phenotype (for example, if both genes affect a transcription factor that generates the entire transcriptome), these plots will generate specific patterns that can be described through linear regressions. The slope of these lines, sa,b, is the transcriptome-wide epistasis coefficient.
Epistasis plots can be understood intuitively for simple cases of genetic interactions. If two genes act additively on the same set of differentially expressed isoforms then all the plotted points will fall along the line y = 0. If two genes interact in an unbranched pathway, then a− and b− should have identical phenotypes for a−, b− and a−b−, if all the genotypes are homozygous for genetic null alleles1. It follows that the data points should fall along a line with slope equal to . On the other hand, in the limit of complete inhibition of a by b, the plots should show a line of best fit with slope equal to −11. Genes that interact synthetically (i.e., through an OR-gate) will fall along lines with slopes > 0. When there is epistasis of one gene over another, the points will fall along a line of best fit with slope sab=b or sab=a. This slope must be determined from the single-mutant data. From this information, we can use the single mutant data to predict the distribution of slopes that results for each case stated above, as well as for each epistatic combination (a−b− = a− or a−b− = b−). The transcriptome-wide epistasis coefficient (sa b), emerges as a powerful way to quantify epistasis because it integrates information from many different genes or isoforms into a single number (see Fig. 6).
In our experiment, we studied two double mutants, egl-9(lf) hif-1(lf) and egl-9(lf);vhl-1(lf). We wanted to understand how well an epistasis analysis based on transcriptome-wide coefficients agreed with the epistasis results reported in the literature, which were based on qPCR of single genes. Therefore, we performed orthogonal distance regression on the two gene combinations we studied (egl-9 and vhl-1; and egl-9 and hif-1) to determine the epistasis coefficient for each gene pair. We also generated models for the special cases mentioned above (additivity, a−b− = a−, strong suppression, etc…) using the single mutant data. For every simulation, as well as for the observed data, we used bootstraps to generate probability distributions of the epistasis coefficients.
When we compared the predictions for the transcriptome-wide epistasis coefficient, segl−9,vhl−1 under different assumptions with the observed slope (−0.42). We observed that the predicted slope matched the simulated slope for the case where egl-9 is epistatic over vhl-1 (egl-9(lf) = egl-9(lf);vhl-1(lf), see Fig. 6) and did not overlap with any other prediction. Next, we predicted the distribution of segl−9,hif−1 for different pathways and contrasted with the observed slope. In this case, we saw that the uncertainty in the observed coefficient overlapped significantly with the strong suppression model, where EGL-9 strongly suppresses HIF-1, and also with the model where hif-1(lf) = egl-9(lf) hif-1(lf). In this case, both models are reasonable—HIF-1 is strongly suppressed by EGL-9, and we know from previous literature that the epistatic relationship, hif-1(lf) = egl-9(lf) hif-1(lf), is true for these mutants. In fact, as the repression of HIF-1 by EGL-9 becomes stronger, the epistatic model should converge on the limit of strong repression (see Epistasis).
Another way to test which model best explains the epistatic relationship between egl-9 and vhl-1 is to use Bayesian model selection to calculate an odds ratio between two models to explain the observed data. Models can be placed into two categories: parameter-free and fit. Parameter free models are ‘simpler’ because their parameter space is smaller (0 parameters) than the fit models (n parameters). By Occam’s razor, simpler models should be preferred to more complicated models. However, simple models suffer from the drawback that systematic deviations from them cannot be explained or accomodated, whereas more complicated models can alter the fit values to maximize their explanatory power. In this sense, more complicated models should be preferred when the data shows systematic deviations from the simple model. Odds-ratio selection gives us a way to quantify the trade-off between simplicity and explanatory power.
We reasoned that comparing a fit model (y = α · x, where α is the slope of best fit) against a parameter-free model (y = γ · x, where γ is a single number) constituted a conservative approach towards selecting which theoretical model (if any) best explained the data. In particular, this approach will tend to strongly favor the line of best fit over simpler model for all but very small, non-systematic deviations. We decided that we would reject the theoretical models only if the line of best-fit was 103 times more likely than the theoretical models (odds ratio, OR > 103). Comparing the odds-ratio between the line of best fit and the different pathway models for egl-9 and vhl-1 showed similar results to the simulation. Only the theoretical model egl-9(lf) = egl-9(lf);vhl-1(lf) could not be rejected (OR = 0.46), whereas all other models were significantly less likely than the line of best fit (OR > 1044). Therefore, egl-9 is epistatic to vhl-1. Moreover, since segl−9,vhl−1 is strictly between and not equal to 0 and −0.5, we conclude that egl-9 acts on its transcriptomic phenotype in vhl-1-dependent and independent manners. A branched pathway that can lead to epistasis coefficients in this range is a pathway where egl-9 interacts with its transcriptomic phenotype via branches that have the same valence (both positive or both negative)26. When we performed a similar analysis to establish the epistatic relationship between egl-9 and hif-1, we observed that the best alternative to a free-fit model was a model where hif-1 is epistatic over egl-9 (OR= 2551), but the free-fit model was still preferred. All other models were strongly rejected (OR > 1025).
Epistasis can be predicted
Given our success in measuring epistasis coefficients, we wanted to know whether we could predict the epistasis coefficient between egl-9 and vhl-1 in the absence of the egl-9(lf) genotype. Since RHY-1 indirectly activates EGL-9, the rhy-1(lf) transcriptome should contain more or less equivalent information to the egl-9(lf) transcriptome. Therefore, we generated predictions of the epistasis coefficient between egl-9 and vhl-1 by substituting in the rhy-1(lf) data. We predicted srhy−1,vhl−1 = −0.45. Similarly, we used the egl-9(lf);vhl-1(lf) double mutant to measure the epistasis coefficient while replacing the egl-9(lf) dataset with the rhy-1(lf) dataset. We found that the epistasis coefficient using this substitution was −0.40. This coefficient was different from −0.50 (OR > 1062), reflecting the same qualitative conclusion that the hypoxia pathway is branched. In conclusion, we were able to obtain a quantitatively close prediction of the epistasis coefficient for two mutants using the transcriptome of a related, upstream mutant. Finally, we showed that in the absence of a single mutant, an upstream locus can under some circumstances be used to estimate epistasis between two genes.
Transcriptomic decorrelation can be used to infer functional distance
So far, we have shown that RNA-seq can accurately measure genetic interactions. However, genetic interactions are far removed from biochemical interactions: Genetic interactions do not require two gene products to interact physically, nor even to be physically close to each other. RNA-seq cannot measure physical interactions between genes, but we wondered whether expression profiling contains sufficient information to order genes along a pathway.
Single genes are often regulated by multiple independent sources. The connection between two nodes can in theory be characterized by the strength of the edges connecting them (the thickness of the edge); the sources that regulate both nodes (the fraction of inputs common to both nodes); and the genes that are regulated by both nodes (the fraction of outputs that are common to both nodes). In other words, we expected that expression profiles associated with a pathway would respond quantitatively to quantitative changes in activity of the pathway. Targeting a pathway at multiple points would lead to expression profile divergence as we compare nodes that are separated by more degrees of freedom, reflecting the flux in information between them.
We investigated the possibility that transcriptomic signals do in fact contain relevant information about the degrees of separation by weighting the robust Bayesian regression between each pair of genotypes by the size of the shared transcriptomic phenotype of each pair divided by the total number of isoforms differentially expressed in either mutant (NIntersection/NUnion). We plotted the weighted correlation of each gene pair, ordered by increasing functional distance (see Fig. 7). In every case, we see that the weighted correlation decreases monotonically due mainly, but not exclusively, to a smaller STP. We believe that this result is not due to random noise or insufficiently deep sequencing. Instead, we propose a framework in which every gene is regulated by multiple different molecular species, which induces progressive decorrelation. This decorrelation in turn has two consequences. First, decorrelation within a pathway implies that two nodes may be almost independent of each other if the functional distance between them is large. Second, it may be possible to use decorrelation dynamics to infer gene order in a branching pathway, as we have done with the hypoxia pathway.
The circuit topology of the hypoxia pathway explains patterns in the data
We noticed that while some of the rank plots contained a clear positive correlation (see Fig. 3), other rank plots showed a discernible cross-pattern (see Fig. 8). In particular, this cross-pattern emerged between vhl-1(lf) and rhy-1(lf) or between vhl-1(lf) and egl-9(lf), even though genetically vhl-1, rhy-1 and egl-9 are all inhibitors of hif-1(lf). Such cross-patterns could be indicative of feedback loops or other complex interaction patterns.
If the above is correct, then it should be possible to identify egl-9-independent, rhy-1(lf)-dependent target genes in a logically consistent way. One erroneous way to identify these targets is via subtractive logic. Using subtractive logic, we would identify genes that are differentially expressed in rhy-1(lf) mutants but not in egl-9(lf) mutants. Such a gene set would consist of almost 700 genes. One major drawback of subtractive logic is that it cannot be applied when feedback loops exist between the genes in question. Another problem is that the set of identified genes are statistically indistinguishable from false positive and false negative hits because they have no distinguishing property beyond the condition that they should be differentially expressed in one mutant but not the other. In fact, this is exactly the behavior expected of false-positive or false-negative hits—presence in one, but not multiple, mutants. We need to consider the relationship between two genes before we can begin to identify targets which expression is dependent on one gene and independent of the other.
rhy-1 and egl-9 share a well-defined relationship. RHY-1 inhibits CYSL-1, which in turn inhibits EGL-938. Therefore, loss of RHY-1 leads to inactivation of EGL-9, which leads to increase in the cellular levels of HIF-1. HIF-1 in turn causes the mRNA levels of rhy-1 and egl-9 to increase, as they are involved in the hif-1-dependent hypoxia response. However, since rhy-1 has been mutated, the observed transcriptome is RHY-1 ‘null’; EGL-9 ‘null’; HIF-1 ‘on’. The situation is similar for egl-9(lf), except that RHY-1 is not inactive, and therefore the observed transcriptome is the result of RHY-1 ‘up’; EGL-9 ‘null’; and HIF-1 ‘on’.
From this pattern, we conclude that the egl-9(lf) and rhy-1(lf) transcriptomes should exhibit a cross-pattern when plotted against each other: The positive arm of the cross is the result of the EGL-9 ‘null’; HIF-1 ‘on’ dynamics; and the negative arm reflects the different direction of RHY-1 activity between transcriptomes. No negative arm is visible (with the exception of two outliers, which are annotated as pseudogenes in WormBase). Therefore, in this dataset we do not find genes that have egl-9 independent, rhy-1-dependent expression patterns.
We also identified a main hypoxia response induced by disinhibiting hif-1 (355 genes) by identifying genes that were commonly up-regulated amongst egl-9(lf), rhy-1(lf) and vhl-1(lf) mutants. Although the hypoxic response is likely to involve between three and seven times more genes (assuming the rhy-1(lf) transcriptome reflects the maximal hypoxic response), this is a conservative estimate that minimizes false positive results, since these changes were identified in four different genotypes with three replicates each. This response included five transcription factors (W02D7.6, nhr-57, ztf-18, nhr-135 and dmd-9). The full list of genes associated with the hypoxia response can be found in the Supplementary Table 1.
hif-1-independent effects of egl-9 have been reported previously40, which led us to question whether we could identify similar effects in our dataset. We have observed that hif-1(lf) displays a modest increase in the transcription of rhy-1, from which we speculated that EGL-9 would have increased activity in the hif-1(lf) mutant compared to the wild-type. Therefore, we searched for genes that were regulated in an opposite manner between hif-1(lf) and egl-9(lf) hif-1(lf), and that were regulated in the same direction between all egl-9(lf) genotypes. We did not find any genes that met these conditions.
We also searched for genes with hif-1 independent, vhl-1-dependent gene expression and found 45 genes, which can be found in the Supplementary Table 2. Finally, we searched for candidates directly regulated by hif-1. Initially, we searched for genes that had were significantly altered in hif-1(lf) genotypes in one direction, but altered in the opposite direction in mutants that activate the HIF-1 response. Only two genes (R08E5.3, and nit-1) met these conditions. This could reflect the fact that HIF-1 exists at very low levels in C. elegans, so loss of function mutations in hif-1 might only have mild effects on its transcriptional targets. We reasoned that genes that are overexpressed in mutants that induce the HIF-1 response would be enriched for genes that are direct candidates. We found 195 genes which have consistently increased expression in mutants with a constitutive hypoxic response. These genes can be found in the Supplementary Table 3.
Enrichment analysis of the hypoxia response
To validate that our transcriptomes were correct, and to understand how functionalities may vary between them, we subjected each decoupled response to enrichment analysis using the WormBase Enrichment Suite 42,43.
We used gene ontology enrichment analysis (GEA) on the main hypoxia response program. This showed that the terms ‘oxoacid metabolic process’ (q < 10−4, 3.0 fold-change, 24 genes), ‘iron ion binding’ (q < 10− 2, 3.8 fold-change, 10 genes), and ‘immune system process’ (q < 10− 3, 2.9 fold-change, 20 genes) were significantly enriched. GEA also showed enrichment of the term ‘mitochondrion’ (q < 10−3, 2.5 fold-change, 29 genes) (see Fig. 9). Indeed, hif-1(lf) has been implicated in all of these biological and molecular functions44,45,46,47. As benchmark on the quality of our data, we selected a set of 22 genes known to be responsive to HIF-1 levels from the literature and asked whether these genes were present in our hypoxia response list. We found 8/22 known genes, which constitutes a statistically significant result (p < 1010). The small number of reporters found in this list probably reflects the conservative nature of our estimates. We studied the hif-1-independent, vhl-1-dependent gene set using enrichment analysis but no terms were significantly enriched.
Identification of non-classical epistatic interactions
hif-1(lf) has traditionally been viewed as existing in a genetic OFF state under normoxic conditions. However, our dataset indicates that 546 genes show altered expression when hif-1 function is removed in normoxic conditions. Moreover, we observed positive correlations between hif-1(lf) β coefficients and egl-9(lf), vhl-1(lf) and rhy-1(lf) β coefficients in spite of the negative regulatory relationships between these genes and hif-1. Such positive correlations could indicate a different relationship between these genes than has previously been reported, so we attempted to substantiate them through epistasis analyses.
To perform epistasis analyses, we first identified genes that exhibited violations of the canonical genetic model of the hypoxia pathway. To this end, we searched for genes that exhibited different behaviors between egl-9(lf) and vhl-1(lf), or between rhy-1(lf) and vhl-1(lf) (we assume that all results from the rhy-1(lf) transcriptome reflect a complete loss of egl-9 activity). We found 31 that satisfied this condition (see Fig. 10, Supplemental Table 4). Additionally, many of these genes exhibited a new kind of epistasis. Namely, egl-9 was epistatic over vhl-1. Identification of a set of genes that have a consistent set of relationships between themselves suggests that we have identified a new aspect of the hypoxia pathway.
To illustrate this, we focused on three genes, nlp-31, ftn-1 and ftn-2, which epistasis patterns that we felt reflected the population well. ftn-1 and ftn-2 are both described in the literature as genes that are responsive to mutations in the hypoxia pathway. Moreover, these genes have been previously described to have aberrant behaviors45,46, specifically the opposite effects of egl-9(lf) and vhl-1(lf). These studies showed that loss of vhl-1(lf) decreases expression of ftn-1 and ftn-2 using both RNAi and alleles, which allays concerns of strain-specific interference. Moreover, Ackerman and Gems (2012) showed that vhl-1 is epistatic to hif-1 for the ftn-1 expression phenotype, and that loss of HIF-1 is associated with increased expression of ftn-1 and ftn-2. We observed that hif-1 was epistatic to egl-9, and that egl-9 and hif-1 both promoted ftn-1 and ftn-2 expression.
Epistasis analysis of ftn-1 and ftn-2 expression reveals that egl-9 is epistatic to hif-1; that vhl-1 has opposite effects to egl-9, and that vhl-1 is epistatic to egl-9. Analysis of nlp-31 reveals similar relationships. nlp-31 expression is decreased in hif-1(lf), and increased in egl-9(lf). However, egl-9 is epistatic to hif-1. Like ftn-1 and ftn-2, vhl-1 has the opposite effect to egl-9, yet is epistatic to egl-9. We propose in the Discussion a model for how HIF-1 might regulate these targets.
HIF-1 in the cellular context
We identified the transcriptional changes associated with bioenergetic pathways in C. elegans by extracting from WormBase all genes associated with the tricarboxylic acid (TCA) cycle, the electron transport chain (ETC) and with the C. elegans GO term energy reserve. Previous research has described the effects of mitochondrial dysfunction in eliciting the hypoxia response48, but transcriptional feedback from HIF-1 into bioenergetic pathways has not been as extensively in C. elegans, as in vertebrates (see, for example32,28). We also searched for the changes in ribosomal components and the proteasome, as well as for terms relating to immune response (see Fig 11).
Bioenergetic pathways
Our data shows that most of the enzymes involved in the TCA cycle and in the ETC are down-regulated when HIF-1 is induced in agreement with the previous literature28. However, the fumarase gene fum-1 and the mitochondrial complex II stood out as notable exceptions to the trend, as they were up-regulated in every single genotype that causes deployment of the hypoxia response. FUM-1 catalyzes the reaction of fumarate into malate, and complex II catalyzes the reaction of succinate into fumarate. Complex II has been identified as a source of reserve respiratory capacity in neonatal rat cardiomyocytes previously49. We found two energy reserve genes that were down-regulated by HIF-1. aagr-1 and aagr-2, which are predicted to function in glycogen catabolism50. Three distinct genes involved in energy reserve were up-regulated. These genes were ogt-1, which encodes O-linked GlcNac Transferase gene; T04A8.7, encoding an ortholog of human glucosidase, acid beta (GBA); and T22F3.3, encoding ortholog of human glycogen phosphorylase isozyme in the muscle (PYGM).
Protein synthesis and degradation
hif-1(lf) is also known to inhibit protein synthesis and translation in varied ways.51. Most reported effects of HIF-1 on the translation machinery are posttranslational, and no reports to date show transcriptional control of the ribosomal machinery in C. elegans by HIF-1. We used the WormBase Enrichment Suite Gene Ontology dictionary43 to extract 143 protein-coding genes annotated as ‘structural constituents of the ribosome’ and we queried whether they were differentially expressed in our mutants. egl-9(lf), vhl-1(lf), rhy-1(lf) and egl-9(lf);vhl-1(lf) showed differential expression of 91 distinct ribosomal constituents (not all constituents were detected in all genotypes). For every one of these genotypes, these genes were always down-regulated. In contrast, hif-1(lf) showed up-regulation of a single ribosomal constituent.
Next, we asked whether HIF-1 has any transcriptional effects on the proteasomal constituents; no such effects of HIF-1 on the proteasome have been reported in C. elegans. Out of 40 WormBase-annotated proteasomal constituents, we found 31 constituents that were differentially expressed in at least one of the four genotypes that induce a hypoxic response. Every gene we found was down-regulated in at least two out of the four genotypes we studied.
Discussion
The C. elegans hypoxia pathway can be reconstructed entirely from RNA-seq data
In this paper, we have shown that whole-organism transcriptomic phenotypes can be used to reconstruct genetic pathways and to discern previously overlooked or uncharacterized genetic interactions. We successfully reconstructed the hypoxia pathway, and inferred order of action (rhy-1 activates egl-9, egl-9 and vhl-1 inhibit hif-1), and we were able to infer from transcriptome-wide epistasis measurements that egl-9 exerts vhl-1-dependent and independent inhibition on hif-1.
HIF-1 and the cellular environment
In addition to reconstructing the pathway, our dataset allowed us to observe a wide variety of physiologic changes that occur as a result of the HIF-1-dependent hypoxia response. In particular, we observed down-regulation of most components of the TCA cycle and the mitochondrial electron transport chain with the exceptions of fum-1 and the mitochondrial complex II. The mitochondrial complex II catalyzes the reaction of succinate into fumarate. In mouse embryonic fibroblasts, fumarate has been shown to antagonize HIF-1 prolyl hydroxylase domain (PHD) enzymes, which are orthologs of EGL-952. If the inhibitory role of fumarate on PHD enzymes is conserved in C. elegans, upregulation of complex II by HIF-1 during hypoxia may increase intracellular levels of fumarate, which in turn could lead to artificially high levels of HIF-1 even after normoxia resumes. The increase in fumarate produced by the complex could be compensated by increasing expression of fum-1. Increased fumarate degradation allows C. elegans to maintain plasticity in the hypoxia pathway, keeping the pathway sensitive to oxygen levels.
Interpretation of the non-classical epistasis in the hypoxia pathway
The observation of almost 30 genes that exhibit a specific pattern of non-classical epistasis suggests the existence of previously undescribed aspects of the hypoxia pathway. Some of these non-classical epistases had been observed previously45,46,44, but no satisfactory mechanism has been proposed to explain this biology.46 and 45 suggest that HIF-1 integrates information on iron concentration in the cell to bind to the ftn-1 promoter, but could not definitively establish a mechanism. It is unclear why deletion of hif-1 induces ftn-1 expression, deletion of egl-9 also causes induction of ftn-1 expression, but deletion of vhl-1 removes this inhibition. Moreover,44 have previously reported that certain genes important for the C. elegans immune response against pathogens reflect similar expression patterns. Their interpretation was that swan-1, which encodes a binding partner to EGL-953, is important for modulating HIF-1 activity in some manner. The lack of a conclusive double mutant analysis in this work means the role of SWAN-1 in modulation of HIF-1 activity remains to be demonstrated. Nevertheless, mechanisms that call for additional transcriptional modulators become less likely given the number of genes with different biological functions that exhibit the same pattern.
One way to resolve this problem without invoking additional genes is to consider HIF-1 as a protein with both activating and inhibiting states. In fact, HIF-1 already exists in two states in C. elegans: unmodified
HIF-1 and HIF-1-hydroxyl (HIF-1-OH). Under this model, HIF-1-hydroxyl antagonizes the effects of HIF-1 for certain genes like ftn-1 or nlp-31. Loss of vhl-1 stabilizes HIF-1-hydroxyl. A subset of genes that are sensitive to HIF-1-hydroxyl will be inhibited as a result of the increase in the amount of this species, in spite of loss of vhl-1 function also increasing the level of non-hydroxylated HIF-1. On the other hand, egl-9(lf) selectively removes all HIF-1-hydroxyl, stimulating accumulation of HIF-1 and promoting gene activity. Whether deletion of hif-1(lf) is overall activating or inhibiting will depend on the relative activity of each protein state under normoxia (see Fig. 12).
Multiple lines of circumstantial evidence that HIF-1-hydroxyl plays a role in the functionality of the hypoxia pathway. First, HIF-1-hydroxyl is challenging to study genetically because no mimetic mutations are available with which to study the pure hydroxylated HIF-1 species. Second, although mutations in the Von-Hippel Landau gene stabilize the hydroxyl species, they also increase the quantity of non-hydroxylated HIF-1 by mass action. Finally, since HIF-1 is detected low levels in cells under normoxic conditions54, total HIF-1 protein (unmodified HIF-1 plus HIF-1-hydroxyl) is often tacitly assumed to be vanishingly rare and therefore biologically inactive.
Our data show hundreds of genes that change expression in response to loss of hif-1 under normoxic conditions. This establishes that there is sufficient total HIF-1 protein to be biologically active. Our analyses also revealed that hif-1(lf) shares positive correlations with egl-9(lf), rhy-1(lf) and vhl-1(lf), and that each of these genotypes also shows a secondary negative rank-ordered expression correlation with each other. These cross-patterns between all loss of function of inhibitors of HIF-1 and hif-1(lf) can be most easily explained if HIF-1-hydroxyl is biologically active.
A homeostatic argument can be made in favor of the activity of HIF-1-hydroxyl. At any point in time, the cell must measure the levels of multiple metabolites at once. The hif-1-dependent hypoxia response integrates information from O2, α-ketoglutarate (2-oxoglutarate) and iron concentrations in the cell. One way to integrate this information is by encoding it only in the effective hydroxylation rate of HIF-1 by EGL-9. Then the dynamics in this system will evolve exclusively as a result of the total amount of HIF-1 in the cell. Such a system can be sensitive to fluctuations in the absolute concentration of HIF-155. Since the absolute levels of HIF-1 are low in normoxic conditions, small fluctuations in protein copy-number represent can represent a large fold-change in HIF-1 levels. These fluctuations would not be problematic for genes that must be turned on only under conditions of severe hypoxia—presumably, these genes would be associated with low affinity sites for HIF-1, so that they are only activated when HIF-1 levels are far above random fluctuations.
For yet other sets of genes that must change expression in response to the hypoxia pathway, it may not make as much sense to integrate metabolite information exclusively via EGL-9-dependent hydroxylation of HIF-1. In particular, genes that may function to increase survival in mild hypoxia may benefit from regulatory mechanisms that can sense minor changes in environmental conditions and which therefore benefit from robustness to transient changes in protein copy number. Likewise, genes that are involved in iron or α-ketoglutarate metabolism (such as ftn-1) may benefit from being able to sense, accurately, small and consistent deviations from basal concentrations of these metabolites. For these genes, the information may be better encoded by using HIF-1 and HIF-1-hydroxyl as an activator/repressor pair. Such circuits are known to possess distinct advantages for controlling output in a manner that is robust to transient fluctuations in the levels of their components56,57.
Our RNA-seq data suggests that one of these atypical targets of HIF-1 may be RHY-1. Although rhy-1 does not exhibit non-classical epistasis, hif-1(lf) and egl-9(lf) hif-1(lf) both had increased expression levels of rhy-1. We speculate that if rhy-1 is controlled by both HIF-1 and HIF-1-hydroxyl, then this might imply that HIF-1 regulates the expression of its pathway (and therefore itself) in a manner that is robust to total HIF-1 levels.
Insights into genetic interactions from vectorial phenotypes
Here, we have described a set of straightforward methods that can be in theory applied to any vectorial phenotype. Genome-wide methods afford a lot of information, but genome-wide interpretation of the results is often extremely challenging. Each method has its own advantages and disadvantages. We briefly discuss these methods, their uses and their drawbacks.
Principal component analysis is computationally tractable and clusters can often be visually detected with ease. However, PCA can be misleading, especially when the dimensions represented do not explain a very large fraction of the variance present in the data. In addition, principal dimensions are the product of a linear combination of vectors, and therefore must be interpreted with extreme care. In this case, the first principal dimension separated genotypes that increase HIF-1 protein levels from those that decrease it, but this dimension is a mix of vectors of change in gene expression. Although PCA showed that there is information hidden in these genotypes, it was not enough by itself to provide biological insight.
Whereas PCA operates on all genotypes simultaneously, correlation analysis is a pairwise procedure that measures how predictable the gene expression changes are in a mutant given the vector of expression changes in another. Like PCA, correlation analysis is easy and fast to perform. Unlike PCA, the product of a correlation analysis is a single number with a straightforward interpretation. However, correlation analysis is particularly sensitive to outliers. Although a common strategy is to rank-transform expression data to mitigate outliers, rank-transformations do not remove the cross-patterns that appear when feedback loops or other complex interactions are present between two genes. Such cross-patterns can still lead to vanishing correlations if both patterns are equally strong. Therefore, correlation analyses must take into account the possible existence of systematic outliers. Moreover, correlation values must be measured for both interactions in cross-patterned rank plots. Weighted correlations could be informative for ordering genes along pathways. A drawback of correlation analysis is that the number of pairwise comparisons that must be made increases combinatorially, though strategies could be used to decrease the total number of effective comparisons.
Epistasis plots are a novel way to visualize epistasis in vectorial phenotypes. Here, we have shown how an epistasis plot can be used to identify interactions between two single mutants and a double mutant. In reality, epistasis plots can be generated for any set of measurements involving a set of N mutants and an N-mutant genotype. Epistasis plots can accumulate an arbitrary number of points within them, possess a rich structure that can be visualized and have straightforward interpretations for special slope values.
Another way to analyze epistasis is via general linear models (GLMs) that include interaction terms between two or more genes. In this way, GLMs can quantify the epistatic effect of an interaction on single genes. We and others22,23 have previously used GLMs to identify gene sets that are epistatically regulated by two or more inputs. While powerful, GLMs suffer from the multiple comparison problem. Correcting for false positives using well-known multiple comparison corrections such as FDR58 tends to increase false negative rates. Moreover, since GLMs attempt to estimate effect magnitudes for individual gene or isoform expression levels, they effectively treat each gene as an independent quantity, which prevents better estimation of the magnitude and direction of the epistasis between two genes.
Epistasis plots do not suffer from the multiple comparison problem because the number of tests performed is orders of magnitudes smaller than the number of tests performed by GLMs. Ideally, in an epistasis plot we need only perform 3 tests—rejection of additive, unbranched and suppressive null models—compared with the tens of thousands of tests that are performed in GLMs. Moreover, the magnitude of epistasis between two genes can be estimated using hundreds of genes, which greatly improves the statistical resolution of the epistatic coefficient. This increased resolution is important because the size and magnitude of the epistasis has specific consequences for the type of pathway that is expected.
Any quantitative use of genome-wide datasets requires a good experimental setup. Here, we have demonstrated that whole-organism RNA-seq can be used to dissect molecular pathways in exquisite detail when paired with experimental designs that are motivated by classical genetics. Much more research will be necessary to understand whether epistasis has different consequences in the microscopic realm of transcriptional phenotypes than in the macroscopic world that geneticists have explored previously. Our hope is that these tools, coupled with the classic genetics experimental designs, will reveal hitherto unknown aspects of genetics theory.
Methods
Nematode strains and culture
Strains used were N2 wild-type Bristol, CB5602 vhl-1 (ok161), CB6088 egl-9(sa307) hif-1 (ia4), CB6116 egl-9 (sa307);vhl-1 (ok161), JT307 egl-9 (sa307), ZG31 hif-1 (ia4), RB1297 rhy-1 (ok1f02). All lines were grown on standard nematode growth media (NGM) plates seeded with OP50 E. coli at 20°C (Brenner 1974).
RNA Isolation
Unsynchronized lines were grown on NGM plates at 20C and eggs harvested by sodium hypochlorite treatment. Eggs were plated on 6 to 9 6cm NGM plates with ample OP50 E. coli to avoid starvation and grown at 20°C. Worms were staged and harvested based on the time after plating, vulva morphology and the absence of eggs. Approximately 30–50 non-gravid young adults were picked and placed in 100 μL of TE pH 8.0 at 4°C in 0.2mL PCR tubes. After settling and a brief spin in microcentrifuge approximately 80μL of TE (Ambion AM 9849) was removed from the top of the sample and individual replicates were snap frozen in liquid N2. These replicate samples were then digested with Proteinase K (Roche Lot No. 03115 838001 Recombinant Proteinase K PCR Grade) for 15min at 60° in the presence of 1% SDS and 1.25 μL RNA Secure (Ambion AM 7005). RNA samples were then taken up in 5 Volumes of Trizol (Tri Reagent Zymo Research) and processed and treated with DNase I using Zymo MicroPrep RNA Kit (Zymo Research Quick-RNA MicroPrep R1050). RNA was eluted in RNase-free water and divided into aliquots and stored at −80°C. One aliquot of each replicate was analyzed using a NanoDrop (Thermo Fisher) for impurities, Qubit for concentration and then analyzed on an Agilent 2100 BioAnalyzer (Agilent Technologies). Replicates were selected that had RNA integrity numbers (RIN) equal or greater than 9.0 and showed no evidence of bacterial ribosomal bands, except for the ZG31 mutant where one of three replicates had a RIN of 8.3.
Library Preparation and Sequencing
10ng of quality checked total RNA from each sample was reverse-transcribed into cDNA using the Clontech SMARTer Ultra Low Input RNA for Sequencing v3 kit (catalog #634848) in the SMARTSeq2 protocol 59. RNA was denatured at 70°C for 3 minutes in the presence of dNTPs, oligo dT primer and spiked-in quantitation standards (NIST/ERCC from Ambion, catalog #4456740). After chilling to 4°C, the first-strand reaction was assembled using the LNA TSO primer described in 59, and run at 42°C for 90 minutes, followed by denaturation at 70°C for 10 minutes. The entire first strand reaction was then used as template for 13 cycles of PCR using the Clontech v3 kit. Reactions were cleaned up with 1.8X volume of Ampure XP SPRI beads (catalog #A63880) according to the manufacturer’s protocol. After quantification using the Qubit High Sensitivity DNA assay, a 3ng aliquot of the amplified cDNA was run on the Agilent HS DNA chip to confirm the length distribution of the amplified fragments. The median value for the average cDNA lengths from all length distributions was 1076bp. Tagmentation of the full length cDNA for sequencing was performed using the Illumina/Nextera DNA library prep kit (catalog #FC-121–1030). Following Qubit quantitation and Agilent BioAnalyzer profiling, the tagmented libraries were sequenced. Libraries were sequenced on Illumina HiSeq2500 in single read mode with the read length of 50nt to an average depth of 15 million reads per sample following manufacturer’s instructions. Base calls were performed with RTA 1.13.48.0 followed by conversion to FASTQ with bcl2fastq 1.8.4. Spearman correlation of the transcripts per million (TPM) for each genotype showed that every pairwise correlation within genotype was > 0.9.
Read Alignment and Differential Expression Analysis
We used Kallisto to perform read pseudo-alignment and performed differential analysis using Sleuth. We fit a general linear model for a transcript t in sample i: where yt,i are the logarithm transformed counts; βt, genotype and βt, batch are parameters of the model, and which can be interpreted as biased estimators of the log-fold change; Xt,i,Yt,i are indicator variables describing the conditions of the sample; and ∈t i is the noise associated with a particular measurement.
Genetic Analysis, Overview
Genetic analysis of the processed data was performed in Python 3.5. Our scripts made extensive use of the Pandas, Matplotlib, Scipy, Seaborn, Sklearn, Networkx, Bokeh, PyMC3, and TEA libraries60,61,62,63,64,65,66,42,67. Our analysis is available in a Jupyter Notebook68. All code and required data (except the raw reads) are available at https://github.com/WormLabCaltech/mprsq along with version-control information. Our Jupyter Notebook and interactive graphs for this project can be found at https://wormlabcaltech.github.io/mprsq/. Raw reads were deposited in the Short Read Archive under the study accession number SRP100886.
Weighted Correlations
Pairwise correlations between transcriptomes where calculated by first identifying the set of differentially expressed genes (DEGs) common to both transcriptomes under analysis. DEGs were then rank-ordered according to their regression coefficient, β. Bayesian robust regressions were performed using a Student-T distribution. Bayesian analysis was performed using the PyMC3 library64 (pm.glm.families.StudenT in Python). If the correlation has an average value > 1, the correlation coefficient was set to 1.
Weights were calculated as the proportion of genes that were < 1.5 standard deviations away from the primary regression out of the entire set of shared DEGs for each transcriptome.
Epistasis Analysis
For a double mutant X−Y−, we used the single mutants X− and Y− to find expected value of the coefficient for a double mutant under an additive model for each isoform i. Specifically,
Next, we find the difference, Δi, between the observed double mutant expression coefficient, βXY,obs,i, and the predicted expression coefficient under an additive model for each isoform i.
To calculate the transcriptome-wide epistasis coefficient, we plotted (βAdd,i, Δ i) and found the line of best fit using orthogonal distance regression using the scipy.odr package in Python. We performed non-parametric bootstrap sampling of the ordered tuples with replacement using 5,000 iterations to generate a probability distribution of slopes of best fit.
There are as many models as epistatic relationships. For quantitative phenotypes, epistatic relationships (except synthetic interactions) can be generally expressed as: where Pi is the quantitative phenotype belonging to the genotype i; G is the set of single mutants {X, Y} that make up the double mutant, XY; and λg is the contribution of the phenotype Pg to PXY. Additive interactions between genes are the result of setting λg = 1. All other relationships correspond to setting λX = 0, λY = 1 or λX = 1, λY = 0.
A given epistatic interaction can be simulated by predicting the double mutant phenotype under that interaction and re-calculating the y-coordinates. The recalculated y-coordinates can then be used to predict the possible epistasis coefficients for the cases where X is epistatic over Y, and Y is epistatic over X.
To select between theoretical models, we implemented an approximate Bayesian Odds Ratio. We defined a free-fit model, M1, that found the line of best fit for the data: where α is the slope of the model to be determined, xi.,yi. were the x- and y-coordinates of each point respectively, and σi. was the standard error associated with the y-value. We minimized the negative logarithm of equation 4 to obtain the most likely slope given the data, D (scipy.optimize.minimize in Python). Finally, we approximated the odds ratio as: where α* is the slope found after minimization, σα* is the standard deviation of the parameter at the point α* and P(D |Mi) is the probability of the data given the parameter-free model, Mi.
Author Contributions
This work was supported by HHMI with whom PWS is an investigator and by the Millard and Muriel Jacobs Genetics and Genomics Laboratory at California Institute of Technology. All strains were provided by the CGC, which is funded by NIH Office of Research Infrastructure Programs (P40 OD010440). This article was written with support of the Howard Hughes Medical Institute. This article wouldn’t be possible without help from Dr._ Igor Antoshechkin who performed all sequencing. We thank Hillel Schwartz for all of his careful advice. We would like to thank Jonathan Liu, Han Wang, and Porfirio Quintero for helpful discussion.
Footnotes
↵1 Specifically, this follows from assuming that b− is wild-type under the conditions assayed; and a−b− = b− = wild-type