Summary
We develop an analytic statistical framework for examining a variety of gene-set enrichment analysis tests. Within this framework, we describe why statistical power for both self-contained and competitive gene set tests is a function of the correlation structure of co-expressed genes, and why this characteristic is undesireable for gene-set analyses. We additionally describe why past gene-set tests have suffered from inflated type 1 error, and how permutation-based methods have sought to address the issue with some success in the case of self-contained tests and with less success in the case of competitive tests. While the context of this investigation is microarray analysis, with particular focus on leading tests CAMERA, ROAST, SAFE, and GAGE, the observations are also relevant to recently proposed RNAseq gene-set tests, including MAST.
The variable statistical power we describe as a function of gene correlation structure has not been studied. While type 1 error inflation has been well-studied and described previously for both self-contained and competitive tests, it has less often been done in an analytical framework and so it is useful to make assumptions explicit and examine parametric distributions.
We propose three alternative tests, one of which replicates the properties of permutation-based self-contained tests but obviates the need for even recently proposed, rotation-based approximations to burdensome permutations in favor of closed-form densities. The two other tests we propose have the unique property that their statistical power is not a function of co-expression correlation in the gene-set and therefore may be the preferred methodology. We provide simulation support for these proposed methods, compare their results to leading gene-set tests, and apply them to an already-published study of smoking exposure on pregnant women. We call the suite of three proposed test “JAGST” – Just Another Gene-Set Test – and make the methods accessible via an R package of the same name.
1. Introduction
Gene set enrichment analysis (GSEA) or gene-set testing is a class of methods whose goal is to asses the joint enrichment of a biologically interpretable set of genes in a microarray or RNAseq experiment (Subramanian and others, 2005; Goeman and others, 2004; Kim and Volsky, 2005; Irizarry and others, 2009; Luo and others, 2009). A variety of methods have been proposed nearly since the advent of microarrays, all with unique advantages, and there have been more recent modifications of these methods as RNAseq becomes a more common platform (Finak and others, 2015). There have been two primary categories of gene-set tests, “competitive” and “self-contained” (Wu and Smyth, 2012; Goeman and Bühlmann, 2007; Zhou and others, 2013; Rahmatallah and others, 2012). The former tries to answer whether genes in a set are more differentially expressed (DE) than some background level of DE on the array. The latter tries to answer whether the gene-set is more DE than one were to expect under the null of no association between transcript abundance and condition. The null distributions for these two kinds of hypothesis tests have often been calculated via permutation, either at the gene-level for the competitive test, or sample level, for the self-contained test. Manual permutation is used in an attempt to attain the nominal type 1 error rate (Naeem and others, 2012; Barry and others, 2005) or sometimes quicker methods that yield permutation-like results (Wu and others, 2010; Zhou and others, 2013). Rejecting a competitive gene-set test test is often a higher bar or more difficult burden of proof, and these tests have been more commonly used in gene-set testing. A rejection of a competitive gene-set test can be more biologically meaningful than a self-contained test.
The use of some GSEA methods has been far more common than others; The seminal paper of Subramanian and others (2005) proposed a means of testing a set of genes that leverages the Kolmogorov-Smirnov hypothesis test for similar distributions. It is still often used today and even implemented via a web portal. In recent years, GSEA tests have been proposed that claim greater power and control of type 1 error rates (Rahmatallah and others, 2012; Mesirov and others, 2016; Wu and Smyth, 2012; Ritchie and others, 2015). Error rates have been an issue in some GSEA tests due to correlated test statistics comprising the gene-set test (Goeman and Bühlmann, 2007; Wu and Smyth, 2012; Gatti and others, 2010; Ritchie and others, 2015; Mesirov and others, 2016).
In this paper, we describe within an analytic statistical framework why past gene-set tests have suffered from inflated type 1 error, and how permutation-based methods have sought to address the issue in different ways. Goeman and Bühlmann (2007) made a similar and rigorous description of gene-set tests, though in the context primarily of 2×2 tables, and prior to the development of some of the tests we analyze. More importantly, we additionally describe in the same framework why past and more recently proposed gene-set tests suffer from having variable power as a function of gene set correlation structure.
We show that the location of causal transcripts within the co-expression correlation structure is critical in determining statistical power of the test in addition to the effect sizes of them, and how the issue is relevant to both competitive and self-contained gene-set tests. Briefly, correlated blocks of genes manifest themselves in disproportionate degree in the tails of distributions under both the null and alternative hypotheses. Since we generally perform statistical tests in the tails of distributions, hypothesis tests are disproportionately influenced by this correlation. When causal transcripts are found in these correlated blocks, power for their detection is likewise overly represented relative to causal transcripts found in less correlated. This observation has been made in the context of region-based SNP testing, though not in the context of gene-set testing (Swanson and others, 2013). Despite the observation being made in the context of SNP-set testing, it is pertinent to mention since tests continue to be proposed that suffer from variable power under different correlation structures (Bakshi and others, 2016).
We leverage our statistical analysis of gene-set tests by proposing three tests, the suite of which we call “JAGST” (Just Another Gene Set Test). The first test is self-contained and replicates results from a leading self-contained test ROAST, but does so using a closed form mixture density rather than the rotation and iteration-based method of ROAST. Though ROAST is an elegant, general, and computationally cheap algorithm for various set-based hypothesis tests, two of its more important special cases (mixed and directed set-based tests) fall within our statistical framework so that relevant null tail probabilities can be written down and calculated.
The second two tests of JAGST are power-invariant self-contained and competitive tests. The proposed JAGST power-invariant self-contained test simply uses correlation information of DE test statistics to implement a standard chi-square test. The proposed competitive test is more complex and computationally intensive, using a sophisticated and recently proposed high-dimensional penalized regression method for obtaining z-statistics from a variable selection procedure (Zou and others, 2007; Lockhart and others, 2014; Taylor and Tibshirani, 2017; Tibshirani and others, 2016). We use these z-statistics to create an appropriate mixture distribution of non-central chi-squares, which is the null distribution for our competitive test.
We perform simulations with leading and commonly used competitive and self-contained tests, including ROAST, CAMERA, SAFE, and GAGE (Wu and Smyth, 2012; Wu and others, 2010; Barry and others, 2005; Luo and others, 2009) to demonstrate the degree to which statistical power is effected by correlation structure of the gene-set and how our proposed tests do not have this same property. ROAST and SAFE are self-contained tests, while CAMERA and GAGE are competitive tests. We perform a data analysis on an already published study of smoking exposure in pregnant women (Votavova and others, 2011). While expression studies are the context of our analysis, our observations are also relevant to RNAseq. The popular MAST method for RNAseq is based on CAMERA, and many of the same statistical principles will apply in the RNAseq context (Finak and others, 2015). Additionally, expression analyses continue to be a valuable and well-used platform for diagnosis and prediction (Tachibana, 2015; Zhao and others, 2014; Byron and others, 2016; Zhang and others, 2015; Xu and others, 2016).
2. Methods
Consider matrix Y (n × m) of normalized transcript measures and matrix X (n × p) of conditions, where Y has m × m correlation matrix Σ, whose i, jth element is ρi.j. Assume for now that p = 1 for simplicity. It is common in GSEA to first perform a standard microarray analysis, and then test gene-set enrichment on the summary statistics (such as transcript-level test statistics).
If we fit a regression model of the ith column of Y (Yi) on X, i ∈ 1…m, we extract some test statistic, call it Ti (oftentimes a t-statistic for linear regression if the outcome is continuous), associated with this model, and likewise Tj for the model of Yj on X. The test statistic from a linear regression model is if we have centered X and Y, which follows a t-distribution on n-2 df under the null hypothesis.
Likewise, define
Since Cor(·, ·) is invariant to scaling its arguments and assuming centered, normalized Yi’s, Yj’s, and X’s, where the second to last equality holds because we calculate under the null of no association with X. So we see that Cor(Ti, Tj) is the (i,j) entry of Σ, the correlation matrix of Y. It follows that the vector T composed of entries Tk, k = 1,…, m also has Cor(T) = Σ
Now consider the regression model under the alternative hypothesis of an association between a condition and transcript abundance. In particular, suppose there exists a causal association between X and Yj such that the expectation of the test statistics Tj corresponding to regression model Yj ~ X is E[Tj] = μj. For example, power under this association for an α-level test would be calculated with where Ftn-2(·) is the cdf of a t-distribution on n-2 df and q is the α/2 quantile of that distribution.
Assume there exists no causal association between X and Yi, and consider test statistic Ti. Because Cor(Ti,Tj) = ρij, E(Ti) = ρij · μj; we have power to detect an association between X and Yi by virtue of Yi’s correlation with Yj. though less so because |pi,j| ⩽ 1. We will return to this point later in the discussion of statistical power.
Suppose that we have sufficient sample size such that the t-statistics resulting from an expression analysis can be approximated by the standard normal distribution. Since the test statistic for many gene-set tests relies on or is highly correlated with a sum or sum of squares of probe-level summary statistics composing the gene-set (e.g, Wu and others (2010); Kim and Volsky (2005); Wu and Smyth (2012); Ritchie and others (2015); Luo and others (2009); Barry and others (2005)), consider the following property about the multivariate normal distribution. for T a length g vector that follows MVN(μ, Σ), Γ a size g set of indices for the gene-set, and JG an indicator vector whose 1 entries correspond to the indices of Γ. So the mean of Σi∈Γ Ti is the sum of the component means corresponding to the 1 entries in JG, and variance the sum of the entries of Σ that remain after multiplication by JG. Sometimes this test statistic in the gene-set testing context is called the directed or directional test since its value is sensitive to the sign of Ti.
Additionally, if we square the Ti’s, we have μJ = JG · μ, and λi and vi are the eigenvalues and eigenvectors of (Imhof, 1961; Press, 1966; Harville, 1971). Sometimes this test statistic in the gene-set testing context is called undirected, non-directional, or mixed as its power is invariant to the sign of Ti.
3. Relationship to permutation-based tests
3.1 Considering preservation of Type 1 error using permutations
It is sometimes thought that permutation is a sure, if computationally burdensome, way of preserving type 1 error when the null distribution of a test statistic is unknown. We show why this is not the case, at least with competitive gene-set tests. The result has been noted frequently elsewhere (Wu and Smyth, 2012; Ritchie and others, 2015; Zhou and others, 2013), but we try to describe it in a statistical framework so that evidence of the phenomenon is not primarily simulation-based. First, however, we consider permutation-based self-contained tests and the parametric distributions approximated by them.
3.1.1 Self-contained tests type 1 error
In light of the results in Section 2, we can consider the null distributions generated when samples (“self-contained” gene-set tests) or genes (“competitive” gene-set tests) are permuted to generate a null distribution for the test statistics discussed. In permuting samples, any possible association between gene and outcome is broken so that the expectation of T, the test statistic vector described above, is a length g vector with expectation 0, rather than μ. Since the permutations do not permute the particular gene set under consideration, their own correlation structure is preserved. Thus, one can conclude that a permutation-generated null distribution is a sample from With the joint distribution of Tnull,self in mind, we can consider the sum of squares of its elements. Since the expectation of Tnull,self is zero and using notation conventions from Section 2, δi = 0 because vi will be an inner product with a 0 vector. Differences in the null distribution for are therefore driven by {λi} (the set of eigenvalues of ΣG), and the null distribution can be very different depending on the distribution of the {λi}’s. If there is high within-set correlation, greater variation in λi will lead to a heavier-tailed null, whereas little within-set correlation will correspond to relatively similar λi’s and a thinner tail.
No within gene-set correlation implies independent Ti’s, so that
It is because independent component test statistics have been assumed at times in the past that a null distribution is used when in fact the heavier tailed was the true null distribution (Irizarry and others, 2009). It is in this distinct way that even self-contained tests have occasionally had inflated type 1 error, though this has been recognized and corrected in different articles (Gatti and others, 2010; Wu and others, 2010; Zhou and others, 2013; Mesirov and others, 2016).
3.1.2 Competitive tests type 1 error
When permuting across gene sets for a competitive gene set test, probe-wise associations with the outcome are maintained, but the set over which we aggregate changes as a function of the genes randomly chosen to compose our permutation-generated set. Thus, for permutation k of the null generation, the distribution from which we sample is where Gk is a set of genes of size g randomly chosen, μGk is the expectation vector of test statistics corresponding to those particular genes, and ΣGk is a submatrix of Σ corresponding to the set Gk. Since Gk changes with each iteration, it is evident that {Tnull,comp,1, …, Tnull,comp,q}, where q is the size of the permutation-generated null, are not identically distributed, but a mixture of distributions.
Crucially, and likely the reason type 1 error has been difficult to maintain in competitive gene-set tests, the off-diagonal entries of ΣGk (the correlation matrix of permutation k) will often be systematically smaller than the off-diagonal entries of ΣG because genes within biologically meaningful sets will often be co-expressed to a greater degree than randomly chosen groups of genes. We may therefore expect that the permutation-generated null distribution will not be as thick tailed as the test statistic whose null it is trying to approximate.
3.2. Statistical power when using permutations
Because expression analyses often rely on univariate regression test statistics, confounding structures are not detected and power to detect some non-causal transcripts can be above α because of their association with the causal gene. The underlying causal structure of transcript on pheno-type and confounding by co-expressed genes and their effects on statistical power has been less discussed on a methodological level. When variable power has been discussed, it has been done so from the perspective of underlying biological or platform phenomena, such as in Oshlack and Wakefield (2009) with respect to transcript length bias. Here we describe why statistical power for both self-contained and competitive gene-set tests can vary due to the methodological set-up of the tests themselves.
3.2.1 Illustrative example
We first consider an example to explain variable power more concretely. For simplicity, assume a gene-set consists of 3 genes, one of which has a causal association with the outcome, and the other two do not. Call the test statistic associated with the causal gene T1 with expectation μ1, and those of the non-causal genes T2 and T3 (both with expectations 0). First suppose that T2 and T3 have correlation ρ, but both are independent of T1 (the causal gene). The distribution of their sum is while the non-directional test statistic is and λi and vi are the eigenvalues and eigenvectors of Σ.
Consider now the scenario where we “shift” the correlation structure so that T1 and T2 have correlation ρ, and T3 is independent of them both. though the same causal associations remain. So E(T1) = μ1, and in this scenario E(T2) = μ1ρ because of T2’s correlation with T1. The expectation of T3 is still 0. This time, the distribution of their sum is and the non-directional test statistic is
It will always be the case that , that is, in the second scenario the non-centrality parameters will tend to be bigger than those in the first scenario. We know this because (μ1 μ1ρ 0) is greater than (μ1 0 0) element-wise. Additionally, there will exist at least one δi strictly greater in the second scenario than the first scenario because for at least one i, vi has a non-zero entry in the second element since eigenvectors span the space defined by the columns of Σ. Because χ2 distributions are stochastically strictly increasing in their non-centrality parameters, we will always be better powered to detect the gene-set in the second scenario than in the first.
From this example, we conclude that we are more likely to reject the null for gene set enrichment test when causal transcripts are co-expressed with non-causal ones.
4. Proposed hy pothesis tests: JAGST
4.1 Proposed hypothesis test 1: an analytic approach to self-contained test ROAST
First we introduce a test that is nearly numerically equivalent to important special cases of a leading self-contained gene-set test ROAST, but whose implementation relies on closed form formulae rather than the rotation-based methodology of ROAST, itself an elegant way of approximating the results of permutation-based procedures.
In Section 2 we gave formulae in equations (1) and (2) for the sum of test statistics and sum of squares of test statistics, respectively. These two distributions correspond to two important special cases of ROAST - its directional and mixed gene-set tests. We therefore do not need to rely on the rotation of residual methodology, which though statistically elegant has p-value granularity dependent on the number of specified iterations. We can instead calculate p-values using tail probabilities of the normal distribution in the case of the directional test and of a mixture of scaled χ2distributions in the case of the mixed test.
We see in the supplementary material the correlation of 0.99 (0.96) between the ROAST directional test (mixed test) and analytic calculation of tail probabilities using these densities, both under the null hypothesis. That correlation falls slightly under the alternative hypothesis for reasons explained in the figures, but remains above 0.88. If the variable power issue is not a concern for the analyst, for example for reasons given in the Discussion section, and p-values from the self-contained test ROAST methodology are preferred, one need only use these formulae given. Since inversion of the \2 mixture distribution can be difficult, sampling from the mixture distribution is also adequate and likely still more computationally inexpensive than ROAST.
4.2 Proposed hypothesis test 2: power-invariant competitive gene-set test
We propose the JAGST competitive test whose statistical power is invariant to the correlation structure in which causal effects are found. We do so by making two key changes to currently used gene-set testing methods. First, the test statistic for the competitive test is calculated using rather than a straightforward sum or sum of squares of test statistics. Using the inverse covari-ance matrix of the test statistics is central to achieving the desired correlation structure power invariance. Secondly, the null distribution is calculated by taking random subsets of genes of size g and calculating the same test statistic. Unlike other competitive tests that calculate their null distribution with random subsets of genes, using the inverse correlation matrix makes each realization of our null distribution a function of only the underlying effects of each subset thereby controlling for correlation structure.
While this is the essential idea of the JAGST competitive test, in practice we generate the proposed null distribution in a different and computationally easier way. Especially for large gene-sets, some of which approach a few hundred genes, calculating the null distribution in this brute force way is too cumbersome if not numerically prohibitive depending on whether the correlation matrices are approximately singular.
We therefore propose instead to:
Perform variable selection on the random gene subsets of size g,
choose the model with the smallest AIC (or another model fitness criterion that could be shown to control type 1 error and be power-invariant), then
use a novel and sophisticated method to calculate the z-statistics of the selected variables (Taylor and Tibshirani, 2017; Lockhart and others, 2014; Tibshirani and others, 2016), in order to
take the sum of squares of these test statistics, calling them (“PI” for “power-invariant”), and
sample from , which is the null distribution of the above test statistic as demonstrated in simulation (Figure 4).
We iterate until desired granularity in the null distribution is achieved.
The L1-penalized regression hypothesis testing procedure of Taylor and Tibshirani (2017) proves central to our algorithm. It provides a means of obtaining z-statistics on selected variables in a penalized regression framework and until recently was not possible. These z-statistics allow us to determine the distribution under the alternative that our DE test statistics arise from.
While variable selection is not without computation cost, since the procedure stops once a certain AIC is achieved, we will generally avoid the large O(g3) cost of matrix inversion and numeric instability for large gene sets by a significant margin. This is especially true if it’s assumed that in general gene sets are composed of many transcripts working together in pathways, though reveal relatively sparse models when regularization is applied.
Iteration over different sets of size g will yield a mixture of non-central χ2 distributions. Formulae and approximations have been proposed for mixtures of χ2 distributions (Press, 1966; Liu and others, 2009), which could be used if the same null distribution were relevant to many different tests, or sampling from distributions were burdensome. For our purposes, we eschew these formulae to prefer sampling from the mixture distribution in R (R Core Team, 2017).
While other model fitness criteria could be used, we propose AIC at least in part because it chooses less parsimonious models than, for example, BIC. Since models are less parsimonious, the procedure we propose is less likely to be anti-conservative since δPI, the sum of squares of non-centrality parameters, will tend to be larger. Indeed, simulation suggests that type 1 error is controlled, without being overly conservative. If other simulation scenarios suggested otherwise, different model fitness criteria could be used.
4.3 Proposed hypothesis test 3: power-invariant self-contained gene-set test
For the JAGST self-contained test, we again use the test statistic
In this case, the null hypothesis assumes no association between transcript and outcome so the null distribution for this test statistic is simply . The quantiles for this null distribution will often be a much lower threshold for statistical significance as compared to the proposed competitive gene set test.
5. Results
We provide simulation and data analysis results for the different JAGST tests and compare them with CAMERA (with and without ranks), GAGE (with and without ranks), ROAST, and SAFE. We abbreviate our simulation analysis of type 1 error since it has been described well and studied in detail elsewhere (Barry and others, 2005, 2008; Zhou and others, 2013; Wu and Smyth, 2012) and focus on power as a function of correlation structure under different generating models. We perform simulation and data analysis in the R language (R Core Team, 2017).
5.1 Simulation
We generated 280 samples, each with 40 transcripts or features, the 40 composed of a correlated region of size 20 with correlation 0.8 and an uncorrelated region of size 20. The correlated and uncorrelated blocks also compose our two hypothetical gene sets. We then generated two different binary outcomes consistent with a logistic regression model, where the probability of event in one case was a function of two transcripts in the correlated block and in the other case a function of two uncorrelated transcripts in the uncorrelated block. The effect sizes of the two transcripts were equal and were also the same in either scenario (i.e., whether the causal transcripts were in the correlated or uncorrelated blocks).
We varied the effect size of the two transcripts from a log odds ratio of zero to 3 in the correlated region and performed eight different gene set tests on the 20 features composing the correlated block: CAMERA (with and without ranks), GAGE (with and without ranks), ROAST, SAFE, and the two JAGST power-invariant tests were propose, one self-contained and the other competitive.
We then varied the effect sizes of the two transcripts in the uncorrelated block over the same values and performed the same eight gene tests, this time on the 20 features composing the uncorrelated block. The two power curves in each of Figures 1-6 correspond to the respective test applied to the 20 correlated or uncorrelated features and the associated outcome.
In the cases of CAMERA, ROAST, SAFE, and GAGE, at all effect sizes greater than zero, there was more power to detect the correlated block gene set, even though effect sizes were the same as compared to the uncorrelated block gene set (Figures 1-6). Indeed, the power curves diverged and had different slopes on the-log p-value scale so that there was a still bigger power differential at increasing effect sizes. The divergence is particularly striking with GAGE, whose simulation was repeated to confirm the result. In the cases of Figures 1 and 2, we see a ceiling on the-log p-value, essentially stopping the divergence of power curves. This occurs because the methods use empirical p-values, and we set the number of iterations to 1000. The divergence would continue if the number of iterations were increased. There is a similar ceiling on the rank-based tests found in Figures 4 and 6.
Our proposed gene set tests on the other hand yielded power curves that were much closer at all effect sizes of the two underlying causal transcripts (Figures 7 and 8). While there are small power differences with at least the proposed competitive JAGST test, the power curves were parallel, indicating that the power difference in curves does not increase for larger effect sizes.
Lastly we performed gene-set tests under the null hypothesis and weak alternative to compare results from ROAST’s directional tests with tail probabilities justified in Section 2. The alternative hypothesis posited two causal transcripts among a gene-set size of 20 correlated transcripts. Rather than inverting the distribution, we estimated this tail probability using the appropriate mixture distribution of central, scaled χ2’s. Figures 9 and 10 show the high degree of correspondence between the two ways of calculating the test, with a correlation of 0.99. Figures found in the Supplementary Material show the correspondence between the mixed, non-directional tests of ROAST and the analytic solution described in Section 2. Again there is a high degree of correspondence between the two tests under the null and alternative hypotheses, both with correlations above 0.9.
5.2. Data analysis
We analyze data from an already published study of maternal and fetal transcription variation due to smoking exposure (Votavova and others, 2011). The study analyzed maternal peripheral, placenta, and neonatal cord blood on 20 pregnant women with smoking exposure and 50 without significant smoking exposure. Women were assayed using Illumina expression beadchip v3 and yielded 24,526 transcripts. We accessed the data via its Gene Expression Omnibus (GEO) accession number GDS3929 (Edgar and others, 2002; Barrett and others, 2013).
In keeping with the analysis protocol of Votavova and others (2011), we filtered transcripts of each cell type to approximately the third at detectable expression level. In doing so, the most differentially expressed genes according to our analyses were consistent with those of Votavova and others (2011). For our gene-set analyses, we only used this approximate third most differentially expressed in each category.
We analyzed 7 different gene-sets, motivated by ones commonly used in gene-set enrichment studies, oncogenic pathways, and ones more specific to the context of (Nevins and Potti, 2007; Zhang and others, 2010; Bild and others, 2006; Votavova and others, 2011) (e.g., placenta development). We used 8 different gene-set test methodologies: CAMERA (with and without ranks), ROAST, SAFE, and GAGE (with and without ranks). ROAST and SAFE are self-contained tests, while CAMERA and GAGE are competitive tests. The last two used are our proposed power-invariant competitive and self-contained tests. P-values for the 8 tests by gene-set and data source (maternal peripheral blood, neonatal cord blood, or placenta) are included in Tables 1, 2, and 3, respectively.
6. Discussion
Much has been learned from gene-set testing since the early 2000’s when such methods were developed. The idea that because biological processes are complex systems whose elements are not acting in isolation is powerful and has rightly been leveraged in many gene-set testing methodologies. Most of these methods are uniquely suited to certain situations, implicitly or explicitly making different assumptions of the underlying data and biological process.
Additionally, most proposed gene-set tests make much intuitive sense, for example summing test statistics from differential expression analyses and considering correlation structure in those test statistics to varying degrees and in different ways. For example, CAMERA accounts for that correlation structure via a variance inflation factor (VIF), while ROAST does so via generating a null distribution with rotation of residuals. As methodology has become more sophisticated over time, many tests have been proposed as a response to deficiencies identified in existing methods. For example, recognition that without accounting for correlated test statistics some even self-contained gene-set tests had grossly inflated type 1 error rates led to many critical methodological developments (Zhou and others, 2013; Wu and others, 2010; Mesirov and others, 2016). And to the extent that permutation has been proposed as a means of controlling type 1 error in some cases Barry and others (2005), other tests have been proposed that avoid the computational demands of permutation (Wu and others, 2010; Zhou and others, 2013).
We described many gene-set tests within an explicit statistical framework in this paper and showed why some could be subject to type 1 error inflation. While that observation has been made previously, its best description was done in the context of 2 × 2 tables (Goeman and Bühlmann, 2007), and we describe why it can be the case with a linear combination of differential expression statistics.
More importantly in this paper, we observed and explained why all gene-set tests analyzed have variable power as a function of correlation structure. We show that if gene drivers of phe-notype are found in uncorrelated regions of a gene-set, there is much less power to detect the gene-set than if the drivers are in a correlated region of the gene-set. The observation is particularly relevant because co-expression of genes on expression arrays is ubiquitous and variable; indeed, the very idea of gene-set testing is that groups of genes working in concert are expressed and can be tested together.
The gene set tests applied in our data analysis of smoking exposure on pregnant women did not yield any significant results adjusted for multiple testing. Our proposed self-contained and competitive tests were less significant in the cases of nearly every gene set. The proposed JAGST competitive gene set test is less significant than the JAGST self-contained test, which is consistent with the idea that competitive tests are a more significant burden of proof than self-contained tests. While we were able to replicate the DE results of Votavova and others (2011), we were not able to confirm findings of significant gene-sets. Since Votavova and others (2011) relied on the DAVID database for enrichment analyses, we cannot expect our results to coincide with those of the previous study (Huang and others, 2008, 2009).
The self-contained tests used in our data analysis, ROAST and SAFE, were not generally more significant than the competitive tests, Camera and Gage. Since tests are a function of correlation structure, and in the cases of the ranked tests, non-parametric, we would not necessarily expect an obvious distinction between the two kinds of tests even though their null hypotheses are different.
While we have pointed out power variability as a function of correlation structure in gene-set tests, it is important to point out that the relevance of use of the JAGST methodology will depend on application. When the correlation structure and location of transcripts driving phenotype relative to that structure are similar to those used in our simulation scenario, then there could be significant under-detection of certain gene sets depending on the genetic architecture of the pathway. On the other hand, there are also scenarios, for example when there is only a single correlated block in the gene set, when power variation is likely less of a concern and any of the tests analyzed in this paper are suitable methods. In principle, however, it seems best to use a test that is statistically sound without such implicit assumptions about correlation structure, in which case the JAGST method is preferred.
It is unclear why exactly there is a small divergence in power curves for the JAGST competitive test in the simulation scenarios compared (see Figure 7). Since our method relies on test statistics from an L1-penalized regression model, it is likely the relatively small variation is due to very different correlation structures from the simulation in which the transcripts with non-zero effect sizes are found. These correlation structures may affect AIC, the model fitness criterion being used to choose the tuning parameter, which then affects magnitude of the test statistics. It is important, though, that that small difference we see in our test is much smaller than that observed in other tests and that additionally the two power curves are parallel on the-log p-value scale. In contrast, we see that divergence in power curves increases with effect size for the other tests examined. For the JAGST self-contained test, we observe nearly identical power curves indicating nearly perfect power invariance.
7. Software and Data
An R package for implementing JAGST is available on GitHub under “sojourningNorth/JAGST” and can be installed in the standard way using directions found at the repository. Additionally, code used for simulation and analysis found in the manuscript is available at “sojourningNorth/JAGST-analysis”. Data used for analysis is available through NCBI’s GEO database and has accession number GDS3929 (Edgar and others, 2002; Barrett and others, 2013).
8. Supplementary Material
Additional figures showing correspondences between the directional and mixed tests of ROAST and our calculation of these tests is in the online Supplementary Material and can be found athttp://biostatistics.oxfordjournals.org.
Acknowledgements
The author wishes to thank Prof. Arnoldo Frigessi for helpful comments in the preparation of the manuscript.