Abstract
Modern scientific studies from many diverse areas of research abound with multiple hypothesis testing concerns. The false discovery rate is one of the most commonly used error rates for measuring and controlling rates of false discoveries when performing multiple tests. Adaptive false discovery rates rely on an estimate of the proportion of null hypotheses among all the hypotheses being tested. This proportion is typically estimated once for each collection of hypotheses. Here we propose a regression framework to estimate the proportion of null hypotheses conditional on observed covariates. We provide both finite sample and asymptotic conditions under which this covariate-adjusted estimate is conservative - leading to appropriately conservative false discovery rate estimates. Our case study concerns a genome-wise association meta-analysis which considers associations with body mass index. In our framework, we are able to use the sample sizes for the individual genomic loci and the minor allele frequencies as covariates. We further evaluate our approach via a number of simulation scenarios.
1 Introduction
Multiple testing is a ubiquitous issue in modern scientific studies. Microarrays (Brown, 1995), next-generation sequencing (Shendure and Ji, 2008), and high-throughput metabolomics (Lindon et al., 2011) make it possible to simultaneously test the relationship between hundreds or thousands of biomarkers and an exposure or outcome of interest. These problems have a common structure consisting of a collection of variables, or features, for which measurements are obtained on multiple samples, with a hypothesis test being performed for each feature.
When performing thousands of hypothesis tests, the most widely used framework for controlling for multiple testing is the false discovery rate. For a fixed unknown parameter µ, and testing a single null hypothesis H0: µ = µ0 versus some alternative hypothesis, for example, H1: µ = µ1, the null hypothesis may either truly hold or not for each feature. Additionally, the test may lead to H0 either being rejected or not being rejected. Thus, when performing m hypothesis tests for m different unknown parameters, Table 1 shows the total number of outcomes of each type, using the notation from Benjamini and Hochberg (1995). We note that U, T, V, and S, and as a result, also R = V + S, are random variables, while m0, the number of null hypotheses, is fixed and unknown.
The false discovery rate (FDR), introduced in Benjamini and Hochberg (1995), is the expected fraction of false discoveries among all discoveries. The false discovery rate depends on the overall fraction of null hypotheses, namely . This proportion can also be interpreted as the a priori probability that a null hypothesis is true, π0.
When estimating the FDR, incorporating an estimate of π0 can result in a more powerful procedure compared to the original Benjamini and Hochberg (1995) procedure; moreover, as m increases, the estimate of π0 improves, which means that the power of the multiple-testing approach does not necessarily decrease when more hypotheses are considered (Storey, 2002).
Most modern adaptive false discovery rate procedures rely on an estimate of π0 using the data of all tests being performed. But additional information, in the form of meta-data, may be available to aid the decision about whether to reject the null hypothesis for a particular feature. We focus on an example from a genome-wide association study (GWAS) meta-analysis, in which millions of genetic loci are tested for associations with an outcome of interest - in our case body mass index (BMI). Different loci may not all be genotyped in the same individuals, leading to loci-specific sample sizes. Additionally, each locus will have a different population-level frequency. Thus, the sample sizes and the frequencies may be considered as covariates of interest. Other examples exist in set-level inference, including gene-set analysis, where each set has a different fraction of false discoveries. Adjusting for covariates independent of the data conditional on the truth of the null hypothesis has also been shown to improve power in RNA-seq, eQTL, and proteomics studies (Ignatiadis et al., 2016).
In this paper, we build on the work of Benjamini and Hochberg (1995), Efron et al. (2001), and Storey (2002) and the more recent work of Scott et al. (2015), which frames the concept of FDR regression and extends the concepts of FDR and π0 to incorporate covariates, represented by additional meta-data. Our focus will be on estimating the covariate-specific π0. We will also show how this can be seen as an extension of our work (Boca et al., 2013) on set-level inference, where an approach which focused on estimating the fraction of non-null variables in a set was developed, introducing the idea of “atoms,” non-overlapping sets based on the original annotations, and the concept of the “atomic FDR.” We provide a more direct approach to estimating the covariate-specific π0 and a number of theoretical frequentist properties for our estimator. We also compare our estimates to those of Scott et al. (2015).
The remainder of the paper is organized as follows. In Section 2 we present the BMI GWAS meta-analysis case study. In Section 3, we review the definitions of FDR and π0 and extend π0 to consider conditioning on a specific covariate. In Section 4, we discuss estimation and inference procedures for the covariate-specific π0 in the FDR regression framework. In Section 5, we consider special cases within the FDR regression framework, including how the no covariates case and the case where the features are partitioned return us to the “standard” estimation procedures. In Section 6, we explore some theoretical properties of the estimator, including showing that, under certain conditions, it is a conservative estimator of the covariate-level π0, its variance has an upper bound which can be calculated from the given data, and it is an asymptotically conservative estimator of the covariate-level π0. In Section 7 and Section 8, we consider simulations and an analysis of GWAS data. Finally, Section 9 provides our statement of reproducibility and Section 10 provides the discussion.
2 Case study: adjusting for sample size and allele frequency in GWAS meta-analysis
As we have described, there are a variety of situations where meta-data could be valuable for improving estimation of the prior probability a hypothesis is true or false. Here we consider an example from the meta-analysis of data from GWAS for BMI (Locke et al., 2015).
In a GWAS, data are collected for a large number of genomic loci called single nucleotide polymorphisms (SNPs) (Hirschhorn and Daly, 2005). Each person has a copy of the DNA at each SNP inherited from their mother and from their father. At each locus there are usually one of two types of DNA, called alleles, that can be inherited, denoted A and a. In general, A refers to the variant that is more common in the population being studied and a to the variant that is less common. Each person has a genotype for that SNP of the form AA, Aa, or aa. The number of copies of a, commonly called the minor allele - is assumed to follow a binomial distribution.
In a GWAS, each individual has the alleles for hundreds of thousands of SNPs measured along with some outcomes of interest like BMI. Then each SNP is tested for association with the outcome in a regression model and p-values are calculated for the association. GWAS studies have grown to sample sizes of tens of thousands of individuals. But the largest studies consist of meta-analyses combining multiple studies (Neale et al., 2010; Hirschhorn and Daly, 2005). In these studies, the sample size may not be the same for each SNP, for example if different individuals are measured with different technologies which measure different SNPs. As a result, the sample size could be considered as a meta-data covariate.
A second covariate of interest could be the frequency of the minor allele a in the population. The power to detect associations increases with increasing minor allele frequency. This is related to the idea that logistic regression is more powerful for outcomes that occur with a frequency close to 0.5.
Here we consider data from the Genetic Investigation of ANthropometric Traits (GIANT) consortium, specifically the genome-wide association study for BMI (Locke et al., 2015). The GIANT consortium performed a meta-analysis of 329,224 individuals measuring 2,555,510 SNPs and tested each for association with BMI. Here we will consider using a regression model to estimate a prior probability for association for each SNP conditional on the SNP-specific sample size and allele frequency.
3 Covariate-specific π0
We will now review the main concepts behind the FDR and the a priori probability that a null hypothesis is true, and consider the extension to the covariate-specific FDR, and the covariate-specific a priori probability. A natural mathematical definition of the FDR would be:
However, R is a random variable that can be equal to 0, so the definition that is generally used is: namely the expected fraction of false discoveries among all discoveries multiplied by the probability of making at least one rejection.
We index the m null hypotheses being considered by 1 ≤ i ≤ m: H01, H02, …, H0m. For each i, the corresponding null hypothesis H0i can be considered as being about a binary parameter θi, such that:
Thus, assuming that θi are identically distributed, the a priori probability that a feature is null is:
We now extend the definition of π0 to consider conditioning on a covariate Xi, where Xi is a column vector of length c, possibly with c = 1:
4 Estimation and inference for covariate-specific π0 in the FDR regression framework
We will now discuss the estimation and inference procedures for π0(xi) in a FDR regression framework. We assume that a hypothesis test is performed for each i, summarized by a p-value Pi. At a given threshold 0 < λ < 1, we consider the random variables Yi:
Thus, Yi is a dichotomous random variable that is 1 when the null hypothesis H0i is not rejected at an α-level of λ and 0 when it is rejected. Thus, for a fixed, given λ. The null p-values will come from a Uniform(0,1) distribution, while the p-values for the features from the alternative
The major assumption we make moving forward is that conditional on the null, the p-values do not depend on the covariates. In Theorem 2, we prove the major result we will use to derive the estimator for π0(xi).
Suppose that m hypotheses tests are performed and that conditional on the null, the p-values do not depend on the covariates. Then:
Proof.
Then, using the assumption that conditional on the null, the p-values do not depend on the covariates:
In Corollary 3, we show the corresponding result for the no-covariate case. This result is easy to prove directly, but we consider it as a corrolary to Theorem 2 to show that there are no identifiability problems with the extension to covariates.
Suppose that m hypotheses tests are performed and that conditional on the null, the p-values do not depend on the covariates. Then:
Proof. Applying the law of iterated expectations:
We complete the proof by using: where ν is typically either the Lebesgue measure over a subset ℝ or the counting measure over a subset of ℚ, and FXi is the cumulative distribution function for Xi. Here we are implicitly assuming some distribution for Xi as well. Everywhere else we are conditioning on X.
We first review the procedure which applies Corollary 3 to lead to the estimator of π0 for the no-covariate case, which is also used by Storey (2002), then develop a procedure based on Theorem 2 to obtain an estimator of π0(x). Both of them are based on assuming reasonably powered tests and a large enough λ, so that:
Corollary 3 then leads to: resulting in:
Using a method-of-moments approach, we consider the estimator: which is used by Storey (2002). Applying the same steps with Theorem 2, we get:
We can use a regression framework to estimate E[Yi|Xi = xi, then estimate π0(x) by:
We now denote by Y the random vector of length m with the ith element Yi and by X the matrix of dimension m × (c + 1), which has the ith row consisting of (1 ). Moving forward, we will denote by x the observed values of the random matrix X.
We consider estimators of the form: where S = Z(ZTZ)−ZT for some m × p matrix Z with p < m and rank(Z) = d ≤ p and is the ith row of S; in particular, we can have Z = X for linear regression or have Z also include polynomial or spline terms. If d = p, then ZTZ is invertible; if d < p, one can use any pseudoinverse of ZTZ, since the projection matrix is unique.
Note that thus far we have considered the estimate of π0(xi) at a single threshold λ, so that is in fact . We can consider smoothing over a series of thresholds to obtain the final estimate, as done by Storey and Tibshirani (2003). In particular, in the remainder of this manuscript, we used cubic smoothing splines with 3 degrees of freedom over the series of thresholds 0.05, 0.10, 0.15, …, 0.95, following the example of the qvalue package, with the estimate being the smoothed value at λ = 0.95. The estimates may also be thresholded so that they are always between 0 and 1.
If we assume that the p-values are independent, we can also use bootstrap samples of them to obtain a confidence interval for . The details for the entire estimation and inference procedure are in Algorithm 1.
4.1 Algorithm 1: Estimation and inference for
a) Obtain the p-values P1, P2, …, Pm, for the m hypothesis tests.
b) For a given threshold λ, obtain Yi = 1(Pi > λ) for 1 ≤ i ≤ m.
c) Choose a design matrix Z, estimate E[Yi|Xi = xi] by: where S = Z(ZTZ)−ZT and π0(xi) by:
d) Smooth over a series of thresholds λ ∈ (0, 1) to obtain , by taking the smoothed value at the largest threshold considered.
e) Take B bootstrap samples of P1, P2, …, Pm and calculate the bootstrap estimates for 1 ≤ b ≤ B using the procedure described above.
f) Form a 1 − α upper confidence interval for by taking the 1 − α quantile of the as the upper confidence bound, the lower confidence bound being 0.
5 Special cases for covariate-specific π0
5.1 No covariates
If we do not consider any covariates, the usual estimator from Eq. (5) can be deduced from applying Algorithm 1 by fitting a linear regression with just an intercept.
5.2 Partioning the features
Now assume that the set of features is partitioned into S sets, namely that a collection of sets S = {As: 1 ≤ s ≤ S} is considered such that all sets are non-empty, pairwise disjoint, and have the set of all the features as their union. Note that the index s does not need to indicate any kind of ordering of the sets. For example, such partioning could be induced by considering all possible atoms resulting from gene-set annotations, or could consist of brain regions of interest in a functional imaging analysis, when considering only the genes or voxels that are annotated (Boca et al., 2013). We can consider this in the covariate framework we developed by taking xi to be a vector of length S − 1, which consists of 0s at all positions with the exception of a value of 1 at the index corresponding to the single set As ∈ S such that i ∈ As, for 1 ≤ s ≤ S − 1. Set AS representing the “baseline set,” so that xi is a vector of length S − 1 consisting of just 0s if i ∈ AS. In notation commonly used in linear algebra:
Taking into account the partition, a natural way of estimating π0(xi) is to just apply the estimator from Eq. (5) to each of the S sets:
A related idea has been proposed for partitioning hypotheses into sets to improve power (Efron, 2008). These results can also be obtained by estimating via Algorithm 1 by fitting a linear regression with an intercept and the covariates xi.
6 Theoretical results
We now proceed to explore some theoretical properties of the estimator . In what follows, 1 is the m × 1 vector consisting of just 1s. We will also use the notation:
Lemma 4 below gives the bias of . Note that , since λ ≤ 1, G(λ) ≤ 1 and π0(xi) ≤ 1, . The second term could, however, be negative, and depends on the level of non-linearity present in π0(xi) and misspecification of the model as encapsulated in the design matrix Z.
The bias of is:
Proof By Eq. (7):
Using the result of Theorem 2:
Given that S = Z(ZTZ)−ZT and that the first column of Z is 1, . This is a known result used in linear regression. It can be obtained using the fact that , where Z1 is a matrix consisting of d linearly independent columns of Z, including the first column, then applying the formula for the inverse of a block matrix. Thus:
Theorem 5 shows that, if the model is correctly specified, i.e. π0(x = Zβ for some vector β of length c + 1, then is a conservative estimate of π0(xi).
If π0(x) = Zβ for some vector β of length c + 1, then is a conservative estimate of π0(xi), i.e.:
Proof In this case, using the fact that S is a projection matrix onto the space spanned by the columns of Z and therefore SZ = Z: so:
If the same π0 is shared by all the features, i.e. it does not change based on any covariates, then is a conservative estimate of π0. This result is also described elsewhere, for example in (Storey, 2002). We note here that it can also be obtained as a direct consequence of Theorem 5. Theorem 5 also applies to the case where the covariates concern the partitioning of the features, as in Section 5.2.
Lemma 7 gives a bound on in terms of S and λ. We note that this bound can always be calculated from the given data.
Assuming that Yi are independent conditional on X and that all the features are indepen-dent: where Sii are the diagonal elements of S.
Proof. By Eq. (7):
By independence of Yi conditional on X and independence of the features:
Since Yj|X = xj is a Bernoulli random variable, its variance is P[Yj = 1|Xj = xj]{1 − P[Yj = 1 Xj = xj]}, which has as its maximal value, attained at . This leads to: the last equality being a direct consequence of S being a symmetric idempotent matrix.
Theorem 8 shows that, if Sii → 0 as m → ∞ holds alongside the assumptions of Lemma 7, then is a consistent estimator of .
If Yi are independent conditional on X, all the features are independent, and Sii → 0 as m → ∞,
Proof. By Chebyshev’s inequality, for all ε > 0:
Then, by using the stated assumptions and Lemma 7, we get that
Is it likely or even possible that Sii → 0 as m → ∞? In general this will be the case, unless there are some xi which have very high leverage on the regression line by being far from the overall mean of the xi vectors. The reason for this is that S being idempotent implies that Tr(S) = rank(S), and given that S = Z(ZTZ)−ZT, Tr(S) = rank(Z) = d, which means that the mean value of Sii is . The diagonal elements of S are also the leverages for the individual data points, with a “rule of thumb” of often being used to identify high leverage points (Hoaglin and Welsch, 1978). It can also be shown that : We first note that 0 ≤ Sii ≤ 1, by once again using the fact that S is idempotent:
We get the improved lower bound by using the fact that and Cauchy’s inequality:
Furter using the fact that the mean value of Sii is d/m and the inequalities between the arithmetic mean and the minimum and maximum values, we obtain:
(Hoaglin and Welsch, 1978) discuss the case where Sii = 1, which occurs when the model is fully saturated, predicting the outcome exactly.
Thus, by Theorems 5 and 8, under reasonable conditions, is a conservative and an asymptotically conservative estimator of π0(xi).
We note that our approach to estimating π0(xi) does not place any restrictions on its range. Thus, in practice, the values will also be thresholded to be between 0 and 1. In the following theorem, we show that implementing this thresholding decreases the mean squared error of the estimator. The approach is similar to that taken in Theorem 2 in the work of Storey (2002).
7 Simulations
We first describe simulations which give a better idea of the usefulness of Lemma 4 and Theorem 5. We implemented a variety of scenarios, with different values of π0(xi) and Z, representing different levels of linearity and model misspecification. In each case, there are m = 1,000 features and 10,000 simulation runs were considered. For the scenarios where xi is a scalar, its values were taken to be evenly spaced, while for the scenarios where it is a vector, the values the first component were taken to be evenly spaced, while the second component was a step function, with the first m/2 values being equal to 1 and the remaining m/2 values being equal to 0. We then randomly generated whether each feature was from the null or alternative distributions, so that the null hypothesis was true for the features for which a success was drawn from the Bernoulli distribution with probability π0(xi).
For the null features, p-values were randomly sampled from a U(0,1) distribution, while for the alternative features, they were sampled from a β(a, b) distribution, with a = 1, b = 2. Sampling the true positive p-values from a Beta distribution is justified in light of recent statistical research (Allison et al., 2002; Pounds and Morris, 2003; Allison et al., 2006; Leek and Storey, 2011). Plots of π0(xi) and versus xi are in Figure 1 and 2 for different fitting approaches, for both our method (with λ = 0.8 and λ = 0.9 and with the smoothed value for our approach) and for the Empirical Bayes (EB) method of Scott et al. (2015). We note that Scott et al. (2015) use z-values instead of p-values, therefore we transform each p-value p to a z-value by using the formula Φ−(1 − p/2). Figure 1 does not threshold the results for our method, whereas Figure 2 thresholds them so that they are always between 0 and 1. Our method also shows improved performance compared to the method of Scott et al. (2015) in terms of the estimated mean being close to the true mean. In particular, the EB approach is more often anti-conservative; additionally, we were only apple to use the estimates of 88% − 90% of the simulation runs for the EB approach, the remaining runs resulting in errors. Note that, as expected, the closer we get to having a correctly specified model with a linear estimator, the better the estimation is. If the estimates are not thresholded, then for a model close to the true model, the theoretical results can be used as a good approximation. However, this can result in estimates below 0 or above 1. For higher values of π0(xi) which may result in estimates above 1, as in panel e) of these two figures, thresholding at 1 may lead to slightly anticonservative results and increased variability.
Next, we use the same set of simulations as in Figures 1 and 2 to estimate the variance of for λ = 0.8 and compare it to the bound from Lemma 7. Plots of and its upper bound versus the index i are presented in Figure S1.
We also used the same scenarios, but varied the number of features in order to see whether Theorem 8, which says that is a consistent estimator of , holds. The number of features was taken to be either m = 10, 100, 1,000 or 10,000 and the components of xiwere set as before. For each value of m considered, we calculated . The results, shown in Table S1, indeed justify the assumptions of Theorem 8. In general, ZT Z can be written as a matrix of the sample means of pairs of the p variables (i.e. for the variables ith and jth variables) multiplied by m, therefore all the terms in S = Z(ZTZ)−1ZT include combinations of the individual variables Zij and the sample means of combinations, with number of terms depending on p, which is fixed, multiplied by 1/m, if ZTZ is invertible. Thus, as long as all the means are bounded as m → ∞, as they would be in the case of equally spaced values, then Sii → 0 as m → ∞, fulfilling the conditions for Theorem 8.
Note that we have thus far assumed independent hypotheses tests. However, this assumption rarely holds in practice. We thus further consider the scenario where the 1,000 features are in 10 blocks of 100 features each. We then sample the latent variables which encode whether a particular feature is drawn from the null or the alternative using a thresholded multivariate normal distribution with a block-diagonal correlation structure, with within-block correlations equal to 0.9, thresholding them at 0. The p-values are then drawn as before, from Unif(0,1) for the null, and from a beta distribution for the alternative. The scenarios analogous to Figures 1 and 2 are presented in Figures S2 and S3, respectively. Note that the results are nearly indistinguishable from the independent case.
8 Data analysis
Here we considered data from the GWAS for BMI (Locke et al., 2015). From a total of 2,555,510 SNPs, we removed the SNPs which did not have minor allele frequencies (MAFs) listed for the HapMap CEU population, leading to 2,500,573 SNPs. For each of these SNPs, we considered the p-values from the test of association with BMI and the meta-data covariates consisting of the number of individuals (N) considered for each SNP and the minor allele frequencies (MAFs) in the HapMap CEU population, since it is well-known that both sample size and MAF have an impact on p-values, with larger sample sizes and MAFs leading to more significant results.
The model we considered uses natural cubic splines with 5 degrees of freedom to model N and 3 discrete categories for the MAFs. Figure 3 shows the dependence of p-values on sample sizes within this dataset. Figure 4 shows the estimates of π0(xi) (thresholded at 0 and 1) plotted against the sample size N, stratified by the CEU MAFs for a random subset of 50,000 SNPs. We note that the results are similar for λ = 0.8, λ = 0.9, and for the final smoothed estimate. The EB method of Scott et al. (2015) shows similar qualitative trends, however the estimated values are closer together as well as closer to 1.
Our results are consistent with intuition - larger sample sizes and larger MAFs lead to a smaller fraction of SNPs estimated to be null. Applying this estimator to the false discovery rate calculation will mean increased power to detect associations for SNPs with large sample sizes and large MAFs, with potentially reduced power for SNPs with the opposite characteristics.
9 Reproducibility
All analyses and simulations in this paper are fully reproducible and the code is available on Github at: https://github.com/SiminaB/Fdr-regression
10 Discussion
Here we have introduced a regression framework for the proportion of true null hypotheses in a multiple testing framework. We have provided conditions for conservative and consistent estimation of this proportion conditional on covariates. Using simulations we have shown that while the regression estimates may be incorrect under model misspecification the upper bounds on the variance of the estimator hold even for inaccurate models.
Applying our estimator to GWAS data from the GIANT consortium demonstrated that, as expected, the estimate of the fraction of null hypotheses decreases with both sample size and minor allele frequency. It is a well known and problematic phenomenon that p-values for all features decrease as the sample size increases. This is because the null is rarely precisely true for any given feature. One interesting consequence of our estimates is that we can calibrate what fraction of p-values appear to be drawn from the non-null distribution as a function of sample size, potentially allowing us to quantify the effect of the “large sample size means small p-values” problem directly.
A range of other applications for our methodology are also possible by modifying our regression framework, including estimating false discovery rates for gene sets (Boca et al., 2013), estimating science-wise false discovery rates (Jager and Leek, 2013), or improving power in high-throughput biological studies (Ignatiadis et al., 2016).