Abstract
Analysis of ‘big data’ frequently involves statistical comparison of millions of competing hypotheses to discover hidden processes underlying observed patterns of data, for example in the search for genetic determinants of disease in genome-wide association studies (GWAS). Controlling the family-wise error rate (FWER) is considered the strongest protection against false positives, but makes it difficult to reach the multiple testing-corrected significance threshold. Here I introduce the harmonic mean p-value (HMP) which controls the FWER while greatly improving statistical power by combining dependent tests using generalized central limit theorem. I show that the HMP easily combines information to detect statistically significant signals among groups of individually nonsignificant hypotheses in examples of a human GWAS for neuroticism and a joint human-pathogen GWAS for hepatitis C viral load. The HMP simultaneously tests all combinations of hypotheses, allowing the smallest groups of hypotheses that retain significance to be sought. The power of the HMP to detect significant hypothesis groups is greater than the power of the Benjamini-Hochberg procedure to detect significant hypotheses, even though the latter only controls the weaker false discovery rate (FDR). The HMP has broad implications for the analysis of large datasets because it enhances the potential for scientific discovery.
Analysis of ‘big data’ has the potential to transform society, not least through improving our understanding of the ways in which genetics influences human traits such as health and disease risk.1 However, large datasets present unique challenges. One such challenge now faces geneticists designing future GWAS. To date, participants have typically been typed at around 600,000 genetic variants spread across the 3.2 billion base-pair genome. With the rapidly decreasing costs of DNA sequencing, direct whole genome sequencing (WGS) may soon become routine, raising the possibility of detecting associations at ever more variants.2,3 However, this presents a paradox because increasing the number of tests of association requires more stringent p-value correction for multiple testing, reducing the probability of detecting any individual association. The idea that analysing more data may lead to fewer discoveries is counter-intuitive, and suggests a flaw of logic.
The problem of testing very many hypotheses while keeping the appropriate false positive rate under control is a longstanding issue in large-scale applications of statistics. The family-wise error rate (FWER) is defined as the probability of falsely rejecting a null in favour of an alternative hypothesis in one or more of all tests performed. Controlling the FWER when some subset of the alternative hypotheses tested might be true is considered the strongest form of protection against false positives.
However, the simple and widely-used Bonferroni method for controlling the FWER tends to be conservative, especially when the individual tests are positively correlated, as often occurs when alternative hypotheses are compared against the same data. In practice, the conservative nature of Bonferroni correction exacerbates the stringent criterion of controlling the FWER, jeopardizing sensitivity to detect true signals.
Alternatives to controlling the FWER have been proposed based on arguments for less stringency. Controlling the false discovery rate (FDR) guarantees that among the significant tests, the proportion in which the null hypothesis is incorrectly rejected in favour of the alternative is limited.4 The widely-used Benjamini-Hochberg procedure4 for controlling the FDR shares with the Bonferroni method a robustness to positive correlation between individual tests,5 but does not share the consequent problem of becoming overly conservative. These advantages have increased the popularity of FDR control, but necessitate the acceptance of a less rigorous standard of control than the FWER, which in practice can produce large numbers of false positives.
Bayesian statistics experiences the same fundamental problem because the posterior odds of any individual hypothesis test are inevitably decreased by increasing the number of alternative hypotheses. However, model averaging using Bayes factors allows alternative hypotheses to be combined, so that comparing a group of alternatives against a common null may rule out the null hypothesis collectively. In the case of GWAS, even if no individual variant shows sufficiently strong evidence of association in a region, the model-averaged signal across that region may still achieve sufficiently strong posterior odds.6,7 Combining tests in this way makes an asset of more data by creating the potential for more fine-grained discovery when the signal is sufficiently strong without the liability of requiring that all hypotheses are evaluated individually at the higher level of statistical stringency.
However, there is no general method for combining evidence across hypotheses by model averaging in classical statistics. While some Bayesian arguments advocate simply abandoning classical statistics,8 others show that p-values from likelihood-based inference are mathematically closely related to Bayesian quantities.9,10 Pragmatically, the difficulty of specifying prior information, a tendency for computationally slower methods, and inertia, mean that application of Bayesian methods by practitioners still lags behind classical approaches in many settings, including GWAS. Here I show that competing hypothesis tests can be combined quickly and easily through the harmonic mean p-value, improving statistical power and the prospects for discovery using classical statistics, and prompting a reevaluation of the issue of controlling false positive rates in analyses of big data.
Results
The harmonic mean p-value
For observed data X consider L mutually exclusive alternative hypotheses Mi, i = 1 … L, all with the same nested null hypothesis M0. Suppose each alternative has been tested against the null to produce a p-value, pi. The main result of this paper is that the weighted harmonic mean p-value of any subset R of the p-values,
(i) combines the evidence in favour of the group of alternative hypotheses R against the common null, (ii) is an approximately well-calibrated p-value for small values, and (iii) the following test controls the strong-sense family-wise error rate (FWER) at level approximately α for α ≥ 0.05, no matter how many subsets are tested: where alternative hypothesis Mi has weight and wR = Σi∈R wi.
Generalized central limit theorem (e.g. ref11) can be used to obtain a p-value that becomes exact for large groups of hypotheses because 1/p̊R tends towards a Landau distribution,12 which has probability density function
This allows tables of significance thresholds to be computed for interpretation of the HMP (Table 1), and computation of a better-calibrated p-value using the HMP as a test statistic:
Table 1 shows that direct interpretation of the HMP p̊ tends to be anti-conservative but very closely approximates pp̊ for small values and small groups of alternative hypotheses.
Use of the HMP has several helpful properties that arise from generalized central limit theorem (see Supplementary Methods). It is
Robust to positive dependency between the individual p-values.
Insensitive to the exact number of tests.
Robust to the distribution of weights w.
Most influenced by the smallest p-values.
The HMP outperforms Bonferroni and Simes13 correction. This latter point means that whenever the Benjamini-Hochberg procedure,4 which controls only the FDR, finds significant hypotheses, the HMP will find significant hypotheses or groups of hypotheses. The HMP complements Fisher’s method for combining independent p-values,14 because the HMP is more appropriate when (i) rejecting the null implies that only one alternative hypothesis may be true, and not all of them (ii) the p-values might be positively correlated, and cannot be assumed independent.
In the next section the theory giving rise to the HMP is explained. Readers most interested in application of the HMP can skip to the following sections.
Model-averaged mean maximum likelihood
A classical analogue of the Bayes factor is the maximized likelihood ratio, which measures the evidence for the alternative hypothesis against the null:
In a likelihood ratio test (LRT), the p-value is calculated as the probability of obtaining an Ri as or more extreme if the null hypothesis were true:
For nested hypotheses (ΘM0 ∈ ΘMi), Wilks’ theorem15 approximates the null distribution of Ri as LogGamma(α = V/2, β = 1) when there are v degrees of freedom.
The idea motivating this paper was to develop a classical analogue to the model-averaged Bayes factor by deriving the null distribution for the mean maximized likelihood ratio, where the weights could take into account prior information and the power of each test. Formally this means the model is treated as a random effect. Choice of weights is considered further in the Supplementary Methods.
The distribution of R̄ cannot be approximated by central limit theorem because the LogGamma distribution is heavy tailed, with undefined variance. Instead generalized central limit theorem can be used,11 which states that for equal weights (wi = 1/L) and independent and identically distributed Ris, where λ = 1 is the heavy-tail index of the LogGamma(v/2,1) distribution, aL and bL are constants and Rλ is a Stable distribution with tail index λ. When v = 2, the specific form of the Stable distribution is the Landau. The assumptions of equal weights, independence and identical degrees of freedom can be relaxed. Full details of the Stable distribution approximation are in the Supplementary Methods.
Notably, when v = 2 and the assumptions of Wilks’ theorem are met, the p-value equals the inverse maximized likelihood ratio: so the mean maximized likelihood ratio equals the inverse HMP:
Under these conditions, interpreting R̄ and the HMP are exactly equivalent. This equivalence motivates use of the HMP more generally because
The HMP will capture similar information to R̄ regard-less of the degrees of freedom.
The Landau distribution gives an excellent approximation for R̄ with v = 2, and so for 1/p̊.
Combining pis rather than Ris automatically accounts for differences in degrees of freedom.
Further, the HMP is approximately well calibrated because the LogGamma cumulative distribution function is regularly varying, meaning that the model-averaged p-value (Equation 3) is approximated by (e.g. ref16)
Directly interpreting the HMP using Equation 2 constitutes a multilevel test in the sense that any significant subset of hypotheses implies the HMP of the superset will also be significant because
This means that (i) the HMP is a closed testing procedure11 that controls the strong-sense FWER, (ii) the HMP is more powerful than Bonferroni and Simes correction because the HMP is always smaller than the p-values for those tests, and therefore (iii) the HMP will produce significant results whenever the Simes-based Benjamini-Hochberg (BH) procedure does, even though BH only controls the less-stringent FDR. However, these results are only exact when the false positive rate a is arbitrarily small. In practice, the exact significance threshold varies by the number of hypotheses combined (Table 1).
So Equation 2 is formally a shortcut procedure that mathematically guarantees either superior power over Bonferroni and Simes or strong sense control of the FWER depending on whether α or α|R| is employed, respectively. Use of the latter threshold is exact up to the order of the Stable distribution approximation, and equivalent to applying a weighted Bonferroni correction to Equation 3. I recommend the use of this more exact test, available in the R package harmonicmeanp, and upon which all subsequent analyses in the main text are based. Analyses based on direct interpretation of the HMP are also presented in the Supplement, and reveal the practical differences between the approaches to be small for α = 0.05.
HMP enables adaptive multiple testing correction by combining p-values
That the Bonferroni method for controlling the FWER can be overly stringent, especially when the tests are nonindependent, has long been recognized. In Bonferroni correction, a p-value is deemed significant if p ≤ α/L, which becomes more stringent as the number of tests L increases. Since human GWAS began routinely testing millions of variants by statistically imputing untyped variants, a new convention was adopted in which a p-value is deemed significant if p ≤ 5 × 10‒8, a rule that implies the effective number of tests is no more than L = 106. Several lines of argument were used to justify this ad hoc threshold,19,20 most applicable only to human GWAS.
In contrast, the HMP affords strong control of the FWER while avoiding both ad hoc rules and the undue stringency of Bonferroni correction, an advantage that increases when tests are non-independent. To show how the HMP can recover significant associations among groups of tests that are individually non-significant, I reanalysed a GWAS of neuroticism,18 defined as a tendency towards intense or frequent negative emotions and thoughts.21 Genotypes were imputed for L = 6524432 variants across 170 911 individuals. I used the HMP to perform model-averaged tests of association between neuroticism and variants within contiguous regions of 10, 100 and 1000 kilobases (kb), 10 megabases (Mb), entire chromosomes and the whole genome, assuming equal weights across variants.
Figure 1 shows the p-value from Equation 3 for each region R adjusted by a factor to enable direct comparison to the significance threshold α = 0.05. Similar results were obtained from direct interpretation of the HMP (Figure S1). Model averaging tends to make significant and nearsignificant adjusted p-values more significant. For example, for every variant significant after Bonferroni correction, the model-averaged p-value for the corresponding chromosome was found to be at least as significant.
Model-averaging increases significance more when combining a group of comparably significant p-values, e.g. the top hits in chromosome 9. The least improvement is seen when one p-value is much more significant than the others, e.g. the top hit in chromosome 3. This behaviour is predicted by the tendency of harmonic means to be dominated by the smallest values. In the extreme case that one p-value dominates the significance of all others, the HMP test becomes equivalent to Bonferroni correction. This implies that Bonferroni correction might not be improved upon for ‘needle-in-a-haystack’ problems. Conversely, dependency among tests actually improves the sensitivity of the HMP because one significant test may be accompanied by other correlated tests that collectively reduce the harmonic mean p-value.
In some cases, the HMP found significant regions where none of the individual variants were significant. For example, no variants on chromosome 12 were significant by Bonferroni correction nor by the ad hoc genome-wide significance threshold of 5 × 10‒8. However, the HMP found significant 10Mb regions spanning several peaks of non-significant individual p-values. One of those, variant rs7973260, which showed an individual p-value for association with neuroticism of 2.4 × 10‒7, had been reported as also associated with depressive symptoms (p =1.8 × 10‒9). Such cross-association or ‘quasi-replication’, in which a variant is near-significant for the trait-of-interest and significant for a related trait, can be regarded as providing additional support for the variant’s involvement in the trait-of-interest.18
In chromosome 3, individual variants were found to be significant by the ad hoc threshold of 5 × 10‒8, but neither Bonferroni correction nor the HMP agreed those variants or regions were significant at a FWER of × = 0.05. Indeed the HMP found chromosome 3 non-significant as a whole. Variant rs35688236, which had the smallest p-value on chromosome 3 of 2.4 × 10‒8, had not validated when tested in a quasi-replication exercise that involved testing variants associated with neuroticism for association with subjective wellbeing or depressive symptoms.18
These observations illustrate that the HMP adaptively combines information among groups of similarly significant tests where possible, while leaving lone significant tests subject to Bonferroni-like stringency, providing a general approach to combining p-values that does not require specific knowledge of the dependency structure between tests.
HMP allows large-scale testing for higher-order interactions without punitive thresholds
Scientific discovery is currently hindered by avoidance of large-scale exploratory hypothesis testing for fear of attracting multiple testing correction thresholds that render signals found by more limited testing no longer significant. A good example is the approach to testing for pairwise or higher-order interactions between variants in GWAS. The Bonferroni threshold for testing all pairwise interactions invites a threshold (L + 1)/2 times more stringent than the threshold for testing variants individually, and strictly speaking this must be applied to every test, even though this is highly conservative because of the dependency between tests. The alternative of controlling the FDR risks a high probability of falsely detecting artefacts among any genuine associations discovered. Therefore interactions are not usually tested for.
To show how model-averaging using the HMP greatly alleviates this problem, I reanalysed human and pathogen genetic variants from a GWAS of pre-treatment viral load in hepatitis C virus (HCV)-infected patients.22 Jointly analysing the influence of human and pathogen variation on infection is an area of great interest, but requires a Bonferroni threshold of α/(LH + LP) when there are LH and LP variants in the human and pathogen genomes respectively, compared to α/(LH + LP) if testing the human and pathogen variants separately. In this example, LH = 399 420 and LP = 827.
In the original study, a known association with viral load was replicated at human chromosome 19 variant rs12979860 in IFNLA (p = 5.9 × 10‒10), below the Bonferroni threshold of 1.3 × 10‒7. The most significant pairwise interaction I found, assuming equal weights, involved the adjacent variant, rs8099917, with p = 2.2 × 10‒10. However, this did not meet the more stringent Bonferroni threshold of 1.5 × 10‒10 (Figure 2A). If the original study’s authors had performed and reported all 330 million tests, they could have been compelled to declare the marginal association in IFNL4 non-significant, despite what intuitively appears like a clear signal.
Model averaging using the HMP reduces this disincentive to perform additional related tests. Figure 2B shows that despite no significant pairwise tests involving rs8099917, model averaging recovered a combined p-value of 3.7 × 10‒8, below the multiple testing threshold of 1.3 × 10‒7. Additionally, two viral variants produced statistically significant model-averaged p-values of 5.5 × 10‒5 and 4.8 × 10‒5 at polyprotein positions 10 and 2 061 in the capsid and NS5a zinc finger domain (GenBank AQW44528), below the multiple testing threshold of 6.0 × 10‒5.
These results show how model-averaging using the HMP can enhance scientific discovery by (i) encouraging tests for higher order interactions when they otherwise would not be attempted and (ii) recovering lost signals of marginal associations after performing an ‘excessive’ number of tests.
Untangling the signals driving significant model-averaged p-values
When more than one alternative hypothesis is found to be significant, either individually or as part of a group, it is desirable to quantify the relative strength of evidence in favour of the competing alternatives. This is particularly true when disentangling the contributions of a group of individually nonsignificant alternatives that are significant only in combination.
Sellke, Bayarri and Berger9 proposed a conversion from p-values into Bayes factors which, when combined with prior information and test power through the model weights, produces posterior model probabilities and credible sets of alternative hypotheses. The Supplementary Methods detail how the Bayes factors are approximately proportional to the weighted inverse p-value. This linearity mirrors the HMP itself, whose inverse is an arithmetic mean of the inverse p-values.
After conditioning on rejection of the null hypothesis by normalizing the approximate model probabilities to sum to 100%, the probability that the association involved human variant rs8099917 was 54.4%. This signal was driven primarily by the three viral variants with the highest probability of interacting with rs8099917 in their effect on pre-treatment viral load: position 10 in the capsid (10.9%), position 669 in the E2 envelope (8.7%) and position 2061 in the NS5a zinc finger domain (11.4%) (Figure 3). Even though the model-averaged p-value for the envelope variant was not itself significant, this revealed a plausible interaction between it and the most significant human variant rs8099917.
Discussion
The HMP provides a way to calculate model-averaged p-values, providing a powerful and general method for combining tests while controlling the strong-sense FWER. It provides an alternative to both the overly conservative Bonferroni control of the FWER, and the lower stringency of FDR control. The HMP allows the incorporation of prior information through model weights, and is robust to positive dependency between the p-values. The HMP is approximately well-calibrated for small values, while a null distribution, derived from generalized central limit theorem, is easily computed. When the HMP is not significant, neither is any subset of the constituent tests.
The HMP is more appropriate for combining p-values than Fisher’s method when the alternative hypotheses are mutually exclusive, as in model comparison. When the alternative hypotheses all have the same nested null hypothesis, the HMP is interpreted in terms of a model-averaged likelihood ratio test. However, the HMP can be used more generally to combine tests that are not necessarily mutually exclusive, but which may have positive dependency. It can be used alone or in combination, for example with Fisher’s method to combine model-averaged p-values between groups of independent data.
The theory underlying the HMP provides a fundamentally different way to think about controlling the FWER through multiple testing correction. The Bonferroni threshold increases linearly with the number of tests, whereas the HMP is the reciprocal of the mean inverse p-value. To maintain significance with Bonferroni correction, the minimum p-value must decrease linearly as the number of tests increases. This strongly penalizes exploratory analyses. In contrast, when the false positive rate a is small, to maintain significance with the HMP requires only that the mean inverse p-value remains constant as the number of tests increases. This does not penalize exploratory analyses so long as the ‘quality’ of the additional hypotheses tested, measured by the inverse p-value, does not decline.
Through example applications to GWAS, I have shown that the HMP combines tests adaptively, producing Bonferroni-like adjusted p-values for ‘needle-in-a-haystack’ problems when one test dominates, but able to capitalize on numerous strongly significant tests to produce smaller adjusted p-values when warranted. I have shown how model averaging using the HMP encourages exploratory analysis and can recover signals of significance among groups of individually non-significant tests, properties that have the potential to enhance the scientific discovery process.
Software Availability
An R package implementing the harmonic mean p-value and MAMML tests is available from https://cran.r-project.org/package=harmonicmeanp.
Acknowledgements
DJW is a Sir Henry Dale Fellow, jointly funded by the Wellcome Trust and the Royal Society (Grant 101237/Z/13/Z) and a member of the STOP-HCV consortium, which is funded by an award from the Medical Research Council (MR/K01532X/1). I thank the Social Science Genetic Association Consortium, the STOP-HCV Consortium and HCV Research UK Biobank for sharing data, Azim Ansari, Vincent Pedergnana and Chris Spencer for sharing expertize and Simon Myers for helpful comments.