Abstract
Siddharth Krishna Kumar1 and co-authors claim to have shown that “GCTA applied to current SNP data cannot produce reliable or stable estimates of heritability.” Given the numerous recent studies on the genetic architecture of complex traits that are based on this methodology, these claims have important implications for the field. Through an investigation of the stability of the likelihood function under phenotype perturbation and an analysis of its dependence on the spectral properties of the genetic relatedness matrix, our study characterizes the properties of an important approach to the analysis of GWAS data and identified crucial errors in the authors’ analyses, invalidating their main conclusions.
Heritability estimation using genome-wide SNP data is a fundamental research topic with profound implications for studies of the genetic architecture of complex traits. The development of a novel methodology2,3 in this direction has spurred studies, on a broad spectrum of complex traits, that have reinforced the view that a substantial portion of missing heritability can be accounted for by hitherto undiscovered common variants4,5 and has led to substantial research that has demonstrated that certain functional categories of SNPs contribute disproportionately to the heritability of complex diseases6-8. However, in a recent report1, Krishna Kumar and co-authors claim to have proved that the method “may not reliably improve our understanding of the genomic basis of phenotypic variability” even when the assumptions of the method are satisfied exactly and that the heritability estimates produced are highly sensitive to the choice of sample used and to measurement errors in the phenotype. We investigated these claims by characterizing the likelihood function and identified crucial analytic errors that seriously undermine the validity of the authors’ conclusions.
The GREML model
We consider the following model (Figure 1A) of the phenotype y (which has been simplified, as in Krishna Kumar et al., to exclude any fixed-effects): where u is a Px1 vector of random (genetic) effects, Z is a NxP (standardized genotype) matrix and ε is the (non-genetic) residual. Here
Thus, the distribution of y assumes the following form:
Note that the phenotypic covariance, var(y), is the sum of a genetic covariance and a residual covariance. The Genetic Relatedness Matrix (GRM), which quantifies the genetic similarity between pairs of individuals using the genotype data Z, can be written as follows:
Singularity index and induced quadratic form
We refer to the function S(Z) ≔ log(det(α2I + σ2ZZT)) as the singularity index (because it provides a formal test for the invertibility of the phenotypic covariance matrix α2I + σ2ZZT) and refer to the function Q(Z,y1) ≔ as the induced quadratic form. Note the log-likelihood of the observed phenotype data y1 is given by
Using Restricted Maximum Likelihood (REML), GCTA estimates the variances σ2 and α2 given the observation y1, thereby providing an estimate of the SNP-based heritability:
Equivalently, the log-likelihood function, now viewed as a function of Z and y1, can be written as a sum involving the singularity index and the induced quadratic form:
Perturbation of the standardized genotype matrix Z and the GRM A
Because the Z in the GREML model is a standardized genotype matrix (wherein each entry is a function of the number of copies of the reference allele and the reference allele frequency at a SNP), this implies that there are implicit constraints on what is a valid perturbed genotype matrix Z + perturb(Z)(i.e., constraints which determine whether Z + perturb(Z) is a realizable or ill-defined standardized genotype matrix). A perturbation matrix Z + perturb(Z) may generate a matrix that departs substantially from a standardized genotype matrix, yielding an ill-defined revised model. To illustrate this, if the original (e.g., independent, real and random) entries in Z have mean 0 and variance 1, a perturbation with elements on the primary diagonal due to the introduction of the phenotype noise ϑ~N(0,τ2) would preserve the mean of these elements but alter their variance, possibly quite substantially. In short, not every element of Matrices(N, P) represents a standardized genotype matrix, and not every perturbation is a reasonable one. For the same reason, a perturbation of the GRM (by an error matrix E, as in the authors’ equation [A17] of the Appendix) does not necessarily generate a valid (revised) GRM. (For example, the resulting perturbed GRM must be symmetric, which implies that the perturbation matrix E must be symmetric as well.) Furthermore, modeling the difference between the true Z and sample Z through an error matrix F via an additive model (Zsample = Ztrue + F) makes some very strong assumptions, including that the two matrices, Ztrue and Zsample, are of the same dimension (in particular, same number of variants). It is therefore more sound to evaluate the discordance between the true GRM (GRMtrue) and the estimated GRM (GRMsample). The impact of this discordance (arising, for example, from the imperfect tagging of causal variants2,9) on the REML estimate of heritability is indeed a valid subject of research3. Interestingly, this issue is related to the classic Horn’s conjecture in matrix theory (which was finally settled10) on the spectrum of the sum of two Hermitian matrices and on how the eigenvalues of two Hermitian matrices constrain the eigenvalues of their sum.
A critique of the authors’ claims
The authors evaluated the sensitivity of the likelihood function, and the resulting GREML estimate, to the GWAS data (specifically, phenotype measurement noise and population stratification). We report here crucial errors in the authors’ analyses, on which the main conclusions of the study are based. Furthermore, we highlight a methodological gap, which we address using an approach that may be of interest to future studies in population genetics and GWAS of complex traits.
(We should note a random matrix theory for the Wishart product matrix ZZT(or the GRM) generally assumes a Z with independent Gaussian entries, and any application in genetics must demonstrate that the relevant theoretical results apply (robustly) to a (non-Gaussian) matrix (e.g., one consisting of standardized genotype data). The authors appear to claim, clearly incorrectly and rather confusingly, for both Z and its symmetrization ZZT a Wishart distribution (e.g., see pages E62 and E68 of the authors’ paper1). In what follows, we will assume that Z is a standardized genotype matrix (and thus non-Gaussian), and Gaussian-based results that require extension to the non-Gaussian case will be explicitly stated.)
1. Sensitivity of third term of log-likelihood to phenotype noise
The authors sought to show the instability of the induced quadratic form Q(Z, y1), and thus of the log-likelihood, by showing its sensitivity to the phenotype measurement (i.e., to a perturbation of y1). In their analysis, this conclusion follows from the instability of the spectral properties of Z even under a “small perturbation.” The authors used the following “equivalence” of perturbations (see equation [A10] of their Appendix A) - namely, the perturbation to the phenotype measurement and the induced perturbation of the matrix Z:
Applying the Sherman-Morrison-Woodbury identify to the third term of the log-likelihood (equation [††]), one obtains
Thus, the sensitivity, assuming phenotype perturbation (equation [e1]), depends not only on the factor with an underlying bracket (i.e., the spectral properties of Z), but also on the remaining terms (including ). (The authors highlighted the former and, curiously, disregarded the latter.) Ignoring these remaining terms may yield invalid inferences concerning Q(Z,y1). Importantly, Q(Z,y1) is an ℝ-valued continuous (i.e., well-behaved and stable) function at every (Z0,y0) ∈ Matrices(N, P)xℝN, i.e., where d: Matrices(N, P)xℝN ⊕ Matrices(N, P)xℝN → ℝ is the distance function defined by:
Here, for M ∈ Matrices(N,P), . The metric in equation [§] endows the space Matrices(N, P)xℝN with the topology of a Euclidean space (homeomorphic to ℝNP+N on which Q(Z,y), consisting of sums and products of continuous functions, is continuous. Similarly, the proper subset {(Z, y)|Z a standardized genotype matrix and y a phenotype vector} ⊆ Matrices(N, P)xℝN, which is embeddable into ℝNP+N via the canonical inclusion, gets an induced subspace topology on which Q(Z,y) is continuous.
Given a fixed matrix Z, we ask how a perturbation in y1 changes Q(Z,y1). The rate of change in Q with respect to (the vector) y1 is given by:
This simplifies to the following expression: which allows us to quantify the l2-norm as a function of (the perturbed) y1. Because S(Z) does not depend on y1 this also gives the rate of change of the entire log-likelihood with respect to the phenotype vector. (Furthermore, the expression for shows that Q ∈ C1, i.e., it is actually continuously differentiable as a function of y1.)
Consistent with the continuity of the function Q in Z and y and the (stable and “linear”) rate of change in Q with respect to y1 simulations we performed confirm the stability of the GREML estimate (Figure 1B). We note that, in fact, both terms (Q(Z,y1) and S(Z)) of the log-likelihood are continuous functions at every (Z0, y0) ∈ Matrices(N, P)xℝN.
The authors’ figure 5, which was intended to show the variation in the GREML estimates from random sampling from repeated measures of a phenotype, is not unexpected and, furthermore, does not empirically support the flawed theoretical argument about the instability of the log-likelihood.
2. Stability of second term of log-likelihood in stratified population
The authors also sought to show the instability of the singularity index S(Z) in a stratified population. Using the singular value decomposition (SVD) of Z and applying the Matrix determinant lemma, one obtains the following decomposition:
The last term of equation [e2] can be written in terms of the singular values wi of Z as logFrom this, the authors concluded (incorrectly, as we will see) that in a stratified population (for which, it is claimed, thousands of the wi are close to 0), this expression for the last term of [e2] (and thus the entire expression itself) is sensitive to small changes in the values of the wi. However, one cannot show the instability of the singularity index without also considering the rest of the terms in equation [e2]. Indeed, equation [e2] can be rewritten as follows:
For singular values wi of Z that are close to 0, log (based on the Taylor series expansion). Thus, the sampling variability (from the expression for S(Z); see equation [e3]) for near-zero singular values does not arise from the terms (as the authors claim), but from . Such near-zero singular values should add little to the singularity index and closely-packed singular values (i.e., for which wi ≈ w, for some constant w) should affect S(Z) nearly similarly, and thus the claim that near-zero singular values lead to unreliable estimates of the variance explained by all SNPs (= Pσ2) remains unfounded. In contrast, very large eigenvalues (such as reflecting non-random population structure) affect the stability of the index, with the rate of change of the index,, with respect to wi given by the following expression: which, at wi = ∞, is approximately , Thus, the marginal effect of increasing singular value on the index decays at infinity in a manner inversely proportional to the magnitude of the singular value. The rate-vector ρ is informative about the behavior of the index at extreme singular values. Clearly, , which implies that the rate of change becomes almost negligible for singular values near 0.
As we have already noted, the singularity index is also a continuous function at each (Z0,y0) ∈ Matrices(N, P)xℝN and, by projection to the first coordinate, a continuous function of the matrix Z. Related to this, the classical Weyl’s inequality11 implies that, given with ‖E‖Frobenius < ɛ, then i.e., small perturbations in GRM yield only small perturbations in singular values. Thus, under an additive model in which the true GRM differs from the sample GRM by a perturbation E whose Frobenius norm is small, the difference in the corresponding singular values between the GRMs will be correspondingly small.
3. Methodological gap
What is notably missing from the authors’ analysis, given its use of the eigenvalues of the GRM (from the SVD) to evaluate the stability of the GREML approach, is a quantification of the degree to which the eigenvalues reflect non-random population structure versus random expectation. A large eigenvalue may well be “within null expectation,” and there is thus a need to quantify its significance. (Note this is different from the empirical distribution of the GRM eigenvalues as presented in the authors’ figure 1, which aimed to show, despite the small sample sizes considered, concordance of the data with the asymptotic behavior of eigenvalues from the Marchenko-Pastur theory.) Consideration of the null is also missing from the authors’ appropriation of the notion of an “ill-conditioned” matrix Z, which is defined in terms of the condition number , as an approach for investigating the effect on GREML estimates. In addition to these key methodological gaps, it is important to note that κ is a property of the matrix Z rather than of the GREML method. Indeed, a very large κ would also affect effect size estimation in simple linear regression (e.g., equation [†]) that jointly fits multiple SNPs as fixed effects; a very large κ would imply that even a small change in y could have a destabilizing impact on the estimated SNP effect sizes and that matrix inversion would be unstable with finite-precision numbers.
The distribution of the largest eigenvalue of the Wishart matrix of a matrix Z with independent Gaussian entries is known 12. For large values of N and P, if λ denotes the largest eigenvalue, then assumes the Tracy-Widom distribution 13; here both the centering constant μ(N, P) and the scaling constant σ(N, P) depend on only N and P. If the following assumptions are met for the symmetrization ZZT = [sij] in the GREML model (where now Z is the standardized genotype matrix with non-Gaussian entries):
the (independent real random) entries have mean 0 and variance 1
all moments of these random variables are finite
E(sij)2m ≤ mm, for some constant m (i.e., the distributions of the entries decay at least as fast as a Gaussian distribution)
Soshnikov’s extension theorem 14 implies that the ratio , for some centering and scaling constants that depend only on N and P, converges in distribution to the Tracy-Widom distribution, just as in the Wishart case. The ratio thus provides a way to assess the significance of the largest eigenvalue of a GRM and to quantify the presence of non-random population structure in the genotype data 15. (For example, using the Framingham dataset presented in the authors’ figure 3, one concludes that the dataset shows extreme population stratification, p<2.2x10−16.) Exact expressions for the density and the moments of the distribution of the smallest eigenvalue (in terms of polynomials, exponentials and hypergeometric functions) for a matrix with independent Gaussian entries have been derived, and, interestingly, the form of this distribution depends on whether P – N is odd or even 16. Additionally, the work of Edelman provides a closed form for the distribution of the condition number κ. Indeed, for Z with independent standard-Gaussian entries and large N 16, we can write providing an asymptotic distribution for κ(Z). The claims made by the authors concerning the stability of the GREML estimates such as through their use of the skew in singular values (such as the “Largest Singular Value” of Figure 3 and the discussion thereof in the text) are, as currently presented, statistically problematic without consideration of what is expected under the null.
Conclusions
We investigated the properties of the log-likelihood to evaluate the dependence of the GREML estimate on phenotype perturbation and on the spectral properties of the standardized genotype matrix. We showed the continuity of the singularity index and the induced quadratic form as functions of the standardized genotype matrix and the phenotype vector, supporting the stability of the log-likelihood under perturbation. Furthermore, we derived an explicit expression for the rate of change in the log-likelihood with respect to the phenotype vector. We examined the sensitivity to changes in the singular values, showing that the authors’ claims regarding the impact of sampling variability for near-zero singular values on the GREML estimate were based on an analytic error (and indeed assumed an incorrect view of the structure of genetic relatedness under population stratification). (It should be noted that the observation that population structure, which may be reflected in the largest eigenvalues of the GRM, may confound heritability estimation, and must thus be adjusted for, has been repeatedly discussed and investigated 17,18.) Finally, we investigated a methodological gap in the authors’ study and highlighted an approach to address it, which may be of broad interest to methods development in population genetics and genome-wide association analysis.
Author contributions
ERG designed the study, performed the research and wrote the paper. DSP performed the research. Both authors reviewed the final manuscript.