Abstract
Estimating the polygenicity (proportion of causally associated single nucleotide polymorphisms (SNPs)) and discover-ability (effect size variance) of causal SNPs for human traits is currently of considerable interest. SNP-heritability is proportional to the product of these quantities. We present a basic model, using detailed linkage disequilibrium structure from an extensive reference panel, to estimate these quantities from genome-wide association studies (GWAS) summary statistics. We apply the model to diverse phenotypes and validate the implementation with simulations. We find model polygenicities ranging from ≃ 2 × 10−5 to ≃ 4 × 10−3, with discoverabilities similarly ranging over two orders of magnitude. A power analysis allows us to estimate the proportions of phenotypic variance explained additively by causal SNPs reaching genome-wide significance at current sample sizes, and map out sample sizes required to explain larger portions of additive SNP heritability. The model also allows for estimating residual inflation (or deflation from over-correcting of z-scores), and assessing compatibility of replication and discovery GWAS summary statistics.
INTRODUCTION
The genetic components of complex human traits and diseases arise from hundreds to likely many thousands of single nucleotide polymorphisms (SNPs) [1], most of which have weak effects. As sample sizes increase, more of the associated SNPs are identifiable (they reach genome-wide significance), though power for discovery varies widely across phenotypes. Of particular interest are estimating the proportion of common SNPs from a reference panel (polygenicity) involved in any particular phenotype; their effective strength of association (discoverability, or causal effect size variance); the proportion of variation in susceptibility, or phenotypic variation, captured additively by all common causal SNPs (approximately, the narrow sense heritability), and the fraction of that captured by genome-wide significant SNPs – all of which are active areas of research [2, 3, 4, 5, 6, 7, 8, 9]. The effects of population structure [10], combined with high polygenicity and linkage disequilibrium (LD), leading to spurious degrees of SNP association, or inflation, considerably complicate matters, and are also areas of much focus [11, 12, 13]. Despite these challenges, there have been recent significant advances in the development of mathematical models of polygenic architecture based on GWAS [14, 15]. One of the advantages of these models is that they can be used for power estimation in human phenotypes, enabling prediction of the capabilities of future GWAS.
Here, in a unified approach explicitly taking into account LD, we present a model relying on genome-wide association studies (GWAS) summary statistics (z-scores for SNP associations with a phenotype [16]) to estimate polygenicity (π1, the proportion of causal variants in the underlying reference panel of approximately 11 million SNPs from a sample size of 503) and discoverability (, the causal effect size variance), as well as elevation of z-scores due to any residual inflation of the z-scores arising from variance distortion (, which for example can be induced by cryptic relatedness), which remains a concern in large-scale studies [10]. We estimate π1, , and , by postulating a z-score probability distribution function (pdf) that explicitly depends on them, and fitting it to the actual distribution of GWAS z-scores.
Estimates of polygenicity and discoverability allow one to estimate compound quantities, like narrow-sense heritability captured by the SNPs [17]; to predict the power of larger-scale GWAS to discover genome-wide significant loci; and to understand why some phenotypes have higher power for SNP discovery and proportion of heritability explained than other phenotypes.
In previous work [18] we presented a related model that treated the overall effects of LD on z-scores in an approximate way. Here we take the details of LD explicitly into consideration, resulting in a conceptually more basic model to predict the distribution of z-scores. We apply the model to multiple phenotype datasets, in each case estimating the three model parameters and auxiliary quantities, including the overall inflation factor λ, (traditionally referred to as genomic control [19]) and narrow sense heritability, h2. We also perform extensive simulations on genotypes with realistic LD structure in order to validate the interpretation of the model parameters.
METHODS
Overview
Our basic model is a simple postulate for the distribution of causal effects (denoted β below) [20]. Our model assumes that only a fraction of all SNPs are in some sense causally related to any given phenotype. We work with a reference panel of approximately 11 million SNPs with 503 samples, and assume that all common causal SNPs (minor allele frequency (MAF) > 0.002) are contained in it. Any given GWAS will have z-scores for a subset (the “tag” SNPs) of these reference SNPs. When a z-score partially involves a latent causal component (i.e., not pure noise), we assume that it arises through LD with neighboring causal SNPs, or that it itself is causal.
We construct a pdf for z-scores that directly follows from the underlying distribution of effects. For any given tag SNP’s z-score, it is dependent on the other SNPs the focal SNP is in LD with, taking into account their LD with the focal SNP and their heterozygosity (i.e., it depends not just on the focal tag SNP’s total LD and heterozygosity, but also on the distribution of neighboring reference SNPs in LD with it and their heterozygosities). We present two ways of constructing the model pdf for z-scores, using multinomial expansion (in Supplementary Information), and using convolution. The former is perhaps more intuitive, but the latter is more numerically tractable, yielding an exact solution, and is used here to obtain all reported results. The problem then is finding the three model parameters that give a maximum likelihood best fit for the model’s prediction of the distribution of z-scores to the actual distribution of z-scores. Because we are fitting three parameters typically using ≳106 data points, it is appropriate to incorporate some data reduction to facilitate the computations. To that end, we bin the data (z-scores) into a 10 × 10 grid of heterozygosity-by-total LD (having tested different grid sizes to ensure convergence of results). Also, when building the LD and heterozygosity structures of reference SNPs, we fine-grained the LD range (0 ≤ r2 ≤ 1), again ensuring that bins were small enough that results were well converged. To fit the model to the data we bin the z-scores (within each heterozygosity/total LD window) and calculate the multinomial probability for having the actual distribution of z-scores (numbers of z-scores in the z-score bins) given the model pdf for the distribution of z-scores, and adjusting the model parameters using a multidimensional unconstrained nonlinear minimization (Nelder-Mead), so as to maximize the likelihood of the data, given the parameters.
A visual summary of the predicted and actual distribution of z-scores is obtained by making quantile-quantile plots showing, for a wide range of significance thresholds going well beyond genome-wide significance, the proportion (x-axis) of tag SNPs exceeding any given threshold (y-axis) in the range. It is important also to assess the quantile-quantile sub-plots for SNPs in the heterozygosity-by-total LD grid elements (see Supplementary Material).
With the pdf in hand, various quantities can be calculated: the number of causal SNPs; the expected genetic effect (denoted δ below, where δ2 is the noncentrality parameter of a Chi-squared distribution) at the current sample size for a tag SNP given the SNP’s z-score and its full LD and heterozygosity structure; the estimated SNP heritability, (excluding contributions from rare reference SNPs, i.e., with MAF<0.2%); and the sample size required to explain any percentage of that with genome-wide significant SNPs. The model can easily be extended using a more complex distribution for the underlying β’s, with multiple-component mixtures for small and large effects, and incorporating selection pressure through both heterozygosity dependence on effect sizes and linkage disequilibrium dependence on the prior probability of a SNP’s being causal – issues we will address in future work.
The Model: Probability Distribution for Z-Scores
To establish notation, we consider a bi-allelic genetic variant, i, and let βi denote the effect size of allele substitution of that variant on a given quantitative trait. We assume a simple additive generative model (simple linear regression, ignoring covariates) relating genotype to phenotype [18, 21]. That is, assume a linear vector equation (no summation over repeated indices) for phenotype vector y over N samples (mean-centered and normalized to unit variance), mean-centered genotype vector gi for the ith of n SNPs (vector over samples of the additively coded number of reference alleles for the ith variant), true fixed effect βi (regression coefficient) for the SNP, and residual vector ei containing the effects of all the other causal SNPs, the independent random environmental component, and random error. Variants with non-zero fixed effect βi are said to be “causal”. For SNP i, the estimated simple linear regression coefficient is where T denotes transpose and is the SNP’s heterozygosity (frequency of the heterozygous genotype): Hi = 2pi(1 − pi) where pi is the frequency of either of the SNP’s alleles.
Consistent with the work of others [11, 15], we assume the causal SNPs are distributed randomly throughout the genome (an assumption that can be relaxed when explicitly considering different SNP categories, but that in the main is consistent with the additive variation explained by a given part of the genome being proportional to the length of DNA [22]). In a Bayesian approach, we assume that the parameter β for a SNP has a distribution (in that specific sense, this is similar to a random effects model), representing subjective information on β, not a distribution across tangible populations [23]. Specifically, we posit a normal distribution for β with variance given by a constant, :
This is also how the β are distributed across the set of causal SNPs. Therefore, taking into account all SNPs (the remaining ones are all null by definition), this is equivalent to the two-component Gaussian mixture model we originally proposed [20] where is the Dirac delta function, so that considering all SNPs, the net variance is . If there is no LD (and assuming no source of spurious inflation), the association z-score for a SNP with heterozygosity H can be decomposed into a fixed effect δ and a residual random environment and error term, , which is assumed to be independent of δ [18]: with so that where
By construction, under null, i.e., when there is no genetic effect, δ = 0, so that var(ϵ) = 1.
If there is no source of variance distortion in the sample, but there is a source of bias in the summary statistics (e.g., the sample is composed of two or more subpopulations with different allele frequencies for a subset of markers – pure population stratification in the sample [24]), the marginal distribution of an individual’s genotype at any of those markers will be inflated. The squared z-score for such a marker will then follow a noncentral chi-square distribution; the noncentrality parameter will contain the causal genetic effect, if any, but biased up or down (confounding or loss of power, depending on the relative sign of the genetic effect and the bias term). The effect of bias shifts, arising for example due to stratification, is nontrivial, and currently not explicitly in our model; it is usually accounted for using standard methods [25].
Variance distortion in the distribution of z-scores can arise from cryptic relatedness in the sample (drawn from a population mixture with at least one subpopulation with identical-by-descent marker alleles, but no population stratification) [19]. If zu denotes the uninflated z-scores, then the inflated z-scores are where σ0 ≥ 1 characterizes the inflation. Thus, from Eq. 7, in the presence of inflation in the form of variance distortion where , so that and . In the presence of variance distortion one is dealing with inflated random variables , but we will drop the tilde on the β’s in what follows.
Since variance distortion leads to scaled z-scores [19], then, allowing for this effect in some of the extremely large data sets, we can assess the ability of the model to detect this inflation by artificially inflating the z-scores (Eq. 9), and checking that the inflated is estimated correctly while the other parameter estimates remain unchanged.
Implicit in Eq. 8 is approximating the denominator, 1 − q2, of the χ2 statistic noncentrality parameter to be 1, where q2 is the proportion of phenotypic variance explained by the causal variant, i.e., . So a more correct δ is
Taylor expanding in q and then taking the variance gives
The additional terms will be vanishingly small and so do not contribute in a distributional sense; (quasi-) Mendelian or outlier genetic effects represent an extreme scenario where the model is not expected to be accurate, but SNPs for such traits are by definition easily detectable. So Eq. 8 remains valid for the polygenicity of complex traits.
Now consider the effects of LD on z-scores. The simple linear regression coefficient estimate for tag SNP i, , and hence the GWAS z-score, implicitly incorporates contributions due to LD with neighboring causal SNPs. (A tag SNP is a SNP with a z-score, imputed or otherwise; generally these will compose a smaller set than that available in reference panels like 1000 Genomes used here for calculating the LD structure of tag SNPs.) In Eq. 1, , where gj is the genotype vector for SNP j, βj is its true regression coefficient, and ε is the independent true environmental and error residual vector (over the N samples). Thus, explicitly including all causal true β’s, Eq. 2 becomes (the sum over j now includes SNP i itself). This is the simple linear regression expansion of the estimated regression coefficient for SNP i in terms of the independent latent (true) causal effects and the latent environmental (plus error) component; is the effective simple linear regression expression for the true genetic effect of SNP i, with contributions from neighboring causal SNPs mediated by LD. Note that is simply cov(gi,gj), the covariance between genotypes for SNPs i and j. Since correlation is covariance normalized by the variances, in Eq. 13 can be written as where rij is the correlation between genotypes at SNP j and tag SNP i. Then, from Eq. 5, the z-score for the tag SNP’s association with the phenotype is given by:
We noted that in the absence of LD, the distribution of the residual in Eq. 5 is assumed to be univariate normal. But in the presence of LD (Eq. 15) there are induced correlations, so the appropriate extension would be multivariate normal for ϵi. A limitation of the present work is that we do not consider this complexity. This may account for the relatively minor misfit in the simulation results for cases of high polygenicity – see below.
Thus, for example, if the SNP itself is not causal but is in LD with k causal SNPs that all have heterozygosity H, and where its LD with each of these is the same, given by some value r2(0 < r2 ≤ 1), then in Eq. 10 will be given by
For this idealized case, the marginal distribution, or pdf, of z-scores for a set of such associated SNPs is where ϕ(.;μ,σ2) is the normal distribution with mean μ and variance σ2, and is shorthand for the LD and heterozygosity structure of such SNPs (in this case, denoting exactly k causals with LD given by r2 and heterozygosity given by H). If a proportion α of all tag SNPs are similarly associated with the phenotype while the remaining proportion are all null (not causal and not in LD with causal SNPs), then the marginal distribution for all SNP z-scores is the Gaussian mixture dropping the parameters for convenience.
For real genotypes, however, the LD and heterozygosity structure is far more complicated, and of course the causal SNPs are generally numerous and unknown. Thus, more generally, for each tag SNP will be a two-dimensional histogram over LD (r2) and heterozygosity (H), each grid element giving the number of SNPs falling within the edges of that (r2, H) bin. Alternatively, for each tag SNP it can be built as two one-dimensional histograms, one giving the LD structure (counts of neighboring SNPs in each LD r2 bin), and the other giving, for each r bin, the mean heterozygosity for those neighboring SNPs, which should be accurate for sufficiently fine binning. We use the latter in what follows. We present two consistent ways of expressing the a posteriori pdf for z-scores, based on multinomial expansion and on convolution, that provide complementary views. The multinomial approach (see Supplementary Material) perhaps gives a more intuitive feel for the problem, but the convolution approach is considerably more tractable numerically and is used here to obtain all reporter results.
Model PDF: Convolution
From Eq. 15, there exist an efficient procedure that allows for accurate calculation of a z-score’s a posteriori pdf (given the SNP’s heterozygosity and LD structure, and the phenotype’s model parameters). Any GWAS z-score is a sum of unobserved random variables (LD-mediated contributions from neighboring causal SNPs, and the additive environmental component), and the pdf for such a composite random variable is given by the convolution of the pdfs for the component random variables. Since convolution is associative, and the Fourier transform of the convolution of two functions is just the product of the individual Fourier transforms of the two functions, one can obtain the a posteriori pdf for z-scores as the inverse Fourier transform of the product of the Fourier transforms of the individual random variable components.
From Eq. 15 z is a sum of correlation- and hetero-zygosity-weighted random variables {βj} and the random variable ϵ, where βj denotes the set of true causal parameters for each of the SNPs in LD with the tag SNP whose z-score is under consideration. The Fourier transform F(k) of a Gaussian f(x) = c × exp(−ax2) is . From Eq. 4, for each SNP j in LD with the tag SNP (1 ≤ j ≤ b, where b is the tag SNP’s block size),
The Fourier transform (with variable k – see below) of the first term on the right hand side is while that of the second term is simply (1 − π1). Additionally, the environmental term is (ignoring LD-induced correlation, as noted earlier), and its Fourier transform is . For each tag SNP, one could construct the a posteriori pdf based on these Fourier transforms. However, it is more practical to use a coarse-grained representation of the data. Thus, in order to fit the model to a data set, we bin the tag SNPs whose z-scores comprise the data set into a two-dimensional heterozygosity/total LD grid (whose elements we denote “H-L” bins), and fit the model with respect to this coarse griding instead of with respect to every individual tag SNP z-score; in the section “Parameter Estimation” below we describe using a 10 × 10 grid. Additionally, for each H-L bin the LD r2 and heterozygosity histogram structure for each tag SNP is built, using wmax equally-spaced r2 bins for ; wmax = 20 is large enough to allow for converged results; is generally small enough to capture true causal associations in weak LD while large enough to exclude spurious contributions to the pdf arising from estimates of r2 that are non-zero due to noise. This points up a minor limitation of the model stemming from the small reference sample size (NR = 503 for 1000 Genomes) from which is built. Larger NR would allow for more precision in handling very low LD (r2 < 0.05), but this is an issue only for situations with extremely large (high heritability with low polygenicity) that we do not encounter for the 16 phenotypes we analyze here. In any case, this can be calibrated for using simulations.
For any H-L bin with mean heterozygosity H and mean total LD L there will be an average LD and heterozygosity structure with a mean breakdown for the tag SNPs having nw SNPs (not all of which necessarily are tag SNPs) with LD r2 in the wth r2 bin whose average heterozygosity is Hw. Thus, one can re-express z-scores for an H-L bin as where βj and ϵ are unobserved random variables.
In the spirit of the discrete Fourier transform (DFT), discretize the set of possible z-scores into the ordered set of n (equal to a power of 2) values z1, …, zn with equal spacing between neighbors given by ∆z (zn = −z1 − ∆z, and zn/2+1 = 0). Taking z1 = −38 allows for the minimum p-values of 5.8 × 10−316 (near the numerical limit); with n = 210, ∆z = 0.0742. Given ∆z, the Nyquist critical frequency is , so we consider the Fourier transform function for the z-score pdf at n discrete values k1, …, kn, with equal spacing between neighbors given by ∆k, where k1 = − fc (kn = − k1 − ∆k, and kn/2+1 = 0; the DFT pair ∆z and ∆k are related by ∆z∆k = 1/n). Define (see Eq. 20). Then the product (over r2 bins) of Fourier transforms for the genetic contribution to z-scores, denoted Gj ≡ G(kj), is and the Fourier transform of the environmental contribution, denoted Ej ≡ E(kj) is,
Let Fz = (G1E1, …, GnEn) denote the vector of products of Fourier transform values, and let denote the inverse Fourier transform operator. Then, the vector of pdf values for z-score bins (indexed by i) in the H-L bin with mean LD and heterozygosity structure , pdfz = (f1, …, fn) where , is
Data Preparation
For real phenotypes, we calculated SNP minor allele frequency (MAF) and LD between SNPs using the 1000 Genomes phase 3 data set for 503 subjects/samples of European ancestry [26, 27, 28]. For simulations, we used HAP-GEN2 [29, 30, 31] to generate genotypes; we calculated SNP MAF and LD structure from 1000 simulated samples. We elected to use the same intersecting set of SNPs for real data and simulation. For HAPGEN2, we eliminated SNPs with MAF<0.002; for 1000 Genomes, we eliminated SNPs for which the call rate (percentage of samples with useful data) was less than 90%. This left nsnp=11,015,833 SNPs.
Sequentially moving through each chromosome in contiguous blocks of 5,000 SNPs, for each SNP in the block we calculated its Pearson r2 correlation coefficients with all SNPs in the central bock itself and with all SNPs in the pair of flanking blocks of size up to 50,000 each. For each SNP we calculated its total LD (TLD), given by the sum of LD r2’s thresholded such that if we set that r2 to zero. For each SNP we also built a histogram giving the numbers of SNPs in wmax equally-spaced r2-windows covering the range . These steps were carried out independently for 1000 Genomes phase 3 and for HAPGEN2 (for the latter, we used 1000 simulated samples).
Employing a similar procedure, we also built binary (logical) LD matrices identifying all pairs of SNPs for which LD r2 > 0.8, a liberal threshold for SNPs being “synonymous”.
In applying the model to summary statistics, we calculated histograms of TLD and LD block size (using 100 bins in both cases) and ignoring SNPs whose TLD or block size was so large that their frequency was less than a hundredth of the respective histogram peak; typically this amounted to restricting to SNPs for which TLD ≤ 600 and LD block size ≤ 1, 500. We also ignored summary statistics of SNPs for which MAF ≤ 0.01.
We analyzed summary statistics for sixteen phenotypes (in what follows, where sample sizes varied by SNP, we quote the median value): (1) major depressive disorder (Ncases = 59,851, Ncontrols = 113,154) [32]; (2) bipolar disorder (Ncases = 20,352, Ncontrols = 31,358) [33]; (3) schizophrenia (Ncases = 35,476, Ncontrols = 46,839) [34]; (4) coronary artery disease (Ncases = 60,801, Ncontrols= 123,504) [35]; (5) ulcerative colitis (Ncases = 12,366, Ncontrols = 34,915) and (6) Crohn’s disease (Ncases = 12,194, Ncontrols = 34,915) [36]; (7) late onset Alzheimer’s disease (LOAD; Ncases = 17,008, Ncontrols = 37,154) [37] (in the Supplementary Material we present results for a more recent GWAS with Ncases = 71,880 and Ncontrols= 383,378 [38]); (8) amyotrophic lateral sclerosis (ALS) (Ncases = 12,577, Ncontrols = 23,475) [39]; (9) number of years of formal education (N = 293,723) [40]; (10) intelligence (N = 262,529) [41, 42]; (11) body mass index (N = 233,554) [43]; (12) height (N = 251,747) [44]; (13) putamen volume (normalized by intracranial volume, N = 11,598) [45]; (14) low- (N = 89,873) and (15) high-density lipoprotein (N = 94,295) [46]; and (16) total cholesterol (N = 94,579) [46]. Most participants were of European ancestry.
For height, we focused on the 2014 GWAS [44], not the more recent 2018 GWAS [47], although we also report below model results for the latter. There are issues pertaining to population structure in the various height GWAS [48, 49], and the 2018 GWAS is a combination of GIANT and UKB GWAS, so some caution is warranted in interpreting results for these data.
For the ALS GWAS data, there is very little signal outside chromosome 9: the data QQ plot essentially tracks the null distribution straight line. The QQ plot for chromosome 9, however, shows a significant departure from the null distribution. Of 471,607 SNPs on chromosome 9 a subset of 273,715 have z-scores, of which 107 are genome-wide significant, compared with 114 across the full genome. Therefore, we restrict ALS analysis to chromosome 9.
For schizophrenia, for example, there were 6,610,991 SNPs with finite z-scores out of the 11,015,833 SNPs from the 1000 Genomes reference panel that underlie the model; the genomic control factor for these SNPs was λGC = 1.466. Of these, 314,857 were filtered out due to low MAF or very large LD block size. The genomic control factor for the remaining SNPs was λGC = 1.468; for the pruned subsets, with ≃ 1.49 × 106 SNPs each, it was λ = 1.30. (Note that genomic control values for pruned data are always lower than for unpruned data.)
A limitation in the current work is that we have not taken account of imputation inaccuracy, where lower MAF SNPs are, through lower LD, less certain. Thus, the effects from lower MAF causal variants will be noisier than for higher MAF variants.
Simulations
We generated genotypes for 105 unrelated simulated samples using HAPGEN2 [31]. For narrow-sense heritability h2 equal to 0.1, 0.4, and 0.7, we considered polygenicity π1 equal to 10−5, 10−4, 10−3, and 10−2. For each of these 12 combinations, we randomly selected ncausal = π1 × nsnp “causal” SNPs and assigned them β-values drawn from the standard normal distribution (i.e., independent of H), with all other SNPs having β = 0. We repeated this ten times, giving ten independent instantiations of random vectors of β’s. Defining YG = Gβ, where G is the genotype matrix and β here is the vector of true coefficients over all SNPs, the total phenotype vector is constructed as Y =YG+ε, where the residual random vector ε for each instantiation is drawn from a normal distribution such that h2 = var(YG)/var(Y). For each of the instantiations this implicitly defines the “true” value .
The sample simple linear regression slope, , and the Pearson correlation coefficient, , are assumed to be t-distributed. These quantities have the same t-value: , with corresponding p-value from Student’s t cumulative distribution function (cdf) with N − 2 degrees of freedom: p = 2×tcdf(−|t|, N − 2) (see Supplementary Material). Since we are not here dealing with covariates, we calculated p from correlation, which is slightly faster than from estimating the regression coefficient. The t-value can be transformed to a z-value, giving the z-score for this p: , where Φ is the normal cdf (z and t have the same p-value).
Parameter Estimation
We randomly pruned SNPs using the threshold r2 > 0.8 to identify “synonymous” SNPs, performing ten such iterations. That is, for each of ten iterations, we randomly selected a SNP (not necessarily the one with largest z-score) to represent each subset of synonymous SNPs. For schizophrenia, for example, pruning resulted in approximately 1.3 million SNPs in each iteration.
The postulated pdf for a SNP’s z-score depends on the SNP’s LD and heterozygosity structure (histogram), . Given the data – the set of z-scores for available SNPs, as well as their LD and heterozygosity structure – and the -dependent pdf for z-scores, the objective is to find the model parameters that best predict the distribution of z-scores. We bin the SNPs with respect to a grid of heterozygosity and total LD; for any given H-L bin there will be a range of z-scores whose distribution the model it intended to predict. We find that a 10 × 10-grid of equally spaced bins is adequate for converged results. (Using equally-spaced bins might seem inefficient because of the resulting very uneven distribution of z-scores among grid elements – for example, orders of magnitude more SNPs in grid elements with low total LD compared with high total LD. However, the objective is to model the effects of H and L: using variable grid element sizes so as to maximize balance of SNP counts among grid elements means that the true H- and L-mediated effects of the SNPs in a narrow range of H and L get subsumed with the effects of many more SNPs in a much wider range of H and L – a misspecification of the pdf leading to some inaccuracy.) In lieu of or in addition to total LD (L) binning, one can bin SNPs with respect to their total LD block size (total number of SNPs in LD, ranging from 1 to ~1,500).
To find the model parameters that best fit the data, for a given H-L bin we binned the selected SNPs z-scores into equally-spaced bins of width dz=0.0742 (between zmin= − 38 and zmax=38, allowing for p-values near the numerical limit of 10−316), and from Eq. 25 calculated the probability for z-scores to be in each of those z-score bins (the prior probability for “success” in each z-score bin). Then, knowing the actual numbers of z-scores (numbers of “successes”) in each z-score bin, we calculated the multinomial probability, pm, for this outcome. The optimal model parameter values will be those that maximize the accrual of this probability over all H-L bins. We constructed a cost function by calculating, for a given H-L bin, −ln(pm) and averaging over prunings, and then accumulating this over all H-L bins. Model parameters minimizing the cost were obtained from Nelder-Mead multidimensional unconstrained nonlinear minimization of the cost function, using the Matlab function fminsearch().
Posterior Effect Sizes
Model posterior effect sizes, given z (along with N, , and the model parameters), were calculated using numerical integration over the random variable δ:
Here, since , the posterior probability of z given δ is simply P (z) is shorthand for pdf(z|N,, π1, σβ, σ0), given by Eq. 25. P (δ) is calculated by a similar procedure that lead to Eq. 25 but ignoring the environmental contributions {Ej}. Specifically, let Fδ = (G1, …, Gn) denote the vector of products of Fourier transform values. Then, the vector of pdf values for genetic effect bins (indexed by i; numerically, these will be the same as the z-score bins) in the H-L bin, pdfδ = (f1, …, fn) where , is
Similarly, which is used in power calculations.
GWAS Replication
A related matter has to do with whether z-scores for SNPs reaching genome-wide significance in a discovery-sample are compatible with the SNPs’ z-scores in a replication-sample, particularly if any of those replication-sample z-scores are far from reaching genome-wide significance, or whether any apparent mismatch signifies some overlooked inconsistency. The model pdf allows one to make a principled statistical assessment in such cases. We present the details for this application, and results applied to studies of bipolar disorder, in the Supplementary Material.
GWAS Power
Chip heritability, , is the proportion of phenotypic variance that in principle can be captured additively by the nsnp SNPs under study [17]. It is of interest to estimate the proportion of that can be explained by SNPs reaching genome-wide significance, p≤5×10−8 (i.e., for which |z|>zt=5.45), at a given sample size [58, 59]. In Eq 1, for SNP i with genotype vector gi over N samples, let ygi ≡ giβi. If the SNP’s heterozygosity is Hi, then . If we knew the full set {βi} of true β-values, then, for z-scores from a particular sample size N, the proportion of SNP heritability captured by genome-wide significant SNPs, A(N), would be given by
Now, from Eq. 15, . If SNP i is causal and sufficiently isolated so that it is not in LD with other causal SNPs, then , and . When all causal SNPs are similarly isolated, Eq. 30 becomes
Of course, the true βi are not known and some causal SNPs will likely be in LD with others. Furthermore, due to LD with causal SNPs, many SNPs will have a nonzero (latent or unobserved) effect size, δ. Nevertheless, we can formulate an approximation to A(N) which, assuming the pdf for z-scores (Eq. 25) is reasonable, will be inaccurate to the degree that the average LD structure of genome-wide significant SNPs differs from the overall average LD structure. As before (see the subsection “Model PDF: Convolution”), consider a fixed set of n equally-spaced nominal z-scores covering a wide range of possible values (changing from the summations in Eq. 31 to the uniform summation spacing ∆z now requires bringing the probability density into the summations). For each z from the fixed set (and, as before, employing data reduction by averaging so that H and L denote values for the 10 × 10 grid), use E(δ2|z, N, H, L) given in Eq. 29 to define (emphasizing dependence on N, H, and L). Then, for any N, A(N) can be estimated by where ΣH,L denotes sum over the H-L grid elements. The ratio in Eq. 33 should be accurate if the average effects of LD in the numerator and denominator cancel – which will always be true as the ratio approaches 1 for large N. Plotting A(N) gives an indication of the power of future GWAS to capture chip heritability.
Quantile-Quantile Plots and Genomic Control
One of the advantages of quantile-quantile (QQ) plots is that on a logarithmic scale they emphasize behavior in the tails of a distribution, and provide a valuable visual aid in assessing the independent effects of polygenicity, strength of association, and variance distortion – the roles played by the three model parameters – as well as showing how well a model fits data. QQ plots for the model were constructed using Eq. 25, replacing the normal pdf with the normal cdf, and replacing z with an equally-spaced vector of length 10,000 covering a wide range of nominal |z| values (0 through 38). SNPs were divided into a 10 × 10 grid of H × L bins, and the cdf vector (with elements corresponding to the z-values in ) accumulated for each such bin (using mean values of H and L for SNPs in a given bin).
For a given set of samples and SNPs, the genomic control factor, λ, for the z-scores is defined as the median z2 divided by the median for the null distribution, 0.455 [19]. This can also be calculated from the QQ plot. In the plots we present here, the abscissa gives the -log10 of the proportion, q, of SNPs whose z-scores exceed the two-tailed significance threshold p, transformed in the ordinate as −log10(p). The median is at qmed = 0.5, or −log10(qmed) ≃ 0.3; the corresponding empirical and model p-value thresholds (pmed) for the z-scores – and equivalently for the z-scores-squared – can be read off from the plots. The genomic inflation factor is then given by
Note that the values of λ reported here are for pruned SNP sets; these values will be lower than for the total GWAS SNP sets.
Knowing the total number, ntot, of p-values involved in a QQ plot (number of GWAS z-scores from pruned SNPs), any point (q, p) (log-transformed) on the plot gives the number, np = q ntot, of p-values that are as extreme as or more extreme than the chosen p-value. This can be thought of as np “successes” out of ntot independent trials (thus ignoring LD) from a binomial distribution with prior probability q. To approximate the effects of LD, we estimate the number of independent SNPs as ntot/f where f ≃ 10. The 95% binomial confidence interval for q is calculated as the exact Clopper-Pearson 95% interval, .
Number of Causal SNPs
The estimated number of causal SNPs is given by the polygenicity, π1, times the total number of SNPs, nsnp: ncausal = π1nsnp. nsnp is given by the total number of SNPs that went into building the heterozygosity/LD structure, in Eq. 25, i.e., the approximately 11 million SNPs selected from the 1000 Genomes Phase 3 reference panel, not the number of tag SNPs in the particular GWAS. The parameters estimated are to be seen in the context of the reference panel, which we assume contains all common causal variants. Stable quantities (i.e., fairly independent of the reference panel size. e.g., using the full panel or ignoring every second SNP), are the estimated effect size variance and number of causal variants – which we demonstrate below – and hence the heritability. Thus, the polygenicity will scale inversely with the reference panel size. A reference panel with a substantially larger number of samples would allow for inclusion of more SNPs (non-zero MAF), and thus the actual polygenicity estimated would change slightly.
Narrow-sense Chip Heritability
Since we are treating the β coefficients as fixed effects in the simple linear regression GWAS formalism, with the phenotype vector standardized with mean zero and unit variance, from Eq. 1 the proportion of phenotypic variance explained by a particular causal SNP whose reference panel genotype vector is g, q2=var(y; g), is given by q2 = β2H. The proportion of phenotypic variance explained additively by all causal SNPs is, by definition, the narrow sense chip heritability, h2. Since and ncausal = π1nsnp, and taking the mean heterozygosity over causal SNPs to be approximately equal to the mean over all SNPs, , the chip heritability can be estimated as
Mean heterozygosity from the ~11 million SNPs is .
For all-or-none traits like disease status, the estimated h2 from Eq. 34 for an ascertained case-control study is on the observed scale and is a function of the prevalence in the adult population, K, and the proportion of cases in the study, P. The heritability on the underlying continuous liability scale [61], , is obtained by adjusting for ascertainment (multiplying by K(1 − K)/(P (1 − P)), the ratio of phenotypic variances in the population and in the study) and rescaling based on prevalence [62, 6]: where a is the height of the standard normal pdf at the truncation point zK defined such that the area under the curve in the region to the right of zK is K.
RESULTS
Simulations
Table 2 shows the simulation results, comparing true and estimated values for the model parameters, heritability, and the number of causal SNPs, for twelve scenarios where π1 and both range over three orders of magnitude, encompassing the range of values for the phenotypes; in Supplementary Material, Figure 6 shows QQ plots for a randomly chosen (out of 10) β-vector and phenotype instantiation for each of the twelve (π1, h2) scenarios. Most of the estimates are in very good agreement with the true values, though for the extreme scenario of high heritability and low polygenicity it is overestimated by factors of two-to-three. The numbers of estimated causal SNPs (out of 11 million) are in correspondingly good agreement with the true values, ranging in increasing powers of 10 from 110 through 110,158. The estimated discoverabilities are also in good agreement with the true values. In most cases, is close to 1, indicating little or no global inflation, though it is elevated for high heritability with high polygenicity, suggesting it is capturing some ubiquitous effects.
In Supplementary Material, we examine the issue of model misspecification. Specifically, we assign causal effects β drawn from a Gaussian whose variance is not simply a constant but depends on heterozygosity, such that rarer causal SNPs will tend to have larger effects [15]. The results – see Supplementary Material Table 3 – show that the model still makes reasonable estimates of the underlying genetic architecture. Additionally, we tested the scenario where true causal effects are distributed with respect to two Gaussians [14], a situation that allows for a small number of the causal SNPs to have quite large effects – see Supplementary Material Table 4. We find that heritabilities are still reasonably estimated using our model. In all these scenarios the overall data QQ plots were accurately reproduced by the model. As a counter example, we simulated summary statistics where the prior probability of a reference SNP being causal decreased linearly with total LD (see Supplementary Material Table 5). In this case, our single Gaussian fit (which assumes no LD dependence on the prior probability of a reference SNP being causal) did not produce model QQ plots that accurately tracked the data QQ plots (see Supplementary Material Figure 13). The model parameters and heritabilities were also poor. But this scenario is highly artificial; in contrast, in situations where the data QQ plots were accurately reproduced by the model, the estimated model parameters and heritability were plausible.
Phenotypes
Figures 1 and 2 show QQ plots for the pruned z-scores for eight qualitative and eight quantitative phenotypes, along with model estimates (Supplementary Material Figs. 14-30 each show a 4 × 4 grid breakdown by heterozygosity total-LD of QQ plots for all phenotypes studied here; the 4 × 4 grid is a subset of the 10 × 10 grid used in the calculations). In all cases, the model fit (yellow) closely tracks the data (dark blue). For the sixteen phenotypes, estimates for the model polygenicity parameter (fraction of reference panel, with ≃11 million SNPs, estimated to have non-null effects) range over two orders of magnitude, from π1 ≃ 2 × 10−5 to π1 ≃ 4 × 10−3. The estimated SNP discoverability parameter (variance of β, or expected β2, for causal variants) also ranges over two orders of magnitude from to (in units where the variance of the phenotype is normalized to 1).
We find that schizophrenia and bipolar disorder appear to be similarly highly polygenic, with model polygenicities ≃ 2.84 × 10−3 and ≃ 2.70 × 10−3, respectively. The model polygenicity of major depressive disorder, however, is 40% higher, π1 ≃ 4 × 10−3 – the highest value among the sixteen phenotypes. In contrast, the model polygenicities of late onset Alzheimer’s disease and Crohn’s disease are almost thirty times smaller than that of schizophrenia.
In Supplementary Material Figure 10 we show results for Alzheimer’s disease exclusively for chromosome 19 (which contains APOE), and for all autosomal chomosomes excluding chromosome 19. We also show results with the same chromosomal breakdown for a recent GWAS involving 455,258 samples that included 24,087 clinically diagnosed LOAD cases and 47,793 AD-by-proxy cases (individuals who were not clinically diagnosed with LOAD but for whom at least one parent had LOAD) [65]. These GWAS give consistent estimates of polygenicity: π1 ∼ 1 × 10−4 excluding chromosome 19, and π1 ∼ 6 × 10−5 for chromosome 19 exclusively.
Of the quantitative traits, educational attainment has the highest model polygenicity, π1 = 3.2 10−3, similar to intelligence, π1 = 2.2 10−3. Approximately two orders of magnitude lower in polygenicity are the endophenotypes putamen volume and low- and high-density lipoproteins.
The model effective SNP discoverability for schizophrenia is , similar to that for bipolar disorder. Major depressive disorder, which has the highest polygenicity, has the lowest SNP discoverability, approximately one-eighth that of schizophrenia; it is this low value, combined with high polygenicity that leads to the weak signal in Figure 1 (A) even though the sample size is relatively large. In contrast, SNP discoverability for Alzheimer’s disease is almost four times that of schizophrenia. The inflammatory bowel diseases, however, have much higher SNP discoverabilities, 16 and 31 times that of schizophrenia respectively for ulcerative colitis and Crohn’s disease – the latter having the second highest value of the sixteen phenotypes: .
Additionally, for Alzheimer’s disease we show in Supplementary Material Figure 10 that the discoverability is two orders of magnitude greater for chromosome 19 than for the remainder of the autosome. Note that since two-thirds of the 2018 “cases” are AD-by-proxy, the discover-abilities for the 2018 data are, as expected, reduced relative to the values for the 2013 data (approximately 3.5 times smaller).
The narrow sense SNP heritability from the ascertained case-control schizophrenia GWAS is estimated as h2=0.37. Taking adult population prevalence of schizophrenia to be K=0.01 [66, 67] (but see also [68], for K=0.005), and given that there are 51,900 cases and 71,675 controls in the study, so that the proportion of cases in the study is P =0.42, the heritability on the liability scale for schizophrenia from Eq. 35 is . For bipolar disorder, with K=0.005 [51], 20,352 cases and 31,358 controls, . Major depressive disorder appears to have a much lower model-estimated SNP heritability than schizophrenia: . The model estimate of SNP heritability for height is 17%, lower than the oft-reported value ~50% (see Discussion). However, despite the huge differences in sample size, we find the same value, 17%, for the 2010 GWAS (N =133,735 [63]), and 19% for the 2018 GWAS (N = 707,868 [47, 44]) – see Table 1.
Figure 3 shows the sample size required so that a given proportion of chip heritability is captured by genome-wide significant SNPs for the phenotypes (assuming equal numbers of cases and controls for the qualitative phenotypes: Neff = 4/(1/Ncases + 1/Ncontrols), so that when Ncases = Ncontrols, Neff = Ncases + Ncontrols = N, the total sample size, allowing for a straightforward comparison with quantitative traits). At current sample sizes, only 4% of narrow-sense chip heritability is captured for schizophrenia and only 1% for bipolar disorder; using current methodologies, a sample size of Neff ~ 1 million would be required to capture the preponderance of SNP heritability for these phenotypes. Major depressive disorder GWAS currently is greatly underpowered, as shown in Figure 3(A). For education, we predict that 3.5% of pehnotypic variance would be explained at N = 1.1 million, in good agreement with the value found from direct computation of 3.2% [64]. For other phenotypes, the proportions of total SNP heritability captured at the available sample sizes are given in Figure 3.
The sample size for ALS was quite low, and we restricted the analysis to choromosome 9, which had most of the genome-wide significant tag SNPs; we estimate that there are ~7 causal SNPs with high discoverability on choromosome 9 [69, 70], with very high discoverability, . In contrast, for AD restricted to chromosome 19, there were an estimated 14 causal SNPs with discoverability (see Supplementary Material Figure 10 (B)).
In this study, we assume that population stratification in the raw data has been corrected for in the publicly-available summary statistics. However, given that some of the sample sizes are extremely large, we allow for the possibility of residual cryptic relatedness. This would result in a scaling of the z-scores, Eq. 9 [19]. Thus, to test the modeling of inflation due to cryptic relatedness, we scaled the simulation z-scores as described earlier (z = σ0zu with σ0 > 1, where zu are the original z-scores, i.e., not artificially inflated) and reran the model. E.g., for education and schizophrenia we inflated the z-scores by a factor of 1.2. For schizophrenia we found , which is almost exactly as predicted (1.14 × 1.2 = 1.368), while the polygenicity and discoverability parameters are essentially unchanged: π1 = 2.81 × 10−3, . For education we found , which again is almost exactly as predicted (1.0 × 1.2 = 1.2), while the polygenicity and discoverability parameters are again essentially unchanged: π1 = 3.19 × 10−3, and .
A comparison of our results with those of [14] and [15] is in Supplementary Material Table 6. Critical methodological differences with model M2 in [14] are that we use a full reference panel of 11 million SNPs from 1000 Genomes Phase 3, we allow for the possibility of inflation in the data, and we provide an exact solution, based on Fourier Transforms, for the z-score pdf arising from the posited distribution of causal effects, resulting in better fits of the model and the data QQ plots – as can be seen by comparing our QQ plots with those reported in 6. Although our estimated number of causal are often within a factor of two of those from the nominally equivalent model M2 of Zhang et al, there is no clear pattern to the mismatch.
Dependence on Reference Panel
Given a liberal MAF threshold of 0.002, our reference panel should contain the vast majority of common SNPs for European ancestry. However, it does not include other structural variants (such as small insertions/deletions, or haplotype blocks) which may also be causal for phenotypes. To validate our parameter estimates for an incomplete reference, we reran our model on real phenotypes using the culled reference where we exclude every other SNP. The result is that all estimated parameters are as before except that doubles, leaving the estimated number of causal SNPs and heritability as before. For example, for schizophrenia we get π1 = 5.3 × 10−3 and for the reduced reference panel, versus π1 = 2.8 × 10−3 and for the full panel, with heritability remaining essentially the same (37% on the observed scale).
DISCUSSION
Here we present a unified method based on GWAS summary statistics, incorporating detailed LD structure from an underlying reference panel of SNPs with MAF>0.002, for estimating: phenotypic polygenicity, π1, expressed as the fraction of the reference panel SNPs that have a non-null true β value, i.e., are “causal”; and SNP discoverability or mean strength of association (the variance of the underlying causal effects), . In addition the model can be used to estimate residual inflation of the association statistics due to variance distortion induced by cryptic relatedness, . The model assumes that there is very little, if any, inflation in the GWAS summary statistics due to population stratification (bias shift in z-scores due to ethnic variation).
We apply the model to sixteen diverse phenotypes, eight qualitative and eight quantitative. From the estimated model parameters we also estimate the number of causal common-SNPs in the underlying reference panel, ncausal, and the narrow-sense common-SNP heritability, h2 (for qualitative phenotypes, we re-express this as the proportion of population variance in disease liability, , under a liability threshold model, adjusted for ascertainment); in the event rare SNPs (i.e., not in the reference panel) are causal, h2 will be an underestimate of the true SNP heritability. In addition, we estimate the proportion of SNP heritability captured by genome-wide significant SNPs at current sample sizes, and predict future sample sizes needed to explain the preponderance of SNP heritability.
We find that schizophrenia is highly polygenic, with π1 = 2.8 × 10−3. This leads to an estimate of ncausal ≃ 31, 000, which is in reasonable agreement with a recent estimate that the number of causals is >20,000 [71]. The SNP associations, however, are characterized by a narrow distribution, , indicating that most associations are of weak effect, i.e., have low discoverability. Bipolar disorder has similar parameters. The smaller sample size for bipolar disorder has led to fewer SNP discoveries compared with schizophrenia. However, from Figure 3, sample sizes for bipolar disorder are approaching a range where rapid increase in discoveries becomes possible. For educational attainment [72, 40, 73], the polygenicity is somewhat greater, π1 = 3.2 × 10−3, leading to an estimate of ncausal ≃ 35, 000, half a recent estimate, ≃70, 000, for the number of loci contributing to heritability [72]. The variance of the distribution for causal effect sizes is a quarter that of schizophrenia, indicating lower discoverability. Intelligence, a related phenotype [41, 74], has a larger discoverability than education while having lower polygenicity (~ 10, 000 fewer causal SNPs).
In marked contrast are the lipoproteins and putamen volume which have very low polygenicity: π1 < 5 × 10−5, so that only 250 to 550 SNPs (out of ~11 million) are estimated to be causal. However, causal SNPS for putamen volume and HDL appear to be characterized by relatively high discoverability, respectively 17-times and 23-times larger than for schizophrenia (see Supplementary Material for additional notes on HDL).
The QQ plots (which are sample size dependent) reflect these differences in genetic architecture. For example, the early departure of the schizophrenia QQ plot from the null line indicates its high polygenicity, while the steep rise for putamen volume after its departure corresponds to its high SNP discoverability.
For Alzheimer’s disease, our estimate of the liability-scale SNP heritability for the full 2013 dataset [37] is 15% for prevalence of 14% for those aged 71 older, half from APOE, while the recent “M2” and “M3” models of Zhang et al [14] gave values of 7% and 10% respectively – see Supplementary Materials Table 6. A recent report from two methods, LD Score Regression (LDSC) and SumHer [75], estimated SNP heritability of 3% for LDSC and 12% for SumHer (asuming prevalence of 7.5%). A raw genotype-based analysis (GCTA), including genes that contain rare variants that affect risk for AD, reported SNP heritability of 53% [76, 7]; an earlier related study that did not include rare variants and had only a quarter of the common variants estimated SNP heritability of 33% for prevalence of 13% [77]. GCTA calculations of heritability are within the domain of the so-called infinitesimal model where all markers are assumed to be causal. Our model suggests, however, that phenotypes are characterized by polygenicities less than 5 × 10−3; for AD the polygenicity is ≃ 10−4. Nevertheless, the GCTA approach yields a heritability estimate closer to the twin-based (broad sense) value, estimated to be in the range 60-80% [78]. The methodlogy appears to be robust to many assumptions about the distribution of effect sizes [79, 80]; the SNP heritability estimate is unbiased, though it has larger standard error than methods that allow for only sparse causal effects [63, 81]. For the 2013 data analyzed here [37], a summary-statistics-based method applied to a subset of 54,162 of the 74,046 samples gave SNP heritability of almost 7% on the observed scale [82, 12]; our estimate is 12% on the observed scale – see Supplementary Material Figure 10 A and B.
Onset and clinical progression of sporadic Alzheimer’s disease is strongly age-related [83, 84], with prevalence in differential age groups increasing at least up through the early 90s [55]. Thus, it would be more accurate to assess heritability (and its components, polygenicity and discoverability) with respect to, say, five-year age groups beginning with age 65 years, and using a consistent control group of nonagenarians and centenarians. By the same token, comparisons among current and past AD GWAS are complicated because of potential differences in the age distributions of the respective case and the control cohorts. Additionally, the degree to which rare variants are included will affect heritability estimates. The summary-statistic-based estimates of polygenicity that we report here are, however, likely to be robust for common SNPs: π1 ≃ 1.1 × 10−4, with only a few causal SNPs on chromosome 19.
Our point estimate for the liability-scale SNP heritability of schizophrenia is (assuming a population risk of 0.01), and that 4% of this (i.e., 1% of overall disease liability) is explainable based on common SNPs reaching genome-wide significance at the current sample size. This estimate is in reasonable agreement with a recent result, [71, 85], also calculated from the PGC2 data set but using raw genotype data for 472,178 markers for a subset of 22,177 schizophrenia cases and 27,629 controls of European ancestry; and with an earlier result of from PGC1 raw genotype data for 915,354 markers for 9,087 schizophrenia cases and 12,171 controls [86, 7]. The recent “M2” (single non-null Gaussian) model estimate is [14] (see Supplementary Materials Table 6). No QQ plot was available for the M2 model fit to schizophrenia data, but such plots (truncated on the y-axis at −log10(p) = 10) for many other phenotypes were reported [14]. We note that for multiple phenotypes (height, LDL cholesterol, total cholesterol, years of schooling, Crohn’s disease, coronary artery disease, and ulcerative colitis) our single Gaussian model appears to provide a better fit to the data than M2: many of the M2 plots show a very early and often dramatic deviation between prediction and data, as compared with our model QQ plots which are also built from a single Gaussian, suggesting an upward bias in polygenicity and/or variance of effect sizes, and hence heritability as measured by the M2 model for these phenotypes. The LDSC liability-scale (1% prevalence) SNP heritability for schizophrenia has been reported as [12] and more recently as 0.19 [75], in very good agreement with our estimate; on the observed scale it has been reported as 45% [82, 12], in contrast to our corresponding value of 37%. Our estimate of 1% of overall variation on the liability scale for schizophrenia explainable by genome-wide significant loci compares reasonably with the proportion of variance on the liability scale explained by Risk Profile Scores (RPS) reported as 1.1% using the “MGS” sample as target (the median for all 40 leave-one-out target samples analyzed is 1.19% – see Extended Data Figure 5 and Supplementary Tables 5 and 6 in [34]; this was incorrectly reported as 3.4% in the main paper). These results show that current sample sizes need to increase substantially in order for RPSs to have predictive utility, as the vast majority of associated SNPs remain undiscovered. Our power estimates indicate that ~500,000 cases and an equal number of controls would be needed to identify these SNPs (note that there is a total of approximately 3 million cases in the US alone).
For educational attainment, we estimate SNP heritability h2 = 0.12, in good agreement with the estimate of 11.5% given in [40]. As with schizophrenia, this is substantially less than the estimate of heritability from twin and family studies of ≃40% of the variance in educational attainment explained by genetic factors [87, 72].
For putamen volume, we estimate the SNP heritability h2 = 0.11, in reasonable agreement with an earlier estimate of 0.1 for the same overall data set [45, 4]. For LDL and HDL, we estimate h2 = 0.06 and h2 = 0.07 respectively, in good agreement with the LDSC estimates h2 = 0.08 and h2 = 0.07 [75], and the M2 model of [14] – see Supporting Material Table 6.
For height (N=251,747 [44]) we find that its model polygenicity is π1 = 5.66 × 10−4, a quarter that of intelligence (and very far from “omnigenic” [88]), while its discoverability is five times that of intelligence, leading to a SNP heritability of 17%. For the 2010 GWAS (N=133,735 [63]) and 2018 GWAS (N=707,868 [47]), we estimate SNP heritability of 17% and 19% respectively (see Table 1 and Supplementary Material Fig. 11). These heritabilities are in considerable disagreement with the SNP heritability estimate of ≃50% [44] (average of estimates from five cohorts ranging in size from N=1,145 to N=5,668, with ~1 million SNPs). For the 2010 GWAS, the M2 model [14] gives h2 = 0.30 (see Supporting Material Table 6); the upward deviation of the model QQ plot in [14] suggests that this value might be inflated. For the 2014 GWAS, the M3 model estimate is h2 = 33% [14]; the Regression with Summary Statistics (RSS) model estimate is h2 = 52% (with ≃11, 000 causal SNPs) [89], which, not taking any inflation into account is definitely a model overestimate; and in [75] the LDSC estimate is reported as h2 = 20% while the SumHer estimate is h2 = 46% (in general across traits, the SumHer heritability estimates tend to be two-to-five times larger than the LDSC estimates). The M2, M3, and RSS models use a reference panel of ∼1 million common SNPs, in contrast with the ~11 million SNPs used in our analysis. Also, it should be noted that the M2, M3, and RSS model estimates did not take the possibility of inflation into account. For the 2014 height GWAS, that inflation is reported as the LDSC intercept is 2.09 in [75], indicating considerable inflation; for the 2018 dataset we find ; the LD score regression intercept of 2.1116 (0.0458) on that dataset. Given the various estimates of inflation and the controversy over population structure in the height data [48, 49], it is not clear what results are definitely incorrect.
Our power analysis for height (2014) shows that 37% of the narrow-sense heritability arising from common SNPs is explained by genome-wide significant SNPs (p ≤ 5 × 10−8), i.e., 6.3% of total phenotypic variance, which is substantially less than the 16% direct estimate from significant SNPs [44]. It is not clear why these large discrepancies exist. One relevant factor, however, is that we estimate a considerable confounding in the height 2014 dataset. Our h2 estimates are adjusted for the potential confounding measured by , and thus they represent what is likely a lower bound of the actual SNP-heritability, leading to a more conservative estimate than what has previously been reported. We note that after adjustment, our h2 estimates are consistent across all three datasets (height 2010, 2014 and 2018), which otherwise would range by more than 2.5-fold. Another factor might be the relative dearth of typed SNPs with low heterozygosity and low total LD (see top left segment in Supporting Material Figure 23, n = 780): there might be many causal variants with weak effect that are only weakly tagged. Nevertheless, given the discrepancies noted above, caution is warranted in interpreting our model results for height.
CONCLUSION
The common-SNP causal effects model we have presented is based on GWAS summary statistics and detailed LD structure of an underlying reference panel, and assumes a Gaussian distribution of effect sizes at a fraction of SNPs randomly distributed across the autosomal genome. While not incorporating the effects of rare SNPs, we have shown that it captures the broad genetic architecture of diverse complex traits, where polygenicities and the variance of the effect sizes range over orders of magnitude.
The current model (essentially Eq. 4) and its implementation (essentially Eq. 25) are basic elements for building a more refined model of SNP effects using summary statistics. Higher accuracy in characterizing causal alleles in turn will enable greater power for SNP discovery and phenotypic prediction.
Funding
Research Council of Norway (262656, 248984, 248778, 223273) and KG Jebsen Stiftelsen; ABCD-USA Consortium (5U24DA041123).
Acknowledgments
We thank the consortia for making available their GWAS summary statistics, and the many people who provided DNA samples.
Footnotes
Typos fixed. Minor updates throughout.
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].↵
- [76].↵
- [77].↵
- [78].↵
- [79].↵
- [80].↵
- [81].↵
- [82].↵
- [83].↵
- [84].↵
- [85].↵
- [86].↵
- [87].↵
- [88].↵
- [89].↵
- [90].
- [91].