Beyond SNP Heritability: Polygenicity and Discoverability of Phenotypes Estimated with a Univariate Gaussian Mixture Model

Dominic Holland; Oleksandr Frei; Rahul Desikan; Chun-Chieh Fan; Alexey A. Shadrin; Olav B. Smeland; V. S. Sundar; Paul Thompson; Ole A. Andreassen; Anders M. Dale

doi:10.1101/133132

Abstract

Estimating the polygenicity (proportion of causally associated single nucleotide polymorphisms (SNPs)) and discover-ability (effect size variance) of causal SNPs for human traits is currently of considerable interest. SNP-heritability is proportional to the product of these quantities. We present a basic model, using detailed linkage disequilibrium structure from an extensive reference panel, to estimate these quantities from genome-wide association studies (GWAS) summary statistics. We apply the model to diverse phenotypes and validate the implementation with simulations. We find model polygenicities ranging from ≃ 2 × 10⁻⁵ to ≃ 4 × 10⁻³, with discoverabilities similarly ranging over two orders of magnitude. A power analysis allows us to estimate the proportions of phenotypic variance explained additively by causal SNPs reaching genome-wide significance at current sample sizes, and map out sample sizes required to explain larger portions of additive SNP heritability. The model also allows for estimating residual inflation (or deflation from over-correcting of z-scores), and assessing compatibility of replication and discovery GWAS summary statistics.

INTRODUCTION

The genetic components of complex human traits and diseases arise from hundreds to likely many thousands of single nucleotide polymorphisms (SNPs) [1], most of which have weak effects. As sample sizes increase, more of the associated SNPs are identifiable (they reach genome-wide significance), though power for discovery varies widely across phenotypes. Of particular interest are estimating the proportion of common SNPs from a reference panel (polygenicity) involved in any particular phenotype; their effective strength of association (discoverability, or causal effect size variance); the proportion of variation in susceptibility, or phenotypic variation, captured additively by all common causal SNPs (approximately, the narrow sense heritability), and the fraction of that captured by genome-wide significant SNPs – all of which are active areas of research [2, 3, 4, 5, 6, 7, 8, 9]. The effects of population structure [10], combined with high polygenicity and linkage disequilibrium (LD), leading to spurious degrees of SNP association, or inflation, considerably complicate matters, and are also areas of much focus [11, 12, 13]. Despite these challenges, there have been recent significant advances in the development of mathematical models of polygenic architecture based on GWAS [14, 15]. One of the advantages of these models is that they can be used for power estimation in human phenotypes, enabling prediction of the capabilities of future GWAS.

Here, in a unified approach explicitly taking into account LD, we present a model relying on genome-wide association studies (GWAS) summary statistics (z-scores for SNP associations with a phenotype [16]) to estimate polygenicity (π₁, the proportion of causal variants in the underlying reference panel of approximately 11 million SNPs from a sample size of 503) and discoverability (, the causal effect size variance), as well as elevation of z-scores due to any residual inflation of the z-scores arising from variance distortion (, which for example can be induced by cryptic relatedness), which remains a concern in large-scale studies [10]. We estimate π₁, , and , by postulating a z-score probability distribution function (pdf) that explicitly depends on them, and fitting it to the actual distribution of GWAS z-scores.

Estimates of polygenicity and discoverability allow one to estimate compound quantities, like narrow-sense heritability captured by the SNPs [17]; to predict the power of larger-scale GWAS to discover genome-wide significant loci; and to understand why some phenotypes have higher power for SNP discovery and proportion of heritability explained than other phenotypes.

In previous work [18] we presented a related model that treated the overall effects of LD on z-scores in an approximate way. Here we take the details of LD explicitly into consideration, resulting in a conceptually more basic model to predict the distribution of z-scores. We apply the model to multiple phenotype datasets, in each case estimating the three model parameters and auxiliary quantities, including the overall inflation factor λ, (traditionally referred to as genomic control [19]) and narrow sense heritability, h². We also perform extensive simulations on genotypes with realistic LD structure in order to validate the interpretation of the model parameters.

METHODS

Overview

Our basic model is a simple postulate for the distribution of causal effects (denoted β below) [20]. Our model assumes that only a fraction of all SNPs are in some sense causally related to any given phenotype. We work with a reference panel of approximately 11 million SNPs with 503 samples, and assume that all common causal SNPs (minor allele frequency (MAF) > 0.002) are contained in it. Any given GWAS will have z-scores for a subset (the “tag” SNPs) of these reference SNPs. When a z-score partially involves a latent causal component (i.e., not pure noise), we assume that it arises through LD with neighboring causal SNPs, or that it itself is causal.

We construct a pdf for z-scores that directly follows from the underlying distribution of effects. For any given tag SNP’s z-score, it is dependent on the other SNPs the focal SNP is in LD with, taking into account their LD with the focal SNP and their heterozygosity (i.e., it depends not just on the focal tag SNP’s total LD and heterozygosity, but also on the distribution of neighboring reference SNPs in LD with it and their heterozygosities). We present two ways of constructing the model pdf for z-scores, using multinomial expansion (in Supplementary Information), and using convolution. The former is perhaps more intuitive, but the latter is more numerically tractable, yielding an exact solution, and is used here to obtain all reported results. The problem then is finding the three model parameters that give a maximum likelihood best fit for the model’s prediction of the distribution of z-scores to the actual distribution of z-scores. Because we are fitting three parameters typically using ≳10⁶ data points, it is appropriate to incorporate some data reduction to facilitate the computations. To that end, we bin the data (z-scores) into a 10 × 10 grid of heterozygosity-by-total LD (having tested different grid sizes to ensure convergence of results). Also, when building the LD and heterozygosity structures of reference SNPs, we fine-grained the LD range (0 ≤ r² ≤ 1), again ensuring that bins were small enough that results were well converged. To fit the model to the data we bin the z-scores (within each heterozygosity/total LD window) and calculate the multinomial probability for having the actual distribution of z-scores (numbers of z-scores in the z-score bins) given the model pdf for the distribution of z-scores, and adjusting the model parameters using a multidimensional unconstrained nonlinear minimization (Nelder-Mead), so as to maximize the likelihood of the data, given the parameters.

A visual summary of the predicted and actual distribution of z-scores is obtained by making quantile-quantile plots showing, for a wide range of significance thresholds going well beyond genome-wide significance, the proportion (x-axis) of tag SNPs exceeding any given threshold (y-axis) in the range. It is important also to assess the quantile-quantile sub-plots for SNPs in the heterozygosity-by-total LD grid elements (see Supplementary Material).

With the pdf in hand, various quantities can be calculated: the number of causal SNPs; the expected genetic effect (denoted δ below, where δ² is the noncentrality parameter of a Chi-squared distribution) at the current sample size for a tag SNP given the SNP’s z-score and its full LD and heterozygosity structure; the estimated SNP heritability, (excluding contributions from rare reference SNPs, i.e., with MAF<0.2%); and the sample size required to explain any percentage of that with genome-wide significant SNPs. The model can easily be extended using a more complex distribution for the underlying β’s, with multiple-component mixtures for small and large effects, and incorporating selection pressure through both heterozygosity dependence on effect sizes and linkage disequilibrium dependence on the prior probability of a SNP’s being causal – issues we will address in future work.

The Model: Probability Distribution for Z-Scores

To establish notation, we consider a bi-allelic genetic variant, i, and let β_i denote the effect size of allele substitution of that variant on a given quantitative trait. We assume a simple additive generative model (simple linear regression, ignoring covariates) relating genotype to phenotype [18, 21]. That is, assume a linear vector equation (no summation over repeated indices) for phenotype vector y over N samples (mean-centered and normalized to unit variance), mean-centered genotype vector g_i for the i^th of n SNPs (vector over samples of the additively coded number of reference alleles for the i^th variant), true fixed effect β_i (regression coefficient) for the SNP, and residual vector e_i containing the effects of all the other causal SNPs, the independent random environmental component, and random error. Variants with non-zero fixed effect β_i are said to be “causal”. For SNP i, the estimated simple linear regression coefficient is where T denotes transpose and is the SNP’s heterozygosity (frequency of the heterozygous genotype): H_i = 2p_i(1 − p_i) where p_i is the frequency of either of the SNP’s alleles.

Consistent with the work of others [11, 15], we assume the causal SNPs are distributed randomly throughout the genome (an assumption that can be relaxed when explicitly considering different SNP categories, but that in the main is consistent with the additive variation explained by a given part of the genome being proportional to the length of DNA [22]). In a Bayesian approach, we assume that the parameter β for a SNP has a distribution (in that specific sense, this is similar to a random effects model), representing subjective information on β, not a distribution across tangible populations [23]. Specifically, we posit a normal distribution for β with variance given by a constant, :

This is also how the β are distributed across the set of causal SNPs. Therefore, taking into account all SNPs (the remaining ones are all null by definition), this is equivalent to the two-component Gaussian mixture model we originally proposed [20] where is the Dirac delta function, so that considering all SNPs, the net variance is . If there is no LD (and assuming no source of spurious inflation), the association z-score for a SNP with heterozygosity H can be decomposed into a fixed effect δ and a residual random environment and error term, , which is assumed to be independent of δ [18]: with so that where

By construction, under null, i.e., when there is no genetic effect, δ = 0, so that var(ϵ) = 1.

If there is no source of variance distortion in the sample, but there is a source of bias in the summary statistics (e.g., the sample is composed of two or more subpopulations with different allele frequencies for a subset of markers – pure population stratification in the sample [24]), the marginal distribution of an individual’s genotype at any of those markers will be inflated. The squared z-score for such a marker will then follow a noncentral chi-square distribution; the noncentrality parameter will contain the causal genetic effect, if any, but biased up or down (confounding or loss of power, depending on the relative sign of the genetic effect and the bias term). The effect of bias shifts, arising for example due to stratification, is nontrivial, and currently not explicitly in our model; it is usually accounted for using standard methods [25].

Variance distortion in the distribution of z-scores can arise from cryptic relatedness in the sample (drawn from a population mixture with at least one subpopulation with identical-by-descent marker alleles, but no population stratification) [19]. If z_u denotes the uninflated z-scores, then the inflated z-scores are where σ₀ ≥ 1 characterizes the inflation. Thus, from Eq. 7, in the presence of inflation in the form of variance distortion where , so that and . In the presence of variance distortion one is dealing with inflated random variables , but we will drop the tilde on the β’s in what follows.

Since variance distortion leads to scaled z-scores [19], then, allowing for this effect in some of the extremely large data sets, we can assess the ability of the model to detect this inflation by artificially inflating the z-scores (Eq. 9), and checking that the inflated is estimated correctly while the other parameter estimates remain unchanged.

Implicit in Eq. 8 is approximating the denominator, 1 − q², of the χ² statistic noncentrality parameter to be 1, where q² is the proportion of phenotypic variance explained by the causal variant, i.e., . So a more correct δ is

Taylor expanding in q and then taking the variance gives

The additional terms will be vanishingly small and so do not contribute in a distributional sense; (quasi-) Mendelian or outlier genetic effects represent an extreme scenario where the model is not expected to be accurate, but SNPs for such traits are by definition easily detectable. So Eq. 8 remains valid for the polygenicity of complex traits.

Now consider the effects of LD on z-scores. The simple linear regression coefficient estimate for tag SNP i, , and hence the GWAS z-score, implicitly incorporates contributions due to LD with neighboring causal SNPs. (A tag SNP is a SNP with a z-score, imputed or otherwise; generally these will compose a smaller set than that available in reference panels like 1000 Genomes used here for calculating the LD structure of tag SNPs.) In Eq. 1, , where g_j is the genotype vector for SNP j, β_j is its true regression coefficient, and ε is the independent true environmental and error residual vector (over the N samples). Thus, explicitly including all causal true β’s, Eq. 2 becomes (the sum over j now includes SNP i itself). This is the simple linear regression expansion of the estimated regression coefficient for SNP i in terms of the independent latent (true) causal effects and the latent environmental (plus error) component; is the effective simple linear regression expression for the true genetic effect of SNP i, with contributions from neighboring causal SNPs mediated by LD. Note that is simply cov(g_i,g_j), the covariance between genotypes for SNPs i and j. Since correlation is covariance normalized by the variances, in Eq. 13 can be written as where r_ij is the correlation between genotypes at SNP j and tag SNP i. Then, from Eq. 5, the z-score for the tag SNP’s association with the phenotype is given by:

We noted that in the absence of LD, the distribution of the residual in Eq. 5 is assumed to be univariate normal. But in the presence of LD (Eq. 15) there are induced correlations, so the appropriate extension would be multivariate normal for ϵ_i. A limitation of the present work is that we do not consider this complexity. This may account for the relatively minor misfit in the simulation results for cases of high polygenicity – see below.

Thus, for example, if the SNP itself is not causal but is in LD with k causal SNPs that all have heterozygosity H, and where its LD with each of these is the same, given by some value r²(0 < r² ≤ 1), then in Eq. 10 will be given by

For this idealized case, the marginal distribution, or pdf, of z-scores for a set of such associated SNPs is where ϕ(.;μ,σ²) is the normal distribution with mean μ and variance σ², and is shorthand for the LD and heterozygosity structure of such SNPs (in this case, denoting exactly k causals with LD given by r² and heterozygosity given by H). If a proportion α of all tag SNPs are similarly associated with the phenotype while the remaining proportion are all null (not causal and not in LD with causal SNPs), then the marginal distribution for all SNP z-scores is the Gaussian mixture dropping the parameters for convenience.

For real genotypes, however, the LD and heterozygosity structure is far more complicated, and of course the causal SNPs are generally numerous and unknown. Thus, more generally, for each tag SNP will be a two-dimensional histogram over LD (r²) and heterozygosity (H), each grid element giving the number of SNPs falling within the edges of that (r², H) bin. Alternatively, for each tag SNP it can be built as two one-dimensional histograms, one giving the LD structure (counts of neighboring SNPs in each LD r² bin), and the other giving, for each r bin, the mean heterozygosity for those neighboring SNPs, which should be accurate for sufficiently fine binning. We use the latter in what follows. We present two consistent ways of expressing the a posteriori pdf for z-scores, based on multinomial expansion and on convolution, that provide complementary views. The multinomial approach (see Supplementary Material) perhaps gives a more intuitive feel for the problem, but the convolution approach is considerably more tractable numerically and is used here to obtain all reporter results.

Model PDF: Convolution

From Eq. 15, there exist an efficient procedure that allows for accurate calculation of a z-score’s a posteriori pdf (given the SNP’s heterozygosity and LD structure, and the phenotype’s model parameters). Any GWAS z-score is a sum of unobserved random variables (LD-mediated contributions from neighboring causal SNPs, and the additive environmental component), and the pdf for such a composite random variable is given by the convolution of the pdfs for the component random variables. Since convolution is associative, and the Fourier transform of the convolution of two functions is just the product of the individual Fourier transforms of the two functions, one can obtain the a posteriori pdf for z-scores as the inverse Fourier transform of the product of the Fourier transforms of the individual random variable components.

From Eq. 15 z is a sum of correlation- and hetero-zygosity-weighted random variables {β_j} and the random variable ϵ, where β_j denotes the set of true causal parameters for each of the SNPs in LD with the tag SNP whose z-score is under consideration. The Fourier transform F(k) of a Gaussian f(x) = c × exp(−ax²) is . From Eq. 4, for each SNP j in LD with the tag SNP (1 ≤ j ≤ b, where b is the tag SNP’s block size),

The Fourier transform (with variable k – see below) of the first term on the right hand side is while that of the second term is simply (1 − π₁). Additionally, the environmental term is (ignoring LD-induced correlation, as noted earlier), and its Fourier transform is . For each tag SNP, one could construct the a posteriori pdf based on these Fourier transforms. However, it is more practical to use a coarse-grained representation of the data. Thus, in order to fit the model to a data set, we bin the tag SNPs whose z-scores comprise the data set into a two-dimensional heterozygosity/total LD grid (whose elements we denote “H-L” bins), and fit the model with respect to this coarse griding instead of with respect to every individual tag SNP z-score; in the section “Parameter Estimation” below we describe using a 10 × 10 grid. Additionally, for each H-L bin the LD r² and heterozygosity histogram structure for each tag SNP is built, using w_max equally-spaced r² bins for ; w_max = 20 is large enough to allow for converged results; is generally small enough to capture true causal associations in weak LD while large enough to exclude spurious contributions to the pdf arising from estimates of r² that are non-zero due to noise. This points up a minor limitation of the model stemming from the small reference sample size (N_R = 503 for 1000 Genomes) from which is built. Larger N_R would allow for more precision in handling very low LD (r² < 0.05), but this is an issue only for situations with extremely large (high heritability with low polygenicity) that we do not encounter for the 16 phenotypes we analyze here. In any case, this can be calibrated for using simulations.

For any H-L bin with mean heterozygosity H and mean total LD L there will be an average LD and heterozygosity structure with a mean breakdown for the tag SNPs having n_w SNPs (not all of which necessarily are tag SNPs) with LD r² in the w^th r² bin whose average heterozygosity is H_w. Thus, one can re-express z-scores for an H-L bin as where β_j and ϵ are unobserved random variables.

In the spirit of the discrete Fourier transform (DFT), discretize the set of possible z-scores into the ordered set of n (equal to a power of 2) values z₁, …, z_n with equal spacing between neighbors given by ∆z (z_n = −z₁ − ∆z, and z_n/2+1 = 0). Taking z₁ = −38 allows for the minimum p-values of 5.8 × 10⁻³¹⁶ (near the numerical limit); with n = 2¹⁰, ∆z = 0.0742. Given ∆z, the Nyquist critical frequency is , so we consider the Fourier transform function for the z-score pdf at n discrete values k₁, …, k_n, with equal spacing between neighbors given by ∆k, where k₁ = − f_c (k_n = − k₁ − ∆k, and k_n/2+1 = 0; the DFT pair ∆z and ∆k are related by ∆z∆k = 1/n). Define (see Eq. 20). Then the product (over r² bins) of Fourier transforms for the genetic contribution to z-scores, denoted G_j ≡ G(k_j), is and the Fourier transform of the environmental contribution, denoted E_j ≡ E(k_j) is,

Let F_z = (G₁E₁, …, G_nE_n) denote the vector of products of Fourier transform values, and let denote the inverse Fourier transform operator. Then, the vector of pdf values for z-score bins (indexed by i) in the H-L bin with mean LD and heterozygosity structure , pdf_z = (f₁, …, f_n) where , is

Data Preparation

For real phenotypes, we calculated SNP minor allele frequency (MAF) and LD between SNPs using the 1000 Genomes phase 3 data set for 503 subjects/samples of European ancestry [26, 27, 28]. For simulations, we used HAP-GEN2 [29, 30, 31] to generate genotypes; we calculated SNP MAF and LD structure from 1000 simulated samples. We elected to use the same intersecting set of SNPs for real data and simulation. For HAPGEN2, we eliminated SNPs with MAF<0.002; for 1000 Genomes, we eliminated SNPs for which the call rate (percentage of samples with useful data) was less than 90%. This left n_snp=11,015,833 SNPs.

Sequentially moving through each chromosome in contiguous blocks of 5,000 SNPs, for each SNP in the block we calculated its Pearson r² correlation coefficients with all SNPs in the central bock itself and with all SNPs in the pair of flanking blocks of size up to 50,000 each. For each SNP we calculated its total LD (TLD), given by the sum of LD r²’s thresholded such that if we set that r² to zero. For each SNP we also built a histogram giving the numbers of SNPs in w_max equally-spaced r²-windows covering the range . These steps were carried out independently for 1000 Genomes phase 3 and for HAPGEN2 (for the latter, we used 1000 simulated samples).

Employing a similar procedure, we also built binary (logical) LD matrices identifying all pairs of SNPs for which LD r² > 0.8, a liberal threshold for SNPs being “synonymous”.

In applying the model to summary statistics, we calculated histograms of TLD and LD block size (using 100 bins in both cases) and ignoring SNPs whose TLD or block size was so large that their frequency was less than a hundredth of the respective histogram peak; typically this amounted to restricting to SNPs for which TLD ≤ 600 and LD block size ≤ 1, 500. We also ignored summary statistics of SNPs for which MAF ≤ 0.01.

We analyzed summary statistics for sixteen phenotypes (in what follows, where sample sizes varied by SNP, we quote the median value): (1) major depressive disorder (N_cases = 59,851, N_controls = 113,154) [32]; (2) bipolar disorder (N_cases = 20,352, N_controls = 31,358) [33]; (3) schizophrenia (N_cases = 35,476, N_controls = 46,839) [34]; (4) coronary artery disease (N_cases = 60,801, N_controls= 123,504) [35]; (5) ulcerative colitis (N_cases = 12,366, N_controls = 34,915) and (6) Crohn’s disease (N_cases = 12,194, N_controls = 34,915) [36]; (7) late onset Alzheimer’s disease (LOAD; N_cases = 17,008, N_controls = 37,154) [37] (in the Supplementary Material we present results for a more recent GWAS with N_cases = 71,880 and N_controls= 383,378 [38]); (8) amyotrophic lateral sclerosis (ALS) (N_cases = 12,577, N_controls = 23,475) [39]; (9) number of years of formal education (N = 293,723) [40]; (10) intelligence (N = 262,529) [41, 42]; (11) body mass index (N = 233,554) [43]; (12) height (N = 251,747) [44]; (13) putamen volume (normalized by intracranial volume, N = 11,598) [45]; (14) low- (N = 89,873) and (15) high-density lipoprotein (N = 94,295) [46]; and (16) total cholesterol (N = 94,579) [46]. Most participants were of European ancestry.

For height, we focused on the 2014 GWAS [44], not the more recent 2018 GWAS [47], although we also report below model results for the latter. There are issues pertaining to population structure in the various height GWAS [48, 49], and the 2018 GWAS is a combination of GIANT and UKB GWAS, so some caution is warranted in interpreting results for these data.

For the ALS GWAS data, there is very little signal outside chromosome 9: the data QQ plot essentially tracks the null distribution straight line. The QQ plot for chromosome 9, however, shows a significant departure from the null distribution. Of 471,607 SNPs on chromosome 9 a subset of 273,715 have z-scores, of which 107 are genome-wide significant, compared with 114 across the full genome. Therefore, we restrict ALS analysis to chromosome 9.

For schizophrenia, for example, there were 6,610,991 SNPs with finite z-scores out of the 11,015,833 SNPs from the 1000 Genomes reference panel that underlie the model; the genomic control factor for these SNPs was λ_GC = 1.466. Of these, 314,857 were filtered out due to low MAF or very large LD block size. The genomic control factor for the remaining SNPs was λ_GC = 1.468; for the pruned subsets, with ≃ 1.49 × 10⁶ SNPs each, it was λ = 1.30. (Note that genomic control values for pruned data are always lower than for unpruned data.)

A limitation in the current work is that we have not taken account of imputation inaccuracy, where lower MAF SNPs are, through lower LD, less certain. Thus, the effects from lower MAF causal variants will be noisier than for higher MAF variants.

Simulations

We generated genotypes for 10⁵ unrelated simulated samples using HAPGEN2 [31]. For narrow-sense heritability h² equal to 0.1, 0.4, and 0.7, we considered polygenicity π₁ equal to 10⁻⁵, 10⁻⁴, 10⁻³, and 10⁻². For each of these 12 combinations, we randomly selected n_causal = π₁ × n_snp “causal” SNPs and assigned them β-values drawn from the standard normal distribution (i.e., independent of H), with all other SNPs having β = 0. We repeated this ten times, giving ten independent instantiations of random vectors of β’s. Defining Y_G = Gβ, where G is the genotype matrix and β here is the vector of true coefficients over all SNPs, the total phenotype vector is constructed as Y =Y_G+ε, where the residual random vector ε for each instantiation is drawn from a normal distribution such that h² = var(Y_G)/var(Y). For each of the instantiations this implicitly defines the “true” value .

The sample simple linear regression slope, , and the Pearson correlation coefficient, , are assumed to be t-distributed. These quantities have the same t-value: , with corresponding p-value from Student’s t cumulative distribution function (cdf) with N − 2 degrees of freedom: p = 2×tcdf(−|t|, N − 2) (see Supplementary Material). Since we are not here dealing with covariates, we calculated p from correlation, which is slightly faster than from estimating the regression coefficient. The t-value can be transformed to a z-value, giving the z-score for this p: , where Φ is the normal cdf (z and t have the same p-value).

Parameter Estimation

We randomly pruned SNPs using the threshold r² > 0.8 to identify “synonymous” SNPs, performing ten such iterations. That is, for each of ten iterations, we randomly selected a SNP (not necessarily the one with largest z-score) to represent each subset of synonymous SNPs. For schizophrenia, for example, pruning resulted in approximately 1.3 million SNPs in each iteration.

The postulated pdf for a SNP’s z-score depends on the SNP’s LD and heterozygosity structure (histogram), . Given the data – the set of z-scores for available SNPs, as well as their LD and heterozygosity structure – and the -dependent pdf for z-scores, the objective is to find the model parameters that best predict the distribution of z-scores. We bin the SNPs with respect to a grid of heterozygosity and total LD; for any given H-L bin there will be a range of z-scores whose distribution the model it intended to predict. We find that a 10 × 10-grid of equally spaced bins is adequate for converged results. (Using equally-spaced bins might seem inefficient because of the resulting very uneven distribution of z-scores among grid elements – for example, orders of magnitude more SNPs in grid elements with low total LD compared with high total LD. However, the objective is to model the effects of H and L: using variable grid element sizes so as to maximize balance of SNP counts among grid elements means that the true H- and L-mediated effects of the SNPs in a narrow range of H and L get subsumed with the effects of many more SNPs in a much wider range of H and L – a misspecification of the pdf leading to some inaccuracy.) In lieu of or in addition to total LD (L) binning, one can bin SNPs with respect to their total LD block size (total number of SNPs in LD, ranging from 1 to ~1,500).

To find the model parameters that best fit the data, for a given H-L bin we binned the selected SNPs z-scores into equally-spaced bins of width dz=0.0742 (between z_min= − 38 and z_max=38, allowing for p-values near the numerical limit of 10⁻³¹⁶), and from Eq. 25 calculated the probability for z-scores to be in each of those z-score bins (the prior probability for “success” in each z-score bin). Then, knowing the actual numbers of z-scores (numbers of “successes”) in each z-score bin, we calculated the multinomial probability, p_m, for this outcome. The optimal model parameter values will be those that maximize the accrual of this probability over all H-L bins. We constructed a cost function by calculating, for a given H-L bin, −ln(p_m) and averaging over prunings, and then accumulating this over all H-L bins. Model parameters minimizing the cost were obtained from Nelder-Mead multidimensional unconstrained nonlinear minimization of the cost function, using the Matlab function fminsearch().

Posterior Effect Sizes

Model posterior effect sizes, given z (along with N, , and the model parameters), were calculated using numerical integration over the random variable δ:

Here, since , the posterior probability of z given δ is simply P (z) is shorthand for pdf(z|N,, π₁, σ_β, σ₀), given by Eq. 25. P (δ) is calculated by a similar procedure that lead to Eq. 25 but ignoring the environmental contributions {E_j}. Specifically, let F_δ = (G₁, …, G_n) denote the vector of products of Fourier transform values. Then, the vector of pdf values for genetic effect bins (indexed by i; numerically, these will be the same as the z-score bins) in the H-L bin, pdf_δ = (f₁, …, f_n) where , is

Similarly, which is used in power calculations.

GWAS Replication

A related matter has to do with whether z-scores for SNPs reaching genome-wide significance in a discovery-sample are compatible with the SNPs’ z-scores in a replication-sample, particularly if any of those replication-sample z-scores are far from reaching genome-wide significance, or whether any apparent mismatch signifies some overlooked inconsistency. The model pdf allows one to make a principled statistical assessment in such cases. We present the details for this application, and results applied to studies of bipolar disorder, in the Supplementary Material.

GWAS Power

Chip heritability, , is the proportion of phenotypic variance that in principle can be captured additively by the n_snp SNPs under study [17]. It is of interest to estimate the proportion of that can be explained by SNPs reaching genome-wide significance, p≤5×10⁻⁸ (i.e., for which |z|>z_t=5.45), at a given sample size [58, 59]. In Eq 1, for SNP i with genotype vector g_i over N samples, let y_gi ≡ g_iβ_i. If the SNP’s heterozygosity is H_i, then . If we knew the full set {β_i} of true β-values, then, for z-scores from a particular sample size N, the proportion of SNP heritability captured by genome-wide significant SNPs, A(N), would be given by

Now, from Eq. 15, . If SNP i is causal and sufficiently isolated so that it is not in LD with other causal SNPs, then , and . When all causal SNPs are similarly isolated, Eq. 30 becomes

Of course, the true β_i are not known and some causal SNPs will likely be in LD with others. Furthermore, due to LD with causal SNPs, many SNPs will have a nonzero (latent or unobserved) effect size, δ. Nevertheless, we can formulate an approximation to A(N) which, assuming the pdf for z-scores (Eq. 25) is reasonable, will be inaccurate to the degree that the average LD structure of genome-wide significant SNPs differs from the overall average LD structure. As before (see the subsection “Model PDF: Convolution”), consider a fixed set of n equally-spaced nominal z-scores covering a wide range of possible values (changing from the summations in Eq. 31 to the uniform summation spacing ∆z now requires bringing the probability density into the summations). For each z from the fixed set (and, as before, employing data reduction by averaging so that H and L denote values for the 10 × 10 grid), use E(δ²|z, N, H, L) given in Eq. 29 to define (emphasizing dependence on N, H, and L). Then, for any N, A(N) can be estimated by where Σ_H,L denotes sum over the H-L grid elements. The ratio in Eq. 33 should be accurate if the average effects of LD in the numerator and denominator cancel – which will always be true as the ratio approaches 1 for large N. Plotting A(N) gives an indication of the power of future GWAS to capture chip heritability.

Quantile-Quantile Plots and Genomic Control

One of the advantages of quantile-quantile (QQ) plots is that on a logarithmic scale they emphasize behavior in the tails of a distribution, and provide a valuable visual aid in assessing the independent effects of polygenicity, strength of association, and variance distortion – the roles played by the three model parameters – as well as showing how well a model fits data. QQ plots for the model were constructed using Eq. 25, replacing the normal pdf with the normal cdf, and replacing z with an equally-spaced vector of length 10,000 covering a wide range of nominal |z| values (0 through 38). SNPs were divided into a 10 × 10 grid of H × L bins, and the cdf vector (with elements corresponding to the z-values in ) accumulated for each such bin (using mean values of H and L for SNPs in a given bin).

For a given set of samples and SNPs, the genomic control factor, λ, for the z-scores is defined as the median z² divided by the median for the null distribution, 0.455 [19]. This can also be calculated from the QQ plot. In the plots we present here, the abscissa gives the -log₁₀ of the proportion, q, of SNPs whose z-scores exceed the two-tailed significance threshold p, transformed in the ordinate as −log₁₀(p). The median is at q_med = 0.5, or −log₁₀(q_med) ≃ 0.3; the corresponding empirical and model p-value thresholds (p_med) for the z-scores – and equivalently for the z-scores-squared – can be read off from the plots. The genomic inflation factor is then given by

Note that the values of λ reported here are for pruned SNP sets; these values will be lower than for the total GWAS SNP sets.

Knowing the total number, n_tot, of p-values involved in a QQ plot (number of GWAS z-scores from pruned SNPs), any point (q, p) (log-transformed) on the plot gives the number, n_p = q n_tot, of p-values that are as extreme as or more extreme than the chosen p-value. This can be thought of as n_p “successes” out of n_tot independent trials (thus ignoring LD) from a binomial distribution with prior probability q. To approximate the effects of LD, we estimate the number of independent SNPs as n_tot/f where f ≃ 10. The 95% binomial confidence interval for q is calculated as the exact Clopper-Pearson 95% interval, .

Number of Causal SNPs

The estimated number of causal SNPs is given by the polygenicity, π₁, times the total number of SNPs, n_snp: n_causal = π₁n_snp. n_snp is given by the total number of SNPs that went into building the heterozygosity/LD structure, in Eq. 25, i.e., the approximately 11 million SNPs selected from the 1000 Genomes Phase 3 reference panel, not the number of tag SNPs in the particular GWAS. The parameters estimated are to be seen in the context of the reference panel, which we assume contains all common causal variants. Stable quantities (i.e., fairly independent of the reference panel size. e.g., using the full panel or ignoring every second SNP), are the estimated effect size variance and number of causal variants – which we demonstrate below – and hence the heritability. Thus, the polygenicity will scale inversely with the reference panel size. A reference panel with a substantially larger number of samples would allow for inclusion of more SNPs (non-zero MAF), and thus the actual polygenicity estimated would change slightly.

Narrow-sense Chip Heritability

Since we are treating the β coefficients as fixed effects in the simple linear regression GWAS formalism, with the phenotype vector standardized with mean zero and unit variance, from Eq. 1 the proportion of phenotypic variance explained by a particular causal SNP whose reference panel genotype vector is g, q²=var(y; g), is given by q² = β²H. The proportion of phenotypic variance explained additively by all causal SNPs is, by definition, the narrow sense chip heritability, h². Since and n_causal = π₁n_snp, and taking the mean heterozygosity over causal SNPs to be approximately equal to the mean over all SNPs, , the chip heritability can be estimated as

Mean heterozygosity from the ~11 million SNPs is .

For all-or-none traits like disease status, the estimated h² from Eq. 34 for an ascertained case-control study is on the observed scale and is a function of the prevalence in the adult population, K, and the proportion of cases in the study, P. The heritability on the underlying continuous liability scale [61], , is obtained by adjusting for ascertainment (multiplying by K(1 − K)/(P (1 − P)), the ratio of phenotypic variances in the population and in the study) and rescaling based on prevalence [62, 6]: where a is the height of the standard normal pdf at the truncation point z_K defined such that the area under the curve in the region to the right of z_K is K.

RESULTS

Simulations

Table 2 shows the simulation results, comparing true and estimated values for the model parameters, heritability, and the number of causal SNPs, for twelve scenarios where π₁ and both range over three orders of magnitude, encompassing the range of values for the phenotypes; in Supplementary Material, Figure 6 shows QQ plots for a randomly chosen (out of 10) β-vector and phenotype instantiation for each of the twelve (π₁, h²) scenarios. Most of the estimates are in very good agreement with the true values, though for the extreme scenario of high heritability and low polygenicity it is overestimated by factors of two-to-three. The numbers of estimated causal SNPs (out of 11 million) are in correspondingly good agreement with the true values, ranging in increasing powers of 10 from 110 through 110,158. The estimated discoverabilities are also in good agreement with the true values. In most cases, is close to 1, indicating little or no global inflation, though it is elevated for high heritability with high polygenicity, suggesting it is capturing some ubiquitous effects.

In Supplementary Material, we examine the issue of model misspecification. Specifically, we assign causal effects β drawn from a Gaussian whose variance is not simply a constant but depends on heterozygosity, such that rarer causal SNPs will tend to have larger effects [15]. The results – see Supplementary Material Table 3 – show that the model still makes reasonable estimates of the underlying genetic architecture. Additionally, we tested the scenario where true causal effects are distributed with respect to two Gaussians [14], a situation that allows for a small number of the causal SNPs to have quite large effects – see Supplementary Material Table 4. We find that heritabilities are still reasonably estimated using our model. In all these scenarios the overall data QQ plots were accurately reproduced by the model. As a counter example, we simulated summary statistics where the prior probability of a reference SNP being causal decreased linearly with total LD (see Supplementary Material Table 5). In this case, our single Gaussian fit (which assumes no LD dependence on the prior probability of a reference SNP being causal) did not produce model QQ plots that accurately tracked the data QQ plots (see Supplementary Material Figure 13). The model parameters and heritabilities were also poor. But this scenario is highly artificial; in contrast, in situations where the data QQ plots were accurately reproduced by the model, the estimated model parameters and heritability were plausible.

Phenotypes

Figures 1 and 2 show QQ plots for the pruned z-scores for eight qualitative and eight quantitative phenotypes, along with model estimates (Supplementary Material Figs. 14-30 each show a 4 × 4 grid breakdown by heterozygosity total-LD of QQ plots for all phenotypes studied here; the 4 × 4 grid is a subset of the 10 × 10 grid used in the calculations). In all cases, the model fit (yellow) closely tracks the data (dark blue). For the sixteen phenotypes, estimates for the model polygenicity parameter (fraction of reference panel, with ≃11 million SNPs, estimated to have non-null effects) range over two orders of magnitude, from π₁ ≃ 2 × 10⁻⁵ to π₁ ≃ 4 × 10⁻³. The estimated SNP discoverability parameter (variance of β, or expected β², for causal variants) also ranges over two orders of magnitude from to (in units where the variance of the phenotype is normalized to 1).

Figure 1:

QQ plots of (pruned) z-scores for qualitative phenotypes (dark blue, 95% confidence interval in light blue) with model prediction (yellow): (A) major depressive disorder; (B) bipolar disorder; (C) schizophrenia; (D) coronary artery disease (CAD); (E) ulcerative colitis (UC); (F) Crohn’s disease (CD); (G) late onset Alzheimer’s disease (AD), excluding APOE (see also Supplementary Material Fig. 10); and (H) amyotrophic lateral sclerosis (ALS), restricted to chromosome 9 (see also Supporting Material Figure 12). The dashed line is the expected QQ plot under null (no SNPs associated with the phenotype). p is a nominal p-value for z-scores, and q is the proportion of z-scores with p-values exceeding that threshold. λ is the overall nominal genomic control factor for the pruned data (which is accurately predicted by the model in all cases). The three estimated model parameters are: polygenicity, ; discoverability, (corrected for inflation); and SNP association χ²-statistic inflation factor, . is the estimated narrow-sense chip heritability, re-expressed as on the liability scale for these case-control conditions assuming a prevalence of: MDD 6.7% [50], BIP 0.5% [51], SCZ 1% [52], CAD 3% [53], UC 0.1% [54], CD 0.1% [54], AD 14% (for people aged 71 and older in the USA [55, 56]), and ALS 5 × 10⁻⁵ [57]. The estimated number of causal SNPs is given by where n_snp = 11, 015, 833 is the total number of SNPs, whose LD structure and MAF underlie the model; the GWAS z-scores are for subsets of these SNPs. N_eff is the effective case-control sample size – see text. Reading the plots: on the vertical axis, choose a p-value threshold (more extreme values are further from the origin), then the horizontal axis gives the proportion of SNPs exceeding that threshold (higher proportions are closer to the origin). Numerical values for the model parameters are also given in Table 1. See also Supplementary Material Figs. 14-20.

Figure 2.

QQ plots of (pruned) z-scores and model fits for quantitative phenotypes: (A) educational attainment; (B) intelligence; (C) body mass index (BMI); (D) height; (E) putamen volume; (F) low-density lipoprotein (LDL); (G) high-density lipoprotein (HDL); and (H) total cholesterol (TC). N is the sample size. See Fig. 1 for further description. Numerical values for the model parameters are also given in Table 1. See also Supplementary Material Figs. 21-26.

View this table:

Table 1:

Summary of model results for phenotypes shown in Figures 1 and 2. The subscript in indicates that for the qualitative phenotypes (the first eight) the reported SNP heritability is on the liability scale. MDD: Major Depressive Disorder; CAD: coronary artery disease; AD: Alzheimer’s Disease (excluding APOE locus; ^∗for the full autosomal reference panel, i.e., including APOE, for AD – see Supplementary Material Figure 10 (A) and (B)); BMI: body mass index; ^†ALS: amyotrophic lateral sclerosis, restricted to chromosome 9; LDL: low-density lipoproteins; HDL: high-density lipoproteins. ^$In addition to the 2014 height GWAS (N=251,747 [44]), we include here model results for the 2010 (N=133,735 [63]) and 2018 (N=707,868 [47]) height GWAS; there is remarkable consistency for the 2010 and 2014 GWAS despite very large differences in the sample sizes – see Supporting Material Figure 11.

View this table:

Table 2:

Simulation results: comparison of mean (std) true and estimated (^) model parameters and derived quantities. Results for each line, for specified heritability h² and fraction π₁ of causal SNPs, are from 10 independent instantiations with random selection of the n_causal causal SNPs that are assigned a β-value from the standard normal distribution. Defining Y_g = Gβ, where G is the genotype matrix, the total phenotype vector is constructed as Y =Y_g +ε, where the residual random vector ε for each instantiation is drawn from a normal distribution such that var(Y) = var(Y_g)/h² for predefined h². For each of the instantiations, i, this implicitly defines the true value , and is their mean. An example QQ plot for each line entry is shown in Supplementary Material, Figure 6.

We find that schizophrenia and bipolar disorder appear to be similarly highly polygenic, with model polygenicities ≃ 2.84 × 10⁻³ and ≃ 2.70 × 10⁻³, respectively. The model polygenicity of major depressive disorder, however, is 40% higher, π₁ ≃ 4 × 10⁻³ – the highest value among the sixteen phenotypes. In contrast, the model polygenicities of late onset Alzheimer’s disease and Crohn’s disease are almost thirty times smaller than that of schizophrenia.

In Supplementary Material Figure 10 we show results for Alzheimer’s disease exclusively for chromosome 19 (which contains APOE), and for all autosomal chomosomes excluding chromosome 19. We also show results with the same chromosomal breakdown for a recent GWAS involving 455,258 samples that included 24,087 clinically diagnosed LOAD cases and 47,793 AD-by-proxy cases (individuals who were not clinically diagnosed with LOAD but for whom at least one parent had LOAD) [65]. These GWAS give consistent estimates of polygenicity: π₁ ∼ 1 × 10⁻⁴ excluding chromosome 19, and π₁ ∼ 6 × 10⁻⁵ for chromosome 19 exclusively.

Of the quantitative traits, educational attainment has the highest model polygenicity, π₁ = 3.2 10⁻³, similar to intelligence, π₁ = 2.2 10⁻³. Approximately two orders of magnitude lower in polygenicity are the endophenotypes putamen volume and low- and high-density lipoproteins.

The model effective SNP discoverability for schizophrenia is , similar to that for bipolar disorder. Major depressive disorder, which has the highest polygenicity, has the lowest SNP discoverability, approximately one-eighth that of schizophrenia; it is this low value, combined with high polygenicity that leads to the weak signal in Figure 1 (A) even though the sample size is relatively large. In contrast, SNP discoverability for Alzheimer’s disease is almost four times that of schizophrenia. The inflammatory bowel diseases, however, have much higher SNP discoverabilities, 16 and 31 times that of schizophrenia respectively for ulcerative colitis and Crohn’s disease – the latter having the second highest value of the sixteen phenotypes: .

Additionally, for Alzheimer’s disease we show in Supplementary Material Figure 10 that the discoverability is two orders of magnitude greater for chromosome 19 than for the remainder of the autosome. Note that since two-thirds of the 2018 “cases” are AD-by-proxy, the discover-abilities for the 2018 data are, as expected, reduced relative to the values for the 2013 data (approximately 3.5 times smaller).

The narrow sense SNP heritability from the ascertained case-control schizophrenia GWAS is estimated as h²=0.37. Taking adult population prevalence of schizophrenia to be K=0.01 [66, 67] (but see also [68], for K=0.005), and given that there are 51,900 cases and 71,675 controls in the study, so that the proportion of cases in the study is P =0.42, the heritability on the liability scale for schizophrenia from Eq. 35 is . For bipolar disorder, with K=0.005 [51], 20,352 cases and 31,358 controls, . Major depressive disorder appears to have a much lower model-estimated SNP heritability than schizophrenia: . The model estimate of SNP heritability for height is 17%, lower than the oft-reported value ~50% (see Discussion). However, despite the huge differences in sample size, we find the same value, 17%, for the 2010 GWAS (N =133,735 [63]), and 19% for the 2018 GWAS (N = 707,868 [47, 44]) – see Table 1.

Figure 3 shows the sample size required so that a given proportion of chip heritability is captured by genome-wide significant SNPs for the phenotypes (assuming equal numbers of cases and controls for the qualitative phenotypes: N_eff = 4/(1/N_cases + 1/N_controls), so that when N_cases = N_controls, N_eff = N_cases + N_controls = N, the total sample size, allowing for a straightforward comparison with quantitative traits). At current sample sizes, only 4% of narrow-sense chip heritability is captured for schizophrenia and only 1% for bipolar disorder; using current methodologies, a sample size of N_eff ~ 1 million would be required to capture the preponderance of SNP heritability for these phenotypes. Major depressive disorder GWAS currently is greatly underpowered, as shown in Figure 3(A). For education, we predict that 3.5% of pehnotypic variance would be explained at N = 1.1 million, in good agreement with the value found from direct computation of 3.2% [64]. For other phenotypes, the proportions of total SNP heritability captured at the available sample sizes are given in Figure 3.

Figure 3:

Proportion of narrow-sense chip heritability, A(N) (Eq. 33), captured by genome-wide significant SNPs as a function of sample size, N, for phenotypes shown in Figures 1 and Figure 2. Values for current sample sizes are shown in parentheses. Left-to-right curve order is determined by decreasing . The prediction for education at sample size N=1.1 million is A(N) = 0.27, so that the proportion of phenotypic variance explained is predicted to be 3.5%, in good agreement with 3.2% reported in [64]. (The curve for AD excludes the APOE locus. For HDL, see Supplementary Material for additional notes.)

The sample size for ALS was quite low, and we restricted the analysis to choromosome 9, which had most of the genome-wide significant tag SNPs; we estimate that there are ~7 causal SNPs with high discoverability on choromosome 9 [69, 70], with very high discoverability, . In contrast, for AD restricted to chromosome 19, there were an estimated 14 causal SNPs with discoverability (see Supplementary Material Figure 10 (B)).

In this study, we assume that population stratification in the raw data has been corrected for in the publicly-available summary statistics. However, given that some of the sample sizes are extremely large, we allow for the possibility of residual cryptic relatedness. This would result in a scaling of the z-scores, Eq. 9 [19]. Thus, to test the modeling of inflation due to cryptic relatedness, we scaled the simulation z-scores as described earlier (z = σ₀z_u with σ₀ > 1, where z_u are the original z-scores, i.e., not artificially inflated) and reran the model. E.g., for education and schizophrenia we inflated the z-scores by a factor of 1.2. For schizophrenia we found , which is almost exactly as predicted (1.14 × 1.2 = 1.368), while the polygenicity and discoverability parameters are essentially unchanged: π₁ = 2.81 × 10⁻³, . For education we found , which again is almost exactly as predicted (1.0 × 1.2 = 1.2), while the polygenicity and discoverability parameters are again essentially unchanged: π₁ = 3.19 × 10⁻³, and .

A comparison of our results with those of [14] and [15] is in Supplementary Material Table 6. Critical methodological differences with model M2 in [14] are that we use a full reference panel of 11 million SNPs from 1000 Genomes Phase 3, we allow for the possibility of inflation in the data, and we provide an exact solution, based on Fourier Transforms, for the z-score pdf arising from the posited distribution of causal effects, resulting in better fits of the model and the data QQ plots – as can be seen by comparing our QQ plots with those reported in 6. Although our estimated number of causal are often within a factor of two of those from the nominally equivalent model M2 of Zhang et al, there is no clear pattern to the mismatch.

Dependence on Reference Panel

Given a liberal MAF threshold of 0.002, our reference panel should contain the vast majority of common SNPs for European ancestry. However, it does not include other structural variants (such as small insertions/deletions, or haplotype blocks) which may also be causal for phenotypes. To validate our parameter estimates for an incomplete reference, we reran our model on real phenotypes using the culled reference where we exclude every other SNP. The result is that all estimated parameters are as before except that doubles, leaving the estimated number of causal SNPs and heritability as before. For example, for schizophrenia we get π₁ = 5.3 × 10⁻³ and for the reduced reference panel, versus π₁ = 2.8 × 10⁻³ and for the full panel, with heritability remaining essentially the same (37% on the observed scale).

DISCUSSION

Here we present a unified method based on GWAS summary statistics, incorporating detailed LD structure from an underlying reference panel of SNPs with MAF>0.002, for estimating: phenotypic polygenicity, π₁, expressed as the fraction of the reference panel SNPs that have a non-null true β value, i.e., are “causal”; and SNP discoverability or mean strength of association (the variance of the underlying causal effects), . In addition the model can be used to estimate residual inflation of the association statistics due to variance distortion induced by cryptic relatedness, . The model assumes that there is very little, if any, inflation in the GWAS summary statistics due to population stratification (bias shift in z-scores due to ethnic variation).

We apply the model to sixteen diverse phenotypes, eight qualitative and eight quantitative. From the estimated model parameters we also estimate the number of causal common-SNPs in the underlying reference panel, n_causal, and the narrow-sense common-SNP heritability, h² (for qualitative phenotypes, we re-express this as the proportion of population variance in disease liability, , under a liability threshold model, adjusted for ascertainment); in the event rare SNPs (i.e., not in the reference panel) are causal, h² will be an underestimate of the true SNP heritability. In addition, we estimate the proportion of SNP heritability captured by genome-wide significant SNPs at current sample sizes, and predict future sample sizes needed to explain the preponderance of SNP heritability.

We find that schizophrenia is highly polygenic, with π₁ = 2.8 × 10⁻³. This leads to an estimate of n_causal ≃ 31, 000, which is in reasonable agreement with a recent estimate that the number of causals is >20,000 [71]. The SNP associations, however, are characterized by a narrow distribution, , indicating that most associations are of weak effect, i.e., have low discoverability. Bipolar disorder has similar parameters. The smaller sample size for bipolar disorder has led to fewer SNP discoveries compared with schizophrenia. However, from Figure 3, sample sizes for bipolar disorder are approaching a range where rapid increase in discoveries becomes possible. For educational attainment [72, 40, 73], the polygenicity is somewhat greater, π₁ = 3.2 × 10⁻³, leading to an estimate of n_causal ≃ 35, 000, half a recent estimate, ≃70, 000, for the number of loci contributing to heritability [72]. The variance of the distribution for causal effect sizes is a quarter that of schizophrenia, indicating lower discoverability. Intelligence, a related phenotype [41, 74], has a larger discoverability than education while having lower polygenicity (~ 10, 000 fewer causal SNPs).

In marked contrast are the lipoproteins and putamen volume which have very low polygenicity: π₁ < 5 × 10⁻⁵, so that only 250 to 550 SNPs (out of ~11 million) are estimated to be causal. However, causal SNPS for putamen volume and HDL appear to be characterized by relatively high discoverability, respectively 17-times and 23-times larger than for schizophrenia (see Supplementary Material for additional notes on HDL).

The QQ plots (which are sample size dependent) reflect these differences in genetic architecture. For example, the early departure of the schizophrenia QQ plot from the null line indicates its high polygenicity, while the steep rise for putamen volume after its departure corresponds to its high SNP discoverability.

For Alzheimer’s disease, our estimate of the liability-scale SNP heritability for the full 2013 dataset [37] is 15% for prevalence of 14% for those aged 71 older, half from APOE, while the recent “M2” and “M3” models of Zhang et al [14] gave values of 7% and 10% respectively – see Supplementary Materials Table 6. A recent report from two methods, LD Score Regression (LDSC) and SumHer [75], estimated SNP heritability of 3% for LDSC and 12% for SumHer (asuming prevalence of 7.5%). A raw genotype-based analysis (GCTA), including genes that contain rare variants that affect risk for AD, reported SNP heritability of 53% [76, 7]; an earlier related study that did not include rare variants and had only a quarter of the common variants estimated SNP heritability of 33% for prevalence of 13% [77]. GCTA calculations of heritability are within the domain of the so-called infinitesimal model where all markers are assumed to be causal. Our model suggests, however, that phenotypes are characterized by polygenicities less than 5 × 10⁻³; for AD the polygenicity is ≃ 10⁻⁴. Nevertheless, the GCTA approach yields a heritability estimate closer to the twin-based (broad sense) value, estimated to be in the range 60-80% [78]. The methodlogy appears to be robust to many assumptions about the distribution of effect sizes [79, 80]; the SNP heritability estimate is unbiased, though it has larger standard error than methods that allow for only sparse causal effects [63, 81]. For the 2013 data analyzed here [37], a summary-statistics-based method applied to a subset of 54,162 of the 74,046 samples gave SNP heritability of almost 7% on the observed scale [82, 12]; our estimate is 12% on the observed scale – see Supplementary Material Figure 10 A and B.

Onset and clinical progression of sporadic Alzheimer’s disease is strongly age-related [83, 84], with prevalence in differential age groups increasing at least up through the early 90s [55]. Thus, it would be more accurate to assess heritability (and its components, polygenicity and discoverability) with respect to, say, five-year age groups beginning with age 65 years, and using a consistent control group of nonagenarians and centenarians. By the same token, comparisons among current and past AD GWAS are complicated because of potential differences in the age distributions of the respective case and the control cohorts. Additionally, the degree to which rare variants are included will affect heritability estimates. The summary-statistic-based estimates of polygenicity that we report here are, however, likely to be robust for common SNPs: π₁ ≃ 1.1 × 10⁻⁴, with only a few causal SNPs on chromosome 19.

Our point estimate for the liability-scale SNP heritability of schizophrenia is (assuming a population risk of 0.01), and that 4% of this (i.e., 1% of overall disease liability) is explainable based on common SNPs reaching genome-wide significance at the current sample size. This estimate is in reasonable agreement with a recent result, [71, 85], also calculated from the PGC2 data set but using raw genotype data for 472,178 markers for a subset of 22,177 schizophrenia cases and 27,629 controls of European ancestry; and with an earlier result of from PGC1 raw genotype data for 915,354 markers for 9,087 schizophrenia cases and 12,171 controls [86, 7]. The recent “M2” (single non-null Gaussian) model estimate is [14] (see Supplementary Materials Table 6). No QQ plot was available for the M2 model fit to schizophrenia data, but such plots (truncated on the y-axis at −log₁₀(p) = 10) for many other phenotypes were reported [14]. We note that for multiple phenotypes (height, LDL cholesterol, total cholesterol, years of schooling, Crohn’s disease, coronary artery disease, and ulcerative colitis) our single Gaussian model appears to provide a better fit to the data than M2: many of the M2 plots show a very early and often dramatic deviation between prediction and data, as compared with our model QQ plots which are also built from a single Gaussian, suggesting an upward bias in polygenicity and/or variance of effect sizes, and hence heritability as measured by the M2 model for these phenotypes. The LDSC liability-scale (1% prevalence) SNP heritability for schizophrenia has been reported as [12] and more recently as 0.19 [75], in very good agreement with our estimate; on the observed scale it has been reported as 45% [82, 12], in contrast to our corresponding value of 37%. Our estimate of 1% of overall variation on the liability scale for schizophrenia explainable by genome-wide significant loci compares reasonably with the proportion of variance on the liability scale explained by Risk Profile Scores (RPS) reported as 1.1% using the “MGS” sample as target (the median for all 40 leave-one-out target samples analyzed is 1.19% – see Extended Data Figure 5 and Supplementary Tables 5 and 6 in [34]; this was incorrectly reported as 3.4% in the main paper). These results show that current sample sizes need to increase substantially in order for RPSs to have predictive utility, as the vast majority of associated SNPs remain undiscovered. Our power estimates indicate that ~500,000 cases and an equal number of controls would be needed to identify these SNPs (note that there is a total of approximately 3 million cases in the US alone).

For educational attainment, we estimate SNP heritability h² = 0.12, in good agreement with the estimate of 11.5% given in [40]. As with schizophrenia, this is substantially less than the estimate of heritability from twin and family studies of ≃40% of the variance in educational attainment explained by genetic factors [87, 72].

For putamen volume, we estimate the SNP heritability h² = 0.11, in reasonable agreement with an earlier estimate of 0.1 for the same overall data set [45, 4]. For LDL and HDL, we estimate h² = 0.06 and h² = 0.07 respectively, in good agreement with the LDSC estimates h² = 0.08 and h² = 0.07 [75], and the M2 model of [14] – see Supporting Material Table 6.

For height (N=251,747 [44]) we find that its model polygenicity is π₁ = 5.66 × 10⁻⁴, a quarter that of intelligence (and very far from “omnigenic” [88]), while its discoverability is five times that of intelligence, leading to a SNP heritability of 17%. For the 2010 GWAS (N=133,735 [63]) and 2018 GWAS (N=707,868 [47]), we estimate SNP heritability of 17% and 19% respectively (see Table 1 and Supplementary Material Fig. 11). These heritabilities are in considerable disagreement with the SNP heritability estimate of ≃50% [44] (average of estimates from five cohorts ranging in size from N=1,145 to N=5,668, with ~1 million SNPs). For the 2010 GWAS, the M2 model [14] gives h² = 0.30 (see Supporting Material Table 6); the upward deviation of the model QQ plot in [14] suggests that this value might be inflated. For the 2014 GWAS, the M3 model estimate is h² = 33% [14]; the Regression with Summary Statistics (RSS) model estimate is h² = 52% (with ≃11, 000 causal SNPs) [89], which, not taking any inflation into account is definitely a model overestimate; and in [75] the LDSC estimate is reported as h² = 20% while the SumHer estimate is h² = 46% (in general across traits, the SumHer heritability estimates tend to be two-to-five times larger than the LDSC estimates). The M2, M3, and RSS models use a reference panel of ∼1 million common SNPs, in contrast with the ~11 million SNPs used in our analysis. Also, it should be noted that the M2, M3, and RSS model estimates did not take the possibility of inflation into account. For the 2014 height GWAS, that inflation is reported as the LDSC intercept is 2.09 in [75], indicating considerable inflation; for the 2018 dataset we find ; the LD score regression intercept of 2.1116 (0.0458) on that dataset. Given the various estimates of inflation and the controversy over population structure in the height data [48, 49], it is not clear what results are definitely incorrect.

Our power analysis for height (2014) shows that 37% of the narrow-sense heritability arising from common SNPs is explained by genome-wide significant SNPs (p ≤ 5 × 10⁻⁸), i.e., 6.3% of total phenotypic variance, which is substantially less than the 16% direct estimate from significant SNPs [44]. It is not clear why these large discrepancies exist. One relevant factor, however, is that we estimate a considerable confounding in the height 2014 dataset. Our h² estimates are adjusted for the potential confounding measured by , and thus they represent what is likely a lower bound of the actual SNP-heritability, leading to a more conservative estimate than what has previously been reported. We note that after adjustment, our h² estimates are consistent across all three datasets (height 2010, 2014 and 2018), which otherwise would range by more than 2.5-fold. Another factor might be the relative dearth of typed SNPs with low heterozygosity and low total LD (see top left segment in Supporting Material Figure 23, n = 780): there might be many causal variants with weak effect that are only weakly tagged. Nevertheless, given the discrepancies noted above, caution is warranted in interpreting our model results for height.

CONCLUSION

The common-SNP causal effects model we have presented is based on GWAS summary statistics and detailed LD structure of an underlying reference panel, and assumes a Gaussian distribution of effect sizes at a fraction of SNPs randomly distributed across the autosomal genome. While not incorporating the effects of rare SNPs, we have shown that it captures the broad genetic architecture of diverse complex traits, where polygenicities and the variance of the effect sizes range over orders of magnitude.

The current model (essentially Eq. 4) and its implementation (essentially Eq. 25) are basic elements for building a more refined model of SNP effects using summary statistics. Higher accuracy in characterizing causal alleles in turn will enable greater power for SNP discovery and phenotypic prediction.

Funding

Research Council of Norway (262656, 248984, 248778, 223273) and KG Jebsen Stiftelsen; ABCD-USA Consortium (5U24DA041123).

Acknowledgments

We thank the consortia for making available their GWAS summary statistics, and the many people who provided DNA samples.

Footnotes

Typos fixed. Minor updates throughout.

References

[1].↵
Peter M Visscher, Matthew A Brown, Mark I McCarthy, and Jian Yang. Five years of gwas discovery. The American Journal of Human Genetics, 90(1):7–24, 2012.
OpenUrl CrossRef PubMed
[2].↵
Eli A Stahl, Daniel Wegmann, Gosia Trynka, Javier Gutierrez-Achury, Ron Do, Benjamin F Voight, Peter Kraft, Robert Chen, Henrik J Kallberg, Fina AS Kurreeman, et al. Bayesian inference analyses of the polygenic architecture of rheumatoid arthritis. Nature genetics, 44(5):483–489, 2012.
OpenUrl CrossRef PubMed
[3].↵
Jian Yang, Andrew Bakshi, Zhihong Zhu, Gibran Hemani, Anna AE Vinkhuyzen, Sang Hong Lee, Matthew R Robinson, John RB Perry, Ilja M Nolte, Jana V van Vliet-Ostaptchouk, et al. Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nature genetics, 2015.
[4].↵
Hon-Cheong So, Miaoxin Li, and Pak C Sham. Uncovering the total heritability explained by all true susceptibility variants in a genome-wide association study. Genetic epidemiology, 35(6):447–456, 2011.
OpenUrl PubMed
[5].↵
Doug Speed, Gibran Hemani, Michael R Johnson, and David J Balding. Improved heritability estimation from genome-wide snps. The American Journal of Human Genetics, 91(6):1011–1021, 2012.
OpenUrl CrossRef PubMed
[6].↵
Sang Hong Lee, Naomi R Wray, Michael E Goddard, and Peter M Visscher. Estimating missing heritability for disease from genome-wide association studies. The American Journal of Human Genetics, 88(3):294–305, 2011.
OpenUrl CrossRef PubMed Web of Science
[7].↵
Jian Yang, S Hong Lee, Michael E Goddard, and Peter M Visscher. Gcta: a tool for genome-wide complex trait analysis. The American Journal of Human Genetics, 88(1):76–82, 2011.
OpenUrl CrossRef PubMed
[8].↵
Siddharth Krishna Kumar, Marcus W Feldman, David H Rehkopf, and Shripad Tuljapurkar. Limitations of gcta as a solution to the missing heritability problem. Proceedings of the National Academy of Sciences, 113(1):E61–E70, 2016.
OpenUrl Abstract/FREE Full Text
[9].↵
Luigi Palla and Frank Dudbridge. A fast method that uses polygenic scores to estimate the variance explained by genome-wide marker panels and the proportion of variants affecting a trait. The American Journal of Human Genetics, 97(2):250–259, 2015.
OpenUrl CrossRef PubMed
[10].↵
Alkes L Price, Noah A Zaitlen, David Reich, and Nick Patterson. New approaches to population stratification in genome-wide association studies. Nature Reviews Genetics, 11(7):459–463, 2010.
OpenUrl CrossRef PubMed Web of Science
[11].↵
Jian Yang, Michael N Weedon, Shaun Purcell, Guillaume Letre, Karol Estrada, Cristen J Willer, Albert V Smith, Erik Ingelsson, Jeffrey R O’connell, Massimo Mangino, et al. Genomic inflation factors under polygenic inheritance. European Journal of Human Genetics, 19(7):807–812, 2011.
OpenUrl CrossRef PubMed
[12].↵
Brendan K Bulik-Sullivan, Po-Ru Loh, Hilary K Finucane, Stephan Ripke, Jian Yang, Nick Patterson, Mark J Daly, Alkes L Price, Benjamin M Neale, Schizophrenia Working Group of the Psychiatric Genomics Consortium, et al. Ld score regression distinguishes confounding from polygenicity in genome-wide association studies. Nature genetics, 47(3):291–295, 2015.
OpenUrl CrossRef PubMed
[13].↵
Hyun Min Kang, Jae Hoon Sul, Noah A Zaitlen, Sit-yee Kong, Nelson B Freimer, Chiara Sabatti, Eleazar Eskin, et al. Variance component model to account for sample structure in genome-wide association studies. Nature genetics, 42(4):348–354, 2010.
OpenUrl CrossRef PubMed Web of Science
[14].↵
Yan Zhang, Guanghao Qi, Ju-Hyun Park, and Nilanjan Chatterjee. Estimation of complex effect-size distributions using summary-level statistics from genome-wide association studies across 32 complex traits. Nature genetics, 50(9):1318, 2018.
OpenUrl
[15].↵
Jian Zeng, Ronald Vlaming, Yang Wu, Matthew R Robinson, Luke R Lloyd-Jones, Loic Yengo, Chloe X Yap, Angli Xue, Julia Sidorenko, Allan F McRae, et al. Signatures of negative selection in the genetic architecture of human complex traits. Nature genetics, 50(5):746, 2018.
OpenUrl CrossRef
[16].↵
Bogdan Pasaniuc and Alkes L Price. Dissecting the genetics of complex traits using summary association statistics. Nature Reviews Genetics, 2016.
[17].↵
John S Witte, Peter M Visscher, and Naomi R Wray. The contribution of genetic variants to disease depends on the ruler. Nature Reviews Genetics, 15(11):765–776, 2014.
OpenUrl CrossRef PubMed
[18].↵
D. Holland, Y. Wang, W. K. Thompson, A. Schork, C. H. Chen, M. T. Lo, A. Witoelar, T. Werge, M. O’Donovan, O. A. Andreassen, and A. M. Dale. Estimating Effect Sizes and Expected Replication Probabilities from GWAS Summary Statistics. Front Genet, 7:15, 2016.
OpenUrl
[19].↵
B. Devlin and K. Roeder. Genomic control for association studies. Biometrics, 55(4):997–1004, Dec 1999.
OpenUrl CrossRef PubMed Web of Science
[20].↵
Dominic Holland, Chun-Chieh Fan, Oleksandr Frei, Alexey A. Shadrin, Olav B. Smeland, V. S. Sundar, Ole A. Andreassen, and Anders M. Dale. Estimating degree of polygenicity, causal effect size variance, and confounding bias in gwas summary statistics. bioRxiv, 2017.
[21].↵
Wesley K Thompson, Yunpeng Wang, Andrew Schork, Verena Zuber, Ole A Andreassen, Anders M Dale, Dominic Holland, and Xu Shujing. An empirical bayes method for estimating the distribution of effects in genome-wide association studies. PLoS Genetics, [in press], 2015.
[22].↵
Jian Yang, Teri A Manolio, Louis R Pasquale, Eric Boerwinkle, Neil Caporaso, Julie M Cunningham, Mariza De Andrade, Bjarke Feenstra, Eleanor Feingold, M Geoffrey Hayes, et al. Genome partitioning of genetic variation for complex traits using common snps. Nature genetics, 43(6):519–525, 2011.
OpenUrl CrossRef PubMed
[23].↵
Andrew Gelman, Hal S Stern, John B Carlin, David B Dunson, Aki Vehtari, and Donald B Rubin. Bayesian data analysis. Chapman and Hall/CRC, 2013.
[24].↵
Nan M Laird and Christoph Lange. The fundamentals of modern statistical genetics. Springer Science & Business Media, 2010.
[25].↵
Chengqing Wu, Andrew DeWan, Josephine Hoh, and Zuoheng Wang. A comparison of association methods correcting for population stratification in case–control studies. Annals of human genetics, 75(3):418–427, 2011.
OpenUrl CrossRef PubMed
[26].↵
1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature, 526(7571):68–74, 2015.
OpenUrl CrossRef PubMed
[27].↵
1000 Genomes Project Consortium et al. An integrated map of genetic variation from 1,092 human genomes. Nature, 491(7422):56–65, 2012.
OpenUrl CrossRef PubMed Web of Science
[28].↵
Gardar Sveinbjornsson, Anders Albrechtsen, Florian Zink, Sigurjón A Gudjonsson, Asmundur Oddson, Gísli Másson, Hilma Holm, Augustine Kong, Unnur Thorsteinsdottir, Patrick Sulem, et al. Weighting sequence variants based on their annotation increases power of whole-genome association studies. Nature genetics, 2016.
[29].↵
Na Li and Matthew Stephens. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics, 165(4):2213–2233, 2003.
OpenUrl Abstract/FREE Full Text
[30].↵
Chris CA Spencer, Zhan Su, Peter Donnelly, and Jonathan Marchini. Designing genome-wide association studies: sample size, power, imputation, and the choice of genotyping chip. PLoS Genet, 5(5):e1000477, 2009.
OpenUrl CrossRef PubMed
[31].↵
Zhan Su, Jonathan Marchini, and Peter Donnelly. Hapgen2: simulation of multiple disease snps. Bioinformatics, 27(16):2304–2305, 2011.
OpenUrl CrossRef PubMed Web of Science
[32].↵
Naomi R Wray, Stephan Ripke, Manuel Mattheisen, Maciej Trzaskowski, Enda M Byrne, Abdel Abdellaoui, Mark J Adams, Esben Agerbo, Tracy M Air, Till MF Andlauer, et al. Genome-wide association analyses identify 44 risk variants and refine the genetic architecture of major depression. Nature genetics, 50(5):668, 2018.
OpenUrl CrossRef PubMed
[33].↵
Eli Stahl, Gerome Breen, Andreas Forstner, Andrew McQuillin, Stephan Ripke, Sven Cichon, Laura Scott, Roel Ophoff, Ole A Andreassen, John Kelsoe, and Pamela Sklar. Genomewide association study identifies 30 loci associated with bipolar disorder. bioRxiv, 2018.
[34].↵
Schizophrenia Working Group of the Psychiatric Genomics Consortium. Biological insights from 108 schizophrenia-associated genetic loci. Nature, 511(7510):421–427, Jul 2014.
OpenUrl CrossRef PubMed Web of Science
[35].↵
Majid Nikpay, Anuj Goel, Hong-Hee Won, Leanne M Hall, Christina Willenborg, Stavroula Kanoni, Danish Saleheen, Theodosios Kyriakou, Christopher P Nelson, Jemma C Hopewell, et al. A comprehensive 1000 genomes–based genome-wide association meta-analysis of coronary artery disease. Nature genetics, 47(10):1121, 2015.
OpenUrl CrossRef PubMed
[36].↵
Katrina M de Lange, Loukas Moutsianas, James C Lee, Christopher A Lamb, Yang Luo, Nicholas A Kennedy, Luke Jostins, Daniel L Rice, Javier Gutierrez-Achury, Sun-Gou Ji, et al. Genome-wide association study implicates immune activation of multiple integrin genes in inflammatory bowel disease. Nature genetics, 49(2):256, 2017.
OpenUrl CrossRef PubMed
[37].↵
Jean-Charles Lambert, Carla A Ibrahim-Verbaas, Denise Harold, Adam C Naj, Rebecca Sims, Cáline Bellenguez, Gyungah Jun, Anita L DeStefano, Joshua C Bis, Gary W Beecham, et al. Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for alzheimer’s disease. Nature genetics, 45(12):1452–1458, 2013.
OpenUrl CrossRef PubMed
[38].↵
Iris Jansen, Jeanne Savage, Kyoko Watanabe, Julien Bryois, Dylan Williams, Stacy Steinberg, Julia Sealock, Ida Karlsson, Sara Hagg, Lavinia Athanasiu, et al. Genetic meta-analysis identifies 10 novel loci and functional pathways for alzheimer’s disease risk. bioRxiv, page 258533, 2018.
[39].↵
Wouter Van Rheenen, Aleksey Shatunov, Annelot M Dekker, Russell L McLaughlin, Frank P Diekstra, Sara L Pulit, Rick AA Van Der Spek, Urmo Võsa, Simone De Jong, Matthew R Robinson, et al. Genome-wide association analyses identify new risk variants and the genetic architecture of amyotrophic lateral sclerosis. Nature genetics, 48(9):1043, 2016.
OpenUrl CrossRef PubMed
[40].↵
Aysu Okbay, Jonathan P Beauchamp, Mark Alan Fontana, James J Lee, Tune H Pers, Cornelius A Rietveld, Patrick Turley, Guo-Bo Chen, Valur Emilsson, S Fleur W Meddens, et al. Genome-wide association study identifies 74 loci associated with educational attainment. Nature, 533(7604):539–542, 2016.
OpenUrl CrossRef PubMed
[41].↵
Suzanne Sniekers, Sven Stringer, Kyoko Watanabe, Philip R Jansen, Jonathan RI Coleman, Eva Krapohl, Erdogan Taskesen, Anke R Hammerschlag, Aysu Okbay, Delilah Zabaneh, et al. Genome-wide association meta-analysis of 78,308 individuals identifies new loci and genes influencing human intelligence. Nature genetics, 49(7):1107, 2017.
OpenUrl CrossRef PubMed
[42].↵
JE Savage, PR Jansen, S Stringer, K Watanabe, J Bryois, CA de Leeuw, M Nagel, S Awasthi, PB Barr, JRI Coleman, KL Grasby, AR Hammerschlag, J Kaminski, R Karlsson, et al. Genome-wiide association meta-analysis (n=269,867) identifies new genetic and functional links to intelligence. Nature genetics, forthcoming, 2018.
[43].↵
Adam E Locke, Bratati Kahali, Sonja I Berndt, Anne E Justice, Tune H Pers, Felix R Day, Corey Powell, Sailaja Vedantam, Martin L Buchkovich, Jian Yang, et al. Genetic studies of body mass index yield new insights for obesity biology. Nature, 518(7538):197, 2015.
OpenUrl CrossRef PubMed
[44].↵
Andrew R Wood, Tonu Esko, Jian Yang, Sailaja Vedantam, Tune H Pers, Stefan Gustafsson, Audrey Y Chu, Karol Estrada, Jian’an Luan, Zoltán Kutalik, et al. Defining the role of common variation in the genomic and biological architecture of adult human height. Nature genetics, 46(11):1173–1186, 2014.
OpenUrl CrossRef PubMed
[45].↵
Derrek P Hibar, Jason L Stein, Miguel E Renteria, Alejandro Arias-Vasquez, Sylvane Desrivières, Neda Jahanshad, Roberto Toro, Katharina Wittfeld, Lucija Abramovic, Micael Andersson, et al. Common genetic variants influence human subcorti-cal brain structures. Nature, 2015.
[46].↵
Cristen J Willer, Ellen M Schmidt, Sebanti Sengupta, Gina M Peloso, Stefan Gustafsson, Stavroula Kanoni, Andrea Ganna, Jin Chen, Martin L Buchkovich, Samia Mora, et al. Discovery and refinement of loci associated with lipid levels. Nature genetics, 45(11):1274, 2013.
OpenUrl CrossRef PubMed
[47].↵
Loic Yengo, Julia Sidorenko, Kathryn E Kemper, Zhili Zheng, Andrew R Wood, Michael N Weedon, Timothy M Frayling, Joel Hirschhorn, Jian Yang, Peter M Visscher, et al. Meta-analysis of genome-wide association studies for height and body mass index in~ 700,000 individuals of european ancestry. bioRxiv, page 274654, 2018.
[48].↵
Mashaal Sohail, Robert M Maier, Andrea Ganna, Alex Bloemendal, Alicia R Martin, Michael C Turchin, Charleston WK Chiang, Joel Hirschhorn, Mark J Daly, Nick Patterson, et al. Polygenic adaptation on height is overestimated due to uncorrected stratification in genome-wide association studies. eLife, 8:e39702, 2019.
OpenUrl
[49].↵
Jeremy J Berg, Arbel Harpak, Nasa Sinnott-Armstrong, Anja Moltke Joergensen, Hakhamanesh Mostafavi, Yair Field, Evan August Boyle, Xinjun Zhang, Fernando Racimo, Jonathan K Pritchard, et al. Reduced signal for polygenic adaptation of height in uk biobank. eLife, 8:e39725, 2019.
OpenUrl
[50].↵
NIMH. Prevalence of Major Depressive Episode Among Adults, 2016. (accessed December 27, 2018).
[51].↵
Kathleen R Merikangas, Robert Jin, Jian-Ping He, Ronald C Kessler, Sing Lee, Nancy A Sampson, Maria Carmen Viana, Laura Helena Andrade, Chiyi Hu, Elie G Karam, et al. Prevalence and correlates of bipolar spectrum disorder in the world mental health survey initiative. Archives of general psychiatry, 68(3):241–251, 2011.
OpenUrl CrossRef PubMed Web of Science
[52].↵
Doug Speed, Na Cai, Michael R Johnson, Sergey Nejentsev, David J Balding, UCLEB Consortium, et al. Reevaluation of snp heritability in complex human traits. Nature genetics, 49(7):986, 2017.
OpenUrl CrossRef
[53].↵
Fabian Sanchis-Gomar, Carme Perez-Quilis, Roman Leischik, and Alejandro Lucia. Epidemiology of coronary heart disease and acute coronary syndrome. Annals of translational medicine, 4(13), 2016.
[54].↵
Johan Burisch, Tine Jess, Matteo Martinato, Peter L Lakatos, and ECCO-EpiCom. The burden of inflammatory bowel disease in europe. Journal of Crohn’s and Colitis, 7(4):322–337, 2013.
OpenUrl
[55].↵
Brenda L Plassman, Kenneth M Langa, Gwenith G Fisher, Steven G Heeringa, David R Weir, Mary Beth Ofstedal, James R Burke, Michael D Hurd, Guy G Potter, Willard L Rodgers, et al. Prevalence of dementia in the united states: the aging, demographics, and memory study. Neuroepidemiology, 29(1-2):125–132, 2007.
OpenUrl CrossRef PubMed Web of Science
[56].↵
Alzheimer’s Association. 2018 alzheimer’s disease facts and figures. Alzheimer’s & Dementia, 14(3):367–429, 2018.
OpenUrl
[57].↵
P Mehta, W Kaye, and J Raymond. et al. Prevalence of amyotrophic lateral sclerosis 2014 united states. MMWR Morb Mortal Wkly Rep, 67:216–218, 2018.
OpenUrl
[58].↵
Itsik Pe’er, Roman Yelensky, David Altshuler, and Mark J Daly. Estimation of the multiple testing burden for genomewide association studies of nearly all common variants. Genetic epidemiology, 32(4):381–385, 2008.
OpenUrl CrossRef PubMed Web of Science
[59].↵
Mark I McCarthy, Gonçalo R Abecasis, Lon R Cardon, David B Goldstein, Julian Little, John PA Ioannidis, and Joel N Hirschhorn. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature reviews genetics, 9(5):356–369, 2008.
OpenUrl CrossRef PubMed Web of Science
[60].
Charles J Clopper and Egon S Pearson. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika, 26(4):404–413, 1934.
OpenUrl CrossRef
[61].↵
Douglas S Falconer. The inheritance of liability to certain diseases, estimated from the incidence among relatives. Annals of human genetics, 29(1):51–76, 1965.
OpenUrl CrossRef Web of Science
[62].↵
Everett R Dempster and I Michael Lerner. Heritability of threshold characters. Genetics, 35(2):212, 1950.
OpenUrl FREE Full Text
[63].↵
J. Yang, B. Benyamin, B. P. McEvoy, S. Gordon, A. K. Henders, D. R. Nyholt, P. A. Madden, A. C. Heath, N. G. Martin, G. W. Montgomery, M. E. Goddard, and P. M. Visscher. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet., 42(7):565–569, Jul 2010.
OpenUrl CrossRef PubMed Web of Science
[64].↵
James J Lee, Robbee Wedow, Aysu Okbay, Edward Kong, Omeed Maghzian, Meghan Zacher, Tuan Anh Nguyen-Viet, Peter Bowers, Julia Sidorenko, Richard Karlsson Linnár, et al. Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nature genetics, 50(8):1112, 2018.
OpenUrl CrossRef
[65].↵
Iris E Jansen, Jeanne E Savage, Kyoko Watanabe, Julien Bryois, Dylan M Williams, Stacy Steinberg, Julia Sealock, Ida K Karlsson, Sara Hägg, Lavinia Athanasiu, et al. Genome-wide meta-analysis identifies new loci and functional pathways in-fluencing alzheimer2019s disease risk. Nature genetics, page 1, 2019.
[66].↵
Shaun M Purcell, Naomi R Wray, Jennifer L Stone, Peter M Visscher, Michael C O’Donovan, Patrick F Sullivan, Pamela Sklar, Shaun M Purcell, Jennifer L Stone, Patrick F Sullivan, et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature, 460(7256):748–752, 2009.
OpenUrl CrossRef PubMed Web of Science
[67].↵
Harvey A Whiteford, Louisa Degenhardt, Jürgen Rehm, Amanda J Baxter, Alize J Ferrari, Holly E Erskine, Fiona J Charlson, Rosana E Norman, Abraham D Flaxman, Nicole Johns, et al. Global burden of disease attributable to mental and substance use disorders: findings from the global burden of disease study 2010. The Lancet, 382(9904):1575–1586, 2013.
OpenUrl CrossRef PubMed
[68].↵
Dennis K Kinney, Pamela Teixeira, Diane Hsu, Siena C Napoleon, David J Crowley, Andrea Miller, William Hyman, and Emerald Huang. Relation of schizophrenia prevalence to latitude, climate, fish consumption, infant mortality, and skin color: a role for prenatal vitamin d deficiency and infections? Schizophrenia bulletin, page sbp023, 2009.
[69].↵
Rubika Balendra and Adrian M Isaacs. C9orf72-mediated als and ftd: multiple pathways to disease. Nature Reviews Neurology, page 1, 2018.
[70].↵
Vitalay Fomin, Patricia Richard, Mainul Hoque, Cynthia Li, Zhuoying Gu, Mercedes Fissore-O’Leary, Bin Tian, Carol Prives, and James L Manley. The c9orf72 gene, implicated in amyotrophic lateral sclerosis and frontotemporal dementia, encodes a protein that functions in control of endothelin and glutamate signaling. Molecular and cellular biology, 38(22):e00155–18, 2018.
OpenUrl
[71].↵
Po-Ru Loh, Gaurav Bhatia, Alexander Gusev, Hilary K Finucane, Brendan K Bulik-Sullivan, Samuela J Pollack, Teresa R de Candia, Sang Hong Lee, Naomi R Wray, Kenneth S Kendler, et al. Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance-components analysis. Nature genetics, 2015.
[72].↵
Cornelius A Rietveld, Sarah E Medland, Jaime Derringer, Jian Yang, Tõnu Esko, Nicolas W Martin, Harm-Jan Westra, Konstantin Shakhbazov, Abdel Abdellaoui, Arpana Agrawal, et al. Gwas of 126,559 individuals identifies genetic variants associated with educational attainment. science, 340(6139):1467–1471, 2013.
OpenUrl Abstract/FREE Full Text
[73].↵
David Cesarini and Peter M Visscher. Genetics and educational attainment. npj Science of Learning, 2(1):4, 2017.
OpenUrl
[74].↵
Robert Plomin and Sophie von Stumm. The new genetics of intelligence. Nature Reviews Genetics, 2018.
[75].↵
Doug Speed and David J Balding. Sumher better estimates the snp heritability of complex traits from summary statistics. Nature genetics, 51(2):277, 2019.
OpenUrl
[76].↵
Perry G Ridge, Kaitlyn B Hoyt, Kevin Boehme, Shubhabrata Mukherjee, Paul K Crane, Jonathan L Haines, Richard Mayeux, Lindsay A Farrer, Margaret A Pericak-Vance, Gerard D Schellenberg, et al. Assessment of the genetic variance of late-onset alheimer’s disease. Neurobiology of aging, 41:200–e13, 2016.
OpenUrl
[77].↵
Perry G Ridge, Shubhabrata Mukherjee, Paul K Crane, John SK Kauwe, et al. Alzheimer’s disease: analyzing the missing heritability. PloS One, 8(11):e79771, 2013.
OpenUrl CrossRef PubMed
[78].↵
Margaret Gatz, Chandra A Reynolds, Laura Fratiglioni, Boo Johansson, James A Mortimer, Stig Berg, Amy Fiske, and Nancy L Pedersen. Role of genes and environments for explaining alzheimer disease. Archives of general psychiatry, 63(2):168–174, 2006.
OpenUrl CrossRef PubMed Web of Science
[79].↵
Luke M Evans, Rasool Tahmasbi, Scott I Vrieze, Gonçalo R Abecasis, Sayantan Das, Steven Gazal, Douglas W Bjelland, Teresa R Candia, Michael E Goddard, Benjamin M Neale, et al. Comparison of methods that use whole genome data to estimate the heritability and genetic architecture of complex traits. Nature genetics, 50(5):737, 2018.
OpenUrl CrossRef
[80].↵
Jian Yang, Jian Zeng, Michael E Goddard, Naomi R Wray, and Peter M Visscher. Concepts, estimation and interpretation of snp-based heritability. Nature genetics, 49(9):1304, 2017.
OpenUrl CrossRef
[81].↵
Xiang Zhou, Peter Carbonetto, and Matthew Stephens. Polygenic modeling with bayesian sparse linear mixed models. PLoS genetics, 9(2):e1003264, 2013.
OpenUrl
[82].↵
Jie Zheng, A Mesut Erzurumluoglu, Benjamin L Elsworth, John P Kemp, Laurence Howe, Philip C Haycock, Gibran Hemani, Katherine Tansey, Charles Laurin, Beate St Pourcain, et al. Ld hub: a centralized database and web interface to perform ld score regression that maximizes the potential of summary level gwas data for snp heritability and genetic correlation analysis. Bioinformatics, 33(2):272–279, 2017.
OpenUrl CrossRef PubMed
[83].↵
Dominic Holland, Rahul S Desikan, Anders M Dale, Linda K McEvoy, Alzheimers Disease Neuroimaging Initiative, et al. Rates of decline in alzheimer disease decrease with age. PloS one, 7(8):e42325, 2012.
OpenUrl CrossRef PubMed
[84].↵
Rahul S Desikan, Chun Chieh Fan, Yunpeng Wang, Andrew J Schork, Howard J Cabral, L Adrienne Cupples, Wesley K Thompson, Lilah Besser, Walter A Kukull, Dominic Holland, et al. Genetic assessment of age-associated alzheimer disease risk: Development and validation of a polygenic hazard score. PLoS medicine, 14(3):e1002258, 2017.
OpenUrl
[85].↵
David Golan, Eric S Lander, and Saharon Rosset. Measuring missing heritability: inferring the contribution of common variants. Proceedings of the National Academy of Sciences, 111(49):E5272–E5281, 2014.
OpenUrl Abstract/FREE Full Text
[86].↵
S Hong Lee, Teresa R DeCandia, Stephan Ripke, Jian Yang, Patrick F Sullivan, Michael E Goddard, Matthew C Keller, Peter M Visscher, Naomi R Wray, Schizophrenia Psychiatric Genome-Wide Association Study Consortium, et al. Estimating the proportion of variation in susceptibility to schizophrenia captured by common snps. Nature genetics, 44(3):247–250, 2012.
OpenUrl CrossRef PubMed
[87].↵
Amelia R Branigan, Kenneth J McCallum, and Jeremy Freese. Variation in the heritability of educational attainment: An inernational meta-analysis. Social Forces, pages 109–140, 2013.
[88].↵
Evan A Boyle, Yang I Li, and Jonathan K Pritchard. An expanded view of complex traits: From polygenic to omnigenic. Cell, 169(7):1177–1186, 2017.
OpenUrl CrossRef PubMed
[89].↵
Xiang Zhu and Matthew Stephens. Bayesian large-scale multiple regression with summary statistics from genome-wide association studies. The annals of applied statistics, 11(3):1561, 2017.
OpenUrl
[90].
Eli A Stahl, Gerome Breen, Andreas J Forstner, Andrew McQuillin, Stephan Ripke, Vassily Trubetskoy, Manuel Mattheisen, Yunpeng Wang, Jonathan RI Coleman, Hálána A Gaspar, et al. Genome-wide association study identifies 30 loci associated with bipolar disorder. Nature genetics, page 1, 2019.
[91].
Tanya M Teslovich, Kiran Musunuru, Albert V Smith, Andrew C Edmondson, Ioannis M Stylianou, Masahiro Koseki, James P Pirruccello, Samuli Ripatti, Daniel I Chasman, Cristen J Willer, et al. Biological, clinical and population relevance of 95 loci for blood lipids. Nature, 466(7307):707, 2010.
OpenUrl CrossRef PubMed Web of Science

View the discussion thread.

Posted May 13, 2019.

Download PDF

Supplementary Material

Citation Tools

Subject Area

Genomics

Subject Areas

All Articles

Animal Behavior and Cognition (5210)
Biochemistry (11736)
Bioengineering (8746)
Bioinformatics (29186)
Biophysics (14964)
Cancer Biology (12084)
Cell Biology (17401)
Clinical Trials (138)
Developmental Biology (9418)
Ecology (14176)
Epidemiology (2067)
Evolutionary Biology (18299)
Genetics (12235)
Genomics (16793)
Immunology (11863)
Microbiology (28066)
Molecular Biology (11580)
Neuroscience (60925)
Paleontology (451)
Pathology (1870)
Pharmacology and Toxicology (3238)
Physiology (4956)
Plant Biology (10422)
Scientific Communication and Education (1683)
Synthetic Biology (2883)
Systems Biology (7338)
Zoology (1650)

[1] [1].↵
Peter M Visscher, Matthew A Brown, Mark I McCarthy, and Jian Yang. Five years of gwas discovery. The American Journal of Human Genetics, 90(1):7–24, 2012.
OpenUrl CrossRef PubMed

[2] [2].↵
Eli A Stahl, Daniel Wegmann, Gosia Trynka, Javier Gutierrez-Achury, Ron Do, Benjamin F Voight, Peter Kraft, Robert Chen, Henrik J Kallberg, Fina AS Kurreeman, et al. Bayesian inference analyses of the polygenic architecture of rheumatoid arthritis. Nature genetics, 44(5):483–489, 2012.
OpenUrl CrossRef PubMed

[3] [3].↵
Jian Yang, Andrew Bakshi, Zhihong Zhu, Gibran Hemani, Anna AE Vinkhuyzen, Sang Hong Lee, Matthew R Robinson, John RB Perry, Ilja M Nolte, Jana V van Vliet-Ostaptchouk, et al. Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nature genetics, 2015.

[4] [4].↵
Hon-Cheong So, Miaoxin Li, and Pak C Sham. Uncovering the total heritability explained by all true susceptibility variants in a genome-wide association study. Genetic epidemiology, 35(6):447–456, 2011.
OpenUrl PubMed

[5] [5].↵
Doug Speed, Gibran Hemani, Michael R Johnson, and David J Balding. Improved heritability estimation from genome-wide snps. The American Journal of Human Genetics, 91(6):1011–1021, 2012.
OpenUrl CrossRef PubMed

[6] [6].↵
Sang Hong Lee, Naomi R Wray, Michael E Goddard, and Peter M Visscher. Estimating missing heritability for disease from genome-wide association studies. The American Journal of Human Genetics, 88(3):294–305, 2011.
OpenUrl CrossRef PubMed Web of Science

[7] [7].↵
Jian Yang, S Hong Lee, Michael E Goddard, and Peter M Visscher. Gcta: a tool for genome-wide complex trait analysis. The American Journal of Human Genetics, 88(1):76–82, 2011.
OpenUrl CrossRef PubMed

[8] [8].↵
Siddharth Krishna Kumar, Marcus W Feldman, David H Rehkopf, and Shripad Tuljapurkar. Limitations of gcta as a solution to the missing heritability problem. Proceedings of the National Academy of Sciences, 113(1):E61–E70, 2016.
OpenUrl Abstract/FREE Full Text

[9] [9].↵
Luigi Palla and Frank Dudbridge. A fast method that uses polygenic scores to estimate the variance explained by genome-wide marker panels and the proportion of variants affecting a trait. The American Journal of Human Genetics, 97(2):250–259, 2015.
OpenUrl CrossRef PubMed

[10] [10].↵
Alkes L Price, Noah A Zaitlen, David Reich, and Nick Patterson. New approaches to population stratification in genome-wide association studies. Nature Reviews Genetics, 11(7):459–463, 2010.
OpenUrl CrossRef PubMed Web of Science

[11] [11].↵
Jian Yang, Michael N Weedon, Shaun Purcell, Guillaume Letre, Karol Estrada, Cristen J Willer, Albert V Smith, Erik Ingelsson, Jeffrey R O’connell, Massimo Mangino, et al. Genomic inflation factors under polygenic inheritance. European Journal of Human Genetics, 19(7):807–812, 2011.
OpenUrl CrossRef PubMed

[12] [12].↵
Brendan K Bulik-Sullivan, Po-Ru Loh, Hilary K Finucane, Stephan Ripke, Jian Yang, Nick Patterson, Mark J Daly, Alkes L Price, Benjamin M Neale, Schizophrenia Working Group of the Psychiatric Genomics Consortium, et al. Ld score regression distinguishes confounding from polygenicity in genome-wide association studies. Nature genetics, 47(3):291–295, 2015.
OpenUrl CrossRef PubMed

[13] [13].↵
Hyun Min Kang, Jae Hoon Sul, Noah A Zaitlen, Sit-yee Kong, Nelson B Freimer, Chiara Sabatti, Eleazar Eskin, et al. Variance component model to account for sample structure in genome-wide association studies. Nature genetics, 42(4):348–354, 2010.
OpenUrl CrossRef PubMed Web of Science

[14] [14].↵
Yan Zhang, Guanghao Qi, Ju-Hyun Park, and Nilanjan Chatterjee. Estimation of complex effect-size distributions using summary-level statistics from genome-wide association studies across 32 complex traits. Nature genetics, 50(9):1318, 2018.
OpenUrl

[15] [15].↵
Jian Zeng, Ronald Vlaming, Yang Wu, Matthew R Robinson, Luke R Lloyd-Jones, Loic Yengo, Chloe X Yap, Angli Xue, Julia Sidorenko, Allan F McRae, et al. Signatures of negative selection in the genetic architecture of human complex traits. Nature genetics, 50(5):746, 2018.
OpenUrl CrossRef

[16] [16].↵
Bogdan Pasaniuc and Alkes L Price. Dissecting the genetics of complex traits using summary association statistics. Nature Reviews Genetics, 2016.

[17] [17].↵
John S Witte, Peter M Visscher, and Naomi R Wray. The contribution of genetic variants to disease depends on the ruler. Nature Reviews Genetics, 15(11):765–776, 2014.
OpenUrl CrossRef PubMed

[18] [18].↵
D. Holland, Y. Wang, W. K. Thompson, A. Schork, C. H. Chen, M. T. Lo, A. Witoelar, T. Werge, M. O’Donovan, O. A. Andreassen, and A. M. Dale. Estimating Effect Sizes and Expected Replication Probabilities from GWAS Summary Statistics. Front Genet, 7:15, 2016.
OpenUrl

[19] [19].↵
B. Devlin and K. Roeder. Genomic control for association studies. Biometrics, 55(4):997–1004, Dec 1999.
OpenUrl CrossRef PubMed Web of Science

[20] [20].↵
Dominic Holland, Chun-Chieh Fan, Oleksandr Frei, Alexey A. Shadrin, Olav B. Smeland, V. S. Sundar, Ole A. Andreassen, and Anders M. Dale. Estimating degree of polygenicity, causal effect size variance, and confounding bias in gwas summary statistics. bioRxiv, 2017.

[21] [21].↵
Wesley K Thompson, Yunpeng Wang, Andrew Schork, Verena Zuber, Ole A Andreassen, Anders M Dale, Dominic Holland, and Xu Shujing. An empirical bayes method for estimating the distribution of effects in genome-wide association studies. PLoS Genetics, [in press], 2015.

[22] [22].↵
Jian Yang, Teri A Manolio, Louis R Pasquale, Eric Boerwinkle, Neil Caporaso, Julie M Cunningham, Mariza De Andrade, Bjarke Feenstra, Eleanor Feingold, M Geoffrey Hayes, et al. Genome partitioning of genetic variation for complex traits using common snps. Nature genetics, 43(6):519–525, 2011.
OpenUrl CrossRef PubMed

[23] [23].↵
Andrew Gelman, Hal S Stern, John B Carlin, David B Dunson, Aki Vehtari, and Donald B Rubin. Bayesian data analysis. Chapman and Hall/CRC, 2013.

[24] [24].↵
Nan M Laird and Christoph Lange. The fundamentals of modern statistical genetics. Springer Science & Business Media, 2010.

[25] [25].↵
Chengqing Wu, Andrew DeWan, Josephine Hoh, and Zuoheng Wang. A comparison of association methods correcting for population stratification in case–control studies. Annals of human genetics, 75(3):418–427, 2011.
OpenUrl CrossRef PubMed

[26] [26].↵
1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature, 526(7571):68–74, 2015.
OpenUrl CrossRef PubMed

[27] [27].↵
1000 Genomes Project Consortium et al. An integrated map of genetic variation from 1,092 human genomes. Nature, 491(7422):56–65, 2012.
OpenUrl CrossRef PubMed Web of Science

[28] [28].↵
Gardar Sveinbjornsson, Anders Albrechtsen, Florian Zink, Sigurjón A Gudjonsson, Asmundur Oddson, Gísli Másson, Hilma Holm, Augustine Kong, Unnur Thorsteinsdottir, Patrick Sulem, et al. Weighting sequence variants based on their annotation increases power of whole-genome association studies. Nature genetics, 2016.

[29] [29].↵
Na Li and Matthew Stephens. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics, 165(4):2213–2233, 2003.
OpenUrl Abstract/FREE Full Text

[30] [30].↵
Chris CA Spencer, Zhan Su, Peter Donnelly, and Jonathan Marchini. Designing genome-wide association studies: sample size, power, imputation, and the choice of genotyping chip. PLoS Genet, 5(5):e1000477, 2009.
OpenUrl CrossRef PubMed

[31] [31].↵
Zhan Su, Jonathan Marchini, and Peter Donnelly. Hapgen2: simulation of multiple disease snps. Bioinformatics, 27(16):2304–2305, 2011.
OpenUrl CrossRef PubMed Web of Science

[32] [32].↵
Naomi R Wray, Stephan Ripke, Manuel Mattheisen, Maciej Trzaskowski, Enda M Byrne, Abdel Abdellaoui, Mark J Adams, Esben Agerbo, Tracy M Air, Till MF Andlauer, et al. Genome-wide association analyses identify 44 risk variants and refine the genetic architecture of major depression. Nature genetics, 50(5):668, 2018.
OpenUrl CrossRef PubMed

[33] [33].↵
Eli Stahl, Gerome Breen, Andreas Forstner, Andrew McQuillin, Stephan Ripke, Sven Cichon, Laura Scott, Roel Ophoff, Ole A Andreassen, John Kelsoe, and Pamela Sklar. Genomewide association study identifies 30 loci associated with bipolar disorder. bioRxiv, 2018.

[34] [34].↵
Schizophrenia Working Group of the Psychiatric Genomics Consortium. Biological insights from 108 schizophrenia-associated genetic loci. Nature, 511(7510):421–427, Jul 2014.
OpenUrl CrossRef PubMed Web of Science

[35] [35].↵
Majid Nikpay, Anuj Goel, Hong-Hee Won, Leanne M Hall, Christina Willenborg, Stavroula Kanoni, Danish Saleheen, Theodosios Kyriakou, Christopher P Nelson, Jemma C Hopewell, et al. A comprehensive 1000 genomes–based genome-wide association meta-analysis of coronary artery disease. Nature genetics, 47(10):1121, 2015.
OpenUrl CrossRef PubMed

[36] [36].↵
Katrina M de Lange, Loukas Moutsianas, James C Lee, Christopher A Lamb, Yang Luo, Nicholas A Kennedy, Luke Jostins, Daniel L Rice, Javier Gutierrez-Achury, Sun-Gou Ji, et al. Genome-wide association study implicates immune activation of multiple integrin genes in inflammatory bowel disease. Nature genetics, 49(2):256, 2017.
OpenUrl CrossRef PubMed

[37] [37].↵
Jean-Charles Lambert, Carla A Ibrahim-Verbaas, Denise Harold, Adam C Naj, Rebecca Sims, Cáline Bellenguez, Gyungah Jun, Anita L DeStefano, Joshua C Bis, Gary W Beecham, et al. Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for alzheimer’s disease. Nature genetics, 45(12):1452–1458, 2013.
OpenUrl CrossRef PubMed

[38] [38].↵
Iris Jansen, Jeanne Savage, Kyoko Watanabe, Julien Bryois, Dylan Williams, Stacy Steinberg, Julia Sealock, Ida Karlsson, Sara Hagg, Lavinia Athanasiu, et al. Genetic meta-analysis identifies 10 novel loci and functional pathways for alzheimer’s disease risk. bioRxiv, page 258533, 2018.

[39] [39].↵
Wouter Van Rheenen, Aleksey Shatunov, Annelot M Dekker, Russell L McLaughlin, Frank P Diekstra, Sara L Pulit, Rick AA Van Der Spek, Urmo Võsa, Simone De Jong, Matthew R Robinson, et al. Genome-wide association analyses identify new risk variants and the genetic architecture of amyotrophic lateral sclerosis. Nature genetics, 48(9):1043, 2016.
OpenUrl CrossRef PubMed

[40] [40].↵
Aysu Okbay, Jonathan P Beauchamp, Mark Alan Fontana, James J Lee, Tune H Pers, Cornelius A Rietveld, Patrick Turley, Guo-Bo Chen, Valur Emilsson, S Fleur W Meddens, et al. Genome-wide association study identifies 74 loci associated with educational attainment. Nature, 533(7604):539–542, 2016.
OpenUrl CrossRef PubMed

[41] [41].↵
Suzanne Sniekers, Sven Stringer, Kyoko Watanabe, Philip R Jansen, Jonathan RI Coleman, Eva Krapohl, Erdogan Taskesen, Anke R Hammerschlag, Aysu Okbay, Delilah Zabaneh, et al. Genome-wide association meta-analysis of 78,308 individuals identifies new loci and genes influencing human intelligence. Nature genetics, 49(7):1107, 2017.
OpenUrl CrossRef PubMed

[42] [42].↵
JE Savage, PR Jansen, S Stringer, K Watanabe, J Bryois, CA de Leeuw, M Nagel, S Awasthi, PB Barr, JRI Coleman, KL Grasby, AR Hammerschlag, J Kaminski, R Karlsson, et al. Genome-wiide association meta-analysis (n=269,867) identifies new genetic and functional links to intelligence. Nature genetics, forthcoming, 2018.

[43] [43].↵
Adam E Locke, Bratati Kahali, Sonja I Berndt, Anne E Justice, Tune H Pers, Felix R Day, Corey Powell, Sailaja Vedantam, Martin L Buchkovich, Jian Yang, et al. Genetic studies of body mass index yield new insights for obesity biology. Nature, 518(7538):197, 2015.
OpenUrl CrossRef PubMed

[44] [44].↵
Andrew R Wood, Tonu Esko, Jian Yang, Sailaja Vedantam, Tune H Pers, Stefan Gustafsson, Audrey Y Chu, Karol Estrada, Jian’an Luan, Zoltán Kutalik, et al. Defining the role of common variation in the genomic and biological architecture of adult human height. Nature genetics, 46(11):1173–1186, 2014.
OpenUrl CrossRef PubMed

[45] [45].↵
Derrek P Hibar, Jason L Stein, Miguel E Renteria, Alejandro Arias-Vasquez, Sylvane Desrivières, Neda Jahanshad, Roberto Toro, Katharina Wittfeld, Lucija Abramovic, Micael Andersson, et al. Common genetic variants influence human subcorti-cal brain structures. Nature, 2015.

[46] [46].↵
Cristen J Willer, Ellen M Schmidt, Sebanti Sengupta, Gina M Peloso, Stefan Gustafsson, Stavroula Kanoni, Andrea Ganna, Jin Chen, Martin L Buchkovich, Samia Mora, et al. Discovery and refinement of loci associated with lipid levels. Nature genetics, 45(11):1274, 2013.
OpenUrl CrossRef PubMed

[47] [47].↵
Loic Yengo, Julia Sidorenko, Kathryn E Kemper, Zhili Zheng, Andrew R Wood, Michael N Weedon, Timothy M Frayling, Joel Hirschhorn, Jian Yang, Peter M Visscher, et al. Meta-analysis of genome-wide association studies for height and body mass index in~ 700,000 individuals of european ancestry. bioRxiv, page 274654, 2018.

[48] [48].↵
Mashaal Sohail, Robert M Maier, Andrea Ganna, Alex Bloemendal, Alicia R Martin, Michael C Turchin, Charleston WK Chiang, Joel Hirschhorn, Mark J Daly, Nick Patterson, et al. Polygenic adaptation on height is overestimated due to uncorrected stratification in genome-wide association studies. eLife, 8:e39702, 2019.
OpenUrl

[49] [49].↵
Jeremy J Berg, Arbel Harpak, Nasa Sinnott-Armstrong, Anja Moltke Joergensen, Hakhamanesh Mostafavi, Yair Field, Evan August Boyle, Xinjun Zhang, Fernando Racimo, Jonathan K Pritchard, et al. Reduced signal for polygenic adaptation of height in uk biobank. eLife, 8:e39725, 2019.
OpenUrl

[50] [50].↵
NIMH. Prevalence of Major Depressive Episode Among Adults, 2016. (accessed December 27, 2018).

[51] [51].↵
Kathleen R Merikangas, Robert Jin, Jian-Ping He, Ronald C Kessler, Sing Lee, Nancy A Sampson, Maria Carmen Viana, Laura Helena Andrade, Chiyi Hu, Elie G Karam, et al. Prevalence and correlates of bipolar spectrum disorder in the world mental health survey initiative. Archives of general psychiatry, 68(3):241–251, 2011.
OpenUrl CrossRef PubMed Web of Science

[52] [52].↵
Doug Speed, Na Cai, Michael R Johnson, Sergey Nejentsev, David J Balding, UCLEB Consortium, et al. Reevaluation of snp heritability in complex human traits. Nature genetics, 49(7):986, 2017.
OpenUrl CrossRef

[53] [53].↵
Fabian Sanchis-Gomar, Carme Perez-Quilis, Roman Leischik, and Alejandro Lucia. Epidemiology of coronary heart disease and acute coronary syndrome. Annals of translational medicine, 4(13), 2016.

[54] [54].↵
Johan Burisch, Tine Jess, Matteo Martinato, Peter L Lakatos, and ECCO-EpiCom. The burden of inflammatory bowel disease in europe. Journal of Crohn’s and Colitis, 7(4):322–337, 2013.
OpenUrl

[55] [55].↵
Brenda L Plassman, Kenneth M Langa, Gwenith G Fisher, Steven G Heeringa, David R Weir, Mary Beth Ofstedal, James R Burke, Michael D Hurd, Guy G Potter, Willard L Rodgers, et al. Prevalence of dementia in the united states: the aging, demographics, and memory study. Neuroepidemiology, 29(1-2):125–132, 2007.
OpenUrl CrossRef PubMed Web of Science

[56] [56].↵
Alzheimer’s Association. 2018 alzheimer’s disease facts and figures. Alzheimer’s & Dementia, 14(3):367–429, 2018.
OpenUrl

[57] [57].↵
P Mehta, W Kaye, and J Raymond. et al. Prevalence of amyotrophic lateral sclerosis 2014 united states. MMWR Morb Mortal Wkly Rep, 67:216–218, 2018.
OpenUrl

[58] [58].↵
Itsik Pe’er, Roman Yelensky, David Altshuler, and Mark J Daly. Estimation of the multiple testing burden for genomewide association studies of nearly all common variants. Genetic epidemiology, 32(4):381–385, 2008.
OpenUrl CrossRef PubMed Web of Science

[59] [59].↵
Mark I McCarthy, Gonçalo R Abecasis, Lon R Cardon, David B Goldstein, Julian Little, John PA Ioannidis, and Joel N Hirschhorn. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature reviews genetics, 9(5):356–369, 2008.
OpenUrl CrossRef PubMed Web of Science

[60] [60].
Charles J Clopper and Egon S Pearson. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika, 26(4):404–413, 1934.
OpenUrl CrossRef

[61] [61].↵
Douglas S Falconer. The inheritance of liability to certain diseases, estimated from the incidence among relatives. Annals of human genetics, 29(1):51–76, 1965.
OpenUrl CrossRef Web of Science

[62] [62].↵
Everett R Dempster and I Michael Lerner. Heritability of threshold characters. Genetics, 35(2):212, 1950.
OpenUrl FREE Full Text

[63] [63].↵
J. Yang, B. Benyamin, B. P. McEvoy, S. Gordon, A. K. Henders, D. R. Nyholt, P. A. Madden, A. C. Heath, N. G. Martin, G. W. Montgomery, M. E. Goddard, and P. M. Visscher. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet., 42(7):565–569, Jul 2010.
OpenUrl CrossRef PubMed Web of Science

[64] [64].↵
James J Lee, Robbee Wedow, Aysu Okbay, Edward Kong, Omeed Maghzian, Meghan Zacher, Tuan Anh Nguyen-Viet, Peter Bowers, Julia Sidorenko, Richard Karlsson Linnár, et al. Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nature genetics, 50(8):1112, 2018.
OpenUrl CrossRef

[65] [65].↵
Iris E Jansen, Jeanne E Savage, Kyoko Watanabe, Julien Bryois, Dylan M Williams, Stacy Steinberg, Julia Sealock, Ida K Karlsson, Sara Hägg, Lavinia Athanasiu, et al. Genome-wide meta-analysis identifies new loci and functional pathways in-fluencing alzheimer2019s disease risk. Nature genetics, page 1, 2019.

[66] [66].↵
Shaun M Purcell, Naomi R Wray, Jennifer L Stone, Peter M Visscher, Michael C O’Donovan, Patrick F Sullivan, Pamela Sklar, Shaun M Purcell, Jennifer L Stone, Patrick F Sullivan, et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature, 460(7256):748–752, 2009.
OpenUrl CrossRef PubMed Web of Science

[67] [67].↵
Harvey A Whiteford, Louisa Degenhardt, Jürgen Rehm, Amanda J Baxter, Alize J Ferrari, Holly E Erskine, Fiona J Charlson, Rosana E Norman, Abraham D Flaxman, Nicole Johns, et al. Global burden of disease attributable to mental and substance use disorders: findings from the global burden of disease study 2010. The Lancet, 382(9904):1575–1586, 2013.
OpenUrl CrossRef PubMed

[68] [68].↵
Dennis K Kinney, Pamela Teixeira, Diane Hsu, Siena C Napoleon, David J Crowley, Andrea Miller, William Hyman, and Emerald Huang. Relation of schizophrenia prevalence to latitude, climate, fish consumption, infant mortality, and skin color: a role for prenatal vitamin d deficiency and infections? Schizophrenia bulletin, page sbp023, 2009.

[69] [69].↵
Rubika Balendra and Adrian M Isaacs. C9orf72-mediated als and ftd: multiple pathways to disease. Nature Reviews Neurology, page 1, 2018.

[70] [70].↵
Vitalay Fomin, Patricia Richard, Mainul Hoque, Cynthia Li, Zhuoying Gu, Mercedes Fissore-O’Leary, Bin Tian, Carol Prives, and James L Manley. The c9orf72 gene, implicated in amyotrophic lateral sclerosis and frontotemporal dementia, encodes a protein that functions in control of endothelin and glutamate signaling. Molecular and cellular biology, 38(22):e00155–18, 2018.
OpenUrl

[71] [71].↵
Po-Ru Loh, Gaurav Bhatia, Alexander Gusev, Hilary K Finucane, Brendan K Bulik-Sullivan, Samuela J Pollack, Teresa R de Candia, Sang Hong Lee, Naomi R Wray, Kenneth S Kendler, et al. Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance-components analysis. Nature genetics, 2015.

[72] [72].↵
Cornelius A Rietveld, Sarah E Medland, Jaime Derringer, Jian Yang, Tõnu Esko, Nicolas W Martin, Harm-Jan Westra, Konstantin Shakhbazov, Abdel Abdellaoui, Arpana Agrawal, et al. Gwas of 126,559 individuals identifies genetic variants associated with educational attainment. science, 340(6139):1467–1471, 2013.
OpenUrl Abstract/FREE Full Text

[73] [73].↵
David Cesarini and Peter M Visscher. Genetics and educational attainment. npj Science of Learning, 2(1):4, 2017.
OpenUrl

[74] [74].↵
Robert Plomin and Sophie von Stumm. The new genetics of intelligence. Nature Reviews Genetics, 2018.

[75] [75].↵
Doug Speed and David J Balding. Sumher better estimates the snp heritability of complex traits from summary statistics. Nature genetics, 51(2):277, 2019.
OpenUrl

[76] [76].↵
Perry G Ridge, Kaitlyn B Hoyt, Kevin Boehme, Shubhabrata Mukherjee, Paul K Crane, Jonathan L Haines, Richard Mayeux, Lindsay A Farrer, Margaret A Pericak-Vance, Gerard D Schellenberg, et al. Assessment of the genetic variance of late-onset alheimer’s disease. Neurobiology of aging, 41:200–e13, 2016.
OpenUrl

[77] [77].↵
Perry G Ridge, Shubhabrata Mukherjee, Paul K Crane, John SK Kauwe, et al. Alzheimer’s disease: analyzing the missing heritability. PloS One, 8(11):e79771, 2013.
OpenUrl CrossRef PubMed

[78] [78].↵
Margaret Gatz, Chandra A Reynolds, Laura Fratiglioni, Boo Johansson, James A Mortimer, Stig Berg, Amy Fiske, and Nancy L Pedersen. Role of genes and environments for explaining alzheimer disease. Archives of general psychiatry, 63(2):168–174, 2006.
OpenUrl CrossRef PubMed Web of Science

[79] [79].↵
Luke M Evans, Rasool Tahmasbi, Scott I Vrieze, Gonçalo R Abecasis, Sayantan Das, Steven Gazal, Douglas W Bjelland, Teresa R Candia, Michael E Goddard, Benjamin M Neale, et al. Comparison of methods that use whole genome data to estimate the heritability and genetic architecture of complex traits. Nature genetics, 50(5):737, 2018.
OpenUrl CrossRef

[80] [80].↵
Jian Yang, Jian Zeng, Michael E Goddard, Naomi R Wray, and Peter M Visscher. Concepts, estimation and interpretation of snp-based heritability. Nature genetics, 49(9):1304, 2017.
OpenUrl CrossRef

[81] [81].↵
Xiang Zhou, Peter Carbonetto, and Matthew Stephens. Polygenic modeling with bayesian sparse linear mixed models. PLoS genetics, 9(2):e1003264, 2013.
OpenUrl

[82] [82].↵
Jie Zheng, A Mesut Erzurumluoglu, Benjamin L Elsworth, John P Kemp, Laurence Howe, Philip C Haycock, Gibran Hemani, Katherine Tansey, Charles Laurin, Beate St Pourcain, et al. Ld hub: a centralized database and web interface to perform ld score regression that maximizes the potential of summary level gwas data for snp heritability and genetic correlation analysis. Bioinformatics, 33(2):272–279, 2017.
OpenUrl CrossRef PubMed

[83] [83].↵
Dominic Holland, Rahul S Desikan, Anders M Dale, Linda K McEvoy, Alzheimers Disease Neuroimaging Initiative, et al. Rates of decline in alzheimer disease decrease with age. PloS one, 7(8):e42325, 2012.
OpenUrl CrossRef PubMed

[84] [84].↵
Rahul S Desikan, Chun Chieh Fan, Yunpeng Wang, Andrew J Schork, Howard J Cabral, L Adrienne Cupples, Wesley K Thompson, Lilah Besser, Walter A Kukull, Dominic Holland, et al. Genetic assessment of age-associated alzheimer disease risk: Development and validation of a polygenic hazard score. PLoS medicine, 14(3):e1002258, 2017.
OpenUrl

[85] [85].↵
David Golan, Eric S Lander, and Saharon Rosset. Measuring missing heritability: inferring the contribution of common variants. Proceedings of the National Academy of Sciences, 111(49):E5272–E5281, 2014.
OpenUrl Abstract/FREE Full Text

[86] [86].↵
S Hong Lee, Teresa R DeCandia, Stephan Ripke, Jian Yang, Patrick F Sullivan, Michael E Goddard, Matthew C Keller, Peter M Visscher, Naomi R Wray, Schizophrenia Psychiatric Genome-Wide Association Study Consortium, et al. Estimating the proportion of variation in susceptibility to schizophrenia captured by common snps. Nature genetics, 44(3):247–250, 2012.
OpenUrl CrossRef PubMed

[87] [87].↵
Amelia R Branigan, Kenneth J McCallum, and Jeremy Freese. Variation in the heritability of educational attainment: An inernational meta-analysis. Social Forces, pages 109–140, 2013.

[88] [88].↵
Evan A Boyle, Yang I Li, and Jonathan K Pritchard. An expanded view of complex traits: From polygenic to omnigenic. Cell, 169(7):1177–1186, 2017.
OpenUrl CrossRef PubMed

[89] [89].↵
Xiang Zhu and Matthew Stephens. Bayesian large-scale multiple regression with summary statistics from genome-wide association studies. The annals of applied statistics, 11(3):1561, 2017.
OpenUrl

[90] [90].
Eli A Stahl, Gerome Breen, Andreas J Forstner, Andrew McQuillin, Stephan Ripke, Vassily Trubetskoy, Manuel Mattheisen, Yunpeng Wang, Jonathan RI Coleman, Hálána A Gaspar, et al. Genome-wide association study identifies 30 loci associated with bipolar disorder. Nature genetics, page 1, 2019.

[91] [91].
Tanya M Teslovich, Kiran Musunuru, Albert V Smith, Andrew C Edmondson, Ioannis M Stylianou, Masahiro Koseki, James P Pirruccello, Samuli Ripatti, Daniel I Chasman, Cristen J Willer, et al. Biological, clinical and population relevance of 95 loci for blood lipids. Nature, 466(7307):707, 2010.
OpenUrl CrossRef PubMed Web of Science

Beyond SNP Heritability: Polygenicity and Discoverability of Phenotypes Estimated with a Univariate Gaussian Mixture Model

Abstract

INTRODUCTION

METHODS

Overview

The Model: Probability Distribution for Z-Scores

Model PDF: Convolution

Data Preparation

Simulations

Parameter Estimation

Posterior Effect Sizes

GWAS Replication

GWAS Power

Quantile-Quantile Plots and Genomic Control

Number of Causal SNPs

Narrow-sense Chip Heritability

RESULTS

Simulations

Phenotypes

Dependence on Reference Panel

DISCUSSION

CONCLUSION

Funding

Acknowledgments

Footnotes

References

Citation Manager Formats

Subject Area