Contrasting regional architectures of schizophrenia and other complex diseases using fast variance components analysis

Po-Ru Loh; Gaurav Bhatia; Alexander Gusev; Hilary K Finucane; Brendan K Bulik-Sullivan; Samuela J Pollack; Schizophrenia Working Group of the Psychiatric Genomics Consortium; Teresa R de Candia; Sang Hong Lee; Naomi R Wray; Kenneth S Kendler; Michael C O’Donovan; Benjamin M Neale; Nick Patterson; Alkes L Price

doi:10.1101/016527

Abstract

Heritability analyses of GWAS cohorts have yielded important insights into complex disease architecture, and increasing sample sizes hold the promise of further discoveries. Here, we analyze the genetic architecture of schizophrenia in 49,806 samples from the PGC, and nine complex diseases in 54,734 samples from the GERA cohort. For schizophrenia, we infer an overwhelmingly polygenic disease architecture in which ≥76% of 1Mb genomic regions harbor at least one variant influencing schizophrenia risk. We also observe significant enrichment of heritability in GC-rich regions and in higher-frequency SNPs for both schizophrenia and GERA diseases. In bivariate analyses, we observe significant genetic correlations (ranging from 0.18 to 0.85) for 13 of 36 pairs of GERA diseases; genetic correlations were consistently stronger (1.3x on average) than correlations of overall disease liabilities. To accomplish these analyses, we developed a novel, fast algorithm for multi-component, multitrait variance components analysis that overcomes prior computational barriers that made such analyses intractable at this scale.

Over the past five years, variance components analysis has had considerable impact on research in human complex trait genetics, yielding rich insights into the heritable variation explained by SNPs [1–3], its distribution across chromosomes, allele frequencies, and functional annotations [4–6], and its correlation across traits [7, 8]. These analyses have complemented genome-wide association studies (GWAS): while GWAS have identified individual loci explaining significant portions of trait heritability, variance components methods have aggregated signal across large SNP sets, revealing information about polygenic SNP effects invisible to association studies. The utility of both approaches has been particularly clear in studies of schizophrenia, for which early GWAS achieved few genome-wide significant findings, yet variance components analysis indicated a large fraction of heritable variance spread across common SNPs in numerous loci, over 100 of which have now been discovered in large-scale GWAS [5, 9–12].

Despite these advances, much remains unknown about the genetic architecture of schizophrenia and other complex diseases. For schizophrenia, known GWAS loci are collectively estimated to explain only 3% of variation in disease liability [12]; of the remaining variation, a sizable fraction has been shown to be hidden among thousands of common SNPs [5, 11], but the distribution of these SNPs across the genome and across the allele frequency spectrum has remained uncertain. Even for traits such as lipid levels and type 2 diabetes for which loci of somewhat larger effect have been identified, the spatial and allelic distribution of variants responsible for the bulk of known SNP-heritability has remained a mystery [13, 14]. Variance components methods have the potential to shed light on these questions using the increased statistical resolution offered by tens or hundreds of thousands of samples [15, 16]. However, while study sizes have increased beyond 50,000 samples, existing variance components methods [2] are becoming computationally intractable at such scales. Computational limitations have thus forced previous studies to split and then meta-analyze data sets [6], a procedure that results in loss of precision for variance components analysis, which relies on pairwise relationships for inference (in contrast to meta-analysis in association studies) [15, 16].

Here, we introduce a novel and much faster variance components method, BOLT-REML, and apply it to analyze roughly 50,000 samples in each of two very large data sets—from the Psychiatric Genomics Consortium (PGC2) [12] and the Genetic Epidemiology Research on Aging cohort (GERA; see URLs)—obtaining several new insights into the genetic architectures of schizophrenia and nine other complex diseases. We harnessed the computational efficiency and versatility of BOLT-REML variance components analysis to estimate components of heritability, infer levels of polygenicity, partition SNP-heritability across the common allele frequency spectrum, and estimate genetic correlations among GERA diseases.

Results

Overview of Methods

The BOLT-REML algorithm employs the conjugate gradient-based iterative framework for fast mixed model computations [17,18] that we previously harnessed for mixed model association analysis [19]. However, in contrast to that work, BOLT-REML robustly estimates variance parameters for models involving multiple variance components and multiple traits [20,21]. BOLT-REML uses a Monte Carlo average information restricted maximum likelihood (AI REML) algorithm [22], which is an approximate Newton-type optimization of the restricted log likelihood [23] with respect to the variance parameters being estimated. (In contrast, our previous work [19] used a rudimentary quasi-Newton approach that sufficed for univariate optimization.) In each iteration, BOLT-REML rapidly approximates the gradient of the log likelihood using pseudorandom Monte Carlo sampling [24] and approximates the Hessian of the log likelihood using the average information matrix [25]. Full details, including simulations verifying the accuracy of BOLT-REML heritability parameter estimates and standard errors (which are nearly identical to standard REML), are provided in Online Methods and the Supplementary Note. We have released open-source software implementing the method (see URLs).

Computational efficiency of BOLT-REML variance components analysis

We assessed the computational performance of BOLT-REML, comparing it to the GCTA software [2] (see URLs) for REML variance components analyses of GERA disease phenotypes on subsets of the GERA cohort of increasing size. We observed that across three types of analyses, BOLT-REML achieved order-of-magnitude reductions in running time and memory use compared to GCTA, with relative improvements increasing with sample size (Figure 1). The running times we observed for BOLT-REML scale roughly as ≈MN^1.5, consistent with previously reported empirical results for BOLT-LMM association analysis [19], whereas standard REML analysis requires O(MN ² + N ³) running time (Figure 1a and Supplementary Table 1). BOLT-REML also only requires ≈MN/4 bytes of memory (nearly independent of the number of variance components used), in contrast to standard REML analysis, which requires O(N ²) memory per variance component (Figure 1b and Supplementary Table 1). Consequently, GCTA could only analyze at most half of our available samples; indeed, computational constraints have forced previous studies to split large cohorts into multiple subgroups for analysis [6], increasing standard errors and reducing statistical power. In contrast, BOLT-REML enabled us to perform a full suite of heritability analyses of N =50,000 samples with tight error bounds [15, 16].

Figure 1. Computational performance of BOLT-REML and GCTA heritability analysis algorithms.

Benchmarks of BOLT-REML and GCTA in three heritability analysis scenarios: partitioning across 22 chromosomes, partitioning across six MAF bins, and bivariate analysis. Run times (a) and memory (b) are plotted for runs on subsets of the GERA cohort with fixed SNP count M =597,736 and increasing sample size (N) using dyslipidemia as the phenotype in the univariate analyses and hypertension as the second phenotype in the bivariate analysis. Reported run times are medians of five identical runs using one core of a 2.27 GHz Intel Xeon L5640 processor. Reported run times for GCTA are total times required for computing the GRM and performing REML analysis; time breakdowns and numeric data are provided in Supplementary Table 1. Data points not plotted for GCTA indicate scenarios in which GCTA required more memory than the 96GB available. Software versions: BOLT-REML, v2.0; GCTA, v1.24.

Estimates of SNP-heritability for schizophrenia and GERA diseases

We analyzed 22,177 schizophrenia cases and 27,629 controls with well-imputed genotypes at 472,178 markers of minor allele frequency (MAF) ≥2% in the PGC2 data [12] (Supplementary Table 2) as well as nine complex diseases in 54,734 randomly ascertained samples typed at 597,736 SNPs in the GERA cohort (see Online Methods; QC procedures included filtering both data sets to unrelated samples of European ancestry and LD-pruning markers to r²≤0.9). We computed liability-scale SNP-heritability estimates (, ref. [1]) for schizophrenia in the PGC2 data set and all 22 disease phenotypes in the GERA data set by assuming a liability threshold model and applying the linear transformation of ref. [3] to convert raw observed-scale heritability parameter estimates (based on applying BOLT-REML directly to observed case/control status) to liability-scale (Table 1 and Supplementary Table 3). All BOLT-REML analyses included 10 principal component covariates and PGC2 analyses further included study indicators to remove possible effects of population stratification (see Online Methods). Given the very low values of for many GERA diseases, we restricted further GERA analyses to the nine individual diseases with highest (Table 1). We assumed schizophrenia population risk of 1% (ref. [5, 11, 12]) and assumed that population risks of GERA diseases matched case fractions in the GERA cohort. In light of the known downward bias of large-sample REML estimates for ascertained case-control traits [26, 27], we performed REML analyses on full data sets as well as with subsamples of each data set with 2x–10x fewer samples. As expected, we observed significant downward bias of schizophrenia estimates with increasing sample size, whereas we observed no such trend for data from GERA, which is a cohort study not subject to case-control ascertainment (Supplementary Table 4). We therefore estimated for schizophrenia by averaging the results of 100 REML analyses of different 10x-downsampled (N =5K) subsamples of the PGC2 data (Online Methods). We observed in simulations that ascertainment-induced REML bias is negligible at this sample size (Supplementary Table 5).

View this table:

Table 1.

Estimated proportions of variance in disease liability explained by SNPs.

These analyses help explain a previously mysterious observation of decreasing estimated schizophrenia with increasing aggregation of cohorts [5]. This phenomenon was attributed to phenotypic heterogeneity [5, 11], as suggested by estimates of between-cohort genetic correlation <1 (ref. [5]). Our analyses implicate ascertainment-induced downward bias of estimated (worsening with sample size) as an additional explanation of this effect (Supplementary Tables 4 and 6). In theory, the extent of ascertainment-induced bias (obtained via comparing results from the 10x-downsampled analysis to results from the full data) could be used to infer the extent of case over-ascertainment and hence infer population risk, but we were unable to conclusively infer population risk (Supplementary Table 5). Finally, we note that while our reported schizophrenia assumes a population risk of 1% (ref. [5, 11, 12]), this estimate and all subsequent results can be easily recalibrated for a different assumed population risk by multiplying by an appropriate transformation factor (e.g., assuming 0.4% population risk [28] rescales the estimate to 26.0%, s.e.=0.5%). Similarly, our use of an LD-pruned marker set (Online Methods) results in higher estimates than using unpruned markers (Supplementary Table 6); this choice alleviates LD bias [29–31] but does not affect the analyses that follow.

Contrasting polygenicity of schizophrenia and GERA diseases

We next turned to a detailed investigation of the polygenicity of schizophrenia and the GERA diseases. Specifically, we estimated SNP-heritability explained by each 1Mb region of the genome, (defined in Online Methods; Fig. 2a); we confirmed in simulations that 1Mb regions are sufficiently wide to ensure negligible leakage of heritability across region boundaries due to linkage disequilibrium or incomplete tagging of variants (Supplementary Table 7). We restricted our primary analyses of GERA diseases to dyslipidemia and hypertension, the diseases with the highest observed-scale SNP-heritability (Supplementary Table 3), because we had insufficient statistical power to make inferences for diseases with lower (Supplementary Fig. 1). As expected, SNP-heritability estimates for individual 1Mb regions were individually noisy (mean estimated / mean s.e.() = 0.88 for schizophrenia and 0.51 for dyslipidemia and hypertension), although we did see substantial SNP-heritability in some 1Mb regions (particularly for dyslipidemia, which has relatively large-effect SNPs [13]; in contrast, no 1Mb region was estimated to explain more than 0.12% of schizophrenia liability). We therefore sought to draw inferences from the bulk distribution of per-megabase SNP-heritability estimates (Supplementary Fig. 2). (We note that a limitation of BOLT-REML is that it is does not compute likelihood ratio test statistics for testing whether individual variance components contribute nonzero variance; see Supplementary Note.)

Figure 2. Differing levels of polygenicity of complex diseases.

(a) Manhattan-style plots of estimated SNP-heritability per 1Mb region of the genome, , for dyslipidemia, hypertension, and schizophrenia. The APOE region of chromsome 19 is an outlier with an estimate of 0.022. (b) Fractions of 1Mb regions with estimated equal to its lower bound constraint of zero in disease phenotypes (solid) and simulated phenotypes with varying degrees of polygenicity and with matching the of each disease (dashed). Simulation data plotted are means over 5 simulations; error bars, 95% prediction intervals assuming Bernoulli sampling variance and taking into account s.e.m. (c) Conservative 95% confidence intervals for the cumulative fraction of SNP-heritability explained by the 1Mb regions that contain the most SNP-heritability. Lower bounds are from a cross-validation procedure involving only the disease phenotypes while upper bounds are inferred from the empirical sampling variance of estimates (Online Methods).

To understand the effect of different levels of polygenicity on the distribution of per-megabase SNP-heritability estimates, we simulated quantitative traits of varying polygenicity (2K–600K causal SNPs) with matching the genome-wide observed-scale estimates for schizophrenia, dyslipidemia, and hypertension (Supplementary Table 3) using PGC2 and GERA geno-types. We then applied the same procedure we applied to the real phenotypes to obtain per-megabase SNP-heritability estimates for the simulated traits (Online Methods) and compared the simulated distributions of per-megabase estimates to the observed distributions, focusing on the fraction of 1Mb regions with estimates of zero (Figure 2b). Intuitively, more polygenic traits have heritability spread more uniformly across 1Mb regions and hence have fewer estimates of 0, as our simulations confirmed. (Based on this statistic, our schizophrenia data is consistent with genetic architectures involving >20,000 causal SNPs; however, we caution that—unlike our analyses below—this estimate is contingent on our parameterization of simulated genetic architectures, as are previous estimates [11, 32].)

We further interrogated our real and simulated distributions of per-megabase SNP-heritability estimates to obtain nonparametric bounds on the cumulative fraction of explained by varying numbers of true top 1Mb regions—i.e., those that harbor the most SNP-heritability in the population—for schizophrenia, dyslipidemia, and hypertension (Figure 2c). We observed that the probability of observing an estimate of zero for a given 1Mb region is a convex function of the true SNP-heritability of that region (Supplementary Fig. 3), and we harnessed this observation to obtain upper bounds on the cumulative heritability explained by true top regions. To obtain lower bounds on this quantity, we applied a cross-validation procedure (similar to ref. [33]) in which we selected top regions using subsets of the data and estimated heritability explained using left-out test samples (see Online Methods). Combining the upper and lower bounds allowed us to obtain conservative 95% confidence intervals for heritability explained by top regions (Figure 2c), as we verified in simulations (Supplementary Fig. 4). In particular, we inferred that schizophrenia has an extremely polygenic architecture, with most 1Mb regions (conservative 95% CI: 76%-100%) containing nonzero contributions to the overall SNP-heritability and very little concentration of SNP-heritability into top 1Mb regions, in contrast to dyslipidemia (Figure 2c). Notably, these bounds are not contingent on any particular parametric model of genetic architecture. (We note that we report only conservative 95% confidence intervals—without parameter estimates—because obtaining point estimates would require assuming a parameterization of genetic architecture.) We repeated all of these analyses using 0.5Mb regions and observed no qualitative differences in the results (Supplementary Figures 2, 3, and 5 and Supplementary Table 7).

Having computed per-megabase estimates, we checked for correlations between estimated and genomic annotations that vary slowly across the genome. Specifically, we tabulated GC content, genic content [6], replication timing [34], recombination rate [35], and background selection [36] per megabase of the genome. (Each of these annotations had an autocorrelation across consecutive 1Mb segments of at least 0.3; see Supplementary Table 8.) For each of schizophrenia, dyslipidemia, and hypertension, we observed the greatest correlation with GC content (p < 10⁻⁵) (Supplementary Table 9). We also observed significant correlations of per-megabase with genic content, replication timing and recombination rate; however, upon including GC content—which is correlated with each of the other annotations (Supplementary Table 10)—as a covariate, all other correlations became non-significant (Supplementary Table 9). To further investigate this finding, we stratified 1Mb regions into GC content quintiles and partitioned SNP-heritability across these strata, observing a clear enrichment of heritability with increasing GC content (Figure 3), which we verified was not due to systematic differences in SNP counts across GC quintiles (Supplementary Table 11). To quantify this enrichment, we performed finer partitioning into 50 GC content strata and regressed SNP-heritability estimates against GC content (Online Methods). We found that a 1% increase in GC content (relative to the median) corresponded to 0.9%, 4.4%, and 3.2% increases in heritability explained (relative to the means) for schizophrenia, dyslipidemia, and hypertension (95% confidence intervals, 0.3–1.5%, 2.1–6.7%, and 1.8-4.6%). Once again, repeating these analyses using 0.5Mb regions produced no qualitative differences in results (Supplementary Fig. 6 and Supplementary Tables 9 and 10).

Figure 3. SNP-heritability of disease liabilities partitioned by GC content.

GC content was computed at 1Mb resolution, after which 1Mb regions were stratified into GC quintiles for variance components analysis. Quintiles 1–5 have median GC contents of 35.7%, 38.1%, 40.2%, 42.8%, and 47.2%, respectively. Error bars, 95% confidence intervals based on REML analytic standard errors.

Finally, we performed chromosome partitioning of SNP-heritability for each disease, as previously done for schizophrenia using N =21K samples [5]. We confirmed a strikingly linear relationship between SNP-heritability of schizophrenia explained per chromosome and chromosome length (Supplementary Fig. 7), consistent with a highly polygenic disease architecture. In contrast, the trend for dyslipidemia was noticeably less linear, consistent with the existence of large-effect loci (Supplementary Fig. 7).

Enrichment of SNP-heritability in higher-frequency SNPs

Given the high observed-scale heritability of schizophrenia on the full N =50K data set (Supplementary Table 3), we reasoned that analyses partitioning schizophrenia SNP-heritability by allele frequency would produce results with small enough standard errors to yield high-confidence conclusions, providing greater resolution than the results of ref. [5] based on N =21K samples. We began by running minor allele frequency (MAF)-partitioned heritability analyses of simulated quantitative phenotypes based on UK10K sequencing data (see Online Methods and URLs). We simulated genetic architectures in which causal SNPs were drawn from SNPs with MAF p≥0.1% and were randomly assigned allele effect sizes with variances proportional to (p(1 − p))α for various values of α between −1 and 0 (ref. [29, 30]) (Online Methods). Under this parameterization, α = −1 corresponds to a model in which rare SNPs have larger per-allele effects, so that all SNPs have the same expected contribution to variance [1], while α = 0 corresponds to a model with no selection [37] in which all alleles have similar per-allele effects, so that on average rarer SNPs contribute less variance. We perfomed MAF-partitioned analyses [30] over six MAF bins (partitioning the 2–50% MAF range) using tag SNPs from the PGC2 data set, and we observed that the heritability captured by tag SNPs in each bin (, defined in Online Methods) accounted for most but not all of the true heritability contributed by causal UK10K variants in each bin (, defined in Online Methods) (Fig. 4a).

Figure 4. Inferred heritability of schizophrenia liability due to SNPs of various allele frequencies.

(a) Simulated narrow-sense heritability per MAF bin (, dashed blue curves) and estimated SNP-heritability per MAF bin (, solid red curves) for quantitative phenotypes with genetic architectures in which SNPs of minor allele frequency p have average perallele effect size variance proportional to (p(1 − p))α. Simulations used causal SNPs with MAF≥0.1% in UK10K sequencing data and tag SNPs from our PGC2 analyses; error bars, 95% confidence intervals based on 4,000 runs. (b) SNP-heritability (red) and inferred narrow-sense heritability (blue) of schizophrenia liability partitioned across six MAF bins. Point estimates of narrow-sense heritability per bin are based on interpolated values of the ratio at α = −0.34, which provided the best least-squares fit between observed and interpolated from the simulations in panel (a) (Supplementary Fig. 8). (c) Inferred narrow-sense heritability of schizophrenia liability explained per SNP in each MAF bin, i.e., in panel (b) normalized by UK10K SNP counts (Supplementary Table 12). Schizophrenia error bars, 95% confidence intervals based on REML analytic standard errors. Schizophrenia and error bars, unions of 95% confidence intervals assuming −1 ≤ α ≤ 0.

We next performed MAF-partitioning of schizophrenia by running BOLT-REML on the full PGC2 data set with variance components corresponding to the same six MAF bins (Fig. 4b). We then estimated total narrow-sense heritability contributed per MAF bin, (Fig. 4b), by performing a least-squares fit of observed against data from our simulations, interpolated for −1 ≤ α ≤ 0; this procedure yielded a best-fit value of α = −0.34 (Supplementary Fig. 8), from which we inferred . To keep our inferences robust to model parameterization, we computed conservative 95% confidence intervals by taking the union of 95% confidence intervals assuming different values of α (−1 ≤ α ≤ 0). Finally, we divided by the number of UK10K SNPs per bin (Supplementary Table 12) to estimate the average heritability explained per SNP in each MAF bin, (Fig. 4c), observing a clear increase in heritability explained per SNP with increasing allele frequency. We observed the same general trend in analyses of GERA diseases, although the results were noisier due to smaller (Supplementary Fig. 9).

Genetic correlations across GERA diseases

The availability of multiple phenotypes across all GERA samples also allowed us to estimate the genetic correlations and total correlations (r_g and r_l, defined in Online Methods) among disease liabilities (Figure 5 and Supplementary Table 13). We estimated genetic correlations by running bivariate BOLT-REML for each pair of case-control traits [7] and total liability-scale correlations by Monte Carlo simulations to match total observed-scale correlations (Online Methods). We first ran the analysis using only our standard set of covariates (age, sex, 10 principal components, and Affymetrix kit type) (Fig. 5a) and then reran the analysis including BMI as an additional covariate (Fig. 5b). We verified that of the nine survey-derived covariates provided with the GERA data set, BMI was the only one relevant to our analysis (Supplementary Fig. 10). Interestingly, we observed that adjusting for BMI lowered genetic correlations by a multiplicative factor of 0.75 (s.e.=0.05) and total correlations by a factor of 0.81 (s.e.=0.03), as assessed by regressing BMI-adjusted correlations on unadjusted correlations, suggesting that some correlation signal among these diseases may be mediated by BMI. Of the 13 significant genetic correlations in the unadjusted analysis, six became non-significant upon adjusting for BMI, leaving a very strong genetic correlation between asthma and allergic rhinitis (rg=0.85, s.e.=0.11) and a cluster of six moderate genetic correlations among cardiovascular disease, type 2 diabetes, dyslipidemia, and hypertension (rg=0.27–0.43) (Supplementary Table 13).

Figure 5. Genetic correlations and total correlations of GERA disease liabilities.

(a) Correlations from bivariate analyses using only age, sex, 10 principal components, and Affymetrix kit type as covariates. (b) Correlations from bivariate analyses including BMI as an additional covariate. Genetic correlations are above the diagonals; total liability correlations are below the diagonals. Asterisks indicate genetic correlations that are significantly positive (z > 3) accounting for 36 trait pairs tested. Numeric data including standard errors are provided in Supplementary Table 13.

We further investigated the relationship between genetic correlations (r_g) and total correlations (r_l) among disease liabilities. We observed that r_g significantly exceeded r_l for asthma and allergic rhinitis (r_g=0.85 vs. r_l=0.46; p=0.008) after adjusting for 36 hypotheses tested; no other pair reached significance. We also observed an approximately linear relationship between genetic correlation and total liability correlation; regressing r_g on r_l yielded a proportionality constant of r_g/r_l=1.3 (s.e.=0.1, with the caveat that the 36 trait pairs are not independent) robust to the choice of whether or not to use BMI as a covariate (Supplementary Fig. 11).

Discussion

We have introduced a new fast algorithm, BOLT-REML, for variance components analysis involving multiple variance components and multiple traits, and demonstrated that it enables accurate large-sample heritability analyses that were previously computationally intractable. Such analyses will be essential to attaining the statistical resolution necessary to reveal deeper insights into the genetic architecture of complex traits [15, 16]. We have applied BOLT-REML to study human complex diseases in roughly 50K samples from each of the PGC2 and GERA data sets. At this sample size, we uncovered multiple insights into complex disease architecture, including extreme polygenicity of schizophrenia, enrichment of complex disease SNP-heritability in GC-rich regions and in higher-frequency SNPs, and significant genetic correlations among several GERA diseases.

Our per-megabase analyses of SNP-heritability in schizophrenia, dyslipidemia, and hypertension revealed contrasting levels of polygenicity, with schizophrenia exhibiting an exceptionally polygenic architecture. Our inference that the large majority of 1Mb regions of the genome (76– 100%) contain schizophrenia loci evokes the concern that complex trait GWAS of increasing sample sizes will ultimately implicate the entire genome, becoming uninformative [38]. Recent very large-scale GWAS [12, 33, 39] have begun grappling with this problem by focusing on biological pathways or gene sets instead of individual SNPs [40]. No previous study has demonstrated the extreme level of polygenicity we have observed here, however; in light of this result, methods that further interrogate association signal at the pathway level will be essential to extracting further biological insights about schizophrenia [41]. An additional question that this finding raises is whether the polygenicity would diminish in analyses with more homogeneous sample recruitment or phenotype (e.g., treatment resistant); future studies may be sufficiently powered to answer this question. As to our observation of enrichment of SNP-heritability with increasing GC content, further study will be required to disentangle the mechanisms underlying this phenomenon; previous work has shown that GC architecture has complex effects on recombination and replication timing [34] as well as DNA methylation [42].

Our results from partitioning the SNP-heritability of schizophrenia and GERA diseases across the 2–50% allele frequency spectrum shed light on the extent to which rarer SNPs tend to have larger per-allele effects, as predicted by evolutionary models [43, 44]. Our analysis of schizophrenia, based on well-imputed SNPs with MAF≥2%, does not assess the contribution of rare variants (MAF<1%) due to the need for stringent QC in heritability analyses of ascertained case-control cohorts [3]; however, the trend for SNPs with MAF 2–50% (Fig. 4b,c) strongly suggests that rarer SNPs have larger effect sizes per allele, yet explain less variance per SNP (corresponding to a best-fit value of −0.34 for the MAF-effect parameter α; ref. [29, 30]). While further study of more phenotypes and rarer variants is needed, this observation implies that the implicit assumption of α = −1 made by standard analyses of heritability [1] and mixed model association [19, 26] may be suboptimal, leaving room for further improvement on both fronts.

Our correlation analyses of GERA disease phenotypes identified a very strong genetic correlation (r_g=0.85, s.e.=0.11) between asthma and allergic rhinitis. While the link between asthma and allergy has long been known and recent GWAS have identified many shared associations, the extent to which these two diseases are genetically related has not previously been quantified [45–47]. Among other disease pairs, our observation of significant genetic correlations among metabolic diseases confirms and adds resolution to previous estimates [48, 49], while our observation of significant broad decreases in genetic and total correlations upon including BMI as a covariate high-lights the importance of carefully considering the effects of heritable covariates when conducting and interpreting genetic analyses [50]. Additionally, our empirical observation of an approximately linear relationship between correlations of total liability and genetic correlations [51], viewed in conjuction with a similar (but noisier) empirical observation among a set of seven quantitative metabolic traits [49], suggests the generality of such a trend for human complex traits.

Methodologically, while the variance components (REML) approach [1] that we have applied and accelerated here enjoys widespread use, three alternative approaches to heritability analysis (with various trade-offs) have recently been proposed. First, the Bayesian sparse linear mixed model [52] adapts the variance components approach to better model traits with large-effect loci, slightly reducing standard errors at the expense of much larger computational cost; integrating this approach into BOLT-REML is a potential future direction. Second, PCGC regression [27], which generalizes Haseman-Elston regression [53], is not subject to downward bias under case-control ascertainment (which we finessed by averaging multiple small-sample REML estimates); however, PCGC estimates have somewhat higher standard errors. Third, LD Score regression [48, 54] is a very different approach that makes inference using only GWAS summary statistics—not genotype data. LD Score regression has the disadvantage of somewhat higher standard errors (vs. REML) that further increase if inference is desired for small regions of the genome; as such, we are not currently aware of a method for assessing degree of polygenicity using summary statistics. All of these methods have the limitation that they assume independence of genetic and environmental effects; violation of this assumption may cause bias.

Compared to existing REML methods, the BOLT-REML algorithm we have proposed is much more computationally efficient; however, our approach does have limitations. First, because BOLT-REML achieves its speedup by avoiding direct computation of likelihoods, it is unable to compute likelihood ratio tests to assess whether variance parameters are significantly nonzero. In fact, the assumptions underlying REML analytic standard errors break down for parameter estimates of zero (and more generally, at the parameter space boundary; see Supplementary Note). GCTA [2] provides an unconstrained optimization feature that allows negative variance estimates, thereby sidestepping this issue and also reducing constraint-induced bias; incorporating such a feature into BOLT-REML is a potential future direction. Second, BOLT-REML, like all REML algorithms, occasionally fails to converge when variance parameters are poorly constrained, typically for multi-component models at small sample sizes (N«5,000). Given that sample sizes are steadily increasing, however, we expect BOLT-REML to be a robust choice for harnessing the full power of large-scale cohorts to further elucidate complex trait architectures.

URLs. BOLT-REML software and source code (implemented in the BOLT-LMM v2.0 package),

http://www.hsph.harvard.edu/alkes-price/software/.

GCTA software, http://www.complextraitgenomics.com/software/gcta/. PLINK2 software, https://www.cog-genomics.org/plink2.

KING software, http://people.virginia.edu/˜wc9c/KING/.

EIGENSOFT v6.0.1, including open-source implementation of FastPCA, http://www.hsph.harvard.edu/alkes-price/software/.

GERA data set, http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000674.v1.p1.

UK10K project, http://www.uk10k.org/.

Online Methods

BOLT-REML algorithm

The overall framework of the BOLT-REML algorithm is Monte Carlo AI REML [22], a Newton-type iterative optimization of the (restricted) log likelihood with respect to the variance parameters sought. BOLT-REML computes initial parameter estimates using the single variance component estimation procedure of BOLT-LMM [19]. Then, in each iteration, BOLT-REML rapidly approximates the gradient of the log likelihood using pseudorandom Monte Carlo sampling [24] and the Hessian of the log likelihood using the average information matrix [25]. BOLT-REML efficiently computes both approximations using conjugate gradient it-eration [17,18] with the performance optimizations applied by BOLT-LMM [19]. The approximate gradient and Hessian produce a local quadratic model of the likelihood surface, which we optimize within an adaptive trust region radius—key to achieving robust convergence—to obtain a proposed step. To evaluate success of the proposed step (i.e., determine whether to accept the step, whether to change the trust region radius, and whether the optimization has converged) we introduce a gradient-based approximation to the change in log likelihood achieved by the step. Details are described in the Supplementary Note.

Accuracy of BOLT-REML variance components analysis

We verified the accuracy of BOLT-REML analysis by simulating quantitative traits with infinitesimal architectures using genotypes from subsets of the GERA data set and partitioning heritability by chromosome. On a first set of 50,000 simulations using genotypes from N =2,000 samples on chromosomes 21–22, BOLT-REML correctly estimated components of heritability, computing nearly identical results to GCTA [2] when run with 100 Monte Carlo trials, and incurring only 1.03 times higher standard errors when run with 15 Monte Carlo trials (Supplementary Table 14), consistent with theory (Supplementary Note). On additional sets of 100 simulations using genotypes from N =10,000 samples on chromosomes 1–2, BOLT-REML correctly estimated genetic correlations in bivariate analyses of simulated quantitative traits [7] (Supplementary Table 15) and randomly ascertained case-control traits using a liability threshold model [3] (Supplementary Table 16). Finally, in simulated N =50K case-control cohorts over-ascertained for cases (simulted using the same mosaic chromosome frame-work as ref. [19]), we observed that while absolute estimates of heritability were downward biased, as previously demonstrated [26, 27], relative contributions of variance components were still accurately estimated when partitioning heritability by chromosome or minor allele frequency (Supplementary Fig. 12).

PGC2 data set

We analyzed the PGC2 schizophrenia data set [12], applying the following filters. Of 39 European-ancestry cohorts available to us for analysis, we first eliminated 10 cohorts (containing 12% of the available samples) with the lowest numbers of well-imputed SNPs. We further filtered out samples with <90% European ancestry as determined by SNPweights v2.0 (ref. [55]). Finally, we extracted an unrelated subset of individuals (pairwise genetic similarity <0.0884) using KING v1.4 –unrelated –degree 3; see URLs (ref. [56, 57]), comprising 22,177 cases and 27,629 controls (Supplementary Table 2). Of the imputed genotypes previously computed for each cohort, we restricted to well-imputed autosomal markers (genotype call confidence P>0.8 with <2% missing rate in the cohort), given that stringent QC is critical to avoid inflated estimates of components of heritability in ascertained case-control data [3]. We then merged the 29 cohorts, taking the union of remaining markers across cohorts and then restricting to markers with total missing rate <5%, leaving 4.4 million markers. We further imposed a >2% MAF threshold based on the imputation quality of typical arrays at low MAF [58], yielding million markers in substantial LD, to which we applied two rounds of LD-pruning at r²=0.9 (PLINK2 [59] –indep-pairwise 50 5 0.9; see URLs), reducing the number of markers to 596,583 and finally 472,178. Our primary motivation for pruning was to reduce susceptibility of REML estimation to LD bias [29–31]; additionally, pruning reduced computational costs.

GERA data set

We analyzed GERA samples (see URLs; dbGaP study accession phs000674.v1.p1) typed on the GERA EUR chip [58] with phenotypes available for each of 22 disease conditions based on electronic medical records. (Our primary analyses did not include survey-derived phenotypes such as BMI, as the data use conditions stipulated that these phenotypes could only be used as covariates.) We applied similar filters as above, eliminating samples with <90% European ancestry and samples with missing sex, and extracting an unrelated subset of 54,734 individuals using PLINK2 (–rel-cutoff 0.05). We removed SNPs deviating from Hardy-Weinberg equilibrium (p<10⁻⁶) and SNPs with missing rate >2%, leaving 597,736 autosomal SNPs.

UK10K data set

Our simulations used UK10K genotypes from sequencing data (see URLs); we merged the ALSPAC and TWINSUK cohorts, intersected marker sets and eliminated multi-allelic variants (leaving 18 million variants), and extracted 3,567 unrelated individuals using PLINK2.

Definitions of heritability parameters

We define as the proportion of population variance in disease liability (assuming a liability threshold model [60]) explained by the best linear predictor using typed variants [6]. We call this quantity “SNP-heritability” [1] (although the set of well-imputed variants in our PGC2 data set included a small fraction of biallelic indels). We define as the proportion of population variance in disease liability explained by the subset of variants in a particular MAF range within the same best linear predictor (jointly fit using all typed variants) and define and analogously [6]. We define h2 as the total narrow-sense heritability—i.e., the proportion of population variance explained by the best linear predictor using all variants (including untyped variants)—and we define as the proportion of population variance explained by all variants in the MAF range (within a predictor using all variants). Finally, we note that we abuse notation slightly by using the above symbols to refer to both true population parameter values and estimates thereof.

Estimating SNP-heritability of disease liabilities

We estimated for each GERA disease by running BOLT-REML on all samples and all markers in our filtered data set. In all our GERA analyses, we adjusted for age, sex, Affymetrix kit type, and 10 principal component (PC) covariates by residualizing genotypes and phenotypes accordingly. We included PC covariates to eliminate phenotypic variance explained by ancestry; we note that PCs can be computed in linear time using FastPCA [61] (see URLs). We transformed raw REML parameter estimates (denoted ) to using the linear transformation of ref. [3] assuming case fraction for each GERA disease matched population risk.

For the PGC2 data set, which is over-ascertained for schizophrenia cases, we estimated by averaging results from 100 BOLT-REML analyses of different 10x-downsampled (N =5K) subsamples of the data in order to ameliorate ascertainment-induced REML bias [26, 27]. In all our PGC2 analyses, we included sex, study indicators, and 10 principal components (from ref. [12]) as covariates. We transformed to assuming schizophrenia population risk of 1% (ref. [5, 11, 12]). We estimated our standard error as the combination of sampling error from averaging a finite number of downsamples (s.e.m.=0.4% from 100 random downsamples) and error arising from the finite size of the full N =50K cohort; we approximated the latter error with the analytic s.e. of REML on the full cohort (s.e.=0.5%). Combining these error terms, which we assumed were roughly independent, we obtained s.e.≈0.6%. This estimate may be slightly conservative based on simulations in which we applied the full estimation procedure to independently simulated case-control ascertained data sets and observed a run-to-run s.d. of 0.4% (Supplementary Table 5).

Partitioning SNP-heritability across genomic regions

We estimated per-chromosome by running BOLT-REML on all samples and markers using one variance component per chromosome and rescaling raw REML parameter estimates and standard errors by (Supplementary Table 3), noting that relative variance contributions are accurately estimated by REML even under case-control ascertainment (Supplementary Fig. 12). Estimating per-megabase in an analogous manner would have required fitting a >2500-variance component model, which was computationally intractable, so we instead performed the computation on contiguous chromosomal segments of up to 100 regions at a time, parallelizing computations using GNU parallel [62]. For schizophrenia, we used one variance component per 1Mb region in the segment (discarding regions containing <5 markers) plus a single additional variance component containing all remaining markers. (This approach is similar to ref. [63] but computationally cheaper than directly applying ref. [63] using BOLT-REML.) Including all markers in the model was necessary because of ascertainment-induced genome-wide “linkage disequilibrium” among causal variants [26]; we observed that analyses without the all-remaining-markers variance component produced inflated estimates. For the GERA diseases, we did not observe this phenomenon, as expected for a randomly ascertained trait, so for computational efficiency we included only markers in flanking 1Mb regions in the additional variance component. We ran BOLT-REML with 15 Monte Carlo trials for the extensive computations in this section; we used 100 Monte Carlo trials in all other analyses.

We estimated per-GC quintile by stratifying 1Mb regions into GC quintiles and running BOLT-REML as above with one variance component per quintile. To obtain finer resolution for regression analyses, we further stratified 1Mb regions into 50 GC content strata. We then performed a series of BOLT-REML analyses with one variance component containing the first n strata and a second variance component containing the last 50 − n strata, and we estimated of the n^th stratum as the difference between the SNP-heritability estimates for n and n − 1 strata.

Bounding SNP-heritability explained by top 1Mb regions

We bounded the population variance in disease liability explained by the 1Mb regions with largest true using the following procedure. We inferred an upper bound by analyzing the observed distribution of estimates and accounting for sampling variance. Explicitly, we analyzed the probability of obtaining a zero estimate, P (0), as a function of the actual value of (relative to its mean). Because of sampling noise and the nonnegativity constraint on our REML estimates, P (0) is always positive. In lieu of an analytic formula for P (0) as a function of actual , we obtained Monte Carlo estimates of P (0) by simulating quantitative traits (for the samples analyzed, using their actual genotypes) with heritability equal to the of the actual disease status (Supplementary Table 3). We distributed heritability across varying numbers of causal variants (13 values ranging from 2,000 random markers to all available markers) and assigned each normalized causal variant a normally distributed effect size, repeating each simulation five times. For each of the 65 simulated traits, we estimated for each 1Mb region. Combining this data with the actual per region (i.e., the sum of squared simulated effect sizes), and aggregating the data from all simulations and all 1Mb regions, we obtained a clean empirical estimate of P (0) as a function of actual , which we observed was well-fit by a sum of two exponentials (Supplementary Fig. 3). While the empirical curve was based on simulation data, we believe it is robust to choices such as simulation parameters and architecture, as it simply measures the sampling distribution of constrained REML estimates for our genotype data at a given actual .

To interpret the observed fraction of zero estimates in light of this information, we harnessed the fact that the decay curve of P (0) vs. actual is convex (Supplementary Fig. 3). In particular, if a set of 1Mb regions has a fixed average actual , their average P (0) is minimized when all the regions have equal actual (by Jensen’s inequality). Conversely, an uneven distribution of actual across regions tends to increase the number of zero estimates. These observations allowed us to bound the maximum fraction of that could be explained by top 1Mb regions and still be consistent with the observed fraction of zero estimates. Explicitly, if a certain number of top regions explain SNP-heritability , then the sum of P (0) over all regions is minimized by setting of each top region to (/ #top regions) and of each remaining region to () / (#non-top regions). We therefore bounded by requiring this minimum expected number of zero estimates to be at most the observed number of zero estimates (plus 1.96 times its s.e. for a conservative 95% confidence bound). We checked the accuracy of this procedure using simulated case-control ascertained data sets (Supplementary Fig. 4).

We obtained lower bounds on the fraction of explained by top 1Mb regions by 3-fold cross-validation. For each fold in turn, we estimated for each region using the remaining two folds, ranked regions accordingly, and then estimated the SNP-heritability explained by top-ranked regions using the left-out fold. We repeated this procedure three times, obtaining nine estimates per fraction of regions, and computed the mean minus 1.96 times the s.e.m. (assuming roughly independent sampling noise) as a conservative 95% confidence lower bound on SNP-heritability explained by top regions. This lower bound is weak because the finite sample size of the training folds prevents an accurate ranking of regions, especially those contributing small amounts of variance; however, it does capture concentration of heritability in large-effect loci, e.g., for dyslipidemia (Fig. 2c).

Partitioning SNP-heritability across allele frequency bins

We computed per-MAF bin estimates in a manner analogous to estimates. To infer per-MAF bin explained by un-typed as well as typed variants, we ran simulations using UK10K sequencing data to assess the tagging efficiency of our PGC2 and GERA marker sets in various MAF ranges. Specifically, we simulated fully heritable quantitative traits in which normalized SNPs with MAF p≥0.1% (in the UK10K data) were selected as causal with probability 0.5% and assigned normally distributed effect sizes with variance (p(1 − p))α. (This setup assumes that UK10K SNPs explain all narrow-sense heritability, but given that we are only interested in tagging efficiency at MAF≥2, our estimation procedure is robust to violations of this assumption.) We performed 4,000 simulations for each of α = 0, –0.25, –0.5, –1. For each marker set, we then computed REML estimates of for each simulated trait across six MAF bins (Fig. 4) using one variance component per bin [30] and restricting to SNPs in the marker set. A small subset of the PGC2 marker IDs (8%) and GERA SNP IDs (4%) were not present among the UK10K SNP IDs, so we did not include these markers in our REML analyses of simulated traits; we verified that the inclusion vs. exclusion of these markers had a negligible effect on schizophrenia estimates (Supplementary Fig. 13). We performed REML analyses of UK10K simulated traits using a slightly modified version of GCTA v1.21 [2] in order to perform robust unconstrained REML (i.e., allow negative estimates); at low sample sizes, constrained REML estimates are upward biased due to noise and the positivity constraint. (We modified GCTA to improve robustness in this setting by adding a trust region framework to its REML optimization.) Finally, we computed for the simulated traits by summing squared simulated effect sizes.

Estimating genetic correlations and total correlations of disease liabilities

For each pair of GERA diseases, we estimated their genetic correlation (denoted r^g) directly from bivariate BOLT-REML, which models both genetic and residual covariance, using all samples and markers. Under a liability threshold model, the estimated genetic correlation (using observed case-control pheno-types) accurately reflects the genetic correlation of underlying disease liabilities, so we did not need to transform raw BOLT-REML r_g parameter estimates [7]. However, the total correlation of observed case-control phenotypes is damped relative to the total correlation of underlying disease liabilities (which we denote by r_l): assuming two diseases have bivariate normal liabilities l1 and l₂ with correlation r_l, the correlation of case-control phenotypes is r_p= corr(l₁>z₁, l₂>z₂), where z₁ and z₂ are appropriate liability thresholds. In general, |r_p|≤|r_l| under a bivariate normal liability threshold model; e.g., two traits with the same liabilities (rl=100%) but different thresholds (z₁/=z₂) have r_p<r_l. We recovered r_l from r_p by straightforward Monte Carlo simulation, performing a binary search to determine the value of r_l producing the observed r_p assuming values of z₁ and z₂ corresponding to GERA case fractions. Similarly, we obtained an s.e. for r_l by transforming the 95% confidence interval for r_p (based on its s.e. of ) in the same way. Finally, we note that for analyses in which we included BMI (coded on a 1–5 scale in the GERA data) as a covariate, we included an additional missing indicator covariate marking samples with missing BMI (5%).

Acknowledgments.

We are grateful to T. Hayeck, P. Palamara, J. Listgarten, V. Anttila, S. Sunyaev, R. Walters, P. Sullivan, M. Keller, M. Goddard, P. Visscher, and J. Yang for helpful discussions. This research was supported by US National Institutes of Health grants R01 HG006399 and R01 MH101244 and US National Institutes of Health fellowship F32 HG007805. H. K. F. was supported by the Fannie and John Hertz Foundation. Members of the Schizophrenia Working Group of the Psychiatric Genomics Consortium are listed in the Supplementary Note. Statistical analyses of PGC2 data were carried out on the Genetic Cluster Computer (http://www.geneticcluster.org) hosted by SURFsara and financially supported by the Netherlands Scientific Organization (NWO 480-05-003 PI: Posthuma) along with a supplement from the Dutch Brain Foundation and the VU University Amsterdam. Analyses of GERA data were conducted on the Orchestra High Performance Compute Cluster at Harvard Medical School, which is partially supported by grant NCRR 1S10RR028832-01.

Footnotes

↵† A full list of members is provided in the Supplementary Note.

References

1.↵
Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nature Genetics 42, 565–569 (2010).
OpenUrl CrossRef PubMed Web of Science
2.↵
Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: a tool for genome-wide complex trait analysis. American Journal of Human Genetics 88, 76–82 (2011).
OpenUrl CrossRef PubMed
3.↵
Lee, S. H., Wray, N. R., Goddard, M. E. & Visscher, P. M. Estimating missing heritability for disease from genome-wide association studies. American Journal of Human Genetics 88, 294–305 (2011).
OpenUrl CrossRef PubMed Web of Science
4.↵
Yang, J. et al. Genome partitioning of genetic variation for complex traits using common SNPs. Nature Genetics 43, 519–525 (2011).
OpenUrl CrossRef PubMed
5.↵
Lee, S. H. et al. Estimating the proportion of variation in susceptibility to schizophrenia captured by common SNPs. Nature Genetics 44, 247–250 (2012).
OpenUrl CrossRef PubMed
6.↵
Gusev, A. et al. Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseases. American Journal of Human Genetics 95, 535–552 (2014).
OpenUrl CrossRef PubMed
7.↵
Lee, S. H., Yang, J., Goddard, M. E., Visscher, P. M. & Wray, N. R. Estimation of pleiotropy between complex diseases using single-nucleotide polymorphism-derived genomic relationships and restricted maximum likelihood. Bioinformatics 28, 2540–2542 (2012).
OpenUrl CrossRef PubMed Web of Science
8.↵
Lee, S. H. et al. Genetic relationship between five psychiatric disorders estimated from genome-wide SNPs. Nature Genetics (2013).
9.↵
Purcell, S. M. et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748–752 (2009).
OpenUrl CrossRef PubMed Web of Science
10.
Ripke, S. et al. Genome-wide association study identifies five new schizophrenia loci. Nature Genetics 43, 969 (2011).
OpenUrl CrossRef PubMed
11.↵
Ripke, S. et al. Genome-wide association analysis identifies 13 new risk loci for schizophrenia. Nature Genetics 45, 1150–1159 (2013).
OpenUrl CrossRef PubMed
12.↵
Ripke, S. et al. Biological insights from 108 schizophrenia-associated genetic loci. Nature 511, 421–427 (2014).
OpenUrl CrossRef PubMed Web of Science
13.↵
Willer, C. J. et al. Discovery and refinement of loci associated with lipid levels. Nature Genetics (2013).
14.↵
Mahajan, A. et al. Genome-wide trans-ancestry meta-analysis provides insight into the genetic architecture of type 2 diabetes susceptibility. Nature Genetics 46, 234–244 (2014).
OpenUrl CrossRef PubMed
15.↵
Visscher, P. M. et al. Statistical power to detect genetic (co)variance of complex traits using SNP data in unrelated samples. PLoS Genetics 10, e1004269 (2014).
OpenUrl
16.↵
Visscher, P. M. & Goddard, M. E. A general unified framework to assess the sampling variance of heritability estimates using pedigree or marker-based relationships. Genetics (2015).
17.↵
Legarra, A. & Misztal, I. Computing strategies in genome-wide selection. Journal of Dairy Science 91, 360–366 (2008).
OpenUrl CrossRef PubMed Web of Science
18.↵
VanRaden, P. Efficient methods to compute genomic predictions. Journal of Dairy Science 91, 4414–4423 (2008).
OpenUrl CrossRef PubMed Web of Science
19.↵
Loh, P.-R. et al. Efficient Bayesian mixed model analysis increases association power in large cohorts. Nature Genetics (2015).
20.↵
Henderson, C. Application of Linear Models in Animal Breeding (University of Guelph, 1984).
21.↵
Henderson, C. & Quaas, R. Multiple trait evaluation using relatives’ records. Journal of Animal Science (1976).
22.↵
Matilainen, K., Mäntysaari, E. A., Lidauer, M. H., Strandén, I. & Thompson, R. Employing a Monte Carlo Algorithm in Newton-Type Methods for Restricted Maximum Likelihood Estimation of Genetic Parameters. PLoS ONE 8, e80821 (2013).
OpenUrl CrossRef PubMed
23.↵
Patterson, H. D. & Thompson, R. Recovery of inter-block information when block sizes are unequal. Biometrika 58, 545–554 (1971).
OpenUrl CrossRef Web of Science
24.↵
García-Cortés, L. A., Moreno, C., Varona, L. & Altarriba, J. Variance component estimation by resampling. Journal of Animal Breeding and Genetics 109, 358–363 (1992).
OpenUrl
25.↵
Gilmour, A. R., Thompson, R. & Cullis, B. R. Average information REML: An efficient algorithm for variance parameter estimation in linear mixed models. Biometrics 1440–1450 (1995).
26.↵
Yang, J., Zaitlen, N. A., Goddard, M. E., Visscher, P. M. & Price, A. L. Advantages and pitfalls in the application of mixed-model association methods. Nature Genetics 46, 100– 106 (2014).
OpenUrl CrossRef PubMed
27.↵
Golan, D., Lander, E. S. & Rosset, S. Measuring missing heritability: Inferring the contribution of common variants. Proceedings of the National Academy of Sciences 111, E5272– E5281 (2014).
OpenUrl Abstract/FREE Full Text
28.↵
Saha, S., Chant, D., Welham, J. & McGrath, J. A systematic review of the prevalence of schizophrenia. PLoS Medicine 2, e141 (2005).
OpenUrl
29.↵
Speed, D., Hemani, G., Johnson, M. R. & Balding, D. J. Improved heritability estimation from genome-wide SNPs. American Journal of Human Genetics 91, 1011–1021 (2012).
OpenUrl CrossRef PubMed
30.↵
Lee, S. H. et al. Estimation of SNP heritability from dense genotype data. American Journal of Human Genetics 93, 1151–1155 (2013).
OpenUrl CrossRef PubMed
31.↵
Gusev, A. et al. Quantifying missing heritability at known GWAS loci. PLoS Genetics 9, e1003993 (2013).
OpenUrl CrossRef
32.↵
Stahl, E. A. et al. Bayesian inference analyses of the polygenic architecture of rheumatoid arthritis. Nature Genetics 44, 483–489 (2012).
OpenUrl CrossRef PubMed
33.↵
Wood, A. R. et al. Defining the role of common variation in the genomic and biological architecture of adult human height. Nature Genetics 46, 1173–1186 (2014).
OpenUrl CrossRef PubMed
34.↵
Koren, A. et al. Differential relationship of DNA replication timing to different forms of human mutation and variation. American Journal of Human Genetics 91, 1033–1040 (2012).
OpenUrl CrossRef PubMed
35.↵
The International HapMap Consortium, K. A., Frazer et al. A second generation human haplotype map of over 3.1 million snps. Nature 449, 851–861 (2007).
OpenUrl CrossRef PubMed Web of Science
36.↵
McVicker, G., Gordon, D., Davis, C. & Green, P. Widespread genomic signatures of natural selection in hominid evolution. PLoS Genetics 5, e1000471 (2009).
OpenUrl
37.↵
Zuk, O. et al. Searching for missing heritability: designing rare variant association studies. Proceedings of the National Academy of Sciences 111, E455–E464 (2014).
OpenUrl Abstract/FREE Full Text
38.↵
Goldstein, D. B. Common genetic variation and human traits. New England Journal of Medicine 360, 1696 (2009).
OpenUrl CrossRef PubMed Web of Science
39.↵
Locke, A. E. et al. Genetic studies of body mass index yield new insights for obesity biology. Nature 518, 197–206 (2015).
OpenUrl CrossRef PubMed
40.↵
Pers, T. H. et al. Biological interpretation of genome-wide association studies using predicted gene functions. Nature Communications 6 (2015).
41.↵
Sullivan, P. F. Puzzling over schizophrenia: schizophrenia as a pathway disease. Nature Medicine 18, 210–211 (2012).
OpenUrl CrossRef PubMed
42.↵
Gelfman, S., Cohen, N., Yearim, A. & Ast, G. DNA-methylation effect on cotranscriptional splicing is dependent on gc architecture of the exon–intron structure. Genome Research 23, 789–799 (2013).
OpenUrl Abstract/FREE Full Text
43.↵
Gibson, G. Rare and common variants: twenty arguments. Nature Reviews Genetics 13, 135–145 (2012).
OpenUrl CrossRef PubMed
44.↵
Lohmueller, K. E. The impact of population demography and selection on the genetic architecture of complex traits. PLoS Genetics 10, e1004379 (2014).
OpenUrl
45.↵
Ferreira, M. A. et al. Identification of IL6R and chromosome 11q13.5 as risk loci for asthma. Lancet 378, 1006–1014 (2011).
OpenUrl CrossRef PubMed Web of Science
46.
Bønnelykke, K. et al. Meta-analysis of genome-wide association studies identifies ten loci influencing allergic sensitization. Nature Genetics 45, 902–906 (2013).
OpenUrl CrossRef PubMed
47.↵
Hinds, D. A. et al. A genome-wide association meta-analysis of self-reported allergy identifies shared and allergy-specific susceptibility loci. Nature Genetics 45, 907–911 (2013).
OpenUrl CrossRef PubMed
48.↵
Bulik-Sullivan, B. et al. An atlas of genetic correlations across human diseases and traits. bioRxiv 014498 (2015).
49.↵
Vattikuti, S., Guo, J. & Chow, C. C. Heritability and genetic correlations explained by common snps for metabolic syndrome traits. PLoS Genetics 8, e1002637 (2012).
OpenUrl
50.↵
Aschard, H., Vilhjálmsson, B. J., Joshi, A. D., Price, A. L. & Kraft, P. Adjusting for heritable covariates can bias effect estimates in genome-wide association studies. American Journal of Human Genetics (2015).
51.↵
Cheverud, J. M. A comparison of genetic and phenotypic correlations. Evolution 958–968 (1988).
52.↵
Zhou, X., Carbonetto, P. & Stephens, M. Polygenic modeling with Bayesian sparse linear mixed models. PLoS Genetics 9, e1003264 (2013).
OpenUrl CrossRef PubMed
53.↵
Haseman, J. & Elston, R. The investigation of linkage between a quantitative trait and a marker locus. Behavior Genetics 2, 3–19 (1972).
OpenUrl CrossRef PubMed Web of Science
54.↵
Finucane, H. K. et al. Partitioning heritability by functional category using GWAS summary statistics. bioRxiv 014241 (2015).
55.↵
Chen, C.-Y. et al. Improved ancestry inference using weights from external reference panels. Bioinformatics btt144 (2013).
56.↵
Manichaikul, A. et al. Robust relationship inference in genome-wide association studies. Bioinformatics 26, 2867–2873 (2010).
OpenUrl CrossRef PubMed Web of Science
57.↵
Manichaikul, A. et al. Population structure of Hispanics in the United States: the multi-ethnic study of atherosclerosis. PLoS Genetics 8, e1002640 (2012).
OpenUrl
58.↵
Hoffmann, T. J. et al. Next generation genome-wide association tool: Design and coverage of a high-throughput European-optimized SNP array. Genomics 98, 79–89 (2011).
OpenUrl CrossRef PubMed Web of Science
59.↵
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience (2015).
60.↵
Falconer, D. S. The inheritance of liability to certain diseases, estimated from the incidence among relatives. Annals of Human Genetics 29, 51–76 (1965).
OpenUrl CrossRef Web of Science
61.↵
Galinsky, K. J. et al. Fast PCA of very large samples in linear time. Abstract presented at the 64th Annual Meeting of The American Society of Human Genetics, October 18–22, 2014, San Diego, CA.
62.↵
Tange, O. GNU Parallel - The Command-Line Power Tool. The USENIX Magazine 36, 42–47 (2011). URL http://www.gnu.org/s/parallel.
OpenUrl
63.↵
Kostem, E. & Eskin, E. Improving the accuracy and efficiency of partitioning heritability into the contributions of genomic regions. American Journal of Human Genetics 92, 558– 564 (2013).
OpenUrl CrossRef PubMed
64.
Dempster, A. P., Laird, N. M. & Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 1–38 (1977).
65.
Searle, S. R., Casella, G. & McCulloch, C. E. Variance components (John Wiley & Sons, 2006).
66.
Liu, J. S. & Wu, Y. N. Parameter expansion for data augmentation. Journal of the American Statistical Association 94, 1264–1274 (1999).
OpenUrl CrossRef Web of Science
67.
Foulley, J.-L. & Van Dyk, D. A. The PX-EM algorithm for fast stable fitting of Henderson’s mixed model. Genetics Selection Evolution 32, 1–21 (2000).
OpenUrl
68.
Zhou, X. & Stephens, M. Efficient multivariate linear mixed model algorithms for genome-wide association studies. Nature Methods 11, 407–409 (2014).
OpenUrl
69.
Meyer, K. et al. PX × AI: Algorithmics for better convergence in restricted maximum likelihood estimation. In 8th World Congress on Genetics Applied to Livestock Production (2006).
70.
Groeneveld, E. A reparameterization to improve numerical optimization in multivariate REML (co)variance component estimation. Genetics Selection Evolution 26, 537–545 (1994).
OpenUrl CrossRef Web of Science
71.
Wei, G. C. & Tanner, M. A. A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. Journal of the American Statistical Association 85, 699–704 (1990).
OpenUrl CrossRef Web of Science
72.
Matilainen, K., Mäntysaari, E. A., Lidauer, M. H., Strande´n, I. & Thompson, R. Employing a Monte Carlo algorithm in expectation maximization restricted maximum likelihood estimation of the linear mixed model. Journal of Animal Breeding and Genetics 129, 457–468 (2012).
OpenUrl
73.
Kuk, A. Y. & Cheng, Y. W. The Monte Carlo Newton-Raphson algorithm. Journal of Statistical Computation and Simulation 59, 233–250 (1997).
OpenUrl
74.
McCulloch, C. E. Maximum likelihood algorithms for generalized linear mixed models. Journal of the American Statistical Association 92, 162–170 (1997).
OpenUrl CrossRef Web of Science
75.
McCulloch, C., Searle, S. & Neuhaus, J. Generalized, linear, and mixed models (Wiley, 2008), 2nd edn.
76.
Barry, R. P. & Kelley Pace, R. Monte Carlo estimates of the log determinant of large sparse matrices. Linear Algebra and its Applications 289, 41–54 (1999).
OpenUrl
77.
Korte, A. et al. A mixed-model approach for genome-wide association studies of correlated traits in structured populations. Nature Genetics 44, 1066–1071 (2012).
OpenUrl CrossRef PubMed
78.
Listgarten, J. et al. A powerful and efficient set test for genetic markers that handles confounders. Bioinformatics 29, 1526–1533 (2013).
OpenUrl CrossRef PubMed Web of Science
79.
Tucker, G., Price, A. L. & Berger, B. A. Improving the power of GWAS and avoiding confounding from population stratification with PC-Select. Genetics (2014).
80.
Johnson, S. G. The NLopt nonlinear-optimization package. URL http://ab-initio.mit.edu/nlopt.
81.
Svanberg, K. A class of globally convergent optimization methods based on conservative convex separable approximations. SIAM Journal on Optimization 12, 555–573 (2002).
OpenUrl CrossRef Web of Science
82.
Kraft, D. Algorithm 733: TOMP–Fortran modules for optimal control calculations. ACM Transactions on Mathematical Software (TOMS) 20, 262–281 (1994).
OpenUrl
83.
Gould, N. I., Orban, D., Sartenaer, A. & Toint, P. L. Sensitivity of trust-region algorithms to their parameters. 4OR 3, 227–241 (2005).
OpenUrl
84.
Lippert, C. et al. FaST linear mixed models for genome-wide association studies. Nature Methods 8, 833–835 (2011).
OpenUrl
85.
Kang, H. M. et al. Efficient control of population structure in model organism association mapping. Genetics 178, 1709–1723 (2008).
OpenUrl Abstract/FREE Full Text
86.
Speed, D. & Balding, D. J. MultiBLUP: improved SNP-based prediction for complex traits. Genome Research gr–169375 (2014).

View the discussion thread.

Posted March 14, 2015.

Download PDF

Supplementary Material

Citation Tools

Subject Area

Genetics

Subject Areas

All Articles

Animal Behavior and Cognition (5215)
Biochemistry (11745)
Bioengineering (8752)
Bioinformatics (29200)
Biophysics (14972)
Cancer Biology (12096)
Cell Biology (17411)
Clinical Trials (138)
Developmental Biology (9421)
Ecology (14182)
Epidemiology (2067)
Evolutionary Biology (18308)
Genetics (12245)
Genomics (16803)
Immunology (11869)
Microbiology (28085)
Molecular Biology (11592)
Neuroscience (60969)
Paleontology (451)
Pathology (1871)
Pharmacology and Toxicology (3238)
Physiology (4959)
Plant Biology (10427)
Scientific Communication and Education (1683)
Synthetic Biology (2885)
Systems Biology (7340)
Zoology (1651)

[1] 1.↵
Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nature Genetics 42, 565–569 (2010).
OpenUrl CrossRef PubMed Web of Science

[2] 2.↵
Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: a tool for genome-wide complex trait analysis. American Journal of Human Genetics 88, 76–82 (2011).
OpenUrl CrossRef PubMed

[3] 3.↵
Lee, S. H., Wray, N. R., Goddard, M. E. & Visscher, P. M. Estimating missing heritability for disease from genome-wide association studies. American Journal of Human Genetics 88, 294–305 (2011).
OpenUrl CrossRef PubMed Web of Science

[4] 4.↵
Yang, J. et al. Genome partitioning of genetic variation for complex traits using common SNPs. Nature Genetics 43, 519–525 (2011).
OpenUrl CrossRef PubMed

[5] 5.↵
Lee, S. H. et al. Estimating the proportion of variation in susceptibility to schizophrenia captured by common SNPs. Nature Genetics 44, 247–250 (2012).
OpenUrl CrossRef PubMed

[6] 6.↵
Gusev, A. et al. Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseases. American Journal of Human Genetics 95, 535–552 (2014).
OpenUrl CrossRef PubMed

[7] 7.↵
Lee, S. H., Yang, J., Goddard, M. E., Visscher, P. M. & Wray, N. R. Estimation of pleiotropy between complex diseases using single-nucleotide polymorphism-derived genomic relationships and restricted maximum likelihood. Bioinformatics 28, 2540–2542 (2012).
OpenUrl CrossRef PubMed Web of Science

[8] 8.↵
Lee, S. H. et al. Genetic relationship between five psychiatric disorders estimated from genome-wide SNPs. Nature Genetics (2013).

[9] 9.↵
Purcell, S. M. et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748–752 (2009).
OpenUrl CrossRef PubMed Web of Science

[10] 10.
Ripke, S. et al. Genome-wide association study identifies five new schizophrenia loci. Nature Genetics 43, 969 (2011).
OpenUrl CrossRef PubMed

[11] 11.↵
Ripke, S. et al. Genome-wide association analysis identifies 13 new risk loci for schizophrenia. Nature Genetics 45, 1150–1159 (2013).
OpenUrl CrossRef PubMed

[12] 12.↵
Ripke, S. et al. Biological insights from 108 schizophrenia-associated genetic loci. Nature 511, 421–427 (2014).
OpenUrl CrossRef PubMed Web of Science

[13] 13.↵
Willer, C. J. et al. Discovery and refinement of loci associated with lipid levels. Nature Genetics (2013).

[14] 14.↵
Mahajan, A. et al. Genome-wide trans-ancestry meta-analysis provides insight into the genetic architecture of type 2 diabetes susceptibility. Nature Genetics 46, 234–244 (2014).
OpenUrl CrossRef PubMed

[15] 15.↵
Visscher, P. M. et al. Statistical power to detect genetic (co)variance of complex traits using SNP data in unrelated samples. PLoS Genetics 10, e1004269 (2014).
OpenUrl

[16] 16.↵
Visscher, P. M. & Goddard, M. E. A general unified framework to assess the sampling variance of heritability estimates using pedigree or marker-based relationships. Genetics (2015).

[17] 17.↵
Legarra, A. & Misztal, I. Computing strategies in genome-wide selection. Journal of Dairy Science 91, 360–366 (2008).
OpenUrl CrossRef PubMed Web of Science

[18] 18.↵
VanRaden, P. Efficient methods to compute genomic predictions. Journal of Dairy Science 91, 4414–4423 (2008).
OpenUrl CrossRef PubMed Web of Science

[19] 19.↵
Loh, P.-R. et al. Efficient Bayesian mixed model analysis increases association power in large cohorts. Nature Genetics (2015).

[20] 20.↵
Henderson, C. Application of Linear Models in Animal Breeding (University of Guelph, 1984).

[21] 21.↵
Henderson, C. & Quaas, R. Multiple trait evaluation using relatives’ records. Journal of Animal Science (1976).

[22] 22.↵
Matilainen, K., Mäntysaari, E. A., Lidauer, M. H., Strandén, I. & Thompson, R. Employing a Monte Carlo Algorithm in Newton-Type Methods for Restricted Maximum Likelihood Estimation of Genetic Parameters. PLoS ONE 8, e80821 (2013).
OpenUrl CrossRef PubMed

[23] 23.↵
Patterson, H. D. & Thompson, R. Recovery of inter-block information when block sizes are unequal. Biometrika 58, 545–554 (1971).
OpenUrl CrossRef Web of Science

[24] 24.↵
García-Cortés, L. A., Moreno, C., Varona, L. & Altarriba, J. Variance component estimation by resampling. Journal of Animal Breeding and Genetics 109, 358–363 (1992).
OpenUrl

[25] 25.↵
Gilmour, A. R., Thompson, R. & Cullis, B. R. Average information REML: An efficient algorithm for variance parameter estimation in linear mixed models. Biometrics 1440–1450 (1995).

[26] 26.↵
Yang, J., Zaitlen, N. A., Goddard, M. E., Visscher, P. M. & Price, A. L. Advantages and pitfalls in the application of mixed-model association methods. Nature Genetics 46, 100– 106 (2014).
OpenUrl CrossRef PubMed

[27] 27.↵
Golan, D., Lander, E. S. & Rosset, S. Measuring missing heritability: Inferring the contribution of common variants. Proceedings of the National Academy of Sciences 111, E5272– E5281 (2014).
OpenUrl Abstract/FREE Full Text

[28] 28.↵
Saha, S., Chant, D., Welham, J. & McGrath, J. A systematic review of the prevalence of schizophrenia. PLoS Medicine 2, e141 (2005).
OpenUrl

[29] 29.↵
Speed, D., Hemani, G., Johnson, M. R. & Balding, D. J. Improved heritability estimation from genome-wide SNPs. American Journal of Human Genetics 91, 1011–1021 (2012).
OpenUrl CrossRef PubMed

[30] 30.↵
Lee, S. H. et al. Estimation of SNP heritability from dense genotype data. American Journal of Human Genetics 93, 1151–1155 (2013).
OpenUrl CrossRef PubMed

[31] 31.↵
Gusev, A. et al. Quantifying missing heritability at known GWAS loci. PLoS Genetics 9, e1003993 (2013).
OpenUrl CrossRef

[32] 32.↵
Stahl, E. A. et al. Bayesian inference analyses of the polygenic architecture of rheumatoid arthritis. Nature Genetics 44, 483–489 (2012).
OpenUrl CrossRef PubMed

[33] 33.↵
Wood, A. R. et al. Defining the role of common variation in the genomic and biological architecture of adult human height. Nature Genetics 46, 1173–1186 (2014).
OpenUrl CrossRef PubMed

[34] 34.↵
Koren, A. et al. Differential relationship of DNA replication timing to different forms of human mutation and variation. American Journal of Human Genetics 91, 1033–1040 (2012).
OpenUrl CrossRef PubMed

[35] 35.↵
The International HapMap Consortium, K. A., Frazer et al. A second generation human haplotype map of over 3.1 million snps. Nature 449, 851–861 (2007).
OpenUrl CrossRef PubMed Web of Science

[36] 36.↵
McVicker, G., Gordon, D., Davis, C. & Green, P. Widespread genomic signatures of natural selection in hominid evolution. PLoS Genetics 5, e1000471 (2009).
OpenUrl

[37] 37.↵
Zuk, O. et al. Searching for missing heritability: designing rare variant association studies. Proceedings of the National Academy of Sciences 111, E455–E464 (2014).
OpenUrl Abstract/FREE Full Text

[38] 38.↵
Goldstein, D. B. Common genetic variation and human traits. New England Journal of Medicine 360, 1696 (2009).
OpenUrl CrossRef PubMed Web of Science

[39] 39.↵
Locke, A. E. et al. Genetic studies of body mass index yield new insights for obesity biology. Nature 518, 197–206 (2015).
OpenUrl CrossRef PubMed

[40] 40.↵
Pers, T. H. et al. Biological interpretation of genome-wide association studies using predicted gene functions. Nature Communications 6 (2015).

[41] 41.↵
Sullivan, P. F. Puzzling over schizophrenia: schizophrenia as a pathway disease. Nature Medicine 18, 210–211 (2012).
OpenUrl CrossRef PubMed

[42] 42.↵
Gelfman, S., Cohen, N., Yearim, A. & Ast, G. DNA-methylation effect on cotranscriptional splicing is dependent on gc architecture of the exon–intron structure. Genome Research 23, 789–799 (2013).
OpenUrl Abstract/FREE Full Text

[43] 43.↵
Gibson, G. Rare and common variants: twenty arguments. Nature Reviews Genetics 13, 135–145 (2012).
OpenUrl CrossRef PubMed

[44] 44.↵
Lohmueller, K. E. The impact of population demography and selection on the genetic architecture of complex traits. PLoS Genetics 10, e1004379 (2014).
OpenUrl

[45] 45.↵
Ferreira, M. A. et al. Identification of IL6R and chromosome 11q13.5 as risk loci for asthma. Lancet 378, 1006–1014 (2011).
OpenUrl CrossRef PubMed Web of Science

[46] 46.
Bønnelykke, K. et al. Meta-analysis of genome-wide association studies identifies ten loci influencing allergic sensitization. Nature Genetics 45, 902–906 (2013).
OpenUrl CrossRef PubMed

[47] 47.↵
Hinds, D. A. et al. A genome-wide association meta-analysis of self-reported allergy identifies shared and allergy-specific susceptibility loci. Nature Genetics 45, 907–911 (2013).
OpenUrl CrossRef PubMed

[48] 48.↵
Bulik-Sullivan, B. et al. An atlas of genetic correlations across human diseases and traits. bioRxiv 014498 (2015).

[49] 49.↵
Vattikuti, S., Guo, J. & Chow, C. C. Heritability and genetic correlations explained by common snps for metabolic syndrome traits. PLoS Genetics 8, e1002637 (2012).
OpenUrl

[50] 50.↵
Aschard, H., Vilhjálmsson, B. J., Joshi, A. D., Price, A. L. & Kraft, P. Adjusting for heritable covariates can bias effect estimates in genome-wide association studies. American Journal of Human Genetics (2015).

[51] 51.↵
Cheverud, J. M. A comparison of genetic and phenotypic correlations. Evolution 958–968 (1988).

[52] 52.↵
Zhou, X., Carbonetto, P. & Stephens, M. Polygenic modeling with Bayesian sparse linear mixed models. PLoS Genetics 9, e1003264 (2013).
OpenUrl CrossRef PubMed

[53] 53.↵
Haseman, J. & Elston, R. The investigation of linkage between a quantitative trait and a marker locus. Behavior Genetics 2, 3–19 (1972).
OpenUrl CrossRef PubMed Web of Science

[54] 54.↵
Finucane, H. K. et al. Partitioning heritability by functional category using GWAS summary statistics. bioRxiv 014241 (2015).

[55] 55.↵
Chen, C.-Y. et al. Improved ancestry inference using weights from external reference panels. Bioinformatics btt144 (2013).

[56] 56.↵
Manichaikul, A. et al. Robust relationship inference in genome-wide association studies. Bioinformatics 26, 2867–2873 (2010).
OpenUrl CrossRef PubMed Web of Science

[57] 57.↵
Manichaikul, A. et al. Population structure of Hispanics in the United States: the multi-ethnic study of atherosclerosis. PLoS Genetics 8, e1002640 (2012).
OpenUrl

[58] 58.↵
Hoffmann, T. J. et al. Next generation genome-wide association tool: Design and coverage of a high-throughput European-optimized SNP array. Genomics 98, 79–89 (2011).
OpenUrl CrossRef PubMed Web of Science

[59] 59.↵
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience (2015).

[60] 60.↵
Falconer, D. S. The inheritance of liability to certain diseases, estimated from the incidence among relatives. Annals of Human Genetics 29, 51–76 (1965).
OpenUrl CrossRef Web of Science

[61] 61.↵
Galinsky, K. J. et al. Fast PCA of very large samples in linear time. Abstract presented at the 64th Annual Meeting of The American Society of Human Genetics, October 18–22, 2014, San Diego, CA.

[62] 62.↵
Tange, O. GNU Parallel - The Command-Line Power Tool. The USENIX Magazine 36, 42–47 (2011). URL http://www.gnu.org/s/parallel.
OpenUrl

[63] 63.↵
Kostem, E. & Eskin, E. Improving the accuracy and efficiency of partitioning heritability into the contributions of genomic regions. American Journal of Human Genetics 92, 558– 564 (2013).
OpenUrl CrossRef PubMed

[64] 64.
Dempster, A. P., Laird, N. M. & Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 1–38 (1977).

[65] 65.
Searle, S. R., Casella, G. & McCulloch, C. E. Variance components (John Wiley & Sons, 2006).

[66] 66.
Liu, J. S. & Wu, Y. N. Parameter expansion for data augmentation. Journal of the American Statistical Association 94, 1264–1274 (1999).
OpenUrl CrossRef Web of Science

[67] 67.
Foulley, J.-L. & Van Dyk, D. A. The PX-EM algorithm for fast stable fitting of Henderson’s mixed model. Genetics Selection Evolution 32, 1–21 (2000).
OpenUrl

[68] 68.
Zhou, X. & Stephens, M. Efficient multivariate linear mixed model algorithms for genome-wide association studies. Nature Methods 11, 407–409 (2014).
OpenUrl

[69] 69.
Meyer, K. et al. PX × AI: Algorithmics for better convergence in restricted maximum likelihood estimation. In 8th World Congress on Genetics Applied to Livestock Production (2006).

[70] 70.
Groeneveld, E. A reparameterization to improve numerical optimization in multivariate REML (co)variance component estimation. Genetics Selection Evolution 26, 537–545 (1994).
OpenUrl CrossRef Web of Science

[71] 71.
Wei, G. C. & Tanner, M. A. A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. Journal of the American Statistical Association 85, 699–704 (1990).
OpenUrl CrossRef Web of Science

[72] 72.
Matilainen, K., Mäntysaari, E. A., Lidauer, M. H., Strande´n, I. & Thompson, R. Employing a Monte Carlo algorithm in expectation maximization restricted maximum likelihood estimation of the linear mixed model. Journal of Animal Breeding and Genetics 129, 457–468 (2012).
OpenUrl

[73] 73.
Kuk, A. Y. & Cheng, Y. W. The Monte Carlo Newton-Raphson algorithm. Journal of Statistical Computation and Simulation 59, 233–250 (1997).
OpenUrl

[74] 74.
McCulloch, C. E. Maximum likelihood algorithms for generalized linear mixed models. Journal of the American Statistical Association 92, 162–170 (1997).
OpenUrl CrossRef Web of Science

[75] 75.
McCulloch, C., Searle, S. & Neuhaus, J. Generalized, linear, and mixed models (Wiley, 2008), 2nd edn.

[76] 76.
Barry, R. P. & Kelley Pace, R. Monte Carlo estimates of the log determinant of large sparse matrices. Linear Algebra and its Applications 289, 41–54 (1999).
OpenUrl

[77] 77.
Korte, A. et al. A mixed-model approach for genome-wide association studies of correlated traits in structured populations. Nature Genetics 44, 1066–1071 (2012).
OpenUrl CrossRef PubMed

[78] 78.
Listgarten, J. et al. A powerful and efficient set test for genetic markers that handles confounders. Bioinformatics 29, 1526–1533 (2013).
OpenUrl CrossRef PubMed Web of Science

[79] 79.
Tucker, G., Price, A. L. & Berger, B. A. Improving the power of GWAS and avoiding confounding from population stratification with PC-Select. Genetics (2014).

[80] 80.
Johnson, S. G. The NLopt nonlinear-optimization package. URL http://ab-initio.mit.edu/nlopt.

[81] 81.
Svanberg, K. A class of globally convergent optimization methods based on conservative convex separable approximations. SIAM Journal on Optimization 12, 555–573 (2002).
OpenUrl CrossRef Web of Science

[82] 82.
Kraft, D. Algorithm 733: TOMP–Fortran modules for optimal control calculations. ACM Transactions on Mathematical Software (TOMS) 20, 262–281 (1994).
OpenUrl

[83] 83.
Gould, N. I., Orban, D., Sartenaer, A. & Toint, P. L. Sensitivity of trust-region algorithms to their parameters. 4OR 3, 227–241 (2005).
OpenUrl

[84] 84.
Lippert, C. et al. FaST linear mixed models for genome-wide association studies. Nature Methods 8, 833–835 (2011).
OpenUrl

[85] 85.
Kang, H. M. et al. Efficient control of population structure in model organism association mapping. Genetics 178, 1709–1723 (2008).
OpenUrl Abstract/FREE Full Text

[86] 86.
Speed, D. & Balding, D. J. MultiBLUP: improved SNP-based prediction for complex traits. Genome Research gr–169375 (2014).