Comparison of methods that use whole genome data to estimate the heritability and genetic architecture of complex traits

Luke M. Evans; Rasool Tahmasbi; Scott I. Vrieze; Gonçalo R. Abecasis; Sayantan Das; Doug W. Bjelland; Teresa R. deCandia; Haplotype Reference Consortium; Michael E. Goddard; Benjamin M. Neale; Jian Yang; Peter M. Visscher; Matthew C. Keller

doi:10.1101/115527

ABSTRACT

Heritability, h², is a foundational concept in genetics, critical to understanding the genetic basis of complex traits. Recently-developed methods that estimate heritability from genotyped SNPs, h² _SNP, explain substantially more genetic variance than genome-wide significant loci, but less than classical estimates from twins and families. However, h²_SNP estimates have yet to be comprehensively compared under a range of genetic architectures, making it difficult to draw conclusions from sometimes conflicting published estimates. Here, we used thousands of real whole genome sequences to simulate realistic phenotypes under a variety of genetic architectures, including those from very rare causal variants. We compared the performance of ten methods across different types of genotypic data (commercial SNP array positions, whole genome sequence variants, and imputed variants) and under differing causal variant frequencies, levels of stratification, and relatedness thresholds. These results provide guidance in interpreting past results and choosing optimal approaches for future studies. We then chose two methods (GREML-MS and GREML-LDMS) that best estimated overall h²_SNP and the causal variant frequency spectra to six phenotypes in the UK Biobank using imputed genome-wide variants. Our results suggest that as imputation reference panels become larger and more diverse, estimates of the frequency distribution of causal variants will become increasingly unbiased and the vast majority of trait narrow-sense heritability will be accounted for.

INTRODUCTION

Narrow-sense heritability, h², the proportion of the total phenotypic variance due to additive genetic variation, is a fundamental concept of medical and quantitative genetics. In addition to providing an understanding of the genetic basis of traits, h² determines the response to selection, the potential utility of individual genetic risk and trait prediction, and how much of the phenotypic variability could theoretically be accounted for in genome-wide association studies (GWAS)^1,2. Importantly, while GWAS have now identified thousands of variants associated with complex traits^3–5, the loci identified by these studies have typically explained only a small fraction of traits’ total heritability, with the remaining genetic variance termed “missing heritability.” This remaining unaccounted for genetic variance may be attributable to a variety of causes, including the role of (typically rare) variants poorly tagged by arrays, small effect common variants that do not reach genome-wide significance due to insufficient sample sizes, or inflated family-based h² estimates^{1, 6–8}.

While traditional family-based estimates of heritability, h²_FAM, have provided valuable insights⁹, the use of close relatives means that estimates of additive genetic variance can be biased by factors shared by close relatives—for example, the joint action of non-additive genetic and common environmental effects can inflate estimates of additive genetic variation^10,11. Recently-developed approaches that utilize unrelated individuals to estimate the variance explained by all genotyped single nucleotide polymorphisms (SNPs), denoted as h²_SNP, have the advantage of being unaffected by these sources of bias, and for many traits have found that a large proportion of the heritability is captured by common variants^6,12,13. For certain complex traits, such as height, little unexplained additive genetic variance remains, as h²_SNP approaches h²_FAM^7,12. Despite this, h²_SNP estimates for most traits are still below h²_FAM, with BMI a typical example where h²_SNP ~0.27 while h²_FAM ~0.4-0.6 (ref. ¹²). Thus, for many complex traits, including disease traits, much of the heritability remains unaccounted for.

A second application of these approaches is to better understand the genetic architecture of complex traits. Genetic architecture refers to the number, frequencies, effect sizes, and locations of causal variants (CVs) underlying trait variation. Methods for estimating heritability from SNPs have found that estimated genetic variance is proportional to chromosome length for numerous complex traits, including height, BMI, schizophrenia, depression, and metabolic traits, consistent with the hypothesis that these traits are influenced by hundreds to thousands of variants with small effects spread throughout the genome^5,6,8,12-16. More recently, these methods have allowed insight into the frequency distribution and functional annotation of causal variants by partitioning SNPs into MAF bins and annotation categories^17,18. Such methods have allowed insight into gene networks involved in complex traits¹⁹, and helped determine optimal strategies for large-scale genotyping, such as whether genotyped SNPs on commercial arrays with subsequent imputation can capture the genetic variation from all frequency classes of causal variants or if whole genome sequences instead are needed¹².

A variety of methods to estimate h²_SNP and partition the genetic variance among sets of markers have been developed for these purposes. Many of these methods use one or more genetic relatedness matrices (GRMs) to estimate variances using restricted maximum likelihood (GREML)^6,12,17,20. Manipulations of the GRM via treelet covariance smoothing²¹ or weighting by linkage disequilibrium (LD) tagging of SNPs¹³ have also been proposed. A much different approach, LD-score regression, estimates h²_SNP from GWAS summary statistics²². The performance of these methods has typically been evaluated via simulation by assuming that causal variants have the same properties, on average, as common SNPs found on commercial genotyping arrays. However, such an approach is problematic because SNPs are specifically selected because they are common, have unusually high LD with untyped SNPs, or have been implicated in disease (e.g., the Affymetrix Axiom chip used in the UK Biobank²³). SNPs on arrays are therefore probably not reflective of typical CVs across the genome, and thus the ability of these methods to estimate h²_SNP or determine the genetic architecture of complex traits has not yet been properly assessed, nor have these methods been directly compared across conditions, such as levels of stratification or environmental confounding, that can cause biases. In particular, how the various methods perform with traits derived from very rare CVs may be quite different than how they perform on traits derived from common, welktagged CVs, such as those used on SNP arrays.

Here, we utilize thousands of recently-sequenced whole genomes to simulate complex phenotypes to test the performance of the most widely used SNP heritability estimation methods. We examine each method’s ability to estimate h²_SNP while varying the amount of population stratification, the frequency distributions of causal variants, and the type of whole-genome data analyzed (SNP array, imputed, and sequence). By using real sequence data to simulate phenotypes, the genotypic data we use are highly realistic with respect to LD, allele frequency distributions (with minor allele frequencies down to 3×10⁻⁴), variant density, and other genomic properties found in real data. Finally, we use the best-performing methods to estimate h²_SNP and examine genetic architecture for six complex traits using the UK Biobank. While h²_SNP estimation following imputation can account for the majority of the heritability, larger sample sizes and reference panels, or novel methods, will be needed to fully account for all the additive genetic variance in complex traits involving very rare causal variants.

MATERIALS AND METHODS

Samples and Population Structure

We simulated continuous phenotypes derived from whole genome sequence (WGS) data in the Haplotype Reference Consortium (HRC) dataset. Full details of the HRC can be found in McCarthy et al.²⁴. Briefly, this resource comprises roughly 32,500 individual whole genome sequences from multiple whole-genome sequencing studies, with phased genotype calls available at all sites with a minor allele count of at least 5. The HRC contains world-wide populations, but the majority are of European (EUR) origin. This large collection allowed us to simulate phenotypes with differing genomic architectures under realistic patters of LD structure, stratification, and relatedness with the whole genomes. We obtained permission to access the following HRC cohorts (recruitment region & sample size): AMD (Europe & worldwidee 3,189), BIPOLAR (European ancestrye 2,487), GECCO (European ancestrye 1,112), GOT2D (Europe, 2,709), HUNT (Norwaye 1,023), SARDINIA (Sardiniae 3,445), TWINS (Minnesotae 1,325), 1000 Genomes (worldwidee 2,495), UK10K (UKe 3,715) (see web resources for HRC information including specific cohorts). The subset of the HRC data we accessed totaled 21,500 whole genome sequences comprising 38,913,048 biallelic SNPs.

Our goal was to assess the bias and precision of various h²_SNP estimation methods using data similar to that typically used in GWAS and h²_SNP analyses. In order to mimic this kind of data, we first extracted variant positions corresponding to a widely-used commercially available genotyping array, the UKBiobank Affymetrix Axiom array. We performed principal components analysis using flashpca²⁵ on 133,603 SNPs after LD and MAF pruning (plink2²⁶ commands-maf 0.05—indep-pairwise 1000 400 0.2), extracting the first ten PCs, and performing K-means clustering in R²⁷. We used the 1000 Genomes individuals in the HRC as anchor points for ancestry and identified 19,478 individuals of European descent, including individuals of Finnish and Sardinian ancestry (Figure S1).

To identify subsets of these 19,478 individuals spanning different levels of genetic heterogeneity, we reran PCA with only these individuals, then proceeded to identify four increasingly homogenous subgroups within them using K-means clustering (Fig. 1). The most stratified group contained all EUR samples (N=19,478). The somewhat stratified group excluded Sardinian and Finnish samples (N=14,424). The low stratification group contained only northern/western European samples (N=11,243), and the least stratified (homogeneous) group was a subset of British ancestry samples (N=8,506). We used GCTA²⁰ to estimate relatedness and remove samples so that the maximum relatedness was 0.1 within each of the four samples. In the most homogeneous (smallest) sample, this left 8,201 individuals. To avoid confounding sample size with degree of stratification, we randomly chose 8,201 of the unrelated individuals from within each of the other three more stratified subsamples. Our purpose in identifying these groups was to vary the amount of genetic heterogeneity within a sample, similar to what might be found across a range of different GWAS samples, rather than formal population assignment or classification of individuals. We also identified individuals with relatedness less than 0.05 within each group, and used both subsets to examine how a 0.1 or 0.05 relatedness cutoff influences h²_SNP estimates. Sample sizes when using the 0.05 relatedness cutoff were 7792, 8115, 8129, and 8186 for the four genetic structure subsamples.

Figure 1.

Population structure subsamples of European ancestry individuals in the HRC (A-D). and UK Biobank individuals projected onto these axes (E). Total sample sizes are shown in each panel. To keep sample size constant across stratification level, we randomly sampled 8,201 individuals with relatedness < 0.1 (the number of unrelated individuals in the most homogeneous and smallest set in panel D) from each subsample to create the subsamples used in the simulations.

Simulated Phenotypes Using Whole Genome Sequencing Data

To assess how methods performed on a range of genetic architectures, we simulated phenotypes from CVs drawn randomly from five MAF ranges from the whole genome sequence data: common (MAF≥0.05), uncommon (0.01≤MAF<0.05), rare (0.0025≤MAF<0.01), very rare (0.0003≤MAF<0.0025), and all variants that had a minor allele count (MAC) of at least 5 (MAF≥0.0003) (Fig. S2). Phenotypes were generated from 1,000 CVs from the model y_i = g_i + e_i, where gi = Σw_ikβ_kw_ik, w_ik is the genotype (coded as 0, 1, or 2) of individual i at the k^th CV, and β_k is the k^th allelic effect size, drawn from ~N(0,1/[2p_k(1-p_k)]), where p_k is the MAF of allele k within a population subset. This model therefore assumes larger average additive effect sizes for rarer variants. The g_i’s were standardized and added to residual error drawn from ~N(0, (1-h²)/h²) for a h² of 0.5 for simulated phenotypes. A total of 100 repetitions were simulated for phenotypes derived from each CV MAF range and for each of the four population stratification subsets. It is important to note that we did not simulate any phenotypic effects as a function of ancestry within any of the subsamples, and thus biases related to stratification in our results were due to the genotypic (e.g., long-range LD), not phenotypic, effects of stratification.

SNPs, WGS, and Imputed Variants

Most marker heritability studies utilize commonly available commercial arrays, and estimates of h²_SNP reflect how well SNPs on these arrays tag CVs. In particular, CVs with low MAF or that exist in regions of low LD are typically tagged poorly by SNP arrays^6,13 and h²_SNP < h² in these situations. Alternatively, as large WGS reference panels (e.g., 1KG, UK10K, HRC) become increasingly available, imputing genome-wide variants based on SNP arrays is an attractive option for capturing more and rarer genetic variants than possible on arrays, although imputation accuracy declines with MAF¹². Finally, using WGS data to estimate GRMs should reflect relatedness at all CVs, including those that are rare or in low LD with other SNPs. Although WGS data in phenotyped samples is not yet widely available at the sample sizes required for precise estimation of h²_SNP, we include it as a benchmark for results based on array and imputed data and because large WGS samples are likely to become increasingly available in the future. We therefore tested each of these data types (array, imputed, and WGS variants) using each of the methods described below to determine how much of the heritability can be captured from each data type, and how closely results from imputed data mimic those from WGS data.

From the HRC sequence data (the WGS dataset), we extracted positions corresponding to the Axiom array as noted above (the array SNP dataset) with MAF>0.01. To impute, we used the 8,201 unrelated individuals in each population stratification set and added their close relatives (relatedness > 0.1) back into the sample as described below in the GREML-SC method description. We added these close relatives back in to the target imputation set in order to a) remove close relatives from the reference panel which would artificially increase imputation accuracy, and b) because some of the methods described below require the use of closely related individuals. We phased these individuals using SHAPEIT2²⁸, imputed using minimac3²⁹, and retained variants with imputation R²≥0.3 (ref.¹²). We used the HRC sequence data as our imputation reference panel after removing all target (8201 unrelated + relatives) individuals, thereby assuring ~independence (no relatedness) between the target and reference panels.

Final reference panel sizes for the four structure subsamples were 11,584; 12,799; 12,785; and 12,994. Reducing the sample size of the reference panel likely resulted in poorer imputation than had we used the full HRC panel but was nevertheless substantially larger than reference panels used in most past imputation procedures (e.g., 1,000 Genomes). Moreover, because the target and reference samples were from the same populations and the same cohorts, the imputation quality is likely higher than most GWAS samples would obtain. However, given that the HRC has become a widely-used imputation reference panel, our imputation quality is probably roughly reflective of imputation quality using modern procedures.

The amount of tagging throughout the genome differs between the various commercial arrays¹², and these differences may lead to differing h²_SNP estimates. To assess this, for the GREML-SC and GREML-MS methods (see below) using array positions data, we compared results from the Axiom array to those from the Illumina Omni2.5 array. For reference, MAF distributions of the different data types for two of the structure subsamples are shown in Figure S2.

Heritability Estimation Methods Tested

Numerous methods have recently been developed to estimate h²_SNP and partition genetic variance using genomic data. Among these, we compared the most widely used, including the various single and multiple component GREML approaches implemented in the GCTA software^6,12,17, approaches that specifically take into account how LD influences the tagging of nearby sites by SNPs, those that use related and unrelated samples to account for rare and common variant effects⁸, those that denoise the GRM using treelet covariance smoothing²¹, those that relate the effect sizes of SNPS from a GWAS to their degree of LD tagging ^19,22, and computationally efficient mixed model approaches¹⁸. Here, we briefly describe our implementation of each of these methods; for additional information on the methods themselves, see the above references. For all methods except LD-Score Regression and BOLT-REML (described below), we generated GRMs following the procedures of each method, and estimated h²_SNP using GCTA²⁰. In all models, variance component estimates were unconstrained (e.g., by using the –reml-no-constrain option of GCTA), and included 20 PCs (10 from worldwide PCA and 10 from the specific subsample PCA) as continuous covariates and sequencing cohort as a categorical covariate.

Single Component GREML (GREML-SC)

Yang et al.⁶ introduced the single component GRM approach using a mixed-effects model, with GRM entries: where m is the number of SNPs, x_jk is the genotype (coded as 0, 1, or 2) of individual j at the k^th locus, and p_k is the MAF of the k^th locus. The variance of the phenotypes is where the variance explained by the SNPs (σ²_v) and error variance (σ²_e) are estimated using restricted maximum likelihood (REML) implemented in the GCTA package²⁰. The proportion of the total variance explained by all SNPs is then a measure of heritability (h²_SNP = σ²_v / (σ²_v +σ²_e)). Typically, the set of m SNPs used to build the GRM is the set of SNPs with MAF≥0.01 (hereafter “common SNPs”) and unrelated individuals (relatedness ≤ 0.05). Because the Axiom array contains some rare markers, we compared this approach to one using all SNPs with MAC≥5 (hereafter “all SNPs”) in each particular stratification subsample, as well as to an approach using less stringent relatedness thresholds (relatedness < 0.10 and no relatedness threshold). For analyses that used no relatedness threshold, inclusion of close relatives increased our sample sizes to 9916, 8701, 8715, and 8506 for the samples with most, some, low, and least stratification, respectively (Fig. 1).

MAF-Stratified GREML (GREML-MS)

Biased estimates of h²_SNP are expected when using the GREML-SC method if the MAF distribution of the CVs does not match the MAF distribution of SNPs used to generate the GRM¹⁷. Stratifying variants into MAF classes and using a multiple GRM GREML approach can mitigate this bias and can also partition the genetic variance into that explained by different MAF categories of SNPs, lending insight into the genetic architecture of complex traits^12,30. We applied this approach using 4 MAF categories, matching the CV MAF categories used for phenotype simulation.

LD- and MAF-Stratified GREML (GREML-LDMS)

Extending the GREML-MS method to account for different levels of LD throughout the genome, Yang et al.¹² introduced an LD score-stratified method to the GREML-MS approach. GREML-LDMS stratifies variants according to both MAF categories as well as an LD-score, defined as the sum of r² between the focal variant and all other variants in a window. We estimated LD scores using the default settings in GCTA (10Mb block size with a 5Mb overlap), and stratified variants into LD score quartiles. Combined with the four MAF categories above, we used 16 GRMs for this approach.

Single Component and MAF-Stratified LD-Adjusted Kinships (LDAK-SC and LDAK-MS)

Speed et al.¹³ noted that because LD varies across the genome, CVs in regions of high LD are given disproportionate weight by eqn. (1) above. They proposed a method to weight SNPs according to local LD, which potentially corrects for the bias introduced when there is variation in how well CVs are tagged by SNPs. We used LDAK5¹³ to estimate these LD-weighted GRMs. This approach thins SNPs in very high LD first to reduce redundant tagging, then estimates SNP weights that are inversely proportional to their average LD with other SNPs. We also applied the MAF-stratified approach described above with the LDAK method (LDAK-MS). For the single component model (LDAK-SC), we used all SNPs (MAC≥5) as well as only common SNPs (MAF≥0.01) to build the GRM. For the MAF-stratified approach, following recommendations in the LDAK documentation, we estimated variant weights over the union of all variants (MAC≥5), then computed GRMs for each MAF class separately. We then applied the multiple GRM method with these LDAK-weighted GRMs to estimate h²_SNP using GCTA.

Extended Genealogy with Thresholded GRMs

Zaitlen et al⁸. introduced a method to simultaneously estimate the full narrow-sense heritability (incorporating the effects of poorly tagged SNPs) and h²_SNP using two GRMs in a sample containing close relatives. The first GRM contains relatedness from SNPs for all individuals while relatedness estimates below a threshold, t, are set to 0 in the second GRM. The first GRM, therefore, contains information on allele sharing of (mostly common) variants in unrelated and related individuals and is used to estimate h²_SNP, while the second only contains information from closely related individuals, presumably reflecting sharing of both common and rare CVs, and provides an estimate of what we call h²_IBS>t (following Zaitlen et al.⁸). The sum of h²_IBS>t and h²_SNP should therefore provide an estimate of total h², similar to h²_FAM., with all the same potential biases that exist in h²_FAM estimates from designs that use close relatives. We tested two relatedness thresholds (t ≤ 0.05 and 0.1) for the second GRM. By necessity, all analyses using the relatedness thresholded GRM approach included close relatives.

Treelet Covariance Smoothing (TCS)

Crossett et al.²¹ noted that the GRM estimates (particularly for unrelated individuals) are inherently noisy. They proposed a method to smooth the estimates using treelet covariance smoothing (TCS) to obtain more accurate estimates of relatedness. Their method takes advantage of the hierarchical nature of relatedness in samples to obtain better estimates of A_ij among unrelated individuals. We replicated their methods, using common SNPs (MAF≥0.01) and including related individuals, and implemented the TCS method in the treelet R package³¹. TCS requires identifying a smoothing parameter, λ (distinct from the genomic control inflation factor λ_GC). Crossett et al.²¹ propose two methods to optimize λ, one based on minimizing the GREML likelihood and one based on minimizing a loss function (Η(λ)) at different levels of λ based on subsamples of the SNPs. With the large number of simulations across stratification subsamples and genetic architectures, minimizing the GREML likelihood for each simulated phenotype was not feasible. Minimizing H(λ) using the second approach requires estimating the GRM and applying the TCS method to over 50 subsets of data, also impractical computationally with over 8,000 individuals. We therefore used a modification of the 2^nd approach. We built GRMs from 2000 randomly chosen individuals from each stratification subsample and optimized λ for each subsample following the published methodology (Fig. S3), then applied the optimal λ to the full GRM of over 8,000 individuals.

LD-Score Regression

LD-score regression uses a different approach to estimating h²_SNP. Rather than estimating relatedness within a sample for use in mixed-model GREML analysis, LD-score regression regresses GWAS test statistics (χ²) on SNPs’ LD scores, which reflect the degree to which each SNP is correlated with surrounding SNPs ^19,22. For a polygenic model, the expected GWAS test statistic of variant j, χ²_j, is where N is the sample size, M is the number of SNPs, l_j is the LD score (= Σ_kr²_jk) measuring the tagging of surrounding variants by SNP j, and a is a measure of confounding biases arising from stratification and cryptic relatedness. Thus, regressing GWAS test statistics on per-variant LD scores allows for both estimation of h²_SNP and assessing the degree of confounding or polygenicity of a trait²². Bulik-Sullivan et al²² argue that LD-score regression provides unbiased estimates of h²_SNP regardless of whether GWAS test statistics are estimated with or without controlling for ancestry or environmental covariates or relatedness. Here, we estimated GWAS test statistics using plink2 without controlling for ancestry covariates, controlling for ancestry covariates (20 PCs and sequencing cohort as above), and controlling for ancestry covariates as fixed effects in a mixed model that included a kinship matrix. For the latter, we applied the GCTA leave-one-chromosome-out (LOCO) approach³²; because the GCTA-LOCO approach is computationally intensive, we ran only 20 repetitions of each phenotype rather than 100, and did so only for the array SNP dataset. We used the Idsc package with default parameters (see URLs) to perform LD score regression. We calculated LD scores for all variants using the whole genome sequence data, including common and rare variants. As recommended by Bulik-Sullivan et al., we used unrelated individuals (relatedness ≤ 0.05) and only common variants to perform the LD score regression itself, because the relationship between the GWAS &² and LD-score is unclear for rare (MAF<.01) SNPs.

LD score regression can also be used to partition heritability among annotations¹⁹. We applied this approach using the four MAF categories described above. Because our MAF categories included very rare variants, for this MAF-stratified LD score regression, we used GWAS test statistics from all variants (MAF≥0.0003, using the--not-5-50 flag in the ldsc package) while controlling for covariates as above.

BOLT-REML

Unlike other GREML approaches, BOLT-REML uses a Monte Carlo approximation of the gradient for the likelihood function to reduce computation time and memory requirements in variance component estimation¹⁸. When using whole genome sequence and imputed variant data with >14M variants (see below), time required by BOLT-REML, even when highly parallelized, was prohibitive for 100 repetitions of each combination of variables we tested, as it scales with MN^1,5, where M is the number of markers and N is the number of samples (see Supplementary Table 1 of Loh et al.¹⁸ for computational performance). Note that GREML takes longer for a single sample due to the length of time to create the GRM; in our simulations with GCTA-style approaches, the GRM computation was done only once, and therefore was much faster when estimating heritability for many repetitions created from randomly-drawn CVs with a single GRM. We therefore only applied BOLT-REML to the array dataset. We applied the method with a single component using either all array positions or only common markers (MAF>0.01) as well as a MAF-stratified approach with the same four MAF partitions and same covariates described above.

Confounding between relatedness and shared environments

Many of the methods we tested use unrelated individuals to avoid the assumption of no shared environmental effect among near relatives⁶. However, several, such as the extended genealogy with thresholding, require the use of near relatives⁶. This could lead to confounding between relatedness estimates and shared environmental effects within families or closely related individuals if shared environmental effects are not modeled^7,33. Indeed, Zaitlen et al.⁸ argue that such shared environmental effects were the likely cause of higher h²_FAM estimates among relatives who shared an environment through cohabitation (e.g., half-siblings) compared to equally related relatives that did not share a cohabitation environment (e.g., grand-parents and grand-children). We therefore assessed whether h²_SNP and h²_FAM estimates are biased for methods that use closely related individuals when extended shared environmental effects are present but unmodeled.

We first identified all groups of individuals connected by at least one pairwise relatedness value > 0.2 (“extended families”). Note that many of the pairwise relationships within these extended families were below 0.2. For example, spouses are typically unrelated but are nevertheless defined as being in the same family if their offspring are present, and cousins would be defined as being in the same family if their parents were present in the sample. We then simulated phenotypes with a shared extended family environmental effect that accounted for 10% of the variance (c²=0.1). Simulations were similar to those described above, with genotypic values exactly the same as above, but with shared effects for each family drawn from ~N(0, V_c), where V_c = c²*Vg/h², V_g is the variance of genetic values, and c² is the proportion of the phenotypic variance due to shared environments, and residual error added as ~N(0, (1-h²-c²)*V_g/h²), for a simulated h²=0.5, c²=0.1, and e²=0.4. We applied GREML-SC, LD score regression, and extended genealogy with thresholded GRMs using common variants from array SNPs controlling for the same covariates as above and without modeling the shared environmental effect. This tested whether methods are robust to violations of the assumption of no shared environmental effects on the phenotype.

Heritability of Complex Traits in the UK Biobank

We estimated h²_SNP for six continuous phenotypes in the UK Biobank using the methods (GREML-MS and GREML-LDMS) that produced consistently unbiased estimates of h² and partitioned the genetic variance most accurately in the simulations above. The UK Biobank is a large, publicly available resource of ~500,000 UK adults, with deep phenotyping, family history, and genotype data²³. The current release includes ~150,000 individuals, primarily of European ancestry, genotyped on the Affymetrix Axiom platform, phased using SHAPEIT2 and imputed to a combined 1000 Genomes and UK10K reference panel (N=6,285 individuals). The details of the official UK Biobank genotyping and imputation methods in the released data can be found at http://biobank.ctsu.ox.ac.uk/crystal/docs/genotyping_qc.pdf and http://biobank.ctsu.ox.ac.uk/crystal/docs/impute_ukb_v1.pdf (accessed 29 Feb. 2016). We excluded individuals with no genetic data and those whose self-reported and genetic sex conflicted (data fields f.31.0.0 and f.22001.0.0). Poor quality samples identified by the UK Biobank and Affymetrix were also removed (f.220010.0.0) as were UKBiLEVE poor-quality samples (f.22051.0.0), leaving a total of 151,661 individuals. To reduce population stratification, we included only individuals of European ancestry in our analyses. The UK Biobank identified self-reported “British” individuals as “Caucasian” based on grouping of individuals with CEU individuals in PCA (see UK Biobank documentation). To these individuals (f.22006.0.0), we added those who self-identified as “White,” “Irish,” or “Any other white background” whose PC scores on the first four axes (f.22009.0.1-4) were within the range of the UK Biobank-identified “Caucasian” individuals, resulting in 126,338 individuals. We projected the UK Biobank samples onto the HRC PCA axes using the loadings from the HRC EUR individuals, demonstrating that the UK Biobank individuals we used in the analyses below are similar to the least stratified or unstratified subsamples of the HRC we used (Fig. 1). To estimate the GRMs, we separately used directly genotyped Axiom array positions as well as imputed genome-wide variants with IMPUTE info score ≥0.3.

We estimated h²_SNP for the following traits in the UK Biobank (field ID number): height (f.50.0.0), body mass index (BMI; f.21001.0.0), whole-body impedance (f.23127.0.0), trunk fat percentage (f.23127.0.0), fluid intelligence (f.20016.0.0), and neuroticism (f.20127.0.0). We normalized phenotypes and removed observations greater than 5 standard deviations away from the mean. We included sex (f.31.0.0), UK Biobank assessment centre (f.54.0.0), genotype measurement batch (f.22000.0.0), and educational attainment (“qualification”, f.6138.0.0) as categorical covariates, and the Townsend deprivation index (f.189.0.0), age at assessment (f.21003.0.0), age at assessment squared, and the 15 PC scores from the UK Biobank (f.22009.0.1-15) as quantitative covariates.

For GREML-MS, we binned variants into eight MAF-categories: MAC≥5 & MAF<0.0001, 0.0001-0.001, 0.001-0.01, 0.01-0.1, 0.1-0.2, 0.2-0.3, 0.3-0.4, & 0.4-0.5. For GREML-LMDS, we were limited in the number of predictor GRMs to use due to computational constraints (1Tb of RAM); we therefore, used 4 MAF bins (common: MAF>0.05, uncommon: 0.01<MAF<0.05, rare: 0.0001<MAF<0.01, and very rare: MAC>5 & MAF<0.0001) and 2 LD-score bins (above and below the median LD-score).

RESULTS

Simulation Results

We found clear differences across methods, degree of stratification, and data types (array SNP, WGS, or imputed variants) in their ability to estimate the simulated h² for different CV MAF architectures (Figs. 2–3 and S4-S6, Tables S1-S3). Below, we describe results for each method in detail. Please refer to Figures 2–4, Figures S4-S6, and Tables S1-S5 for estimates of heritability, and Figures S7-S9 for estimates of the heritability standard errors.

Figure 2.

Average h²_SNP estimates across 100 replicates (± SEM) from GRMs built from Axiom array positions (left), whole genome sequence data (center), or imputed genome-wide variants (right). Horizontal panels show MAF ranges (specified in insert) of 1,000 randomly chosen causal variants (CVs). Methods are listed on the X-axis as follows: Single component GREML (GREML-SC); MAF-stratified GREML (GREML-MS); LD-& MAF-stratified GREML (GREML-LDMS); Single-component Linkage Disequilibrium-Adjusted Kinships (LDAK-SC); MAF-stratified LDAK (LDAK-MS); Treelet Covariance Smoothing (TCS); Extended Genealogy with Thresholded GRMs; LD Score Regression using no PCs as covariates in GWAS, using PCs as covariates, or using both PCs and the kinship matrix; and Single Component and MAF-stratified BOLT-REML. Estimates are from samples of unrelated individuals (relatedness <0.05) except for samples used in the Threshold GRM method, which included all individuals. For the Threshold GRM method we plot h²_SNP rather than total h² (h²_SNP + h²_IBS>t) from models where t = .05. Dotted line is the simulated (true) h² = 0.5. Colors represent the 4 subsamples varying in genetic structure.

See Figs. S4-6 for estimates using different relatedness thresholds.

Figure 3.

Average of 100 h²_SNP estimates (± SEM) from GRMs constructed from imputed genome-wide variants of different MAF ranges (different symbols) in samples of unrelated (<0.05) individuals. Horizontal panels show MAF ranges (specified in insert) of 1,000 randomly chosen CVs and colors represent the 4 subsamples varying in genetic structure. GREML-MS & GREML-LDMS partition the phenotypic variance to the correct MAF-range GRM, while LDAK-MS often attributed genetic variance to incorrect GRMs.

Figure 4.

Mean heritability estimates (± SEM) from 100 replicates of phenotypes simulated with or without confounding shared environmental effects among families for three different methods (x axis) for different genetic architectures. GRMs were estimated using common (MAF>0.01) array SNP positions for the most structured and most homogeneous stratification subsamples only. Different symbols indicate the relatedness cutoffs used. For GREML-SC, we used three thresholds, including no relatedness cutoff (all individuals included). For LD Score Regression, we did not apply a 0.1 relatedness cutoff, as most studies will use a 0.05 or lower threshold for individuals included in GWAS. The threshold GRM approach requires all individuals, and the different symbols indicates the relatedness threshold (t) below which the thresholded GRM was set to 0. h²_Total is the sum of both variance components, h²_SNP is the variance component of the unthresholded GRM. Each horizontal panel indicates the minor allele frequency (MAF) range of the 1,000 randomly chosen causal variants (CV), with the range specified in the inset.