Abstract
Dense genotype data and thousands of phenotypes from large biobanks, coupled with increasingly accessible summary association statistics from genome-wide association studies (GWAS), provide great opportunities to dissect the complex relationships among human traits and diseases. We introduce BADGERS, a powerful method to perform polygenic score-based biobank-wide scans for disease-trait associations. Compared to traditional regression approaches, BADGERS uses GWAS summary statistics as input and does not require multiple traits to be measured on the same cohort. We applied BADGERS to two independent datasets for Alzheimer’s disease (AD; N=61,212). Among the polygenic risk scores (PRS) for 1,738 traits in the UK Biobank, we identified 48 significant trait PRSs associated with AD after adjusting for multiple testing. Family history, high cholesterol, and numerous traits related to intelligence and education showed strong and independent associations with AD. Further, we identified 41 significant PRSs associated with AD endophenotypes. While family history and high cholesterol were strongly associated with neuropathologies and cognitively-defined AD subgroups, only intelligence and education-related traits predicted pre-clinical cognitive phenotypes. These results provide novel insights into the distinct biological processes underlying various risk factors for AD.
Introduction
Late-onset Alzheimer’s disease (AD) is a prevalent, complex, and devastating neurodegenerative disease without a current cure. Millions of people are currently living with AD worldwide, and the number is expected to grow rapidly as the population continues to age [1, 2]. With the failure of numerous drug trials, it is of great interest to identify modifiable risk factors that can be potential therapeutic targets for AD [3-5]. Epidemiological studies that directly test associations between measured risk factors and AD are difficult to conduct and interpret because identified associations are in many cases affected by confounding and reverse causality. These challenges in risk factor studies for complex disease are ubiquitous and are particularly critical for AD due to its extended pre-clinical stage – irreversible pathologic changes have already occurred in the decade or two prior to clinical symptoms [6]. In an effort to combat confounding and reverse causality, Mendelian randomization (MR) methods [7-9] have been developed to identify causal risk factors for disease using data from genome-wide association studies (GWAS). Despite the favorable theoretical properties in identifying causal relationships, these methods have limited statistical power, therefore they are not suitable for hypothesis-free screening of risk factors.
Motivated by transcriptome-wide association study – an analysis strategy that identifies genes with genetically-regulated expression values associated with disease [10-12] – we sought a systematic and statistically-powerful approach to identify AD risk factors using summary association statistics from large-scale GWAS. GWAS for late-onset AD have been successful, and dozens of associated loci have been identified to date [13-19]. Although direct information about risk factors is limited in these studies, dense genotype data on a large number of samples, in conjunction with independent reference datasets for thousands of complex human traits, such as the UK Biobank [20], make it possible to genetically impute potential risk factors and test their associations with AD. This strategy allows researchers to study risk factors that are not directly measured in AD studies. Furthermore, it reduces reverse causality because the imputation models are trained on independent, younger, and mostly dementia-free reference cohorts, thereby improving the interpretability of findings.
Here, we introduce BADGERS (Biobank-wide Association Discovery using GEnetic Risk Scores), a statistically-powerful and computationally-efficient method to identify associations between a disease of interest and a large number of genetically-predicted complex traits using GWAS summary statistics. We applied BADGERS to identify associated risk factors for AD from 1,738 heritable traits in the UK Biobank and replicated our findings in independent samples. Furthermore, we performed multivariate conditional analysis, Mendelian randomization, and follow-up association analysis with a variety of AD biomarkers, AD-related pathology findings, and cognitive subgroup phenotypes to provide mechanistic insights into our findings.
Results
Method overview
Here, we briefly introduce the BADGERS model. Complete statistical details are discussed in the Methods section. BADGERS is a two-stage method to test associations between traits. First, polygenic risk score (PRS) models are trained to impute complex traits using genetic data. Next, we test the associations between a disease or trait of interest and the PRSs of various traits. Given a PRS model, the genetically-imputed trait can be denoted as: where X is the N × M genotype matrix for N individuals in a GWAS, and W denotes pre-calculated weight values of M SNPs in the PRS model. Then, we test the association between measured trait Y and imputed trait via a univariate linear model
The test statistic for γ can be expressed as: where Z̃ is the vector of SNP-level association z-scores for trait Y, and Γ is a diagonal matrix with the jth diagonal element being the ratio between standard deviation of the jth SNP and that of .
This model can be further generalized to perform multivariate analysis. If K imputed traits are included in the analysis, we use a similar notation as in univariate analysis:
Here, W* is a M × K matrix and each column of W* is the pre-calculated weight values of SNPs for each imputed trait. Then, the associations between Y and K imputed traits (1 ≤ i ≤ K) are tested via a multivariate linear model where γ* = (γ1,⋯,γK)T is the vector of regression coefficients. The z-score for γi(l ≤ i ≤ K) can be denoted as: where U is the inverse variance-covariance matrix of ; Ik is the K × 1 vector with the kth element being 1 and all other elements equal to 0; Θ is a M × M diagonal matrix with the ith diagonal element being and Z̃ is defined similarly to the univariate case as the vector of SNP-level association z-scores for trait Y.
Simulations
We used real genotype data from the Genetic Epidemiology Research on Adult Health and Aging (GERA) to conduct simulation analyses (Methods). First, we evaluated the performance of our method on data simulated under the null hypothesis. We tested the associations between randomly simulated traits and 1,738 PRSs from the UK Biobank and did not observe inflation of type-I error (Supplementary Table 1). Similar results were also observed when we simulated traits that are heritable but not directly associated with any PRS. Since BADGERS only uses summary association statistics and externally estimated linkage disequilibrium (LD) as input, we also compared effect estimates in BADGERS with those of traditional regression analysis based on individual-level data. Regression coefficient estimates and association p-values from these two methods were highly consistent in both simulation settings (all correlations > 0.94; Figure 1A and Supplementary Figures 1-2), showing minimal information loss in summary statistics compared to individual-level data. To evaluate the statistical power of BADGERS, we simulated traits by combining effects from randomly selected PRS and a polygenic background (Methods). We set the effect size of PRS to be 0.02, 0.015, 0.01, 0.008, and 0.005. BADGERS showed comparable statistical power to the regression analysis based on individual-level genotype and phenotype data (Figure 1B). Overall, our results suggest that using summary association statistics and externally estimated LD as a proxy for individual-level genotype and phenotype data does not inflate type-I error rate or decrease power. The performance of BADGERS is comparable to regression analysis based on individual-level data.
Identify risk factors for late-onset AD among 1,738 heritable traits in the UK Biobank
We applied BADGERS to conduct a biobank-wide association scan (BWAS) for AD risk factors from 1,738 heritable traits (p<0.05; Methods) in the UK Biobank. We repeated the analysis on two independent GWAS datasets for AD and further combined the statistical evidence via meta-analysis (Supplementary Figure 3). We used stage-I association statistics from the International Genomics of Alzheimer’s Project (IGAP; N=54,162) as the discovery phase, then replicated the findings using 7,050 independent samples from the Alzheimer’s Disease Genetics Consortium (ADGC). We identified 50 significant PRS-AD associations in the discovery phase after correcting for multiple testing, among which 14 had p<0.05 in the replication analysis. Despite the considerably smaller sample size in the replication phase, top trait PRSs identified in the discovery stage showed strong enrichment for p<0.05 in the replication analysis (enrichment=2.5, p=2.2e-5; hypergeometric test). In the meta-analysis, a total of 48 trait PRSs reached Bonferroni-corrected statistical significance and showed consistent effect directions in the discovery and replication analyses (Figure 2 and Supplementary Table 2).
Unsurprisingly, many identified associations were with dementia-related PRSs. PRSs for Family history of AD and dementia showed the most significant associations with AD (meta-analytic p=3.7e-77 and 5.2e-28 for illnesses of mother and father, respectively). The PRS for having any dementia diagnosis is also strongly and positively associated (p=8.5e-11). In addition, we observed consistent and negative associations between PRSs for better performance on cognitive tests and AD. These cognitive traits include those for fluid intelligence score (p=2.4e-14), time to complete a card matching task (p=2.8e-9), ability to remember and comply with an instruction despite interference information (p=9.1e-11), and many others. Consistently, PRSs for educational attainment showed strong inverse associations with AD. The PRS for higher age at which the person completed full time education was inversely associated with AD (p=2.5e-7). PRSs for four out of seven traits based on a survey about education and professional qualifications were significantly associated with AD (Supplementary Figure 4). Higher education traits, such as having a university degree (p=4.4e-12), A levels/AS levels or equivalent education in the UK (p=6.9e-9), and professional qualifications (p=7.1e-6), were inversely associated with AD. In contrast, the PRS for choosing “none of the above” in this survey (i.e. do not have any listed education or qualification) was directly associated with AD (p=1.6e-11). Other PRSs with notable strong associations included those for high cholesterol (p=2.5e-15; positive), lifestyle traits such as cheese intake (p=2.5e-10; negative), occupational traits such as jobs involving heavy physical work (p=2.7e-10; positive), anthropometric traits including height (p=5.3e-7; negative), and traits related to pulmonary function, e.g. forced expiratory volume in 1 second (FEV1; p=1.9e-6; negative). Detailed information for all associations is summarized in Supplementary Table 2.
Multivariate conditional analysis identifies independently associated risk factors
Of note, associations identified in marginal analysis are not guaranteed to be independent. We observed clear correlational structures among the PRSs of identified traits (Figure 3). For example, PRS of various intelligence and cognitive traits are strongly correlated, and the PRS for consumption of cholesterol lowering medication is correlated with the PRS for self-reported high cholesterol. To account for the correlations among trait PRSs and identify risk factors that are independently associated with AD, we performed multivariate conditional analysis using GWAS summary statistics (Methods). First, we applied hierarchical clustering to the 48 traits we identified in marginal association analysis and divided these traits into 15 representative clusters. The traits showing the most significant marginal association in each cluster were included in the multivariate analysis (Supplementary Figure 5). Similar to the marginal analysis, we analyzed IGAP and ADGC data separately and combined the results using meta-analysis (Supplementary Table 3). All 15 representative trait PRSs remained nominally significant (p<0.05) and showed consistent effect directions between marginal and conditional analyses (Supplementary Table 4). However, several traits showed substantially reduced effect estimates and inflated p-values in multivariate analysis, including PRSs for fluid intelligence score, mother still alive, unable to work because of sickness or disability, duration of moderate activity, and intake of cholesterol-lowering spreads. Interestingly, major trait categories that showed the strongest marginal associations with AD (i.e. family history, high cholesterol, and education/cognition) were independent from each other. Paternal and maternal family history also showed independent associations with AD, consistent with the low correlation between their PRSs (correlation= 0.052).
Influence of the APOE region on identified associations
Further, we evaluated the impact of APOE on identified associations. We removed the extended APOE region (chr19: 45,147,340–45,594,595; hg19) from summary statistics of the 48 traits showing significant marginal associations with AD and repeated the analysis. We observed a substantial drop in the significance level of many traits, especially family history of AD, dementia diagnosis, and high cholesterol (Supplementary Table 5 and Figure 4). 38 out of 48 traits remained significant under stringent Bonferroni correction after APOE removal. Interestingly, the associations between AD and almost all cognitive/intelligence traits were virtually unchanged, suggesting a limited role of APOE in these associations. Removal of an even wider region surrounding APOE showed similar results (Supplementary Figure 6).
Causal inference via Mendelian randomization
Next, we investigated the evidence for causality among identified associations. We performed MR (Methods) in IGAP and ADGC datasets separately and meta-analyzed the results. Among the 48 significant traits identified by BADGERS, inverse variance-weighted (IVW) MR identified 23 nominally significant causal effects (p<0.05) on AD (Supplementary Table 6). The signs of all significant causal effects were consistent with results from BADGERS. The most significant effect was family history (p=1.1e-233 and 1.7e-69 for illnesses of mother and father, respectively). Having any dementia diagnosis (p=9.1e-7), high cholesterol (p=4.1e-6), A levels/AS levels education (p=1.7e-4), and time spent watching television (p=2.4e-4) were also among top significant effects. Of note, fluid intelligence score, one of the most significant associations identified by BADGERS, did not reach statistical significance in MR (p=0.06).
Associations with AD subgroups, biomarkers, and pathologies
To further investigate mechanistic pathways for the identified risk factors, we applied BADGERS to a variety of AD subgroups, biomarkers, and neuropathologic features (Supplementary Table 7). Overall, 29 significant associations were identified under a false discovery rate (FDR) cutoff of 0.05, and these AD-related phenotypes showed distinct association patterns with PRSs for AD risk factors (Figure 5; Supplementary Figure 7). First, we tested the associations between the 48 PRSs for AD-associated traits and five AD subgroups defined in the Executive Prominent Alzheimer’s Disease (EPAD) study of cognitively-defined AD subgroups, i.e. isolated relative impairments in memory, language, and visuospatial functioning, no domain with an isolated relative impairment, and multiple domains with relative impairments (Methods) [21, 22]. The PRS for maternal family history of AD and dementia was strongly and consistently associated with all five cognitively defined AD subgroups (Supplementary Table 8), with the group with isolated relative memory impairment showing the strongest association (p=3.3e-16), which is consistent with the higher frequency of APOE ε4 in this subgroup [21]. The PRS for paternal family history was not strongly associated with any of the cognitively defined subgroups but the effect directions were consistent with those found for maternal history. Interestingly, PRSs for intelligence and cognitive traits, such as ability to remember and comply with an instruction despite interference information (p=2.7e-5) and fluid intelligence score (p=6.8e-5), were specifically associated with the group with no domain with an isolated relative impairment. PRSs for high cholesterol and related traits were positively associated with the groups with isolated relative impairments in language and memory and the group with multiple domains with relative impairments but showed weaker associations with the group with isolated relative impairment in visuospatial functioning and the group with no domain with an isolated relative impairment.
Next, we extended our analysis to a different GWAS dataset for three AD-related biomarkers in cerebrospinal fluid (CSF): amyloid beta (Aβ42), tau, and phosphorylated tau (ptau181) [23]. Somewhat surprisingly, AD risk factors did not show strong associations with Aβ42 and tau (Supplementary Table 9). Maternal family history of AD and dementia was associated with ptau181 (p=4.2e-4), but associations were absent for Aβ42 and tau. It has been recently suggested that CSF biomarkers have sex-specific genetic architecture [24]. However, no association passed an FDR cutoff of 0.05 in sex-stratified analyses (Supplementary Table 10).
Further, we applied BADGERS to a variety of neuropathologic features of AD and related dementias (Methods), including neuritic plaques (NPs), neurofibrillary tangles (NFTs), cerebral amyloid angiopathy (CAA), Lewy body disease (LBD), hippocampal sclerosis (HS), and vascular brain injury (VBI) [25]. PRSs for family history of AD/dementia (p=3.8e-8, maternal; p=1.4e-5, paternal) and high cholesterol (p=2.1e-5) were strongly associated with NFT Braak stages (Supplementary Table 11). NP also showed very similar association patterns with PRSs for these traits (p=2.7e-19, maternal family history; p=2.6e-7, paternal family history; p=0.001, high cholesterol). Of note, the association between the PRS of maternal family history of AD/dementia and NP remained significant even after APOE removal (p=2.3e-4; Supplementary Table 12). The other neuropathologic features did not show strong associations. Despite not being statistically significant, PRSs for family history of AD/dementia was negatively associated with VBI, and PRSs for multiple intelligence traits were positively associated with LBD, showing distinct patterns with other pathologies (Supplementary Figure 7). We also note that various versions of the same neuropathology findings all showed consistent associations in our analyses (Supplementary Figure 8). The complete association results for all the aforementioned endophenotypes are summarized in Supplementary Table 13.
Associations with cognitive traits in a pre-clinical cohort
Finally, we studied the associations between PRSs for AD risk factors and pre-clinical cognitive phenotypes using 1,198 samples from the Wisconsin Registry for Alzheimer’s Prevention (WRAP), a longitudinal study of initially dementia-free middle-aged adults [26]. Assessed phenotypes include mild cognitive impairment (MCI) status and three cognitive composite scores for executive function, delayed recall, and learning (Methods). A total of 12 significant associations reached an FDR cutoff of 0.05 (Supplementary Table 14). Somewhat surprisingly, although the effect directions were consistent with other related associations reported in our study, parental history and high cholesterol, the risk factors that showed the strongest associations with various AD endophenotypes, were not significantly associated with MCI or cognitive composite scores in WRAP. Instead, PRSs for education and intelligence-related traits strongly predicted pre-clinical cognition (Figure 6). PRSs for A-levels education and no education both showed highly significant associations with delayed recall (p=4.0e-5 and 7.7e-7) and learning (p=7.6e-6 and 5.0e-8). No education was also associated with higher risk of MCI (p=2.5e-4). Additionally, fluid intelligence score was positively associated with the learning composite score (p=7.5e-4), and time to complete a card matching task was negatively associated with the executive function (p=1.1e-5).
Discussion
In this work, we introduced BADGERS, a new method to perform PRS-based association scans at the biobank scale using GWAS summary statistics. Through simulations, we demonstrated that our method provides consistent effect estimates and similar statistical power compared to regression analysis based on individual-level data. Additionally, we applied BADGERS to two large and independent GWAS datasets for late-onset AD. Among 1,738 heritable traits in the UK Biobank, we identified 48 trait PRSs showing statistically significant associations with AD. These trait PRSs covered a variety of categories including family history, cholesterol, intelligence, education, occupation, and lifestyle. Although many of the identified traits are genetically correlated, multivariate conditional analysis confirmed multiple strong and independent associations for AD. PRS for family history showing strong associations with AD is not a surprise, and many other associations are supported by the literature as well. The protective effect of higher educational and occupational attainment on the risk and onset of dementia is well studied [27, 28]. Hypercholesterolemia is also known to associate with β-amyloid plaques in the brain and higher AD risk [29-31].
More interestingly, these identified trait PRSs had distinct association patterns with various AD subgroups, biomarkers, pathologies, and pre-clinical cognitive traits. Five cognitively-defined AD subgroups were consistently associated with maternal family history, but only the group without substantial relative impairment in any domain was associated with PRSs for intelligence and education. In addition, PRSs for family history and high cholesterol were strongly associated with classic AD neuropathologies including NP and NFT while PRSs for intelligence and educational attainment were associated with pre-clinical cognitive scores and MCI. We also investigated the influence of APOE on the identified associations. Effects of PRSs for family history and high cholesterol were substantially reduced after APOE removal. In contrast, associations with PRSs for cognition and education were virtually unchanged. These results suggest that various AD risk factors may affect the disease course at different time points and via distinct biological processes, and genetically predicted risk factors for clinical AD include at least two separate components. While some risk factors (e.g. high cholesterol and APOE) may directly contribute to the accumulation of pathologies, other factors (e.g. intelligence and education) may buffer the adverse effect of brain pathology on cognition [28].
Further, we note that the association results in BADGERS need to be interpreted with caution. Although PRS-based association analysis is sometimes treated as causal inference in the literature [32], we do not consider BADGERS a tool to identify causal factors. Key assumptions in causal inference are in many cases violated when analyzing complex, highly polygenic traits, which may lead to complications when interpreting results. In our analysis, BADGERS showed superior statistical power compared to MR-IVW – among the 48 traits identified by BADGERS, only 7 reached Bonferroni-corrected statistical significance in MR (Supplementary Table 6). We envision BADGERS as a tool to prioritize associations among a large number of candidate risk factors so that robust causal inference methods can be applied to carefully assess causal effects. Limited sample size in the AD endophenotypes datasets is another limitation in our study. We have used data from the largest available GWAS for CSF biomarkers and neuropathologies. Still, small sample size made it challenging to assess the effects of traits that were weakly associated with AD. Finally, emerging evidence has highlighted sex-specific genetic architecture of AD [24, 33]. In our analysis, maternal family history of AD showed stronger associations with various phenotypes than paternal family history. This is consistent with what has been reported in the literature [34, 35]. However, we note that this may also be explained by the sample size difference in UK Biobank (Ncase=28,507 and 15,022 for samples with maternal and paternal family history, respectively). We also performed sex-stratified analyses for CSF biomarkers but identified limited associations, possibly due to the small sample size. Overall, sex-specific effects of risk factors remain to be investigated in the future using larger datasets. Finally, BADGERS requires training data for genetic prediction models and the downstream disease GWAS to be independent but of similar genetic ancestry. Development of methods that are more robust to sample overlap and diverse genetic ancestry remains an open problem for future research.
In conclusion, BADGERS is a statistically powerful method to identify associated risk factors for complex disease. Large-scale biobanks continue to provide rich data on various human traits that may be of interest in disease research. Our method uses GWAS to bridge large biobanks with studies on specific diseases, lessens the limitation of insufficient disease cases in biobanks and lack of risk factor measurements in disease studies, and provides a statistically-justified approach to identifying risk factors for disease. We have demonstrated the effectiveness of BADGERS through extensive simulations, a two-stage BWAS for late-onset AD, and various follow-up analyses of identified risk factors. Our results provide new insights into the genetic basis of AD and reveal distinct mechanisms for the involvement of risk factors in AD etiologies. The ever-growing sample size in GWAS and biobanks, in conjunction with increasingly accessible summary association statistics, makes BADGERS a powerful and valuable tool in human genetics research.
Methods
BADGERS framework
The goal of this method is to study the association between Y, a measured trait in the study, and , a trait imputed from genetic data via a linear prediction model:
Here, XN×M is the genotype matrix for N individuals in a study of trait Y. WM×1 is the precalculated weight values on SNPs in the imputation model. M denotes the number of SNPs. We use Y, a N × 1 vector, to denote the trait values measured on the same group of individuals. We test the association between Y and via a linear model where α is the intercept, δ is the term for random noise, and regression coefficient γ is the parameter of interest. The ordinary least squares (OLS) estimator for γ can be denoted as
Here, Xj is the jth column of X. Additionally, we derive the formula for the standard error of :
The approximation in this formula is based on the assumption that trait Y has complex etiology and imputed trait only explains a small proportion of its phenotypic variance. When an accurate estimate of var(δ) is difficult to obtain, this approximation approach provides conservative results and controls type-I error in the analysis.
In practice, individual-level genotype (i.e. X) and phenotype data (i.e. Y) may not be accessible due to policy and privacy concerns. Therefore, it is of practical interest to perform the aforementioned association analysis using summary association statistics. Standard genetic association analysis tests the association between trait Y and each SNP via the following linear model:
The OLS estimator for βj and its standard error have the following forms.
Again, the approximation is based on the empirical observation in complex trait genetics – each SNP explains little variability of Y [36].
Next, we derive the test statistic (i.e. z-score) for γ: where Γ is a diagonal matrix with the jth diagonal element being and Z̃ is the vector of SNP-level z-scores obtained from the GWAS of trait Y, i.e.
Without access of individual-level genotype data, var(Xj) and var( ) need to be estimated using an external panel with a similar ancestry background. We use X̃ to denote the genotype matrix from an external cohort, then var(Xj) can be approximated using the sample variance of X̃j. Variance of can be approximated as follows where D̃ is the variance-covariance matrix of all SNPs estimated using X̃. However, when the number of SNPs is large in the imputation model for trait T, calculation of D̃ is computationally intractable. Instead, we use an equivalent but computationally more efficient approach. We first impute trait T in the external panel using the same imputation model
Then, var( ) can be approximated by sample variance var( ).
Thus, we can test association between Y and without having access to individual-level genotype and phenotype data from the GWAS. The required input variables for BADGERS include a linear imputation model for trait T, SNP-level summary statistics from a GWAS of trait Y, and an external panel of genotype data. With these, association tests can be performed.
Multivariate analysis in BADGERS
To adjust for potential confounding effects, it may be of interest to include multiple imputed traits in the same BADGERS model. We still use Y to denote the measured trait of interest. The goal is to perform a multiple regression analysis using K imputed traits (i.e. ) as predictor variables:
Here, we use to denote a N × K matrix for K imputed traits. Regression coefficients γ* = (γ1, ⋯, γK)T (are the parameters of interest. To simplify algebra, we also assume trait Y and all SNPs in the genotype matrix X are centered so there is no intercept term in the model, but the conclusions apply to the general setting. Similar to univariate analysis, traits are imputed from genetic data via linear prediction models: where are imputation weights assigned to SNPs. The ith column of W denotes the imputation model for trait Ti. Then, the OLS estimator and its variance-covariance matrix can be denoted as follows:
The approximation is based on the assumption that imputed traits collectively explain little variance in Y, which is reasonable in complex trait genetics if K is not too large. We further denote:
All elements in matrix U can be approximated using a reference panel X̃:
Therefore, the z-score for γk (1 ≤ k ≤ K) is where Ik is the K × 1 vector with the kth element being 1 and all other elements equal to 0, Θ is a M × M diagonal matrix with the ith diagonal element being and similar to the notation in univariate analysis, Z̃ is the vector of SNP-level z-scores from the GWAS of trait Y. Given imputation models for K traits (i.e. W*), GWAS summary statistics for trait Y (i.e. Z̃), and an external genetic dataset to estimate U and Θ, multivariate association analysis can be performed without genotype and phenotype data from the GWAS.
Genetic prediction
Any linear prediction model can be used in the BADGERS framework. With access to individual-level genotype and phenotype data, users can train the model using their preferred statistical learning methods, e.g. penalized regression or linear mixed model. When only GWAS summary statistics are available for risk factors (i.e. T), PRS can be used for imputation. We used PRS to impute complex traits in all analyses reported throughout the paper. Of note, more advanced PRS methods that explicitly model LD [37] and functional annotations [38] to improve prediction accuracy have been developed. However, additional independent datasets may be needed if there are tuning parameters in PRS. In general, higher imputation accuracy will improve statistical power in association testing [12]. The BADGERS software allows users to choose their preferred imputation model.
Simulation settings
We simulated quantitative trait values using genotype data of 62,313 individuals from the GERA cohort (dbGap accession: phs000674). Summary association statistics for each of the simulated traits were generated using PLINK [39]. We ran BADGERS on summary statistics based on the simulated traits and PRS of 1,738 traits in the UK Biobank. To compare BADGERS with the traditional approach that uses individual-level data as input, we also directly regressed simulated traits on the PRS of UK Biobank traits to estimate association effects.
Setting 1. We simulated values of a quantitative trait as identically and independently distributed samples from a normal distribution with mean 0 and variance 1. In this setting, simulated trait values were independent from genotype data.
Setting 2. We simulated values of a quantitative trait based on an additive random effect model commonly used in heritability estimation [40]. We fixed heritability to be 0.1. In this setting, the simulated trait is associated with SNPs, but is not directly related to PRSs of UK Biobank traits.
Setting 3. We randomly selected 100 PRSs from 1,738 UK Biobank traits and calculated PRSs on GERA data. For each of these 100 PRSs, we simulated a quantitative trait by summing up the effect of a PRS, a polygenic background, and a noise term.
Here, X denotes the genotype of samples; β is the effect size of each variant; P is the PRS of one of the selected traits; ρ is the effect size of the PRS; and ε is the error term following a standard normal distribution. The polygenic background and random noise (i.e. Xβ + ε) were simulated using the same model described in setting 2. This term and the PRS were normalized separately. The standardized effect size (i.e. ρ) was set as 0.02, 0.015, 0.01, 0.008, and 0.005 in our simulations. In this setting, simulated traits are directly associated with SNPs and PRSs. For each value of ρ, statistical power was calculated as the proportion of significant results (p<0.05) out of 100 traits.
GWAS datasets
Summary statistics for 4,357 UK Biobank traits were generated by Dr. Benjamin Neale’s group and were downloaded from http://www.nealelab.is/uk-biobank. AD summary statistics from the IGAP stage-I analysis were downloaded from the IGAP website (http://web.pasteur-lille.fr/en/recherche/u744/igap/igap_download.php). ADGC phase 2 summary statistics were generated by first analyzing individual datasets using logistic regression adjusting for age, sex, and the first three principal components in the program SNPTest v2 [41]. Meta-analysis of the individual dataset results was then performed using the inverse-variance weighted approach [42].
GWAS summary statistics for neuropathologic features of AD and related dementias were obtained from the ADGC. Details on these data have been previously reported [25]. We analyzed a total of 13 neuropathologic features, including four NP traits, two traits for NFT Braak stages, three traits for LBD, CAA, HS, and two VBI traits. Among different versions of the same pathology, we picked one dataset for each pathologic feature to show in our primary analyses, but results based on different datasets of the same pathologic feature were consistent (Supplementary Figure 8). Six AD subgroups were defined in a recent paper on cognitively defined AD subgroups [21] on the basis of relative performance in memory, executive functioning, visuospatial functioning, and language at the time of Alzheimer’s diagnosis. Each person’s average across these four domains was calculated, and differences identified between performance in each domain and the individual average score. A threshold of 0.80 SD units was identified as an indicator of a substantial relative impairment. Each person could have a substantial relative impairment in zero, one, or two or more domains. Those with zero were assigned to the “no domain with a relative impairment” group. Those with exactly one were assigned to the associated group, e.g. isolated relative memory impairment, isolated relative language impairment, etc. Those with two or more domains with a substantial relative impairment were assigned to the multiple domains with a substantial impairment group. Each of these subgroups was compared with healthy controls in case-control association analyses. We did not include the executive functioning subgroup in our analysis due to its small sample size in cases. GWAS of these subgroups is described in [20]; we analyzed the summary statistics from those analyses here. Detailed information about the design of CSF biomarker GWAS and the recent sex-stratified analysis has been described previously [23, 24]. Details on the association statistics for AD subgroups, CSF biomarkers, and neuropathological features are summarized in Supplementary Table 7.
Analysis of GWAS summary statistics
We applied LD score regression implemented in the LDSC software [43] to estimate the heritability of each trait. Among 4,357 traits, we selected 1,738 with nominally significant heritability (p<0.05) to include in our analyses. We removed SNPs with association p-values greater than 0.01 from each of the 1,738 summary statistics files, clumped the remaining SNPs using a LD cutoff of 0.1 and a radius of 1 Mb in PLINK [39], and built PRS for each trait using the effect size estimates of remaining SNPs.
Throughout the paper, we used samples of European ancestry in the 1000 Genomes Project as a reference panel to estimate LD [44]. In univariate analyses, we tested marginal associations between each PRS and AD using the IGAP stage-I dataset and replicated the findings using the ADGC summary statistics. Association results in two stages were combined using inverse variance-weighted meta-analysis [42]. A stringent Bonferroni-corrected significance threshold was used to identify AD-associated risk factors. For associations between identified risk factors and AD endophenotypes, we used an FDR cutoff of 0.05 to claim statistical significance. We applied hierarchical clustering to the covariance of 48 traits we identified from marginal association analysis, then divided the result into 15 clusters and selected one most significant trait from each cluster and used them to perform multivariate conditional analysis. We analyzed IGAP and ADGC datasets separately, and combined the results using meta-analysis.
We used MR-IVW approach [45] implemented in the Mendelian Randomization R package [46] to study the causal effects of 48 risk factors identified by BADGERS. For each trait, we selected instrumental SNP variables as the top 30 most significant SNPs after clumping all SNPs using a LD cutoff of 0.1.
Analysis of WRAP data
WRAP is a longitudinal study of initially dementia-free middle-aged adults that allows for the enrollment of siblings and is enriched for a parental history of AD. Details of the study design and methods used have been previously described [26, 47]. After quality control, a total of 1,198 participants whose genetic ancestry was primarily of European descent were included in our analysis. On average, participants were 53.7 years of age (SD=6.6) at baseline and had a bachelor’s degree, and 69.8% (n=836) were female. Participants had two to six longitudinal study visits, with an average of 4.3 visits, leading to a total of 5,184 observations available for analyses.
DNA samples were genotyped using the Illumina Multi-Ethnic Genotyping Array at the University of Wisconsin Biotechnology Center. Thirty-six blinded duplicate samples were used to calculate a concordance rate of 99.99%, and discordant genotypes were set to missing. Imputation was performed with the Michigan Imputation Server v1.0.3 [48], using the Haplotype Reference Consortium (HRC) v. r1.1 2016 [49] as the reference panel and Eagle2 v2.3 [50] for phasing. Variants with a quality score R2<0.80, MAF<0.001, or that were out of HWE were excluded, leading to 10,499,994 imputed and genotyped variants for analyses. Data cleaning and file preparation were completed using PLINK v1.9 [51] and VCFtools v0.1.14 [52]. Coordinates are based on the hg19 genome build. Due to the sibling relationships present in the WRAP cohort, genetic ancestry was assessed and confirmed using Principal Components Analysis in Related Samples (PC-AiR), a method that makes robust inferences about population structure in the presence of relatedness [53].
Composite scores were calculated for executive function, delayed recall, and learning based on a previous analysis [54]. Each composite score was calculated from three neuropsychological tests, which were each converted to z-scores using baseline means and standard deviations. These z-scores were then averaged to derive executive function, delayed recall, and learning composite scores at each visit for each individual. Cognitive impairment status was determined based on a consensus review by a panel of dementia experts. Resulting cognitive statuses included cognitively normal, early MCI, clinical MCI, impairment that was not MCI, or dementia, as previously defined [55]. Participants were considered cognitively impaired if their worst consensus conference diagnosis was early MCI, clinical MCI, or dementia (n=387). Participants were considered cognitively stable if their consensus conference diagnosis was cognitively normal across all visits (n=803).
The 48 PRSs were developed within the WRAP cohort using PLINK v1.9 [51] and tested for associations with the three composite scores (i.e. executive function, delayed recall, and learning) and cognitive impairment statuses. MCI status was tested using logistic regression models in R, while all other associations, which utilized multiple study visits, were tested using linear mixed regression models implemented in the lme4 package in R [56]. All models included fixed effects for age and sex, and cognitive composite scores additionally included a fixed effect for practice effect (using visit number). Mixed models included random intercepts for within-subject correlations due to repeated measures and within-family correlations due to the enrollment of siblings.
Software availability
The BADGERS software is freely available at https://github.com/qlu-lab/BADGERS
Competing financial interests
The authors declare no competing financial interests.
Author contribution
Q.L. conceived the study.
D.Y., B.H., and Q.L. developed the statistical framework.
D.Y., B.H., B.D., and Q.L. performed the statistical analyses.
D.Y. implemented the software.
S.M., B.W.K., Y.D., L.D., A.N., A.K., Y.Z., C.C., and T.J.H., assisted in preparation and curation of GWAS association statistics.
S.M. and P.K.C. assisted in data application and interpretation of association results.
B.F.D., S.C.J., and C.D.E. assisted in WRAP data preparation and interpretation.
Y.W. assisted in literature review and results interpretation.
Q.L. advised on statistical and genetic issues.
H.K. advised on causal inference.
D.Y. B.H., and Q.L. wrote the manuscript.
All authors contributed in manuscript editing and approved the manuscript.
Acknowledgments
This project was supported by the Clinical and Translational Science Award (CTSA) program, through the NIH National Center for Advancing Translational Sciences (NCATS), grant UL1TR000427, and by the University of Wisconsin-Madison Office of the Chancellor and the Vice Chancellor for Research and Graduate Education with funding from the Wisconsin Alumni Research Foundation. B.F.D. was supported by an NLM training grant to the Computation and Informatics in Biology and Medicine Training Program [NLM 5T15LM007359]. This research was also supported by the NIH grants R01 AG054047, R01 AG27161, and P2C HD047873, Helen Bader Foundation, Northwestern Mutual Foundation, Extendicare Foundation, and State of Wisconsin. C.C. was supported by NIH grants P01 AG003991, R01 AG057777, R01 AG044546, RF1 AG053303, RF1 AG058501, and U01 AG058922. T.J.H. was supported by NIH grants R01 AG059716, R21 AG059941, and K01 AG049164. The authors thank the University of Wisconsin Madison Biotechnology Center Gene Expression Center for providing Illumina Infinium genotyping services. We thank the International Genomics of Alzheimer’s Project (IGAP) for providing summary results data for these analyses. The investigators within IGAP contributed to the design and implementation of IGAP and/or provided data but did not participate in analysis or writing of this report. IGAP was made possible by the generous participation of the subjects and their families. The i-Select chips were funded by the French National Foundation on Alzheimer’s disease and related disorders. EADI was supported by the LABEX (laboratory of excellence program investment for the future) DISTALZ grant, Inserm, Institut Pasteur de Lille, Université de Lille 2, and the Lille University Hospital. GERAD was supported by the Medical Research Council (Grant n° 503480), Alzheimer’s Research UK (Grant n° 503176), the Wellcome Trust (Grant n° 082604/2/07/Z), and German Federal Ministry of Education and Research (BMBF): Competence Network Dementia (CND) grant n° 01GI0102, 01GI0711, 01GI0420. CHARGE was partly supported by the NIH/NIA grant R01 AG033193 and the NIA AG081220 and AGES contract N01–AG–12100, the NHLBI grant R01 HL105756, the Icelandic Heart Association, and the Erasmus Medical Center and Erasmus University. ADGC was supported by the NIH/NIA grants: U01 AG032984, U24 AG021886, U01 AG016976, and the Alzheimer’s Association grant ADGC–10–196728. We thank contributors who collected samples used in this study, as well as patients and their families, whose help and participation made this work possible; Data for this study were prepared, archived, and distributed by the National Institute on Aging Alzheimer’s Disease Data Storage Site (NIAGADS) at the University of Pennsylvania (U24-AG041689-01). We are also grateful for ADGC and its investigators for providing GWAS summary statistics for various AD phenotypes. The full acknowledgement to ADGC is included in the supplementary material. Summary statistics for cognitively defined subgroups was generated with support from R01AG042437. Data were from the Adult Changes in Thought (ACT) study, the Religious Orders Study (ROS), the Rush Memory and Aging Project (MAP), the Alzheimer’s Disease Neuroimaging Initiative (ADNI), and the University of Pittsburgh Alzheimer’s Center. ACT data collection was supported by U01 AG006781. ACT genotyping was supported by U01 HG006375. Cerebellum samples for genotyping for some ACT samples were prepared with support from P50 AG005136. ADNI data collection and genotyping were supported by U01 AG024904. Data collection and sharing was funded by U01 AG024904 and W81XWH-12-2-0012. ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California. MAP data collection and genotyping were supported by R01 AG017917. Additional genotyping was supported by Kronos, Zinfandel, and U01 AG032984. ROS data collection and genotyping were supported by P30 AG10161, R01 AG15819, and R01 AG30146. Additional genotyping was supported by Kronos, Zinfandel, and U01 AG032984. PITT data collection were funded by P50 AG05133, R01 AG030653, and R01 AG041718.