Abstract
Genome-wide association studies (GWAS) have now been conducted for hundreds of phenotypes of relevance to human health. Many such GWAS involve multiple closely-related phenotypes collected on the same samples. However, the vast majority of these GWAS have been analyzed using simple univariate analyses, which consider one phenotype at a time. This is de-spite the fact that, at least in simulation experiments, multivariate analyses have been shown to be more powerful at detecting associations. Here, we conduct multivariate association analyses on 13 different publicly-available GWAS datasets that involve multiple closely-related phenotypes. These data include large studies of anthropometric traits (GIANT), plasma lipid traits (GlobalLipids), and red blood cell traits (HaemgenRBC). Our analyses identify many new associations (433 in total across the 13 studies), many of which replicate when follow-up samples are available. Overall, our results demonstrate that multivariate analyses can help make more effective use of data from both existing and future GWAS.
Author Summary Genome-wide association studies (GWAS) have become a common and powerful tool for identifying significant correlations between markers of genetic variation and physical traits of interest. Often these studies are conducted by comparing genetic variation against single traits one at a time (‘univariate’); however, it has previously been shown that it is possible to increase your power to detect significant associations by comparing genetic variation against multiple traits simultaneously (‘multivariate’). Despite this apparent increase in power though, researchers still rarely conduct multivariate GWAS, even when studies have multiple traits readily available. Here, we reanalyze 13 previously published GWAS using a multivariate method and find >400 additional associations. Our method makes use of univariate GWAS summary statistics and is available as a software package, thus making it accessible to other researchers interested in conducting the same analyses. We also show, using studies that have multiple releases, that our new associations have high rates of replication. Overall, we argue multivariate approaches in GWAS should no longer be overlooked and how, often, there is low-hanging fruit in the form of new associations by running these methods on data already collected.
2 Introduction
Genome wide association studies (GWAS) have been widely used to identify genetic factors – particularly single nucleotide polymorphisms (SNPs) and copy number variations (CNVs) – associated with human disease risk and other phenotypes of interest (Price et al., 2015; Visscher et al., 2017). Indeed, at time of writing over 24,000 such associations have been identified as ‘genome-wide significant’ (MacArthur et al., 2017).
The vast majority of these many genetic association analyses consider only one phenotype at a time (“univariate association analysis”). This is despite the fact that measurements on multiple phenotypes are often available, and joint association analysis of multiple phenotypes (“multivariate association analysis”) can substantially increase power (Jiang and Zeng, 1995; Zhu and Zhang, 2009; Shriner, 2012; Yang and Wang, 2012; Galesloot et al., 2014). There are likely multiple reasons for the preponderance of univariate analyses. One possible reason is that initial association analyses are usually performed under tight time constraints, and at a time when many other analysis issues (e.g. quality control, population stratification) are competing for attention. In these conditions it makes sense to focus on the simplest possible approach that will quickly yield new associations, without overly worrying about loss of efficiency. In addition analysts may be legitimately concerned that deviation from the most widely adopted analysis pipeline may invite unwanted additional reviewer attention.
Nonetheless, we believe that multivariate association analysis has an important role to play in making the most of costly and time-consuming GWAS studies. One way forward is to conduct multivariate analyses of previously-published GWAS, checking for additional associations that may have been missed by the initial univariate association analyses. This is greatly facilitated by the fact that many GWAS now make summary data from single-SNP tests freely available (Willer et al., 2013; Wood et al., 2014; Locke et al., 2015; Shungin et al., 2015; Astle et al., 2016), and that simple multivariate analysis can be conducted using such summary data (Stephens, 2013; Pickrell et al., 2016; Hormozdiari et al., 2016).
Here we demonstrate the potential benefits of reanalyzing published GWAS using multivariate methods. Specifically we apply multivariate methods from Stephens 2013 to reanalyze 13 different GWAS whose initial publications reported only univariate results. In most cases our multivariate analyses find many new associations. For example, in GIANT 2014/5 we find over 150 new associations. In studies with multiple data releases, we find that new multivariate associations found in initial releases typically replicate in subsequent releases, supporting that many of the new associations are likely real. We also demonstrate that the multivariate approach is not equivalent to simply relaxing the univariate GWAS significance threshold. Finally, we point out some limitations of the specific framework we used here, and suggest some alternative strategies that may help address those limitations in future multivariate GWAS analyses.
3 Results
Multivariate association analyses
To facilitate multivariate association analyses using the methods from Stephens 2013, we implemented them in an R package bmass (Bayesian multivariate analysis of summary statistics). The software requires as input univariate GWAS summary statistics, for the same set of SNPs, on d related phenotypes. Then, for each SNP, it attempts to categorize each phenotype as belonging to one of three categories: Unassociated, Directly Associated, or Indirectly Associated with the SNP. The difference between D and I is that an indirect association disappears after controlling for associations with other phenotypes (see Online Methods and Supplementary Figure 1).
For d phenotypes, there are 3d possible assignments of phenotypes to these 3 categories, and each assignment corresponds to a different “model” γ. For example, one model corresponds to the “null” that all phenotypes are Unassociated; another model corresponds to the model that all phenotypes are Directly associated; another model corresponds to just the first phenotype being Directly associated, etc. The goal of the association analysis is to determine which of these models is consistent with the data and, in particular, to assess overall evidence against the null model.
The support in the data for model γ, relative to the null model, is summarized by a Bayes Factor (BFγ). Large values of BFγ indicate strong evidence for model γ compared against the null. One advantage of Bayes Factors over p-values is that the Bayes Factors from different models can be easily compared and combined. For example, the overall evidence against the null is given by the (weighted) average of these BFs: where the weights wγ are chosen to reflect the relative plausibility of each model γ. In bmass we implemented the Empirical Bayes approach from Stephens 2013 that learns appropriate weights from the data (see Online Methods).
Comparisons with published univariate analyses
To provide a benchmark against which to compare our multivariate analysis results, we compiled a list of “previous univariate associations”: SNPs that were both reported as significant in the original publication and exceeded the original publication’s definition for genome-wide significance in at least one phenotype in the publicly-available (univariate) summary data analyzed here. This does not include all SNPs reported in every original publication because in some studies SNPs became genome-wide significant only after adding additional samples not included in the publicly available summary data.
We used these previous univariate associations to determine a significance threshold for our multivariate associations. Specifically, we declared a multivariate association as significant if its BFav exceeds that of any previous univariate association’s BFav in the same study (Stephens, 2013). The rationale is that the evidence for these multivariate associations exceeds the evidence for previously-reported genome-wide significant associations, which are generally regarded as likely to be (mostly) real associations.
Finally, we defined a list of “new multivariate associations”, which are SNPs that are significant in our multivariate analysis but are not a “previous univariate association”. To avoid double-counting of signals due to linkage disequilibrium (LD), we pruned the list of new multivariate associations so that they are all at least 0.5Mb apart. For additional details, see Online Methods.
Many new loci identified in reanalyzing 13 publicly available GWAS studies
We applied bmass to 13 publicly available GWAS studies, representing 10 different collections of phenotypes (Table 1). Phenotypic collections include blood lipid traits (GlobalLipids: (Teslovich et al., 2010; Willer et al., 2013)), body morphological traits (GIANT: (Lango Allen et al., 2010; Speliotes et al., 2010; Heid et al., 2010; Wood et al., 2014; Locke et al., 2015; Shungin et al., 2015)), red blood cell traits (HaemgenRBC: (van der Harst et al., 2012; Astle et al., 2016)), blood pressure traits (International Consortium for Blood Pressure Genome-Wide Association et al., 2011; Wain et al., 2011), bone density traits (Zheng et al., 2015), and kidney function traits (Kottgen et al., 2010; Boger et al., 2011). For three of these phenotypic collections (GlobalLipids, GIANT, and HaemgenRBC), two different releases were available from the source consortiums. We conducted basic QC as described in Online Methods.
Our multivariate analyses identify, in total, hundreds of new associations. The numbers of previous univariate associations and new multivariate associations are summarized in Figure 1 (see also Supplementary Table 2). For example, we identify 162 new multivariate associations in GIANT2014/5, 65 in GlobalLipids2013, and 60 in HaemgenRBC2016. These represent power increases from 10% to 45% compared with previous univariate analyses.
Replication of multivariate associations across releases
To demonstrate that many of these new multivariate associations are likely to be real we take advantage of three datasets that each have two releases separated by several years (GlobalLipids, GIANT, and HaemgenRBC). In each case we performed multivariate association analysis of the earlier release and checked how the new multivariate associations fared in univariate analyses of the later release (Figure 2). Since later releases include the samples from earlier releases, to assess “replication” we focus on whether the association in the new release is more significant than the original release – that is, whether the signal in the new (non-overlapping) samples provides additional evidence over and above the original signal. By this measure the results show high replication rates for the new multivariate associations: in total, 84 of 94 new associations have smaller minimum univariate p-values in the later release (at exactly the same SNP), and indeed the majority of these reach univariate GWAS significance in the later release.
Multivariate analysis is different from multiple univariate analyses
Because multivariate analysis takes account of joint patterns across phenotypes, its ranking of significance of SNPs can change compared with that from the univariate p-values alone. That is, multivariate analysis is not simply equivalent to multiple univariate analyses. To illustrate this we examined, in three well-powered studies, the associations that would be declared significant if the univariate significance threshold were relaxed, and assessed which of them would also be significant in our multivariate analysis (i.e. we assess whether, if we go deeper into the univariate results, we find the same SNPs as appear in our multivariate results). The results are shown in Figure 3. Although there is, understandably, substantial overlap between the significant SNPs, any non-trivial relaxation of the univariate threshold includes many SNPs that are not multivariate significant in our analysis; for example, at a univariate threshold of 5× 10−7 only 66% of the univariate significant SNPs are also multivariate significant across these three studies. This demonstrates that, indeed, our multivariate approach reorders significance of SNPs compared with multiple univariate analyses.
Reanalysis also identifies new univariate associations
During our multivariate reanalyses we noticed many SNPs that appeared to be genome-wide univariate significant but were – somewhat mysteriously – not reported as such by the original studies (i.e. SNPs whose univariate p-values crossed the significance threshold, as defined by the given study, in at least one trait). Supplementary Table 1 reports 79 such associations.
There may be many reasons why such variants went unreported, but one reason may be physical proximity to a variant with a stronger signal. Indeed, more than half of the variants described above are within 1Mb of a previously-reported univariate GWAS association. Refraining from reporting multiple near-by associations seems a reasonable – if conservative – strategy to avoid reporting redundant associations due to LD. Further, even when redundant associations due to LD can be ruled out (e.g. by directly examining LD rather than by simply using physical distance), it might be assumed that multiple nearby associated variants may all act through the same biological mechanism and therefore provide redundant biological insights. However, we found that multi-phenotype patterns of association can differ between nearby SNPs, suggesting that they act through different mechanisms.
To highlight just one example, consider rs7515577 – which is an original univariate association in GlobalLipids2010 – and rs12038699 – which is a new multivariate association in GlobalLipids2013. We note that rs12038699 actually reached univariate genome-wide significance in the GlobalLipids2013 dataset, but was not reported (Supplementary Table 6). This is possibly because the latter SNP is relatively close, in genomic terms, to the former SNP (549kb). However, these SNPs are not in strong LD (r2 = .08), and so these associations almost certainly represent non-redundant associations. This is further supported by the effect sizes in each phenotype, which clearly reveal very different multivariate patterns of effect sizes among phenotypes (Supplementary Figure 2 & Supplementary Table 6). Indeed the very different multivariate patterns of effect size suggest that not only are these associations non-redundant but likely involve different biological mechanisms as well.
These results suggest that, moving forward, it may pay to be more careful in designing filters designed to avoid reporting redundant associations, and that multi-phenotype analyses may have a helpful role to play here.
Limitations
One goal of the multivariate approach introduced in Stephens 2013 was to increase interpretability of multivariate analyses; in particular, the goal was to not only test for associations but also to help explain the associations by partitioning the phenotypes into “Unassociated”, “Directly Associated”, and “Indirectly Associated” categories. In principle one might hope to use these classifications to gain insights into the relationships among phenotypes and also perhaps to identify different “types” of multivariate association - effectively clustering associations into different groups. However, in practice we find that these discrete classifications are often not as helpful as one might hope. One reason is the difficulty of reliably distinguishing between direct and indirect effects (Stephens, 2013). Another reason is widespread associations with multiple phenotypes. Indeed, we find that, consistently across data sets, the most common multivariate models involve associations – either direct or indirect – with many phenotypes (Supplementary Table 7) and many SNPs are classified as being associated with many phenotypes (Figure 4A). Further, SNPs are very rarely confidently classified as “Unassociated” with any phenotype (Figure 4B). This last observation can be explained by the fact that it is essentially impossible to distinguish ‘unassociated’ from ‘weakly associated’. Nonetheless when all SNPs show similar classifications it is difficult to get insights into different patterns of multivariate association.
Moving forward, rather than relying on the discrete classifications of “Unassociated”, “Directly Associated”, and “Indirectly Associated” to identify different patterns of multivariate association, we believe it will be more fruitful to use multivariate methods that take a more quantitative approach, such as identifying different patterns of effect size (including direction of effect) among phenotypes (Urbut et al., 2017). Focusing on effect sizes has the potential to be much more informative than discrete classification, which can hide effect size differences. For example, when multiple SNPs are classified as associated with all phenotypes, they can still show very different patterns of estimated effect sizes/direction (see Supplementary Figure 3).
Another limitation of our multivariate methods is that they can lead to (what appear to be) false positive associations when applied to test SNPs with very low minor allele frequencies. Specifically we saw examples where very low-frequency SNPs (e.g. MAF < .001) showed strong signals of multivariate association despite showing very little signal in any univariate test. Although such results are not impossible, we believe that most of these cases were likely false positives, and we applied a MAF cut-off (of 0.01 or 0.005) to avoid these issues. Consequently we recommend caution in interpreting results of multivariate analyses at very low-frequency SNPs, and more generally we recommend that multivariate results be compared against univariate results to check they make sense – highly significant multivariate associations that do not also show at least a moderate level of univariate association should be treated with caution.
4 Discussion
We reanalyzed 13 publicly available GWAS datasets using a Bayesian multivariate approach and identified many new genetic associations. Turning genetic associations into biological discoveries remains, of course, a challenging problem. Nonetheless, our results suggest that the increased power of multivariate association analysis that has been reported in many simulation studies (Stephens, 2013; Galesloot et al., 2014; Porter and O’Reilly, 2017) also translates to discovery of many new associations in practice.
Our results exploit the public availability of summary data from several large GWAS. Despite progress toward easier availability of individual-level data for large studies (Sudlow et al., 2015), in many cases summary data remain much easier to obtain and work with; there are big practical advantages as well to modular pipelines that first compute summary data and then use these as inputs to sub-sequent (more sophisticated) analyses. For example, the multivariate analyses we present here are simplified by assuming that the summary data were computed while adequately adjusting for population stratification. And our results illustrate the potential for reanalysis of summary data to yield novel inferences. In this regard we also emphasize the importance of consortia releasing carefully-chosen summaries. For example, Z-scores are much more helpful than p-values because they preserve information on the direction of the effect. Even better would be both the effect size and standard error that created the Z-score. More generally, although not necessarily essential for our analyses here, it is always helpful to include additional key meta-data (e.g. the reference allele, or effect allele, the minor allele frequency, and sample size).
The specific multivariate methods used here were derived under the assumption that the summary data from each phenotype has been obtained from the same sampled individuals (which is true, at least approximately, for studies analyzed here). However, multivariate analysis of summary data is also possible even when data were obtained from different samples for each phenotype. The main difference between these settings is that, for data from overlapping samples, the “noise” is correlated as well as the signal: i.e. the summary data are correlated under the null due to sample overlap, and correlated under the alternative due to both sample overlap and any shared genetic effects. In contrast, for data from non-overlapping samples the noise is uncorrelated (whereas the signal may remain correlated if genetic factors are shared). Our methods use data at (empirically) null SNPs to estimate the noise correlation, and so their overall assessment of associations should be relatively robust to whether samples for different phenotypes overlap (however, our definitions of D (direct) vs I (indirect) associations requires the same samples to be measured across phenotypes.)
Moving forward, we expect multivariate association analyses to play an increasingly important role in detecting and understanding genetic associations and relationships among phenotypes. Large studies are now collecting, and making available, rich human genetic and phenotypic information on many complex phenotypes, most notably the UKBioBank (Sudlow et al., 2015). In addition, there are increasingly large studies linking genetic variation and molecular phenotypes such as gene expression (e.g. the GTEx project (GTEx Consortium, 2013)), as well as epigenetic modifications and transcript degradation (Gaffney, 2013; Pai et al., 2015; Birney et al., 2016; Stricker et al., 2017). Analysis of multiple molecular traits can help yield insights into causal connections among traits (Li et al., 2016), and joint analysis of molecular traits with complex phenotypes may also shed light on functional mechanisms (as in “co-localization” analyses (Hormozdiari et al., 2016; Li and Kellis, 2016; Zhu et al., 2016; Wen et al., 2017)). Even simply moving from single phenotype to pairwise analysis can shed considerable light on sharing of genetic effects and possible causal connections (Pickrell et al., 2016; Shi et al., 2017).
These increasingly-complex new data also bring new analytic and computational challenges. Here we have restricted our analyses to relatively small sets of closely-related traits, and indeed the specific multivariate framework we used here – which performs an exhaustive search over all possible multivariate models – is fully tractable for only moderate numbers of traits (up to about 10). Scaling methods up to dealing with larger number of traits may well be helpful for some settings, and recent multivariate analysis methods can deal with dozens of outcomes (Dahl et al., 2016; Urbut et al., 2017). In addition, developing multivariate methods to perform fine-mapping of genetic associations simultaneously across multiple phenotypes (Lewin et al., 2016) seems an important and challenging area for future work.
5 URLs
bmass R package: https://github.com/mturchin20/bmass
7 Author Contributions
MS conceived the original statistical framework. MS and MCT conceived the study design. MCT performed the data collection, processing, and analyses. MCT wrote the R package bmass. MS supervised the project. MCT and MS wrote the paper.
8 Materials and Methods
8.1 GWAS Datasets
Below are specific details regarding retrieval and data-processing for each dataset analyzed. Where applicable, these details include the sample size (N), minor allele frequency (MAF), and p-value thresholds that were applied (based on the thresholds used in the original publications). For each dataset variants were dropped if they satisfied at least one of the following criteria: did not contain information for every phenotype; had missing MAF; were fixed (MAF of 0); had effect size exactly 0 (i.e. direction of effect would be indeterminable); or did not contain the same reference and alternative alleles across each phenotype. For a handful of studies, external databases were used to retrieve chromosome, basepair information, and MAF based on rsID#; in these studies SNPs for which this information could not be retrieved were also dropped.
GlobalLipids2010 (Teslovich et al., 2010): Original merged, processed, and GWAS-hit annotated summary data from Stephens 2013 (Stephens, 2013) for HDL, LDL, TG, and TC was downloaded from https://github.com/stephens999/multivariate (dtlesssignif.annot.txt and RSS0.txt).
GlobalLipids2013 (Willer et al., 2013): Summary data for HDL, LDL, TG, and TC was downloaded from http://csg.sph.umich.edu/abecasis/public/lipids2013/. We used a minimum N threshold of 50,000, a MAF threshold of 1%, and a univariate significant GWAS p-value threshold of 5 × 10−8. All variants were oriented to the HDL minor allele. The final merged and QC’d datafile contained 2,004,701 SNPs. rsID#’s of published GWAS SNPs were retrieved for all four phenotypes from https://www.nature.com/ng/journal/v45/n11/full/ng.2797.html via Supplementary Tables 2 and 3.
GIANT2010 (Lango Allen et al., 2010; Speliotes et al., 2010; Heid et al., 2010): Summary data for Height, BMI, and WHRadjBMI were downloaded from https://www.broadinstitute.org/collaboration/giant/index. php/GIANT_consortium_data_files. We used a minimum N threshold of 50,000, a MAF threshold of 1%, and a univariate significant GWAS p-value threshold of 5 × 10−8. Chromosome and basepair position per variant were retrieved from dbSNP130 (Sherry et al., 2001). All variants were oriented to the Height minor allele. The final merged and QC’ed datafile contained 2,363,881 SNPs. rsID#’s of published GWAS SNPs were retrieved for Height from https://www.nature.com/nature/journal/v467/n7317/full/nature09410.html via Supplementary Table 1, for BMI from https://www.nature.com/ng/journal/v42/n11/full/ng.686.html via Table 1, and for WHRadjBMI from https://www.nature.com/ng/journal/v42/n11/full/ng.685.html via Table 1.
GIANT2014/5 (Wood et al., 2014; Locke et al., 2015; Shungin et al., 2015): Summary data for Height, BMI, and WHRadjBMI were downloaded from https://www.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium_data_files. We used a minimum N threshold of 50,000, a MAF threshold of 1%, and a univariate significant GWAS p-value threshold of 5 × 10−8. Chromosome and basepair position per variant were retrieved from dbSNP130 (Sherry et al., 2001). All variants were oriented to the Height minor allele. The final merged and QC’ed datafile contained 2,340,715 SNPs. rsID#’s of published GWAS SNPs were retrieved for Height from https://www.nature.com/ng/journal/v46/n11/full/ng.3097.html via Supplementary Table 1, for BMI from https://www.nature.com/nature/journal/v518/n7538/full/nature14177.html via Supplementary Tables 1 and 2, and for WHRadjBMI from https://www.nature.com/nature/journal/v518/n7538/full/nature14132.html via Supplementary Table 4.
HaemgenRBC2012 (van der Harst et al., 2012): Summary data for RBC, PCV, MCV, MCH, MCHC, and Hb were downloaded from the European Genome-Phenome Archive via accession number EGAS00000000132 (https://www.ebi.ac.uk/ega/studies/EGAS00000000132). We used a minimum N threshold of 10,000, a MAF threshold of 1%, and a univariate significant GWAS p-value threshold of 1 × 10−8. Chromosome, basepair position, and MAF per variant were retrieved from HapMap release 22 (International HapMap, 2003). All variants were oriented to the RBC minor allele. The final merged and QC’ed datafile contained 2,327,567 SNPs. rsID#’s of published GWAS SNPs were retrieved for all six phenotypes from https://www.nature.com/nature/journal/v492/n7429/full/nature11677.html via Table 1.
HaemgenRBC2016 (Astle et al., 2016): Summary data for RBC, PCV, MCV, MCH, MCHC, and Hb were shared via personal communication with the authors. We used a MAF threshold of 1% and a univariate significant GWAS p-value threshold of 8.319×10−9. Since sample size was not provided per variant, the following overall study sample sizes were used as proxies per phenotype: 172,952 for RBC, 172,433 for PCV, 173,039 for MCV, 172,332 for MCH, for 172,925 MCHC, and 172,851 for Hb. All variants were oriented to the RBC minor allele. Only SNPs were analyzed. The final merged and QC’ed datafile contained 8,649,095 SNPs. We then used these summary data to create a list of (non-redundant) “Previous univariate associations”. This was done separately for each phenotype by collecting all SNPs that exceeded the univariate significant GWAS p-value threshold and greedily pruning the SNPs: i.e. we went down the list, removing SNPs that were less significant than another SNP within 500kb. The pruned lists of previous univariate associations for each phenotype were then combined to produce the final SNP list of “published GWAS results”. Published CNVs that tagged regions that were not identified by this ‘final published SNP list’ were also included to avoid erroneously claiming downstream a region as a ‘new unpublished result’; these CNVs however were only used to mask additional loci as being ‘nearby a published univariate GWAS result’ and for nothing else in the bmass analysis pipeline.
ICBP2011 (International Consortium for Blood Pressure Genome-Wide Association et al., 2011; Wain et al., 2011): Summary data for SBP, DBP, PP, and MAP were downloaded from dbGaP via accession number phs000585.v1.p1 (https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000585.v1.p1). We used a minimum N threshold of 10,000, a MAF threshold of 1%, and a univariate significant GWAS p-value threshold of 5 × 10−8. Chromosome and basepair position per variant were retrieved from HapMap release 21 (International HapMap, 2003). All variants were oriented to the SBP minor allele. The final merged and QC’ed datafile contained 2,387,851 SNPs. rsID#’s of published GWAS SNPs were retrieved for SBP and DBP from https://www.nature.com/nature/journal/v478/n7367/full/nature10405.html via Supplementary Table 5, and for PP and MAP from https://www.nature.com/ng/journal/v43/n10/full/ng.922.html via Table 1 and Supplementary Table 2F. Additionally, we gratefully acknowledge the International Consortium for Blood Pressure Genome-Wide Association Studies (Nature. 2011 Sep 11;478(7367):103-9, Nat Genet. 2011 Sep 11;43(10):1005-11) for generating and sharing these data.
MAGIC2010 (Dupuis et al., 2010): Summary data for FstIns, FstGlu, HOMA_B, and HOMA_IR were downloaded from https://www.magicinvestigators.org/downloads/. We used a MAF threshold of 1% and a univariate significant GWAS p-value threshold of 5 × 10−8. Since sample size was not provided per variant, the overall study sample size of 46,186 was used as a proxy. Chromo-some and basepair position per variant were retrieved from HapMap release 22 (International HapMap, 2003). All variants were oriented to the FstIns minor allele. The final merged and QC’ed datafile contained 2,333,328 SNPs. rsID#’s of published GWAS SNPs were retrieved for all four phenotypes from https://www.nature.com/ng/journal/v42/n2/full/ng.520.html via Table 1.
GEFOS2015 (Zheng et al., 2015): Summary data for FA, FN, and LS were downloaded from http://www.gefos.org/?q=content/data-release-2015. We used a MAF threshold of .5% and a univariate significant GWAS p-value threshold of 1.2×10−8. Since sample size was not provided per variant, the overall study sample size of 32,965 was used as a proxy. All variants were oriented to the FA minor allele. The final merged and QC’ed datafile contained 8,938,035 SNPs. rsID#’s of published GWAS SNPs were retrieved for all four phenotypes from https://www.nature.com/nature/journal/v526/n7571/full/nature14878.html via Supplementary Table 13.
GIS2014 (Benyamin et al., 2014): Summary data for Iron, Sat, TrnsFrn, and Log10Frtn were shared via personal communication with the authors. We used a MAF threshold of 1% and a univariate significant GWAS p-value threshold of 5 × 10−8. Since sample size was not provided per variant, the overall study sample size of 48,972 was used as a proxy. All variants were oriented to the Iron minor allele. The final merged and QC’ed datafile contained 1,985,313 SNPs. rsID#’s of published GWAS SNPs were retrieved for all four phenotypes from https://www.nature.com/articles/ncomms5926/ via Table 1.
SSGAC2016 (Barban et al., 2016): Summary data for NEB_Pooled and AFB_Pooled were downloaded from https://www.thessgac.org/data. We used a MAF threshold of 1% and a univariate significant GWAS p-value threshold of 5 × 10−8. Since sample size was not provided per variant, the following overall study sample sizes were used as proxies per phenotype: 251,151 for NEB_Pooled and 343,072 for AFB_Pooled. All variants were oriented to the NEB_Pooled minor allele. The final merged and QC’ed datafile contained 2,395,561 SNPs. rsID#’s of published GWAS SNPs were retrieved for all four phenotypes from https://www.nature.com/ng/journal/v48/n12/full/ng.3698.html via Table 1.
CKDGen2010/1 (Kottgen et al., 2010; Boger et al., 2011): Summary data for Crea, Cys, CKD, UACR, and MA were downloaded from https://www.nhlbi.nih.gov/research/intramural/researchers/pi/fox-caroline/datasets. We used a MAF threshold of 1% and a univariate significant GWAS p-value thresh-old of 5 × 10−8. Since sample size was not provided per variant, the following overall study sample sizes were used as proxies per phenotype: 67,093 for Crea, 20,957 for Cys, 62,237 for CKD, 31,580 for UACR, and 30,482 for MA. All variants were oriented to the Crea minor allele. The final merged and QC’ed datafile contained 2,333,498 SNPs. rsID#’s of published GWAS SNPs were retrieved for Crea, Cys, and CKD from https://www.nature.com/ng/journal/v42/n5/full/ng.568.html via Table 2.
ENIGMA22015 (Hibar et al., 2015): Summary data for ICV, Accumbens, Amygdala, Caudate, Hippocampus, Pallidum, Putamen, and Thalamus were downloaded from http://enigma.ini.usc.edu/research/download-enigma-gwas-results/. We used a minimum N threshold of 10,000, a MAF threshold of 1% and a uni-variate significant GWAS p-value threshold of 5 × 10−8. All variants were oriented to the ICV minor allele. The final merged and QC’ed datafile contained 6,271,117 SNPs. rsID#’s of published GWAS SNPs were retrieved for all 8 phenotypes from https://www.nature.com/nature/journal/v520/n7546/full/nature14101.html via Table 1.
8.2 bmass
bmass implements in an R package the statistical methods described in Stephens 2013, which should be consulted for full details. In particular, the sections “Computation” and “Detailed Methods (Global Lipids Analysis)” in Stephens 2013 describe how multivariate analyses are applied to GWAS summary data, and bmass implements the data analysis pipeline described in the “Detailed Methods (Global Lipids Analysis)” section. The bmass R package also includes two vignettes to help users begin processing GWAS summary data and implementing these methods.
8.3 Additional Details for Figure 3
For each dataset we made a list of “marginally-significant” SNPs, with p-values smaller than 1 × 10−6 but not genome-wide significant at the relevant datasets’ GWAS threshold. We then greedily pruned these lists of marginally-significant SNPs: that is we repeatedly went through the lists removing SNPs that were less significant than another SNP within 500kb. We then removed any SNPs that were within 500kb of a new multivariate association, and merged the resulting list with the list of new multivariate associations, and sorted this merged list of SNPs by their minimum univariate p-value.
This results in a non-redundant list of marginally-significant SNPs – some of which are new multivariate associations and some of which are not – sorted by their smallest univariate p-value. The plot shows how the number of SNPs of each type varies as the p-value threshold is relaxed from the GWAS threshold to 10−6 (the HaemgenRBC2016 results show only the top 500 SNPs due to the abundance of SNPs between 8.31 × 10−9 and 1 × 10−6).
9 Supporting Information Legends
Supplementary Figure 1: Graphical Model of Multivariate Categories. Shown here is a Directed Acyclic Graphical (DAG) model of our multivariate categories in the context of our vector of phenotypes Y (e.g. Y = {YU, YD, YI}) and their connections with the variant of interest g. The relationships described in-text can be seen here. YU, our unassociated phenotypes, have no connection with g. YD, our directly associated phenotypes, have a direct connection with g. And YI, our indirectly associated phenotypes, have a connection with g only by going through YD first. Note that if YD were not observed, YI would appear as a direct connection.
Supplementary Figure 2: Refining Association Signals – GlobalLipids2013 rs7515577 & rs12038699. Shown are the -log10 univariate p-values from the GlobalLipids2013 analysis for both the previous univariate association rs7515577 (“Previous Univariate SNP”) and the new multivariate association rs12038699 (“New Multivariate SNP”) across all four phenotypes analyzed. rs7515577 is represented as a triangle and rs12038699 is represented as a square. Also shown are the -log10 univariate p-values of SNPs within 1Mb of the midpoint between rs7515577 and rs12038699. Color-coding of the SNPs represent the degree of linkage disequilibrium between variants and the new association rs12038699 based on the GBR cohort of 1000Genomes (Genomes Project et al., 2015); for color coding details, see legend.
Supplementary Figure 3: Effect Size Heterogeneity Among SNPs With Identical Multivariate Model Assignments. Shown are the phenotype effect sizes (points), and ±2 standard errors (bars), for four significantly associated SNPs from HaemgenRBC2016. All four SNPs were classified as being “associated” with all six phenotypes (i.e. marginal posterior probability of association >= 95% for each phenotype). However, they clearly show different patterns of effect sizes. Therefore focusing simply on binary calls of “associated” vs “unassociated” can hide different patterns of multivariate association.
Supplementary Table 1: Summary of Associations in Each Dataset.
aNumber of new multivariate associations discovered by our analysis. Note that we required a multivariate association to be at least 500kb from a previous reported association to be considered “new”.
bUnivariate GWAS significance p-value threshold used by the original study publication.
cThese are new multivariate SNPs that were not reported by the original study despite having a univariate association (in the public summary data) that was genome-wide significant by the original study’s univariate significance threshold.
dA “previous association” means an association reported by the original GWAS; “near” means within 1Mb (but these are all more than 500kb away from a previous association since our classification of new multivariate SNPs requires this).
Supplementary Tables 2a-m: Lists of New bmass Multivariate Associations, per Dataset. Attached Excel sheets list new bmass associations for each dataset analyzed.
Supplementary Tables 3a-m: Lists of Retrieved Univariate Associations From Original Publications, per Dataset. Attached Excel sheets list the rsID#’s of the univariate significant SNPs that were retrieved from the original publication(s) associated with each dataset (see Online Methods for details).
Supplementary Tables 4a-m: Results for Previous Univariate Associations, per Dataset. Attached Excel sheets give bmass results for previous univariate associations, per dataset. Note that these results may not include all SNPs from Tables 3a-m, because some SNPs were dropped during QC and other SNPs were dropped because they did not reach univariate significance in the publicly available summary data (see Online Methods for details).
Supplementary Table 5: Replication of New Multivariate Associations. Shown are example metrics of how well our new multivariate associations replicate in datasets that allow such an evaluation. Specifically, for three of the studies used (GlobalLipids, GIANT, and HaemgenRBC), there are multiple dataset releases. To examine how well our new multivariate bmass associations replicate, we compared the results from the first releases (“1st”) with the univariate GWAS associations of the second releases (“2nd”). In essence, each of these approaches aim to increase power – one by using a multivariate approach (bmass) and the other by increasing sample size (the 2nd releases) – thus allowing us to compare the results against one another. Univariate p-Value Threshold: univariate GWAS significance p-value thresholds used by the original publication(s) for both the earlier (1st) and later (2nd) releases. New Multivariate SNPs in 1st: number of new multivariate associations from the earlier release. Lower Univariate p-Value in 2nd: number of new multivariate associations from the earlier release that also have lower univariate p-values in the later release. Below 2nd Univariate Threshold: number of new multivariate associations from the earlier release that also cross the later release’s univariate GWAS significance threshold.
Supplementary Table 6: p-Values for rs7515577 & rs12038699 in 2010 and 2013 GlobalLipds Releases – In the 2010 release rs7515577 has a univariate p-value that crosses the 5 × 10−8 threshold (TC), whereas rs12038699 does not. Since rs12038699 is near to rs7515577 it may get masked for future analyses; however in the 2013 data rs12038699 not only has a lower minimum univariate p-value, but also has a different multivariate p-value pattern as compared to rs7515577. Both these signals suggest that rs12038699 should be viewed as a separate GWAS hit for GlobalLipids2013.
Supplementary Table 7: Top Multivariate Model Examples per SNP. List of multivariate models that most frequently have the highest posterior probabilities per SNP. Top 5 models are shown from across both the previous univariate associations analyzed and the new multivariate associations discovered in the GlobalLipids2013, GIANT2014/5, and HaemgenRBC2016 datasets. Phenotype ordering is shown in the header, where 0, 1, and 2 refer to the multivariate categories of Unassociated, Directly Associated, and Indirectly Associated. n is the number of SNPs that show the specified model as having the largest posterior probability, with Mean Posterior displaying the average posterior probability of the given model across the n SNPs, and Original Prior showing the prior established for the given model from training on all the previous univariate associations from that dataset.
6 Acknowledgments
We thank John Novembre, Anna Di Rienzo, and Xin He for helpful feedback during the development of this project. We also thank Peter Carbonetto for helpful feedback on the bmass R package and the manuscript. This work was supported by National Institutes of Health (NIH) Grant R01 HG002585 to MS, NIH Grants T32 GM007197, TL1 TR000432, and F31 AI118375 to MCT, and NIH Grant R01 GM118652.