Abstract
Genetic interactions have been reported to underlie phenotypes in a variety of systems, but the extent to which they contribute to complex disease in humans remains unclear. In principle, genome-wide association studies (GWAS) provide a platform for detecting genetic interactions, but existing methods for identifying them from GWAS data tend to focus on testing individual locus pairs, which undermines statistical power. Importantly, the global genetic networks mapped for a model eukaryotic organism revealed that genetic interactions often connect genes between compensatory functional modules in a highly coherent manner. Taking advantage of this expected structure, we developed a computational approach called BridGE that identifies pathways connected by genetic interactions from GWAS data. Applying BridGE broadly, we discovered significant interactions in Parkinson’s disease, schizophrenia, hypertension, prostate cancer, breast cancer, and type 2 diabetes. Our novel approach provides a general framework for mapping complex genetic networks underlying human disease from genome-wide genotype data.
Genome-wide association studies (GWAS) have been increasingly successful at identifying single-nucleotide polymorphisms (SNPs) with statistically significant association to a variety of diseases1-5 and gene sets significantly enriched for SNPs with moderate association6-10. However, for most diseases, there remains a substantial disparity between the disease risk explained by the discovered loci and the estimated total heritable disease risk based on familial aggregation11-16. While there are a number of possible explanations for this “missing heritability”, including many loci with small effects or rare variants11-15-17, genetic interactions between loci are one potential culprit13,14,16,18,19. Genetic interactions generally refer to a combination of two or more genes whose contribution to a phenotype cannot be completely explained by their independent effects16,20,21, For example, one example of an extreme genetic interaction is synthetic lethality, which is the case where two mutations, neither of which is lethal on its own, combines to generate a lethal double mutant phenotype. Genetic interactions allow relatively benign variation to combine and generate more extreme phenotypes, including complex human diseases11-13,16,22. While several studies have reported interactions between genetic variants in various disease contexts20,23-26, and though efficient and scalable computational tools have been developed for searching for interactions amongst genome wide SNPs20,26-28, discovering them systematically with statistical significance remains a major challenge. For example, recent work estimated through simulation studies that approximately 500,000 subjects would be needed to detect significant genetic interactions under reasonable assumptions16, which remains beyond the cohort sizes available for a typical GWAS study or even the large majority of meta-GWAS studies.
Genome-wide reverse genetic screens in model organisms have produced rich insights into the prevalence and organization of genetic interactions29,30. Specifically, the mapping and analysis of the yeast genetic interaction network revealed that genetic interactions are numerous and tend to cluster in highly organized network structures, connecting genes in two different but compensatory functional modules (e.g. pathways or protein complexes) as opposed to appearing as isolated instances29, 31–33. For example, nonessential genes belonging to the same pathway often exhibit negative genetic interactions with the genes of a second nonessential pathway that impinges on the same essential function (Fig. 1A). Due to their functional redundancy, the two different pathways can compensate for the loss of the other, and thus, only simultaneous perturbation of both pathways would result in an extreme loss of function phenotype, which could be associated with either increased or decreased disease risk. Importantly, the same phenotypic outcome could be achieved by several different combinations of genetic perturbations in both pathways (e.g. A-X, A-Z, B-X, B-Y, B-Z, as summarized in Fig. 1B).This model for the local topology of genetic networks, called the “between pathway model” (BPM), has been widely observed in yeast genetic interaction networks29,34. Indeed, as many as ~70% of negative genetic interactions observed in yeast occur in BPM structures, indicating that genetic interactions are highly organized and this type of local clustering is the rule rather than the exception31. Combinations of mutations in genes within the same pathway or protein complex also exhibit a high frequency of genetic interaction, a scenario we refer to as the “within-pathway model” (WPM)29,34. Indeed, ~80% of essential protein complexes in yeast exhibit a significantly elevated frequency of within-pathway interactions35. In the context of human disease, this scenario may arise for an individual inheriting two variants in the same pathway, resulting in reduced flux or function of a particular pathway and an increase or decrease in disease risk.
The prevalence of BPM and WPM structures observed in the yeast global genetic network has important practical implications that can be exploited to explore disease-associated genetic interactions in humans based on GWAS data. Although tests to identify interactions between specific SNP or gene pairs are statistically under-powered, we may be able to detect genetic interactions by leveraging the fact that pairwise interactions between genome variants are likely to cluster into larger BPM and WPM network structures similar to those observed in the yeast global genetic network. Indeed, other studies exploited similar structural properties to derive genetic interaction networks from phenotypic variation in a yeast recombinant inbred population36. We note that the method we propose here is also broadly similar to previous approaches that have used gene set enrichment or GO enrichment analysis to interpret SNP sets arising from univariate or interaction analyses6-10,37-40 or aggregation tests for rare variants15,41,42 (See Methods). Other existing approaches have successfully identified interactions by reducing the test space for SNP-SNP pairs, through either knowledge or data-driven prioritization43-46 (See Methods). However, to our knowledge, no existing method has been developed to systematically identify between-pathway interaction structures based on human genetic data, which is the focus of this study.
Results
BridGE: a novel method for systematic discovery of pathway level genetic interactions from GWAS
We developed a method called BridGE (Bridging Gene sets with Epistasis) to explicitly search for coherent sets of SNP-SNP interactions within GWAS cohorts that connect groups of genes corresponding to characterized pathways or functional modules. Specifically, although many pairs of loci do not have statistically significant interactions when considered individually, they can be collectively significant if there is an enrichment of SNP-SNP interactions between two functionally related sets of genes (Fig. 1B). Thus, we imposed prior knowledge of pathway membership and exploited structural and topological properties of genetic networks to gain statistical power to detect genetic interactions that occur between or within pathways in GWAS associated with diverse diseases. Our algorithm specifically focuses on identifying BPM stuctures, where two distinct pathways are bridged by several SNP-level interactions connecting them, as well as WPM structures, where interactions densely connect between SNPs linked to genes in the same functional module or pathway.
Our approach involves five main components (See Methods, Fig. 1C): (I) Data processing consisting of sample quality control and adjustment for population substructure between the cases and controls to avoid false discoveries due to population stratification47,48. Linkage disequilibrium (LD) was also accounted for by pruning the full set of SNPs into a subset, as LD could otherwise result in spurious BPM structures. (II) SNP-SNP interaction networks were constructed based on SNP-SNP interactions scored under different disease model assumptions (additive, recessive, dominant, or combined recessive and dominant models). The additive disease model was implemented as previously described, and SNP-SNP interaction scores were derived based on likelihood ratio tests for models with and without an interaction term20. Interactions based on recessive and dominant disease models were estimated using a hypergeometric-based metric that directly tests for disease association for individuals that are either homozygous (recessive and dominant models) or heterozygous (dominant only) for the minor allele at two loci and compares the observed degree of association to the marginal effects of both loci. (III) The SNP-SNP network was thresholded by applying a lenient significance cutoff to generate a low-confidence, high-coverage SNP-SNP interaction network. This binary network is expected to contain a large number of false positive interactions, but it enables assessment of the significance of SNP-SNP interactions collectively at the pathway level. (IV) Pairs of pathways (for BPMs) or single pathways (for WPMs), as defined by curated functional standards49-51, were tested for enrichment of SNP-SNP pair interactions connecting between them (or within the single pathway) with a chi-squared test, compared to both the global interaction density and the marginal interaction density of the two pathways , as well as a permutation test (pperm) conducted by randomly shuffling the SNP-pathway assignment. These tests produced three statistics to measure the significance of each candidate BPM or WPM. (V) Finally, a sample permutation strategy was applied to estimate false discovery rate, to correct for multiple hypothesis testing and assess the significance of the candidate BPMs or WPMs. Multiple hypothesis test correction is conducted only at the level of pathway or pathway pairs; the number of hypothesis tests performed for all possible pathways and all possible between-pathway combinations is substantially less than the number of tests for all possible SNP pairs (~105 as compared to ~1011), which increases our power for discovering interactions relative to approaches that operate on individual SNP-SNP interactions. As part of BridGE, in addition to discovering BPM and WPM structures, we can also identify individual pathways that have significantly elevated marginal density of SNP-SNP interactions even where the interaction partners do not necessarily have clear coherence in terms of pathways (called PATH structures, See Methods). In this case, we are not focused on pathway-pathway interactions but simply assess whether a particular pathway is a highly connected hub and associated with numerous SNP-level interactions. These five steps enabled us to extract statistically significant pathway-level interactions that can be associated with either increased risk of disease when pairs of minor alleles linked to two pathways occur more frequently in the diseased population or, conversely, decreased risk of disease when pairs of minor alleles annotated to two pathways occur more frequently in the control population.
Discovery of between-pathway interactions in a Parkinson’s disease cohort
We first applied BridGE to identify between pathway interactions in a genome-wide association study of Parkinson’s disease (PD)52, denoted as PD-NIA (Supplementary Table 1). Recent work estimated a substantial heritable contribution to PD risk across a variety of GWAS designs (20%~40%)53,54, and although a relatively large number of variants have been individually associated with PD, the loci discovered to date explain only a small fraction (6%–7%) of the total heritable risk 53. The PD-NIA cohort used in this analysis consists of 519 patients and 519 ancestry-matched controls after balancing the population substructure (See Methods). We compiled a collection of 833 curated gene sets (MSigDB Canonical pathways)55 representing established pathways or functional modules from KEGG49, BioCarta50 and Reactome51 (Supplementary Table 2) and found that 658 of these pathways were represented in the PD-NIA cohort after filtering based on gene set size (minimum: 10 genes or SNPs, maximum: 300 genes or SNPs). After using both SNP-pathway membership permutations (NP=150,000) and sample permutations (NP=10) to establish global significance and correct for the multiple hypotheses tested (See Methods), BridGE reported 173 total significant BPMs at a false discovery rate (FDR) of < 0.25 (pperm > 4.7 × 10-5) using a combined disease model (QQ plot in Fig. 2A, Supplementary Table 3). Due to overlap among the pathways, these could be summarized by a less redundant set of 23 BPMs involving 32 unique pathways (a maximum overlap coefficient of 0.25, Fig. 3, Supplementary Table 4, See Methods). Some of the identified BPMs persisted at even the most stringent FDR cutoffs (FDR ≤ 0.05). For example, a high confidence BPM was identified between the Golgi associated vesicle biogenesis gene set and FcεRI signaling. More specifically, we observed 2281 SNP-SNP interactions between the vesicle biogenesis and FcεRI signaling gene sets (Fig. 2B), which is 1.5-fold higher than the expected number of SNP-SNP interactions (1510) based on the global density SNP-SNP interaction network and 1.3- and 1.2-fold higher than expected given the marginal density of the two pathways (5.9% and 6.5%), respectively , , , Fig. 2C). In contrast to the significance of this BPM, none of the individual SNPs supporting this BPM were significant on their own after multiple hypothesis correction based on single-locus tests on this cohort (Fig. 2B). Furthermore, none of the individual SNP-SNP interactions between the two pathways were significant when tested independently under an additive disease model (Fig. 2D, FDR ≥0.94), or recessive or dominant models (See Methods) (Supplementary Fig. 1). Thus, the variants involved in this pathway-pathway interaction observed in the Parkinson’s disease PD-NIA cohort, would be missed based on traditional univariate analysis or interaction tests that focus on individual SNP pairs, but were highly significant when assessed collectively by BridGE.
Furthermore, few of the pathways that we discovered as parts of significant BPMs (Fig. 3, Supplementary Table 4) would be discovered using approaches based on pathway enrichment tests of single locus effects6,7. For example, only three pathways were enriched among the single-locus effects associated with PD (Golgi associated vesicle biogenesis, Clathrin-derived vesicle budding and the Rac-1 cell motility signaling pathway; Supplementary Table 5) at the same FDR applied to the discovery of BPMs (FDR < 0.25), and only one of these was represented as part of a BPM identified by our analysis (Supplementary Table 4). We failed to identify any of the remaining 31 BPM-involved pathways through gene set enrichment analysis of single locus effects.
Strikingly, the large majority (22 of 23) of discovered BPMs were associated with decreased risk for Parkinson’s disease (Fig. 3). This may suggest that, in the case of Parkinson’s disease, genetic interactions may be more frequently associated with protective effects, or alternatively, simply that there is more heterogeneity across the population in genetic interactions leading to increased risk, which would limit our ability to discover such interactions. Several BPM interactions were highly relevant to the biology of Parkinson’s disease. In particular, the FC epsilon receptor I (FcεRI) signaling pathway represented a hub in the pathway interaction network (Fig. 3). FcεRI is the high-affinity receptor for Immunoglobulin E and is the major controller of the allergic response and associated inflammation. In general, immune-related inflammation has been frequently associated with Parkinson’s disease and several immuno-modulating therapies have been pursued, but it remains unclear whether this is a causal driver of the disease or is rather a result of the neurodegeneration associated with disease progression56,57. There has been relatively little focus on the specific role of FcεRI in Parkinson’s, but recent observations support the relevance of this pathway to the disease58. For example, Bower et al. reported an association between the occurrence of allergic rhinitis and increased susceptibility to PD59. Furthermore, reduction of IL-13, one of the cytokines activated by FcεRI and a member of the FcεRI signaling pathway, was shown to have a protective effect in mouse models of PD60, and galectin-3, which is known to modulate the FcεRI immune response, was shown to promote microglia activation induced by α-synuclein, a cellular phenotype associated with PD61,62. These observations indicate that a hyperactive allergic response may predispose indviduals to PD, and suggest that protective interactions reported by our method may result from variants that subtly reduce the activity of this pathway. Aberrant events in the Golgi and related transport processes have been known to play an important role in the pathology of various neurodegenerative diseases, including Parkinson’s disease63,64. Also, glycolytic and gluconeogenic metabolic intermediates have been found to be cytoprotective against 1-methyl 4-phenylpyridinium (MPP+) ion toxicity in Parkinson’s disease65. Our BridGE approach also identified three protective interactions involving the IL-12 and STAT4 signaling pathway, a pro-inflammatory cytokine that plays a major role in regulating both the innate and adaptive immune responses66. Specifically, microglial cells both produce and respond to IL-12 and IFN-gamma, and these comprise a positive feedback loop that can support stable activation of microglia67,68, a hallmark of Parkinson’s disease, particularly in later stages69-73. The prevalence of the FcεRI and IL-12 interactions among the discovered interactions suggests a major role for immune signaling as a causal driver of PD.
In addition to significant between-pathway interactions, we also discovered 3 significant WPMs associated with Parkinson’s disease risk: golgi-associated vesicle biogenesis , , and FRD > 0.01), collagen mediated activation cascade , , , and FDR = 0.13), and the HCMV and MAP kinase pathway , , , and FDR = 0.25) (Fig. 3, Supplementary Table 4). In all three cases, minor allele combinations within the pathways were associated with decreased risk of PD. All three of these pathways were also implicated in high confidence protective BPM interactions with other pathways suggesting they play important roles in PD risk.
Replication of pathway-pathway interactions in an independent Parkinson’s disease cohort
To validate our findings, we determined if the BPM interactions discovered in the PD-NIA cohort could be replicated in an independent PD cohort (PD-NGRC)74; 1947 cases and 1947 controls, all of European ancestry; subjects overlapping with PD-NIA cohort were removed). Indeed, 8 of the 173 total BPM interactions discovered in the PD-NIA cohort were nominally significant in the PD-NGRC based on all three significance criteria (, , ) (See Methods). To assess the significance of this level of replication across the entire set of discoveries, we compared the number of observed replicated BPMs at several different FDR cutoffs to the number expected by chance, which was estimated based on 10 random sample permutations of the validation cohort (See Methods). Indeed, this analysis confirmed that the discovered interactions replicated more frequently than expected (Fig. 4A, Supplementary Table 6). For example, at an FDR cutoff of 0.05, the number of replicated BPMs was ~7 fold higher than expected (p = 0.02). BPMs identified at more stringent FDR cutoffs showed a stronger tendency to replicate in the independent cohort (Fig. 4A, Supplementary Table 6), including the top-ranked BPM interaction we discovered between Golgi associated vesicle biogenesis and the FC epsilon receptor I (FcεRI) signaling pathway. Intriguingly, another between-pathway interaction for the FcεRI signaling pathway, with a Glycolysis/gluconeogenesis gene set, also replicated (Supplementary Table 6).
While we confirmed replication of a significant fraction of the discovered interactions at the pathway level, this does not necessarily imply that the individual SNP pairs supporting these pathway-level effects are shared across cohorts. For the 8 BPMs that were validated in the PD-NGRC cohort, we evaluated the significance of the overlap between the specific SNP-SNP pair interactions supporting each of the validated BPMs in the PD-NIA and the PD-NGRC cohorts and contrasted the observed overlap to comparable statistics from 10 random sample permutations of the PD-NGRC cohort. Several individual BPMs exhibited significant overlap in their supporting SNP-SNP interactions, and collectively, the set of 8 replicated BPMs were strongly shifted toward higher than expected SNP-SNP interaction overlap (See Methods) (p = 1.4 × 10-3) (Fig. 4B, see Supplementary Table 6 for a list of SNP-SNP pairs in common across cohorts). However, despite statistically significant overlap among SNP-SNP interactions identified in replicated BPMs, the extent of the observed overlap in terms of fraction of pairs was relatively low for most cases, with all of them exhibiting an overlap coefficient of less than 0.15 (See Methods) (Fig. 4C). Thus, the same pathway-pathway interaction may be supported by different sets of SNP-SNP interactions in different populations, or alternatively, this may reflect that the power for reliably pinpointing specific locus pairs is limited. In either case, these results highlight the primary motivation for our method: genetic interactions, in particular those in a BPM structure, can be more efficiently detected from GWAS when discovered at a pathway or functional module level rather than at the level of individual genomic loci.
Discovery of pathway-level genetic interactions in five other diseases
We applied BridGE more broadly to an additional twelve GWAS cohorts representing seven different diseases (Parkinson’s disease, schizophrenia, breast cancer, hypertension, prostate cancer, pancreatic cancer and type 2 diabetes)75-80 (Supplementary Table 1) (See Methods). Including PD-NIA, of the thirteen cohorts, analysis of eleven cohorts (covering six different diseases) resulted in significant discoveries for at least one of the three types of interactions (BPM, WPM or PATH) at FDR < 0.25. More specifically, significant BPMs were discovered for eight cohorts (covering six different diseases), significant WPMs for six cohorts (covering four different diseases) and significant PATH structures for six cohorts (covering three different diseases) at FDR ≤ 0.25 (Fig. 5, Fig. 6A, Supplementary Tables S7-S20). The number of interaction discoveries per cohort varied substantially, from as low as two in one of the schizophrenia cohorts to as many as 50 interactions in one of the breast cancer cohorts. While we tested multiple disease models (additive, dominant, recessive, and combined dominant-recessive), the most significant discoveries for the majority of diseases examined were reported when using a dominant or combined model as measured by our SNP-SNP interaction metric (See Methods). The relative frequency of interactions under a dominant vs. a recessive model may be largely due to our increased power to detect interactions between SNPs with dominant effects compared to recessive effects (See Methods).
We obtained appropriate replication cohorts for three additional diseases beyond Parkinson’s disease, including prostate cancer, breast cancer and schizophrenia, and were able to successfully replicate discovered genetic interactions for all three diseases (Supplementary Table 21 replication summary). For example, three of eleven BPMs (FDR≤ 0.25) discovered in the ProC-CGEMS prostate cancer cohort were replicated in the ProC-BPC3 cohort (7.5-fold enrichment, p = 0.01) while three of ten WPMs discovered from the ProC-BPC3 cohort (FDR≤ 0.25) could be replicated in ProC-CGEMS (3-fold enrichment, p = 0. 0001). For breast cancer, six of 108 significant BPMs (FDR ≤ 0.20) discovered from the BC-MCS-JPN cohort replicated in the BC-MCS-LTN cohort (2-fold enrichment, p = 0.07) and the sole significant PATH interaction discovered from the BC-MCS-LTN cohort replicated in the BC-MCS-JPN cohort. For schizophrenia, one of eight signficant BPMs discovered from the SZ-GAIN cohort replicated (fold-enrichment > 10, p = 0.02), and the top significant WPM (FDR ≤ 0.1) also replicated in the SZ-CATIE cohort.
The vast majority of the genetic interactions we discovered appear to be disease-specific (Fig. 5, Supplementary Table 7), and many of the pathways implicated in genetic interactions showed strong relevance to the corresponding disease. For example, we identified several cancer-related gene sets involved in replicated BPMs predicted to affect breast cancer risk, including p53 signaling, a basal cell carcinoma gene set, as well as an increased-risk interaction between MTA-3 related genes and T cell receptor activation initiated by Lck and Fyn. MTA-3 is a Mi-2/NuRD complex subunit that regulates an invasive growth pathway in breast cancer81, and Lck and Fyn are members of the Src family of kinases whose expression have been found to be associated with breast cancer progression and response to treatment82-84.
We also identified and replicated multiple prostate cancer risk-associated interactions that involved DNA repair, PD-1 (Programmed cell death protein 1) signaling, and insulin regulation pathways. Consistent with our findings, metabolic syndrome has been recently associated with prostate cancer85, and serum insulin levels have been shown to correlate with risk of prostate cancer86. We also identified a replicating interaction associated with decreased risk of prostate cancer between the p38 MAPK signaling and AKAP95 chromosome dynamics pathways. P38 MAPK signaling has been associated with a variety of cancers87, and AKAP95 is an A kinase-anchoring protein involved in chromatin condensation and maintenance of condensed chromosomes during mitosis88 whose expression has been previously implicated in the development and progression of rectal and ovarian cancers89. We also discovered and replicated two WPMs associated with prostate cancer risk. The first involves the antigen processing and presentation pathway (associated with increased risk) and a second involving a gene set associated activation of ATR in response to replication stress (associated with decreased risk). Both of these pathways have strong relevance to cancer risk90,91.
For schizophrenia, we discovered and replicated a BPM interaction comprising a gene set associated with the HIV life cycle and a vitamin and cofactor metabolism pathway. Interestingly, a recent large Danish schizophrenia study reported that schizophrenia patients are at a 2-fold increased risk of HIV infection, and conversely, that individuals infected with HIV exhibited increased risk of schizophrenia, especially in the year following diagnosis92. Our finding suggests a common genetic basis between risk factors for schizophrenia and host response to the HIV virus, which may help to explain the observed co-morbidity of these diseases. We also discovered and replicated a protective WPM for schizophrenia in the nicotinate and nicotinamide metabolism pathway. Nicotinic acid (vitamin B3) supplements have been pursued as a treatment for schizophrenia dating back to the 1950s93. Interestingly, after an initial series of reports of promising treatments, several follow-up studies had difficulty reproducing the beneficial effects of nicotinic acid94, which could be a result of modifier effects within this pathway.
Although we did not conduct replication analyses for hypertension or type 2 diabetes, we found that many of the pathways involved in interactions from the discovery cohorts were also highly relevant to the corresponding disease. For example, in the hypertension cohort, we identified a risk-associated BPM interaction involving hypoxia inducible factor (HIF) signaling, whose aberrant expression has been previously associated with hypertension95. Two BPMs and one WPM, all associated with increased risk, involved the Rho cell motility signaling pathway, which has been previously implicated in the pathogenesis of hypertension96. For type 2 diabetes, we discovered BPMs associated with protective effects involving an autoimmune thyroid disease gene set, glycosaminoglycan biosynthesis, and the mTOR signaling pathway, all of which have strong links to diabetes97-99. In summary, BridGE was able to detect all possible types of pathway-level genetic interactions (BPM, WPM and PATH) across several diverse disease cohorts, highlighting the utility of our method and the potential for genetic interactions to underlie complex human diseases.
Simulation study to evaluate the power of BridGE approach
Several of our results indicate that the additional power gained by aggregating SNPs connecting between or within pathways is critical for discovering genetic interactions from GWAS, at least based on the cohort sizes analyzed here. To fully explore the limits of our approach, we carried out a simulation study to estimate the statistical power afforded by the BridGE method with respect to sample size, interaction effect size, minor allele frequency, and pathway size, all of which should affect the sensitivity of detection of pathway-level genetic interactions.
We focused our power analysis on the detection of BPMs, which comprise most of our discoveries. Briefly, our simulations involved two components: one in which individual SNP-SNP pairs were embedded in a simulated population cohort with varying allele frequency 100, and another component that simulated the rate of detection of increasingly larger BPM interaction structures given the corresponding level of false positives in the SNP-level network as determined by the first component (See Methods). Indeed, we found that each of the evaluated parameters (sample size, interaction effect size, minor allele frequency, and pathway size) affected the power of our approach (Fig. 6B). As expected, the sensitivity of our method increases with increasing pathway size, which is a key motivation for the approach. For example, our power analysis indicated that a minimum cohort size of 5000 individuals (2500 cases, 2500 controls) is required to detect a 25×25 BPM (i.e. two interacting pathways with 25 SNPs mapping to each pathway) that confers a 2X increase in risk with a minor allele frequency (MAF) of 0.05 (FDR < 25%) while a 300x300 BPM with the same effect size would require only 1000 individuals (500 cases, 500 controls) for detection at the same level of significance (simulation results for more stringent FDR cutoffs). As expected, the sensitivity of the approach also increases for interactions involving SNPs with higher MAF. For example, the same 25x25 BPM involving variants at MAF of 0.15 conferring 2X increase in risk can be detected from cohorts as small as 2000 individuals (1000 cases, 1000 controls), and a 300x300 BPM with these characteristics could be detected from a cohort as small as 500 individuals (250 cases, 250 controls). A key parameter affecting these power estimates is the assumed biological density of interactions, which we define as the fraction of SNP-SNP pairs crossing two pathways of interest that actually have a functional impact on the disease phenotype relative to all possible SNP-SNP pairs. We assumed a density of 5% for the power analysis reported here (analysis based on 2.5% and 10% are included in Supplementary Fig. 2), meaning that the fraction of SNP-pairs that have the potential to jointly influence the phenotype comprise only a small minority of all possible SNP pairs. In practice, we anticipate that this frequency varies substantially across different pathways, depending on the frequency of functionally deleterious SNPs that are present in the population for each pathway. A higher density of functionally deleterious SNPs will result in higher sensitivity of our approach and vice versa, a lower density of functionally deleterious SNP combinations can substantially reduce the sensitivity of our approach (Supplementary Fig. 3). Notably, while statistical power increases with pathway size (i.e. number of SNPs mapping to each pathway), this is only true under the assumption that the SNPs (and the corresponding genes) actually contribute in a functionally coherent manner to the particular pathway or functional module. On the real disease cohorts, we discovered interactions for a large range of pathway sizes (Supplementary Fig. 4), suggesting there are even relatively small functional modules (e.g. less than 20 associated SNPs) that have sufficiently strong interaction effects to be detected. In general, these power analyses confirm that our approach is sufficiently powered to discover pathway-level genetic interactions at moderate effect size (~1.5-2X increased/decreased risk) for relatively small cohorts (~1000 or more individuals), which suggests it could be broadly applied to discover interactions in hundreds of existing GWAS cohorts that have been previously analyzed using only univariate approaches101.
Discussion
We described a novel and systematic approach for discovering human disease-specific, pathway-level genetic interactions from genome-wide association data. Results from eleven GWAS cohorts representing six different diseases confirmed that interaction structures prevalent in genetic networks of model organisms are indeed apparent in human disease populations and that these structures can be leveraged to discover significant genetic interactions either between or within biological pathways or functional modules. Genetic interactions discovered for these six diseases have the potential to contribute substantially to our understanding of their genetic basis. For example, to date, there have been approximately 85 singly associated loci (p ≤1.0×10-7) and one genetic interaction (between FGF20 and MAOB) reported for Parkinson’s Disease102,103. Here, we discovered 23 more pathway level genetic interactions, emphasizing the potential of our approach to expand our knowledge of the contribution of genetic variation associated with diseases such as PD. Indeed, many of the pathways discovered by our approach have not been previously implicated in these diseases. For example, the median percentage of BridGE-identified pathways for which there was at least one linked SNP reported in dbGaP across the six diseases was 22% (Supplementary Table 22), indicating that the large majority of our discoveries represent novel insights that could not be made using standard single-locus approaches.
The are several ways the BridGE method could be expanded and improved upon to better detect genetic interactions. First, our approach currently depends on literature-curated collections of biological pathways as a major input. The potential of our method to detect genetic interactions within or between well-defined pathways and functional modules could be substantially improved as more complete curated or data-derived functional standards are developed and integrated with the approach, which will be a focus of future work. Second, to avoid spurious network structures related to SNPs that map to genes located in close physical proximity or linkage disequilibrium (LD), we sampled a conservatively sized subset of tag SNPs to run our analysis for each dataset. This conservative approach has undoubtedly missed functional variants that may contribute to disease risk. More sophisticated approaches for retaining a larger set of tag SNPs while still controlling for LD structure could improve the sensitivity of our method. Finally, we emphasize that our study focuses exclusively on detecting pathway level genetic interactions between common variants assayed by typical GWAS. Continued development to examine the contribution of rare variants or interactions between rare variants and other loci, or to leverage the full set of variants identified through whole-genome or exome sequencing represent logical extensions of the BridGE approach.
Developing mechanistic or clinically actionable disease insights based on the genetic interactions we have discovered will require additional strategies that build on pathway-level discoveries to generate more targeted hypotheses, followed by functional studies in disease models. One potential strategy to generate more targeted hypotheses involves leveraging an approach like BridGE to find pathways with robust disease-associated genetic interactions followed by a more targeted search for individual SNP-SNP or gene-gene pairs within these pathways that explain these structures. Our analysis of the Parkinson’s cohort indicated that there is indeed significant overlap among the strongest SNP-SNP interactions underlying replicated pathway level interactions, supporting the potential utility of this hierarchical approach.
The extent to which genetic interactions contribute to the genetic basis of human disease has been the subject of recent debate16,104,105. This debate is in part fueled by differences in language among geneticists that regularly encounter physiological epistasis between specific alleles and statistical geneticists who instead study statistical epistasis, which measures the non-additive component of genetic variance in a population104,106. The target of our method is to discover disease-relevant physiological epistasis between sets of specific alleles in biological pathways based on population genetic data. Robust estimates of the additional heritability explained by pathway level genetic interactions discovered by our method will be a focus of future work, but we anticipate this still remains just one of many contributions to heritability. Even in cases where the contribution to disease heritability is modest, genetic interactions define genetically distinct disease subtypes and point toward new insights about disease mechanism that can seed the search for new, targeted therapies. Also, recent studies suggest that accurately predicting the phenotypes of individuals from genotypes can depend critically on understanding interactions between genetic loci104,107, and thus, progress in personalized genome interpretation and medicine depends on our understanding of how specific alleles interact to cause phenotypes. Our work establishes a new paradigm for approaching this problem and provides a systematic method for detecting genetic interactions that can be applied to existing population genetic data for a variety of human diseases.
Methods
1. Brief Summary of existing methods
Although efficient and scalable computational tools have been developed for searching for interactions amongst genome wide SNPs26–28, 108, detecting them with statistical significance remains a major challenge. There are previous methods that have approached this problem, although from different perspectives than the method proposed here. We briefly summarize those methods and describe the novelty of our approach relative to this body of existing work.
Three general directions taken by previous methods for genetic interaction analysis that are the most similar to our approach are: (1) gene set enrichment-based approaches applied to loci derived from univariate tests, (2) gene set enrichment-based approaches applied to SNP-level summary statistics from interactions, and (3) methods that use pathways as a prior to study SNP or gene level interactions or reduce the number of hypothesis tests.
(1) Gene set enrichment-based approaches applied to loci derived from univariate tests
Gene set enrichment analysis (GSEA) was originally developed for case-control gene expression datasets55,109 but has previously been adapted to summarize sets of loci (and their linked genes) derived from univariate tests applied to GWAS datasets6,7. There are two key differences between these approaches and the method we propose. First, traditional approaches for GSEA start from univariate statistics of genes or SNPs, while our approach is built on interactions between pairs of SNPs that could have little or no single locus association with a disease phenotype. Second, approaches for GSEA target the enrichment of single gene/SNP associations in each individual pathway while our approach explores the enrichment of SNP-SNP interactions crossing each pair of pathways (between-pathway model or BPMs).
(2) Gene set enrichment-based approaches applied to SNP-level summary statistics from interactions
The gene set enrichment approach has also been applied beyond loci derived from univariate analysis. Another class of methods first measure genetic interactions based on pairwise SNP analysis, derive summary statistics at the individual SNP level based on specific interaction properties, and follow this with gene set enrichment analysis (GSEA) using pathway-associated SNP (or gene) interaction-based scores. For example, one such approach was recently applied to a bipolar study and a sporadic Amyotrophic Lateral Sclerosis study110,111. In this study, whole genome SNPs were first filtered based on their ECML scores112 and only the top 1000 SNPs with the strongest main effects and gene-gene interactions were retained for studying SNP-SNP interactions. Then, a SNP-SNP interaction network was constructed using a logistic regression model, and SNPs were ranked based on their network centrality in this network. Finally, candidate pathways were evaluated using a gene-set enrichment analysis based on pathway members’ rankings. A similar GO enrichment approach was applied to the sporadic Amyotrophic Lateral Sclerosis study111, but SNP interaction strength was first estimated using a multiple dimension reduction (MDR) model and then summarized at a gene by enrichment analysis. GO annotation enrichment approaches were then applied to these gene-level scores. Again, these studies have not introduced the key concept that motivates our method: that genetic interactions connect coherently across pairs of distinct pathways.
(3) Methods that use pathways as a prior to study SNP or gene level interactions or reduce the number of hypothesis tests
Another strategy implemented by other existing methods to address the multiple hypothesis testing challenge presented by pairwise SNP analysis is to reduce the number of hypothesis tests, based on a variety of different criteria113. These methods typically employ a filtering step, either data driven43-45 or knowledge driven46,114, before applying statistical analysis of interactions. Other illustrative examples of this class of approaches are from a recent autism spectrum disorder study where all possible SNPs were tested for interactions with the Ras/MAPK pathway39, and a melanoma risk study where SNP-SNP interactions were studied within the five pathways that are significant based on the traditional individual SNP based-GSEA analysis40. Most studies implementing this approach investigate interactions among a small set of genetic variants (genes or SNPs) that either statistically demonstrate evidence for individual association with the disease phenotype or are known to be relevant to the disease based on prior knowledge. Hence, systematic detection of genetic interactions among novel genes, or genes that show no marginal association will not be detected by these approaches.
In summary, existing approaches are related to the proposed approach in the general sense that they leverage existing knowledge of pathways or other sets of functionally related sets of genes to either perform enrichment on univariate effects or interaction-based SNP summary statistics (e.g. interaction degree), or simply use pathways as a prior to reduce the number of SNP pairs tested for interactions. To our knowledge, no existing methods explicitly test for higher-level interactions connecting within or between multiple pathways and are sufficiently powered to perform this systematically across comprehensive pathway databases.
2. Genome-wide association studies (GWAS) datasets
Twelve GWAS datasets, representing 13 different cohorts covering seven diseases, were used in this paper: Parkinson’s disease (PD-NIA: phs000089.v3.p2, PD-NGRC: phs000196.v1.p1), breast cancer (BC-CGEMS-EUR, BC-MCS-JPN and BC-MCS-LTN: phs000517.v3.p1), schizophrenia (SCHZ-GAIN: phs000021.v3.p2; SCHZ-CATIE: CATIE study), hypertension (HT-eMERGE: phs000297.v1.p1; HT-WTCCC: cases are from EGAD00000000006, controls are from EGAD00000000001 and EGAD00000000002), prostate cancer (ProC-CGEMS: phs000207.v1.p1; ProC-BPC3: phs000812.v1.p1), pancreatic Cancer (.PanC-PanScan: phs000206.v3.p2) and Type 2 Diabetes (T2D-WTCCC: cases are from EGAD00000000009, controls are from EGAD00000000001 and EGAD00000000002). These data sets were obtained from three resources: dbGaP101, Wellcome Trust Case Control Consortium or the National Institute of Mental Health (NIMH)115. Details of each dataset (e.g. sample size, genotyping platform) are summarized in Supplementary Table 1.
3. Data processing
We used the same set of pre-processing steps for all GWAS data sets analyzed in this paper. Each of the steps is outlined in detail in the sections that follow.
3.1 Sample quality control
We first controlled data quality using the standard PLINK inclusion procedure with the following parameters: 0.02 as the maximal missing genotyping rate for each individual/SNP (–mind, –geno), 0.05 as the minimum minor allele frequency (–maf), and 1.0 1 0 as the Hardy-Weinberg equilibrium cutoff (–hwe 1e-6).
To identify outlier samples that were not consistent with the reported study population, we mapped SNPs in each GWAS dataset to Genome Reference Consortium GRCh37116 and combined the samples with the 1000 Genomes data117 (all ancestry groups). We then used PLINK to perform multi-dimensional scaling (MDS) analysis. Based on the MDS plot, we removed samples that were not tightly clustered with the corresponding ancestry groups in the 1000 Genomes data. For the two Parkinson’s disease cohorts, we followed the previous study118 to remove samples that are likely outliers. For these cohorts, duplicate subjects were kept in just one cohort with priority given to PD-NIA over the PD-NGRC cohort, so that we could retain as many samples as possible for the smaller cohort.
3.2 Population stratification
Checking relatedness among individuals
Relatedness among each pair of subjects was tested by calculating IBD119. For subject pairs with a proportion IBD score greater than 0.2, one was randomly chosen and removed from the data, and the other was kept.
Matching population structure between cases and controls
Because spurious allelic associations can be discovered due to unknown population structure47,120,121, recent GWAS analyses suggest the use of a procedure to ensure balanced population structure between cases and controls119. Here, all subjects were clustered into groups of size 2, each containing one case and one control that are from the same sub-population (based on pairwise identity-by-state distance and the corresponding statistical test), as is implemented in PLINK119.
Future extensions of our method could include parameters capturing population structure directly in the model for genetic interactions, for example, as is described in122. The primary concern in developing and applying our current approach was to ensure that population structure was not introducing spurious between-pathway interactions, so we took this relatively conservative approach to adjust for population stratification. More sophisticated approaches could reduce the number of samples lost in filtering based on population stratification and improve the sensitivity of the method.
3.3 Filtering SNPs in linkage disequilibrium (LD)
For each data set, we selected all SNPs that could be mapped to at least one of the 6744 genes in the collection of pathways used in the pathway-pathway interaction search. A SNP was mapped to all genes that overlap with a +/- 50kb window centered at the SNP, and then mapped to pathways to which the corresponding gene(s) were annotated. For the purposes of computing pathway-level statistics, a SNP was only associated once with each pathway, even if it mapped to multiple genes in the pathway.
To avoid the discovery of trivial bipartite structures, SNPs in linkage disequilibrium (LD) need to be removed before between or within-pathway enrichment of SNP-SNP interactions is conducted. Two general approaches can be pursued towards this goal: 1) removing SNPs in LD before calculating pairwise SNP-SNP interactions; and 2) removing structures that emerge as a result of SNPs in LD after calculating pairwise SNP-SNP interactions.
The first alternative is more likely to miss informative SNP-SNP interactions than the second because it only considers a subset of all SNPs, but is more computationally efficient and scalable. It is worth noting that a biclustering algorithm pursuing the second approach was designed in36 to condense a yeast SNP-SNP interaction network into an LD-LD network. The algorithm described in that work took the SNP-SNP interaction matrix as input and searched for sets of consecutive SNPs that had a statistically significant number of across-set SNP-SNP interactions based on a hypergeometric test. The algorithm was applied on a yeast SNP-SNP interaction network (originally constructed in123) with 1977 SNPs, where the LD effect was assumed to be localized to less than 60 SNPs for computational reasons1. We attempted to apply this algorithm to the human genotype datasets used in this paper and observed that the algorithm could handle about 1500 SNPs with a threshold of σ below 60) but not beyond. For example, on a data set with 2000 SNPs, the program did not finish in two days with σ = 100. Given issues with scalability of this approach, we adopted the first alternative, which is to select a subset of SNPs that are not in LD. To accomplish this, we used a procedure in PLINK119 to select a subset of unlinked SNPs from each GWAS dataset, specifically “-indep-pairwise 50 5 0.1”. With this procedure, PLINK searches each window of 50 SNPs with a sliding step of 5 SNPs, and selects a subset of SNPs with pairwise r2 below 0.1 within each sliding window. After this procedure, ~15,000-20,000 SNPs were left in each dataset, and the highest r2 between any pair of SNPs within any window of 1Mb is lower than the commonly used threshold for controlling LD (r2 < 0.2)7,124, demonstrating that the LD was effectively controlled. Note that by using a stringent r2 threshold of 0.1, we are undoubtedly ignoring many informative SNPs. However, we chose this conservative approach to minimize the chance that spurious BPMs resulted from remaining LD structure. Future work that explores less conservative approaches to handling SNPs in LD would be worthwhile.
For diseases that we tested for replication of discovered interactions on independent cohorts of the same ancestry, to make the discovery and replication analysis consistent for these instances, cohorts were first combined and then processed using the procedures described above to select the subset of SNPs on which the analysis was run. After selection of SNPs, population stratification and discovery of interactions was then performed independently. We followed this procedure for three of the diseases analyzed, Parkinson’s disease, schizophrenia, and breast cancer.
For prostate cancer, our access to ProC-CGEMS and ProC-BPC3 was gained at different times, so SNPs used in ProC-BPC3 were selected based on the CGEMS cohort. A summary of all processed datasets used in this study is included in Supplementary Table 1.
3.4 Selection of Pathways
833 human pathways (gene sets) were collected from the Kyoto Encyclopedia of Genes and Genomes (KEGG)125,126, Biocarta127, and Reactome51 (Supplementary Table 2). We excluded any pathway from our analysis with less than 10 or more than 300 genes, or less than 10 or more than 300 SNPs, mapping to the pathway after LD control to avoid pathways that were too small to provide sufficient statistical power or too large to provide specific biological insights.
4. SNP-SNP genetic interaction estimation
MM, Mm and mm are used to denote the three genotypes of each SNP, i.e., majority homozygous, heterozygous and minority homozygous, respectively. Our method implements multiple disease models, which affect how interactions are estimated at the SNP-SNP interaction level. A minor allele (m) at each locus could be additive, dominant or recessive in the context of different diseases. For the additive model, we used the standard logistic regression-based model implemented in CASSI28 to quantify the interaction between two SNPs coded as follows, mm=2, Mm=1, MM=0. In this model, the goodness-of-fit was compared between a standard logistic regression model with an interaction term between the two loci of interest and a standard logistic regression without an interaction term, and the significance of the interaction was measured by a likelihood ratio test28. We refer to this type of SNP-SNP interaction as an additive-additive (AA) model based interaction. In the dominant model, a SNP is encoded as mm=1, Mm=1, MM=0. In the recessive model, a SNP is encoded as mm=1, Mm=0, MM=0. Because the minor allele could have recessive (R) or dominant (D) contribution to disease at two different loci comprising an interaction, four types of SNP-SNP interactions were examined: recessive-recessive (RR), dominant-dominant (DD), recessive-dominant (RD), and dominant-recessive (DR) model-based interaction for each pair of SNPs. The interactions under these four models can also be estimated by a logistic regression-based model similar to the AA case described above except with the appropriate encoding of the SNP genotypes. Alternatively, the RR, DD, DR and RD interactions can be estimated by explicit statistical tests (e.g. hypergeometric tests) of the association between a specific genotype combination of two SNPs and a disease of interest, where this association is compared to the association between each of the individual SNPs and the disease (marginal effect). Interactions estimated by logistic regression based models directly capture non-additive effects between two SNPs considering different combinations of SNP genotypes. In contrast, interactions estimated by explicit statistical tests have the flexibility of specifically testing certain combinations of genotypes for association with the phenotype. We explored alternative approaches both in representing different disease models and in the estimation of SNP-SNP interactions, and found that RR, DD, DR and RD interactions estimated by explicit statistical tests more likely led to the discovery of significant BPMs/WPMs in the context of our BridGE approach. The measure we developed based on explicit statistical tests, called hygeSSI, is described in detail below. The relationship between hygeSSI and logistic regression based models is explored in more depth in section 8.
4.1 hygeSSI
We designed a hypergeometric-based measurement (hygeSSI) to estimate the interactions between two binary-coded SNPs (dominant or recessive as described above). The hypergeometric p-value for a pair of binary-coded SNPs with respect to a case-control cohort is calculated as follows: Where Sx and Sy are two SNPs; M is the total number of samples; N is the total number of samples in class C; K is the total number of samples that have genotype T; X is the total number of samples that have genotype T in class C.
We use P1 ~ (Sx, C) and P1 ~ (Sy, C) to represent the individual SNP Sx and Sy’s main effects and P11(Sx, Sy, C), P10(Sx, Sx, C), P01(Sx, Sy, C), and P00(Sx, Sx, C) to represent the effects of all pairs of combinations. With a nominal p-value threshold (α), we first require a SNP pair to have significant association with the phenotype P11(Sx, Sy, C) ≤ α. In addition, we specifically exclude instances where other allele combinations show significant association with the trait, i.e. we require: P10(Sx, Sy, C) < α, P01(Sx, Sy, C) < α and P00(Sx, Sy, C) > α. Given a binary-coded SNP pair (Sx, Sy) and a binary class label C, the following measure hygeSSI (Hypergeometic SNP-SNP Interaction) was defined to estimate the genetic interaction between two SNPs Sx and Sy (specifically for P11):
As described in a recent comprehensive review20, algorithms based on logistic/linear regression, multifactor dimensionality reduction (MDR)128, entropy or information theory129 have been developed to measure genetic interactions. All of these approaches quantify the synergistic effect of SNP pairs by comparing the relative strength of the association between a pair of SNPs and a disease trait with the strength of the associations between two individual SNPs and the disease trait. A few of these alternatives were tested in the context of our method and did not provide the significant results we achieved with the metric above. We designed the above hygeSSI measure because it explicitly captures the interaction between combinations of specific genotypes of two loci.
4.2 Construction of SNP-SNP interaction networks
We constructed SNP-SNP interaction networks to serve as the basis for the pathway level BPM tests based on each of the disease model assumptions described above. An additive-additive (AA) interaction network was constructed by the described logistic regression based approach, where SNP-SNP edge scores were derived from the -log10 p-value resulting from the likelihood ratio test. The recessive-recessive (RR) and dominant-dominant (DD) interaction networks were computed based on the hygeSSI metric described above, and only positive interactions were kept in the network (i.e. where the joint effect of the SNP-SNP pair under the corresponding disease model was stronger than any marginal or alternative combination of SNPs). In addition to the above three networks, we also constructed a hybrid SNP-SNP interaction network in which interactions under recessive and dominant disease model could coexist. To do this, we integrated all four networks (RR, DD, RD and DR) into a single network (RD-combined) by taking the maximum hygeSSI among the four interaction networks for each pair of SNPs.
5. Measuring pathway-pathway interactions
5.1 Estimating pathway-pathway interactions based on the SNP-SNP interaction network
For each pair of pathways, we want to test if the number of SNP-SNP interactions between them is significantly higher than expected given the overall density of the SNP-SNP network as well as the marginal interaction density of the two pathways involved. enrichment analysis based on SNP-SNP interactions is much more computationally challenging, and thus we choose to binarize the hygeSSI values (based on a lenient threshold) to make follow up computation efficient and scalable. After binarization, we divided the SNP-SNP interaction network into two networks based whether the joint mutation of a SNP pair is more prevalent in the case or control group, which we refer to as the risk and protective networks, respectively.
For each pathway-pathway interaction, we first removed the common SNPs shared between two pathways. Then, we test if the observed SNP-SNP interaction density between two pathways is significantly higher than expected globally (the global network density) and locally (the marginal density of SNP-SNP interactions of the two pathways). Specifically, the marginal density of a pathway is calculated as the SNP-SNP interaction density between the SNPs mapped to the pathway and all other SNPs in the network. We computed a chi-square statistic to test differences from both global and local density, namely chi-square global and chi-square local . The chi-square test assumes the SNP-SNP interactions in a network are independent, which may not be true for a variety of reasons. So, in addition to these chi-square statistics, we use permutation tests to derive an empirical p-value for each pathway-pathway interaction. To do this, we randomly shuffled the SNP-pathway membership (NP = 100,000-200,000 times), and for a given pathway-pathway interaction (bpmi), we compared its observed and with the values from these random permutations and to obtain a permutation-based p-value. We used (pperm) together with and for BPM discovery as further described in detail in the next two sections.
5.2 Correction for multiple hypothesis testing
Because a large number of pathway pairs (all possible pathway-pathway combinations) are tested in the search for significant BPMs, correction for multiple hypothesis testing is needed. To estimate a false discovery rate, we employed sample permutations (NP = 10 times) to derive the number of expected BPMS discovered by chance at each level of significance. We randomly shuffled the original case-control groups 10 times while maintaining the matched case-control population structure. For each permuted dataset, the same, complete pipeline for BPM discovery was performed, including calculation of the SNP-SNP interaction network after permutation, which was then thresholded at a fixed interaction density matching the density chosen for the real sample labels. From these sample permutations, we obtained three null distributions (, , and ), from which we estimated the false discovery rate (FDR) for each BPM (e.g., bpmi). Specifically, we compared the number of BPMs observed in each real dataset that have better overall statistics than with the corresponding random expectation estimated from the three null distributions derived from sample permutations (, , and ):
A simpler approach to estimate FDR would be to use only the SNP permutation-based p-value, pperm, in the above formula. However, we chose to use all three measurements (, and pperm) because we observed that in some cases the permutation-based p-value alone did not provide enough resolution to differentiate among top BPMs (this could be improved with additional SNP permutations, but this is computationally expensive). and provide higher resolution measures of significance of each BPM and, when combined with the permutation-based p-value, can differentiate among the top-most significant discoveries.
We emphasize that we have used a hybrid permutation strategy to assess significance of the discovered structures. The primary permutation applied was to permute the SNP labels, for which 100,000-200,000 permutations were used for each dataset analyzed. The sample (case-control label) permutation approach mentioned above was used in addition to the SNP permutation strategy to estimate our false discovery rate across all discovered interactions. For each of the 10 sample permutations, we ran the full set of 100,000-200,000 SNP permutations. This hybrid approach provides a robust estimate of significance of the discovered pathway interactions and properly corrects for multiple testing.
We also conducted a study to explore the sensitivity of our FDR estimation on the number of sample permutations. Specifically, for the PD-NIA dataset, we performed 1000 sample permutations (and 200,000 SNP permutations within each of these) to derive an estimate of FDR for discoveries in this dataset (Supplementary Table 25). As shown in Supplementary Fig. 5, the FDRs estimated from 10 sample permutations show reasonable agreement to FDRs estimated from 1000 sample permutations (Pearson’s correlation of 0.81).
5.3 Selection of disease models and density thresholds
The method we proposed for pathway-level detection of genetic interactions is general in the sense that any disease model (e.g. RR, DD, RD-combined, and AA) or interaction statistic could be used to discover pathway-level interactions. In this study, we focus on prioritizing a single disease model per disease cohort for full analysis by our pipeline to limit the complexity of data analysis across the 13 GWAS cohorts we explored with our method. Here, we describe the strategy we used to select the disease model to focus on for each GWAS dataset.
To prioritize the disease model and SNP-SNP interaction network density threshold for each data set, we first performed a pilot experiment in which we examined combinations of different disease models and different density thresholds, but with fewer SNP permutations (Supplementary Table 23). To exclude SNP pairs with little or weak interactions from our analysis, we required each SNP pair’s hygeSSI score to be at least 0.2 before applying density-based binarization. For each combination, we performed 10,000 SNP-pathway membership permutations (as compared to 100,000-200,000 for a complete run) to estimate FDRs using a similar procedure as that described in section 5.2, except that SNP permutations were used to estimate FDR instead of sample permutations, as sample permutations are much more computationally expensive. Based on this pilot experiment in each cohort, we chose the disease model and density threshold combination that resulted in the lowest estimated FDR for the top-most significant pathway-pathway interaction. The rationale of using such a pilot experiment is to identify the disease model that is most likely to discover significant pathway-level interactions while limiting the computational burden of applying our approach to several GWAS cohorts under multiple disease models. Based on these pilot experiments, which were performed for all 13 cohorts, we ran the complete BridGE pipeline, including 100,000-200,000 SNP permutations and 10 sample permutations with the disease model and network density threshold chosen from the pilot experiments. The results of pilot experiments for all cohorts are reported in Supplementary Table 23, and all full BPM discovery results for all diseases can be found in Supplementary Table 3 and 9-20 as well a summary in Suppementary Table 8. We note that for focused application of our approach on a single or small number of cohorts of interest, we would suggest exploring all possible disease models with complete runs.
5.4 Replication in independent cohorts
The significant BPMs discovered from one cohort could be evaluated in another independent cohort for replication. To determine if a discovered BPM was replicated in an independent cohort, we required the BPM to satisfy , , and pperm ≤ 0.05 on the validation cohort. We also performed sample permutation tests (NP=10) for each validation cohort, from which we could generate null distributions for , and pperm in the validation cohort. Given a set of discovered BPMs (e.g. FDR ≤ 0.25), we calculated fold enrichment by comparing the number of BPMs discovered from the original dataset that passed the validation criteria to the average number of BPMs that passed the same validation criteria in the random sample permutations. More specifically, given a set of significant BPMs (bpm1,2,…,k) which were discovered from original cohort, the fold enrichment for replication is defined as:
We also evaluated the significance of the fold enrichment by 10,000 bootstrapped BPM sets. Specifically, we randomly selected the same number of BPMs and used the above procedure to evaluate the fold enrichment, and we repeated this for 10,000 times to generate a null distribution for the fold enrichment scores in the validation cohort. We then evaluated the significance of the fold enrichment score for our discovered BPM set based on this empirical null distribution. All replication results can be found in Supplementary Table 6 and 21.
For the BPMs that replicated in an independent cohort, we further checked if the SNP-SNP interactions supporting the discovered pathway-level interactions were similar between the cohort used for discovery and the independent cohort used for replication. For example, we used the BPMs discovered from PD-NIA (FDR 0.25) and for each BPM replicated in PD-NGRC, we computed the number of SNP-SNP interactions in common between the PD-NIA and PD-NGRC interaction networks as supporting interactions for the BPM. We used the same permutation approach as that described above for BPM-level validation except that the SNP-SNP interactions supporting each BPM were compared between the discovery and validation cohorts by a hypergeometric test. This was done for the real validation cohort PD-NGRC first and then repeated 10 times under sample permutations of the validation cohort to estimate a null distribution. A Wilcoxon’s rank-sum test was then used to evaluate the significance of the SNP-SNP interaction overlap between the replicated BPMs in the real validation cohort and in the random sample permuted validation cohorts (Fig. 4B).
5.5 BPM redundancy
Due to the fact that many of the curated gene sets overlap, we needed to control for redundancy in the discovered BPMs. To do this, in reporting total discoveries, we filtered BPMs based on their relative overlap in terms of SNP-SNP interactions using an overlap coefficient. The overlap coefficient between two BPMs is defined as the number of overlapping SNP pairs divided by the number of possible SNP pairs in the smaller BPM.
For the significant BPMs discovered, we computed all pairwise overlap coefficients and used a maximum allowed similarity score of 0.25 as a cutoff. We reported the number of unique BPMs based on the number of connected components. For visualization purposes (Fig. 3), we selected representative BPMs from each connected component, prioritizing BPMs that validated in the independent cohort (PD-NGRC) for visualization. Significance of the validation of the set of BPMs was evaluated on the entire set of discovered BPMs using the permutation procedures described above, which directly accounts for the redundancy among the discovered BPMs.
6. Measuring within-pathway interactions
In addition to the between-pathway model (BPM), we also tested for enrichment of genetic interactions within each pathway34 (within-pathway models, WPMs). All of the measures and procedures described above for BPMs apply directly to the WPM case, only we specifically look at SNP pairs connecting genes within the same pathways/gene sets instead of between pathway pairs. For WPMs, the false discovery rate and validation statistics were computed separately from BPMs. All WPM discovery results can be found in Supplementary Table 3, 9-20.
7. Identifying pathway hubs in the SNP-SNP interaction network
Since both “between-pathway model” and “within-pathway model” analysis have been designed to avoid discoveries caused by the higher marginal interaction density of the individual pathways, pathways that are frequently interacting with many loci across the genome (as opposed to localized interactions with functionally coherent gene sets) are less likely to appear in our pathway-pathway or within-pathway interactions. However, such pathways may also be disease relevant as they reflect pathways that modify the disease risk associated with a large number of other variants, so we also report pathways exhibiting these characteristics with BridGE (we refer to these as “PATH” discoveries in BridGE output files). For PATH discovery, the procedure is similar to that for BPMs and WPMs, with a minor modification to the scoring of each pathway. Specifically, each pathway is represented by a vector of pathway-associated SNPs’ degrees in the SNP-SNP interaction network. We then applied a one-tailed rank-sum test to compare each pathway-associated degree vector with the non-pathway-associated degree vector to see if the PATH associated SNPs exhibited significantly more interactions than the entire set of SNPs. PATH discovery and validation is then done by repeating the same steps as BPM/WPM discovery but replacing the and statistics with the rank-sum test p-value (in –log10 scale). All PATH discovery results can also be found in Supplementary Table 3 and 9-20. Many of these also have clear relevance to the disease cohort in which they were discovered. For example, applying BridGE to discover such hub pathways in the context of Parkinson’s disease resulted in 3 significant pathways after removing redundancy (FDR ≤ 0.25), including the same Golgi-associated vesicle biogenesis gene set as well as the IL-12 and STAT4 signaling pathway (Biocarta) discussed in the main text.
8. Comparison of hygeSSI interactions with logistic regression-based interactions
We examined if the interactions captured by hygeSSI were non-additive as measured through a standard logistic regression-based interaction measure. We applied the logistic regression model on the PD-NIA data and computed RR, DD, RD and DR interaction networks (binary encoding as described earlier). We also integrated these 4 logistic regression-based networks to form an RD-combined network. Then we checked (1) if the top SNP-SNP interactions based on hygeSSI were significant (p≤0.05) in logistic regression based tests, and (2) if the significant BPMs discovered from a hygeSSI interaction network show significance (, , and pperm ≤ 0.05) based on SNP-SNP interactions estimated from logistic regression. This analysis revealed that among the top 1% hygeSSI interactions, 93% are significant based on a logistic regression-based test for interaction. And for the significant BPMs (FDR≤0.05), 100% of them are also significant if only SNP-SNP interactions also supported by a logistic regression model are considered. These data suggest SNP-SNP interactions captured by hygeSSI do represent non-additive interactions as defined based on a logistic regression model. Detailed results from this comparison can be found in Supplementary Table 24. Further evaluation of different disease models and different measures for estimating SNP-SNP interactions in the context of BridGE will be the focus of future work.
9. Evaluation of significance of individual SNP-SNP interaction tests
For SNP-SNP pairs that supported the between-pathway interaction reported in Fig. 2B, we checked the statistical significance of SNP-SNP interaction pairs tested individually. We measured all pairwise additive-additive (AA), recessive-recessive (RR), dominant-dominant (DD) interactions. We then performed a permutation test in which sample labels were permuted 10 times and for each permutation, all pairwise AA, RR, DD interactions were computed for each SNP pair. These permutations were used to estimate a false discovery rate (FDR) for those SNP-SNP pairs supporting the reported BPM. No individual SNP-SNP pairs were significant after FDR-based multiple hypothesis correction (Fig. 2D, Supplementary Fig. 1).
10. Pathway enrichment analysis of single locus effects
To check if the pathways involved in the significant BPMs discovered in PD-NIA were enriched for SNPs with moderate univariate association with Parkinson’s disease, we performed single pathway enrichment analysis for the same set of 685 pathways used for BPM discovery. In the single pathway enrichment analysis, we used a hypergeometric test as the SNP-level statistic for measuring univariate association (risk and protective associations were evaluated separately) for three different disease models: 1) recessive; 2) dominant, and 3) a combination of recessive and dominant, in which each SNP were tested for both recessive and dominant disease models and the more significant one assigned to each SNP. We then used Wilcoxon’s rank-sum test to check if a pathway was enriched for SNPs with higher association than the background (all SNPs). With 10,000 sample permutations, we computed FDR for each individual pathway (both risk and protective associations) by using same procedure described in 5.2. The results are summarized in Supplementary Table 5.
11. Comparison of pathways discovered by BridGE with previously reported disease risk loci from the GWAS catalog
To check if previous singly-associated SNPs also appear in our discovered pathway-level interactions, we compared our BridGE-discovered pathways with pathways that could be linked to disease risk loci reported in NHGRI-EBI GWAS catalog130 (Ensembl release version 87, retrieved on Feb 6, 2017). Based on the GWAS catalog, the numbers of genes linked to known risk loci (p≤2.0 x 10-5) in each disease are: 143 (144 SNPs, Parkinson’s disease), 1009 (824 SNPs, Schziophrenia), 134 (172 SNPs, Breast cancer), 71 (57 SNPs, Hypertension), 249 (234 SNPs, Prostate cancer) and 294 (288 SNPs, Type II diabetes). For each disease, we summarized all pathways that were discovered by BridGE (FDR ≤ 0.25) and identified pathways that were implicated by individually associated SNPs reported in the GWAS catalog (a SNP mapping to a single gene in a given pathway was assumed to implicate the corresponding pathway). For context, for each disease, we also summarize the total number of genes implicated by GWAS-identified SNPs, how many these map to the 833 pathways we used in our study, and how many of them can be linked to the significant pathways identified by BridGE. These results are presented in Supplementary Table 22.
12. Dependence of interaction discoveries on the assumed disease model
While we tested multiple disease models (additive, dominant, recessive, and combined dominant-recessive), the most significant discoveries for the majority of diseases examined were reported when using a dominant or combined model as measured by our SNP-SNP interaction metric131. The relative frequency of interactions under a dominant vs. a recessive model may be largely due to our increased power to detect interactions between SNPs with dominant effects compared to recessive effects. More specifically, individuals with both heterozygous and homozygous (minor allele) genotypes at two interacting loci would be affected under a dominant disease model, while only individuals with homozygous (minor allele) genotypes would be affected in a recessive disease model. The number of individuals homozygous at two interacting loci can be quite small depending on the allele frequency, which limits our power to discover them. Thus, the larger number of discoveries based on a dominant model assumption relative to a recessive model is likely a reflection of difference in statistical power and not an indication that genetic interactions among alleles with dominant effects are contributing more strongly to disease risk. We observed that interactions derived from an additive disease model provided the fewest significant discoveries when used in the context of BridGE based on the pilot experiments (Supplementary Table 23). To understand this, we investigated whether the SNP-SNP interactions supporting the BPMs discovered under the combined dominant-recessive model for the PD-NIA cohort were non-additive when evaluated using a logistic-regression based interaction test as opposed to the direct association tests used for our dominant and recessive disease models131. Most SNP-SNP interactions supporting the PD-NIA discoveries were indeed non-additive when assessed using the logistic regression framework, but these were not necessarily ranked among the highest SNP-SNP pairs when assessed in the context of a logistic regression model131 (Supplementary Table 24), which may explain the difference in results under the additive vs. recessive or dominant disease models. An important distinction between the SNP-level interaction metric we use is that we specifically identify the small subset of individuals with the appropriate combination of genotypes (dominant model: heterozygous for minor allele at two candidate loci; recessive model: homozygous for minor allele at two candidate loci), and directly test for association with the disease phenotype, whereas for the additive model, an interaction term must explain a sufficient fraction of the variance across the entire population for it to reach significance. This distinction may play a role in why we are able to discover pathway-level genetic interactions with the metric proposed here but rarely with a standard additive model. It is worth noting that the core of the BridGE approach, discovering genetic interactions in aggregate rather than in isolation, is readily adaptable to other disease models or other statistical measures of interaction. Further exploration of different disease models as well as different statistical measures of interaction 123,132 would be worthwhile.
13. Power analysis based on interaction simulation study
To characterize the power of our BridGE approach with respect to sample size, effect size, minor allele frequency and pathway size, we used a two-stage simulation approach. We first generated synthetic GWAS datasets with embedded SNP-SNP interaction pairs using GWAsimulator100. Specially, we used PD-NIA as input to GWAsimulator and embedded SNP-SNP interactions with different minor allele frequencies (e.g. 0.05, 0.1, 0.15, 0.2 and 0.25) and a range of interaction effects (e.g. d11=d12=d12=d22=1.1, 1.5, 2, 2.5, 3 and 5, where 0, 1, 2 refer to the number of minor alleles present in a given genotype for an individual SNP, and d11, d12, d12, and d22 are defined as the relative risk of that genotype–11,12, 21 or 22– versus 00)100. We also varied the number of samples (genotypes) in the simulation (e.g. 200, 500, 1000, 2000, 5000 and 10000). In all simulations, we specified the disease prevalence to be 0.05, dominance effect for all disease SNPs with PR1=1 (see GWAsimulator for more details)100. Under different scenarios (combinations of different minor allele frequencies, interaction effects and sample sizes), we embedded 100 SNP pairs and measured the percentage of SNP-SNP interactions that were identified by our pairwise SNP-SNP interaction measure, hygeSSI at a 1% network density (e.g. SNP-SNP pairs whose hygeSSI is greater or equal to the 99th percentile of all possible interactions) (Supplementary Fig. 6). These simulations provide a direct measure of the sensitivity and specificity of the SNP-SNP interaction level measure that forms the basis of the pathway-level statistics.
The SNP-SNP level power statistics were complemented with a second set of simulations in which we directly assessed the sensitivity of BridGE in detecting BPMs with different levels of noise in the SNP-SNP level network (derived from the process described above). To characterize the statistical power of our approach as a function of pathway size, we first generated a synthetic interaction network with the same degree distribution as the PD-NIA DD network at 1% density. Then, we embedded a set of non-overlapping BPMs into this SNP-SNP interaction network while retaining the same degree distribution and density of the network. Each set had 90 BPMs at 9 different sizes (number of SNPs mapped to the two pathways in each BPM: 10×10, 25x25, 50×50, 75x75, 100×100, 150×150, 200×200, 250×250 and 300×300; and 10 different background densities 0.01, 0.012, 0.014, 0.016, 0.018, 0.02, 0.025, 0. 03, 0.04 and 0.05. We applied 150,000 SNP-pathway membership permutations to assess the significance of these embedded patterns. The SNP permutation-derived p-values of the simulations were reported in Supplementary Fig. 3 and provide an estimation of BPM density required for detecting interactions between pathways of different sizes. We used the average p-values (p = 3.0×10-5, SNP-permutation) of the significant BPM discoveries across all GWAS cohorts (FDR ≤ 0.25) as the discovery significance cutoff for the simulation analysis.
We derived power estimates for each combination of parameter settings by integrating the results from above two simulation studies. More specifically, we estimated the minimum sample size needed to discover significant BPMs at different pathway sizes under each of the scenarios (e.g. minor allele frequency, relative disease risk). To connect the two simulation studies, we require a scaling parameter (here, we explored s = 0.025, 0.05 and 0.1) which corresponds to the biological density of genetic interactions crossing each pair of truly interacting pathways. This represents the fraction of all possible SNP-SNP pairs crossing the pair of pathways of interest for which the combination of variants actually has a functional deleterious impact on the phenotype. This quantity is expected to be relatively small, but is difficult to estimate, which is why we have explored three scenarios (s = 0.025, 0.05 and 0.1). For a given BPM of a specific size (10×10, 25×25, 50×50, 75×75, 100×100, 150×150, 200×200, 250×250 and 300×300), from the 2nd simulation, we identified the corresponding BPM density needed for it to rise to the level of statistical significance required for a 25% FDR based on the PD-NIA cohort. We then scaled the required density by the parameter, s, and based on the 1st set of simulation results, identified the minimum sample size required under each scenario (combinations of minor allele frequency, interaction effect, and sample size) to support the discovery of the corresponding BPM (results summarized in Fig. 6B).
Simulation results for additional scaling parameters (s = 0.1 and s = 0.025) are included in the supplementary Supplementary Fig. 2. These plots together provide an estimate of the power of the BridGE approach to detect pathway-pathway interaction in these different scenarios. We note that this power analysis was conducted for the dominant disease model, which comprises the majority of the BPM interactions discovered across all cohorts. Sensitivity of our method under a recessive model assumption is expected to be lower, which is consistent with the relative rate of discoveries of both types.
PD-NIA (phs000089.v3.p2)
The genotyping of samples was provided by the National Institute of Neurological Disorders and Stroke (NINDS). The dataset used for the analyses described in this manuscript were obtained from the NINDS Database found at https://www.ncbi.nlm.nih.gov/gap
PD-NGRC (phs000196.v3.p1)
This work utilized in part data from the NINDS DbGaP database from the CIDR:NGRC PARKINSON’S DISEASE STUDY.
SZ-GAIN (phs000021.v3.p2)
Funding support for the Genome-Wide Association of Schizophrenia Study was provided by the National Institute of Mental Health (R01 MH67257, R01 MH59588, R01 MH59571, R01 MH59565, R01 MH59587, R01 MH60870, R01 MH59566, R01 MH59586, R01 MH61675, R01 MH60879, R01 MH81800, U01 MH46276, U01 MH46289 U01 MH46318, U01 MH79469, and U01 MH79470) and the genotyping of samples was provided through the Genetic Association Information Network (GAIN). The datasets used for the analyses described in this manuscript were obtained from the database of Genotypes and Phenotypes (dbGaP) found at http://www.ncbi.nlm.nih.gov/gap through dbGaP accession number phs000021.v3.p2. Samples and associated phenotype data for the Genome-Wide Association of Schizophrenia Study were provided by the Molecular Genetics of Schizophrenia Collaboration (PI: Pablo V. Gejman, Evanston Northwestern Healthcare (ENH) and Northwestern University, Evanston, IL, USA).
BC-CGEMS-EUR (phs000147.v3.p1)
This dataset was from the Cancer Genetic Markers of Susceptibility (CGEMS) Breast Cancer Genome-wide Association Study with dbGaP accession number phs000147.v3.p1.
BC-MCS-LTN, BC-MCS-JPN (phs000517.v3.p1)
The Multiethnic Cohort and the genotyping in this study were funded by grants from the National Institute of Health (CA63464, CA54281, CA098758, CA132839 and HG005922) and the Department of Defense Breast Cancer Research Program (W81XWH-08-1-0383).
HT-eMERGE (phs000297.v1.p1)
Group Health Cooperative/University of Washington – Funding support for Alzheimer's Disease Patient Registry (ADPR) and Adult Changes in Thought (ACT) study was provided by a U01 from the National Institute on Aging (Eric B. Larson, PI, U01AG006781). A gift from the 3M Corporation was used to expand the ACT cohort. DNA aliquots sufficient for GWAS from ADPR Probable AD cases, who had been enrolled in Genetic Differences in Alzheimer's Cases and Controls (Walter Kukull, PI, R01 AG007584) and obtained under that grant, were made available to eMERGE without charge. Funding support for genotyping, which was performed at Johns Hopkins University, was provided by the NIH (U01HG004438). Genome-wide association analyses were supported through a Cooperative Agreement from the National Human Genome Research Institute, U01HG004610 (Eric B. Larson, PI).
Mayo Clinic – Samples and associated genotype and phenotype data used in this study were provided by the Mayo Clinic. Funding support for the Mayo Clinic was provided through a cooperative agreement with the National Human Genome Research Institute (NHGRI), Grant #: UOIHG004599; and by grant HL75794 from the National Heart Lung and Blood Institute (NHLBI). Funding support for genotyping, which was performed at The Broad Institute, was provided by the NIH (U01HG004424).
Marshfield Clinic Research Foundation – Funding support for the Personalized Medicine Research Project (PMRP) was provided through a cooperative agreement (U01HG004608) with the National Human Genome Research Institute (NHGRI), with additional funding from the National Institute for General Medical Sciences (NIGMS) The samples used for PMRP analyses were obtained with funding from Marshfield Clinic, Health Resources Service Administration Office of Rural Health Policy grant number D1A RH00025, and Wisconsin Department of Commerce Technology Development Fund contract number TDF FYO10718. Funding support for genotyping, which was performed at Johns Hopkins University, was provided by the NIH (U01HG004438).
Northwestern University – Samples and data used in this study were provided by the NUgene Project (www.nugene.org). Funding support for the NUgene Project was provided by the Northwestern University’s Center for Genetic Medicine, Northwestern University, and Northwestern Memorial Hospital. Assistance with phenotype harmonization was provided by the eMERGE Coordinating Center (Grant number U01HG04603). This study was funded through the NIH, NHGRI eMERGE Network (U01HG004609). Funding support for genotyping, which was performed at The Broad Institute, was provided by the NIH (U01HG004424).
Vanderbilt University - Funding support for the Vanderbilt Genome-Electronic Records (VGER) project was provided through a cooperative agreement (U01HG004603) with the National Human Genome Research Institute (NHGRI) with additional funding from the National Institute of General Medical Sciences (NIGMS). The dataset and samples used for the VGER analyses were obtained from Vanderbilt University Medical Center's BioVU, which is supported by institutional funding and by the Vanderbilt CTSA grant UL1RR024975 from NCRR/NIH. Funding support for genotyping, which was performed at The Broad Institute, was provided by the NIH (U01HG004424).
Assistance with phenotype harmonization and genotype data cleaning was provided by the eMERGE Administrative Coordinating Center (U01HG004603) and the National Center for Biotechnology Information (NCBI). The datasets used for the analyses described in this manuscript were obtained from dbGaP at http://www.ncbi.nlm.nih.gov/gap through dbGaP accession number phs000297.v1.p1.
ProC-CGEMS (phs000207.v1.p1)
This data was from the Cancer Genetic Markers of Susceptibility (CGEMS) Prostate Cancer Genome-Wide Association Study.
ProC-BPC3 (phs000812.v1.p1):
The Breast and Prostate Cancer Cohort Consortium (BPC3) genome-wide association studies of advanced prostate cancer and estrogen-receptor negative breast cancer was supported by the National Cancer Institute under cooperative agreements U01-CA98233, U01-CA98710, U01-CA98216, and U01-CA98758 and the Intramural Research Program of the National Cancer Institute, Division of Cancer Epidemiology and Genetics.
PanC-PanScan (phs000206.v5.p3)
This project was funded in whole or in part with federal funds from the National Cancer Institute (NCI), US National Institutes of Health (NIH) under contract number HHSN261200800001E. Additional support was received from NIH/NCI K07 CA140790, the American Society of Clinical Oncology Conquer Cancer Foundation, the Howard Hughes Medical Institute, the Lustgarten Foundation, the Robert T. and Judith B. Hale Fund for Pancreatic Cancer Research and Promises for Purple. A full list of acknowledgments for each participating study is provided in the Supplementary Note of the manuscript with PubMed ID: 25086665.
Conflict of Interest
The authors declare that they have no conflict of interest.
List of Supplementary Tables
Supplementary Table 1. Information about the 13 genome-wide association studies (GWAS) data sets used in this study.
Supplementary Table 2. List of 833 gene sets from KEGG, BioCarta and Reactome.
Supplementary Table 3. BridGE results from PD-NIA cohort based on recessive/dominant combined disease model.
BridGE results are reported for the PD-NIA cohort, with the following tabs (in order): summary of discoveries, between-pathway model (BPM) interactions, within-pathway model (WPM) interactions, and hub pathways (pathways exhibiting elevated density of SNP-SNP interactions across the genome) (PATH). Decreased risk (protective) and increased risk (risk) interactions are listed separately. These results were derived using the combined recessive-dominant disease model.
Supplementary Table 4. List of BPMs and WPMs after filtering for redundancy for the PD-NIA cohort.
This file contains a list of BPMs obtained from the PD-NIA cohort after controlling for redundancy based on a maximum overlap coefficient of 0.25. These correspond to the set visualized in Fig. 3A of the manuscript.
Supplementary Table 5. Pathway enrichment analysis for single locus effects for PD-NIA.
Pathway enrichment analysis on single locus effects was computed for several different disease models and subsets of SNPs. Each of the following tabs appears in this file: (A) combined disease model, LD controlled SNP set, (B) dominant disease model, LD controlled SNP set, (C) recessive disease model, LD controlled SNP set, (D) combined disease model, genome-wide SNP set, (E) dominant disease model, genome-wide SNP set, (F) recessive disease model, genome-wide SNP set.
Supplementary Table 6. Replication statistics and lists of replicated BPMs for BridGE discoveries from PD-NIA.
BPMs discovered from the PD-NIA cohort were tested for replication in the independent PD-NGRC cohort. Tab (A) contains a summary of replication statistics and tab (B) contains a list of replicated BPMs.
Supplementary Table 7. Summary of between and within-pathway interactions discovered across six diseases. This file contains a list of BPMs and WPMs (top 10) discovered across six diseases. These correspond to the set visualized in Fig. 5 of the manuscript.
Supplementary Table 8. Summary of interactions discovered across 13 GWAS cohorts.
The number of between-pathway model (BPM) interactions, within-pathway model (WPM) interactions, and hub pathways (pathways exhibiting elevated density of SNP-SNP interactions across the genome) (PATH) discovered are reported for each of the 13 GWAS cohorts at a range of FDR cutoffs.
Supplementary Table 9. BridGE results from PD-NGRC cohort based on dominant disease model. BridGE results are reported for the PD-NGRC cohort, with the following tabs (in order): summary of discoveries, between-pathway model (BPM) interactions, within-pathway model (WPM) interactions, and hub pathways (pathways exhibiting elevated density of SNP-SNP interactions across the genome) (PATH). Decreased risk (protective) and increased risk (risk) interactions are listed separately. These results were derived using the dominant disease model.
Supplementary Table 10. BridGE results from SZ-GAIN cohort based on combined disease model.
BridGE results are reported for the SZ-GAIN cohort, with the following tabs (in order): summary of discoveries, between-pathway model (BPM) interactions, within-pathway model (WPM) interactions, and hub pathways (pathways exhibiting elevated density of SNP-SNP interactions across the genome) (PATH). Decreased risk (protective) and increased risk (risk) interactions are listed separately.These results were derived using the combined recessive-dominantdisease model.
Supplementary Table 11. BridGE results from SZ-CATIE cohort based on recessive disease model.
BridGE results are reported for the SZ-CATIE cohort, with the following tabs (in order): summary of discoveries, between-pathway model (BPM) interactions, within-pathway model (WPM) interactions, and hub pathways (pathways exhibiting elevated density of SNP-SNP interactions across the genome) (PATH). Decreased risk (protective) and increased risk (risk) interactions are listed separately.These results were derived using the recessivedisease model.
Supplementary Table 12. BridGE results from BC-CGEMS-EUR cohort based on recessive disease model.
BridGE results are reported for the BC-CGEMS-EUR cohort, with the following tabs (in order): summary of discoveries, between-pathway model (BPM) interactions, within-pathway model (WPM) interactions, and hub pathways (pathways exhibiting elevated density of SNP-SNP interactions across the genome) (PATH). Decreased risk (protective) and increased risk (risk) interactions are listed separately. These results were derived using the recessive model.
Supplementary Table 13. BridGE results from BC-MCS-JPN cohort based on dominant disease model.
BridGE results are reported for the BC-MCS-JPN cohort, with the following tabs (in order): summary of discoveries, between-pathway model (BPM) interactions, within-pathway model (WPM) interactions, and hub pathways (pathways exhibiting elevated density of SNP-SNP interactions across the genome) (PATH). Decreased risk (protective) and increased risk (risk) interactions are listed separately. These results were derived using the dominant model.
Supplementary Table 14. BridGE results from BC-MCS-LTN cohort based on dominant disease model.
BridGE results are reported for the BC-MCS-LTN cohort, with the following tabs (in order): summary of discoveries, between-pathway model (BPM) interactions, within-pathway model (WPM) interactions, and hub pathways (pathways exhibiting elevated density of SNP-SNP interactions across the genome) (PATH). Decreased risk (protective) and increased risk (risk) interactions are listed separately. These results were derived using the dominant model.
Supplementary Table 15. BridGE results from HT-eMERGE cohort based on dominant disease model.
BridGE results are reported for the HT-eMERGE cohort, with the following tabs (in order): summary of discoveries, between-pathway model (BPM) interactions, within-pathway model (WPM) interactions, and hub pathways (pathways exhibiting elevated density of SNP-SNP interactions across the genome) (PATH). Decreased risk (protective) and increased risk (risk) interactions are listed separately. These results were derived using the dominant model.
Supplementary Table 16. BridGE results from HT-WTCCC cohort based on combined disease model.
BridGE results are reported for the HT-WTCCC cohort, with the following tabs (in order): summary of discoveries, between-pathway model (BPM) interactions, within-pathway model (WPM) interactions, and hub pathways (pathways exhibiting elevated density of SNP-SNP interactions across the genome) (PATH). Decreased risk (protective) and increased risk (risk) interactions are listed separately. These results were derived using the recessive-dominant combined model.
Supplementary Table 17. BridGE results from ProC-CGEMS cohort based on dominant disease model.
BridGE results are reported for the ProC-CGEMS cohort, with the following tabs (in order): summary of discoveries, between-pathway model (BPM) interactions, within-pathway model (WPM) interactions, and hub pathways (pathways exhibiting elevated density of SNP-SNP interactions across the genome) (PATH). Decreased risk (protective) and increased risk (risk) interactions are listed separately. These results were derived using the dominant model.
Supplementary Table 18. BridGE results from ProC-BPC3 cohort based on dominant disease model.
BridGE results are reported for the ProC-BPC3 cohort, with the following tabs (in order): summary of discoveries, between-pathway model (BPM) interactions, within-pathway model (WPM) interactions, and hub pathways (pathways exhibiting elevated density of SNP-SNP interactions across the genome) (PATH). Decreased risk (protective) and increased risk (risk) interactions are listed separately. These results were derived using the dominant model.
Supplementary Table 19. BridGE results from PanC-PanScan cohort based on dominant disease model.
BridGE results are reported for the PanC-PanScan cohort, with the following tabs (in order): summary of discoveries, between-pathway model (BPM) interactions, within-pathway model (WPM) interactions, and hub pathways (pathways exhibiting elevated density of SNP-SNP interactions across the genome) (PATH). Decreased risk (protective) and increased risk (risk) interactions are listed separately. These results were derived using the dominant model.
Supplementary Table 20. BridGE results from T2D-WTCCC cohort based on combined disease model.
BridGE results are reported for the T2D-WTCCC cohort, with the following tabs (in order): summary of discoveries, between-pathway model (BPM) interactions, within-pathway model (WPM) interactions, and hub pathways (pathways exhibiting elevated density of SNP-SNP interactions across the genome) (PATH). Decreased risk (protective) and increased risk (risk) interactions are listed separately. These results were derived using the recessive-dominant combined model.
Supplementary Table 21. Replication statistics and lists of replicated BPMs, WPMs or PATHs for BridGE discoveries from prostate cancer, breast cancer and schizophrenia.
BPMs, WPMs and PATHs discovered from the each disease cohort were tested for replication in the corresponding independent cohort, for each of the three diseases. Both a summary of replication statistics and a list of replicated BPMs, WPMs or PATHs are reported, with one disease cohort per tab.
Supplementary Table 22. Comparison between BridGE pathways and SNPs reported in the GWAS catalog.
Summary of the comparison (A) and list of pathways identified by BridGE with FDR< 0.25 and their association with GWAS SNPs for the six diseases studied: (B) Parkinson’s disease, (C) Schizophrenia, (D) Breast cancer, (E) Hypertension, (F) Prostate cancer and (G) Type II diabetes.
Supplementary Table 23. Results of pilot experiments for 13 GWAS cohorts.
As described in methods, all 13 cohorts on which BridGE was applied were first explored in pilot runs in which a smaller number of SNP permutations. Based on initial estimates of FDR, the disease model and density combination with strongest statistical significance were run in full. Pilot results from all 13 cohorts are included in this file, one per tab.
Supplementary Table 24. Summary of evaluation of hygeSSI SNP-SNP interactions by a logistic regression-based interaction test.
Supplementary Table 25. BridGE results from PD-NIA cohort based on recessive/dominant combined disease model using 1000 sample permutations.
BridGE results are reported for the PD-NIA cohort, with the following tabs (in order): summary of discoveries, between-pathway model (BPM) interactions, within-pathway model (WPM) interactions, and hub pathways (pathways exhibiting elevated density of SNP-SNP interactions across the genome) (PATH). Decreased risk (protective) and increased risk (risk) interactions are listed separately. These results were derived using the combined recessive-dominant disease model.
Acknowledgments
We thank Dr. Frank Albert and Dr. Jing Hou for constructive comments on the manuscript. This work was partially supported by NSF grants DBI 0953881 (CLM) and IIS 0916439 (VK), NIH grants R01HG005084 (CLM) and R01HG005853 (CLM, CB), R01MH097276 (GF, EES) and R01GM114472 (GF), a University of Minnesota Rochester Biomedical Informatics and Computational Biology Program Traineeship Award (GF) and a Walter Barnes Lang Fellowship (GF). CLM and CB are supported by the CIFAR Genetic Networks program. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funders. Computing resources and data storage services were partially provided by the Minnesota Supercomputing Institute and the UMN Office of Information Technology, respectively.
The genome-wide association datasets (PD-NIA, PD-NGRC, SZ-GAIN, BC-CGEMS-EUR, BC-MCS-JPN, BC-MCS-LTN, HT-eMERGE, ProC-CGEMS, ProC-BPC3 and PanC-PanScan) used in this study were obtained from https://www.ncbi.nlm.nih.gov/gap through dbGaP accession numbers: phs000089.v3.p2, phs000196.v3.p1, phs000021.v3.p2, phs000147.v3.p1, phs000517.v3.p1, phs000297.v1.p1, phs000207.v1.p1, phs000812.v1.p1, and phs000206.v5.p3. We acknowledge the Contributing Investigators who submitted data from their original study to dbGaP, the primary funding organization that supported the Contributing Investigators, and the NIH data repository.
The genome-wide association datasets (SZ-GAIN, HT-WTCCC, T2D-WTCCC) used in this study were provided by Wellcome Trust Case Control Consortium through Dataset Accession numbers: EGAD00000000006, EGAD00000000009 and EGAD00000000001 and EGAD00000000002. These were funded by the Wellcome Trust under award 076113 and a full list of the investigators who contributed to the generation of the data is available from www.wtccc.org.uk.
References
- 1.↵
- 2.
- 3.
- 4.
- 5.↵
- 6.↵
- 7.↵
- 8.
- 9.
- 10.↵
- 11.↵
- 12.
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.
- 25.
- 26.↵
- 27.
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.
- 71.
- 72.
- 73.↵
- 74.↵
- 75.↵
- 76.
- 77.
- 78.
- 79.
- 80.↵
- 81.↵
- 82.↵
- 83.
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵
- 105.↵
- 106.↵
- 107.↵
- 108.↵
- 109.↵
- 110.↵
- 111.↵
- 112.↵
- 113.↵
- 114.↵
- 115.↵
- 116.↵
- 117.↵
- 118.↵
- 119.↵
- 120.↵
- 121.↵
- 122.↵
- 123.↵
- 124.↵
- 125.↵
- 126.↵
- 127.↵
- 128.↵
- 129.↵
- 130.↵
- 131.↵
- 132.↵