Abstract
Motivation Genome-wide association study (GWAS) has been a great success in the past decade. However, significant challenges still remain in both identifying new risk loci and interpreting results. Bonferroni-corrected significance level is known to be conservative, leading to insufficient statistical power when the effect size is moderate at risk locus. Complex structure of linkage disequilibrium also makes it challenging to separate causal variants from nonfunctional ones in large haplotype blocks.
Results We describe GenoWAP, a post-GWAS prioritization method that integrates genomic functional annotation and GWAS test statistics. The effectiveness of GenoWAP is demonstrated through its applications to Crohn’s disease and schizophrenia using the largest studies available, where highly ranked loci show substantially stronger signals in the whole dataset after prioritization based on a subset of samples. At the single nucleotide polymorphism (SNP) level, top ranked SNPs after prioritization have both higher replication rates and consistently stronger enrichment of eQTLs. Within each risk locus, GenoWAP is also able to distinguish functional sites from groups of correlated SNPs.
Availability and Implementation GenoWAP is freely available on the web at http://genocanyon.med.yale.edu/GenoWAP
Introduction
In the past ten years, genome-wide association studies (GWAS) have been designed and applied to identify disease genes for almost all complex diseases. As of January 15, 2015, 15,216 single nucleotide polymorphisms (SNP) from over 2,000 publications have been documented in the GWAS Catalog (Hindorff, et al., 2009). Despite its great success in identifying disease-associated loci, scientists have noted several limitations of current GWAS approaches. First, although linkage disequilibrium (LD) is the basis of GWAS, it also hinders the interpretation of association results. Due to the complex LD structure among SNPs, it is the disease-associated haplotype blocks containing hundreds of thousands of nucleotides that are identified in GWASs. Therefore, the resolution of GWAS is not sufficient for distinguishing causal variants from a large group of correlated SNPs, especially in non-coding regions where the mechanism of genomic function is still largely unknown (Cooper and Shendure, 2011; Visscher, et al., 2012; Ward and Kellis, 2012). Second, although Bonferroni-corrected significance threshold (e.g. 5×10−8) is widely accepted as the standard cutoff in GWAS analysis, it is well known that Bonferroni correction is too conservative when the number of hypotheses is large and there are many weak to moderate signals. In fact, for most complex diseases, numerous genomic loci are involved in disease etiology while each locus only has a moderate effect size. Therefore, studies based on high-throughput genomic scan may be underpowered if the sample size is not large enough. This has led to so-called missing heritability, which refers to the gap between the narrow-sense heritability estimated from twin/pedigree analysis and the proportion of the variance explained by significant SNPs identified from GWAS, that has been reported for many diseases (Manolio, et al., 2009; Witte, et al., 2014). One explanation of missing heritability is the insufficient statistical power to identify all the disease-associated SNPs (Eichler, et al., 2010).
Variant prioritization techniques are crucial for post-GWAS analysis on different scales. Locally, it can reveal truly functional variants within each significant locus. Globally, signals at some loci can be enhanced if proper prior information is used. Many variant prioritization methods have been proposed (Hou and Zhao, 2013). Supervised-learning-based statistical tools for predicting deleterious variants are probably the richest among available approaches. So far, most of the existing deleteriousness prediction tools only focus on protein-coding genes in the human genome. However, coding-region-based tools are not sufficient for post-GWAS prioritization because nearly 90% of the significant SNPs identified in GWAS reside in the non-coding genome (Hindorff, et al., 2009). A few tools targeting non-coding variants have been proposed (Fu, et al., 2014; Kircher, et al., 2014; Ritchie, et al., 2014; Shihab, et al., 2015). Detailed comparisons of these methods were reviewed elsewhere (Cooper and Shendure, 2011; Wang, et al., 2015). Unlike the extensively studied protein-altering variants, very few non-coding pathogenic variants have been revealed so far (Ward and Kellis, 2012). Therefore, existing non-coding variant prioritization tools based on supervised-learning may suffer from the potentially biased training data. Their performance in post-GWAS prioritization remains to be further investigated. Finally, although deleteriousness of a single SNP is crucial for identifying causal variants, it does not provide all the information needed in post-GWAS prioritization, where each SNP in GWAS also carries information of nearby variants that are not genotyped. A better informed post-GWAS prioritization method should be able to measure the functional potential for the surrounding region of each genotyped marker.
Recently, Lu et al. developed GenoCanyon, a statistical framework to predict functional non-coding regions in the human genome through integrated analysis of multiple biochemical signals and genomic conservation measures (Lu, et al., 2015). Its unsupervised-learning framework makes GenoCanyon suffer less from our limited knowledge of non-coding genome. Moreover, since the resolution of its functional prediction is at the nucleotide level, it is possible to use GenoCanyon scores to evaluate the surrounding region of each genotyped SNP. In this paper, we propose GenoWAP (Genome Wide Association Prioritizer), a post-GWAS prioritization method that integrates GenoCanyon functional prediction and GWAS p-values. We apply the method on two smaller GWASs of Crohn’s disease and schizophrenia, respectively, to prioritize SNPs. The performance is evaluated using the results from large GWAS meta-analyses of these two diseases. Compared to the top loci ranked on p-values only, top ranked loci after prioritization tend to show substantially stronger signals in large GWAS studies. Within each locus, GenoWAP is able to distinguish true signals among highly correlated SNPs. The method has the potential to reduce noises caused by LD and rescue marginal signals in GWASs with insufficient sample sizes.
Methods
Statistical model
For each SNP, we define Z to be the indicator of general functionality, and define ZD to be the indicator of disease-specific functionality. More specifically, if a SNP or its surrounding region is active in any genomic functional pathway, then Z equals to 1. If this SNP or the surrounding region is involved in the disease pathway, then ZD equals to 1. For each SNP, we use Z to denote its p-value obtained from the standard GWAS analysis.
The goal of post-GWAS prioritization is to assign each SNP a new score that measures its importance. A reasonable quantity is the conditional probability of being disease-specific functional given the p-value, i.e. P(ZD = 1|p). Using Bayes formula, we can rewrite the conditional probability as below:
Based on the definitions of Z and ZD, we know that the SNPs satisfying ZD = 1 must be a subset of the SNPs satisfying Z = 1. This is because if a SNP is disease-specific functional, then it has to be functional in the general sense. Therefore, we get the following formula.
Therefore, in order to calculate the conditional probability P(ZD = 1/p) for a marker, we need its prior probability of being functional, i.e. P(Z = 1); we also need the p-value density for disease-specific functional markers, i.e. f(p|ZD = 1), and the p-value density for markers that are not related to the disease, i.e. f(p|ZD = 0); finally, we need an estimate for the conditional probability of being disease-specific functional given the marker is functional in the general sense, i.e. P(ZD = 1|Z = 1).
Estimation
Recently, Lu et al. developed GenoCanyon, an unsupervised-learning-based statistical framework that predicts the functional potential for each nucleotide in the human genome (Lu, et al., 2015). For each SNP in our dataset, we use the mean GenoCanyon functional score of its surrounding 10,000 base pairs as the prior probability P(Z = 1). Different from using variant-based annotation tools as the prior knowledge, this prior information not only measures the importance of the genotyped marker, but also evaluates its surrounding region where the ungenotyped causal variants may reside.
Next, we partition all the SNPs into functional (Z = 1) and non-functional (Z = 0) subgroups based on the calculated mean GenoCanyon score with cutoff 0.1. Since the GenoCanyon functional score has a bimodal pattern, this partition is not sensitive to the cutoff choice. There are two major reasons why we do the partition. First, this can be viewed as a noise reduction step. After removing the non-functional markers, the signal pattern in the functional subgroup is amplified (Figure 1). The proportion of disease-related markers in the remaining (functional) subgroup also increased, which leads to more stable estimates in the following steps. Second, we can now empirically estimate the p-value density for non-functional markers, i.e. f(p|Z = 0). Since the p-values are acquired from a disease-specific case-control study, we assume that the p-values for markers that are not related to the disease should behave just like the p-values for markers that are not functional entirely. Mathematically, this assumption is characterized as the equation below.
Based on this assumption, we can estimate f(p|ZD = 0) using the p-values for SNPs in the non-functional subgroup. Notably, it may seem natural to assume (p|ZD = 0) follows a uniform distribution. However, the p-value of a marker with ZD = 0 can actually be driven by a nearby disease-related marker due to LD. The empirically estimated density can capture a certain amount of LD information, which is complex and non-trivial to model. Moreover, it is common to see some variants with low minor allele frequencies in GWAS samples. The p-values for these markers will form a spike near 1 in the p-value density. The empirically estimated density is also able to account for this artifact. We propose to use histogram for density estimation, because it has stable performance near the boundary. In fact, the p-value boundary near 0 is where the real signals reside, and the boundary near 1 occasionally has the artifact issue caused by rare variants. Histogram is able to capture both issues. Moreover, the sample size in this framework is the number of markers, which is usually large in GWAS studies. Therefore histogram is a reasonable choice for density estimation. The number of bins is chosen based on cross-validation. It still remains to estimate the p-value density for disease-related markers f(p|ZD = 1), and the conditional probability P(ZD = 1|Z = 1). Now, we partition the functional subgroup (Z = 1) into finer subgroups. First, based on equation (3), it is straightforward to show that
Therefore, the p-value density for functional markers is the following mixture.
In formula (5), f(p|ZD = 0) has already been estimated in previous steps. We further assume a parametric form of f(p|ZD = 1). In a recent work of Chung et al., they showed that beta distribution is a robust approximation of p-value distribution under some general assumptions of SNP effect size (Chung, et al., 2014). We adopt the same assumption.
The constraint 0 < Z α 1 guarantees that a smaller p-value is more likely to occur than a larger p-value. Then, we apply the EM algorithm on all the p-values in the functional subgroup. One advantage of beta distribution assumption is that each iteration in the EM algorithm has a closed-form expression. In this way, we acquire the estimates for both P(ZD = 1|Z = 1) and P(p|ZD = 1). Then, all missing pieces in formula (1) have been estimated. We calculate the conditional probability P(ZD = 1|p) for all the SNPs using these estimates. This quantity is referred to as the posterior score in this paper.
Results
Application to Crohn’s disease
Several GWASs of different scales have been performed for Crohn’s disease. The largest GWAS meta-analysis, which identified 71 disease-associated loci, is one of the most successful GWASs to date (Franke, et al., 2010). We applied GenoWAP on a smaller Crohn’s disease GWAS conducted by the North American National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) IBD Genetics Consortium, and tested the results using the large meta-analysis done by the International Inflammatory Bowel Disease Genetics Consortium (IIBDGC). Cohort information is listed in Supplementary Table 1. Details of both studies have also been reported previously (Franke, et al., 2010; Rioux, et al., 2007). It is worth noting that the samples in these two studies overlap with each other. However, the goal for this paper is not to replicate the detected signals in an independent cohort. Instead, we seek to better prioritize signals using only a small sample size. In order to test the performance, the results from the largest study available are used as the gold standard.
For each SNP in the dataset, define p to be the GWAS p-value, and define ZD to be the indicator of disease-specific functionality. The posterior probability of being disease-specific functional, i.e. P(ZD = 1|p), is used to prioritize SNPs (See Methods). This score will be referred to as the posterior score in the following sections. Test statistics of the NIDDK study were downloaded from dbGap (Supplementary Table 1). Among the 298,391 SNPs, 70 were deleted due to unavailable hg19 genomic locations. We calculated the posterior scores for the remaining 298,321 SNPs (Supplementary Figure 1). Test statistics of the IIBDGC meta-analysis were downloaded from the IIBDGC website (http://www.ibdgenetics.org). The dataset contains 953,241 SNPs, including 262,621 SNPs overlapping with the NIDDK dataset.
A total of 71 loci passed genome-wide significance level in the validation stage of IIBDGC meta-analysis, including 32 previously reported risk loci and 39 newly confirmed risk loci (Franke, et al., 2010). We ranked the 298,321 SNPs in the NIDDK study based on their p-values and posterior scores, respectively. Then, within each locus among the 71 loci, we compared the rank of the lowest p-value to the rank of the largest posterior score. 56 out of 71 loci (79%) had an improved rank, 3 loci (4%) had an equal rank, while only 12 loci (17%) had a reduced rank (Supplementary Table 2). The probability of having an increased rank is significantly higher than that of having a decreased rank (p-value = 3.11×10−8, one-sided binomial test).
Next, we compared the top 20 loci with the smallest p-values to the top 20 loci with the largest posterior scores in the NIDDK study. The locus information and the lowest meta-analysis p-value at each locus are listed in Table 1. 14 out of 20 loci are shared between the two lists. Interestingly, the posterior-specific loci, i.e. the loci that show up only in the list based on posterior score, showed substantially stronger signals in the IIBDGC meta-analysis compared to the p-value-specific loci (Table1, Figure 2a). For example, the risk locus on chromosome 10q22 was a genome-wide significant locus in the meta-analysis (rs1250550, Pmeta= 2.00×10−10). Although the same SNP, rs1250550, had the lowest p-value at this locus in the NIDDK dataset (PNIDDK = 5.95×10−5), the signal was not strong enough to make this locus surpass other loci such as the one on chromosome 2q24 (rs6733000, PNIDDK = 2.01×10−5 Table 1). However, with posterior scores, locus 10q22 was ranked as the 17th top locus, while the highest posterior score at locus 2q24 was only 0.0142, which agrees with its weak signal in the meta-analysis result (Pmeta = 0.019). Overall, two posterior-specific loci were genome-wide significant in the meta-analysis, while the lowest Pmeta among the six p-value-specific loci was only 1.10× 10−4. These results show that our method can effectively reduce noises likely due to LD and chance and enhance true signals at disease risk loci.
To see if SNPs with high posterior scores are more enriched of eQTLs, we downloaded the whole-blood eQTL data from GTEx (http://www.gtexportal.org). The top 1000 SNPs based on p-values are not statistically significantly enriched for eQTLs (p-value = 0.076; hypergeometric test; fold enrichment = 1.39), while the enrichment for the top 1000 SNPs based on posterior scores is highly significant (p-value = 1.46×10–4; fold enrichment = 2.17). The difference becomes even more drastic when using the top 2000 SNPs, with p-values 0.018 and 6.19×10−11 (fold enrichment 1.43 and 2.56), respectively. When the number of top SNPs increases, the posterior-based approach dominates the p-value-based approach in both enrichment p-value and fold change (Figures 2b and 2c).
In order to show how our method performs locally, we chose two genome-wide significant loci from the IIBDGC meta-analysis. First, within the risk locus on chromosome 1q23, two SNPs had substantially stronger signals than others, i.e. rs2274910 (PNIDDK = 4.40×10−4) and rs955371 (PNIDDK = 4.84×10−4). According to the p-values, these two SNPs are undistinguishable, because the signal at rs2274910 is only slightly stronger. However, the results from the meta-analysis clearly show the existence of two SNP clusters with strong signals at this locus (Figure 3a). The cluster closer to gene CD244, in which rs955371 resides, actually has stronger signals than the cluster where rs2274910 is located. Interestingly, the posterior scores capture this difference between two SNPs very well. In fact, the posterior scores for rs955371 and rs2274910 are 0.272 and 0.208, suggesting rs955371 is more likely to be functional even though its p-value is larger. The second example is the risk locus on chromosome 14q35, which is one of the 12 loci with a reduced rank under the posterior scores (Supplementary Table 2). Signals at this locus were not strong in the NIDDK study, with the smallest p-value only at 4.70×10−3 (rs1959715). Moreover, the signal peak in the NIDDK study (near 88.2M) was quite far from that in the meta-analysis, which resides in genes GALC and GPR65 (Figure 3b). However, the posterior scores once again capture the signal pattern in the meta-analysis. Signals near 88.2M on chromosome 14 are shrunk substantially, while the SNPs in GALC and GPR65 are pushed up as the strongest signal (rs4904410). Since these SNPs have very weak signals in their p-values, the posterior score is still low (See Methods). This explains the reduced rank, because the p-value-based rank of rs1959715 was compared with the posterior-based rank of rs4904410. It is worth noting that the SNPs with the strongest signals in the meta-analysis, e.g. rs8005161, were either not genotyped or dropped in the quality control steps in the NIDDK study. It is reasonable to believe that the posterior scores would have had an even better performance if imputations had been done for the NIDDK dataset.
Application to schizophrenia
In addition to Crohn’s disease, we also applied GenoWAP to schizophrenia, a major psychiatric disorder. Psychiatric Genomics Consortium (PGC), the largest international consortium in psychiatry, focuses on genetic studies of many psychiatric disorders including schizophrenia. Two large-scale GWAS mega-analyses of schizophrenia have been published. We applied GenoWAP to the earlier and smaller PGC2011 study (Consortium, 2011), and evaluated the performance using results from the larger mega-analyses published in 2014 (Consortium, 2014). Test statistics for both studies were downloaded from the PGC website (Supplementary Table 3). Among the 1,252,901 SNPs in PGC2011 study, 264 were removed due to unavailable hg19 locations. Posterior scores were calculated for all the remaining 1,252,637 SNPs (Supplementary Figure 2). PGC2014 study contains 9,444,230 SNPs, including 1,179,913 SNPs overlapping with the PGC2011 dataset.
PGC2014 study identified 108 schizophrenia-associated loci, from which we removed three loci on chromosome X because the PGC2011 dataset did not contain any SNP on sex chromosomes. We ranked the 1,252,637 SNPs in PGC2011 study based on their p-values and posterior scores, respectively. Within each locus, the rank of the lowest p-value was compared to the rank of the largest posterior score. Across the 105 loci, 68 (65%) had an improved rank, 1 locus (1%) had an equal rank, and the other 36 loci (34%) had a reduced rank (Supplementary Table 4). The probability of having an increased rank is significantly higher than that of having a reduced rank (p-value = 0.001, one-sided binomial test). Interestingly, among the 10 loci with the strongest signals in the PGC2014 study, 8 had an increased rank (80%). The proportion of increased or equal ranks gradually drops when more top loci in the PGC2014 study were considered, showing less confidence in weaker signals (Supplementary Figure 3).
Next, we compared the top 20 loci with the smallest p-values to the top 20 loci with the largest posterior scores in the PGC2011 study. In order to identify 20 independent loci, 582 SNPs were needed when using p-value as the criterion. When posterior scores were used to choose top signals, 548 SNPs were sufficient to identify 20 loci, showing better efficiency (Figure 4a). A total of 14 loci could be identified using both p-values and posterior scores. As for the comparisons between the 6 posterior-specific loci and the 6 p-value-specific loci, the posterior-specific loci showed better signals than the p-value-specific loci (Table2, Figure 4b) in the PGC2014 study. Four of the 6 posterior-specific loci were genome-wide significant in the PGC2014 study, whereas 2 p-value-specific loci passed the genome-wide significance level. Among the 6 p-value-specific loci, the locus on chromosome 3q26 had the strongest signal in the PGC2014 study (P2014 = 5.35× 10−11). This locus will be discussed in detail later.
Since imputation was done for both PGC2011 and PGC2014 studies, and the total number of SNPs is large, it is possible to compare the SNP-level replication rates when the SNPs were ranked based on p-values and posterior scores. Among the top 500 SNPs with the largest posterior scores, 327, 267, and 152 had a p-value lower than 5×10−2, 5×10−8, and 5×10−15 in the PGC2014 study, respectively. When choosing the top 500 SNPs based on their p-values, the corresponding numbers were 290, 237, and 120 (Figure 4c), respectively. A similar pattern can be observed for the top 200 SNPs (Supplementary Table 4). We further performed enrichment analysis for whole-blood eQTLs. The top 1000 SNPs based on the p-values were significantly enriched for eQTLs (p-value = 2.30×10−22, fold enrichment = 4.57), but the enrichment for the top 1000 SNPs based on the posterior scores was even stronger (p-value = 3.32×10−25, fold enrichment = 4.88). As the number of top SNPs increased, the posterior-based top SNPs always had stronger enrichment of eQTL than the p-value-based list (Figures 4d and 4e).
Finally, we compared PGC2011 p-values, PGC2011 posterior scores, and PGC2014 p-values at two loci to further illustrate the performance of our method. The first locus is on chromosome 3q26. It had the strongest signal in PGC2014 among the p-value-specific top 20 loci (Table 2, P2014 = 5.35×10-11). Based on the p-values in the PGC2011 study, the strongest signals reside in the intergenic region upstream of FXR1. But the posterior scores brought down those intergenic SNPs, and enhanced the signals in FXR1 instead, which is in agreement with the results from PGC2014 (Figure 5a). In fact, from the PGC2014 p-values, we can clearly see that the strongest signals reside in FXR1 while the significant results for the SNPs upstream or downstream of FXR1 are likely due to LD. The second example is on chromosome 8q21 (Figure 5b). In the PGC2011 study, the strongest signal at this locus resides in the intergenic region between 89.7M and 89.8M. However, posterior scores removed most of the correlated SNPs at this locus, leaving three separate peaks as candidate functional spots. The first peak lies right upstream of MMP16. The second peak is more upstream (∼89.6M), and is suggested to be the strongest signal source. The SNPs with the lowest p-values in PGC2011 remained as a signal peak, but their posterior scores were not as strong as the peak in the middle. Most interestingly, the results from the posterior scores perfectly matched the signal patterns in the PGC2014 study. From the lowest panel in Figure 5b, we can clearly see two separate peaks at the same locations suggested by the posterior scores, with the one near 89.6M being the strongest signal source. Also, the SNPs between 89.7M and 89.8M had weaker signals than the peak in the middle. Notably, this entire risk locus resides in an intergenic region. This example shows that our method can effectively prioritize SNPs in the non-coding genome.
Discussion
In this study, we developed and applied GenoWAP to two sets of GWAS data to illustrate its performance in post-GWAS prioritization. Compared to p-values, GenoWAP posterior scores can better prioritize SNPs in many different ways. At the locus level, posterior score is more efficient in the sense that fewer SNPs are needed to identify the same number of top loci. Moreover, noises due to chance are effectively reduced, and the highly ranked loci using posterior score are more likely to be functional than the top loci selected purely based on p-values. At the SNP level, markers with high posterior scores have both better replication rates and consistently stronger enrichment of eQTLs than the top SNPs based on p-values. More importantly, within each risk locus identified in GWAS, posterior scores can effectively suggest real signals among a large number of correlated SNPs.
The performance of GenoWAP depends on the accuracy of functional annotation and the quality of GWAS data. Due to our limited understanding of non-coding genome, it is challenging to provide accurate genomic functional annotation. GenoCanyon is the first functional prediction tool at the nucleotide level. When more accurate or tissue-specific functional annotation becomes available in the future, the performance of GenoWAP may be further improved. On the other hand, GenoWAP does not play magic. If no information is contained in the GWAS dataset, then GenoWAP can only provide very limited insight.
More than 2,000 GWASs have been published in the past decade, and the number continues to grow. It is well known that our ability to identify new risk loci for complex diseases has surpassed our ability to interpret the results. However, although we are overwhelmed by the large amount of information detected in GWASs, evidence such as missing heritability still suggests that many risk loci remain to be discovered. Therefore, there is pressing need for post-GWAS prioritization tools and our method has great potential for future application. Since GenoWAP uses only p-values as the input, it is convenient to apply our method on published results, which may help reveal truly functional variants within large haplotype blocks, and ultimately help understand disease etiology. Moreover, for multi-stage GWASs, GenoWAP can be used to better prioritize SNPs from the discovery stage to the validation planning and increase the replication rates. Finally, next-generation sequencing is widely recognized as the future of genomic epidemiology. However, the high cost of sequencing usually leads to insufficient sample sizes and many other challenging issues (Sboner, et al., 2011). The combination of GenoWAP and the rich collection of publicly available GWAS data have the potential to provide functional candidates and guide sequencing analysis in the future.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
QL conceived and wrote the original manuscript. XY, QL, and YH developed the GenoWAP software. HZ advised on genetic and statistical issues.
Acknowledgements
We would like to thank Dr. Katerina Kechris and all the members in the Data Integration – COPD working group at SAMSI for their advice and useful discussions on this work. This study was supported by the National Institutes of Health grants R01 GM59507 and U01 HG005718, the VA Cooperative Studies Program of the Department of Veterans Affairs, Office of Research and Development, and the Yale World Scholars Program sponsored by the China Scholarship Council.