Abstract
Protein-truncating variants can have profound effects on gene function and are critical for clinical genome interpretation and generating therapeutic hypotheses, but their relevance to medical phenotypes has not been systematically assessed. We characterized the effect of 18,228 protein-truncating variants across 135 phenotypes from the UK Biobank and found 27 associations between medical phenotypes and protein-truncating variants in genes outside the major histocompatibility complex. We performed phenome-wide analyses and directly measured the effect of homozygous carriers, commonly referred to as “human knockouts,” across medical phenotypes for genes implicated to be protective against disease or associated with at least one phenotype in our study and found several genes with strong pleiotropic or non-additive effects. Our results illustrate the importance of protein-truncating variants in a variety of diseases.
Protein-truncating variants (PTVs), genetic variants predicted to shorten the coding sequence of genes, are a promising set of variants for drug discovery since identification of PTVs that protect against human disease provides in vivo validation of therapeutic targets1,2,3,4. Although tens of thousands of standing germline PTVs have been identified5,6, their medical relevance across a broad range of phenotypes has not been characterized. Because most PTVs are present at low frequency, assessing the effects of PTVs requires genotype data from a large number of individuals with linked phenotype data for a variety of diseases and physiological measurements. The recent release of genotype and linked clinical and questionnaire data for 488,377 individuals in the UK Biobank provides an unprecedented opportunity to assess the clinical impact of truncating protein-coding genes at a resolution not previously possible.
Results
To assess the clinical relevance of PTVs, we cataloged predicted PTVs present in the Affymetrix UK Biobank array and their effects on medical phenotypes from 337,208 unrelated individuals in the UK Biobank study 7,8. We defined PTVs as single-nucleotide variants (SNVs) predicted to introduce a premature stop codon or to disrupt a splice site or small insertions or deletions (indels) predicted to disrupt a transcript’s reading frame 5. We identified 18,228 predicted PTVs in the UK Biobank array that were polymorphic across 8,750 genes after filtering (Methods, Figure S1). Each participant had 95 predicted PTVs with minor allele frequency (MAF) less than 1% on average, and 778 genes were predicted to be homozygous or compound heterozygous for PTVs with MAF less than 1% in at least one individual. The observed number of PTVs per individual is consistent with the ~100 loss-of-function variants observed in the 1000 Genomes project 9. In contrast, the number of PTV singletons (or observed allele counts less than 10) in ExAC suggests approximately five singletons per individual and only ~0.2 per individual in highly constrained genes 10,11. These observations indicate that the majority of PTVs in an individual are common (or common and low frequency) such that they can be assessed via genotyping.
We used computational matching and manual curation based on hospital in-patient record data, self-reported verbal questionnaire data, and cancer and death registry data to define a broad set of medical phenotypes including various cancers, cardiometabolic diseases, and autoimmune diseases (Table S1) 12. We then performed association analyses between the 3,724 PTVs with MAF greater than 0.01% and 135 medical phenotypes with at least 2,000 case samples (Figure 1, Figure S2) and stratified the association results into three bins based on PTV MAF greater than 1% (463 PTVs), between 0.1% and 1% (700 PTVs), and between 0.01% and 0.1% (2,561 PTVs) to account for expected differences in the statistical power to detect associations for PTVs with different MAFs (Figure S3). We adjusted the nominal association p-values separately for each MAF bin using the Benjamini-Yekutieli (BY) procedure to correct for multiple hypothesis testing and identified 74 significant associations between PTVs and medical phenotype (BY-adjusted p < 0.05, Figure 2A-C, Table S2).
Among the 74 PTV-phenotype associations we identified, 27 involved PTVs in genes outside of the MHC. We identified five PTVs with seven associations consistent with protective effects (odds ratio [OR]<1, BY-adjusted p<0.05, Figure 2D, Table S2). We found that the rare splice-disrupting PTV rs146597587 in IL33 is strongly associated with protection against asthma (MAF=0.48%, p=7.6x10−13, OR=0.64, 95% CI: 0.57-0.72). This PTV is negatively associated with eosinophil counts (β=-0.21 SD, p=2.5×10−16) and has suggestive evidence of an association with asthma (p=1.8x10−4, OR=0.47, 95% CI: 13 0.32-0.70)13. Our results provide strong evidence in an independent sample that this PTV protects against asthma and suggests that knocking down IL33 function may be a useful therapeutic approach for asthma. We also identified protective associations for the PTV rs11078928 (MAF=47.1%) in GSDMB against asthma (p=6.3x10−50, OR=0.90, 95% CI:0.88-0.91) and bronchitis (p=2.6x10−6, OR=0.91, 95% CI: 0.87-0.95). GSDMB is associated with asthma in humans and induces an asthma phenotype in mouse when overexpressed 14,15. We identified additional protective associations between PTVs in IFIH1 and hypothyroidism (labeled as hypothyroidism/myxoedema) (MAF=1.5%, p=1.7x10−6, OR=0.80, 95% CI: 0.73-0.88) and VKORC1 and hypertension (MAF=25.3%, p=1.4x10−6, OR=0.97, 95% CI: 0.96-0.98).
We also found 20 risk associations for PTVs in 12 genes outside the MHC (Figure 2D, Table S2). We identified clinically relevant PTV-phenotype associations such as FLG, whose protein product contributes to the structure of epidermal cells, and eczema/dermatitis (MAF=0.48%, p=6.7x10−15, OR=1.80, 95% CI: 1.55-2.08) 16 and TSHR, thyroid stimulating hormone receptor, and hypothyroidism/myxoedema (MAF=0.046%, p=1.2x10−13, OR=3.30, 95% CI: 2.41-4.53) 17. We replicated known risk genome-wide association study (GWAS) associations such as BRCA2 and family history of lung cancer (MAF=0.93%, p=7.3x10−11, OR=1.19, 95% CI: 1.13-1.25) 18 and rs33966350 in ENPEP and hypertension (MAF=1.3%, p=4.8x10−11, OR=1.17, 95% CI: 1.12-1.23) 19 and identified risk associations between FANCM, a member of the same gene family as BRCA2, and lung cancer (MAF=0.11%, p=9.7x10−10, OR=1.58, 95% CI: 1.36-1.83) as well as NOL3, a regulator of apoptosis in muscle cells, and muscle or soft tissue injury (MAF=0.11%, p=6.5x10−8, OR=3.43, 95% CI:2.19-5.36) 20,21. Even in the context of variants with strong predicted effects such as PTVs, it is critical to evaluate whether the associated variant is causal in the context of neighboring variants. We initially identified an association between the PTV rs34358 in ANKDD1B and high cholesterol, although this association disappeared upon conditional analysis with rs17238484, an intronic variant in HMGCR known to be associated with cholesterol levels 22. Another association between rs34358 and family history of diabetes remained upon conditional analysis with rs17238484 (p=9.1x10−5, OR=1.03, 95% CI: 1.02-1.05). Overall we found both PTV-phenotype associations that reflect known biology or disease associations and PTV-phenotype associations that implicate genes in disease.
We identified five significant associations between PTVs and family history phenotypes included in our analysis (Table S2). For two of these associations, the variant associated with the family history phenotype was also associated directly with the phenotype. rs180177132 in PALB2 was associated with a family history of breast cancer (MAF=0.037%, p=2.5x10−8; OR=2.14, 95% CI: 1.64-2.79) as well as breast cancer diagnosis (p=9.0x10−12; OR=4.25, 95% CI: 2.80-6.43) and FUT2 was associated with family history of high blood pressure (MAF=49.1%, p=1.3x10−7; OR=1.03, 95% CI: 1.02-1.04), hypertension diagnosis (p=5.7x10−13; OR=1.04, 95% CI: 1.03-1.05), and essential hypertension (p=5.2x10−8, OR=1.04, 95% CI: 1.02-1.05). We also found that the PTV rs11571833 in BRCA2 was associated with lung cancer (MAF=0.934%, p=7.3x10−11, OR=1.19, 95% CI: 1.13-1.25). These results demonstrate previous approaches for identifying genetic associations using family history information (e.g. 23) can be applied even to relatively rare PTVs.
To further characterize the PTV-phenotype associations, we asked whether missense variants with MAF greater than 0.01% in the genes with significant PTV associations were also associated with the same phenotypes. For each of the 27 PTV-phenotype associations in our GWAS, we performed association analyses between the missense variants in that gene and the phenotype that the PTV was associated with and found 23 missense variant-phenotype associations with p<0.001 (Table S2). 13 of these 23 associations remain significant when after a conditional analysis including the PTV genotype as a covariate indicating that a number of genes with PTV associations also contain independent missense associations. For instance, we found two different missense variants in TSHR that were both associated with hypothyroidism independent of the PTV association. We also identified independent missense associations for genes and phenotypes such as ENPEP and hypertension; GSDMB and asthma; IFIH1 and hypothyroidism; and PALB2 and lung cancer (Table S2). In total, we found at least one missense association for seven genes implicated in our PTV GWAS providing more evidence that these genes are likely important to the etiology of these conditions.
47 of the 74 significant associations involved PTVs in genes in or near the MHC (Table S2). To investigate whether these associations are caused by linkage between these PTVs and HLA susceptibility alleles, we performed association analyses for each of these PTVs conditional on the presence of each of 344 HLA alleles that were polymorphic among the 337,208 subjects (Table S3). We found that the p-values for all five associations with MAF between 0.1% and 1% were greater than 0.05 for at least one HLA allele (Figure 2E). Similarly, the p-values for 30 of 42 associations with MAF greater than 1% were greater than 0.05 for at least one HLA allele and only three were less than 0.001 (Figure 2F). For instance, we identified an association between rs72841509 in BTN3A2 and Celiac disease (coded malabsorption/coeliac disease) in our initial GWAS (MAF=0.13, p=1.8x10−119, OR=2.33, 95% CI: 2.17-2.50). However, conditioning upon the presence of the well-known Celiac disease risk allele HLA-B8 reduced the p-value of the association between rs72841509 and Celiac disease to p=0.92 24. These results indicate that the majority of the associations identified here for PTVs in MHC genes are likely due to LD with HLA susceptibility alleles and show that it is important to carefully consider the genomic context of associated variants, even for variants with strong predicted effects 25.
We next investigated whether we could identify PTV-phenotype associations using imputed genotypes. After filtering (Methods), we identified 546 PTVs outside the MHC with MAF greater than 0.01% among the UK Biobank imputed genotypes. We stratified these PTVs into the same MAF bins as above (0.01%-0.1%, 0.1%-1%, and 1%-50%) and applied the BY adjustment to the association p-values for each bin. We found nine significant associations for imputed PTVs (BY-adjusted p<0.05, Table S2) including rs74315329 in MYOC and glaucoma (MAF=0.0012, p=1.8x10−30, OR=4.71, 95% CI: 3.61-6.14) 26, a well-known risk variant for glaucoma 27, and D2HGDH and asthma (MAF=0.445, p=1.6x10−12, OR=0.95, 95% CI: 0.94-0.96) and hay fever (coded hayfever/allergic rhinitis) (p=8.4x10−9, OR=0.94, 95% CI: 0.92-0.96). The D2HGDH PTV is in partial LD with an intronic variant rs34290285 in D2HGDH (r2=0.366, LDlink) that has been associated with asthma and allergic disease28,29. We also identified an association between the PTV rs754512 in MAPT and Parkinson’s disease (MAF=0.23, p=1.1x10−6; OR=0.94, 95% CI: 0.92-0.97) 30. This variant is predicted to be a PTV but is in the intron of the canonical MAPT transcript and lies on the same haplotype as three MAPT missense variants (rs17651549, rs62063786, rs10445337) so conditional analysis could not establish the causal allele. We found associations between a PTV in RPL3L and atrial flutter (MAF=0.0021, p=5.0x10−10, OR=0.54, 95% CI: 0.44-0.66) and atrial fibrillation (p=2.3x10−9, OR=0.55, 95% CI: 0.46-0.67). The missense variant rs140185678 in RPL3L is also independently associated with atrial fibrillation (MAF=0.0363, p=5.4x10−9, OR=1.21, 95% CI: 1.14-1.30, unpublished data) and atrial flutter (p=1.1x10−7, OR=1.20, 95% CI: 1.12-1.28). Overall, we were able to recover a small number of associations using imputed PTVs, indicating that better imputation methods are likely needed in the absence of direct genotyping of PTVs.
To further assess the role of PTVs across medical phenotypes, we performed a phenomewide association analysis (pheWAS) to determine whether PTVs that have been implicated in disease predisposition may impact other diseases or commonly measured traits 31. We focused this analysis on PTVs with minor allele frequency greater than 0.01% in the 17 genes with significant associations in our GWAS. In addition to PTVs in the genes identified here, we also investigated PTVs in genes with previously identified protective effects such as: CARD9, RNF186 and IL23R shown to confer protection against Crohn’s disease and/or ulcerative colitis 2,1; ANGPTL4, PCSK9, LPA, and APOC3shown to confer protection against coronary heart disease 4,32,33,34,35,36; and SCN9A where homozygous PTV carriers show an inability to experience pain 37 (Table S4).
We identified all associations (p<0.01) for PTVs in these 25 genes with a MAF greater than 0.01% and found that PTVs in many of these genes were associated with a broad range of phenotypes (Table S2, Figure S4). PTVs in eight of the 25 genes were associated with eight or more phenotypes. We observed associations between the viral receptor IFIH1 and 10 phenotypes including protective effects against hypothyroidism, hypertension, gastric reflux, and psoriasis (Figure 3, Table S4). Despite minor allele frequencies ranging from 0.02% to 1.5%, three of these associations were observed for more than one IFIH1 PTV. PTVs in IFIH1 were also associated with increased risk for ulcerative colitis, inflammatory bowel disease, and endometriosis. We identified new protective effects for IL33 for hay fever (coded hayfever/allergic rhinitis), nasal polyps, and angina as well as weak risk effects for bowel/intestinal obstruction and shoulder/scapula fracture (Figure S4). Overall, these results demonstrate that PTVs can have pleiotropic effects across diverse phenotypes and that PTVs in the same gene can both protect against and increase risk for different diseases.
Homozygous carriers of PTVs, referred to as homozygous knockouts (KOs), may have dramatically altered medical outcomes compared to carriers with only one PTV 38 (heterozygous KOs)38. Genetic association analyses typically assume that genetic effects are additive; that is, the log OR of a homozygote is expected to be twice the log OR of a heterozygote. Given the large difference between having one functional copy and no functional copies of a gene, however, we expect that homozygote KOs may have non-additive effects that are stronger or weaker than would be predicted given the effect size for heterozygote KOs. To assess whether any of the genes whether any of the 17 genes with significant associations in our GWAS or the eight genes with published protective effects (Table S4) have evidence for non-additive effects on medical phenotypes, we estimated the KO status in each subject for each of these 25 genes. Subjects with one PTV in a gene were considered heterozygote KOs for that gene and subjects with two or more PTVs were considered homozygote KOs. In total, 16 of the 25 genes had at least one predicted homozygous KO carrier. We fit additive and non-additive models to test for associations between KO status for these 16 and 206 medical phenotypes (minimum 1,000 cases, Figure S5) and found 13 associations (6 distinct genes, 12 distinct phenotypes) with potential non-additive effects (Figure S6, Table S5, Methods).
We identified 87,176 predicted homozygous KOs for FUT2 caused by a common PTV rs601338 with MAF 49.1% and identified non-additive risk associations between FUT2 KO status and eight phenotypes including hypertension and mumps (Figure 4, Table S5). FUT2 regulates the expression of the H antigen on the gastrointestinal mucosa and genetic variation in FUT2 is associated with Crohn’s disease 39,40, psoriasis 41, plasma vitamin B12 levels 42,43, levels of two tumor biomarkers 44,45, and urine fucose levels 46. Under a non-additive model, the ORs for heterozygous FUT2 KOs are all nearly one while FUT2 homozygous KOs have ORs ranging from 1.05 (95% CI: 1.03-1.07) to 1.51 (95% CI: 1.29-1.77). Given the frequency of the rs601338 PTV, our results indicate that FUT2 function may play an important role in a wide range of phenotypes.
We also found evidence that the association between GSDMB KO and asthma described in our GWAS analysis above is non-additive (Figure S6, Table S5). In total, we identified 168,025 heterozygous KOs and 74,534 homozygous KOs for GSDMB. Under an additive model, GSDMB heterozygote KOs are predicted to have a decreased risk for asthma with OR=0.90 (p=5.9x10−50; 95% CI: 0.88-0.91). Under a non-additive model, however, GSDMB heterozygote KOs are predicted to have OR=0.86 (p=4.3x10−38; 95% CI: 0.84 0.88) while GSDMB homozygote KO offers only modestly higher protection (p=9.7x10−46, OR=0.81, 95% CI: 0.79-0.84). Variants that increase expression of GSDMB in humans 47 are associated with asthma risk,47 and increased GSDMB expression causes an asthma phenotype in mice 48. Our results suggest that knocking out just one copy of GSDMB provides most of the effect on asthma risk. Overall, we identified non-additive PTV associations for six of 16 genes tested demonstrating that the effect of PTVs on disease risk can be complex.
Discussion
Assessing the medical relevance of protein-truncating variants is critical for prioritizing putative drug targets and clinical interpretation. We systematically characterized the association of PTVs, a class of variants with functional consequences likely to be consistent with inhibition, with medical phenotypes using data from the UK Biobank study. We estimated the effects of PTVs across 135 phenotypes and identified 27 associations between PTVs in 17 genes and 20 different phenotypes. We found four associations for PTVs with minor allele frequency less than 0.1% indicating that more subjects or case/control studies design may be needed to test for associations between ultra-rare PTVs and relatively low prevalence diseases that are not well-represented in biobank datasets. We performed 25 phenome-wide association analyses for the genes implicated by GWAS in this study plus eight genes curated from the literature (Table S4) and identified eight genes that were associated with eight or more phenotypes (p < 0.01). 6 of these 25 genes showed evidence for non-additive associations across several phenotypes including non-additive associations between GSDMB and asthma and FUT2 and eight phenotypes including hypertension and cholesterol.
The genetic associations reported here directly link gene function to disease etiology and provide attractive targets for drug discovery. Naturally occurring human knockouts that protect against disease provide in vivo validation of safety and efficacy and may be relatively simple to target with drugs. Protective associations between PTVs in IL33 and asthma; GSDMB and asthma; and IFIH1 and hypothyroidism represent particularly attractive drug targets while risk associations between PTVs in FANCM and lung cancer and NOL3 and muscle injuries implicate these genes as important to the development of these conditions. Our results illustrate the value of deep population-scale health and genomic datasets for prioritizing genetic variants and genes with translational potential.
Author Contributions
M.A.R. conceived and designed the study. C.D., Y.T., G.M., A.L., and M.A.R. designed and carried out the statistical and computational analyses. C.C. optimized and implemented computational methods. C.D., Y.T., G.M., A.L., and M.J.D. carried out quality control of the data. Browser features in Global Biobank Engine were led and developed by G.M. and M.A.R, with assistance from A.L., Y.T., and C.D. M.A.R. supervised all aspects of the study. M.J.D. and C.D.B. provided analysis and commented on the manuscript. The manuscript was written by C.D., Y.T., and M.A.R.
Conflicts of Interest
C.D.B. is a member of the scientific advisory boards for Liberty Biosecurity, Personalis, 23andMe Roots into the Future, Ancestry.com, IdentifyGenomics, and Etalon and is a founder of CDB Consulting. M.J.D. is a member the scientific advisory board for Ancestry.com. M.A.R. is a paid consultant for Genomics PLC and Prime Genomics.
Supplemental Tables
Table S1. Medical Phenotypes. 206 medical phenotypes used in this study. Category indicates whether the phenotype was derived from family history questionnaire information (FH) or from diagnosis of a cancer (CA) or other disease (HC). See Methods and Figures S2,5 for more information.
Table S2. Significant GWAS and pheWAS associations. Significant associations from PTV and missense GWASs and pheWAS analysis. Variant IDs are from the UK Biobank data release. “gwas_protective” and “gwas_risk” tabs contain significant PTV associations (BY-adjusted p<0.05) for genes outside the MHC. “imputed” tab contains significant associations (BY-adjusted p<0.05) for imputed PTVs. “missense” tab contains significant (p<0.001) missense associations. “phewas” tab contains significant (p<0.01) pheWAS associations.
Table S3. HLA conditional analysis. p-values for genetic effects for PTVs in MHC genes from our initial GWAS (“P_variant_gwas”) and from an analysis conditional on the HLA allele in the “HLA_subtype” column (“P_variant_conditional”). “P_subtype_conditional” contains the p-value for the association between the given HLA subtype and phenotype in the conditional analysis.
Table S4. Genes with known protective associations. We identified eight genes that did not have significant associations in our single-variant analysis (BY-adjusted p<0.05) but have been reported in the literature to protect against different diseases. We included these genes in our pheWAS and additivity analyses.
Table S5. Non-additive associations. Results from fitting additive and non-additive models of association for PTV carrier status (no PTVs, heterozygous knockout, or homozygous knockout) in 25 genes against 206 medical phenotypes with at least 1,000 cases. Columns that begin with “dosage1” and “dosage2” correspond to results for the non-additive model while columns that begin with “additive” correspond to the additive model. “aic_diff” is the AIC of the additive model subtracted from the AIC of the non-additive model.
Acknowledgements
This research has been conducted using the UK Biobank resource.We thank all the participants in the UK Biobank study. We would like to thank Stefan Stender for suggesting that the association between the PTV rs34358 and high cholesterol might be due to the LDL GWAS variant rs17238484 in HMGCR (https://gitter.im/UK-Biobank/Lobby). M.A.R. is supported by Stanford University and a National Institute of Health center for Multi - and Trans-ethnic Mapping of Mendelian and Complex Diseases grant (5U01 HG009080). C.D. is supported by a postdoctoral fellowship from the Stanford Center for Computational, Evolutionary, and Human Genomics. Y .T. is supported by Funai Overseas Scholarship from Funai Foundation for Information Technology and the Stanford University Biomedical Informatics Training Program. A.L. and G.M. are supported by NIH BD2K grant number T32 LM 012409. The primary and processed data used to generate the analyses presented here are available in the UK Biobank access management system (https://amsportal.ukbiobank.ac.uk/) for application 24983, “Generating effective therapeutic hypotheses from genomic and hospital linkage data” (http://www.ukbiobank.ac.uk/wp-content/uploads/2017/06/24983-Dr-Manuel-Rivas.pdf), and the results are displayed in the Global Biobank Engine (https://biobankengine.stanford.edu). We would like to thank the Customer Solutions Team from Paradigm4 who helped us implement efficient databases for queries and application of inference methods to the data.