ABSTRACT
Cancer risk is determined by a complex interplay of genetic and modifiable risk factors. Combining individual germline risk variants into polygenic risk scores (PRS) creates a personalized genetic susceptibility profile that can be leveraged for disease prediction. Using data from the UK Biobank cohort (413,753 individuals; 22,755 incident cases), we systematically quantify the added predictive value of augmenting conventional cancer risk factors with PRS for 16 cancer types. Our results indicate that incorporating PRS in addition to family history of cancer and modifiable risk factors improves prediction accuracy, but the magnitude of incremental improvement varies substantially between cancers. We also demonstrate the utility of PRS for risk stratification. Individuals with high genetic risk (PRS≥80th percentile) have significantly divergent 5-year absolute risk trajectories across strata based on family history and modifiable risk factors. Finally, we estimate that high genetic risk accounts for 4.0% to 30.3% of new cancer cases, which exceeds the impact of many lifestyle-related risk factors. In summary, we provide novel quantitative data illustrating the importance of integrating PRS into personalized cancer risk assessment.
INTRODUCTION
Cancer susceptibility is inherently complex, but it is well-accepted that heritable genetic factors and modifiable exposures contribute to cancer development. While our knowledge of causal modifiable risk factors has gradually evolved over the past decades, genome-wide association studies (GWAS) have rapidly produced a wealth of germline genetic risk variants for different cancers. These studies have shed light on genetic mechanisms of cancer susceptibility, however, the public health impact of GWAS findings has been modest. In response, GWAS results have been leveraged to create polygenic risk scores (PRS) by combining weighted genotypes for risk alleles into a single, integrated measure of an individual’s genetic predisposition to a specific phenotypic profile. Such genetic risk scores are not designed to reflect the complexity of molecular susceptibility mechanisms, but they are highly amenable to phenotypic prediction.
Multiple studies have demonstrated that PRS can generate informative predictions for heritable traits1, 2 and diseases3, 4, prompting many to advocate for increased integration of genetic risk scores into clinical practice5, 6. An important step towards realizing the promise of PRS in precision medicine lies in systematically assessing the added value of genetic information in comparison to conventional risk factors and examining how it affects lifetime risk trajectories6. The recent development of large, prospective cohorts with both genome-wide genotyping and deep phenotyping data, such as the UK Biobank7, provide an opportunity for integrative analyses of genetic variation and modifiable risk factors. In addition to evaluating PRS predictive performance, these data also provide a unique opportunity to answer etiological questions about the relative contribution of genetic and modifiable risk factors to cancer susceptibility.
Our overarching aim was to quantify the relative contribution of common, low-penetrance risk variants to cancer risk prediction and overall disease susceptibility. To address these aims, we assembled PRS for 16 cancer types, based on results from previously published GWAS, and applied them to 413,870 individuals in the UK Biobank (UKB) cohort. First, we assessed the degree to which PRS can improve risk prediction and stratification based on established cancer risk factors, such as family history and modifiable health-related characteristics. Next, we estimated the proportion of cancer cases at the population-level that can be attributed to high genetic susceptibility, captured by the PRS, and compared this to modifiable determinants of cancer.
RESULTS
Characteristics of the UKB study population are presented in Supplementary Table 1. Over the course of the follow-up period a total of 22,755 incident cancers were diagnosed in 413,753 individuals, after excluding participants outside of the age enrollment criteria and those who withdrew consent after enrollment. Established cancer risk factors (listed in Supplementary Table 2) exhibited associations of expected magnitude and direction with each cancer (Supplementary Table 3). Family history of cancer in first-degree relatives, at the corresponding site, conferred a significantly higher risk of prostate (HR=1.84, 95% CI: 1.68-2.00, p=9.1×10-46), breast (HR=1.56, 1.44-1.69, p=3.0×10-29), lung (HR=1.61, 1.43-1.81, p=7.4×10-15), and colorectal (HR=1.26, 1.14-1.40, p=1.2×10-5) cancers. Metrics of tobacco use, such as smoking status, intensity, and duration were positively associated with risks of lung, colorectal, bladder, kidney, pancreatic, and oral cavity/oropharyngeal cancers. Weekly alcohol intake was associated with higher risks of breast (HR per 70 grams = 1.04, p=2.3×10-5), colorectal (HR=1.04, p=5.9×10-9), and oral cavity/pharyngeal (HR=1.05, p=3.0×10-10) cancers. Adiposity was associated with cancer risk at multiple sites, including endometrium (BMI: HR per 1-unit = 1.09, 1.08-1.10, p=1.6×10-49), colon/rectum (waist-to-hip ratio: HR per 10% increase = 1.17, 1.11-1.24, p=2.2×10-8), and kidney (BMI: HR=1.04, 1.02-1.05, p=1.7×10-6). Particulate matter (PM2.5) was associated with lung cancer risk8 (PM2.5: HR per 1 micro-g/m3 = 1.10, 1.05-1.15, p=1.9×10-5) in the model that included smoking status and intensity.
All PRS associations with the target cancer reached at least nominal statistical significance (Figure 1; Supplementary Table 4). We considered three PRS approaches (see Methods for details): standard weights corresponding to reported risk allele effect sizes (PRSβ); unweighted sum of risk alleles (PRSunw); inverse variance weights that incorporate the standard error of the risk effect size (PRSIV). The latter approach resulted in stronger or equivalent (HR ± 0.01) associations for most cancers, except Non-Hodgkin lymphoma (NHL). Compared to standard PRSβ, substantial differences were observed for prostate (PRSIV: HR=1.77, P=4.3×10-366 vs. PRSβ: HR=1.39, P=2.0×10-105), colon/rectum (PRSIV: HR=1.48, P=1.8×10-94 vs. PRSβ: HR=1.32, P=5.5×10-50), leukemia (PRSIV: HR=1.70, P=6.3×10-23 vs. PRSβ: HR=1.45, P=8.0×10-13), and thyroid (PRSIV: HR=1.75, P=1.9×10-15 vs. PRSβ: HR=1.57, P=5.7×10-10). All subsequent analyses use PRSIV since this approach appears to improve PRS performance by appropriately downweighing the contribution of variants with less precisely estimated effects.
Improvement in risk prediction
The predictive performance of each risk model was evaluated based on its ability to accurately estimate risk (calibration) and distinguish cancer cases from cancer-free individuals (discrimination). All cancer-specific risk models were well-calibrated (Goodness of fit p>0.05; Supplementary Figure 1). Model discrimination was assessed by Harrell’s C-index, estimated as a weighted mean between 1 and 5 years of follow-up time. For completeness, we also report the AUC at 5 years of follow-up time9. Proportionality violations (p<0.05) were detected for age in the breast cancer model and PRSIV for cervical cancer. For breast cancer this was resolved by incorporating an interaction term with follow-up time. As a sensitivity analysis for cervical cancer we modelled a time-varying PRS effect (Supplementary Figure 2).
The C-index reached 0.60 with age and/or sex, for all cancers except for breast and thyroid (Supplementary Table 5). For cancers with available information on family history of cancer at the same site (prostate, breast, colon/rectum, and lung), incorporating this had a modest impact on the C-index (ΔC<0.01). In fact, replacing family history with the PRS resulted in an improvement in discrimination for prostate (C=0.763, ΔC=0.047), breast (C=0.618, ΔC=0.060), and colorectal (C=0.708, ΔC=0.029), but not lung (C=0.711, ΔC=-0.002) cancers.
Next, we assessed the change in the C-index (ΔC) after incorporating the PRS into prediction models with all available risk factors for each cancer (Figure 2; Supplementary Table 5). The resulting improvement in prediction performance was variable. The largest increases in the C-index were observed for cancer sites with few available predictors, such as testes (CPRS=0.766, ΔC=0.138), thyroid (CPRS=0.692, ΔC=0.099), prostate (CPRS=0.768, ΔC=0.051) and lymphocytic leukemia (CPRS=0.756, ΔC=0.061). However, adding the PRS also improved prediction accuracy for melanoma (CPRS=0.664, ΔC=0.042), breast (CPRS=0.631, ΔC=0.060), and colorectal (CPRS=0.716, ΔC=0.030) cancers, which have multiple environmental risk factors. The highest overall C-index was observed for lung (CPRS=0.849) and bladder (CPRS=0.814) cancers, which was primarily attributed to non-genetic predictors (C without PRS: lung = 0.846; bladder = 0.808). Changes in the AUC at 5 years of follow-up were of similar magnitude (Supplementary Table 5).
As a complementary metric of model performance, Royston’s R2 was calculated to quantify the variation in the time-to-event outcome captured by each risk model10. Across all 16 sites, the median change in R2 (ΔR2) was 0.066. Large improvements, defined as ΔR2 >0.10, were observed for cancers of the breast (R2PRS=0.146; ΔR2=0.103), pancreas (R2PRS=0.439; ΔR2=0.103), leukemia (R2PRS=0.415; ΔR2=0.160), prostate (R2PRS=0.510; ΔR2 =0.161), thyroid (R2PRS=0.310; ΔR2 =0.230), and testis (R2PRS=0.605; ΔR2 =0.421). These results parallel the trend in improvement observed based on C-index and AUC.
For 15 out of 16 cancers, incorporating the PRS resulted in significant improvement in reclassification, as indicated by positive percentile-based net reclassification index (NRI)11 values with 95% bootstrapped confidence intervals excluding 0 (Supplementary Table 6). The overall NRI was primarily driven by the event NRI (NRIe), which is the increase in the proportion of cancer cases reclassified to a higher risk group. Positive NRIe values >0.25 were observed for prostate, thyroid, breast, testicular, leukemia, melanoma, and colorectal cancers. The largest reclassification improvement in non-event NRI (NRIne) observed for the lung PRS (NRIne=0.015) and breast PRS (NRIne=0.012). Four cancers (testes, leukemia, kidney, oral cavity/pharynx) had significantly negative NRIne values indicating that adding the PRS decreased classification accuracy in cancer-free individuals.
Refinement of risk stratification
The ability of the PRS to refine risk estimates was assessed by examining 5-year absolute risk trajectories as a function of age, across strata defined by percentiles of PRS (high risk ≥80%, average: >20% to <80%, low risk: ≤20%) and family history of cancer (Figure 3). Significantly diverging risk trajectories, overall and at age 60, were observed for prostate (P<4.5×10-25), breast (P<9.3×10-36), colorectal (P<2.0×10-21), and lung cancers (P<0.031). For all cancers except lung, risk stratification was primarily driven by PRS. For instance, 60-year-old men with a high PRS but no family history of prostate cancer had a higher mean 5-year disease risk (4.74%) compared to men with a positive family history and an average PRS (3.66%). For lung cancer, on the other hand, participants with a positive family history had higher average 5-year risks, even with a low PRS (0.54%), compared to those without (high PRS: 0.46%; low PRS: 0.29%). There was evidence of interaction between the PRS and family history of cancer for prostate (P = 9.0×10-128), breast (P = 1.7×10-104), colorectal (P = 8.7×10-14) cancers (Supplementary Table 7). For lung cancer the interaction with family history was limited to the high PRS group (P = 5.9×10-3).
We also compared 5-year risk projections across strata of PRS and modifiable risk factors. Effects of multiple risk factors were combined into a single score by generating summary linear predictors for each cancer (see Methods for details). For several common cancers, individuals with a high PRS were predicted to have higher cancer risk, even modifiable risk factor scores below the median (Figure 4). PRS achieved significant risk stratification for breast cancer (pre-menopausal: P<5.9×10-12; post-menopausal: P<4.3×10-50), colorectal cancer (P<1.8×10-42), and melanoma (P<4.6×10-105) (Figure 4). The same pattern of stratification was observed for NHL, leukemia, pancreatic, thyroid, and testicular cancers (Supplementary Figure 3). For other phenotypes, lifestyle-related risk factors had a stronger overall influence on risk trajectories than PRS (Figure 5). However, the stratifying by levels of PRS still resulted in significantly diverging risk projections for several cancers (lung: P<1.1×10-13; oral cavity/pharynx: P<1.2×10-12; kidney: P<1.1×10-13). For bladder cancer, the risk trajectories for high PRS/reduced modifiable risk and low PRS/high modifiable risk were overlapping (P=0.98).
There was evidence of larger than additive risk differences, at age 60 between elevated modifiable risk factor profiles and all ordinal PRS categories for melanoma (P=3.3×10-122), post-menopausal breast (P=1.3×10-21), colorectal (P=1.3×10-208), lung (P=1.1×10-37), bladder (P=1.5×10-50), kidney (P=5.5×10-29), and oral cavity/pharynx cancers (P=5.2×10-11) (Supplementary Table 7). For pre-menopausal breast cancer the interaction was limited to women in the high PRS group (P=4.4×10-4).
Quantifying population-level impact
Population attributable fractions (PAF) were used to summarize the relative contribution of genetic susceptibility and modifiable risk factors to cancer risk at the population level. In order to allow comparisons between PAF estimates, the PRS and modifiable risk score distributions were both dichotomized at ≥80th percentile. All risk factors nominally contributed (P<0.05) to cancer incidence (Figure 6; Supplementary Table 8), with the exception of the PRS for oral cavity/pharynx cancer (P=0.78) and PM2.5 for lung cancer in never smokers (P=0.44).
PAF for high genetic risk exceeded the contribution of modifiable exposures for several cancers, such as thyroid (PAFPRS=0.268, P=1.7×10-9), prostate (PAFPRS=0.232, P=5.5×10-158), colon/rectum (PAFPRS=0.167, P=9.2×10-50), breast (PAFPRS=0.166, P=2.6×10-85), and melanoma (PAFPRS=0.139, P=1.3×10-23). For testicular cancer (PAFPRS=0.303, P=4.5×10-4), leukemia (PAFPRS=0.269, P=4.5×10-4), lung cancer in never smokers (PAFPRS=0.077, P=0.045), and NHL (PAFPRS=0.053, P=1.9×10-3), PRS was the only significant risk factor other than demographic factors. Cancers for which modifiable risk factors had a substantially larger impact on disease burden than PRS included oral cavity/pharynx (PAFmod=0.310 vs. PAFPRS=0.006), lung (AFmod=0.636 vs. PAFPRS=0.040), endometrium (PAFmod=0.353 vs. PAFPRS=0.043), kidney (PAFmod=0.210 vs. PAFPRS=0.046), and bladder cancers (PAFmod=0.189 vs. PAFPRS=0.085). For other sites, such as pancreas (PAFmod=0.118 vs. PAFPRS=0.133) and ovary (PAFmod=0.100 vs. PAFPRS=0.082), the contribution of PRS and modifiable risk factors were more balanced.
DISCUSSION
Cancer is a multifactorial disease with a complex web of etiological factors, from macro-level determinants, such as health policy, to individual-level characteristics, such as health-related behaviors and heritable genetic profiles. Heritable and modifiable risk factors act in concert to influence cancer development, but their relative contributions to disease risk are rarely compared directly in the same population. In this study we provide new insight into the potential utility of PRS for cancer risk prediction and provide insight into the relative of contribution of genetic and modifiable risk factors to cancer incidence at population level.
Our first major finding is that cancer-specific PRS comprised of lead GWAS variants improve risk prediction for all 16 cancers examined. However, the magnitude of the resulting improvement in prediction varies substantially between sites. In evaluating the added predictive value of the PRS it is important to keep in mind that achieving the same incremental increase in the C-index/AUC is more difficult when the baseline model already performs well12. This was applicable to most cancers, where age and/or sex alone achieved non-trivial risk discrimination (C-index/AUC>0.60). Expanding the set of predictors to include modifiable risk factors further improved discrimination, as previously shown13. By adding the PRS to the most comprehensive risk factor models facilitated by our data, we adopted a conservative approach for quantifying its added predictive value, which provides an informative benchmark for future efforts seeking to incorporate genetic predisposition in cancer risk assessment.
Cancer sites for which the PRS resulted in the largest gains in prediction performance included prostate, testicular, and thyroid cancers, as well as leukemia, and melanoma. This is consistent with high heritability estimates reported for these cancers in twin studies14 and our analyses in the UK Biobank15. Modelling the PRS in addition to established risk factors yielded very modest improvements in risk discrimination for cancers of the lung, endometrium, bladder, oral cavity/pharynx, and kidney. These cancers have strong environmental risk factors, such as smoking, alcohol consumption, obesity, and HPV infection, some of which were captured in our analysis. Limited predictive ability for cervical and endometrial cancers may also be due to a low number of variants included in the PRS (9 and 10, respectively). The association of the lung cancer PRS with cigarettes per day16 may have diminished its apparent predictive value when added to a model with smoking status and intensity, which already achieved an AUC>0.80 making difficult to elicit further improvement. Furthermore, PRS may be particularly relevant for assessing lung cancer risk in never smokers, since other risk factors have a limited impact in this population.
Few pan-cancer PRS studies have been conducted in prospective cohorts and none have considered the breadth of modifiable risk factors that we evaluated. Shi et al.17 tested 11 cancer PRS in cases from The Cancer Genome Atlas and controls from the Electronic Medical Records and Genomics Network. This analysis was limited by fewer risk variants in each PRS, as well as potential for bias due to selection of cases and controls from different populations. A phenome-wide analysis in the Michigan Genomics Initiative cohort by Fritsche et al.18 examined PRS for 12 cancers and reported similar associations for the target phenotype. However, risk stratification was not formally evaluated. Considering cancer-specific studies, the PRS presented here achieved superior prediction performance for some cancers19–22, but not others23, 24. For pancreatic cancer25 and melanoma26, our results are consistent with previous analyses using PRS of similar composition. Generally, comparison of prediction performance is complicated by differences in PRS content, population characteristics, and inclusion of different non-genetic predictors. Outside the cancer literature, our conclusions align with a recent study of ischemic stroke, which demonstrated that the PRS is similarly or more predictive than multiple established risk factors, including family history27.
Our second major finding advances the idea of using germline genetic information to refine individual risk estimates. We show that incorporating PRS improves risk stratification provided by conventional risk factors alone, as illustrated by significantly diverging 5-year risk projections within strata based on family history or modifiable risk factors. For certain cancers, including some with strong environmental risk factors, such as melanoma, breast, colorectal, and pancreatic cancers, PRS was the primary determinant of risk stratification. For others, such as lung and bladder cancers, modifiable risk factors had a stronger impact on 5-year risk trajectories. A consistent finding for all cancers was that individuals in the top 20% of the PRS distribution with an unfavorable modifiable risk factor profile had the highest level of risk, with evidence that the effects of PRS and modifiable risk factors may be synergistic. Taken together, these findings highlight the potential for attenuating high genetic risk by adhering to a healthier lifestyle. Similar risk stratification results based on genetic and modifiable risk factors have also been reported for coronary disease28 and Alzheimer’s29.
In addition to evaluating predictive performance and risk stratification, our work demonstrates the relevance of common genetic risk variants at the population level. High genetic risk (PRS≥80th percentile) explained between 4.0% and 30.3% of new cancer cases, and for many phenotypes this exceeded PAF estimates for modifiable risk factors or family history. The contribution of genetic variation to disease risk is typically conveyed by heritability, which is an informative metric, although not easily translated into a measure of disease burden useful in a public health context. Recent work on cancer PAF in the UK30 and a series of publications from the ComPARe initiative in Canada31, 32 examined wide range of modifiable risk factors. Despite providing useful data, these studies overlook the contribution of genetic susceptibility. Our work addresses these limitations by providing a more complete perspective on the determinants of cancer and potential impact of future prevention policies.
In evaluating the contributions of our study, several limitations should be acknowledged. First, we did not account for the impact of workplace exposures and socio-economic determinants of health, thereby underestimating the role of non-genetic risk factors. We also lacked data on several known carcinogens, such as ionizing radiation, and clinical biomarkers, such as prostate-specific antigen, thus limiting the extent to which our results inform risk discrimination for certain cancers. Information on family history was also not available for all cancer types. Second, since the UK Biobank cohort is unrepresentative of the general UK population due to low participation and resulting healthy volunteer bias33, we may have underestimated PAFs for modifiable risk factors. Finally, the models presented here are calibrated to the UKB population and we urge caution in extrapolating prediction performance and absolute risk projections to other populations. Since our analytic sample is restricted to individuals of predominantly European ancestry, this limits the applicability of our findings to diverse populations.
This work has several important strengths. The UK Biobank resource enabled us to simultaneously evaluate heritable and modifiable cancer risk factors in a population-based cohort with uniform deep phenotyping. We report a series of metrics that comprehensively characterize different dimensions of predictive performance that can be improved by incorporating genetic risk scores. While our results are promising, we anticipate that the performance of the PRS reported here may be enhanced by adopting less stringent p-value thresholding to include additional risk variants, optimizing subtype-specific weights, and implementing more sophisticated PRS models that incorporate linkage disequilibrium structure, functional annotations, or SNP interactions. Some of these strategies are already being successfully implemented4, 23. We also provide insight into PRS modelling by showing that accounting for the variance in risk allele effect sizes improves PRS performance. This approach may be particularly advantageous for PRS derived from multiple sources rather than a single GWAS. Throughout this study we consider a relatively lenient definition of high genetic risk, corresponding to the top 20% of the PRS distribution. Exploring other cut-points will be informative, however, our results are valuable for demonstrating that the utility of PRS for stratification is not limited to the most extreme ends of the genetic susceptibility spectrum. This threshold is also compelling from a population-health perspective, as it allows us to quantify the proportion of cases attributed to a risk factor with a 20% prevalence.
Genetic risk scores have the potential to become a powerful tool for precision health, but only if the resulting information can be understood and acted on appropriately. One important consideration is the accuracy and stability of PRS-based risk classifications, especially at clinically actionable risk thresholds that exist for certain cancers. For instance, there are established screening programs for breast and colorectal cancers, and increasing evidence supporting the effectiveness of low-dose computed tomography for lung cancer screening34, 35. For these cancers PRS could be used to adjust the optimal age for screening initiation and/or intensity. However, to justify this, studies are needed to demonstrate the benefit of using PRS to supplement conventional screening criteria. Such trials are already underway for breast cancer, where genetic risk scores are being incorporated to personalize risk-based screening36. For other cancers, such as prostate, screening remains controversial and PRS may prove useful in identifying a subset of high-risk individuals who may benefit the most from screening.
Another area where PRS may prove useful is for prioritizing individuals for targeted health and lifestyle-related interventions. In support of this, our study demonstrates that those with the highest levels of genetic risk, based on the PRS, may also experience larger decreases in risk from shifting to a healthier lifestyle. However, there is also accumulating evidence that simply reporting genetic risk information to individuals does not induce behavior change that could lead to meaningful reductions in risk37. Therefore, progress in our ability to construct and apply PRS to identify high-risk individuals must be also accompanied by the development of effective behavioral interventions that can be implemented in response to high disease risk, in addition to early detection and screening protocols.
Ultimately, the impact of PRS on clinical decision-making should be carefully evaluated in randomized trials prior to deployment in healthcare settings. By demonstrating cancer-specific improvements in risk prediction, as well as the substantial proportion of cancer incidence that is captured by known genetic susceptibility variants, we provide novel evidence that contextualizes the potential for using genetic information to improve cancer outcomes.
METHODS
Study Population
The UK Biobank (UKB) is a population-based prospective cohort of individuals aged 40 to 69 years, enrolled between 2006 and 2010. All participants completed extensive questionnaires, in-person physical assessments, and provided blood samples for DNA extraction and genotyping7. Health-related outcomes were ascertained via individual record linkage to national cancer and mortality registries and hospital in-patient encounters7. Details of the quality control and phenotyping procedures for this dataset have been previously described15, 16. Briefly, individuals with at least one recorded incident diagnosis of a borderline, in situ, or malignant primary cancer were defined as cases. Cancer diagnoses coded by International Classification of Diseases (ICD)-9 or ICD-10 codes were converted into ICD-O-3 codes using the SEER site recode paradigm in order to classify cancers by organ site.
Participants were genotyped on the UKB Affymetrix Axiom array (89%) or the UK BiLEVE array (11%)7. Genotype imputation was performed using the Haplotype Reference Consortium as the main reference panel, supplemented with the UK10K and 1000 Genomes phase 3 reference panels7. Genetic ancestry principal components (PCs) were computed using fastPCA38 based on a set of 407,219 unrelated samples and 147,604 genetic markers7. All analyses were restricted to self-reported European ancestry individuals with concordant self-reported and genetically inferred sex. To further minimize potential for population stratification, we excluded individuals with values for either of the first two ancestry PCs outside of five standard deviations of the population mean. Based on a subset of genotyped autosomal variants with minor allele frequency (MAF)≥0.01 and genotype call rate ≥97%, we excluded samples with call rates <97% and/or heterozygosity more than five standard deviations from the mean of the population. With the same subset of SNPs, we used KING38 to estimate relatedness among the samples. We excluded one individual from each pair of first-degree relatives, preferentially retaining individuals to maximize the number of cancer cases remaining, resulting in a total of 413,870 UKB participants.
Polygenic Risk Scores
In order to derive polygenic risk scores (PRS) for each of the 16 cancers, we extracted previously associated variants by searching the National Human Genome Research Institute (NHGRI)-European Bioinformatics Institute (EBI) Catalog of published GWAS. For every eligible GWAS, both the original primary manuscript and supplemental materials were reviewed. Additional relevant studies were identified by examining the reference section of each article and via PubMed searches of other studies in which each article had been cited. We abstracted all autosomal variants with minor allele frequency MAF≥ 0.01 and P<5×10-8 identified in populations of at least 70% European ancestry and published by June 2018, with the exception of one colorectal cancer GWAS39 (published in December 2018). For inclusion in the PRS we preferentially selected independent SNPs (LD r2<0.3) with the highest imputation score and we excluded SNPs with allele mismatches or MAF differences >0.10 relative to the 1000 Genomes reference population, and palindromic SNPs with MAF≥0.45. For associations reported in more than one study of the same ancestry and phenotype, we selected the one with the most information (i.e., which reported the risk allele and effect estimate) with the smallest p-value. Further details of the PRS development approach, including a list of source studies, is described by Graff et al16.
We considered three approaches for combining risk variants in the PRS. First, we used standard PRS weights, corresponding to the log odds ratio (β) for each risk allele: We compared this to an unweighted score corresponding to the sum of the risk alleles, which is equivalent to assigning all variants an equal weight of 1: Lastly, we applied inverse variance (IV) weights that incorporated the standard error (SE) of the SNP log(OR) to account for uncertainty in risk allele effect sizes and downweigh the contribution of variants with less precisely estimated associations (weights provided in Supplementary Data 1): Each PRS was standardized across the entire analytic cohort to have a mean of 0 and standard deviation (SD) of 1.
Statistical Analysis
Development of risk models for each cancer
Cancer-specific prediction models consisting of four classes of risk factors were developed: i) demographic factors (age and sex); ii) family history of cancer in first-degree relatives; iii) modifiable risk factors; and iv) genetic susceptibility, represented by the PRS. Family history of cancer was derived based on self-reported illnesses in non-adopted first-degree relatives, which only listed cancers of the prostate, breast, bowel, or lung. In addition to these four cancer sites, family history of breast cancer was included as a predictor for ovarian cancer40, 41. Models for pancreatic cancer included a composite variable for family history of cancer at any of these four sites42, 43. Selection of modifiable risk factors was informed by literature review and reports, such as the European Code Against Cancer44, with an emphasis on risk factors that are likely to have a causal role. Final models included established environmental and lifestyle-related characteristics that were collected for the entire UK Biobank cohort (Supplementary Table 1).
Cause-specific Cox proportional-hazard models were used to estimate the hazard ratios (HR) and corresponding 95% confidence intervals (CI) for genetic and lifestyle factors associated with each incident cancer. Death from any cause, other than cancer site-specific mortality, was treated as a competing event. Information on primary and contributing causes of death was used to identify cancer site-specific mortality. Follow-up time was calculated from the date of enrollment to the date of cancer diagnosis, date of death, or end of follow-up (January 1, 2015). For each cancer, individuals with a past or prevalent cancer diagnosis at that same site were excluded from the analysis, while individuals diagnosed with cancers at other sites were retained in the population. All models including the PRS were also adjusted for genotyping array and the first 15 genetic ancestry PCs. For the PRS, HR estimates correspond to 1 SD increase in the standardized genetic score.
Risk model evaluation
The predictive performance of each risk model was evaluated based on its ability to accurately estimate risk (calibration) and distinguish cancer cases from cancer-free individuals (discrimination). Calibration was assessed with a Hosmer-Lemeshow goodness-of-fit statistic modified for time-to-event outcomes45, and by plotting the expected event status against the observed event probability46 across risk deciles. For rarer cancers calibration was assessed across quantiles of risk to ensure a minimum of 5 cases per group. Violation of the proportionality of hazards assumption was assessed by examining the association between standardized Schoenfeld residuals and time.
We evaluated nested models starting with the most minimal set of predictors, such as demographic factors, followed by models including family history of cancer and modifiable risk factors, and finally models incorporating the PRS. Risk discrimination was assessed based on Harrell’s C-index, calculated as a weighted average between 1 and 5 years of follow-up time, and Area Under the Curve (AUC) at 5 years. We also report pseudo R2 coefficients based on Royston’s measure of explained variation for survival models10. Percentile-based net reclassification improvement (NRI) index11 was used to quantify improvements in reclassification. NRI summarizes the proportion of appropriate directional changes in predicted risks. Any upward movement in risk categories for cases indicates improved classification, and any downward movement implies worse reclassification. The opposite is expected for non-cases: Where nU is the number of individuals up-classified and nD is the number down-classified. Overall NRI is the sum of the NRI in cases and NRI in non-cases: NRI = NRIe + NRIne. Bootstrapped confidence intervals were obtained based on 1000 replicates.
Risk stratification: genetic vs. modifiable factors
For each individual, we estimated the 5-year absolute risk of being diagnosed with a specific cancer using the formula of Benichou & Gail47, as implemented by Ozenne et al48. Absolute risk trajectories were examined as a function of age across strata defined by genetic and modifiable risk profiles, as well as family history. Individuals in the top 20% of the PRS distribution (PRS≥80th percentile) for a given cancer were classified has having high genetic risk, those in the bottom 20% (PRS≤20th percentile) were classified as low risk, and the middle category (>20th to <80th percentile) classified as average genetic risk.
Modifiable risk factors were summarized by generating summary linear predictors (predicted log-hazard ratios) based on risk factors in Supplementary Table 1, excluding age, sex, and family history. Individuals above the median of this risk score distribution were considered to have an unfavorable modifiable risk profile. Risk trajectories in each stratum were visualized by fitting linear models with smoothing splines across individual risk estimates as a function of age. Differences in mean absolute risk at age 60 were tested using a two-sample t-test. We also tested for interaction between the 3-level ordinal PRS variable and the modifiable risk score (dichotomized at the median) in a linear model with the predicted absolute risk as the outcome.
Etiology: contribution of genetic vs. modifiable risk factors
The relative contribution of genetic and modifiable cancer risk factors at the population level was quantified with population attributable fractions (PAF) using the method of Sjölander & Vansteedlandt49, 50 based on the counterfactual framework. To obtain comparable AF estimates, thresholds for high genetic risk and high burden of modifiable risk factors corresponded to the top 20% (≥80th percentile) of each risk score distribution.
DATA AVAILABILITY
The UK Biobank in an open access resource, available at https://www.ukbiobank.ac.uk/researchers/. This research was conducted with approved access to UK Biobank data under application number 14105.
COMPETING INTERESTS
The authors declare no competing interests.
ACKNOWLEDGEMENTS
Disclaimer: Where authors are identified as personnel of the International Agency for Research on Cancer / World Health Organization, the authors alone are responsible for the views expressed in this article and they do not necessarily represent the decisions, policy or views of the International Agency for Research on Cancer / World Health Organization.
This research was supported by funding from the National Institutes of Health (US NCI R25T CA112355 and R01 CA201358; PI: Witte) and Cancer Research UK (C18281/A19169).