Abstract
Background Over 180 single nucleotide polymorphisms (SNPs) associated with breast cancer susceptibility have been identified; these SNPs can be combined into polygenic risk scores (PRS) to predict breast cancer risk. Since most SNPs were identified in predominantly European populations, little is known about the performance of PRS in non-Europeans. We tested the performance of a 180-SNP PRS in Latinas, a large ethnic group with variable levels of Indigenous American, European, and African ancestry.
Methods We conducted a pooled case-control analysis of U.S. Latinas and Latin-American women (4,658 cases, 7,629 controls). We constructed a 180-SNP PRS consisting of SNPs associated with breast cancer risk (p < 5 x 10−8). We evaluated the association between the PRS and breast cancer risk using multivariable logistic regression and assessed discrimination using area under the receiver operating characteristic curve (AUROC). We also assessed PRS performance across quartiles of Indigenous American genetic ancestry.
Results Of 180 SNPs tested, 140 showed directionally consistent associations compared with European populations, and 43 were nominally significant (p < 0.05). The PRS was associated with breast cancer risk, with an odds ratio (OR) per standard deviation increment of 1.58 (95% CI 1.52-1.64) and AUCROC of 0.63 (95% CI 0.62 to 0.64). The discrimination of the PRS was similar between the top and bottom quartiles of Indigenous American ancestry.
Conclusions The 180-SNP PRS predicts breast cancer risk in Latinas, with similar performance as reported for Europeans. The performance of the PRS did not vary substantially according to Indigenous American ancestry.
Introduction
Over 180 single nucleotide polymorphisms (SNPs) associated with breast cancer susceptibility have been discovered in genome-wide association studies (GWAS) [1–4]. Though each SNP has a modest effect, multiple SNPs can be combined into a polygenic risk score (PRS) [5]. PRS has emerged as a promising tool for breast cancer risk stratification. The risk associated with having a PRS in the upper 20-25th percentile is similar to that of strong clinical risk factors such as having extremely dense breasts [6], and adding PRS to risk models improves discrimination and reclassification [6–8]. Ongoing clinical trials are studying the use of PRS to personalize breast cancer screening and prevention [9]. Some commercial genetic testing laboratories are already returning PRS results to those who tested negative for pathogenic moderate- or high-penetrance mutations [10, 11].
A major barrier to the widespread use of PRS is the relative paucity of knowledge regarding its performance in non-European populations. To date, SNP discovery has overwhelmingly occurred in European populations [12]. However, the effect sizes, allele frequencies, and linkage disequilibrium patterns of SNPs vary by ancestry [12, 13]. Though relatively few studies have examined PRS performance in non-Europeans, they suggest that PRS constructed using European SNP summary statistics (effect size, allele frequency) typically perform worse in non-European populations [14, 15]. Currently, commercial testing laboratories only provide breast cancer PRS results to women of European ancestry [10, 11].
Disparities in the use and performance of PRS could especially affect Latinas. Latino/Latinas comprise the largest minority group in the U.S., representing 17.8% of the population in 2016 [16]. This diverse group includes genetically admixed individuals who have varying degrees of Indigenous American, European, African, and Asian ancestry [17–19]. We previously identified SNPs in the 6q25 locus associated with breast cancer risk exclusively in Latinas [20]. Most SNPs discovered in European populations display directional consistency in Latinas, with some also being nominally significant [20, 21]. One previous study assessed the performance of a breast cancer PRS in Latinas, finding that a 71-SNP PRS had worse prediction in Latinas as comparable PRS in Europeans [5, 15]. However, it included only 147 cases and did not account for genetic ancestry [15].
We sought to test the performance of PRS in U.S. Latinas and Latin American women (collectively referred to hereafter as Latinas). To that end, we conducted a pooled case-control analysis of 8 studies comprising 13,631 Latinas. We examined the predictive performance of a 71-SNP and a 180-SNP PRS, and whether PRS performance varies by genetic ancestry.
Methods
Participants
Our analysis included 13,631 self-identified Latinas, of whom 5,697 women with invasive breast cancer were considered cases and 7,934 without breast cancer were controls. Participants came from 8 studies (Tables 1 and S1). Recruitment details and patient characteristics have been previously reported for each study except for PGEN-BC. Studies are briefly described below and in more detail in the Supplement.
The San Francisco Bay Area Breast Cancer Study (SFBCS) plus the Northern California Breast Cancer Family Registry (NC-BCFR), a population-based case-control study recruiting from the San Francisco Bay Area [22, 23].
The Kaiser Permanente Research Project on Genes, Environment, and Health (RPGEH), a biobank recruiting from Northern California and the Pacific Northwest [24].
The Multiethnic Cohort (MEC) study, a prospective cohort study recruiting from Southern California and Hawaii [25].
The Cancer de Mama (CAMA) study, a population-based case-control study in Mexico [26].
The Post-Columbian Study of Environmental and Heritable Causes of Breast Cancer (COLUMBUS-Colombia), a population-based case-control study in southern Colombia [20].
The Post-Columbian Study of Environmental and Heritable Causes of Breast Cancer (COLUMBUS-Mexico), a population-based case-control study in Mexico [20]. The COLUMBUS substudies (Colombia and Mexico) were analyzed as separate datasets given differences in study populations and genotyping methods.
The Peru Genetics and Genomics of Breast Cancer Study (PEGEN-BC), a case-series from a Peruvian cancer center. Unrelated Peruvian individuals from 1000 Genomes [27] were used as controls.
The City of Hope Clinical Cancer Genetics Community Research Network (COH/CCGCRN), the Southern California site of a multisite cancer center and community-based registry for familial breast cancer [28].
All studies obtained local institutional review board approval and written informed consent from participants.
Genotyping and genetic ancestry
For all studies except COH/CCGCRN, genotyping was performed using high-density arrays (Table S1). Genotyping of COH/CCGCRN samples was performed using next-generation sequencing with a targeted capture kit that included all 89 SNPs identified as of 2016, prior to publication of the OncoArray GWAS results [3]. Further information about genotyping is provided in the Supplementary Methods.
We estimated genetic ancestry from genome-wide markers using the program ADMIXTURE [29] in unsupervised mode with a model including 4 ancestral populations: European, Indigenous American (IA), African, and East Asian. We used genotype data from 90 European Americans (CEU) and 90 Nigerian Yorubans (YRI) from HapMap [30] to represent European and African populations, respectively. We also included a subset of 504 East Asian individuals from 1000 Genomes [27] and 71 Indigenous Americans previously genotyped on the Affymetrix Axiom LAT1 array [31, 32]. Women with >75% East Asian ancestry were excluded given that the limited influence of the East Asian component in the Hispanic/Latino population did not allow for a subgroup analysis.
Polygenic risk score
We used a 180-SNP PRS for our primary analysis (Table S2). SNP selection is discussed in further detail in the Supplementary Methods. We performed sensitivity analyses of different imputation r2 cutoffs for inclusion of SNPs in our PRS (Table S3). We ultimately included all SNPs regardless of imputation quality, as we did not find substantive differences in the associations between the 180-SNP PRS and a 168-SNP PRS constructed using an imputation r2 threshold of > 0.5.
Since targeted genotyping was performed within COH/CCGCRN samples, genotypes were available for 89 SNPs. We dropped 1 SNP due to missingness. Of the remaining 88 SNPs, 63 overlapped and 8 had LD proxies (r2 > 0.7) with the 180 SNPs comprising the main PRS. We used these 71 SNPs to construct a PRS within the COH/CCGCRN dataset. We then constructed a comparator 71-SNP PRS in the 7 remaining datasets using the 63 shared SNPs and 8 respective LD proxies, and pooled all 8 datasets to evaluate the performance of the 71-SNP PRS.
We constructed the PRS as previously described [7, 33]. Briefly, the PRS represents the product of the likelihood ratios across multiple SNPs, assuming each SNP exerts an independent effect. The likelihood ratio for each SNP was calculated based on the number of risk alleles present, and the allele frequency and effect size (odds ratio, OR) of the risk allele. We used risk allele frequencies derived from the Latin American (AMR) population in 1000 Genomes [27] and published ORs [3]. The latter predominantly reflects the effect of the SNP within a European population, except for those discovered in Latina studies (Table S2) [20, 21].
Statistical analysis
First, we tested the associations between individual SNPs and breast cancer risk using multivariable logistic regression models adjusted for genetic ancestry and study. We used METAL [34] to perform inverse variance based meta-analysis of 180 SNPs across 3 studies: COLUMBUS-Colombia, COLUMBUS-Mexico, and pooled SFBCS/NC-BCFR, Kaiser RPGEH, MEC, CAMA, and PEGEN-BC studies.
To test the associations of PRS with breast cancer, we adjusted for genetic ancestry and study, given that both remained independently associated with breast cancer risk when included in the same model as the PRS. We first performed linear regression of study and ancestry on the PRS (dependent variable). We then used the residual as the main predictor in univariate logistic regression with breast cancer as the outcome. We analyzed the residual as a continuous variable normalized to the mean and standard deviation (SD) in controls. We tested the discrimination of the adjusted PRS by estimating the area under the receiver operating characteristic curve (AUROC). We also tested calibration using the Hosmer-Lemeshow test across deciles of the adjusted PRS, with the 40-50th and 50-60th deciles combined and used as the reference group.
To examine the ancestry-specific performance of the PRS, we divided the pooled dataset into quartiles of IA ancestry. We performed logistic regression within each quartile of IA ancestry and compared the resulting coefficients using a Wald test of linear hypothesis. To compare AUROC estimates, we performed a test of equality of AUROC as described by DeLong [35]. Given differences in the population structures between U.S. Latina and Latin-American studies, we also examined ancestry-specific performance of the PRS by geographic origin of study, specifically U.S. (SFBCS/NC-BCFR, RPGEH, MEC) versus Latin-American (CAMA, COLUMBUS, PEGEN-BC).
All tests for significance used two-sided α = 0.05. We developed the script to calculate the PRS using R (The R Foundation). We performed all statistical analyses using Stata 14.1 (StataCorp, College Station, TX).
Results
Study characteristics
Our pooled data included 13,631 women from 8 studies, for a total of 5,697 cases and 7,934 controls (Table 1). Across all studies, ancestry was mostly European and Indigenous American (IA). There was substantial variation in ancestry within and across studies (Supplementary Figure S2). For instance, PEGEN-BC in Peru had the highest average IA ancestry (76% in cases and controls) while RPGEH in Northern California had the lowest (27% in cases, 29% in controls). Within each study, cases tended to have similar or lower IA ancestry than controls, as we have previously reported [36, 37]. In the pooled analysis, cases had higher IA ancestry since nearly half the controls came from RPGEH, the study with the lowest IA ancestry.
Association of PRS with breast cancer risk
We first examined the associations between individual SNPs and breast cancer risk. Of 180 SNPs tested, 140 had associations that were directionally consistent with those reported in European populations (Table S2) [3]. Forty-eight SNPs were nominally significant (p < 0.05) in our dataset, with 43 being also directionally consistent. Six SNPs remained significant to p < 2.8×10-4 after Bonferroni correction for multiple testing. Thirteen SNPs displayed heterogeneous associations across studies (Phet < 0.05). For both PRSs, the mean unadjusted PRS was higher in cases than controls (Table 1, Figure S1).
Our main analysis evaluated the performance of a 180-SNP PRS in 12,287 women (4,658 cases and 7,629 controls) from 7 studies, excluding COH/CCGCRN given that 89 SNPs were genotyped in that study. After normalization and adjustment for genetic ancestry and study, the 180-SNP PRS was strongly associated with breast cancer risk, OR per SD increment = 1.58 (95% CI 1.52 to 1.64) (Table 2). The associations with breast cancer were especially pronounced among extremes of the PRS. Compared with women with a PRS in the 40-60th percentile, women with a PRS in the bottom decile had an OR of 0.44 (95% CI 0.37 to 0.53), while those with a PRS in the top decile had an OR of 2.03 (95% 1.79 to 2.31). The AUROC for the 180-SNP PRS was 0.63 (95% CI 0.62 to 0.64), Figure 1A. The Hosmer-Lemeshow test suggested good fit, with χ2 = 8.06 (p = 0.53), Figure 2A.
Our secondary analysis evaluated the performance of a 71-SNP PRS in 13,631 women (5,697 cases and 7,934 controls) from 8 studies, including COH/CCGCRN. The 71-SNP PRS had a similar, albeit slightly weaker, association with breast cancer risk (Table 2, Figure 1B), while the Hosmer-Lemeshow test was again suggestive of good fit, χ2 = 5.82 (p = 0.76), Figure 2B.
Performance of PRS by Indigenous American ancestry
The 180-SNP PRS displayed similar performance regardless of IA ancestry, with comparable ORs and AUROCs across the top (>55%) and bottom (<29%) quartiles of IA ancestry (Table 3). In contrast, the 71-SNP PRS performed worse in the top compared to the bottom quartile, [OR 1.46 (95% CI 1.36 to 1.56) vs OR 1.68 (95% CI 1.54 to 1.83), p = 0.01]. This corresponded to top versus bottom quartile AUROCs of 0.61 (95% CI 0.59 to 0.63) and 0.64 (95% CI 0.62 to 0.66), respectively (p = 0.02).
Given differences in ancestry structure between U.S. Latinas and Latin-American women, we stratified the analysis by geographic origin of study. Among 7,427 women from the U.S. studies (SFBCS/NC-BCFR, RPGEH, and MEC), the 180-SNP PRS performed best in the bottom quartile of IA ancestry (Table S4). However, among the 4,970 women from the Latin-American studies (CAMA, PEGEN-BC, COLUMBUS), the 180-SNP PRS performed similarly across quartiles of IA ancestry (Table S5).
Discussion
We found that PRSs primarily consisting of SNPs identified in European populations were predictive of breast cancer risk in Latinas. Our 180-SNP PRS had an adjusted OR of 1.58 (95% CI 1.52 to 1.64) and an AUROC of 0.63 (95% CI 0.62 to 0.64). These results are comparable to those of European studies, which tested PRSs including 77 to 3820 SNPs and reported ORs per SD between 1.46-1.66 and AUROCs between 0.60-0.64 [5, 38]. Our 71-SNP PRS performed worse than the 180-SNP PRS, though the difference was modest.
Ours is the largest study to date on breast cancer PRS in Latinas and extends the literature by refining estimates of PRS performance in this population. Allman, et al [15] reported that a 71-SNP PRS had an OR per SD increment of 1.39 (95% CI 1.18 to 1.64) and AUROC of 0.59 (95% CI 0.54 to 0.64) among U.S. Latinas. Their 71-SNP PRS shares 62 SNPs (two by LD proxy) with our 71-SNP PRS, including a SNP (rs140068132) previously identified in Latina GWAS [20]. Given the degree of overlap in PRS composition, our results likely represent a more precise estimate of PRS performance, given we had substantially more cases (4,658 vs. 147) and controls (7,629 vs. 3,201) representing wider Latina/Latin-American ancestry. Indeed, our observed ORs and AUROCs for the 71- and 180-SNP PRSs fall in the upper range of the confidence interval reported by Allman.
We could not definitively determine whether PRS performance varies by ancestry. Differential PRS performance by genetic ancestry might be expected for two reasons: first, differences in LD structures between European and non-European populations can attenuate the associations between GWAS hits discovered in Europeans and causal SNPs in LD; secondly, causal alleles may only be present in certain populations. However, the 180-SNP PRS performed similarly across quartiles of IA ancestry. In contrast, the 71-SNP performed better in the bottom quartile of IA ancestry, corresponding to higher European ancestry. One explanation for the latter finding could be that analysis of the 71-SNP PRS analysis included 1,039 additional cases from COH/CCGCRN and therefore had greater statistical power to detect differences in performance by IA ancestry.
A major strength of our study was the size and diversity of our study population. Additionally, we accounted for genetic ancestry, which can bias associations in genetic studies [39]. Given that ancestry was a confounder and an independent predictor of breast cancer risk, we used a novel approach to calculate an “ancestry-adjusted” PRS. We also examined PRS performance by IA ancestry, which has not been previously done. Another strength was the inclusion of several large, diverse breast cancer studies representing populations from several geographic areas (Western U.S., Central and South America) and including women with varying degrees of IA versus European ancestry.
Our results should be interpreted in light of three limitations. First, the generalizability of our findings is limited to Latina populations with similar distributions of genetic ancestry, although the ancestry composition of our study resembled that of other large studies of Latinas from the western U.S. and Central/South America [19, 40]. However, our results may not be generalizable to Caribbean Latinas, whose population structures have a higher proportion of African ancestry [17–19]. We did not test the performance of PRS according to African ancestry given that our study population predominantly consisted of women originating from Latin American countries, where African ancestry is limited. Secondly, our analysis included women recruited from community-based and familial breast cancer clinics and may include moderate or high-penetrance mutation carriers. While PRS is associated with breast cancer risk in mutation carriers and women with elevated familial risk, the magnitudes of these associations vary slightly from those in the average-risk population [41]. Finally, we tested a PRS containing 180 SNPs representing all known GWAS hits at the time of analysis. However, others have constructed expanded PRSs comprising 313 and 3820 SNPs by including SNPs that did not have genome-wide significant associations with breast cancer [38]. Though these expanded PRSs performed better than a 77-SNP PRS, there was little difference in performance between the 313-SNP and 3820-SNP PRSs [38]. We included only SNPs with genome-wide significant associations in our PRS since we reasoned that these signals may be more robust across ancestry. The AUROC for our 180-SNP PRS (0.63) was similar to that of the 313-SNP PRS [38].
Our results suggest that the PRS has predictive value in Latinas, a large and rapidly-growing population in the U.S. Although studies on the ability of the PRS to inform decisions around screening and prevention are underway [9], several commercial genetic testing laboratories are already returning PRS results to women of European descent who tested negative for deleterious mutations. If this practice were extended to Latinas, one could expect the PRS to perform comparably well. Even if the performance of the PRS were slightly attenuated in Latinas of higher Indigenous American ancestry, this does not necessarily preclude its use in this population. Instead, results could account for this attenuation and model the joint effects of PRS and ancestry.
Though our findings lend optimism to the utility of the PRS in predicting breast cancer risk among Latinas, they do not nullify the prospect of disparities in genetic discovery research [42]. Whereas we studied mostly common variants, rare variants display more geographic clustering [43]. As genetic association studies identify more rare variants, those discovered in European populations will be less generalizable to other populations.
Thus, high-quality genetic studies in non-European populations should remain a priority. Fine-mapping in large datasets may enhance the identification of causal SNPs associated with breast cancer risk. Likewise, GWAS should be intentional about including Latinas, particularly those with higher IA and/or African ancestry. In addition, future studies should prospectively assess prediction and examine the contribution of PRS to clinical risk models. Though one such trial is currently using the PRS to tailor decision-making around breast cancer screening and prevention [9], similar clinical effectiveness studies should also aim to recruit diverse women.
Notes
We thank the participants of the SFBCS/NC-BCFR, Kaiser Permanente RPGEH, MEC, CAMA, COLUMBUS, PEGEN-BC, and COH-CCGCRN studies. The authors declare no competing interests.
The contributors from the COLUMBUS Consortium (in alphabetical order) include: Jennyfer Benavides (Universidad del Tolima, Ibagué, Colombia), Mabel Bohorquez (Universidad del Tolima, Ibagué, Colombia), Fernando Bolaños (Hospital Hernando Moncaleano Perdomo, Neiva, Colombia), Luis G Carvajal-Carmona (Universidad del Tolima, Ibagué, Colombia, University of California Comprehensive Cancer Center, Sacramento, USA, Fundación de Genética y Genómica, Medellín, Colombia, Genome Center and Department of Biochemistry and Molecular Medicine, University of California, Davis, Davis, CA, USA), Jenny Carmona (Dinámica IPS, Medellín, Colombia), Ángel Criollo (Universidad del Tolima, Ibagué, Colombia), Magdalena Echeverry (Universidad del Tolima, Ibagué, Colombia), Ana Estrada (Universidad del Tolima, Ibagué, Colombia),Gilbert Mateus (Hospital Federico Lleras Acosta, Ibagué, Colombia), Raúl Murillo (Pontificia Universidad Javeriana, Bogotá, Colombia), Justo Ramirez (Hospital Hernando Moncaleano Perdomo, Neiva, Colombia), Yesid Sánchez (Universidad del Tolima, Ibagué, Colombia), Carolina Sanabria (Instituto Nacional de Cancerología, Bogotá, Colombia), Martha Lucia Serrano (Instituto Nacional de Cancerología, Bogotá, Colombia), John Jairo Suarez (Universidad del Tolima, Ibagué, Colombia), Alejandro Vélez (Dinámica IPS, Medellín, Colombia, Hospital Pablo Tobón Uribe, Medellín, Colombia).
Funding
This work was funded in part by grants from the National Cancer Institute (K24CA169004, R01CA120120 to E.Z. and R01CA184545 to E.Z. and S.N.). Y.S. was supported by the National Center for Advancing Translational Sciences of the NIH under award KL2TR001870.
The Northern California Breast Cancer Family Registry was supported by grant UM1 CA164920 from the National Cancer Institute. The San Francisco Bay Area Breast Cancer Study was funded by grants CA063446 and CA077305 from the National Cancer Institute, grant DAMD17-96-1-6071 from the U.S. Department of Defense, and grant 7PB-0068 from the California Breast Cancer Research Program.
The Kaiser Permanente Research Program on Genes, Environment and Health was supported by Kaiser Permanente national and regional Community Benefit programs, and grants from the Ellison Medical Foundation, the Wayne and Gladys Valley Foundation, and the Robert Wood Johnson Foundation. Genotyping in the GERA cohort was supported by grant RC2 AG03667 from the National Institutes of Health.
The Multiethnic Cohort Study was supported by the National Institutes of Health grants R01 CA63464 and R37 CA54281, R01 CA132839, 5UM1CA164973.
The CAMA Study was funded by Consejo Nacional de Ciencia y Tecnología (SALUD-2002-C01-7462).
The PEGEN-BC study was supported by the National Cancer Institute [R01CA204797 (L.F.)] and the Instituto Nacional de Enfermedades Neoplásicas (Lima, Peru).
The COLUMBUS Consortium was supported by grants from School of Medicine (Dean’s Fellowship in Precision Health Equity to LGC-C) and support from the Office of the Provost for LGC-C’s Latino Cancer Health Equity Initiative); The V Foundation for Cancer Research (V Foundation Scholarship to LGC-C); GSK Oncology (Ethnic Research Initiative to LGCC and ME); The U.S. National Institutes of Health (Cancer Center Support Grant P30CA093372 from the National Cancer Institute). LGC-C, MEB and ME are also grateful for support Colciencias (Graduate Studentship to Jeniffer Benavides, member of COLUMBUS, from Convocatoria para la Formación de Capital Humano de Alto Nivel para el Departamento de Tolima-COLCIENCIAS – 755/2016), Universidad del Tolima (Grants to MEB and ME, project 10112), and Sistema Nacional de Regalías, Gobernación del Tolima (Grants to MEB and ME, project 520115). JT was supported by Coordinacion Nacional de Investigación en Salud, IMSS, México, grant FIS/IMSS/PROT/PRIO/13/027 and by the Consejo Nacional de Ciencia y Tecnologia (Fronteras de la Ciencia grant 773), México.
The study funders and sponsors did not participate in the collection, analysis, or interpretation of data, or in the writing of the manuscript. The contents of this article are solely the responsibility of the authors and do not reflect the official views of the National Institutes of Health.