ABSTRACT
Breast cancer is a complex disease and studying DNA methylation (DNAm) in tumors is complicated by disease heterogeneity. We compared DNAm in breast tumors with normal-adjacent breast samples from The Cancer Genome Atlas (TCGA). We constructed models stratified by tumor stage and PAM50 molecular subtype and performed cell-type reference-free deconvolution on each model. We identified nineteen differentially methylated gene regions (DMGRs) in early stage tumors across eleven genes (AGRN, C1orf170, FAM41C, FLJ39609, HES4, ISG15, KLHL17, NOC2L, PLEKHN1, SAMD11, WASH5P). These regions were consistently differentially methylated in every subtype and all implicated genes are localized on chromosome 1p36.3. We also validated seventeen DMGRs in an independent data set. Identification and validation of shared DNAm alterations across tumor subtypes in early stage tumors advances our understanding of common biology underlying breast carcinogenesis and may contribute to biomarker development. We also provide evidence on the importance and potential function of 1p36 in cancer.
INTRODUCTION
Invasive breast cancer is a complex disease characterized by diverse etiologic factors1. Key genetic and epigenetic alterations are recognized to drive tumorigenesis and serve as gate-keeping events for disease progression2. Early DNA methylation (DNAm) events have been shown to contribute to breast cancer development3. Importantly, DNAm alterations have been implicated in the transition from normal tissue to neoplasia4,5 and from neoplasia to metastasis6. Furthermore, patterns of DNAm are known to differ across molecular subtypes of breast cancer7 - Luminal A (LumA), Luminal B (LumB), Her2-enriched and Basal-like - identified based on the prediction analysis of microarray 50 (PAM50) classification8. However, while DNAm differences across breast cancer subtypes have been explored, similarities across subtypes are less clear9. Such similarities found in early stage tumors can inform shared biology underpinning breast carcinogenesis and – as similarities would be agnostic to subtype – potentially contribute to biomarkers for early detection.
Studying DNAm in bulk tumors is complicated by disease heterogeneity. Heterogeneity is driven by many aspects of cancer biology including variable cell-type proportions found in the substrate used for molecular profiling10. Different proportions of stromal, tumor, and infiltrating immune cells may confound molecular profile classification when comparing samples11 because cell types have distinct DNAm patterns12–14. The potential for cell–type confounding prompted the development of statistical methods to adjust for variation in cell-type proportions in blood15 and solid tissue16. One such method, RefFreeEWAS, is a reference-free deconvolution method and does not require a reference population of cells with known methylation patterns and is agnostic to genomic location when performing deconvolution17. Instead, the unsupervised method infers underlying cell-specific methylation profiles through constrained non-negative matrix factorization (NMF) to separate cell-specific methylation differences from actual aberrant methylation profiles observed in disease states. This method has previously been shown to effectively determine the cell of origin in breast tumor phenotypes18.
We applied RefFreeEWAS to The Cancer Genome Atlas (TCGA) breast cancer DNAm data and estimated cell proportions across the set. We compared tumor DNAm with adjacent normal tissue stratified by tumor subtype9 and identified common early methylation alterations across molecular subtypes that are independent of cell type composition. We identified a specific chromosomal location, 1p36.3, that harbors all 19 of the differentially methylated regions that are in common to early stage breast cancer subtypes. 1p36 is a well-studied and important region in many different cancer types, but there remain questions about how it may impact carcinogenesis and disease progression19. Our study provides evidence that methylation in this region may provide important clues about early events in breast cancer. We also performed RefFreeEWAS on an independent validation set (GSE61805) and confirmed these results20.
RESULTS
DNA methylation deconvolution
Subject age and tumor characteristic data stratified by PAM50 subtype and stage is provided in Table 1 for the 523 TCGA tumors analyzed. TCGA breast tumor sample purity, estimated by pathologists from histological slides, was consistent across PAM50 subtypes and stages indicating that observed methylation differences are not predominantly a result of large differences in tumor purity (Supplementary Fig. S1). To correct for cell-proportion differences across tumor samples, we estimated the number of cellular methylation profiles contributing to the mixture differences by applying NMF to the matrix of beta values, which resulted in model specific dimensionality estimates indicating diverse cellular methylation profiles (Supplementary Table S1). The reference-free deconvolution altered the number and extent of significant differentially methylated CpGs across all models that compared breast tumor methylation with adjacent normal samples (Supplementary Fig. S2).
Subtype specific methylation patterns
In early stage tumors, we identified a set of nineteen DMGRs shared among Luminal A, Luminal B, Her2, and Basal-like subtypes (DMGRs Q < 0.01, Figure 1A). In the late stage tumors, we identified 31,931 DMGRs in common across subtypes (Figure 1B). Subtype specific methylation patterns in early stage tumors were most divergent for Basal-like tumors versus other types, while in late stage tumors methylation alterations in Luminal B tumors were most divergent (Supplementary Table S2). To test if collapsing by genomic region had an appreciable effect on detecting DMGRs, we compared DMGR results to results derived from regions defined by CpG island status (i.e. CpG island, Shore, Shelf, Open Sea). Using CpG island context designations indicated similar results (Supplementary Fig S3), though a lower number of common DMGRs were observed. Therefore, downstream analyses used DMGRs identified based on probe position in relation to TSS.
We identified nineteen DMGRs with common methylation alterations among tumor subtypes in comparison with normal tissues that were annotated to eleven genes: AGRN, C1orf170, FAM41C, FLJ39609, HES4, ISG15, KLHL17, NOC2L, PLEKNH1, SAMD11, and WASH5P (Supplementary Table S3).
Dependent upon tumor subtype, some gene regions had a different directional change in tumor methylation compared to normal tissue (e.g. C1orf170, HES4, and ISG15). Additionally, of the eleven genes identified, we observed differential methylation in different regions including gene body, promoter (TSS1500, and TSS200), and 3’UTR (Table 2 and Supplementary Table S3). All nineteen DMGRs also had differential methylation in at least one late stage tumor subtype, and thirteen of the nineteen DMGRs were significantly differentially methylated across all tumor subtypes in late stage tumors (Table 2 and Supplementary Table S4). A heatmap of the unadjusted beta values for individual CpGs from the nineteen DMGRs demonstrated grouping of most of the Basal-like tumors separate from a group of mixed Luminal and Her2 tumors (Figure 2).
DMGRs cluster on chromosome 1p36 and on gene bodies
Of the nineteen DMGRs identified, all of them are in eleven genes located on the p36.3 cytoband of chromosome 1 (Supplementary Figure S4). Chromosome 1p36.3 is the start section of chromosome 1 and of the eleven genes identified, one (WASH5P) is located near the very start of the chromosome (chr1:14,362 - 29,370) and the other ten genes are located end-to-end between chr1:868,071 - 1,056,116 (Supplementary Figure S4).
Most of the DMGRs tracked to gene body regions: AGRN, C1orf170, FAM41C, ISG15, KLHL17, NOC2L, PLEKHN1, SAMD11, and WASH5P all had gene body methylation differences. Gene body regions were enriched among early stage tumor DMGRs compared to all other regions (Fisher’s Exact Test OR = 4.15, 95% CI = 1.04 – 23.83, P = 0.04). All differentially methylated CpG probe IDs are given in Supplementary Table S5. DAVID pathway analysis applied to the top 400 most aberrantly methylated genes in common to the four PAM50 subtypes identified the GO term for the regulation of hormone levels to be significantly enriched (GO:0010817, FDR = 0.035, Supplementary Table S6).
Breast cancer copy number alterations in 1p36
Among these 523 tumors, the prevalence of 1p36.3 copy number alteration was only 1.2% (n=6), all were amplifications that affected ten of the eleven genes most distal to the chromosome end. Among the six tumors with 1p36.3 amplification three were Basal-like, two were Her2-enriched, and one was Luminal A. Exclusive of tumors with copy number alterations, there was one tumor (Her2-enriched), with a truncating mutation in KLHL17, and one tumor with a missense mutation in PLEKHN1 (Basal-like).
DMGRs impact gene expression
We identified CpG sites with significant correlation of methylation with gene expression for five genes (AGRN, PLEKHN1, KLHL17, SAMD11, and FAM41C), associated with eight DMGRs (Supplementary Table S7 and Supplementary Figures S6-9).
Validating DMGR hits in an independent dataset
We validated our findings in an independent 450K methylation data set from 186 tumors and 46 normal tissues described in Fleischer et al. (GSE60185). Seventeen of nineteen DMGRs were significantly differentially methylated between tumor and normal tissues in the replication set (all DMGRs at Q < 0.01; Table 2), and CpGs in these DMGRs had similar patterns of beta value distributions (Supplemental Figure S10). The remaining two gene regions were also highly ranked in the q value distribution (WASH5P body: Q = 0.07; ISG15 Body: Q = 0.10).
Reproducibility
All TCGA and validation data is publicly available. We also provide software under an open source license for analysis reproducibility and to build upon our work21.
DISCUSSION
We were interested in identifying common biology underlying breast cancer independent of molecular subtype and cell-type proportion. After applying a reference-free deconvolution algorithm, we observed that early stage tumors harbor differentially methylated gene regions localized entirely to a small region on 1p36.3 shared across four major subtypes. Although DNA methylation alterations are widespread in early stage tumors and prior work has demonstrated alterations that differ among breast tumor subtypes9,22 we observed only 19 DMGRs that overlapped molecular subtypes. All DMGRs tracking to the same region on 1p36.3 suggests that altered regulation of this region contributes to breast carcinogenesis irrespective of disease subtype.
Previously, alterations on chromosome 1 have been observed in breast cancer cell lines and tumors23. Additionally, copy number deletions in this region have been shown to be an important precursor in DCIS tumors 24 and in follicular lymphomas 25. However, the most prevalent copy number alterations on chromosome 1 are gains on the q arm and losses on the p arm that do not typically fully encompass our implicated genes on 1p36.323,26,27. The region is also well-studied and significantly altered in neuroblastoma - the most common solid tissue tumor of childhood28–31. The biological underpinnings of this region remain elusive19,32 but a systematic understanding of how these specific DMGRs may impact early cancer development may be important for other cancer types and not just breast cancer.
Of the nineteen DMGRs identified, eighteen of them replicated in either one or both late stage and independent validation sets. The one DMGR that did not replicate was the WASH5P body. This region is located more than 830,000 base pairs (bp) away from the much tighter region spanned by the remaining eighteen DMGRs (~188,000 bp), suggesting a loose association between WASH5P and the other ten genes.
There is also additional evidence implicating the potential importance of the identified genes assigned to the differentially methylated regions. For example, in a study of mutational profiles in metastatic breast cancers, AGRN was more frequently mutated in metastatic cancers compared with early breast cancers33. Similarly, expression of the HES4 Notch gene is known to be significantly correlated with the presence of activating mutations in multiple breast cancer cell lines, and is associated with poor patient outcomes34. In addition, ISG15 has been implicated as a key player in breast carcinogenesis35, though there is conflicting evidence36. However, the conflicting evidence to date may be related to our observation of ISG15 hypomethylation in Basal-Like, Her2, and LumB tumors, and hypermethylation in LumA tumors (Supplementary Table S3). Opposing methylation states among tumor subtypes relative to normal tissue may contribute to subtype-specific roles of ISG15 dysregulation in breast carcinogenesis. Additionally, the NOC2L gene has been identified as a member of a group of prognostic genes derived from an integrated microarray of breast cancer studies37. We also identified three DMGRs – TSS1500, Body, & 5’UTR - in the SAMD11 gene, which has significantly reduced expression in breast cancer cells compared to normal tissues38, consistent with our findings of SAMD11 hypermethylation across all four breast cancer subtypes. As DNAm changes were observed consistently and robustly across subtypes, it is likely that several of the other identified genes are cancer initiation factors that require additional study.
Importantly, we validated the identified DMGRs in an independent set of invasive breast tumors and normal tissues. Our validation is strengthened by the lack of molecular subtype assignments in the validation set. The validation of DMGRs in a setting agnostic to intrinsic subtype indicates that differential magnitude or direction of methylation alterations that may be present in different subtypes did not limit our ability to identify significant alterations. A limitation of the validation set is a lack of gene expression data to further investigate relationships between expression and methylation for each gene region. Nevertheless, additional targeted studies on this set of validated genes and gene regions can enhance the understanding of methylation alterations at these DMGRs in breast carcinogenesis.
Caution should be exercised in interpreting the results of the adjusted beta coefficients from the reference-free algorithm. It is unclear if specific disease states are a result of aberrant methylation profiles in specific cell types which then cause changes to cell mixtures, or if the disease state is a result of cell-type proportion differences. Additionally, the unsupervised clustering heatmaps plot unadjusted methylation beta values and do not account for cell type adjustment. Lastly, the DMGR analysis drops CpGs that do not track to gene regions, which may reduce detection of non-genic regions related with breast carcinogenesis.
We identified and validated DMGRs in early stage breast tumors across PAM50 subtypes that are located on chromosome 1p36.3. The observed differential methylation suggests that this region may contribute to the initiation or progression to invasive breast cancer. Additional work is needed to investigate the scope of necessary and sufficient alterations to 1p36.3 for transformation and to more clearly understand the implications of 1p36.3 methylation alterations to gene regulation. Further investigation of DNAm changes to 1p36.3 may identify opportunities for early identification of breast cancer or risk assessment. Lastly, the reference-free approach we used could be applied to methylation datasets from other tumor types to identify potential drivers of carcinogenesis common across histologic or intrinsic molecular subtypes.
PATIENTS & METHODS
Data Processing
We accessed breast invasive carcinoma Level 1 Illumina HumanMethylation450 (450K) DNAm data (n = 870) from the TCGA data access portal and downloaded all sample intensity data (IDAT) files. We processed the IDAT files with the R package minfi using the “Funnorm” normalization method on the full dataset 39. We filtered CpGs with a detection P-value > 1.0E-05 in more than 25% of samples, CpGs with high frequency SNP(s) in the probe, probes previously described to be potentially cross-hybridizing, and sex-specific probes 40,41. We filtered samples that did not have full covariate data (PAM50 subtype, pathologic stage42,12) and full demographic data (age and sex). All tumor adjacent normal samples were included regardless of missing data (n = 97, Table 1).
From an original set of 485,512 measured CpG sites on the Illumina 450K array, our filtering steps removed 2,932 probes exceeding the detection P-value limit, and 93,801 probes that were SNP-associated, cross-hybridizing, or sex-specific resulting in a final analytic set of 388,779 CpGs. From 870 TCGA breast tumors, we restricted to primary tumors with available PAM50 intrinsic subtype assignments of Basal-like (n = 86), Her2 (n = 31), Luminal A (n = 279), and Luminal B (n = 127), excluding Normallike tumors due to limited sample size (n = 18). Lastly, we restricted the final total tumor set to only those with stage assignments resulting in a final analytic sample size of n = 523.
Reference-free cell type adjustment modeling
We stratified samples by PAM50 subtype (Basal-like, Luminal A, Luminal B, Her2) and then by tumor stage dichotomizing as early (stage I and II tumors) and late (stage III and IV tumors)42, resulting in eight distinct models. To analyze DNAm differences between tumor and normal tissue and to adjust for effects of cellular heterogeneity across samples, we applied the reference-free deconvolution algorithm from the RefFreeEWAS R package to each model adjusting for age16. The method estimates the number of underlying tissue-specific cell methylation states contributing to methylation heterogeneity through a constrained variant of NMF43. Briefly, the method assumes the sample methylome is composed of a linear combination of the constituent methylomes. It decomposes the matrix of sample methylation values (Y) into two matrices (Y = ΜΩΤ), where M is an m x K matrix of m CpG-specific methylations states for K cell types and Ω is a nx K matrix of subject-specific cell-types. K is selected via bootstrapping K = 2…10 and choosing the optimal K that minimizes the bootstrapped deviance. To correct for multiple comparisons, we converted all extracted P-values to Q-values using the R package qvalue44.
Identifying differentially methylated gene regions
To understand the genomic regions with common DNAm alterations we grouped CpGs by gene and region relative to genomic location (transcription start site 1500 (TSS1500), TSS200, 3’ untranslated region (3’UTR), 5’UTR, 1st exon, and gene body). We used this gene-region taxonomy to collapse differentially methylated CpGs, as defined by our Q-value cutoff, into specific differentially methylated gene regions (DMGRs). This extended the Illumina 450K CpG annotation file to allow for a given CpG to be associated with up to two genes depending on the proximity of the CpG site to neighboring genes (Figure 3).
We defined a differentially methylated CpG as one with a Q-value < 0.01 following cell-type adjustment in a specific subtype model compared to normal tissue. To identify DMGR sets for each stage and subtype, we analyzed all eight models independently.
Pathway Analysis
We performed a DAVID (the database for annotation, visualization and integrated discovery) analysis45,46 for the 400 genes with the lowest median CpG Q-values that are in common to all early stage tumors regardless of PAM50 subtype, and extracted enriched Gene Ontology (GO)47 and Kyoto Encyclopedia of Genes and Genomes (KEGG)48 terms. We selected the top 400 genes based on recommended gene list sizes49.
Copy number, gene expression, and genomic location
We downloaded TCGA Breast Invasive Carcinoma CNV data9 and normalized RNAseq using cBioPortal50. For the DMGRs we identified, we analyzed the prevalence of copy number alterations and mutations in each gene across all samples, stratified by molecular subtype. Similarly, to determine whether these DMGRs affect gene expression of their target gene, we calculated Spearman correlations of DNAm beta values in significant CpGs (Q < 0.01) to matched sample Illumina HiSeq gene expression data. We used a Bonferroni correction to determine significant expression differences, resulting in an acceptance alpha value of 9.36E-5.
Validation
To confirm the identified early stage DMGRs in common among intrinsic molecular subtypes we applied the analysis workflow to TCGA late stage tumors and an independent validation set (GSE60185)20. The validation set includes samples of ductal carcinoma in situ (DCIS), mixed, invasive, and normal histology collected from Akershus University Hospital and from the Norwegian Radium Hospital. We analyzed only the invasive samples compared to normal samples using the same bioinformatics pipeline of quality control CpG filtering steps and normalization procedures. However, we did not have complete age information or intrinsic subtype assignments for the validation set and the models are not adjusted for age or stratified by subtype. This resulted in a single model comparing 186 invasive tumors with 46 normal controls measured across 390,253 CpGs.
COMPETING INTERESTS
The authors declare that they have no competing interests
SUPPLAMENTAL TABLES
Due to size limitations of this document and the size of the supplemental tables available for this manuscript, supplemental tables may be found at the following DOI link: DOI: 10.5281/zenodo.400247
ACKNOWLEDGEMENTS
Funding was provided by P20GM104416 and R01DE02277 (BCC), by the Quantitative Biomedical Sciences graduate program, and through a BD2K Fellowship to AJT (T32LM012204).
Footnotes
AUTHOR EMAILS: AT: Alexander.J.Titus.gr{at}dartmouth.edu, GW: GregWay{at}upenn.edu, KJ: Kevin.C.Johnson{at}jax.org, BC: Brock.Christensen{at}dartmouth.edu