ELMER v.2: An R/Bioconductor package to reconstruct gene regulatory networks from DNA methylation and transcriptome profiles

Tiago C Silva; Simon G Coetzee; Lijing Yao; Nicole Gull; Dennis J Hazelett; Houtan Noushmehr; De-Chen Lin; Benjamin P Berman

doi:10.1101/148726

Abstract

Motivation DNA methylation has been used to identify functional changes at transcriptional enhancers and other cis-regulatory modules (CRMs) in tumors and other disease tissues. Our R/Bioconductor package ELMER (Enhancer Linking by Methylation/Expression Relationships) provides a systematic approach that reconstructs altered gene regulatory networks (GRNs) by combining enhancer methylation and gene expression data derived from the same sample set.

Results We present a completely revised version 2 of ELMER that provides numerous new features including an optional web-based interface and a new Supervised Analysis mode to use pre-defined sample groupings. We show that this approach can identify GRNs associated with many new Master Regulators including KLF5 in breast cancer.

Availability ELMER v.2 is available as an R/Bioconductor package at http://bioconductor.org/packages/ELMER/

1 Introduction

Motivated by the identification of transcription factor binding sites (TFBSs), enhancers, and other cis-regulatory modules (CRMs) from DNA methylation data in tumor samples (Berman et al., 2012; Hovestadt et al., 2014; Johann et al., 2016), and the strong association between DNA methylation and target gene expression in tumors (Aran et al., 2013; Aran and Hellman, 2013), we previously developed an R/Bioconductor package ELMER (Enhancer Linking by Methylation/Expression Relationships) to infer regulatory element landscapes and GRNs from cancer methylomes (Yao et al., 2015). ELMER version 1 has been adopted by other groups (Dhingra et al., 2017; Mishra and Guda, 2017; Malta et al., 2018), and remains the only publicly available software tool to use matched DNA methylation and expression profiles to reconstruct TF networks (reviewed in Teschendorff and Relton, 2018). Other tools such as TENET (Rhie, 2016) and RegNetDriver (Dhingra et al., 2017) have incorporated ELMER principles and code into cancer network analysis.

We present here a substantially re-written ELMER v. 2 (Fig. 1A) that implements new features and improvements including: (i) support for Infinium HM450 or EPIC arrays and RNA-seq using the gold-standard MultiAssayExperiment (MAE) data structure, (ii) integration with our TCGABiolinks package (Colaprico et al., 2015) for cohort selection and data importing from the NCI Genomic Data Commons (Grossman et al., 2016), (iii) integration with our TCGABiolinksGUI tool (Silva et al., 2018) to run ELMER via a web-based interface, (iv) output of all results in a single interactive HTML file include all data tables, figures, and source code, (v) adoption of software engineering best practices including unit testing and better exception handling, (vi) annotation of cell-type specific chromatin context for resulting genomic elements, and (vii) a new Supervised mode where the user can explicitly define sample groups for comparison. In this brief Note, we highlight several of these new features by analyzing TCGA Breast Cancer data to identify molecular subtype-specific networks. A complete description of new methods and features, along with computational benchmarking, is presented in the Supplementary Methods and Notes (Supplementary Figures 1–16 and Supplementary Tables S1–S5). ELMER v. 2 has been publicly available starting with v. 2.2.7 in Bioconductor Release 3.6 (October 2017). Complete result reports for the BRCA analyses are available in the Supplemental Materials and at http://bit.ly/ELMER_reports.

Figure 1.

(A) ELMER architecture, showing external data sources (gray) and Bioconductor packages (blue). (B) Association of enhancer probe methylation and expression of the nearby GATA3 gene, showing sample groups used in the Unsupervised vs. Supervised analysis modes. In Unsupervised mode, the 20% of samples with the lowest (blue) and highest (red) methylation levels are compared; in Supervised mode, the predefined Luminal A (blue) and Basal-like (red) tumors are compared. (C) A selected set of subtype-specific Master Regulator candidates identified from TCGA BRCA, comparing Unsupervised vs. various Supervised analysis runs. The complete table is available as Supplementary Table S3. (D) StateHub chromatin state enrichment analysis for 1, 076 regulatory elements identified in the Unsupervised analysis. (E) Master Regulator analysis for the top motif in the Unsupervised analysis, FOXA2. All TFs are ranked by their correlation with methylation changes of distal probes within 250 bp of a FOXA2 binding motif. Colored dots indicate the top 3 most anti-correlated TFs (FOXA1, GATA3 and ESR1), and all TFs classified in the same family as FOXA2.

2 Feature highlights

Supervised vs Unsupervised mode

ELMER first identifies Differentially Methylated CpGs (DMCs) occurring at distal (non-promoter) probes (Step 1), then searches for downstream gene targets for each DMC (Step 2), and finally identifies Master Regulator TFs based on enriched binding motifs and TF expression (Step 3), as shown in Supplementary Fig. 1. ELMER v. 1 identified DMCs by comparing methylation in all cancer vs. non-cancer samples, while the subsequent steps used correlation between methylation and expression in the n% of tumors with the most extreme methylation values (by default, n=20). The rationale was that any particular GRN might only be altered in a subset of tumors with a specific molecular phenotype, which would not always be known a priori. While 20% was an arbitrary definition, we found this to be a useful exploratory strategy given the heterogeneity of cancer molecular phenotypes.

In ELMER v. 2, we continue to support this original Unsupervised strategy. However, we have found many practical use cases where the group structure is known in advance, and a Supervised search strategy is preferable. This is especially true for “case-control” experimental designs such as treated vs. untreated samples. The major difference is that in Supervised mode, all samples must be contained in one of the two comparison groups, whereas Unsupervised mode still uses only the n% most extreme. Furthermore, this subset of samples with the most extreme methylation values changes from one genomic locus to the next.

To compare Supervised vs. Unsupervised modes, we used ELMER v. 2.4.3 to analyze TCGA BRCA (Breast Invasive Carcinoma) data (Supplementary Figures 2–15 and Supplementary Tables 2–3). Based on enhancer-gene pairing, Unsupervised mode had lower statistical power (Fig. 1B), but was able to identify molecular subtype-specific networks without explicit a priori subtype labels (Fig. 1C). As expected, Supervised mode is best suited to explore well-understood molecular phenotypes, while Unsupervised mode can be a powerful tool to discover networks in unknown tumor subtypes. When molecular subtypes are known, the two modes can be used in conjunction and compared (as we have done in Supplemental Table S3).

Functional interpretation of chromatin states

While ELMER v.1 was limited to analyze only probes overlapping known enhancers, ELMER v.2 analyzes all distal probes, and thus it is now important to provide a functional interpretation of the resulting regions. We perform a chromatin state enrichment analysis using states automatically downloaded from the (http://StateHub.org) database, a publicly-available resource that integrates histone modification and other publicly-available epigenomic data for over 1,000 different human samples (Coetzee et al., 2018). Enrichment of these states is calculated against a randomly sampled background set drawn from the same distal probe set used as input. We used ELMER 2 to perform this state enrichment analysis for the BRCA dataset, yielding insights into the cell-type specificity of the genomic regions identified (Fig. 1D, and Supplementary Fig. 5). The strongest enrichment was for active enhancer and promoter states having cell-type specificity for MCF7, a Luminal Breast Cancer cell line.

Motif enrichment analysis and identification of Master Regulator TFs

The final step of ELMER identifies enriched TF binding motifs within candidate regulatory regions, followed by correlation with TF expression to identify upstream Master Regulators (Supplementary Fig. 1). ELMER v. 1 used a hand-curated selection of 145 TF motifs, which were grouped into binding domain families manually. We re-implemented these sections in ELMER v. 2 to use publicly available databases for these steps, making the package more comprehensive and easier to update in future versions. ELMER v. 2 uses 771 human binding models from HOCOMOCO v11 (Kulakovskiy et al., 2017). Each of these is associated with one or more of 1,639 transcription factors defined in (Lambert et al., 2018), which are grouped into 82 different binding domain families and 331 sub-families using the TFClass database (Wingender et al., 2017). We use the Fisher’s exact test and Benjamini-Hochberg multiple hypothesis correction to compare the frequency of each motif flanking the positive CpG probes to a background defined by all distal probes on the array, plotting the top hits as odds ratios with 95% confidence intervals (Supplementary Fig. 13).

For each enriched motif, we then calculate a mean DNA methylation value for all probes having a motif instance within ±250bp, and correlate this value to each of the 1, 639 TFs in our database. This helps to distinguish between different members of the same TF family, which often have nearly indistinguishable binding motifs. For instance, in the BRCA analysis, the most highly enriched motif corresponded to FOXA2, but our this Master Regulator (MR) analysis showed the likely family member to be FOXA1 (Fig. 1E), which has been extensively validated as a MR in luminal subtypes of breast cancer (Meyer and Carroll, 2012; Nakshatri and Badve, 2009). We ran the same analysis with the Supervised mode to compare explicit changes in each of the known molecular subtypes from (Ciriello et al., 2015), which had a significant overlap with the Unsupervised analysis but yielded many additional MRs (Fig. 1C, Supplementary Table S3). Two examples of were SOX11 and KLF5, whose functional roles in basal-like BRCA were recently described (Shepherd et al., 2016; Ben-Porath et al., 2008), and Androgen Receptor (AR), which has been implicated in ER-positive BRCA (Feng et al., 2017; Vera-Badillo et al., 2013). In addition to these known regulators, many completely unexplored TFs were identified as candidate MRs (Supplementary Table S3), highlighting the power of Unsupervised analysis.

3 Conclusions and Future Directions

ELMER v. 2 has been substantially re-written based on Bioconductor standards and user needs. The new Supervised mode and improved TF analysis identified additional known and novel Master Regulators candidates in TCGA BRCA analyses. ELMER v. 2 has only been tested on data from Illumina methylation arrays, which cover only 5-15% of all enhancer regions based on whole-genome bisulfite sequencing (WGBS). While ELMER does not currently support WGBS due to lack of sufficient test data, the number of WGBS datasets is quickly growing, and we expect the same basic ELMER approach will scale well in the future to take advantage of this more comprehensive data type.

Funding

The project was funded by the Cedars-Sinai’s Samuel Oschin Comprehensive Cancer Institute, by the São Paulo Research Foundation (FAPESP) (2016/01389-7 to T.C.S. & H.N. and 2015/07925-5 to H.N.), by the NIH/NCI Informatics Technology for Cancer Research (1U01CA184826 to B.P.B., D.J.H & S.G.C), and Genomic Data Analysis Network (1U24CA210969 to B.P.B & T.C.S) programs, as well as NIH/NCI grant R01CA190182 to D.J.H.

Competing interests

No competing interests were disclosed

Footnotes

↵* dchlin11{at}gmail.com or Benjamin.Berman{at}csmc.edu

References

↵
Aran, D. et al. (2013). Dna methylation of distal regulatory sites characterizes dysregulation of cancer genes. Genome biology, 14(3):R21.
OpenUrl CrossRef PubMed
↵
Aran, D. and Hellman, A. (2013). Dna methylation of transcriptional enhancers and cancer predisposition. Cell, 154(1):11–13.
OpenUrl CrossRef PubMed Web of Science
↵
Ben-Porath, I. et al. (2008). An embryonic stem cell–like gene expression signature in poorly differentiated aggressive human tumors. Nature genetics, 40(5):499–507.
OpenUrl CrossRef PubMed Web of Science
↵
Berman, B. P. et al. (2012). Regions of focal dna hypermethylation and long-range hypomethylation in colorectal cancer coincide with nuclear lamina-associated domains. Nature genetics, 44(1):40–46.
OpenUrl CrossRef PubMed
↵
Ciriello, G. et al. (2015). Comprehensive molecular portraits of invasive lobular breast cancer. Cell, 163(2):506–519.
OpenUrl CrossRef PubMed
↵
Coetzee, S. et al. (2018). Statehub-statepaintr: rapid and reproducible chromatin state evaluation for custom genome annotation. F1000Research, 7(214).
↵
Colaprico, A. et al. (2015). Tcgabiolinks: an r/bioconductor package for integrative analysis of tcga data. Nucleic acids research, page gkv1507.
↵
Dhingra, P. et al. (2017). Identification of novel prostate cancer drivers using regnetdriver: a framework for integration of genetic and epigenetic alterations with tissue-specific regulatory network. Genome biology, 18(1):141.
OpenUrl
↵
Feng, J. et al. (2017). Androgen and ar contribute to breast cancer development and metastasis: an insight of mechanisms. Oncogene, 36(20):2775.
OpenUrl CrossRef
↵
Grossman, R. L. et al. (2016). Toward a shared vision for cancer genomic data. New England Journal of Medicine, 375(12):1109–1112.
OpenUrl CrossRef PubMed
↵
Hovestadt, V. et al. (2014). Decoding the regulatory landscape of medulloblastoma using dna methylation sequencing. Nature, 510(7506):537.
OpenUrl CrossRef PubMed
↵
Johann, P. D. et al. (2016). Atypical teratoid/rhabdoid tumors are comprised of three epigenetic subgroups with distinct enhancer landscapes. Cancer Cell, 29(3):379 – 393.
OpenUrl CrossRef PubMed
↵
Kulakovskiy, I. V. et al. (2017). Hocomoco: towards a complete collection of transcription factor binding models for human and mouse via large-scale chip-seq analysis. Nucleic acids research, 46(D1):D252–D259.
OpenUrl
↵
Lambert, S. A. et al. (2018). The human transcription factors. Cell, 172(4):650–665.
OpenUrl CrossRef PubMed
↵
Malta, T. M. et al. (2018). Machine learning identifies stemness features associated with oncogenic dedifferentiation. Cell, 173(2):338–354.
OpenUrl CrossRef
↵
Meyer, K. B. and Carroll, J. S. (2012). Foxa1 and breast cancer risk. Nature Genetics, 44:1176 EP –.
OpenUrl PubMed
↵
Mishra, N. K. and Guda, C. (2017). Genome-wide dna methylation analysis reveals molecular subtypes of pancreatic cancer. Oncotarget, 8(17):28990.
OpenUrl
↵
Nakshatri, H. and Badve, S. (2009). Foxa1 in breast cancer. Expert Reviews in Molecular Medicine, 11:e8.
OpenUrl
↵
Rhie, S. K. a. (2016). Identification of activated enhancers and linked transcription factors in breast, prostate, and kidney tumors by tracing enhancer networks using epigenetic traits. Epigenetics & chromatin, 9(1):50.
OpenUrl
↵
Shepherd, J. H. et al. (2016). The sox11 transcription factor is a critical regulator of basal-like breast cancer growth, invasion, and basal-like gene expression. Oncotarget.
↵
Silva, T. et al. (2018). Tcgabiolinksgui: A graphical user interface to analyze cancer molecular and clinical data. F1000Research, 7(439).
↵
Teschendorff, A. E. and Relton, C. L. (2018). Statistical and integrative system-level analysis of dna methylation data. Nature Reviews Genetics, 19(3):129.
OpenUrl
↵
Vera-Badillo, F. E. et al. (2013). Androgen receptor expression and outcomes in early breast cancer: a systematic review and meta-analysis. Journal of the National Cancer Institute, 106(1):djt319.
OpenUrl
↵
Wingender, E. et al. (2017). Tfclass: expanding the classification of human transcription factors to their mammalian orthologs. Nucleic acids research, 46(D1):D343–D347.
OpenUrl
↵
Yao, L. et al. (2015). Inferring regulatory element landscapes and transcription factor networks from cancer methylomes. Genome biology, 16(1):105.
OpenUrl CrossRef PubMed

References

↵
Aken, B. L. et al. (2016). The ensembl gene annotation system. Database, 2016:baw093.
OpenUrl CrossRef PubMed
↵
Ben-Porath, I. et al. (2008). An embryonic stem cell-like gene expression signature in poorly differentiated aggressive human tumors. Nature genetics, 40(5):499–507.
OpenUrl CrossRef PubMed Web of Science
↵
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society. Series B (Methodological), pages 289–300.
↵
Ciriello, G. et al. (2015). Comprehensive molecular portraits of invasive lobular breast cancer. Cell, 163(2):506–519.
OpenUrl CrossRef PubMed
↵
Coetzee, S. G. et al. (2017). Statehub-statepaintr: rapid and reproducible chromatin state evaluation for custom genome annotation. bioRxiv, page 127720.
↵
Durinck, S. et al. (2005). Biomart and bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics, 21(16):3439–3440.
OpenUrl CrossRef PubMed Web of Science
↵
Durinck, S. et al. (2009). Mapping identifiers for the integration of genomic datasets with the r/biocon-ductor package biomart. Nature protocols, 4(8):1184–1191.
OpenUrl
↵
Fisher, R. A. (1922). On the interpretation of χ 2 from contingency tables, and the calculation of p. Journal of the Royal Statistical Society, 85(1):87–94.
OpenUrl CrossRef Web of Science
↵
Gong, C. et al. (2015). Foxa1 repression is associated with loss of brca1 and increased promoter methylation and chromatin silencing in breast cancer. Oncogene, 34(39).
↵
Heinz, S. et al. (2010). Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and b cell identities. Molecular cell, 38(4):576–589.
OpenUrl CrossRef PubMed Web of Science
↵
Huber, W. et al. (2015). Orchestrating high-throughput genomic analysis with bioconductor. Nature methods, 12(2):115–121.
OpenUrl
↵
Kulakovskiy, I. V. et al. (2016). Hocomoco: expansion and enhancement of the collection of transcription factor binding sites models. Nucleic acids research, 44(D1):D116–D125.
OpenUrl CrossRef PubMed
↵
Kulakovskiy, I. V. et al. (2017). Hocomoco: towards a complete collection of transcription factor binding models for human and mouse via large-scale chip-seq analysis. Nucleic acids research, 46(D1):D252–D259.
OpenUrl
↵
Lambert, S. A. et al. (2018). The human transcription factors. Cell, 172(4):650–665.
OpenUrl CrossRef PubMed
↵
Li, G. et al. (2012). Extensive promoter-centered chromatin interactions provide a topological basis for transcription regulation. Cell, 148(1):84–98.
OpenUrl CrossRef PubMed Web of Science
↵
Perou, C. M., Sorlie, T., Eisen, M. B., Van De Rijn, M., et al. (2000). Molecular portraits of human breast tumours. nature, 406(6797):747.
OpenUrl CrossRef PubMed Web of Science
↵
Ramos, M. et al. (2017). Software for the integration of multi-omics experiments in bioconductor. Cancer Research, 77(21); e39–42.
OpenUrl Abstract/FREE Full Text
↵
Sham, P. C. and Purcell, S. M. (2014). Statistical power and significance testing in large-scale genetic studies. Nature reviews. Genetics, 15(5):335.
OpenUrl CrossRef PubMed
Shepherd, J. H. et al. (2016). The sox11 transcription factor is a critical regulator of basal-like breast cancer growth, invasion, and basal-like gene expression. Oncotarget.
↵
Silva, T. C. et al. (2017). Tcgabiolinksgui: A graphical user interface to analyze gdc cancer molecular and clinical data. bioRxiv.
↵
Sørlie, T. et al. (2001). Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proceedings of the National Academy of Sciences, 98(19):10869–10874.
OpenUrl Abstract/FREE Full Text
↵
Wingender, E. et al. (2013). Tfclass: an expandable hierarchical classification of human transcription factors. Nucleic acids research, 41(D1):D165–D170.
OpenUrl CrossRef PubMed Web of Science
Wingender, E. et al. (2017). Tfclass: expanding the classification of human transcription factors to their mammalian orthologs. Nucleic acids research, 46(D1):D343–D347.
OpenUrl
↵
Yao, L. et al. (2015). Inferring regulatory element landscapes and transcription factor networks from cancer methylomes. Genome biology, 16(1):105.
OpenUrl CrossRef PubMed
↵
Yates, A. et al. (2015). Ensembl 2016. Nucleic acids research, page gkv1157.
↵
Yersal, O. and Barutca, S. (2014). Biological subtypes of breast cancer: Prognostic and therapeutic implications. World journal of clinical oncology, 5(3):412.
OpenUrl
↵
Zhou, W. et al. (2016). Comprehensive characterization, annotation and innovative use of infinium dna methylation beadchip probes. Nucleic Acids Research, page gkw967.
↵
Zhou, W. et al. (2017). Comprehensive characterization, annotation and innovative use of infinium dna methylation beadchip probes. Nucleic Acids Research, 45(4):e22.
OpenUrl CrossRef