Abstract
Epithelial-mesenchymal plasticity (EMP) underlies embryonic development, wound healing, and cancer metastasis and fibrosis. Cancer cells exhibiting EMP often have more aggressive behavior, characterized by drug resistance, and tumor-initiating and immuno-evasive traits. Thus, the EMP status of cancer cells can be a critical indicator of patient prognosis. Here, we compare three distinct transcriptomic-based metrics – each derived using a different gene list and algorithm – that quantify the EMP spectrum. Our results for 96 cancer-related RNA-seq datasets reveal a high degree of concordance among these metrics in quantifying the extent of EMP. Moreover, each metric, despite being trained on cancer expression profiles, recapitulates the expected changes in EMP scores for non-cancer contexts such as lung fibrosis and cellular reprogramming into induced pluripotent stem cells. Thus, we offer a scoring platform to quantify the extent of EMP in vitro and in vivo for diverse biological applications including cancer.
Introduction
Epithelial-Mesenchymal Plasticity (EMP) is an important feature of cancer metastasis and therapy resistance, the two major clinical challenges that claim the majority of cancer-related deaths (Gupta and Massague, 2006). EMP involves dynamic and reversible switching among multiple phenotypes along the epithelial-hybrid-mesenchymal spectrum. It encompasses both EMT (Epithelial-to-Mesenchymal Transition) and MET (Mesenchymal-to-Epithelial Transition). Originally considered to be binary transitions, EMT and MET are both now understood as multistep processes, and cells can execute these programs to varying degrees, enabling one or more hybrid epithelial/mesenchymal (E/M) phenotypes (Jolly and Levine, 2017; Nieto et al., 2016; Pal et al., 2021). EMP usually entails changes in cell-cell adhesion, migration, and invasion; while EMT is often involved with cells escaping the primary tumor and initiating metastasis, MET is thought to be important for colonization, the last step of metastasis. Besides these features, EMP is also implicated in conferring tumor-initiation potential (Morel et al., 2008; Pasani et al., 2021), immune evasion (Chen et al., 2014; Dongre et al., 2017; S. C. Tripathi et al., 2016), and resistance to various chemotherapeutic drugs and targeted therapies (Creighton et al., 2009; Sahoo et al., 2021a; Wang et al., 2009). Thus, EMP can be considered as the “motor of cellular plasticity” (Brabletz and Brabletz, 2010), which enhances cancer cell fitness in a variety of biological contexts.
Recent preclinical and clinical observations have suggested high metastatic potential of hybrid E/M phenotypes, and their association with worse patient survival across cancer types (Bierie et al., 2017; Godin et al., 2020; Huang et al., 2013; Kröger et al., 2019; Pastushenko and Blanpain, 2019; Puram et al., 2017; Sahoo et al., 2021b; Simeonov et al., 2021). Hybrid E/M phenotypes have also been observed in circulating tumor cells (CTCs); their higher frequency is often concomitant with worse clinicopathological features (Bocci et al., 2021; Lecharpentier et al., 2011; Saxena et al., 2019; Yu et al., 2013). The ability of hybrid E/M cells to form clusters of CTCs can also escalate their metastatic fitness, given the disproportionately high metastatic burden of CTC clusters (Aceto et al., 2014; Cheung et al., 2016; Jolly et al., 2017). Given the context-specific diversity of hybrid E/M phenotypes (Jolly et al., 2021), it is imperative that EMP be quantified as a continuum or spectrum, through integrating various experimental and/or computational methods.
At the transcriptomic level, various computational methods have been proposed to quantify the EMP spectrum by calculating the extent to which a given sample has undergone EMT/MET (hereafter, referred to as the ‘EMT score’). First, using gene expression from non-small cell lung cancer (NSCLC) cell lines and patients, a 76-gene EMT signature was identified and then used to derive a score (hereafter, referred to as the ‘76GS score’) based on the relative enrichment of expression levels of epithelial-associated genes. Thus, the higher the 76GS score, the more epithelial a given sample is (Byers et al., 2013; Guo et al., 2019). Second, a two-sample Kolmogorov-Smirnov method was used to calculate a score (hereafter, referred to as the ‘KS score’) on the interval [-1, +1] to depict the EMP status of cell lines and tumors. The higher the KS score, the more mesenchymal a sample is (Tan et al., 2014). Third, a multinomial logistic regression method implemented on NCI-60 expression data quantified the extent of EMT on the interval [0, 2] (hereafter, referred to as the ‘MLR score’) by calculating the probabilities for a given sample to belong to E, M or hybrid E/M states (George et al., 2017). Higher MLR scores depict a relatively enriched mesenchymal phenotype. A previous study compared these three methods – each of which utilizes a distinct gene list and algorithm – and observed that these methods were largely well-correlated with one another in terms of quantifying EMP across multiple microarray datasets (Chakraborty et al., 2020). This analysis suggested that 76GS scores correlated negatively with their MLR and KS counterparts, both of which positively correlated with one another. However, two key questions remain unanswered: a) can all three of these scoring metrics quantify the EMP spectrum for bulk and single-cell RNA-seq data with the same level of consistency? b) can these scores, all constructed on cancer-related datasets, be helpful in estimating the extent of EMP in non-cancer scenarios as well?
Here, we have addressed these limitations by analyzing multiple bulk and single-cell RNA-seq datasets, as well as investigating both microarray and RNA-seq datasets for two non-cancer cases where EMP has been reported: a) lung diseases - chronic obstructive pulmonary disease (COPD) and fibrosis (Jolly et al., 2018; Sohal, 2017) and b) reprogramming into induced pluripotent stem cells (iPSCs) (Lai et al., 2020). We demonstrate consistency amongst the EMT scoring metrics in quantifying the EMP spectrum across these biological contexts, as well as heterogeneity of EMP phenotypes in single-cell RNA-seq datasets. Finally, through a pan-cancer analysis of RNA-seq data available via The Cancer Genome Atlas (TCGA), we show that the association of EMP with patient survival is context-specific. Despite using diverse gene-sets and methodology to quantify EMP, a convergence of these three methods suggests possible commonalities in the different trajectories that cells undergoing EMT/MET can take in a high-dimensional landscape. Moreover, our results offer proof-of-principle that these metrics, all of which were derived based on cancer cells, can successfully quantify EMP in other useful non-cancer biological contexts too.
Results
EMT scoring methods show concordant trends across bulk RNA-seq datasets
We used the three different EMT scoring methods – 76GS, KS, and MLR – to quantify the extent of EMP in multiple RNA-seq datasets, as was previously done for microarray data (Chakraborty et al., 2020). Each method utilizes a distinct gene signature and underlying algorithm to compute an EMT score. The 76GS score is a weighted sum of the expression of 76 genes, where the weight factor is the correlation coefficient of that gene with the expression levels of CDH1 (E-cadherin), a canonical epithelial marker. Thus, a higher 76GS score corresponds to a more epithelial sample (Byers et al., 2013; Guo et al., 2019). The KS scoring method compares the empirical distribution function to the cumulative distribution function for epithelial and mesenchymal signatures identified in cell lines and tumors. The KS score is constructed by taking the maximal difference in these distributions for each predictor, followed by normalization by the number of predictors, thus taking values between −1 and +1. Positive (resp. negative) KS scores correspond to a relative enrichment of the mesenchymal (resp. epithelial) signature (Tan et al., 2014). The Multinomial Logistic Regression-based (MLR) method quantifies the extent of EMT on a scale of 0-2. MLR scores are calculated based on the probability of a given sample being assigned to the E, E/M, and M phenotypes. Thus, the higher the score, the more mesenchymal the sample is (George et al., 2017). While KS and 76GS methods operate on gene lists and can therefore be directly applied to both microarray and RNA-seq data, the MLR method utilizes the NCI-60 microarray data as training set for regression. Therefore, applying these methods for analyzing RNA-seq data needs further customization.
We extended the previous MLR framework trained on microarray-based transcriptomics of NCI-60 series to impute log2-normalized FPKM or TPM RNA-seq data. To achieve this, the log2 RNA-seq values for 3 predictors (CLDN7, VIM, CDH1) and 20 normalizers were linearly mapped to their corresponding microarray values (Fig 1A). This mapping was estimated for both FPKM- and TPM-normalized data by averaging over 24 previously published samples (Zhao et al., 2014) where log2 microarray and log2 RNA-seq expression signatures were simultaneously available (Fig S1, S2). The output of the updated MLR approach assigns a numerical EMT score, S, on the scale of [0, 2] based on the probability of a sample’s categorization into one of three groups: E, E/M and M.
To check the concordance of these three EMT scoring metrics, we calculated the 76GS, KS and MLR scores for 77 bulk high-throughput transcriptomics datasets. As expected, we found 76GS scores to be negatively correlated (r < −0.3; p < 0.045) with the MLR and KS scores, and found a positive correlation (r > 0.3; p < 0.05) between the MLR and KS scores across most datasets that contain cell lines and primary tumors across cancer types (Fig 1B-D; Table S1). 44 out of 77 (57.14%) datasets showed all three trends significantly (KS vs. MLR, MLR vs. 76GS and 76GS vs. KS). Additionally, 52 (67.53%) cases exhibit expected trends for 76GS vs. KS, compared to 56 (72.72%) for MLR vs. 76GS, and 57 (74.02%) for MLR vs. KS (Fig 1E). Thus, the MLR, 76GS, and KS scoring metrics show strong concordance among themselves for these 77 datasets.
Next, we investigated several individual datasets where EMT/MET phenomenon was induced in different tissues and cell lines. We found that the Py2T murine epithelial tumor cells that exhibit reversible EMT upon treatment with TGF-β in vitro had lower KS and MLR scores but higher 76GS scores as compared to MTΔEcad cells that represent irreversible EMT murine mammary gland tumor cells (Fig 2A; GSE118612) (Ishay-Ronen et al., 2019). Thus, all three EMT scores captured the expected trend of Py2T cells being more epithelial relative to MTΔEcad (murine breast cancer cells with ablated E-cadherin). Further, in mammary epithelial cells (MCF10A), depletion of Runx1 results in striking morphological changes consistent with EMT (Hong et al., 2017). Consistently, Runx1 depleted MCF10A cells had higher KS and MLR scores, but lower 76GS scores (Fig 2B; GSE85857). Similarly, TGF-β treated primary airway epithelial cells as well as TGF-β and EGF treated HeLa cells had a more mesenchymal profile as assessed by 76GS, KS and MLR scores, consistent with their reported experimental trends (Fig 2C-D; GSE72419, GSE61220) (Tian et al., 2015; V. Tripathi et al., 2016). These scores were also able to recapitulate in vitro observations that, while TGF-β treatment was able to induce EMT in MCF10A cells, the extent of EMT induced was decreased upon knockdown of ZEB1 (Fig 2E, GSE1248423) (Watanabe et al., 2019), a key EMT-inducing transcription factor in many cancers (Drápela et al., 2020). ZEB1 forms a mutually inhibitory feedback loop with GRHL2, a crucial MET-inducing factor, and knockdown of GRHL2 is known to push epithelial or hybrid E/M cells into a more mesenchymal phenotype (Chung et al., 2016; Cieply et al., 2013; Jolly et al., 2016; Mooney et al., 2017). Therefore, OVCA429 cells with GRHL2 knockdown had higher MLR and KS scores, but reduced 76GS scores, as compared to control, reflective of their more mesenchymal status (Fig 2F, GSE118407) (Chung et al., 2019). Similarly, Grhl2-null embryos had reduced levels of other gatekeepers of epithelial phenotype (Ovol1, Ovol2 and miR-200 family (Aue et al., 2015; Chung et al., 2016; Jia et al., 2015)) and elevated levels of Zeb1, commensurate with their altered KS, 76GS and MLR scores (Fig 2G, GSE106130) (Carpinelli et al., 2020). ZEB1 is directly activated by Twist (Dave et al., 2011), another well-characterized EMT inducer (Yang et al., 2004). Thus, activation of Twist in HMLE (human mammary gland epithelial cells) corresponded to higher KS and MLR scores and reduced 76GS scores (Fig 2I; GSE139074).
Similarly, these scoring metrics could recapitulate activation of EMT in pre-malignant immortalized and Ras-transformed HMECs (human mammary epithelial cells) as compared to primary HMECs (GSE110677; Fig 2H). Finally, in the context of renal fibrosis caused by loss of HNF-1β (Chan et al., 2018), HNF-1β deficient renal epithelial cells mIMCD3 showed upregulated mesenchymal traits relative to wild-type cells, as again captured by KS, 76GS and MLR scores (Fig 2J; GSE97770) (Table S2). Together, these case studies demonstrate that each scoring metric can capture the extent of EMT induced upon various perturbations, consistent with enrichment of EMT depicted by the Hallmark EMT geneset reported in MSigDB (Molecular Signature Database) (Fig S3) (Liberzon et al., 2011).
Single-cell RNA-seq data analysis reveals heterogeneity along the EMP spectrum
After investigating bulk RNA-seq datasets, we calculated EMT scores for 17 single-cell RNA-seq datasets using 76GS, KS and MLR metrics. For example, in a dataset containing 5902 single cells isolated from 18 patients with oral cavity tumors (head and neck squamous cell carcinoma), we observed a negative correlation between 76GS and KS scores, and between 76GS and MLR ones, with however a positive correlation between the KS and MLR scores (Fig 3A; GSE103322) (Puram et al., 2017). This trend was largely seen across other single-cell RNA-seq datasets as well, where, like our previous results for bulk RNA-seq datasets, roughly 65% (11/17) of datasets showed negative correlation for 76GS vs. KS scores, 59% (10/17) datasets had negative correlation for MLR vs. 76GS, and 53% (9/17) exhibited a positive correlation for MLR vs. KS (Fig 3B-C, S4A). Thus, the concordant trends observed for these metrics using bulk transcriptomics were found to be conserved for single-cell RNA-seq datasets as well (Table S3).
Next, we plotted the histograms for EMT scores of various single-cell RNA-seq datasets to decipher the heterogeneity seen along the EMP spectrum across a variety of biological contexts: 1. human embryonic stem-cell-derived progenitor cells differentiating to endoderm (GSE75748) (Chu et al., 2016); 2. human fetal pituitary gland development including progenitors of many endocrine cell types and subtypes (GSE142653) (Zhang et al., 2020); 3. cells from different tissues and organs of E9.5 to E11.5 mouse embryos (GSE87038) (Dong et al., 2018); 4. MCF10A cells treated with TGF-β for varying durations and exhibiting a gradual change in their EMT status (PRJNA698642) (Deshmukh et al., 2021); 5. murine pancreatic duct cells with variations along the EMP spectrum (GSE159343) (Hendley et al., 2021); 6. EpCAM+ and EpCAM-squamous skin carcinoma cells with varied epithelial and/or mesenchymal features (GSE110357) (Pastushenko et al., 2018); 7. cells from oral cavity tumors/head and neck squamous cell carcinoma (GSE103322) (Puram et al., 2017); 8. human colorectal cancer cell lines and tumors (GSE81861) (Li et al., 2017), and; 9. mouse hair follicle stem cells and transit-amplifying cells (GSE90848) (Yang et al., 2017). Across these cases, we observed two distinct peaks in KS scoring metrics (Fig 3C), suggesting the presence of at least two major subpopulations with varied EMT status.
Of note, several plots for the 76GS and MLR metrics appeared saturated, which we hypothesized related to the relative sparsity of predictor signal in the single cell datasets. For the MLR approach, we then restricted our analysis to datasets with at least 90% of all single-cell samples containing nonzero entries for each predictor, indicating the presence of measurable signal. In these cases, MLR and 76GS metrics were able to recapitulate the trends observed in KS for many such datasets (Fig S5A-B).
Quantifying the EMP spectrum during lung diseases and cellular reprogramming
All the three EMT metrics (76GS, KS, MLR) have been designed and/or trained for quantifying the EMT status in cancer samples (Byers et al., 2013; George et al., 2017; Tan et al., 2014), but our single-cell RNA-seq data analysis suggests their applicability in various developmental contexts. Thus, we investigated if they can be broadly applied to quantifying EMT status in other biological processes other than cancer. For both microarray and RNA-seq datasets (Table S4), we used these metrics to calculate EMT status for lung diseases including chronic obstructive pulmonary disease (COPD) and idiopathic pulmonary fibrosis (IPF) where EMT is reported to be involved in initiating and/or aggravating the disease (Jolly et al., 2018).
As compared to normal lung tissues, the fibrotic lung tissues from IPF patients had higher MLR and KS scores but lower 76GS scores, indicating their enhanced mesenchymal status (Fig 4A, i; GSE72073). Fibrotic lung tissues had reduced levels of USP13, a deubiquitylase that stabilizes PTEN, and in vitro analysis suggested that USP13 deficiency increased invasive and migratory capacities of fibroblasts, traits usually associated with EMT (Geng et al., 2015). Similarly, relative to healthy volunteers, COPD patients showed increased EMT in their bronchoalveolar lavage (BAL) cells (Fig 4A, ii; GSE73395). Consistently, as compared to normal lung tissue, patients with any of the three lung pathological situations – IPF, non-specific interstitial pneumonia (NSIP) and mixed IPF-NSIP – exhibited trends of enhanced EMT (Fig 4A, iii; GSE110147) (Cecchini et al., 2018). Further, RNA-seq analysis of lung tissues of patients with acute lung injury (ALI) and IPF had higher MLR and KS scores but reduced 76GS scores (Fig 4A, iv; GSE134692), consistent with earlier reports (Cabrera-benítez et al., 2012; Gouda et al., 2018; Sivakumar et al., 2019).
After investigating these few examples, we analyzed the trends among KS, MLR and 76GS scores obtained for 46 microarray or RNA-seq datasets associated with lung injury. Reinforcing the trends seen for cancer-related datasets, 76GS and MLR scores were negatively correlated in roughly 72% (33/46) of datasets. Similarly, 76GS and KS scores correlated negatively in ~41% (19/46) of datasets. Further, KS and MLR scores correlated positively in ~54% (25/46) of datasets (Fig 4B, S4B). Overall, we see a strong concordance among the three EMT scoring metrics for non-cancerous lung diseases too.
Further, we investigated a set of datasets related to cellular reprogramming of differentiated cell types to induced pluripotent stem cells (iPSCs), where EMT/MET are reportedly involved (Lai et al., 2020) (Table S5). Across 92 datasets for which we calculated the 76GS, KS and MLR EMT scores, roughly 62% (57/92) showed a positive correlation between MLR and KS, while 67% (62/92) showed negative correlation between KS and 76GS, and approximately 74% (68/92) showed a negative correlation between corresponding 76GS and MLR ones. Overall, 54% (50/92) datasets demonstrated all three pairwise correlations to be strong (Fig 4C, S4C) thus endorsing that these EMT scoring metrics can be quite consistent with one another in terms of identifying the EMP status of cells en route to cellular reprogramming.
Context-specific association of EMP status with patient survival
Next, we quantified EMT scores in patient samples using TCGA datasets of various cancer types. Here also we found the expected trends that the 76GS scores shows negative correlation with the MLR and KS scores and KS and MLR scores are positively correlated to each other (Fig 5A), reinforcing our observations for a pan-cancer analysis of microarray datasets (Table S6) (Vasaikar et al., 2021). We also calculated Single-set Gene Set Enrichment Analysis (ssGSEA scores) (Subramanian et al., 2005) using the EMT gene set from MSigDB (Liberzon et al., 2011). Each ssGSEA enrichment score represents the degree to which the genes in a particular gene set are coordinately regulated within a sample. We find that the ssGSEA scores for EMT show, as expected, negative correlation with 76GS scores and positive correlation with MLR and KS scores.
We also assessed the association between EMT scores and patient survival using different survival data types (Overall survival (OS), Disease-Specific Survival (DSS), Progression free interval (PFI) and Disease-free interval (DFI)) in various TCGA cancer cohorts. The samples were scored using all three methods and segregated into high and low groups based on the mean value of each EMT score. The 76GSlow subgroup can be thought of as comparable to the MLRhigh and KShigh groups, given their relatively strong M signature. In bladder cancer (BLCA; Fig 5B, top), we see consistent trends in case of overall survival (OS) for all three types of EMT scores that the stronger the M phenotype, poorer the survival probability; whereas in case of Low-Grade Glioma (LGG), we see the opposite trend, that is, the stronger the M phenotype, the better the survival probability for 76GS and MLR (Fig 5B, middle). Similarly, in Thyroid Cancer (THCA) and Kidney Chromophobe (KICH), higher MLR scores reflect worse survival outcomes, but in pancreatic adenocarcinoma (PAAD), higher MLR scores associate with better outcomes (Fig 5B, bottom), indicating a contextspecific association of the extent of EMT with patient survival. These trends for OS were also seen in 76GS and KS scores, where the hazard ratio (HR) > 1 and HR < 1 scenarios were both observed (Fig 5C, S6) depending on the cancer subtype in TCGA.
After investigating OS data, we calculated the survival probabilities through other metrics as well – DSS, PFI and DFI, wherever available. For DSS, we found that Kidney Papillary Cell Carcinoma (KIRP) samples having larger MLR scores (or lower 76GS scores) corresponding reflect poorer survival. This trend held for PFI as well with high KS scores reflecting poorer survival (Fig 6A; columns 1, 2), but contrasted with LGG samples, which indicated the same trend of improved DSS and PFI associated with higher (resp. lower) MLR (resp. 76GS) scores (Fig 6B; columns 1, 2). The DFI for Head and Neck Cancer (HNSC) indicates a worse prognosis associated with high MLR scores, opposite to that seen for the case of Uterine Carcinosarcoma (UCS) (Fig 6C), indicating that the UCS samples with enriched M phenotypes correspond to improved survival.
Discussion
Quantifying the spectrum of epithelial-hybrid-mesenchymal cell states in cancer has garnered recent interest due to a surge in the availability of in vitro and in vivo spatial and/or temporal dynamic and high-throughput data at multiple levels - transcriptomic, proteomic, epigenetic, metabolic, and morphological (Bocci et al., 2019; Cook and Vanderhyden, 2020; Deshmukh et al., 2021; Devaraj and Bose, 2019; Jia et al., 2019; Johnson et al., 2021; Karacosta et al., 2019; McFaline-Figueroa et al., 2019; Serresi et al., 2021; Stylianou et al., 2019; Wang et al., 2020). Phenotypic plasticity and heterogeneity along the EMP spectrum has been postulated to be a more important criteria for defining the survival fitness of a cancer cell population than the predominance of a specific phenotype (Brown et al., 2021; Chakraborty et al., 2021), suggesting possible benefits to a more heterogeneous population through cooperation among cancer cells with varying EMP phenotypes (Neelakantan et al., 2017; Tsuji et al., 2008). For single cells, hybrid E/M phenotypes are believed to be the most plastic relative to their more ‘extreme’ epithelial and mesenchymal counterparts; such plasticity can amplify tumor-initiation ability (Kröger et al., 2019; Ruscetti et al., 2016). Therefore, characterizing the EMP as a continuous spectrum instead of as an ‘all-or-none’ process becomes imperative for an improved understanding of emergent dynamics of EMT and MET, and their relevance to patient survival.
Here, we used three different EMT transcriptomic-based scoring metrics, each of which was developed using cancer cell lines and/or tumor samples, to quantify the extent of EMT on a continuum – 76GS, KS, MLR. Using these metrics, we calculated EMT scores for over 100 RNA-seq datasets – both at bulk and individual cell levels – across multiple biological contexts (cancer, fibrosis, COPD, and cellular reprogramming to iPSC). We observed that these methods show a high degree of concordance among themselves in their ability to identify the extent of EMT/MET a sample has undergone, despite using different gene lists and algorithms. This concordance suggests an overlap of core expression patterns central to EMT in a high-dimensional feature space and indicates that these metrics – initially developed for cancer samples – can be applied more generally to a broader range of biological contexts. Using these metrics in biological contexts where hybrid E/M states have been proposed (Aban et al., 2020; Grande et al., 2015; Jolly et al., 2018) may be helpful in mapping the corresponding trajectories of EMT/MET. The hybrid E/M state has been previously documented at both bulk and single-cell levels during various stages of development as well (Dong et al., 2018; Leroy and Mostov, 2007); here, we have shown proof-of-principle that these scoring metrics can successfully quantify the extent of EMP in mouse models. However, whether they can be adapted to adequately investigate the role of EMT in applications for other non-human model organisms remains to be investigated.
In our analysis of single-cell RNA-seq data, the resolvability of bimodal distributions consistent with dual sub-populations was optimally characterized by the KS scoring metric. Improvement in the MLR and 76GS approaches were observed when restricting the analysis to datasets having non-zero MLR predictors for a majority (>90%) of single-cell samples. These scores, while concentrated in the middle of the EMT interval, were able to recover features of the distributions observed via the KS method. Together, this suggests that the development of optimized signal-to-noise criteria, may improve the absolute placement of samples on the EMT MLR spectrum and is the focus of research effort. Future efforts should consider how these metrics can be adapted to investigate different cell-state transition trajectories, for instance, by defining a two-dimensional EMT score that can deconvolute gains in mesenchymal program vs. losses in epithelial one (Foroutan et al., 2018).
These metrics have been helpful in investigating the association of EMT/MET with other axes of cellular plasticity such as stemness (Bocci et al., 2018), immune evasion (Li et al., 2019) and sensitivity to anti-cancer agents (Wang et al., 2021). Intriguingly, EMT status of primary tumors was not found to be universally correlated with worse patient survival, but instead showed a context-dependent trend, consistent with previous reports (Tan et al., 2014). EMP is a highly dynamic trait. Thus, capturing static snapshots of gene expression profiles may not be sufficient for recapitulating the dynamic dependence of cancer cell fitness on EMT and/or MET. Thus, the EMT status and/or heterogeneity of a primary tumor may not reflect that of circulating tumor cells (CTCs) and their metastatic potential, leading to such observed context-specific trends. Moreover, transcriptomic profiles may not be sufficient to indicate phenotypic variability and incorporate epigenetic and/or metabolic status can elucidate the manifestations of dynamic adaptation during metastasis. Understanding the interplay among EMT, metabolic and epigenetic reprogramming (Dumont et al., 2008; Jia et al., 2021, 2019; Peixoto et al., 2019; Serresi et al., 2021) will be key for better patient stratification and therapeutic strategies.
Materials and Methods
Software and Datasets
We downloaded high-throughput transcriptomics data (bulk and single cell) from GEO and EMBL-EBI databases. Microarray datasets were downloaded using GEOquery R Bioconductor package (Davis and Meltzer, 2007). TCGA expression and survival data were obtained from the UCSC xena browser (https://xena.ucsc.edu/). Statistical analysis, survival analysis and plots were all done in R version 4.0.3. ggplot was used for plots.
Preprocessing of datasets
After downloading the HTS datasets, quality check was done by FASTQC (Andrews, 2010) (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Bulk and single-cell RNA seq data were aligned to reference genome (hg38/mm10, appropriately) using STAR-aligner (Dobin et al., 2013). Samtools (Li et al., 2009) was used to modify alignment files (SAM/BAM) and htseq-count (Anders et al., 2015) was used to calculated the read counts. Using these read counts, TPM expression was calculated using custom scripts and log2 normalized TPM values were used for calculation of EMT scores. In case of microarray datasets, they were preprocessed to obtain the gene-wise expression for each sample from probe-wise expression matrix. If there were multiple probes mapping to one gene, then the mean expression of all the mapped probes was considered for that gene.
Calculation of EMT scores
EMT scores were calculated using all three methods - 76GS, KS and MLR as previously done for microarray datasets (Chakraborty et al., 2020). MLR method, which was designed for microarray datasets (George et al., 2017), was adjusted to work for HTS transcriptomics data as well.
MLR model applied to RNA-Seq
We adapted a previously developed method of quantifying EMT spectrum trained on and designed to predict microarray samples (George et al., 2017). Very briefly, VIM, CDH1, and CLDN7 transcripts were identified to maximally predict NCI-60 holdout samples in leave-one-out assessment utilizing two-dimensional multinomial logistic regression (MLR). These, together with a list of 20 normalizers, enable the assignment of each input (CLDN7, VIM/CDH1) to an ordered triple (PE, PE/M, PM) that characterizes the probability that a signature belongs either to the Epithelial (E), Mesenchymal (M), or hybrid (E/M) group. This ordered triplet is then projected onto the interval [0, 2] with 0 designating a fully epithelial signature, 1 maximally hybrid signature, and 2 fully mesenchymal signature.
In order to apply the microarray based MLR model to RNA-seq data, we utilized transcriptomic data available in both formats on biological replicates (Zhao et al., 2014). Using the data available in Fig. 2 (2 biological replicates for each of 6 time points), we restricted our analysis to the intersection of microarray and RNA-seq transcripts for genes represented in positive abundance for both datasets. Linear regression to the average of each biological replicate, producing a total of 6 slope-intercept pairs: which were then averaged to be used as the fit parameters for cross-platform assessment:
From these, unique microarray values, xμArray, representative of RNASeq values may be calculated by inversion:
Survival Analysis
Different metrics of survival data was obtained from TCGA cohort. All samples were divided into 76GShigh and 76GSlow, MLRhigh and MLRlow, KShigh and KSlow groups based on the mean (or median) of the respective scores of the samples. Kaplan–Meier analysis was done using R package “survival” and plotted using R package “ggfortify”. Log rank test was used to calculate the p-values. Hazard ratio (HR) and confidence interval (95% CI) reported are estimated using cox regression.
T-test
Two-tailed student’s t-test with unequal variance was done to compare between samples in many bar plots. Error bars denoted the standard deviation (statistical significance at p < 0.05).
ssGSEA
Single-sample GSEA (ssGSEA), an extension of Gene Set Enrichment Analysis (GSEA), calculates separate enrichment scores for each pairing of a sample and gene set. Each ssGSEA enrichment score represents the degree to which the genes in a particular gene set are coordinately up- or down-regulated within a sample. We used “HALLMARK_EMT” gene set from The Molecular Signatures Database (MSigDB) database and the scores were calculated using R package “ssgsea”.
Code availability
All codes used in the manuscript are given at https://github.com/sushimndl/EMT_Scoring_RNASeq
Conflict of Interest
SSS reports personal fees for lectures from Chiesi, outside the submitted work. All the other authors declare no conflict of interest.
Author contributions
MKJ, JTG, and HL conceived of and designed the research; MKJ and JTG supervised the research; SM, TT, RJ, SS and PT performed the research. SM, TT, SS, PC and SSS analyzed and interpreted the data. All authors contributed to manuscript writing.
Supplementary Figures
Legends for Supplementary Tables
Table S1. Details of 77 bulk RNA-seq datasets (GSE ID, PMID, no. of samples, and pairwise correlation coefficient and p-values).
Table S2. Details of sample IDs in individual datasets that have been considered for comparative analysis in Fig 2.
Table S3. Details of 19 single-cell RNA-seq datasets (GSE ID, PMID, no. of samples, and pairwise correlation coefficient and p-values).
Table S4. Details of 46 RNA-seq and microarray datasets pertaining to COPD, IPF and other lung diseases (GSE ID, PMID, no. of samples, and pairwise correlation coefficient and p-values).
Table S5. Details of 92 RNA-seq and microarray datasets pertaining to cellular reprogramming (GSE ID, PMID, no. of samples, and pairwise correlation coefficient and p-values).
Table S6. Details of datasets included in EMT-ome (Vasaikar et al., 2021) together with pairwise correlation coefficient and p-values among EMT scoring metrics.
Acknowledgements
MKJ was supported by Ramanujan Fellowship awarded by SERB, DST, Government of India (SB/S2/RJN-049/2018) and by the InfoSys Young Investigator Fellowship awarded by InfoSys Foundation, Bangalore. SSS is supported by grants from Clifford Craig Foundation Launceston General Hospital and Rebecca L Cooper Medical Research Foundation.