Abstract
Purpose Cancer is a highly complex disease caused by multiple genetic factors. MicroRNA (miRNA) and mRNA expression profiles are useful for identifying prognostic biomarkers for cancer. The kidney renal clear cell carcinoma (KIRC) was selected for our analysis, because KIRC accounts for more than 70% of all renal malignant tumor cases.
Methods Traditional methods of identifying cancer prognostic markers may not be accurate. Tensor decomposition (TD) is a useful method uncovering the underlying low-dimensional structures in the tensor. TD-based unsupervised feature extraction method was applied to analyze mRNA and miRNA expression profiles. Biological annotations of the prognostic miRNAs and mRNAs were examined by utilizing pathway and oncogenic signature databases, i.e. DIANA-miRPath and MSigDB.
Results TD identified the miRNA signatures and the associated genes. These genes were found to be involved in cancer-related pathways and 23 genes were significantly correlated with the survival of KIRC patients. We demonstrated that the results are robust and not highly dependent upon the database we selected. Compare to the t-test, we shown that TD achieves a much better performance in selecting prognostic miRNAs and mRNAs.
Conclusion These results suggest that integrated analysis using the TD-based unsupervised feature extraction technique is an effective strategy for identifying prognostic signatures in cancer studies.
1. Introduction
Cancer is a highly complicated and heterogeneous disease. It is the result of a loss of cell cycle control (Vargas-Rondon, Villegas, & Rondon-Lagos, 2017), which is due to accumulation of genetic mutations, gene duplication (Hanahan & Weinberg, 2011), and aberrant epigenetic regulation (Feinberg & Vogelstein, 1983; Rouhi, Mager, Humphries, & Kuchenbauer, 2008).Genetic mutations involving activation of proto-oncogenes to oncogenes (OCG) and inactivation of tumor-suppressing genes (TSG) may cause cancer by alternating transcription factors (TF), such as the p53 and ras oncoproteins, which in turn control the expression of other genes. Gene duplication causes an elevated level of its protein product and thus favor the proliferation of cancer cells. MicroRNAs (miRNAs) are a class of small non-coding RNAs that bind to the messenger RNA (mRNA) and induce either its cleavage or impede translation repression. Several studies have indicated that abnormal miRNA expression is associated with carcinogenesis (Medina & Slack, 2008). miRNAs induce cancers by acting as oncogenes (OCG) and tumor suppressor genes (TSG). An miRNA that targets the mRNA of a TSG would induce loss of the protective effect of the TSG (Medina & Slack, 2008; Zhang, Dahlberg, & Tam, 2007). Although there have been many advancements in cancer therapy and diagnosis, many patients are unable to recover or experience recurrence after treatment. Accordingly, miRNA expression profiles are useful for identifying prognostic biomarkers for cancer diagnosis. For instance, dysregulated miRNAs were identified in urothelial carcinoma of the bladder (Inamoto et al., 2018). Recent studies also suggested that miRNAs could be used as a prognostic biomarker for patients with pancreatic adenocarcinoma (Shi et al., 2018; Yu, Feng, & Cang, 2018). Furthermore, by utilizing meta-analysis, it was reported that a panel of eight-miRNA signatures could serve as an effective marker for predicting overall survival in bladder cancer patients (Zhou et al., 2015). In this study, we selected kidney renal clear cell carcinoma (KIRC) for our analysis. KIRC is the most common cancer subtype of all renal malignant tumors, accounting for more than 70% of the cases (Zhang et al. 2013). Several studies have identified a few miRNA signatures that are associated with the overall survival of KIRC patients (Lokeshwar et al., 2018; Luo et al., 2019; Xie et al., 2018).
Typical data structures in bioinformatics are difficult to analyze because of the small number of samples with many variables. Supervised feature extraction are effective methods for reducing the number of features. If supervised learning is applied, overfitting can occur. Regularization (sparse modeling) attempts to minimize the number of features by restricting the sum of coefficients attributed to features and penalizes the use of additional variables. The disadvantage of regularization is that we must select the values of parameters that balance the prediction accuracy and the number of variables. There are two major issues with supervised feature extraction methods: (i) class labels may not always be true and (ii) there may be more class labels present in the dataset. However, unsupervised methods such as principal component analysis (PCA) are often used to generate a smaller number of variables through the linear combination of original variables. The problem with this approach is that the linear combination of many variables often prevents us from interpreting the newly generated variables. An unsupervised methodology that is suitable for the dimension reduction problems is tensor decomposition (TD)-based unsupervised feature extraction (FE) (Y. Taguchi, 2017; Y. Taguchi & Ng, 2018; Y.-h. Taguchi, 2019a, 2019b, 2019c; Y.-h. Taguchi & T. Turki, 2019; Y. H. Taguchi, 2017a, 2017b, 2017c, 2018a, 2018b, 2018c, 2019; Y. H. Taguchi & T. Turki, 2019). This method allows selection of a smaller number of variables effectively and stably.
2. Materials and Methods
2.1 Tensors and tensor decomposition (TD)
Tensor [17] is a mathematical structure for storing datasets associated with more than two properties. If we measure miRNA and mRNA expression for the samples, we cannot avoid storing these two measurements into two separate matrices. However, by using tensor we can store these two datasets into a tensor, because tensors can have more than two suffixes, which matrices do not have.
TD [17] is a mathematical trick that can approximate tensors as the summation of series whose terms are expressed via the outer product of vectors, each of which represent individual property (in this specific example, these vectors correspond to mRNAs, miRNAs, and samples).
2.2. Tensor decomposition method
The miRNAseq and mRNAseq expression data for KIRC were retrieved from the TCGA Data Portal Research Network (https://gdcportal.nci.nih.gov/).
TD is a natural extension of matrix factorization, and is regarded as a generalization of the singular value decomposition (SVD) method. It is a useful technique uncovering the underlying low-dimensional structures in the tensor. There are two popular tensor decomposition algorithms: canonical polyadic decomposition (CPD) and Tucker decomposition (Rabanser, Shchur, & Günnemann, 2017). The rank decomposition method, CPD, is to express a tensor as the sum of a finite number of rank-one tensors. The Tucker decomposition decomposes a tensor into a so-called core tensor and multiple matrices.
TD-based unsupervised FE was applied to analyze mRNA and miRNA expression profiles. Let xij(mRNA) denote the expression profiles of the ith mRNA (i = 1, …N) of the jth sample (j = 1, … M), whereas xkj(miRNA) denotes the expression profiles of the kth miRNA (k = 1, …K) of the jth sample (j = 1, … M). Both xij and xkj will be standardized such that they are associated with zero mean and unit variance. Next, we generated a case II type I tensor, that is,
xijk is subjected to Tucker decomposition as follows: where G ∈ RN×M×K is the core tensor and and are singular value matrices that are orthogonal. Because Tucker decomposition is not unique, we have to specify how Tucker decomposition was derived. In particular, we chose higher-order singular value decomposition (HOSVD). Given that xijk is too large to apply TD, we generated a case II type II tensor, which is given by:
By applying SVD, we can get and as
Then, we can also obtain two that correspond to miRNA and mRNA expression:
Selection of genes can be determined using the following quantities, where is the cumulative probability that the argument is greater than x in a χ2 distribution. and denote the standard deviations for and , respectively. After the P-values are adjusted by means of the Benjamini–Hochberg (BH) criterion, miRNAs and mRNAs that are associated with adjusted P-values less than 0.01 are selected as those showing differences in expression between controls (normal tissues) and treated samples (tumors).
2.3 mRNA and miRNA expression
Expression profiles of the mRNA and miRNA were retrieved from TCGA. The samples consisted of 253 kidney tumors and 71 normal kidney tissues (M = 324). The number of mRNAs measured was N = 19536, and the number of measured miRNAs was K = 825.
Another dataset was downloaded from GEO with GEO ID GSE16441, and two files, GSE16441-GPL6480_series_matrix.txt.gz (for mRNA) and SE16441-GPL8659_series_matrix.txt.gz (for miRNA) were used. A total of N = 33698 mRNAs and K = 319 miRNAs were measured for 17 patients and 17 healthy controls (M = 34).
2.4 Analysis of the correlation between miRNA and gene expression
Correlations between and (l1 = l3 = 2) were quantified by the Pearson’s correlation coefficient (PCC). The PCC and P-values were calculated using the corr.function and cor.test function in the R software, respectively.
2.5. Biological function analysis
We evaluated the biological significance of the set of differentially expressed miRNAs and their correlated mRNAs. Biological annotations of the prognostic miRNAs and mRNAs were examined by employing the DIANA-miRPath (Vlachos et al., 2015) and MSigDB (Liberzon et al., 2015) databases, respectively.
3. Results
We applied TD-based unsupervised FE to the KIRC dataset retrieved from TCGA. It was found that and (l1 = l3 = 2) varied between the normal and tumor samples. The t-test derived P-values were 7.10 × 10−39 for mRNA and 2.13 × 10−71 for miRNA, respectively. In order to see if and are significantly correlated, we computed the PCC between them, which was 0.905 (P = 1.63 × 10−121), indicating that they are highly correlated.
The results of the miRNA signatures and their significant correlated genes are shown in Table 1. A total of 11 miRNAs and 72 genes were identified. To determine if these miRNAs and mRNAs are significantly correlated, we computed the PCC for all 11 × 72 = 792 pairs. Among them, 353 pairs were positively correlated and 358 pairs were negatively correlated (P-values were less than 0.01 after correcting with the BH criterion). Therefore, 90% of pairs are significantly correlated. Moreover, we could successfully identify significantly correlated pairs of miRNAs and mRNAs. We noted that among the predicted 11 miRNAs, one miRNA (miR-155) matched the result reported by Lokeshwar et al. (Lokeshwar et al., 2018).
Next, in order to evaluate the biological significance of selected mRNAs, we determined the top 10 oncogenic signatures of the 72 genes reported by MSigDB (Table 2).
The results of the top 10 REACTOME pathways reported by MSigDB are summarized in Table 3.
These results suggest that the selected 72 mRNAs are likely related to oncogenesis. In order to further confirm if these 72 mRNAs are related to kidney cancer, we checked if these genes were linked to survival rates (Table 4). Among 72 mRNAs, 23 were significantly correlated with the survival of kidney cancer patients. This also highlights the effectiveness of our analysis.
We also evaluated the identified 11 miRNAs by DIANA-mirpath. Table 5 shows the enriched disease-related KEGG pathways (P-value < 0.05). The renal cell carcinoma pathway is identified with a significant P-value equal to 0.01613.
4. Discussion
The top signature in Table 2 is related to the cAMP signaling pathway. Targeting the cAMP pathway is an effective treatment for kidney cancer (Piazzon, Maisonneuve, Guilleret, Rotman, & Constam, 2012; Torres & Harris, 2014). The second signature in Table 2 is the Snf5 gene expression profile of a murine model (Mouse Embryonic Fibroblast (MEF) cells) that closely resembles that of human SNF5-deficient rhabdoid tumors (pediatric soft tissue sarcoma that arises in the kidney, the liver, and the peripheral nerves) (Isakoff et al., 2005). Impairment of the SWI/SNF chromatin remodeling complex plays an important role in the development and aggressiveness of clear cell renal cell carcinoma (Sarnowska et al., 2017). The sixth signature in Table 2 comes from a study of the effects of knockdown of the gene family of eukaryotic translation initiation factors (EIF) by RNAi in MCF10A cells. EIF3b is a promising prognostic biomarker and a potential therapeutic target for patients with clear cell renal cell carcinoma (Zang et al., 2017), and EIF4GI is a target for cancer therapeutics (Jaiswal, Koul, Palanisamy, & Koul, 2019).
The top pathway in Table 3 is the ‘Pathway of regulation of IGF activity by IGFBP’. Studies show that insulin-like growth factors (IGFs) and insulin play a stimulatory role for renal cancer cells (Braczkowski, Bialozyt, Plato, Mazurek, & Braczkowska, 2016; Solarek, Koper, Lewicki, Szczylik, & Czarnecka, 2019). Patients with IGF-1 receptor overexpression have a 70% increased risk of death (Tracz, Szczylik, Porta, & Czarnecka, 2016). Moreover, this overexpression has been shown to increase kidney cancer risk in middle-aged male smokers (Major, Pollak, Snyder, Virtamo, & Albanes, 2010). The second pathway in Table 3 is ‘Cytokine Signaling in Immune system’. Cytokines are important biomolecules that play essential roles in tumor formation (Lee & Rhee, 2017) and they are therapeutic targets (Doehn, Kausch, Melz, Behm, & Jocham, 2004; Macleod et al., 2015). The IL-6 cytokine family can serve as useful diagnostic and prognostic biomarkers. In fact, IL-6 is a potential target in cancer therapy (Kaminska, Czarnecka, Escudier, Lian, & Szczylik, 2015; Unver & McAllister, 2018). Ishibashi et al., reported that IL-6 suppresses the expression of the cytokine signaling-3 (SOCS3) gene, and is associated with poor prognosis of kidney cancer patients (Ishibashi et al., 2018).
Table 4 shows the significant relationships between the predicted 23 mRNAs and the patients’ survival rates. For some of the 23 genes, patients cannot be divided equally based on expression of considered genes in order to get significant P-values for the Kaplan-Meier plots. A majority of the mRNAs (15 out of 23) are associated with P-values less than 0.05 with 50/50 divisions based on the level of gene expression. Among the 16 KEGG pathways predicted by DIANA-mirpath (Table 5), 14 are directly related to cancers, except for Hepatitis B and Hepatitis C. Therefore, we correctly identified miRNA signatures that are cancer-related.
In order to validate the robustness of our findings, we employed an independent dataset to confirm that our results are independent of datasets to some extent. The alternative dataset was downloaded from GEO (GSE16441). The procedures applied to analyze the GEO dataset are similar to those applied to the dataset obtained from TCGA. The only difference is the number of samples, miRNAs, and mRNAs. After repeating the same procedures, we realized that and (l1 = l3 = 2) also varied between normal and tumor samples (Fig 1). P-values computed by the t-test were 6.74 × 10−22 for mRNA and 2.54 × 10−18 for miRNA. In order to ascertain whether and are significantly correlated, we calculated the PCC between them, which was 0. 931 (p-value = 1.58 × 10−15), indicating that they are highly correlated.
Next, we checked if the selected miRNAs and mRNAs were common between the TCGA and GEO datasets. We identified three miRNAs – hsa-miR-141, hsa-miR-210, and hsa-miR-200c, which are listed in Table 1. On the other hand, 209 genes were identified. After restricting genes included in both TCGA and GEO datasets, we evaluated the overlap as the confusion matrix (Table 6).
The P-value determined using the Fisher exact test was 8.97 × 10−11 and the odds ratio was 19.7. Therefore, the coincidence between selected genes in the TCGA and GEO datasets is significant and the results obtained for TCGA are robust and not highly dependent upon specific samples.
To test the superiority to the conventional method, we applied the t-test to the TCGA and GEO datasets. After applying the t-test, P-values were calculated and adjusted based on the BH criterion. Then, 13,895 genes and 399 miRNAs for TCGA and 12,152 genes and 78 miRNAs for GEO were associated with adjusted P-values less than 0.01. Relative to the TD method, the t-test identified a larger number of genes and miRNAs using the P-values as criteria. If the top ranked (small enough or restricted) number of genes and miRNAs was selected by the t-test, the coincidence between TCGA and GEO might be compatible. Therefore, we selected the same number of genes and miRNAs by the t-test as those selected by TD. Only one miRNA and no genes were common between the TCGA and GEO datasets. Therefore, we determined that the t-test could identify less coincident sets of genes and miRNAs between TCGA and GEO. In conclusion, this strongly suggests that the proposed method is superior to the t-test.
5. Conclusions
In this study, we applied the TD-based unsupervised FE method to the KIRC miRNA expression and gene expression data. The TD-based method can identify miRNA signatures with differential expression between normal tissues and tumors as well as significant correlations between the gene expression data. Selected mRNAs and miRNAs are not only mutually correlated, but are also significantly related to various aspects of cancers. This suggests that integrated analysis performed by TD-based unsupervised FE is an effective strategy, despite its simplicity to identify biologically significant pairs of miRNAs and mRNAs, which is not easy by other strategies.
Supplementary Materials
Supplementary figures. The results of the Kaplan-Meier plots of the 23 KIRC survival-associated genes by using OncoLnc.
Author Contributions
Ka-Lok Ng foresee the research, prepared the data, writing—original draft preparation, review and editing. Y-h Taguchi performed the formal analysis, writing—original draft preparation, review and editing.
Conflicts of Interest
The authors declare no conflict of interest.
Acknowledgments
Dr. Ka-Lok Ng is funded by the Ministry of Science and Technology, Taiwan (MOST), grant number MOST 108-2221-E-468-020, and also supported by the Asia University, grant numbers 107-asia-02 and 107-asia-09. Dr. Y-h Taguchi is supported by Kakenhi 19H05270 and 17K00417. We would like to thank Editage (www.editage.com) for English language editing.
Footnotes
ppiddi{at}gmail.com
This is the major revision based upon the reviewers' comments to the original version.