ABSTRACT
The primary function of microRNAs (miRNAs) is to maintain cell homeostasis. In cancerous tissues miRNAs’ expression undergo drastic alterations. In this study, we used miRNA expression profiles from The Cancer Genome Atlas (TCGA) of 24 cancer types and 3 healthy tissues, collected from >8500 samples. We seek to classify the cancer’s origin and tissue identification using the expression from 1046 reported miRNAs. Despite an apparent uniform appearance of miRNAs among cancerous samples, we recover indispensable information from lowly expressed miRNAs regarding the cancer/tissue types. Multiclass support vector machine classification yields an average recall of 58% in identifying the correct tissue and tumor types. Data discretization has led to substantial improvement reaching an average recall of 91% (95% median). We propose a straightforward protocol as a crucial step in classifying tumors of unknown primary origin. Our counter-intuitive conclusion is that in almost all cancer types, highly expressing miRNAs mask the significant signal that lower expressed miRNAs provide.
INTRODUCTION
Mature microRNAs (miRNAs) are short, non-coding RNA molecules. They lead to post transcriptional repression by reducing the stability of mRNAs and attenuating the translation machinery (1). In humans, there are about 1900 miRNA genes that yield ~2600 mature miRNAs. The expression levels of miRNAs in healthy tissues span 5 to 6 orders of magnitude (2).
In all multicellular organisms including humans, miRNAs were implicated in embryogenesis, tissue identity and development. As the gatekeepers of cell homeostasis (3), miRNAs respond to alteration in cell regulation that occurs during viral infection, inflammation, and numerous pathologies. In mammals, cell transformation and carcinogenesis are accompanied by drastic alterations in miRNA expression profiles (4,5). Those miRNAs that are associated with carcinogenesis and metastasis are called oncomiRs (6). They were implicated in targeting components of the cell cycle, DNA repair, oncogenes or tumor suppressor genes (7,8). Most oncomiRs are expressed in multiple cancer tissues. However, some are specific to only certain cancer tissues. For example, human miR-21 over-expression is associated with almost all cancers while miR-15 and miR-16 expressions are mostly associated with B-cell neoplasm. Other miRNAs (miR-143 and miR-145) directly regulate other oncomiRs and thus are candidates for anti-tumor therapy (9). Therefore, altered miRNA expression in cancerous versus healthy tissues is suggested as invaluable biomarkers, for cancer diagnosis and prognosis and as a lead for novel therapeutic approach (10). Despite advances in understanding cell deregulation by miRNAs in cancers for most miRNAs, a relation between expression level, diagnosis and prognosis for specific cancer types cannot be drawn (see discussion in (11,12)).
From a clinical perspective, profiling miRNAs is important for: (i) Selecting optimal treatment; (ii) Monitoring the disease’s progression; (iii) Identifying the primary origin of a metastatic cancer (12). For example, miR-10b level was shown to be a prognosis indicator for chemotherapeutic resistance in colorectal cancer (13) while miR-30c inhibits tumor chemotherapy resistance in breast cancer (14). The monitoring of minute miRNAs levels in patient’s body fluids allows a follow up for treatment and a direct assessment for disease’s progression (5,15).
Many of the samples collected from cancer patients are metastatic. Evidently, it is more challenging to treat metastatic than primary cancers (see reviews (16,17)). Importantly, the origin of up to 12-15% of all newly diagnosed cancers is unknown or uncertain (18). Due to the difficulty of identifying the origin of metastatic tissues, cancers of unknown primary origin (coined CUP) are associated with poor prognosis and inconclusive protocol for disease management (19). Consequently, there is an active line of research seeking ways to identify the type and origin of tissues from metastatic biopsies through their molecular signatures (for example (12,20,21)). Success in classifying lung squamous cell carcinomas from lung adenocarcinoma (22) was further expanded to four main subtypes of lung cancer using miRNA profiles (23). In another study, a classifier based on 47 miRNAs successfully separated 10 different tissues from ~100 samples from primary and metastatic samples (24). Expanding the task to 42 cancer types resulted in an extended set of 68 miRNAs (25) that together are used as indicators for tissue of origin in numerous biopsies from cancer patients (26). Although most measurements of miRNAs originate from cancer biopsies (27), miRNAs circulating in body fluids has become attractive candidates for early diagnosis (28,29).
The goal of our study is to use the miRNA expression profile data from The Cancer Genome Atlas (TCGA) toward the task of classifying different cancerous tissues and their disease/healthy states. TCGA provides rich molecular data from cancer patients on an unprecedented scale (30), with samples of >25 cancer types. We focused on miRNA profiles that report on the normalized expression level of each of the 1046 analyzed miRNAs. Specifically, we ask: (i) Which miRNAs best capture the information needed to distinguish the various cancerous types and origin? (ii) Which tissues are prone to false classifications? (iii) Can we find a biological interpretation of the set of informative miRNAs?
MATERIALS AND METHODS
Data collection
MicroRNAs (miRNAs) profiles were extracted from The Cancer Genome Atlas (TCGA, collected between October 2013 and January 2014) (31–34). We defined a class by coupling the originating tissue with the label of the sample’s state (i.e., cancerous or healthy). We limited our analysis to classes with over 50 samples. In total we collected 27 classes. Three classes of healthy tissues are abbreviated as Kidney, Thyroid and Breast. Additionally, 24 cancerous classes and their acronyms are listed: adrenocortical carcinoma (ACC), bladder urothelial carcinoma (BLCA), brain lower grade glioma (LGG), breast invasive carcinoma (BRCA), cervical squamous cell carcinoma and endocervical adenocarcinoma (CESC), colon adenocarcinoma (COAD), esophageal carcinoma (ESCA), head and neck squamous cell carcinoma (HNSC), kidney chromophobe (KICH), kidney renal clear cell carcinoma (KIRC), kidney renal papillary cell carcinoma (KIRP), liver hepatocellular carcinoma (LIHC), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), pancreatic adenocarcinoma (PAAD), pheochromocytoma and paraganglioma (PCPG), prostate adenocarcinoma (PRAD), rectum adenocarcinoma (READ), sarcoma (SARC), skin cutaneous melanoma (SKCM), stomach adenocarcinoma (STAD), thyroid carcinoma (THCA), uterine carcinosarcoma (UCS) and uterine corpus endometrial carcinoma (UCEC).
The 27 classes include 8522 distinct samples. Each sample is associated with the expression of 1046 miRNAs. We assessed the similarity among patient samples by calculating the cosine of the angle between the vectors of miRNA expression. These cosines were used as a proxy to the initial separation ability and similarity among classes. The lists of all samples and miRNAs’ TCGA identifiers are provided in supplementary data Table S1.
Data transformation
We apply a rough discretization to the data, according to miRNA expression level, measured in reads per million (RPM). We start with a dichotomy - one category for expression levels strictly between 0 and 30 RPM, and a second category for everything else. A somewhat more refined classification is a trichotomy, that partitions the data to non-expressed, lowly expressed (expression levels from 0 to 30 RPM) and highly expressed (>30 RPM). On average, each sample has 306 lowly expressed miRNAs (s.d. of 43.91), and 163 highly expressed miRNAs (s.d. of 18.58).
Machine learning model
Human TCGA samples were classified to the 27 classes (see above) by training a multiclass support vector machine (SVM) classifier (35). We used a Python implementation based on the scikit-learn package (36). The training sets for the raw data, the dichotomy and the trichotomy were limited to 40% of the samples of each class. The sensitivity of the SVM classifier to the size of the training set was tested. The tests were repeated with training set comprised of 20% to 80% of each class of samples, with increments of 10%. Our results for both data transformations were consistent and stable for training sets of at least 40% of the samples and somewhat less stable for training sets below the 40% threshold. Both dichotomy and trichotomy schemes were run 1000-10,000 times each. Scikit-learn cross-validation mechanism was used in order to prevent over-fitting.
Prior to data analysis we randomly partition all data into 40% that are unseen and 60% that are seen. We create 5 training sets each of which is comprised of a random subset of two-thirds of seen part. For each set, two classification matrices were created, one based on the 30 RPM threshold, and the other based on a 60 RPM threshold, for separating the lowly-expressed from the highly-expressed. We then run our machine-learning procedure on each of these 5 sets and test them on the unseen part of the data. Our final classification is chosen by carrying a plurality vote among these 10 classification results.
Performance evaluation and error analysis
We assessed our models according to their average recall rate for each of the classes. Recall is defined as [true positive] / [true positive + false negative]. The average, median and standard deviation (s.d.) were measured for over 300 independent runs of the SVM protocol on raw or discretized data.
We carried out two types of error analysis: At the sample level and at the class level. First, we analyzed the errors made per each of the samples over distinct runs (Sample-based Error). A second error analysis concerns a class misclassifications and the identity of the faulty assignment (Class-based Error). The plurality vote was implemented in our final classification in order to overcome Sample-based Errors. We have merged the classes READ and COAD in order to reduce the extent of Class-based Errors.
Informative miRNA extraction
Several methodologies were applied to identify and select a minimal set of miRNAs that are most informative towards a successful classification task. Our best results were obtained using the above described trichotomy. Specifically, for each miRNA we create an 81-dimensional vector, with 3 coordinates for each of the 27 classes (24 cancer types and 3 healthy tissues). These three coordinates are defined as follows: If there are N samples in this class for the miRNA at hand, of which N_0 are not expressed, N_1 are between 0 and 30 RPM and N_2 are highly expressed, then we set the first coordinate at N_0/N, the second to N_1/N and the third to N_2/N. We then calculate the standard deviation (s.d.) of each of the expression levels among the 27 classes separately, and calculate the sum of the s.d. Formally:
We expect miRNAs of higher summed s.d. values to be more informative, since the s.d. captures the miRNA’s variability among different classes. We chose not to assign distinct weights to different classes despite their differing samples’ size (ranging from 57 for Uterine Carcinosarcoma to 1066 for Breast Invasive Carcinoma).
RESULTS
Many cancerous tissues have similar miRNA profiles
The main goal of our study is to classify cancer classes using the information encoded by miRNA profiles from patients. For most cancer types, a small number of highly expressed miRNAs undergo a drastic change in expression level along the progression of the disease. Therefore, it is a commonly held view that uniquely expressed miRNAs might be used for diagnosis, prognosis and possibly also for disease treatment. Here we revisit this view and test the miRNA heterogeneity and expression consistency in an unbiased way. We analyze many different cancer types and tissues, and thousands of patient samples.
We wanted to get a sense of the similarity between the vectors of miRNA reads among samples from each given tissue. We found high correlations among the miRNA expression vectors when both samples are healthy or when both are diseased (within the same tissue). The correlation is much lower for healthy vs. cancerous one (data not shown). Figure 1 shows an analysis for all pairs of patients with lung adenocarcinoma samples (521 samples, abbreviated LUAD) and matched healthy samples (~50 samples). The similarity between each pair is indicated (cosine values, see Materials and Methods). We show that healthy and cancerous tissues exhibit different patterns. Importantly, healthy samples are similar to each another as are the diseased samples.
We tested whether this property (Figure 1) extends to all other cancerous and healthy tissues. We therefore restricted our analysis to tissues for which samples from both healthy and cancerous patients are available. As Figure 2A shows, each healthy tissue is well characterized by its miRNAs’ profile. For example, breast tissues in different patients have very similar miRNA profiles, and differ significantly from liver profiles. Even the two types of lung tissues are visibly distinguishable from each other. However, the distance matrix for the diseased samples (5 classes, ~2700 patients, Figure 2B) is relatively uniform. While all the diseased tissues are rather similar using our similarity matrix, a somewhat higher similarity is visible between the two lung diseases, the adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC). We concluded that the trend that we observed in Figure 1 applies to other tissues. Namely, the distance among all pairs of diseased tissues is lower than that among the corresponding healthy tissues.
Figure 2C offers a bird's eye view of miRNA-based information for all 8522 samples that were included in our analysis, according to 27 classes discussed in this study. The entries show the average of the correlations over all samples in each pair of classes. Note that the values on the diagonal are only slightly higher than the rest of the values in this symmetric matrix. It is noteworthy that that brain lower grade glioma (LGG) markedly differs from other cancerous tissues, yet exhibiting a significant self-similarity. Additionally, sarcoma samples (SARC, diagonal, Figure 2C) exhibit rather low (0.61) average correlation, suggesting high variability of the miRNA expression in this cancer type.
Lowest expressed miRNAs provide differentiating information
In view of the relatively homogenous pattern of miRNA profiles among diseased tissues (Figure 1), it is clear that the identification of cancer classes cannot be trivially based on a predefined simple set of miRNAs. Thus, we sought a classifier that can correctly label a tissue of origin based on information that is globally captured by the miRNAs’ profiles. To this end, we applied a multiclass support vector machine (SVM) classifier with a randomly chosen training sets (Figure 3A). The flow of the methodologies used in this study is shown. Our classifier employs two variations of data transformation, following by a step that benefits from studying the typical errors in our classification task (Figure 3A).
SVM applied to the raw data to identify the tissue of origin achieves an average success rate of 58%, and a median of 69% (27 classes) (Figure 3B, blue). Note that naïve guessingwould yield ~4% success rate for classifying a sample to any of the 27 classes. Interestingly, for 10 out of the 27 classes attains recall rate higher than 80%, and for 5 classes the success rate is even above 90%. However, the prediction for some classes is very poor (<10%). The best performance is associated with LGG, as could be anticipated from the observation in Figure 2C.
However, for clinical purposes, a success rate of 58% is unsatisfactory. We thus focused on searching a subset of miRNAs that are most informative towards the task of classifying tissue of origin and sample status (diseases, healthy). The expression levels of miRNA exhibit a high dynamic range of 5-6 order of magnitudes and expression is expressed in RPM (reads per millions). Furthermore, it turns out that only a small number of miRNAs (20–25) account for the vast majority of the reads in the samples (96-98%, not shown). Based on these observation, we tested the potential of discretizing miRNAs’ raw values, thus allowing a representation that emphasizes the contribution of lowly expressed miRNAs. We anticipated that the long tail of miRNAs might be more informative for the classification purposes with respect to a small set of dominating miRNAs.
We applied a simple dichotomy that partitions the lowly expressed miRNA (< 30 RPM) vs. the rest of the miRNAs. Typically, ~70% of the miRNAs are expressed at the range of >0 to 30 RPM. This simple transformation (see Materials and Methods) dramatically reduces the volume of data, and concentrates on only ~ 0.02% of the original signal (i.e., the sum total of miRNAs expression values). Importantly, the application of naïve dichotomy already improves the performance of our classifier to an average success rate of 66% (median of 80%, Figure 3B).
We sought additional data processing procedures to further improve the classification performance. In dichotomy we treat equally miRNAs that are expressed at levels above the threshold (>30 RPM) and those that are not expressed at all in a specific class. We next separate these two sets, and replace the data transformation to a trichotomy (Figure 3A, see Materials and Methods). This transformation yields a substantial improvement. The classifier thus reaches an average success rate of 88% (median 91%, Figure 3B and Table1). Figure 3C shows the improvement in the classification success for the trichotomy mode with respect to the original raw data. It is clear that some classes benefit more from this transformation than others. For 8 out of the27 classes the observed improvement was >3 folds (Figure 3C).
We analyzed the dependence of the success rate on the chosen threshold (<30 RPM) and observed that the recall rate remained virtually unchanged for thresholds ranging between 5 to 300 RPM. However, raising the threshold above 300 RPM affects the performance negatively, suggesting that at this range of parameter we are actually adding noisy information (Figure 4). It is evident that for both the dichotomy and trichotomy modes, information within the very low expression levels is critical for the classification task. The contribution of the very lowly-expressed miRNAs to the classification task was further assessed. We zoomed on miRNAs expressed at levels x, where x ranges from >0 to 30. We trained the classifier again for different values of x and artificially relabeled miRNAs that are smaller than x as non-expressed. When marking as non-expressed miRNAs with expression levels under 10 RPM, there was no significant change to the classification results, while applying x > 10 RPM led to a decline in performance. Typically, only ~8% of the miRNAs are listed in the range of 10-30 RPM, while 25% of the entire list of miRNAs are within the range of 1 to 10 RPM. We conclude that lowly expressed miRNAs that carry fundamental information for the classifier, mostly (but not entirely) reside in the 10 to 30 RPM range.
Not all tissue types are equally hard to classify
As mentioned, trichotomy has resulted in a substantial improvement for many classified tissues. For 11 out of the 27 classes, we consistently reached a success rate of 95% or above (Table 1). Importantly, the classification success does not immediately reflect the sample size. For example, a 96% success for Adrenocortical Carcinoma (ACC) is based on only 80 samples while lung squamous cell carcinoma (LUSC) achieved only 84% success with a sample size which is 6-fold larger (Table 1). Specifically, we are doing poorly on classifying esophageal carcinoma (ESCA) and rectum adenocarcinoma (READ), with a success rate of 43% and 39%, respectively.
We tested the sensitivity of the classification performance with respect to alternative machine learning methods: Random Forest and SVM with polynomial and radial basis function (RBF) kernels. The results for Random Forest and SVM with polynomial kernel were somewhat worse than the used SVM (see Materials and Methods). RBF kernel gave an improved performance on the raw data, but failed to improve following data discretization and data transformation.
A third of the classification errors are interpretable
We tried to refine our understanding on features that govern the success / failure in the classification task. To this end we inspected the results in view of the different errors that are made. Figure 5 presents a “wheel view” for all 27 classes with a visual indication on the samples’ size for each class, and the typical errors. Cross-edges in the graph and their color indicate the extent and nature of all misclassifications. The wheel indicates the true-positive and false-positive (inner arc), as well as the true-positives and false-negatives (middle arc). From the outermost arc one can estimate the sum of the two error types. The source data is available in Supplemental Table S2.
We are often unable to interpret the errors in classification. We categorized the errors by biological relevance of the classification errors: (a) Cancerous state errors: Correct classified tissue but failure in determining healthy vs. diseased state. These are restricted to the 3 instances of tissues that are represented by both healthy and cancerous instances. (b) Intra-tissue errors: misclassification of one class to another that is anatomically adjacent. Such errors are limited to classes with multiple diseases of the same/ related organ. (c) Inter-class errors: classification mistakes that are not obviously interpretable. Figure 6A shows schematically the possible error types for all 27 classes. Figure 6B indicates the distribution of different error types among the relevant categories of errors. Figure 6C focuses on an example for inter-class error and the source for false positives. It shows that in the case of BLCA, the majority of the false positives (10.6% out of 12.7%) derive from misclassification from 6 classes of cancer carcinomas (additional classes, that contribute <1% of false positives each are not shown).
Improving classification success from errors’ consistency
As Figures 5-6 demonstrate, many errors are recurrent among specific classes. We tested the effect of merging classes that are characterized by abundant, bi-directional false assignments by the classifier. The most prominent case involves Adenocarcinoma of the Rectum and Colon (READ and COAD, respectively). Merging these two classes has improved the average recall rate, while merging other pairs of classes has worsened the overall classification success (not shown).
When we examined the errors per sample it transpired that for certain samples, classification varied with the (random) choice of training sets. A plurality vote protocol was applied in order to reduce this dependency (see Materials and Methods). The plurality vote led to an additional improvement in the average classification success. For most classes the improvement is rather modest. However, merging the Rectum (READ) and Colon Adenocarcinoma (COAD) classes that were reported with 39% and 80% success, respectively was highly beneficial. Actually the merged collection of samples reached 97% recall. The average recall rate for the valid 26 classes has reached a further improvement to 91%, with a median of 95% (Table 1).
Additional tests in which the classifier was trained to identify the clinical stages of the disease (i.e., cancer stage 1 versus stage 3) and other global features, like age and gender, resulted in poor performance and will not be further discussed.
Lowest expressing miRNA unveil informative miRNAs
Notwithstanding the impressive success rate of the classifier (91%, with 95% median), we are still unable to quantify the contribution of any specific miRNA to this success. To this end, we designed a ranking score for miRNAs according to their potential in contributing to the classifier success (see Materials and Methods). Empirically, we compared the results achieved when classifying with a reduced vector of miRNA expression. We randomly drew various groups of 50 miRNAs out of the 1046 miRNA discussed in this study (Table S1) and tested the classification results when using the data as is (raw data), as well as following the trichotomy transformation (Figure 7A). We also tested the results when selecting the 50 most informative miRNA (according to rank by sum of the statistical variation, see Materials and Methods). We repeated this randomization test for a growing set of miRNAs. Note that as the set size grows, a random set tends to include more informative miRNAs (e.g., for 200 miRNAs). Consequently, the differences between the performance of the randomized set and the most informative set is reduced. Already the 50 most informative miRNAs attain an impressive 75% success rate, suggesting that much of the classification signal is already captured by the set that is characterized by maximal variance (Figure 7A). With 300 of the most informative miRNAs, success rate reached on average 87%, similar to what is achieved when all the miRNAs are employed along with the trichotomy transformation.
We tested whether the miRNAs that contribute most to the classification task are indeed those that were implicated in cancer progression, prognosis, and diagnosis. To this end, we analyzedthe 20 most informative miRNAs by the criterion used in Figure 7A. The source of variability is shown as reflected by the s.d. associated with the 3 coordinates for each class (no expression, lowly expressed (0<x<30) and highly expressed (>30 RPM). For a complete rank of informative miRNAs see Supplemental Table S3. It is clear that some informative miRNAs rely on variability in the vector of highly expressed (Figure 7B, compare miR-205 to miR 934). For others, no variability is observed in the vector for non-expressed (Figure 7B, blue). While both hsa-mir-141 and hsa-mir-215 are among the most informative miRNAs, their source of variability is rather different. While hsa-miR-205 is characterized by a very large variation to in the highly expressed section, yet a substantial variability is recorded for this miRNA also in the other two categories (non-expressed and lowly expression).
Discussion
Cellular quantities of miRNAs
Our results uncover the power of the least expressed miRNAs in distinguishing a large collection of tissues and cancerous types. The common notion is that the quantity of mature miRNAs in the cell is critical for cell regulation. Specifically, excess of a specific type of miRNA potentially titrates out genuine mRNA targets (37). As a secondary effect, an overflow of specific miRNA will most likely result in its binding to non-genuine targets, the so-called called “off target” effect (38). The overall quantitative effect of miRNA levels is discussed in terms of competition (39,40) and cooperativity (41). Most studies that consider miRNAs in living cells tend to be biased toward the analysis of highly expressed miRNAs. Within such framework, lowly expressed miRNAs are most likely discarded with the premise that they have no impact of the mass effect.
In this study, we show for the first time that for the task of cancer classification, miRNA-based information is encoded by the minute expression of miRNAs that together account for only 0.02% of the total reads. Moreover, it is apparently the long distributional tail of miRNA expression that carries the signal for correctly identifying multiple cancer / tissue types. We actually show that including the entire profile of miRNAs leads to poorer performance as the principal, informative signal for class identification is masked.
A practical consequence of our observation concerns the preferred sequencing depth needed for a routine miRNA profiling. The high coverage that is reported for all samples in TCGA, entails a high sensitivity. Specifically, about 45-50% of the miRNAs, in almost all classes are expressed at a level of 0 to 1 RPM. The miRNA list with values of 1-10 RPM accounts for an additional 25% of the miRNA list. Classically, when sequencing short RNA (<200 nt), most protocols call for a coverage ranging from 0.5 to 1.5M reads, under the assumption that discovery rate for new miRNAs is extremely low above 2M reads (42). We argue that the high coverage provided by the TCGA is essential for (i) increasedreliability of miRNA identification (i.e., reporting on expression from a list of 1046 miRNAs); (ii) highlighting non-classical and often low expressing miRNAs.
Low expressing miRNA act as intrinsic markers for tissue specificity
An unexpected outcome of our study refers to the stability and consistency in the expression of the lowly expressed miRNAs through many samples of the same cancer class. Ample observations show that extreme conditions (e.g., stress, transformation, viral infection, differentiation) lead to drastic changes in miRNA expression profiles in cell types. We postulate that in conditions where miRNA expression patterns dramatically change (e.g., cancer, stem cell differentiation, hypoxia etc.), the tail of the miRNA distribution is hardly affected and consequently cell identity and the origin of the cells remain robust. This interpretation is supported by the gradual manipulating of the data by removal of miRNAs that belong to the lowly expressed tail and assessing the robustness of the classifier to such manipulations (Figure 4).
miRNA profile can be used for sub-classification of cancer types
TCGA is a fast growing resource and has already reached >10,000 samples which are assigned with 33 cancer types. The methodology presented in this study is applicable for a classification task for any number of diseases’ classes that are supported by enough instances. We showed that the classification performance is very high for a small set of classes (Figure 3B and Table 1). For example, the success rate for Breast Invasive Carcinoma (BRCA) is 96-98%. However, a wide spectrum of survival rates (43,44) is associated with breast cancers, supporting the notion of heterogeneous molecular basis of this disease. Therefore, managing the disease is likely to benefit from a refined classification within the broad class of BRCA. A prognosis profile combining mRNAs, miRNAs and DNA methylation had been proposed (45). In addition to the genes that were implicated as driver mutations, the authors identified number of informative miRNAs such as hsa-miR-328, hsa-miR-484, and hsa-miR-874. While these miRNAs were associated with cell proliferation, angiogenesis and tumor-suppressive functions (46,47), they are poor separators of BRCA from other major tumor types (48). Actually, these miRNAs were implicated in Colon Adenocarcinoma (COAD) (49) and mostly in renal and lung carcinomas (48).
Importantly, each cancer type displays its unique miRNA set. For example, overexpressed miRNAs that were implicated in gastric cancer include miR-17-5p, miR-18a, miR-18b, miR-19a, miR-20b, miR-20a, miR-21, miR-106a, miR-106b, miR-135b, miR-183, miR-340-3p, miR-421 and miR-658. In colon cancer, in addition to the validated oncomiRs, miR-16, miR-31, miR-34a, miR-96, miR-125b and miR-133b were overexpressed. The list of cancer associated miRNAs for pancreatic ductal adenocarcinoma (PDAC) includes miR-15b, miR-95, miR-155, miR-186, miR-190, miR-196a, miR-200b, miR-221 and miR-222 (reviewed (50)). Among the underrepresented miRNAs (called anti-oncomiRs) let-7a-1, miR-143 and miR-145 were validated for a number of cancer types.
We found no statistical evidence for an overlap between the ‘most informative’ miRNAs (e.g., supplemental Table S3, Figure 7B) and miRNAs that are reported in the literature to govern cancer progression. The later includes an extended list of oncomiRs including miR-15, miR-16, miR-17, miR-18, miR-19a, miR-19b, miR-20, miR-21, miR-92, miR-125b, miR-155 and miR-569. We postulate that there is no direct correspondence between the informative tail of lowly expressing miRNAs and the dominating miRNAs that best associated with a transition of a cell to its transformed, cancerous state.
Among the top 20 most informative miRNAs, miR-205 and miR-141 exhibit maximal variability for the 27 classes (by a vector of 81 values, see Materials and Methods). Addition informative miRNAs (Figure 7B) include miR-7-3, miR-31, miR-135a/135b, miR-149, miR-196a-1, miR-200a/200b/200c, miR-215, miR-224, miR-328, miR-429, miR-552, miR-934, miR-944 and miR-1251. In view of the lack of any statistical enrichment between miRNAs assigned as the most significant ones (Figure 7B) and the collection of oncomiRs/ anti-oncomiRs from the literature, we postulate that the multiclass typing and markers for tumorigenesis representing different aspects in the characterization of the disease. Inspecting the top 50 miRNAs from our results addresses (indirectly) a question on the nature of miRNAs that capture most of the classification information. Among the most variable miRNAs, three of the miRNAs (hsa-miR-141, hsa-mir-200a and hsa-mir-200b) belong to one family (mir-8) with a high correlation in their appearance in all 27 classes (r2=0.71-0.74).
Clinical application and premises
Our results can be used in clinical treatment of patients having cancer of unknown primary origin (CUP). However, toward improvements in clinical guidelines, additional measurements (e.g., immune-histochemical, PCR for mRNA expression) are likely to be beneficial. A set of miRNAs that best classify cancer samples by tissue of origin was presented in view of the success in diagnosis of CUP (20,25,26). A significant overlap between the most informative miRNAs (extracted from (26)) and our SVM protocol was found. By comparing informative 50 candidates from the two studies, we detected 10 shared miRNAs (miR-122, miR-138-1, miR-141, miR-146a, miR-196a, miR-200a, miR-200c, miR-205, miR-31 and miR-9-3). Despite a large difference in the analyzed data (500 versus 8000 samples) and the methodology used (decision tree vs SVM), the consistency among informative miRNAs is intriguing (p-value 2E^5). Ample evidence demonstrated the impact of alteration in hsa-miR-200/hsa-miR-141 expression on proliferation, morphology and aberrant histone acetylation in numerous cancers and cell types. An overview on the contribution of miRNAs for cancer diagnosis, the advantages and limitations of several complementary methods was recently reported (51).
Lastly, we inspected the features that specify the least informative miRNAs in our study. There are 90 miRNAs (out of 1046) that are completely non informative to our task. Namely, in the trichotomy transformation they are uniformly expressed with respect to the 3*27 classes. For these miRNAs the sum of the standard deviations (Table S3) for each of the trichotomy label is zero. Obviously, these miRNAs cannot contribute to the classification task. Interestingly, several validated omcomiRs are among these miRNAs (e.g., miR-16, miR-17, miR-18 and miR-21). This emphasizes the uncoupling between the informative low expressing miRNAs we have identified for the classification task and highly expressed oncomiRs. The level of expression of oncomiRs are used as valuable markers for cell dysregulation and as such are the hallmark of proliferation and transformation in cancer.
Funding
Supported by ERC grant 339096 on High Dimensional Combinatorics and ELIXIR-Excelerate grant.