SUMMARY
Small non-coding RNAs (sRNAs) play a vital role in a broad range of biological processes both in health and disease. A comprehensive quantitative reference of sRNA expression would significantly advance our understanding of sRNA roles in shaping tissue functions. Here, we systematically profiled the expression of four sRNA classes across eleven mouse tissues by RNA-seq. Using fourteen biological replicates spanning both genders, we identified 3,962 various sRNAs; 473 of these sRNAs are novel and identified for the first time by this work. We found that 40% of these transcripts were distributed across the body in a tissue-specific manner, and this tissue-specificity extends through multiple sRNA classes; furthermore, some sRNAs are also sexually dimorphic. By combining these findings with machine learning, we were able to accurately classify tissue types from sRNA data generated by other studies. These results yield the most comprehensive catalog of specific and ubiquitous small RNAs in individual tissues to date, and we expect that this catalog will be a resource for the further identification of sRNAs involved in tissue-function in health and dysfunction in disease.
INTRODUCTION
Small non-coding RNAs (sRNAs) are a large family of endogenously expressed transcripts, 18 to 40 nucleotides long, that play a crucial role in regulating cell function (Bartel, 2018; Cech and Steitz, 2014). Unknown to researchers two decades ago, today sRNAs are believed to be involved in nearly all developmental and pathological processes in mammals (Cech and Steitz, 2014; Esteller, 2011; He and Hannon, 2004; Ng et al., 2016). The main function of sRNAs in a cell is to tightly regulate gene expression at the levels of post-transcriptional RNA processing and translation. Aberrant expression of small sRNAs, in turn, has been associated with diseases such as cancer, autoimmune disease and several neurodegenerative disorders (Almeida et al., 2011; Liu et al., 2017; Somel et al., 2010).
Mammalian cells express several classes of sRNAs including microRNA (miRNA) (Ha and Kim, 2014), small interfering RNAs (siRNA), small nucleolar RNAs (snoRNA) (Matera et al., 2007), sno-derived RNA (sdRNA) (Taft et al., 2009), PIWI-interacting RNA (piRNA) (Ishizu et al., 2012), tRNA-derived small RNAs (Kumar et al., 2014), with some being shown to be expressed in a tissue- (Dittmar et al., 2006; Landgraf et al., 2007), cell-type-(Faridani et al., 2016; Volinia et al., 2006) or even cell-state-specific manner (Hayes et al., 2014; Palfi et al., 2016; Sherstyuk et al., 2017). Through their interactions with the messenger RNA (mRNA) that code for protein expression, these small non-coding molecules shape the dynamic molecular spectrum of tissues (Faridani et al., 2016; Sharma et al., 2016). Despite extensive knowledge of sRNA biogenesis and function (Bartel, 2018; Jorjani et al., 2016), much remains to be explored about tissue- and gender-specific sRNA expression. Given the emerging role of sRNA as biomarkers (Anfossi et al., 2018) and potent therapeutic targets (Janssen et al., 2013), a comprehensive reference catalog of tissue sRNA expression would represent a highly valuable resource not only for the fundamental but also for the clinical research.
The first attempts to establish a catalog of tissue-specific mammalian sRNAs began a decade ago (Ach et al., 2008; Hsu et al., 2007; Landgraf et al., 2007; Liang et al., 2007). While these pioneering microarray-, qPCR- and Sanger sequencing-based studies mapped only a limited number of highly expressed miRNA, they, nevertheless, established a “gold standard” reference for the upcoming 10 years of miRNA research. Efforts to characterize tissue-specific sRNA have recently resumed with employment of RNA-seq, which greatly advanced the discovery of novel and previously undetected low expressed transcripts (Londin et al., 2015; McCall et al., 2017; de Rie et al., 2017). However, cataloging the tissue-specific patterns of miRNA expression was mostly done using publicly available RNA-seq data originating from various experiments designed to target a tissue or a condition of interest rather than systematically analyzing a spectrum of normal tissues from the same individual (Ludwig et al., 2016; McCall et al., 2017; de Rie et al., 2017). As a result, tissue expression patterns reported by these studies are mainly based on one replicate, suffer from protocol biases (Giraldez et al., 2018) and, still yield an incomprehensive picture of bodily miRNA patterns. In addition, none of the prior studies encompass a spectrum of mammalian tissues from both female and male individuals, let alone, various non-coding RNA species other than miRNA.
Here, we describe a comprehensive atlas of sRNA expression across eleven normal mouse tissues. Using multiple biological replicates (n=14), we mapped tissue-specific as well as broadly transcribed sRNA attributed to different classes and spanning a large spectrum of expression levels. We have also, for the first time, provided evidence of gender-based differences in sRNA expression across multiple tissues. Finally, we used machine learning to accurately predict the tissue and, in certain cases, even its functionality based on sRNA expression.
RESULTS
sRNA expression atlas of mouse tissues
We profiled the expression of sRNA across eleven tissues from adult female (n=10) and male (n=4) C57BL/6J mice (Figure 1A, Table S1). We generated a dataset comprising in total of 142 sRNA sequencing libraries from brain, lung, heart, muscle, kidney, pancreas, liver, small intestine, spleen, bone marrow and testes RNA. Each library yielded ~1–10 million sRNA reads, resulting in the average of 21 million sRNA reads per tissue (Figure S1A). Using the ENCODE GRCm38 annotation, we mapped the expression of four distinct sRNA classes: miRNA, snRNA, snoRNA and scaRNA, in profiled tissues. Among all the tissues we identified 1551 distinct pre-miRNA, 941 snRNA, 953 snoRNA and 44 scaRNA, which corresponds to 70.3% 67.9%, 63.2% and 86.3% of ENCODE-annotated transcripts of the respective class (Figure 1B). With respect to protein coding genes, the majority of detected snoRNA were of intronic origin, snRNA and scaRNA were intronic and intergenic (63/35% and 64/36% respectively) and miRNA were transcribed from either introns (53%), exons (11%) or intergenic regions (11%) (Figure S1B). The number of distinct sRNA greatly varied across tissues, for example, spleen and lung contained the largest number of distinct sRNA (~ 1300) while pancreas and liver – the lowest (~ 500) (Figure S1C). Furthermore, within the profiled tissues we detected 95.1% of pre-miRNA denoted by miRBase v22 database as high confidence transcripts (Kozomara and Griffiths-Jones, 2014). Using the obtained data we have reconstructed the most complete genome-wide tissue map of mammalian sRNA expression (Figure 1C, Table S2).
Tissue-specific expression of sRNA
We first assessed the differences in sRNA patterns across profiled tissues based on the expression of all four sRNAs classes (Figure 2A). Unsupervised clustering of most variable sRNAs across all samples (Methods) demonstrated that a large number of sRNA transcripts are shared among brain and pancreas, while not being expressed in other tissues (Figure 2B and Figure S2). Similar patterns were observed for bone marrow and spleen. Clustering of the most variable transcripts within each sRNA class allowed us to further identify miRNA to be the main driver of brain and pancreas similarity, while bone marrow and spleen clustered together within snoRNAs (Figure S2).
Dimensional reduction via t-distributed stochastic neighbor embedding (tSNE) (Laurens Van Der Maaten and Geoffrey Hinton, 2008) (Methods) on all sRNA genes revealed a robust clustering of samples according to tissue types (Figure 2C). Further analysis, performed on each sRNA class separately (Figure S3A), showed that out of all four sRNA classes miRNA separated samples in eleven clearly identifiable clusters corresponding to profiled tissues, while snRNA showed no clear separation by tissue type. snoRNA and scaRNA tSNEs resolved the majority of tissue types, however failed to identify pancreas, intestine and kidney.
For each sRNA we next computed the tissue specificity index, TSI, as described previously in (Ludwig et al., 2016) (Table S2). We observed that ~ 21% of all detected sRNA were expressed in only one tissue (TSI = 1), ~ 19% of all sRNA were highly expressed in one but also present in other tissues (0.95 < TSI < 1) while the remaining ~60% were either ubiquitously expressed or had high expression scores in multiple tissues (Figure S3B). Brain contains the highest number of tissue-specific transcripts (437 with TSI > 0.95) followed by lung (257 with TSI > 0.95), bone marrow (161) and spleen (152) (Figure 2D). Interestingly, despite of the lowest number of transcript counts across profiled tissues, lung contained the largest number of distinct sRNAs (2532), followed by spleen (2430) and brain (2081) (Figure 2D). At the level of individual RNA class, brain, lung, spleen and bone marrow remained the top four tissues harboring the largest number of unique tissue-specific transcripts (Figure S3C), except for scaRNA, for which only few transcripts were found to have TSI>0.95.
Tissue-specific miRNAs
Comparing the expression of miRBase-annotated (Kozomara and Griffiths-Jones, 2014) pre-miRNAs across all eleven tissues, we identified well-described tissue-specific miRNAs, such as miR-122, miR-375, miR-10a enriched respectively in liver, pancreas and kidney (Landgraf et al., 2007) to be among the most variable pre-miRNA (Figure 3A and Figure S4). We also found a large number of highly expressed miRNA previously unknown to preferentially localize within a particular tissue; these include miR-6236 in the bone marrow, miR-194 and miR-215 – in the intestine, miR-381 – in the brain, miR-203 and miR-23b – in testes (Figure 3A and Figure S4). In addition, we found several low-expressed miRNAs specific to either one or two tissues (Figure 3B and 3C). Moreover, among the top fifty miRNAs with highest expression scores in each tissue, only the minority of transcripts appeared to be tissue specific (TSI>0.9) while the majority were also expressed in other ten tissues (Figure S4).
We next asked how the identified tissue expression patterns compare to those of individual cell types. To investigate that, we correlated our data to the miRNA-seq data generated for primary mammalian cells by FANTOM5 consortium (de Rie et al., 2017). Comparing mouse samples first, we found that FANTOM5 embryonic and neonatal cerebellum tissues strongly correlated with our brain samples (rs=0.89–0.9), while erythroid cells had the strongest correlation with spleen and bone marrow (rs=0.93) (Figure S5A). To perform a comparison with human samples, we focused on the expression scores of 531 orthologs detected in both the current study and the FANTOM5 samples (Figure S5B). Spearman correlation coefficients reflected the cell-type composition of tissues (Figure S5C). As such, we observed that mouse bone marrow and spleen had the highest correlation with human B-cells, T-cells, dendritic cells and macrophages (0.5<rs<0.6), muscle correlated the most with myoblasts and myotubes (rs=0.47), while brain – with neural stem cells, spinal cord, pineal and pituitary glands (rs=0.49) (Figure S5C).
Tissue-specific snoRNAs
Given the ability of snoRNA to separate the majority of profiled tissues based on the expression (Figure S3A), we have next focused on identifying the most variable snoRNAs across tissues. Among the verified snoRNA we detected sixty that varied across tissues (Figure 3D, Methods). Top variable snoRNAs included Snora35 and Snord116, which were highly expressed in the brain and previously shown to be specific to neural tissues (Cavaille et al., 2000). Bone marrow contained the largest fraction of tissue-specific snoRNAs, while a few were also present in spleen, pancreas, liver and testis. Importantly, we found that Snord70 and Snord66, often used as normalization controls in qPCR-based assays (Chen et al., 2012; Emde et al., 2015), are also expressed in a tissue specific manner. Another example of identified tissues-specific snoRNA is Snord123, located 3kb upstream of pancreatic cancer-associated Sema5a gene, we found it to be expressed predominantly in the pancreas. (Figure 3E). We also discovered several snoRNAs whose existence had been previously predicted but which had not yet been detected experimentally to be specifically expressed across profiled tissues (Figure 3E).
miRNAs are expressed in a gender-specific manner
To address a long standing question of gender bias in miRNA expression (Guo et al., 2017; Kolhe et al., 2017), we compared the miRNA expression levels between female and male mice. In each somatic tissue, except pancreas, we identified at least two miRNAs to be differentially expressed (log2FoldChange > 1, normalized counts > 100) at FDR < 0.01 between genders (Figure 4, Figure S6). Kidney and lung contained the highest number of gender-biased miRNAs (27 and 18 respectively), while only two were detected in the heart, five in the muscle and seven in the brain (Figure S7A). Among the identified gender-biased miRNAs, we found miR-411, miR-186, miR-340, miR-182, miR-183, miR-148a, miR-145a, miR-101b to be systematically overrepresented in female compared to male tissues and miR-379, miR-195a, miR-99a, miR-let-7g, miR-666, miR-15b, miR-151 to be expressed higher in male over female tissues (Figure S7B). Interestingly, three out of eight female-dominant miRNAs: miR-182, miR-148a and miR-145a, were also shown previously to be estrogen regulated (Klinge, 2009) while another miRNA, miRNA-340, was reported to be downregulated in response to elevated androgen levels (Fletcher et al., 2014).
Given the innate ability on miRNA to lower the levels of target mRNA (Guo et al., 2010), we hypothesized that the levels of protein-coding transcripts targeted by gender-biased miRNA would also differ across male and female tissues. To test this hypothesis we correlated the expression of gender-biased miRNAs with the levels of their respective target mRNAs across profiled tissues (Methods). Among the anticorrelated targets (rs < −0.8, FDR < 0.1) we identified two genes previously shown to be sexually dimorphic (Figure S8C). Specifically, we found miR-423, upregulated in male tissues, to negatively correlate with its target – estrogen-related receptor gamma (Esrrg) (rs= −0.9, FDR < 0.1), and female-specific miR-340 to negatively correlate with androgen-associated ectodysplasin A2 receptor (Eda2r) (Prodi et al., 2008). However, we also found that the majority of the predicted targets were in fact positively correlated with the respective miRNAs (rs > 0.8, FDR < 0.1), suggesting that the tested miRNAs are not involved in silencing of these targets through degradation.
Novel miRNAs
To search for novel miRNA in our data, unmapped reads were analyzed with the miRDeep2 framework (Friedländer et al., 2012). We identified 473 novel miRNAs supported by at least five sequencing reads, with the majority being present in only one tissue (312), but a small number (4) being found in all eleven tissues (Figure 5A). Principal component analysis on the newly identified miRNAs, supported by at least 50 reads, showed a clear separation of brain, lung and muscle from other tissues based on the expression values. Similar to annotated transcripts, novel miRNAs demonstrate a spectrum of tissue specificity with some being ubiquitously expressed, while others are only present in one tissue (Figure S8A). Differential expression analysis on putative novel miRNAs identified six miRNAs to be also expressed in a gender-specific manner. Strikingly, all six were male-dominant, with one of them even found to be consistently upregulated in two tissues, male muscle and pancreas (Figure 5B and S8C). We speculate that the prevalence of male-specific novel miRNAs identified in our study reflects the inconsistent sampling of both genders by prior murine miRNA research.
miRNA-based tissue classification
We finally asked whether the observed variation in miRNA expression across tissues (Figure 2B and C) would be sufficient to accurately predict the tissue type based solely on miRNA-seq data. To address this question we set out to construct an algorithm that can learn tissue characteristics from the data reported in the current study and make predictions on new data sets. We first trained a support vector machine (SVM) model (Cortes and Vapnik, 1995) on 134 data sets generated in this study, each containing the expression scores for 1973 miRNAs (Figure S9A). As a validation dataset we used available miRNA-seq data released by the ENCODE consortium for multiple mouse tissues (Dunham et al., 2012). Notably, the ENCODE datasets contained data generated for the postnatal and embryonic life stages, as opposed to the adult stage profiled in the current study (Table S4). Nonetheless, our SVM model accurately classified postnatal forebrain, midbrain, hindbrain and neural tube as brain tissue, as well as accurately inferred the tissue types for heart, intestine, kidney, liver, muscle samples, yielding an overall accuracy of 0.96 (Methods). For the embryonic tissues, however, our model was able to only reach an accuracy of 0.69. This was mainly due to inability of the model to correctly classify liver tissues and instead assigning them to bone marrow (Figure 6A). Strikingly, in this case our model accurately predicted the hematopoietic function of the organ, known to shift from the liver at the embryonic stages to the bone marrow in adulthood (Baron et al., 2012), rather than the tissue type itself. Furthermore, we identified hematopoiesis-associated miR-150 and miR-155 (Bissels et al., 2012) to have highest weights among the features defining the bone marrow in our model (Figure S9B).
The random forest (RF) regression model (Tin Kam Ho, 1998) weighted miRNA differently (Figure S9C) and was able to predict the postnatal and embryonic tissues with respective accuracies 1.0 and 0.9, guessing correctly 27/27 and 44/49 samples and misassigning one of the embryonic kidney, lung, liver and intestine samples (Figure 6B).
DISCUSSION
Small non-coding RNA plays an indispensable role in shaping cellular identity by altering the levels of protein-coding transcripts (Matera et al., 2007; Qureshi and Mehler, 2012). Recent efforts in profiling the miRNA content of cells and tissues demonstrated the existence of tissue- and cell type-specific short non-coding transcripts. (Landgraf et al., 2007; Londin et al., 2015; Ludwig et al., 2016; McCall et al., 2017; de Rie et al., 2017). In this work, we show that this phenomenon extends beyond one sRNA class and involves not only tissue-specific but also gender-specific sRNA expression. Compared to previous studies, we use a large number of biological replicates (n=14) and focus on deriving a quantitative rather than qualitative reference of sRNA expression across eleven normal murine tissues. We demonstrate that the obtained quantitative data can be used to train machine learning algorithms to recognize tissues or even their functions based on sRNA expression.
We found a large number of new tissue-specific sRNA missed by previous studies (Figure 3 and S4, Table S2) because of low sequencing coverage (Landgraf et al., 2007) or due to insignificant expression scores derived from a single biological replicate (McCall et al., 2017). By analyzing the expression of several classes of sRNA we discovered a large number of snoRNA expressed mainly in the brain, pancreas or bone marrow (Figure 3D). Taken together with previous observations (Jorjani et al., 2016), this finding raises additional questions regarding the biogenesis pathways of snoRNA as well as its potential specialized functions across distinct tissues.
More than 60% of protein-coding genes in mammalian genome harbor predicted miRNA target sites (Friedman et al., 2008). However, only a handful of them are bona fide miRNA targets, while up to 70% are falsely assigned by prediction algorithms (Agarwal et al., 2015; Betel et al., 2010; Dweep and Gretz, 2015). Currently, the validation of miRNA:mRNA interactions is still mainly based on low-throughput labor-intensive approaches such as knock-down or over-expression assays (Thompson et al., 2015) and thus has been only done for a limited number of miRNAs. Growing amounts of both, mRNA- and miRNA-seq data, generated for various cell and tissue types, now provide us with a possibility to narrow down the list of putative targets by identifying those that go down with elevated miRNA levels (Guo et al., 2010). Here, using our tissue miRNA-seq dataset and publicly available mRNA-seq data, we demonstrate the implementation of this approach. By correlating the expression of miRNA:target pairs across tissues, we show that at least half of the predicted targets are not affected by increasing miRNA levels. In parallel, we were able identify targets that show a strong negative correlation with miRNAs. On the example of sexually dimorphic miRNA we further demonstrate that the levels of some protein-coding transcripts indeed decrease with increased miRNA levels (Figure S6C), suggesting miRNAs to contribute to gender-biases in gene expression.
Our work, to our knowledge, is the first to demonstrate that the expression of short non-coding transcripts can be used to accurately predict tissue types (Figure 6). Within the current study we show that machine learning algorithms applied to quantitative sRNA expression estimates yield robust tissue classifiers applicable to data generated by other groups. Given the emerging evidence of strong miRNA dysregulation in disease (Mendell and Olson, 2012), particularly in cancers (Jansson and Lund, 2012; Peng and Croce, 2016), we anticipate that in the future sRNA-based classifiers could be expanded to recognize affected tissues or various disease types.
sRNAs have been long known to regulate development and functions of the brain (Qureshi and Mehler, 2012). Our study finds that brain, in fact, contains the largest number of unique mammalian sRNA transcripts that are absent in other tissues. We found that lung also contains a large number of tissue-specific sRNAs, and expresses the largest number of distinct sRNA among eleven profiled tissues (Figure 2D). Given the complexity of tissues, one would expect that the majority of specific transcripts are residing within a particular cell type or state uniquely present in the tissue. However, our knowledge of cell-specific miRNA expression is not complete and does not yet allow us to identify all the cell types driving the complexity (McCall et al., 2017; de Rie et al., 2017). We directly observed this phenomenon in miR-92b, miR-448 and miR-1298, which we identified as expressed in both, brain and lung tissues (Figure 3B and S4). According to the cell-type-based studies, however, these transcripts are specific to neural and stem cells (de Rie et al., 2017), which explains why we see them in the brain, but does not explain their presence in the lung. This inability to fully explain the roots of tissue complexity elucidates the need for further characterization of the sRNAs content of specific cell types or even, similarly to mRNA, that of single cells (Faridani et al., 2016; Trapnell, 2015). This atlas, meanwhile, will set a solid pillar for future sRNAs studies and will serve as a powerful resource of sRNA tissue identity for fundamental and clinical research.
Author Contributions
A.I. designed and performed the experiments and data analysis. A.I. and S.Q. interpreted the data and wrote the manuscript.
Competing interest
The authors declare no conflict of interest.
STAR Methods
Subject details
Animals
All procedures followed animal care and biosafety guidelines approved by Stanford University’s Administrative Panel on Laboratory Animal Care and Administrative Panel of Biosafety. Wild type C57BL/6J mice, 4 males and 10 females, aged ~3 month old were used (Table S1).
Methods details
Tissue handling and RNA extraction
Upon collection, tissue samples were submerged and preserved at −80C in RNAlater stabilization solution (ThermoFisher cat # AM7021) until further processing. Total RNA was isolated from ~ 100 mg of tissue using Qiagen miRNeasy mini kit (cat # 217004) and the Qiagen tissue lyser using 5 mm stainless steel beads. RNA integrity was assesses using Agilent Bioanalyzer using RNA 6000 pico kit (Agilent Technologies cat # 5067–1513).
Library preparation and sequencing
Short RNA libraries were prepared following the Illumina TruSeq Small RNA Library Preparation kit (cat # RS-200–0012, RS-200–0024, RS-200–0036, RS-200–0048) according to the manufacturer’s protocol and size-selected using Pippin Prep 3% Agarose Gel Cassette (Safe Science) in a range 135 bp – 210 bp. Samples were pooled in batches of 48 and sequenced using the Illumina NextSeq500 instrument in a single-read, 50 or 75 -base mode.
Data processing
Sequencing reads were demultiplexed by BaseSpace (Illumina). Reads were trimmed from the adaptor sequences and aligned to the mouse genome (GRCm38) using STAR_v2.5.1 (Dobin et al., 2013) with the following parameters --outFilterMismatchNoverLmax 0.05 --outFilterMatchNmin 16 --outFilterMatchNminOverLread 0 --outFilterScoreMinOverLread 0 --alignIntronMax 1 --outMultimapperOrder Random --clip3pAdapterSeq TGGAATTCTC --clip3pAdapterMMp 0.1. Spliced alignments and hard/soft-clipping were disabled. Reads mapping with insertions or deletions were removed. We used ENCODE GRCm38 and miRBase v22 annotations to count the number of sRNA transcripts. Reads were assigned to one of the annotated biotypes: miRNAs, snoRNAs, snRNAs and scaRNA from ENSEMBL or to pre-miRNAs from miRBase v22 using featureCounts v 1.6.1 (Liao et al., 2014). We first counted reads from both intronic and exonic regions of the protein coding and lincRNAs attempting to capture small RNAs transcribed from these regions (Figure S1A and B), using the following command: featureCounts -a Mus_musculus.GRCm38.90.gtf -M -primary -s 0. We used a -M -primary option to count reads mapping to multiple genomic locations marked as “primary alignment” by STAR. Next, we only counted uniquely mapping sRNAs using the following parameters featureCounts -a Mus_musculus.GRCm38.90.gtf -s 0. We excluded libraries with fewer than 10,000 mapped reads. All tissue specificity and differential expression analyses in the manuscript were carried out on uniquely mapping counts (i.e., only mapped to one genomic location), except where otherwise noted.
Unsupervised clustering and dimensionality reduction analysis
sRNA raw counts were normalized and log-transformed using DESeq2 package. Batch effects were corrected using limma R package (Ritchie et al., 2015). Hierarchical clustering was performed using log2 transformed expression values and using complete linkage as distance measure between clusters. We computed Euclidian distances between samples and used these values to perform the t-distributed stochastic neighborhood embedding (t-SNE) (Laurens Van Der Maaten and Geoffrey Hinton, 2008)_with the following parameters: perplexity = 20, initial dimensions of 50 and maximum iteration of 1,000. Transcripts detected in one or more samples with overall log2 expression scores <1 were excluded from this analysis.
Tissue specificity index
To compute the tissue specificity index we used the formula described previously in (Ludwig et al., 2016):
Where N is the total number of tissues measured and xj,i is the expression score of tissue i normalized by the maximal expression of any tissue for miRNA j.
Comparison with available miRNA data
To compute Spearman correlation coeffitients between samples generated in the current study and mouse miRNA data generated by FANTOM5 consortium (de Rie et al., 2017) we used DESeq2-normalized scores of 2207 annotated miRNAs. To compare the miRNA expression between mouse tissues and human cell types we generated a curated list of orthologous miRNAs that contained maximum two mismatches per ortholog mature miRNA. 531 miRNA passed this criteria and were used to compute Spearman correlation coefficients shown in Figure S5.
Differential expression analysis with DESeq2
miRNAs differentially expressed between female and male tissues were computed using DESeq2 (Love et al., 2014). To test for the NULL hypothesis, we performed a permutation test in which we randomly re-assigned the sex labels to 14 samples across each tissue and plotted the distribution of DESeq2 p-values computed for the two groups (i.e. female and male) (Figure S6). We used Benjamini-Hochberg-corrected p-values to assess the statistical significance of the computed DE scores (Figure 4 and S7A). The differentially expressed miRNAs were visualized on volcano plots, where male- and female-specific miRNAs (adjusted P-value < 0.01 and absolute fold change > 1) were labeled accordingly.
Analysis of correlation between miRNA expression and the expression of its targets
Putative miRNA target genes were extracted from TargetScan, DIANA, miRanda, or mirDB databases (Agarwal et al., 2015; Griffiths-Jones et al., 2007; Paraskevopoulou et al., 2013; Wong and Wang, 2015). Only targetes present in two or more databases were used. The gene expression scores of the respective targets in various tissues were extracted from the ENCODE database (Pennisi, 2012) (Table S6). Spearman correlation coefficients were computed between FPKM retrieved from the ENCODE mRNA expression tables and DESeq2-normalized miRNA counts across ten profiled tissues using corr.test() function from ‘psych’ R package (Revelle, W.) and threshholded above Benjamini-Hochberg adjusted P-value of 0.1 and Spearman correlation coefficient (−0.8< rs<0.8).
Identification of candidate novel miRNA
Candidate novel miRNA were identified using miRDeep2 software (Friedländer et al., 2012). Only miRNAs supported by > 5 reads were reported in this study. We analyzed tissue- and gender-specificity of novel miRNAs based on transcripts supported by at least 50 sequencing reads across all samples. Statistical analysis and data visualization were performed as described above for annotated miRNAs.
Machine learning
We trained the radial kernel SVM and the Random Forest models on 136 samples corresponding to different tissue types (Figure S9A) using e1070 (Meyer et al., 2017) and caret (Kuhn, 2018) R packages respectively. We used z-scores of DESeq2 normalized counts obtained in this study as a train dataset and those obtained from ENCODE miRNA-seq data as test dataset (Table S4). We normalized and scaled train and test datasets separately.
To measure the predictive power of each model we used the accuracy measure, calculated as the following:
We tuned the SVM model to derive optimal cost and gamma using tune.svm() function and searching within gamma ∈ [2^(−10): 2^10] and cost ∈ [10^(−5):10^3]. We tuned RF model using first random and then grid search, with an evaluation metric set to “Accuracy”. The accuracy was computed using 10-fold cross-validation procedure (Japkowicz and Shah, 2011). The reported accuracy is computed as a mean over the 10 testing sets in which 9-folds are used for training and the held-out fold used as a test set. The R script used to train the models and compute the predictions is included in the supplement.
Data availability
The datasets generated and analyzed in the study are available in the NCBI Gene Expression Omnibus (GEO) under the entry GEO:GSE119661.
Supplementary figures
Figure S9. Machine learning statistics. (A) RNA-seq samples used to train machine learning models. Left: number of train datasets corresponding to each tissue, right: examples of feature scores. (B) miRNAs assigned highest weights in defining bone marrow in SVM model. (C) top 20 miRNAs atributed highest weights in defining tissue types by RF model.
Supplementary tables
Table S1. Animals used in the study.
Table S2. Mean normalized expression counts.
Table S3. Novel miRNAs, sequence and expression counts.
Table S4. ENCODE data used for ML predictions.
Table S5. Targets of sexually dimorphic miRNAs.
Table S6. ENCODE mRNA datasets used in the study.
Acknowledgements
We thank Dylan Henderson for assistance in RNA extraction and library preparation. Norma Neff and Jennifer Okamoto for sequencing expertise. Geoff Stanley and Kiran Kocherlakota for kind advice in tissue dissection and preservation. This study was supported by Howard Hughes Medical Institute and Chan Zuckerberg Biohub. A.I. was supported by the Swiss National Foundation Early PostDoc Mobility Fellowship.