Abstract
Long non-coding RNAs (lncRNAs) that drive tumorigenesis are a growing focus of cancer genomics studies. To facilitate further discovery, we have created the “Cancer LncRNA Census” (CLC), a manually-curated and strictly-defined compilation of lncRNAs with causative roles in cancer. CLC has two principle applications: first, as a resource for training and benchmarking de novo identification methods; and second, as a dataset for studying the fundamental properties of these genes.
CLC Version 1 comprises 122 lncRNAs implicated in 31 distinct cancers. LncRNAs are included based on functional or genetic evidence for different causative roles in cancer progression. All belong to the GENCODE reference annotation, to facilitate integration across projects and datasets. For each entry, the evidence type, biological activity (oncogene or tumour suppressor), source reference and cancer type are recorded. CLC genes are significantly enriched amongst de novo predicted driver genes from PCAWG. CLC genes are distinguished from other lncRNAs by a series of features consistent with biological function, including gene length, expression and sequence conservation of both exons and promoters. We identify a trend for CLC genes to be co-localised with known protein-coding cancer genes along the human genome. Finally, by integrating data from transposon-mutagenesis functional screens, we show that mouse orthologues of CLC genes tend also to be cancer driver genes.
Thus CLC represents a valuable resource for research into long non-coding RNAs in cancer. Their evolutionary and genomic properties have implications for understanding disease mechanisms and point to conserved functions across ~80 million years of evolution.
Introduction
Tumorigenesis is driven by a series of genetic mutations that promote cancer phenotypes and consequently experience positive selection (Yates & Campbell 2012). The systematic discovery of such driver mutations, and the genes whose functions they alter, has been made possible by tumour genome sequencing. By collecting the entirety of such genes for every cancer type, we aim to develop a comprehensive view of underlying processes and pathways, and thereby formulate effective, targeted therapeutic strategies.
The cast of genetic elements implicated in tumorigenesis has recently grown as diverse new classes of non-coding RNAs and regulatory features have been discovered. These include the long non-coding RNAs (lncRNAs), of which tens of thousands have been catalogued (Derrien et al. 2012; Cabili et al. 2011; Guttman et al. 2009; Jia et al. 2010). LncRNAs are >200 nt long transcripts with no protein-coding capacity. Their evolutionary conservation and regulated expression, combined with a number of well-characterised examples, have together led to the view that lncRNAs are bona fide functional genes (Grote et al. 2013; Sauvageau et al. 2013; Ulitsky & Bartel 2013; Liu et al. 2017). Current thinking holds that lncRNAs function by forming complexes with proteins and RNA both inside and outside the nucleus (Guttman & Rinn 2012; Johnson & Guigó 2014).
LncRNAs have been shown to play important roles in various cancers. For example, MALAT1, a potent oncogene across numerous cancers, is restricted to the nucleus and plays a housekeeping role in splicing (Gutschner & Diederichs 2012; Engreitz et al. 2014). MALAT1 is overexpressed in a variety of cancer types, and its knockdown potently reduces not only proliferation but also metastasis in vivo (Gutschner et al. 2013).The MALAT1 gene is subjected to elevated mutational rates in human tumours, suggesting that genetic alteration in its function is an important step in tumorigenesis (Lanzós et al. 2017). LncRNAs may also function as tumour suppressors. LincRNA-p21 acts as a downstream effector of p53 regulation through recruitment of the repressor hnRNP-K (Huarte et al. 2010). These and other examples of lncRNAs linked to cancer, raise the question of how many more remain to be found amongst the ~99% of lncRNAs that are presently uncharacterised (Derrien et al. 2012; Quek et al. 2015; Iyer et al. 2015).
Recent tumour genome sequencing studies, in step with advanced bioinformatic driver-gene prediction methods, have yielded hundreds of new candidate protein-coding driver genes (Tamborero et al. 2013). For economic reasons, these studies initially restricted their attention to “exomes” or the ~2% of the genome covering protein-coding exons (Chang et al. 2013). Unfortunately such a strategy ignores mutations in the remaining ~98% of genomic sequence, home to the majority of lncRNAs (Gutschner & Diederichs 2012; Derrien et al. 2012). Driver gene identification methods rely on statistical models that make a series of assumptions about and simplifications of complex tumour mutation patterns (Lawrence et al. 2014). It is critical to test the performance of such methods using true-positive lists of known driver genes. For protein-coding genes, this role has been fulfilled by the Cancer Gene Census (CGC) (Futreal et al. 2004), collected and regularly updated by manual annotators based on experimentally-validated cases. Comparison of driver predictions to CGC genes facilitates further method refinement and comparison between methods (Sjoblom et al. 2006; Redon et al. 2006; Mularoni et al. 2016).
In addition to its benchmarking role, the CGC resource has also been useful in identifying unique biological features of cancer genes (Furney et al. 2006; Furney et al. 2008). For example, CGC genes tend to be more conserved and longer. Furthermore, they are enriched for genes with transcription regulator activity and nucleic acid binding functions (Furney et al. 2006; Furney et al. 2008).
Until very recently, efforts to discover cancer lncRNAs have depended on classical functional genomics approaches of differential expression using microarrays or RNA sequencing (Huarte et al. 2010; Iyer et al. 2015). While valuable, differential expression per se is not direct evidence for causative roles in tumour evolution. To more directly identify lncRNAs that drive cancer progression, methods have recently been developed to search for signals of positive selection using mutation maps created using tumour genome sequences. OncodriveFML utilises nucleotide-level functional impact scores inferred from predicted changes in RNA secondary structure (Sabarinathan et al. 2013) together with an empirical significance estimate, to identify lncRNAs with an excess of high-impact mutations (Mularoni et al. 2016). Another method, ExInAtor, identifies candidates with elevated mutational load, using trinucleotide-adjusted local background (Lanzós et al. 2017). A clear impediment in both cases has been the lack of true-positive set of known lncRNA driver genes, analogous to CGC. Although there do exist databases of cancer lncRNAs, notably LncRNADisease (Chen et al. 2013) and Lnc2Cancer (Ning et al. 2016), they mix unfiltered data from numerous sources, resulting in inconsistent criteria for inclusion (including expression changes), and inconsistent gene identifiers.
To facilitate the future discovery of cancer lncRNAs, and gain insights into their biology, we have compiled a highly-curated set of cases with roles in cancer processes. Here we present the Cancer LncRNA Census (CLC), the first compendium of lncRNAs with direct functional or genetic evidence for cancer roles. We demonstrate the utility of CLC in assessing the performance of driver lncRNA predictions. Through analysis of this geneset, we demonstrate that cancer lncRNAs have a unique series of features that may in future be used to assist de novo predictions. Finally, we show that CLC genes have conserved cancer roles across the approximately 80 million years of evolution separating humans and rodents.
Results
Definition of cancer related lncRNAs
As part of recent efforts to identify driver lncRNAs within the Pancancer Analysis of Whole Genomes (PCAWG) project, we discovered the need for a high-confidence set of cancer-related lncRNA genes (CRL) (Lanzós et al. 2017), which we henceforth refer to as “cancer lncRNAs”, and which we here present as Version 1 of the Cancer LncRNA Census (CLC).
Cancer lncRNAs were identified from the literature using defined and consistent criteria, being direct experimental or genetic evidence for roles in cancer progression or phenotypes (see Materials and Methods). Alterations in lncRNA expression alone were not considered sufficient evidence. Importantly, only lncRNAs with GENCODE identifiers were included. For every cancer lncRNA, one or more associated cancer types were collected.
Attesting to the value of this approach, we identified several cases in semi-automatically annotated cancer lncRNA databases of lncRNAs that were misassigned GENCODE identifiers, usually that of their overlapping protein-coding gene (Chen et al. 2013). We also excluded a number of published lncRNAs for which we could not find evidence to meet our criteria, for example CONCR, SRA1 and KCNQ1OT1 (Marchese et al. 2016; Lanz et al. 1999; Higashimoto et al. 2006).
Version 1 of CLC contains a total of 122 genes, or 0.76% of a total of 15,941 lncRNA gene loci annotated in GENCODE v24 (Derrien et al. 2012; Harrow et al. 2012) (Figure 1) (eight of the CLC genes are annotated as pseudogenes rather than lncRNAs by GENCODE). The entire remaining set of 15,827 lncRNA loci is henceforth referred to as “nonCLC” (Figure 1). The full CLC dataset is found in Supplementary Table 1. The CLC set is smaller than the Lnc2Cancer database (n= 667) (Ning et al. 2016), the LncRNADisease Database (n=121) (Chen et al. 2013) and lncRNAdb (n=191) (Quek et al. 2015). CLC covers between 17% and 31% of these databases (Lnc2Cancer and LncRNADisease respectively) but none of these resources contain the complete list of genes presented here (Figure 2). CLC is substantially larger than a previous “Cancer Related LncRNAs” set we produced (Lanzós et al. 2017)(Figure 2B). For comparison, the Cancer Gene Census (CGC) (COSMIC v78, downloaded Oct, 3, 2016) lists 561 or 2.8% of protein-coding genes (Futreal et al. 2004). Using the International Classification of Diseases for Oncology (World Health Organization 2013), we reassigned the cancer types described in the original research articles to a reduced set of 31(Figure 1, Supplementary Figure 1).
Altogether, CLC contains 337 unique lncRNA-cancer type relationships. Out of 122 genes, 77 (63.1%) were shown to function as oncogenes, 36 (29.5%) as tumour suppressors, and 9 (7.4%) with evidence for both activities (Figure 1 and Supplementary Figure 1).
The most prolific lncRNAs, with >15 recorded cancer types, are HOTAIR, MALAT1, MEG3 and H19 (Figure 1 and Supplementary Figure 1). It is not clear whether this reflects their unique pan-cancer functionality, or is simply a result of their being amongst the most early-discovered and widely-studied lncRNAs.
In vitro experiments were the most frequent evidence source, usually consisting of RNAi-mediated knockdown in cultured cell lines, coupled to phenotypic assays such as proliferation or migration (Supplementary Figure 1). Far fewer have been studied in vivo, or have cancer-associated somatic or germline mutations. 19 lncRNAs had 3 or more independent evidence sources (Supplementary Figure 1).
CLC for benchmarking lncRNA driver prediction methods
One of the primary motivations for CLC is as a true positive set for benchmarking and comparing methods for identifying driver lncRNAs. In the domain of protein-coding driver gene predictions, the Cancer Gene Census (CGC) has become the “gold standard” training set, against which new methods are tested (Futreal et al. 2004). Typically, the predicted driver genes belonging to CGC are judged to be true positives, and the fraction of these amongst predictions is used to estimate the Positive Predictive Value (PPV), or precision. This measure can be calculated for increasing cutoff levels, to assess the optimal cutoff.
First, we used CLC to examine the performance of the lncRNA driver predictor ExInAtor (Lanzós et al. 2017) in recalling CLC genes using PCAWG tumour mutation data (PCAWG Consortium, Manuscript In Preparation). A total of 2,687 GENCODE lncRNAs were included here, of which 82 (3.1%) belong to CLC. Driver predictions on several cancers at the standard False Discovery Rate (“q-value”) cutoff of 0.1 are shown for selected cancers in Figure 3A. That panel shows the CLC-defined precision (y-axis) as a function of predicted driver genes ranked by q-value (x-axis). We observe rather heterogeneous performance across cancer cohorts. This may reflect a combination of intrinsic biological differences and differences in cohort sizes, which differs widely between the datasets shown. For the merged Pancancer dataset (“PANCAN”), ExInAtor predicted three CLC genes amongst its top ten candidates (q-value < 0.1), a rate far in excess of the background expectation. Similar enrichments are observed for other cancer types. These results support both the predictive value of ExInAtor, and the usefulness of CLC in assessing lncRNA driver predictors. Comprehensive CLC-based assessments of lncRNA driver discovery, across all methods and tumour cohorts in PCAWG, may be found in the main PCAWG driver prediction publication (PCAWG Consortium, Manuscript In Preparation).
Finally, we assessed the precision (i.e. positive predictive value) of PCAWG lncRNA and protein-coding driver predictions across all cancers and all prediction methods (PCAWG Consortium, Manuscript In Preparation). Using the same q-value cutoff of 0.1, we found that across all cancer types and methods, a total of 8 (8.5%) of lncRNA predictions are in CLC (Figure 3B), while a total of 139 (23.1%) of protein-coding predictions are CGC genes (Figure 3C). In terms of sensitivity, 9.8% and 25. 1 % of CLC and CGC genes are predicted as candidates, respectively. Despite the lower detection of CLC genes in comparison to CGC genes, both sensitivity rates significantly exceed the prediction rate of nonCLC and nonCGC genes, again highlighting the usefulness of the CLC geneset.
CLC genes are distinguished by function- and disease-related features
We recently found evidence, using a smaller geneset than presented here, that cancer lncRNAs are distinguished by various genomic and expression features indicative of biological function (Lanzós et al. 2017). We here extended these findings using a large series of potential gene features, to search for those features distinguishing CLC from nonCLC lncRNAs (Figure 4A).
First, associations with expected cancer-related features were tested (Figure 4B). CLC genes are significantly more likely to have their transcription start site (TSS) within 100 kb of cancer-associated germline SNPs (“Cancer SNPs 100kb TSS”), and more likely to be either differentially-expressed or epigenetically-silenced in tumours (Yan et al. 2015) (Figure 4B). Intriguingly, we observed a tendency for CLC lncRNAs to be more likely to lie within 1 kb of known cancer protein-coding genes (“CGC 1kb TSS”) – this is explored in more detail below.
We next investigated the properties of the genes themselves. As seen in Figure 4C, and consistent with our previous findings (Lanzós et al. 2017), CLC genes (“Gene length”) and their spliced products (“Exonic length”) are significantly longer than average. No difference was observed in the ratio of exonic to total length (“Exonic content”), nor repetitive sequence coverage (“Repeats coverage”), nor GC content.
CLC genes also tend to have greater evidence of function, as inferred from evolutionary conservation. Base-level conservation at various evolutionary depths was calculated for lncRNA exons and promoters (Figure 4D). Across all measures tested, using either average base-level scores or percent coverage by conserved elements, we found that CLC genes’ exons are significantly more conserved than other lncRNAs (Figure 4D). The same was observed for conservation of promoter regions.
High levels of gene expression are known to correlate with lncRNA conservation, and are hypothesized to be a reflection of functionality (Managadze et al. 2011). We found that CLC genes have consistently higher steady-state expression levels, across both healthy organs and cultured cell lines (Figure 4E). This difference tended to be more marked for cell lines compared to organ samples. It is unclear whether this is a batch effect, since cell lines data comes from ENCODE (Djebali et al. 2012) and organ samples from Illumina Bodymap, or reflects a generalised upregulation of cancer lncRNA expression in cultured cells. It should be noted that these mainly comprise immortalised cancer cell lines but also includes lines such as IMR-90 that are neither immortal nor transformed.
Evidence for genomic clustering of non-coding and protein-coding cancer genes
In light of recent evidence for colocalisation and coexpression of disease-related lncRNAs and protein-coding genes (Tan et al. 2017), we were curious whether such an effect holds for cancer-related lncRNAs and protein-coding genes. We also asked, more specifically, whether CLC genes tend to be closer to CGC genes than expected by chance, and whether this is manifested in a more co-regulated expression.
Using TSS-TSS distances, we found that CLC genes on average tend to lie moderately closer to protein-coding genes of all types, compared to nonCLC lncRNAs (median 29 kb vs 38 kb respectively) (P=0.03, Wilcoxon test) (Supplementary Figure 2A). Interestingly, this effect is also observed when assessing specifically for distance to CGC genes (median 1,122 kb vs 1,599 kb, respectively) (P= 0.0004, Wilcoxon test) (Figure 5A).
It has been widely proposed that proximal lncRNA / protein-coding gene pairs are involved in cis-regulatory relationships, which is reflected in expression correlation (Ponjavic et al. 2009). We next asked whether proximal CLC-CGC pairs exhibit this behaviour. An important potential confounding factor is the known positive correlation between nearby gene pairs (Marques et al. 2013), and this must be controlled for. Using gene expression data across 11 human cell lines, we observed a positive correlation between CLC-CGC gene pairs for each cell type (Figure 5B). To control for the effect of proximity on correlation, we next randomly sampled a similar number of non-CLC lncRNAs with matched distances (TSS-TSS) from the same CGC genes, and found that this correlation was lost (Figure 5B, “nonCLC-CGC”). To further control for a possible correlation arising from the simple fact that both CGC and CLC genes are involved in cancer, we randomly shuffled the CLC-CGC pairs 1000 times, again observing no correlation (Figure 5B, “Shuffled CLC-CGC”). Together these results show that genomically-proximal protein-coding/non-coding gene pairs exhibit an expression correlation that exceeds that expected by chance, even when controlling for genomic distance.
These results prompted us to further explore the genomic localization of CLC genes relative to their proximal protein-coding gene and the nature of their neighbouring genes. First, considering gene pairs as proximal when both TSS lie within 10 kb, we find, in agreement with our previous results, that CLC genes are significantly more likely to lie near to a CGC gene, compared to other lncRNAs (6.6% vs 2.1%, respectively) (P=0.004, Fisher’s exact test) (Supplementary Figure 2B). Next, we observed an unexpected difference in the genomic organisation of CLC genes: when classified by orientation with respect to nearest protein-coding gene (Derrien et al. 2012), we found a significant enrichment of CLC genes immediately downstream and on the same strand as protein-coding genes (“Samestrand, pc up”, Figure 5C). Moreover, CLC genes are approximately twice as likely to lie in an upstream, divergent orientation to a protein-coding gene (“Divergent”, Figure 5C). Of these CLC genes, 20% are divergent to a CGC gene, compared to 5% for nonCLC genes (P=0.018, Fisher’s exact test) (Figure 5D), and several are divergent to protein-coding genes that have also been linked or defined to be involved in cancer, despite not being classified as CGCs (Supplementary Table 2).
Given this noteworthy enrichment of CGC genes in a divergent configuration to protein-coding genes, we next inspected the latters’ function annotation. Examining their Gene Ontology (GO) terms, molecular pathways and other gene function related terms, we found this group of genes to be enriched in GO terms for “sequence-specific DNA binding”, “DNA binding”, “tube development” and “transcriptional misregulation in cancer” (Figure 5E). These results were confirmed by another, independent GO-analysis suite (see Materials and Methods). Interestingly, three out of the top four functional groups were observed previously in a study of protein-coding genes divergent to long upstream antisense transcripts (LUAT) in primary mouse tissues (Lepoivre et al. 2013).
Thus, CLC genes appear to be non-randomly distributed with respect to protein-coding genes, and particularly their CGC subset.
Evidence for anciently conserved cancer roles of lncRNAs
Numerous studies have employed unbiased forward genetic screens to identify genes that either inhibit or promote tumorigenesis in mice (Copeland & Jenkins 2010). These studies use engineered, randomly-integrating transposons, carrying bidirectional polyadenylation sites as well as strong promoters. Insertions, or clusters of insertions, called “common insertion sites” (CIS) that are identified in sequenced tumour DNA, implicate the overlapping or neighbouring gene locus as either an oncogene or tumour-suppressor gene. Although these studies have traditionally been focused on identifying protein-coding genes, they can in principle also identify non-coding driver loci.
We reasoned that comparison of mouse CISs to orthologous human regions could yield independent evidence for the functionality of human cancer lncRNAs (Figure 6A). To test this, we collected a comprehensive set of CISs in mouse (Abbott et al. 2015), consisting of 2,906 loci from 7 distinct cancer types (Supplementary Table 4). These sites were then mapped to in orthologous regions in the human genome, resulting in 1,309 human CISs, or hCISs. 7.3% of these CISs lie outside of protein-coding gene boundaries, which were used for the following analyses.
Mapping hCISs to lncRNA annotations, we discovered altogether eight CLC genes (6.6%) carrying an insertion within their gene span: DLEU2, GAS5, MONC, NEAT1, PINT, PVT1, SLNCR1, XIST (Table 1). In contrast, just 61 (0.4%) nonCLC genes contained hCISs (Figure 6B). A good example is SLINCR1, shown in Figure 6C, which drives invasiveness of human melanoma cells (Schmidt et al. 2016), and whose mouse orthologue contains a CIS discovered in pancreatic cancer.
This analysis would suggest that CLC genes are enriched for hCISs; however, there remains the possibility that this is confounded by their greater length. To account for this, we performed two separate validations. First, sets of nonCLC genes with CLC-matched length were randomly sampled, and the number of intersecting hCISs per unit gene length (Mb) was counted (Figure 6D). Second, CLC genes were randomly relocated in the genome, and the number of genes intersecting at least one hCIS was counted (Figure 6E). Both analyses showed that the number of intersecting hCISs per Mb of CLC gene span is far greater than expected. In contrast, nonCLC genes show a depletion for hCIS sites (Figure 6F).
Together these analyses demonstrate that CLC genes are orthologous to mouse cancer-causing genomic loci at a rate greater than expected by random chance. These identified cases, and possibly other CLC genes, display cancer functions that have been conserved over tens of millions of years since human-rodent divergence.
Discussion
We have presented the Cancer LncRNA Census, the first controlled set of GENCODE-annotated lncRNAs with demonstrated roles in tumorigenesis or cancer phenotypes.
The present state of knowledge of lncRNAs in cancer, and indeed lncRNAs generally, remains highly incomplete. Consequently, our aim was to create a geneset with the greatest possible confidence, by eliminating the relatively large number of published “cancer lncRNAs” with as-yet unproven causative roles in disease processes. Thus, we used a rather strict definition of cancer lncRNA, being any having direct experimental or genetic evidence for a causative role in cancer phenotypes. By this measure, gene expression changes alone do not suffice. By introducing these well-defined inclusion criteria, we hope to ensure that CLC contains the highest possible proportion of bona fide cancer genes, giving it maximum utility for de novo predictor benchmarking. In addition, its basis in GENCODE ensures portability across datasets and projects. Inevitably some well-known lncRNAs did not meet these criteria (including SRA1, CONCR, KCNQ1OT1) (Marchese et al. 2016; Lanz et al. 1999; Higashimoto et al. 2006); these may be included in future when more validation data becomes available. We believe that CLC will complement the established lncRNA databases such as lncRNAdb, LncRNADisease and Lnc2Cancer, which are more comprehensive, but are likely to have a higher false-positive rate due to their more relaxed inclusion criteria (Chen et al. 2013; Quek et al. 2015; Ning et al. 2016).
De novo lncRNA driver gene discovery is likely to become increasingly important as the number of sequenced tumours grow. The creation and refinement of statistical methods for driver gene discovery will depend on the available of high-quality true positive genesets such as CLC. It will be important to continue to maintain and improve the CLC in step with anticipated growth in publications on validated cancer lncRNAs. Very recently, CRISPR-based screens (Zhu et al. 2016; Liu et al. 2017) have catalogued large numbers of lncRNAs contributing to proliferation in cancer cell lines, which will be incorporated in future versions.
We used CLC to estimate the performance of de novo driver lncRNA predictions from the PCAWG project, made using the ExInAtor pipeline (Lanzós et al. 2017). Supporting the usefulness of this approach, we found a strong enrichment for CLC genes amongst the top-ranked driver predictions. Extending this to the full set of PCAWG driver predictors, approximately ten percent of CLC genes (9.8%) are called as drivers by at least one method (PCAWG Consortium, Manuscript In Preparation), which is lower to the rate of CGC genes identified (25.1%). This difference may be due to technical or biological factors. For example, CLC genes that are not identified as drivers by PCAWG may be false negatives, whose identification simply awaits greater statistical power afforded by more genomes; or else many CLC genes may contribute to cancer phenotypes, but are not targeted by somatic mutations during tumorigenesis. Addressing these questions will be a key objective in future studies.
Analysis of CLC geneset has broadened our understanding of the unique features of cancer lncRNAs, and generally supports the notion that lncRNAs have intrinsic biological functionality. Cancer lncRNAs are distinguished by a series of features that are consistent with both (a) roles in cancer (eg tumour expression changes), and (b) general biological functionality (eg high expression, evolutionary conservation). Elevated evolutionary conservation in the exons of CLC genes would appear to support their functionality as a mature RNA transcript, in contrast to the act of their transcription alone (Latos et al. 2012). Another intriguing observation has been the colocalisation of cancer lncRNAs with known protein-coding cancer genes: these are genomically proximal and exhibit elevated expression correlation, even after normalising for proximity. This points to a regulatory link between cancer lncRNAs and protein-coding genes, perhaps through chromatin looping, as described in previous reports for CCAT1 and MYC, for example (Xiang et al. 2014).
One important caveat for all features discussed here is ascertainment bias: almost all lncRNAs discussed have been curated from published, single-gene studies. It is entirely possible that selection of genes for initial studies was highly non-random, and influenced by a number of factors – including high expression, evolutionary conservation and proximity to known cancer genes – that could bias our inference of lncRNA features. This may be the explanation for the observed excess of cancer lncRNAs in divergent configuration to protein-coding genes. However, the general validity of some of the CLC-specific features described here – including high expression and evolutionary conservation – were also observed recent unbiased genome-wide screens (Lanzós et al. 2017; Liu et al. 2017), suggesting that they are genuine.
In principle, lncRNAs may play two distinct roles in cancer: first, as “driver genes” whose mutations are early and positively-selected events in tumorigenesis; or second, as “downstream genes”, which do make a genuine contribution to cancer phenotypes, but through non-genetic alterations in cellular networks resulting from changes in expression, localisation or molecular interactions. These downstream genes may not be expected to display positively-selected mutational patterns, but would be expected to display cancer-specific alterations in expression. A key question for the future is how lncRNAs break down between these two categories, and whether this ratio is distinct from protein-coding genes. Despite the care taken to annotate CLC, it might be expected to contain both types of cancer lncRNA.
Despite the relatively low concordance of CLC genes with PCAWG driver predictions, the results of this study strongly support the value and key cancer role of identified lncRNAs in cancer. Most notably, the existence of a core set of eight lncRNAs with independently-identified mouse orthologues with similar cancer functions, is powerful evidence that these genes are bona fide cancer genes, whose overexpression or silencing can drive tumour formation. To our knowledge this is the most direct demonstration to date of anciently-conserved functions and disease roles for lncRNAs. It will be intriguing to investigate in future whether more human-mouse orthologous lncRNAs have been identified in such screens.
Materials and Methods
Manual Curation
All lncRNAs in lncRNAdb and those listed in Schmitt and Chang’s recent review article were collected (Quek et al. 2015; Schmitt et al. 2016). To these were added all cases from LncRNADisease and Lnc2Cancer databases (Chen et al. 2013; Ning et al. 2016). This primary list formed the basis for a manual literature search: all available publications for each gene were identified by keyword search in Pubmed. If publications were found conforming to at least one of the inclusion criteria (below) and the gene has a GENCODE ID, then it was added to CLC, with appropriate information on the associated cancer, biological activity. For the numerous cases where no GENCODE ID was supplied in the original publication, any available ID, or primer or siRNA sequence was used to identify the gene using the UCSC Genome Browser Blat tool (Kent et al. 2002).
Inclusion criteria sufficient to define a cancer lncRNA and link it to a cancer type were:
Class t: In vitro demonstration that their knockdown and/or overexpression in cultured cancer cells results in changes to cancer-associated phenotypes. These typically include proliferation rates, migration, sensitivity to apoptosis, or anchorage-independent growth.
Class v: In vivo demonstration that their knockdown and/or overexpression in cancer cells alters their tumorigenicity when injected into animal models.
Class g: Germline mutations or variants that predispose humans to cancer.
Class s: Somatic mutations that show evidence for positive selection during tumour formation.
An additional criterion was allowed to link an lncRNA to a cancer type, only if one of the above criteria was already met for another cancer:
Class p: Prognosis: The lncRNAs expression is statistically linked to disease progression or response to treatment.
If an lncRNA was found to promote tumorigenesis or cancer phenotype, it was defined as “oncogene” (og). Conversely those found to inhibit such phenotypes were defined as “tumour suppressor” (tsg). Several lncRNAs were found to have both activities recorded in different cancer types, and were given both labels (og/tsg). For every lncRNA-cancer association, a single representative publication is recorded. Finally, it is important to note that no lncRNAs were included based on evidence from previous driver gene discovery studies of the types represented by OncodriveFML, ExInAtor, ncdDetect or others described in PCAWG (Mularoni et al. 2016; Lanzós et al. 2017; Juul et al. 2017) (PCAWG Consortium, Manuscript In Preparation).
CLC set at this stage relies on GENCODE v24 annotation, and therefore all CLC genes have a GENCODE v24 ID assigned. However, data relative to GENCODE v24 was not available for all types of data used in this study (ie. all data relative to PCAWG is based on GENCODE v19). Thus, for some analysis only genes also present in GENCODE v19 could be used (specified in the corresponding methods section) and the total number of genes analysed in these cases is slightly lower (107 instead of 122 CLC genes and 13503 instead of 15827 nonCLC).
LncRNA and protein-coding driver prediction analysis
LncRNA and protein-coding predictions for ExInAtor and the rest of PCAWG methods, as well as the combined list of drivers, were extracted from the consortium database (PCAWG Consortium, Manuscript In Preparation). Parameters and details about each individual methods and the combined list of drivers can be found on the main PCAWG driver publication (PCAWG Consortium, Manuscript In Preparation) and false discovery rate correction was applied on each individual cancer type for each individual method in order to define candidates (q-value cutoff of 0.1). To calculate sensitivity (percentage of true positives that are predicted as candidates) and precision (percentage of predicted candidates that are true positives) for lncRNA and protein-coding predictions we used the CLC and CGC (COSMIC v78, downloaded Oct, 3, 2016) sets, respectively. To assess the statistical significance of sensitivity rates, we used Fisher’s exact test.
Feature Identification
We compiled several quantitative and qualitative traits of GENCODE lncRNAs and used them to compare CLC genes to the rest of lncRNAs (referred to as “nonCLC”). Analysis of quantitative traits were performed using Wilcoxon test while qualitative traits were tested using Fisher’ exact test. These methods principally refer to Figure 4.
Sequence / gene properties
Exonic positions of each gene were defined as the projection of the union of exons from all its transcripts. Introns were defined as all remaining non-exonic nucleotides within the gene span. Repeats coverage refers to the percent of exonic nucleotides of a given gene overlapping repeats and low complexity DNA sequence regions obtained from RepeatMasker data housed in the UCSC Genome Browser (Tyner et al. 2017). Exonic content refers to the fraction of total gene span covered by exons. For this section we used GENCODE v19.
Evolutionary conservation
Two types of PhastCons conservation data were used: base-level scores and conserved elements. These data for different multispecies alignments (GRCh38/hg38) were downloaded from UCSC genome browser (Tyner et al. 2017). Mean scores and percent overlap by elements were calculated for exons and promoter regions (GENCODE v24). Promoters were defined as the 200nt region centred on the annotated gene start.
Expression
We used polyA+ RNA-seq data from 10 human cell lines produced by ENCODE (Djebali et al. 2012; ENCODE Project Consortium et al. 2012) and from various human tissues by the Illumina Human Body Map Project (HBM) (www.illumina.com; ArrayExpress ID: E-MTAB-513) (GENCODE v19).
Cancer SNPs
On October, 4, 2016, we collected all 2,192 SNPs related to “cancer”, “tumour” and “tumor” terms in the NHGRI-EBI Catalog of published genome-wide association studies (Hindorff et al. 2009; Welter et al. 2014) (https://www.ebi.ac.uk/gwas/home). Then we calculated the closest SNP to each lncRNA TSS using closest function from Bedtools v2.19 (GENCODE v24).
Epigenetically-silenced lncRNAs
We obtained a published list of 203 cancer-associated epigenetically-silenced lncRNA genes present in GENCODE v24 (Yan et al. 2015). These candidates were identified due to DNA methylation alterations in their promoter regions affecting their expression in several cancer types.
Differentially expressed in cancer
We collected a list of 3,533 differentially-expressed lncRNAs in cancer (Yan et al. 2015) (GENCODE v24).
Distance to protein-coding genes and CGC genes
For each lncRNA we calculated the TSS to TSS distance to the closest protein-coding gene (GENCODE v24) or CGC gene (downloaded on October, 3, 2016 from Cosmic database)(Futreal et al. 2004) using closest function from Bedtools v2.19.
Coexpression with closest CGC gene
We took CLC-CGC gene pairs whose TSS-TSS distance was <200kb. RNAseq data from 11 human cell lines from ENCODE was used to assess expression levels (Djebali et al. 2012; ENCODE Project Consortium et al. 2012). ENCODE RNAseq data were obtained from ENCODE Data Coordination Centre (DCC) in September 2016,https://www.encodeproject.org/matrix/?type=Experiment. All data is relative to GENCODE v24. We calculated the expression correlation of gene pairs within each of the 11 cell lines, using the Pearson measure. To control for the effect of proximity, we randomly sampled a subset of non CLC-CGC pairs matching the same TSS-TSS distance distribution as above, and performed the same expression correlation analysis (“nonCLC-CGC”). Finally, to further control for the fact that CLC and CGC are both cancer genes, which may influence their expression correlation, we shuffled CLC-CGC pairs 1000 times, and tested expression correlation for each set (“Shuffled CLC-CGC”). Finally, we tested if correlation coefficients of the three groups were significantly different (Kolmogorov Smirnov test).
Genomic classification
We used an in-house script to classify lncRNA transcripts into different genomic categories based on their orientation and proximity to the closest protein-coding gene (GENCODE v24): a 10 kb distance was used to distinguish “genic’ from “intergenic’ lncRNAs. When transcripts belonging to the same gene had different classifications, we used the category represented by the largest number of transcripts.
Functional enrichment analysis
The list of protein-coding genes (GENCODE v24) that are divergent and closer than 10 kb to CLC genes (or nonCLC) was used for a functional enrichment analysis (20 unique genes in the case of CLC analysis and 1202 in the case of nonCLC analysis). We show data obtained using g:Profiler web server (Reimand et al. 2016), g:GOSt, with default parameters for functional enrichment analysis of protein-coding genes divergent to CLC and using Bonferroni correction for protein-coding gene divergent to nonCLC. For CLC analysis we performed the same test with independent methods: Metascape (http://metascape.org) (Tripathi et al. 2015) and GeneOntoloy (Panther classification system)(Mi et al. 2013; Mi et al. 2017). In both cases similar results were found.
Mouse mutagenesis screen analysis
We extracted the genomic coordinates of transposon common insertion sites (CISs) in Mouse (GRCm38/mm10) http://ccgd-starrlab.oit.umn.edu/about.php (Abbott et al. 2015). This database contains target sites identified by transposon-based forward genetic screens in mice. LiftOver (Kent et al. 2002) was used at default settings to obtain aligned human genome coordinates (hCISs) (GRCh38/hg38). We discarded hCIS regions longer than 1000 nucleotides and intersecting protein coding genes, and intersected the remainders with the genomic coordinates CLC and nonCLC genes.
To correctly assess the statistical enrichment of CLC in hCIS regions, we performed two control analyses:
Randomly repositioning of CLC and nonCLC genes
We randomly relocated CLC/nonCLC genes 10,000 times within the whole genome using the tool shuffle from BedTools v19. In each iteration, we calculated the number of genes that intersected at least one hCIS, and created the distribution of these simulated values. Finally, we calculated an empirical p-value by counting how many of the simulated values were higher or equal than the real values. This analysis was performed separately for CLC and nonCLC genes.
Length-matched sampling
To calculate if the enrichment of hCIS intersecting genes in CLC set is higher and statistically different from nonCLC set, while controlling by gene length, we created 1000 samples of nonCLC genes with the same gene length distribution as CLC genes. Each sample was intersected with hCIS, and the number of intersecting hCISs per Mb of gene length was calculated. To create the nonCLC samples we used the matchDistribution script: https://github.com/julienlag/matchDistribution.
Contributions
RJ conceived the project, performed manual annotation of CLC, and supervised with advice and suggestions of JS-P, RG, LF and CH. JCF and AL performed the feature analysis and evolutionary analysis. AL performed mutation analysis, with help from IM. RJ, AL and JCF drafted the manuscript and prepared the figures and supplementary material. All authors read and approved the final draft.
Acknowledgements
We wish to thanks Julien Lagarde (CRG) for help and advice in bioinformatic analysis. We acknowledge Romina Garrido (CRG), Deborah Re (DKF), Silvia Roesselet (DKF) and Marianne Zahn (Inselspital) for administrative support. We thank Ivo Buchhalter (DKFZ) and Sandra Koser (DKFZ) for preprocessing the SNV and expression data for the integrated analysis. Iñigo Martincorena (Sanger Institute) kindly provided the script for analysing driver prediction sensitivity. A.L. is supported by pre-doctoral fellowship FPU14/03371. This research was supported by the NCCR RNA & Disease funded by the Swiss National Science Foundation.