Abstract
Within recent years, circular RNAs (circRNAs) have been an attractive new field of research in RNA biology and disease. Consequently, numerous studies have been published towards the disclosure of circRNA biogenesis and function. Initially, circRNAs were described as a subclass of cytoplasmic non-coding RNA, however, a few recent observations have proposed that circRNAs may instead be templates for protein production. The extent to which this is the case is currently debated, and therefore using rigorous data analysis and proper experimental setups is instrumental to settle the current controversies. Here, the conventional experiments used for detecting circRNA translation are outlined, and guidelines to distinguish signal from the inherent noise are discussed. While these guidelines are specific for circRNA translation, most also apply to all other aspects of non-canonical translation.
Introduction
Within recent years, circRNAs have emerged as a fascinating new class of molecules that based on their closed covalent structure exhibit high stability in cells [1,2]. CircRNAs have been identified in almost all studied eukaryotes, and the expression has been shown to be highly tissue specific [3,4]. While the landscape and expression profiles of circRNAs have been examined in detail, the functional relevance of circRNAs is still largely unsettled. Initially, spawned by the discovery and characterization of ciRS-7/cdr1as [5–7], many circRNAs are proposed as putative miRNA regulators despite stoichiometric challenges [8,9] and absence of miRNA response element enrichment in circRNA sequences [10,11]. Recently, a subset of circRNAs were published to show the ability to encode circRNA-specific peptides [12–14]. In most cases, the circRNAs share the translation initiation site with the corresponding host-gene, and thus the resulting peptides are predicted to resemble the very N-terminal region of the mRNA-derived protein. To identify and study circRNA-derived translation, different approaches have been used, including ectopic circRNA expression, polysome gradients, ribosome footprinting (RiboSeq), and mass-spec analysis. Here, the common pitfalls using these approaches are discussed along with requirements for claiming translation.
Results
Ectopic expression
The most convenient approach to study circRNAs translation is to device an overexpression vector, transfect cell-line of interest, and detect protein production using western blot analysis. Even without commercially available antibodies, the overexpression vector can include a tag immediately upstream of the circRNAs-specific stop-codon to detect specifically the circRNAs-derived proteins. However, importantly, all experiments contain unforeseen biases, aberrant outputs, off-targets, technical and biological variation, collectively referred to here as noise. Similarly, for overexpression-based setups, the intended vector-derived output (and its cellular consequences) is presumably only a fraction of noise-derived effects and that is why proper control experiments are essential. Control experiments are required to disentangle signal from noise and to determine the isolated effects of the intended perturbation. For circRNAs overexpression, the control experiment is typically an empty vector, and consequently everything produced from the circRNA expression vector is categorized as signal and circRNA-derived. Unfortunately, to our knowledge, no vector-based overexpression system is sufficiently clean to dismiss any contribution from linear artefacts. Indeed, we and others have shown [11,15], that upon overexpression, the specific sequence produced by back-splicing also occurs in linear concatemers, originating either from so-called ‘rolling circle’ plasmid transcription or non-canonical trans-splicing (Fig. 1A). Here, all features assumed to be circRNA specific, namely the unique back-splicing region, is reconstructed in a linear context, and similarly, any open reading frames (ORFs) defined as circRNAs-specific are now contained in a capped and poly-adenylated mRNA transcript. The problem is not the vector-design used for overexpression, it is the poorly devised control experiment. As previously suggested [11,15], one improved alternative is instead to use vectors containing the circRNAs exon(s) of interest, but lacking the flanking regions required for circRNAs biogenesis (Fig. 1B) to unequivocally detect the circRNAs-derived proteins. Here, aberrant linear exon-repeats are likely unaffected by impaired back-splicing and thus the observed difference between the circRNA expression vector and the control is now attributable as a circRNAs-specific phenotype.
Polysome profiling, ribotag and RiboSeq technologies
Classical approaches toward translatome disclosure is to purify ribosomes and quantify the associated mRNA. One embodiment of this approach is the polysome profiling by sucrose gradient fractionation, whereby polysomes are separated from ribosomal subunits and monosomes, and the fractional co-elution of translated mRNA species serves as a proxy for ribosome occupancy on ORFs and thus the level of translation [16]. Another embodiment is the ribotag technique, where a ribosomal subunit, typically RPL22, is tagged allowing full ribosome immunoprecipitation. Here, clever genetic engineering and tissue-specific Cre-recombinase expression, allows for the interrogation of cell-type-specific translation in mouse models [17]. While polysome gradients and ribotag methods greatly enrich for translated RNA species, they do not provide a clear demarcation between the coding and non-coding landscape within cells; some transcripts are subjected to low-level translation while others may associated with the ribosome unspecifically or indirectly (the noise). When using polysome gradients or ribotag qualitatively for circRNA translation, it is imperative to pursue a mutational approach whereby imperative elements, such as the AUG start codon, are disrupted to allow a signal to noise estimate. With current genome editing technologies, it has become possible to engineer specifically such mutations endogenously, although successful identification of homozygous mutants is a laborious approach. And, in most cases, the circRNA and host mRNA ORFs overlap disqualifying a circRNA-specific mutational analysis. Consequently, the urge to utilize vector-based expression for these analyses may prevail, although, again, this requires the ability to separate bona-fide circRNAs from exon concatamers (as discussed in the previous paragraph) using either rigorous RNaseR treatment or northern blotting. In any case, the obtained output should be compared with a translation incompetent mutant to highlight that the ribosome association is indeed coupled to translation.
More recently, high-throughput ribosome foot-printing, coined RiboSeq analysis, has emerged as a powerful technique to discover ribosome-bound open reading frames globally and to quantify translation efficiencies [18,19]. In contrast to polysome gradient and ribotag approaches, RiboSeq only captures ribosome-protected fragments, and thus to demarcate between circRNA and overlapping mRNA transcripts, only back-splice junction spanning fragments are unequivocally derived from circRNA. Like most other experiments, RiboSeq has inherent noise, and inability or failure to estimate signal:noise ratios is potentially a severe problem, particularly when conducting a high-throughput data analysis. A hallmark of active translation is the triplet periodicity or phasing of reads across the ORF. However, capturing the exact phasing of 25-35 nucleotides (nts) RiboSeq reads across the confined circRNAs-specific back-splice junction (BSJ) has obvious limitations, which reduces signal and power significantly. Moreover, attempts to enhance signal by combining the signal from multiple circRNAs has failed to detect RiboSeq phasing [11]. Of note, combining putative ORFs across the back-splice junction of circRNAs pre-assumes the frame of translation, and arguably, this may not necessarily adhere to the same coding frame of the host gene. Then, the alternative approach is to simply count all reads mapping across the back-splice junction irrespective of phasing as a proxy for translation. Here, albeit with general low coverage compared to genuine ORFs, thousands of reads map uniquely to circRNA back-splice junctions (Fig. 2A). Ideally, this suggests that most circRNAs undergo translation but with low rate. The difficult question is then, how many of these reads derive from translating ribosomes and how many are background noise? In an attempt to answer that question, one may assume that the signal:noise ratio (the dataset quality) correlates with the ability to capture phasing across bona fide coding sequences (CDS). Evidently, with this approach, RiboSeq datasets vary dramatically in quality as exemplified by P-site distribution across β-ACTIN using high quality and low quality datasets, respectively (Fig. 2B-C). Alarmingly, using the same datasets to quantify the total number of reads on the β-ACTIN CDS and across the back-splice junction of circRNAs, there is clearly a tendency towards low quality datasets containing increased number of BSJ-spanning reads, and vice versa for β-ACTIN (Fig. 2D). In fact, globally, when contrasting the fraction of reads (reads per mapped million, RPMM) mapping on annotated CDSs with dataset quality (i.e. level of observed phasing), a clear positive correlation is observed. In contrast, the fraction of circRNA mapping reads show negative correlation for all possible read-lengths. In most analyses, the counterfactual scenario, also referred to as the negative control, is an important measure to demarcate signal from noise in high-throughput analysis. Thus, if the definition of signal is the mere presence of reads, the counterfactual scenario should be devoid of reads. Here, with small RNAs, miRNA and snoRNAs, representing a counterfactual scenario, extensive RiboSeq coverage is still observed, but, consistently, similar to circRNAs, the fraction of mi/snoRNAs aligned reads drops subtly with increased RiboSeq quality (Fig. 2E). While circRNAs, miRNAs and snoRNAs in principle could serve as templates for protein production, assuming some level of noise potentially allows detection of all expressed RNA species, and consequently using detection as a proxy for translation fails to appreciate the entire reason for conducting RiboSeq analysis in the first place. Therefore, while reads across circRNAs backsplice junction may at first seem as evidence for circRNAs translation, without detectable phasing the parsimonious explanation is likely a noisy dataset.
Mass-spectrometry
In the RNA research field, mass-spectrometry is typically considered a bulletproof approach towards peptide identification. But, unlike most RNA-based techniques, mass-spec analysis has an in-built noise estimator to compute error and false discovery rates (FDR), the so-called target-decoy search strategy [20,21]: All identified peptides have an associated score that signifies the degree of match between the observed and predicted spectrum. Then, to estimate noise, mass-spec analysis is in parallel performed on a positive list of peptide sequences (typically the annotated proteome) and a similar sized list of decoys (often the reverse peptide sequences to preserve amino-acid composition), i.e. the decoys are directly derived from the positive target list. Then, simply put, using the ratio between target and decoy for any given score, the probability of a false assignment is deduced, resulting (when using Maxquant) in a posterior error probability value (PEP). Additionally, by accumulating the number of detected decoys in a ranked list, the overall FDR, i.e. how many false discoveries are expected in the resulting peptides, is available for any particular PEP/score cutoff. While this is very simplified account of the mass-spec statistics, the imperative point is that the final FDR reflects the entire group of peptides identified, and obviously, each individual peptide is not called with equal confidence (PEP). With this in mind, taking a non-random subsample from the total bulk of peptides, the initial FDR does not apply anymore. For instance, extracting the top 5% or the bottom 5% based on PEP values, the signal-to-noise ratio in each subset is dramatically different, and thus the FDR must be re-calculated based on the ratio of positive and decoy peptides in each subsample.
To exemplify the importance of FDR-recalculation, a re-analysis of the circRNA peptides recently published in Cell by van Heesch et al, 2019 [22] was performed. Again, like the RiboSeq analysis, a counterfactual approach was included that surely should be devoid of signal, here the COVID-19 proteome. Similar to van Heesch et al, Maxquant mass-spec analysis was performed on an extensive datasets from human heart [23]. Here, almost 4 million uniquely assigned peptides were derived from the Uniprot proteome assemble, whereas 38 peptides where found as circRNA specific of which 14 span the backsplice junction (Fig. 3A). In addition, Maxquant identifies 79 total COVID-19 peptides, however, as seen for both circRNA and COVID19-derived peptides, the number of matched decoys exceed the number of positive targets (Fig. 3A). Also, importantly, the distribution of PEP values obtained from circRNAs, COVID-19 and the Uniprot proteome are very different (Fig. 3B). Here, the Uniprot signal greatly diverge from noise when PEP is small, whereas the distribution looks very similar for circRNA and COVID19. First of all, this shows that circRNA and COVID19 peptides are a non-random subset of the bulk output, and second, that the signals obtained from circRNAs and COVID19 are inseparable from noise. Consistently, when re-computing the FDR based on the level of circRNA- and COVID-derived decoys, the FDR increases to 0.67 and 0.65, respectively. In fact, not a single possible PEP-value cutoff results in any significant circRNAs-derived peptides using FDR < 0.05 (Fig. 3C), however, alarmingly, four distinct COVID19-specific peptides are called as significant using a PEP < 0.02 cutoff. This, surely, must reflect either insufficient FDR control by the target-decoy strategy (discussed below), or that the detection of only four peptides is very likely by change, which prompts additional caution when only few peptides are identified.
Consistently, to emphasize the need for proper FDR control, three of the six BSJ-spanning peptides identified by van Heesch et al are preceded by stop-codons and thus translation initiation must occur cap-independently on CAU, ACU and UUA, respectively, if peptides are assumed bona fide (Fig 3D).
Collectively, this shows that without proper data analysis, anything is detectable by mass-spectrometry and additional care should be taken when conducting these experiments or when encountering them in published literature
Conclusion
As outlined here, studying circRNA translation comes with severe pitfalls, and without meticulous data analysis, noise is easily interpreted as evidence for circRNAs translation. In contrast, exerting an unreasonable level of stringency may on the other hand result in false negatives potentially failing to disclose game changing scientific discoveries. While there may not be a golden path towards a suitable level of required stringency for each research question, it is critical to transparently estimate the contribution of noise – because there is always noise. For mass-spec, the target-decoy strategy may itself not always be sufficient for signal-noise estimates, as this procedure assumes similar score distributions for decoys and false discoveries, which in itself is controversial [24,25]. Therefore, this strategy may inflate signal and thus be more prone to false positive than negatives, and thus counterfactual analyses may provide important additional measures to ensure sound conclusions.
Evidence of circRNAs translation is based on detecting the predicted circRNAs-specific output and showing association between circRNAs and translating ribosome. As discussed above, the circRNA specific ORFs may not necessarily adhere to the host gene, and therefore any possible reading-frame across the BSJ are possible in theory. When searching for all possible scenarios, as done previously [22], it is crucial to ascertain the validity of the ORF, i.e. the presence of a canonical (preferably AUG) initiation codon. Here, for van Heesch et al, three of six identified peptides lack initiation codons (NUG) all together. Moreover, translation of the cap-less structure of circRNAs implies that cap-independent processes facilitates the translation. Therefore, a coherent account of circRNA translation implies the usage of endogenous IRES elements, an already controversial subject-matter not discussed further here.
CircRNAs are typically less abundant that their linear counterparts, and assumingly, if circRNA translation is true, then cap-independent initiation may also be less effective than canonical cap-dependent translation. Collectively, if at all, circRNAs are likely translated at very low levels or subjected to fast decay and as a result the derived peptides may be borderline indistinguishable from noise. Therefore, alternative sources of evidence should be considered to support or dismantle the hypothesis. Many high abundant circRNAs show cross-species conservation, and assuming that the postulated functional output from a circRNA is coupled to its biological relevance, then some level of selection pressure to preserve this output should be evident in sequence conservation analyses. Therefore, functional ORFs from conserved circRNAs species are likely to show CDS-like codon-conservation pattern, which may serve as independent evidence.
CircRNA translation is an extremely interesting hypothesis. This would not only disclose an additional proteome layer in most animals, but also shed important light on the functional relevance of potentially thousands of circRNAs. While the thrill of discovering a non-canonical game-changer in gene regulation or proteome complexity is very desirable, it is our scientific duty to uphold a high level of critical data assessment before spending valuable time and money on studying noise and artefacts. Therefore, in our ongoing scrutiny of the functional properties of circRNAs, it would likely benefit the entire circRNA research field to address our future challenges with increased stringency and additional rigor.
Methods
RiboSeq analysis
RiboSeq analysis was based on the procedure and data described in Stagsted et al [11]. Briefly, for phasing estimates and visualization, trimmed RiboSeq reads were mapped specifically to β-ACTIN and GAPDH mRNA or across circRNAs BSJ using bowtie (v1.2.3) with no mismatch tolerance (bowtie -S -a -v 0). Only P-sites from −8 to +6 relative to the BSJ and reads without single-mismatch alignments in the mRNA reference were considered circRNAs specific. Then, for each read-length, p-site offsets and associated quality scores were determined based on best performance on the reference CDSs (Supplementary Table 1). To assess globally the fraction of reads mapping on CDS and sno/miRNAs, reads were mapped onto hg19 using STAR (v2.7.3a) using default settings and each read-length was extracted and quantified using gencode annotations (v28lift37) with featureCounts (v2.0.0, using options --minOverlap 0 --nonOverlap 0) individually. Then, RiboSeq reads mapping specifically on annotated CDS and sno/miRNA regions (not overlapping any CDS) were counted.
Mass-spec analysis
The raw mass-spec datasets where downloaded from PRIDE accession PXD006675. Maxquant (v1.6.2.10) was used for peptide identification using the complete Uniprot (UP000005640, downloaded November 10th, 2020), all ORFs across the 40 RiboSeq-identified circRNAs published by van Heesch et al [22], and the COVID19 proteome (UP000464024, downloaded November 10th, 2020). The Maxquant evidence files were merged and only unique peptides (targets and decoys) were kept (Supplementary Table 2).
Availability
Data and scripts used in the analysis are available on GitHub: https://github.com/ncrnalab/circRNA_translation
Acknowledgements
I would like to thank Prof. Gunter Meister for commenting on the manuscript, the circRtrain ITN network, and the Novo Nordisk Foundation (NNF16OC0019874) for funding.