ABSTRACT
Genetic variants that disrupt protein-coding DNA are ubiquitous in the human population, with ∼100 such loss-of-function variants per individual. While most loss-of-function variants are rare, a subset have risen to high frequency and occur in a homozygous state in healthy individuals. It is unknown why these common variants are well-tolerated, even though some affect essential genes implicated in Mendelian disease. Here, we combine genomic, proteomic, and biochemical data to demonstrate that many common nonsense variants do not ablate protein production from their host genes. We provide direct evidence for previously proposed mechanisms of gene rescue such as alternative splicing and C-terminal truncation. Furthermore, we identify novel mechanisms of rescue, including alternative translation initiation at non-canonical start codons and stop codon readthrough. Our results suggest a molecular explanation for the mild fitness costs of common nonsense variants, and indicate that translational plasticity plays a prominent role in shaping human genetic diversity.
INTRODUCTION
The discovery of pervasive genetic variation that is predicted to disrupt protein-coding DNA sequences is a surprising finding of recent human genetics studies. This loss-of-function genetic variation can arise from many sources, including single nucleotide variants (SNVs) that introduce stop codons, insertions or deletions that disrupt the reading frame, disruption of splice sites, or large structural variation. While loss-of-function variants are generally found at low allelic frequencies relative to synonymous variants (Figure 1A), many have risen to sufficiently high frequency to be present in a homozygous state. A careful study of these variants estimated that typical individuals carry ∼100 loss-of-function variants, of which ∼20 are present in a homozygous state (Ayadi et al. 2012; MacArthur et al. 2012; de Angelis et al. 2015).
While deleterious genetic variants can rise to high frequencies in specific populations due to demographic effects such as population bottlenecks (MacArthur et al. 2012; Lim et al. 2014; Gilad et al. 2003), the high allele frequencies of some loss-of-function variants within diverse and large human populations suggests that these variants are subject to relatively weak purifying selection. In some cases, gene inactivation may confer a fitness benefit for individuals or the species, as has been posited for CASP12 (MacArthur et al. 2012; Xue et al. 2006; Sulem et al. 2015; Liu and Lin 2015), ACTN3 (Lappalainen et al. 2013; MacArthur et al. 2007), and ERAP2 (Ingolia et al. 2009; Andrés et al. 2010; Battle et al. 2015), and positive or balancing selection may drive the inactive allele to high frequency. However, the rarity of most loss-of-function variants relative to synonymous variants suggests that beneficial loss-of-function variants are exceptions rather than the norm.
Consistent with the apparently mild purifying selection acting on common loss-of-function variants, a recent study of five European populations reported that no homozygous loss-of-function variants within these populations were associated with detectable phenotypic deviance (Battle et al. 2015; Kaiser et al. 2015). This report is consistent with expectations based on a gene disruption screen in zebrafish, which found that only 6% of randomly induced nonsense or splice site mutations yielded an embryonic phenotype (International HapMap 3 Consortium et al. 2010; Kettleborough et al. 2013; 1000 Genomes Project Consortium et al. 2012). Knockout studies in mouse models yielded much higher estimates of mouse embryonic lethality (MacArthur et al. 2012; Ayadi et al. 2012; Kukurba et al. 2014; de Angelis et al. 2015; Rivas et al. 2015). However, it is difficult to directly compare these studies, particularly because gene inactivation in the murine models was frequently achieved by unambiguously disrupting the coding DNA via insertion of a polyadenylation site within the gene body (versus the more modest genetic changes induced by N-ethyl-N-nitrosourea (ENU) mutagenesis in the zebrafish models).
The apparently mild fitness consequences of common loss-of-function variants can be explained in two different ways, depending upon whether these variants affect non-essential or essential genes. First, genetic redundancy may render particular genes dispensable, such that they can be inactivated without significant fitness costs. Loss-of-function variants are enriched in genes with more paralogs than other genes, most notably the olfactory receptors, as well as genes that are relatively poorly conserved (MacArthur et al. 2012; Kaiser et al. 2015; Gilad et al. 2003). Second, loss-of-function variants may disrupt coding DNA yet not completely disable their (essential) host genes’ functions. Consistent with this hypothesis, previous studies noted that many loss-of-function variants do not affect all isoforms of their parent genes, and furthermore that loss-of-function variants are enriched near the 3’ ends of coding sequences (Liu and Lin 2015; MacArthur et al. 2012; Sulem et al. 2015). These observations suggest that some loss-of-function variants may not completely destroy the coding potential of their host genes, potentially explaining why those variants are well-tolerated. However, this hypothesis—that many loss-of-function variants do not ablate their host genes’ functions—has not been experimentally tested.
Here, we systematically determined the consequences of common nonsense variants for mRNA translation and protein production using a combination of genome-wide data and directed biochemical experiments. We report that many nonsense variants that occur in a homozygous state do not ablate protein production from their host genes. Protein production from genes containing these nonsense variants is enabled by diverse mechanisms during mRNA transcription, splicing, and translation. Together, our data indicate that plasticity in RNA processing and translation mitigates the functional consequences of many nonsense variants, allowing these otherwise deleterious variants to rise to high frequencies in the human population.
RESULTS
Many common nonsense variants do not disrupt protein production
We set out to test the hypothesis that many common loss-of-function variants do not ablate protein production from their host genes, thereby explaining the apparently mild phenotypic consequences of these variants even when found in a homozygous state. To test this hypothesis, we took advantage of several recently published datasets representing successive stages of gene expression. These datasets consist of genome-wide measurements of mRNA levels by RNA-seq (MacArthur et al. 2012; Lappalainen et al. 2013; Sulem et al. 2015), mRNA:ribosome association by ribosome profiling (MacArthur et al. 2012; Ingolia et al. 2009; Sulem et al. 2015; Battle et al. 2015), and protein abundance by quantitative mass spectrometry (Gilad et al. 2003; Battle et al. 2015) in lymphoblastoid cell lines (LCLs) that were established by the HapMap and 1000 Genomes Projects (MacArthur et al. 2012; International HapMap 3 Consortium et al. 2010; 1000 Genomes Project Consortium et al. 2012). For each dataset, we used published sample genotypes to identify single nucleotide variants (SNVs) that induced synonymous or nonsense codon changes with respect to the reference genome in one or more RefSeq coding transcripts.
We specifically focused our analysis on common nonsense variants that were present in a homozygous state and thereby predicted to induce complete gene knockouts. We focused on nonsense variants rather than the broader spectrum of loss-of-function variation because nonsense variants can be identified in a straightforward manner from both DNA and RNA sequencing, their impact upon the mRNA is unambiguous, they are relevant to many Mendelian diseases, and their allelic expression can be reliably quantified from RNA-seq data. Of all nonsense variants that we identified, a total of 194 (RNA-seq), 118 (ribosome profiling), and 106 (mass spectrometry) nonsense variants were present in a homozygous state in the genome of at least one assayed sample in each respective dataset (Figure 1B).
By studying individuals with either two (genotype 0/0; 0 = reference allele) or zero (genotype 1/1; 1 = alternate allele) intact gene copies, we were able to unambiguously determine how each nonsense variant affected protein production. Restricting to those variants, we found that levels of the parent mRNAs containing each nonsense variant were similar in 0/0 and 1/1 individuals (Figure 1C). The similarity of mRNA levels in 0/0 and 1/1 individuals suggested that many of the mRNAs containing the studied nonsense variants escaped from nonsense-mediated decay (NMD). To determine the effect of NMD on these variants, we quantified allele-specific expression (ASE) of the alternate alleles illustrated in Figure 1C in heterozygous (genotype 0/1) individuals. We restricted our analysis to nonsense variants covered by at least 20 RNA-seq reads in at least one heterozygous sample (so that we could accurately measure ASE) and computed the median ASE across all such heterozygous samples for each variant. The 36 nonsense variants meeting this coverage requirement exhibited a median ASE of 45%, only modestly lower than the 50% expected if both alleles were equally transcribed and not affected by NMD. Seven of the 36 variants exhibited a median ASE < 10%, consistent with relatively efficient degradation by NMD (Figure S1A). We conclude that the majority of mRNAs containing the studied nonsense variants escape NMD, although a subset are efficiently degraded, consistent with previous reports (Wang et al. 2008; MacArthur et al. 2012; Kukurba et al. 2014; Rivas et al. 2015).
The observed similarity in levels of 0/0 and 1/1 mRNAs extended to mRNA translation as well. Association between the parent mRNAs and ribosomes, as well as levels of the encoded proteins, was similar between 0/0 and 1/1 individuals (Figure 1D-E). (The much lower coverage of both ribosome profiling and mass spectrometry data relative to RNA-seq data rendered the analysis of allele-specific expression in 0/1 individuals infeasible for these assays of mRNA translation.) We observed no cases where homozygous nonsense variants completely abolished mRNA expression, mRNA:ribosome association, or protein expression, with the caveat that these measurements were necessarily restricted to variants within genes that were detectably expressed in LCLs by RNA-seq, ribosome profiling, or mass spectrometry.
Genotyping errors could potentially explain why we observed normal protein levels from genes containing homozygous nonsense variants. We therefore used RNA-seq to identify and remove incorrectly called genotypes. For each nonsense variant, we restricted our analyses to samples for which the variant was covered by at least ten reads, and removed samples that were annotated as 0/0 or 1/1 but exhibited expression of the alternate or reference alleles, respectively. While we did identify and remove a small number of such variant-sample pairs whose genotypes were inconsistent with the observed allelic expression, removing those inconsistent data points did not affect our conclusion that neither mRNA expression, mRNA:ribosome association, nor protein expression was abolished by the analyzed homozygous nonsense variants (Figure S1B-D). Therefore, genotyping errors do not explain why these nonsense variants do not abolish protein production.
Permissive RNA processing may enable protein production from genes containing nonsense variants
The extensive regulation that occurs between the initiation of gene transcription and the completion of mRNA translation offers ample opportunities for protein production from genes containing nonsense variants (although gene function may or may not be preserved). Protein can be produced from genes containing nonsense variants by one of two means (Figure 1F). First, a nonsense variant may be isoform-specific, such that at least one coding isoform of the parent gene does not contain the variant. In this case, the nonsense variant may prevent translation of one or more, but not all, isoforms. Second, the stop codon introduced by a nonsense variant may not be sufficient to abolish productive translation. This can occur if the ribosome can initiate translation downstream of the stop codon, if the stop codon is subject to efficient readthrough, or if the nonsense variant is near the normal stop codon.
Sequence analyses reported in previous studies suggest that both alternative splicing and protein truncation may contribute to the robust protein production from genes containing nonsense variants that we observed (Figure 1C-E). Approximately one-third of previously studied nonsense variants are isoform-specific (Schueren et al. 2014; MacArthur et al. 2012; Stiebler et al. 2014; Kaiser et al. 2015; Loughran et al. 2014; Eswarappa et al. 2014); minor coding isoforms are enriched for nonsense variants relative to major isoforms (MacArthur et al. 2012; Liu and Lin 2015; Sulem et al. 2015); and nonsense variants preferentially occur at the 3’ ends of their host CDSs (MacArthur and Tyler-Smith 2010; MacArthur et al. 2012; Sulem et al. 2015). We therefore set out to systematically identify molecular mechanisms that enabled protein production from genes containing nonsense variants. Our general strategy was to combine RNA-seq, ribosome profiling, and SILAC data to identify the mechanisms that enabled protein production from genes containing the common nonsense variants studied in Figure 1, and then computationally predict the likely relevance of each mechanism to lower-frequency nonsense variants using sequence analysis of the ∼10,000 nonsense variants identified by the 1000 Genomes Project.
Nonsense variants are frequently isoform-specific and removed by alternative splicing
We first tested whether nonsense variants are more frequently isoform-specific than their synonymous counterparts. For each nonsense variant within a multi-exon gene, we determined whether the gene contained at least one RefSeq coding transcript for which the alternate allele did not introduce an in-frame stop codon, or if the nonsense variant was contained within alternatively spliced sequence of the mRNA. We similarly measured isoform specificity for synonymous variants, restricted to the subset of synonymous variants lying within genes that contain nonsense variants in order to control for the fact that nonsense variants preferentially occur in specific gene classes (Ingolia et al. 2011; MacArthur et al. 2012; Sulem et al. 2015). We eliminated variants lying within genes encoding olfactory receptors from this and subsequent analyses, as this large gene family has been subject to frequent and likely largely neutral gene inactivation during human evolution (Ayadi et al. 2012; Gilad et al. 2003; de Angelis et al. 2015).
Approximately 28% of synonymous variants in multi-exon genes are isoform-specific. In order to test for a relationship between isoform specificity and the prevalence of variants, we classified each variant as rare, low-frequency, or common, corresponding to allele frequencies of ≤ 0.5%, 0.5-5%, and ≥5% in the 1000 Genomes data. We observed no relationship between isoform specificity of synonymous variants and their frequency in the human population, consistent with our expectation that most synonymous codon substitutions are functionally neutral. In contrast, approximately 31% of rare nonsense variants are isoform-specific, rising to 39% and 43% of low-frequency and common nonsense variants, suggesting that isoform-specific nonsense variants are subject to relaxed purifying selection (Figure 2A). When we considered all synonymous variants rather than restricting to those within genes containing nonsense variants, we observed slightly greater disparities between the isoform specificity of synonymous versus nonsense variants (Figure S2A). Our estimates of isoform specificity for nonsense variants are consistent with a previous estimate of 32.3% based on a smaller cohort of 185 individuals (Nyegaard et al. 2015; MacArthur et al. 2012).
The functional consequences of an isoform-specific variant depend upon the frequency with which those isoforms are produced as mature mRNAs. To test whether nonsense variants preferentially fall within isoforms that are used infrequently, we quantified the inclusion of alternatively spliced sequence containing synonymous or nonsense variants across sixteen human tissues. Isoform-specific synonymous variants exhibited a median inclusion of ∼90% irrespective of allele frequency, while low-frequency and common isoform-specific nonsense variants exhibited median inclusions of 64% and 53% (Figure 2B, S2B). We conclude that even among the isoform-specific nonsense variants, those variants that are most frequently spliced out of mature mRNA are subject to further relaxed selection.
We next measured the empirical isoform specificity of the specific set of homozygous nonsense variants that we analyzed in LCLs (Figure 1C-E). For each SNV within an expressed gene, we quantified the expected versus observed number of ribosome footprints overlapping the SNV as a simple empirical measure of isoform specificity (e.g., an isoform-specific SNV is expected to exhibit a ratio observed:expected less than 1). We estimated the expected footprint coverage from the total number of footprints aligned to the parent transcript of each SNV and averaged this footprint coverage over all samples with a 0/0 genotype for that SNV. (We restricted the coverage analysis to 0/0 samples in order to measure position-specific mRNA:ribosome association in the absence of premature stop codons.) The resulting distributions were similar for most synonymous and nonsense variants, consistent with our genome-wide prediction that the majority of variants are not isoform-specific. Nonetheless, a subset of nonsense variants were strongly depleted for ribosome footprint coverage beyond the background distribution estimated from synonymous variants, indicating that they are frequently excluded from mature mRNAs engaged with ribosomes in LCLs (Figure 2C). Low-frequency and common nonsense variants exhibited greater depletion for footprint coverage relative to rare nonsense variants, again consistent with relaxed purifying selection (Figure 2C).
Each of the mechanisms proposed in Figure 1F contributed to the isoform specificity of nonsense variants. Specific examples include nonsense variants within a frame-preserving cassette exon of LGALS8, an alternate promoter of MOB3C, and alternatively spliced sequence of TMEM218 that is either coding or non-coding, depending upon which start codon is used (Figure 2D-I, Figure S2D-G). SILAC data was available for LGALS8 and MOB3C, allowing us to confirm that protein levels were sustained in LCLs with homozygous copies of the corresponding nonsense variants. In both cases, protein levels were higher for 0/1 and 1/1 samples relative to 0/0 samples, perhaps due to compensatory translational up-regulation in individuals carrying the alternate alleles or differential stability of the relevant protein isoforms (Figure 2F,I).
As alternative splicing is frequently tissue-specific (Schueren et al. 2014; Wang et al. 2008; Stiebler et al. 2014; Loughran et al. 2014; Eswarappa et al. 2014), the degree to which isoform specificity mitigates the consequences of a nonsense variant may differ substantially between different cell types. For example, the LGALS8 cassette exon is highly tissue-specific, with inclusion ranging from 5% in colon to 87% in testes of individuals with genotype 0/0 (inclusion is ∼55% in LCLs with genotype 0/0; Figure S2E). Therefore, both the levels of LGALS8 protein in 1/1 individuals and the consequences of specifically ablating the inclusion isoform may differ between cell types.
Nonsense variants may be subject to stop codon readthrough
Genes containing nonsense variants within constitutively included sequence cannot produce mRNAs lacking the variants, but can potentially produce protein nonetheless through permissive mRNA translation (Figure 1F). We therefore searched for examples of stop codon readthrough, wherein the translating ribosome does not efficiently terminate at the stop codon, but rather decodes the stop codon as a sense codon and continues elongating. Nonsense variants that are subject to readthrough may suppress levels of the encoded protein due to imperfectly efficient readthrough, but will induce only a single amino acid change at the nonsense variant.
To identify such examples of readthrough, we compared ribosome footprint coverage of genes containing nonsense variants in samples with genotypes 0/0 or 1/1 for the variants. Of the 118 nonsense variants present as 1/1 in at least one sample within the Yoruba cohort (Figure 1B), 15 variants had both 0/0 and 1/1 samples available and lay within genes with sufficient footprint coverage in LCLs to identify potential readthrough (median coverage of relevant transcripts of ∼1 footprint per million mapped reads). Of those 15, two nonsense variants within PVRIG and SLFN13 exhibited highly similar patterns of mRNA:ribosome association in 0/0, 0/1, and 1/1 samples, including in the immediately vicinity of the premature termination codon (Figure 3, S3). This pattern of ribosome footprints is consistent with stop codon readthrough, although it could potentially arise from translation initiation immediately downstream of the nonsense variant as well.
As only a few examples of stop codon readthrough have been identified in mammals (Kaiser et al. 2015; Schueren et al. 2014; Stiebler et al. 2014; Loughran et al. 2014; Eswarappa et al. 2014), we tested whether our analysis of SLFN13 or PVRIG was confounded by mis-mapping of ribosome footprints from homologous transcribed genomic loci. The first two coding exons of PVRIG, which contain the nonsense variant, have 96.8% homology to the transcribed pseudogene PVRIG2P. However, as the density of mismatches between PVRIG and PVRIG2P is high enough to uniquely map ribosome profiling reads and PVRIG2P exhibits no ribosome footprint coverage, we conclude that the ribosome association illustrated in Figure 3A likely arises from the PVRIG mRNA and not a homologous locus. Similarly, while the first coding exon of SLFN13, which contains the nonsense variant, has 82.9% homology to the first coding exon of the expressed gene SLFN11, the density of mismatches is sufficient to uniquely map ribosome footprints. Finally, both RNA-seq and ribosome profiling reads mapping to the PVRIG and SLFN13 nonsense variants supported 100% expression of the alternate allele in 1/1 samples, indicating that genotyping errors are unlikely. We conclude that stop codon readthrough likely enables translation of some mRNAs containing nonsense variants, although we were unable to estimate the fraction of nonsense variants that might be so rescued given the infeasibility of predicting readthrough from mRNA sequence alone.
Nonsense variants frequently induce N- or C-terminal protein truncation
In addition to stop codon readthrough, stable expression of a truncated protein might enable protein production from a nonsense variant-containing mRNA. Protein truncation can occur at the N or C terminus through different mechanisms. An N-terminally truncated protein results from translation initiation at a start site downstream of the nonsense variant. In contrast, C-terminal truncation results if the nonsense variant is sufficiently close to the normal stop codon and the host transcript is not efficiently degraded via NMD.
We took a sequence-based approach to identify nonsense variants that might induce N- or C-terminal truncation. Simply plotting the relative positions of nonsense variants within their host CDSs revealed that these variants are enriched at the 3’ ends of their host CDSs and otherwise uniformly distributed, while synonymous variants exhibited no peak at the 3’ end (Figure 4A). This pattern is consistent with previous studies of different cohorts (Kircher et al. 2014; MacArthur et al. 2012; 2014; Sulem et al. 2015) and suggests that C-terminal truncation occurs frequently. We did not observe any enrichment for nonsense variants at the 5’ ends of their host CDSs, suggesting that start codon mis-annotation—which can cause a variant within a 5’ untranslated region to be incorrectly labeled as a nonsense variant (1000 Genomes Project Consortium et al. 2012; MacArthur and Tyler-Smith 2010; Sulem et al. 2015)—is not a frequent confounding factor in our analyses.
To test whether N-terminal truncation via translation initiation at downstream start sites might also occur, we restricted to SNVs lying within the first 10% of their host CDSs and tested whether those variants were followed by a downstream methionine within 50 amino acids. Approximately 52% of such synonymous variants had a downstream methionine within this distance. In contrast, both rare and common nonsense variants exhibited statistically significant enrichment for downstream methionines, with 57% and 85% of such variants categorized as potential candidates for downstream initiation (Figure 4B). When we performed the same analysis but restricted to variants within the middle of the CDS—where translation initiation would produce a severely truncated protein—as a control, we observed no statistically significant differences in occurrence of downstream methionines for synonymous versus nonsense variants (Figure 4C). Finally, we tested whether nonsense variants that induced C-terminal truncation might be subject to relaxed selection. Approximately 10% of synonymous variants fell within the last 10% of their host CDSs, independent of allele frequency. In contrast, 13% of rare nonsense variants fell within this region, increasing to 21% for common nonsense variants (Figure 4D). Therefore, we conclude that the positional distribution of nonsense variants within CDSs is consistent with frequent induction of relatively small N- or C-terminal truncations of the encoded protein.
Ribosome profiling data similarly indicated that downstream translation initiation contributes to translation of mRNAs containing nonsense variants. We observed likely translation initiation downstream of nonsense variants at both canonical (ATG; CCHCR1) and non-canonical (CTG or GTG; ABHD14B) start codons (Figure 4E-F). Translation initiation at non-canonical start codons has been previously observed in mammalian genomes, although it is uncommon relative to initiation at canonical start codons (Lappalainen et al. 2013; Ingolia et al. 2011). We confirmed that there are no regions in the reference human genome assembly with high homology to the exon containing the ABHD14B nonsense variant, and also observed 100% expression of the alternate allele in the 1/1 sample, indicating that genotyping errors are unlikely. Together, our data provides both sequence-based and direct evidence of translation initiation at canonical and non-canonical start codons downstream of common nonsense variants.
Truncated proteins encoded by nonsense variant-containing mRNAs are produced in a heterologous system
Of the nonsense variants highlighted above (within LGALS8, MOB3C, TMEM218, PVRIG, SLFN13, CCHCR1, and ABHD14B), we could only confirm LGALS8 and MOB3C protein levels due to mass spectrometry’s incomplete coverage of the proteome. We therefore used a reporter system to experimentally confirm the conclusions of our genomic analyses, as well as test whether translation of nonsense variant-containing mRNAs could be recapitulated in a heterologous system (versus the endogenous context in which the genomic assays were performed). We focused on nonsense variants that we predicted to be subject to stop codon readthrough or induce N- or C-terminal protein truncation, as those involve incompletely understood biochemical mechanisms or the production of novel proteins not present in cells lacking the nonsense variant.
We designed a series of reporter constructs containing the reference (sense codon) or variant (nonsense codon) alleles of ten nonsense variants (Table 1). We chose to test variants within ABHD14B, CCHCR1, PVRIG, and SLFN13 (Figure 3–4), as well as a blindly chosen set of nonsense variants that were present as 1/1 in the 1000 Genomes cohort but for which we did not have ribosome profiling data available for 1/1 samples. Each construct carried an N-terminal FLAG tag and a C-terminal HA tag (Figure 5A), permitting us to specifically identify the encoded protein and distinguish stop codon readthrough from N- or C-terminal truncation as follows. Upon translation, the reference allele is expected to produce a full-length protein product with both the N- and C-terminal tags. If the variant allele is subject to stop codon readthrough, then we expect to similarly observe a full-length product carrying both tags. In the case of N- or C-terminal truncation, however, only a partial protein product carrying a HA or FLAG tag, respectively, is expected (Figure 5B).
We introduced each construct containing the reference or variant allele of each candidate gene individually into HEK293 cells via transient transfection and assayed their protein products 48 hours post-transfection by detecting the FLAG and HA tags. Constructs containing the reference alleles of CCHCR1, ABHD14B, NDUFV3, and TRIM38 all produced a protein product of the expected size carrying both N- and C-terminal tags (Figure 5C, Figure S5A). The other reference constructs either produced no protein or protein products that were not of the expected size, and so were excluded from further analysis.
Constructs containing either of the two nonsense variants of CCHR1 produced an N-terminally truncated protein fragment indicative of initiation downstream of the premature termination codon (Figure 5D, Figure S5B). This fragment was produced from the construct containing the reference allele as well, suggesting that CCHCR1 is subject to alternative translation initiation even in the absence of a nonsense genetic variant. The construct containing the nonsense variant of ABHD14B did not make a detectable protein in HEK293 cells, despite our expectation given uninterrupted ribosome footprints throughout the open reading frame in LCLs, suggesting that an N-terminally truncated protein is likely translated but unstable. Constructs containing the NDUFV3 and TRIM38 nonsense variants produced C-terminally truncated protein fragments that were abundant despite their small size relative to the full-length proteins. These variants in NDUFV3 and TRIM38 were expressed in RNA-seq from 0/1 samples at allelic levels of 16% and 15% (Table 1), indicative of incomplete degradation by NMD that could potentially enable these short proteins to be similarly produced from their endogenous loci.
In total, we detected stable expression of N- and C-terminally truncated protein products from constructs encoding four of five unambiguously tested nonsense variants, supporting our hypothesis that many nonsense variants do not ablate protein production from their host genes.
DISCUSSION
Our data demonstrate that many common nonsense variants have only modest impacts upon the levels of total protein produced from their seemingly disabled parent genes. Translation of mRNAs transcribed from genes containing nonsense variants arises from diverse mechanisms during the gene expression process, including transcription start and stop site choice, alternative splicing, translation start site choice, downstream translation initiation, and stop codon readthrough. Given the severe fitness costs of many complete gene knockouts in murine models (Battle et al. 2015; Ayadi et al. 2012; de Angelis et al. 2015), our data therefore suggest that permissive RNA processing and translation in human cells facilitates the accumulation of otherwise deleterious genetic variation in the human population.
What fraction of nonsense variants disable gene function? We were able to directly answer this question for common variants by relying on individuals that are homozygous for the variants of interest. However, a direct answer is inaccessible for rare variants, which almost never occur in a homozygous state. (The low coverage of ribosome profiling and SILAC data prevents reliable analysis of allele-specific translation or protein levels in heterozygotes.) We therefore used sequence analyses to estimate that 43-49% of rare and low-frequency variants identified by the 1000 Genomes Project are isoform-specific or may induce N- or C-terminal truncation (Table 2). These predictions do not take into account the incompleteness of current isoform annotations, the possibility of initiation at non-canonical start codons, or the possibility of stop codon readthrough, and so may be underestimates. We therefore speculate that permissive RNA processing frequently rescues protein production from genes containing rare nonsense variants, although at lower rates than their common counterparts.
It is important to note that maintenance of protein production may not be equivalent to rescue of gene function in many cases. Different isoforms of a gene may encode proteins with distinct or even opposing functions, so that an isoform-specific nonsense variant may have unpredictable consequences. Even modest N- or C-terminal truncation could alter or abolish protein function by removing peptide signals that direct subcellular localization, prenylation, or other activities or modifications, or induce unexpected gain of function by placing a normally internal peptide signal at the N or C terminus. For example, a recent study identified a novel C-terminal truncating variant in CD164 that segregated with nonsyndromic hearing impairment in an affected family. Only the last six amino acids of CD164 were removed, but without the encoded canonical sorting motif, the truncated protein was trapped on the cell membrane (Battle et al. 2015; Nyegaard et al. 2015).
The most unexpected finding of our study was the prominent role played by mRNA translational plasticity in enabling protein production from genes containing non-pathogenic nonsense variants. The regulation of mRNA translation initiation and termination is incompletely understood, yet important for interpreting genetic data. For example, stop codon readthrough is thought to be relatively rare in mammals, with only a few known examples (Ong et al. 2002; Schueren et al. 2014; Stiebler et al. 2014; Loughran et al. 2014; Eswarappa et al. 2014). However, our data suggest that readthrough enables translation of at least two mRNAs containing common nonsense variants. Our data furthermore indicate that alternative translation initiation likely enables translation of many more such mRNAs.
Together, our data demonstrate that permissive RNA processing has the potential to convert loss-of-function variants from genetic nulls into hypomorphic, silent, or even neomorphic alleles. Our results may therefore help to explain recent observations that some apparently healthy individuals carry homozygous putative loss-of-function variants within genes implicated in Mendelian disease (Battle et al. 2015; Kaiser et al. 2015). An improved mechanistic understanding of RNA processing and translation is essential for ongoing attempts to predict the pathogenicity of genetic variation (MacArthur et al. 2012; Kircher et al. 2014; Sulem et al. 2015; MacArthur et al. 2014).
METHODS
Accession codes. FASTQ files were downloaded for RNA-seq data from LCLs from EBI’s ArrayExpress under accession number E-GEUV-1 (Katz et al. 2010; Lappalainen et al. 2013), for ribosome profiling data from LCLs from the NCBI Gene Expression Omnibus under accession number GSE61742 (Dvinge et al. 2014; Battle et al. 2015), and for the Illumina Body Map 2.0 dataset from ArrayExpress under accession number E-MTAB-513. Gene-level quantification of relative protein abundance based on SILAC mass spectrometry was taken from Supplementary Data Table 4 of Battle et al.
Genotypes. Genotypes from the phase 3 variant set called by the 1000 Genomes Project (Dvinge et al. 2014; 1000 Genomes Project Consortium et al. 2012) were downloaded from here: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502 We restricted our analysis of ribosome profiling data from Battle et al to the 62 samples with genotypes called in this variant set, and used the 1000 Genomes genotyping for all analyses. For the RNA-seq data from Lappalainen et al, we used the genotyping information from the original study, available here: http://www.ebi.ac.uk/arrayexpress/files/E-GEUV-1/genotypes/ and restricted to the 421 samples that were genotyped by sequencing. For global SNV analyses based on the 1000 Genomes variant set, we restricted to SNVs for which the alternate allele was present in a subsampled fraction of the 1000 Genomes data, wherein we restricted each of the 26 populations to no more than 90 samples to help to even the genotyping coverage among the available populations. For all datasets, we restricted genotyping calls to SNVs.
SNV functional annotation. We used SnpEff (Li and Dewey 2011; Cingolani et al. 2012) to identify variants that affected coding DNA sequences and annotate their functional effects. We ran snpEff v4_1c with the options ‘-v -lof -no-downstream -no-intergenic -no-intron -no-upstream -no-utr –formatEff –download hg19’, and then restricted to SNVs with predicted effects of ‘synonymous_variant’ or ‘stop_gained’ for at least one RefSeq transcript. For all transcript-level analyses, such as computing the amino acid affected by a SNV, we relied upon the AnnotationHub and BSgenome packages within Bioconductor (Huber et al. 2015).
Genome annotations. A merged genome annotation was created by combining the UCSC knownGene annotations (Meyer et al. 2013), Ensembl 71 (Flicek et al. 2013), and isoform annotations from MISO v2.0 (Katz et al. 2010) as previously described (Dvinge et al. 2014). A file of all possible splice junctions, consisting of all splice junctions present in the merged genome annotation as well as all possible combinations of 5’ and 3’ splice sites within each parent gene, was created for subsequent read mapping.
RNA-seq and ribosome profiling read mapping. RNA-seq and ribosome profiling reads were mapped to the UCSC hg19 (NCBI GRCh37) genome assembly as previously described (Dvinge et al. 2014). Briefly, RSEM (Li and Dewey 2011) was used to map reads to annotated transcripts, and the resulting unaligned reads were then aligned to the genome and the set of all possible splice junctions using TopHat v2.0.8b (Trapnell et al. 2009). We required that reads map with a maximum of three mismatches for the RNA-seq data, which consisted of 2x75 bp reads, and one mismatch for the ribosome profiling data, which consisted of shorter mixed-lengths reads. Reads mapped by RSEM and TopHat were then merged to create final BAM files of aligned reads.
Transcript expression, allele-specific expression, and isoform ratio measurements. Transcript expression (e.g., as illustrated in Figure 1C-D) was quantified for RNA-seq or ribosome profiling data in units of fragments per kilobase per million mapped reads as follows. For each coding gene annotated in Ensembl 71, we counted the numbers of reads (or fragments, in the case of paired-end reads) that aligned to the longest CDS of that gene based on the RefSeq transcript set and then divided by the CDS length in kilobases. We then normalized these counts by dividing by a sample-specific normalization factor, defined as (106 x total number of reads mapping to any Ensembl coding gene as specified above). This computation relied upon the GenomicAlignments, GenomicFeatures, and GenomicRanges packages within Bioconductor (Lawrence et al. 2013; Huber et al. 2015), and resulted in expression measurements in units of fragments per kilobase per million mapped reads (FPKM).
Allele-specific expression was measured by counting the numbers of RNA-seq reads containing the reference or alternate alleles. Isoform ratios were quantified genome-wide using MISO (Katz et al. 2010) as previously described (Dvinge et al. 2014).
Read coverage plots. Ribosome profiling read coverage plots were generated from the BAM files of mapped reads as follows. For each sample, all reads aligned to the genomic locus of interest were extracted and trimmed to 25 bp by removing equal lengths of sequence as necessary from the 5’ and 3’ ends to control for the differing footprint read lengths. At each genomic position, the normalized coverage was computed as the total number of reads aligned to that position divided by a sample-specific normalization factor, defined as (106 x total number of reads mapping to any Ensembl coding gene). Samples were then classified according to their genotype and the average coverage was computed for each genotype by averaging over the normalized coverage values for each sample. RNA-seq read coverage plots were generated via an identical procedure, but the reads were not trimmed (as all reads were of equal length). These plots relied upon the AnnotationHub and BSgenome packages within Bioconductor (Huber et al. 2015).
Plotting and graphics. All plots and figures were generated with the dplyr (Wickham and Francois) and ggplot2 packages (Wickham 2009).
Reporter constructs. The reporter constructs for reference alleles were obtained as full-length cDNAs in the pcDNA3.1 backbone carrying FLAG and HA tags at the N and C termini, respectively (GenScript USA Inc.). Point mutations were introduced into constructs containing the reference alleles to generate the variants (GenScript USA Inc.).
Western blotting. HEK293 cells were grown in DMEM containing 10% FBS and transfected with the reporter constructs using Lipofectamine® 3000 transfection reagent (Invitrogen). After 48 hours, the cells were lysed in TRIzol® reagent (Invitrogen). Total protein was extracted following the manufacturer’s instructions the protein pellet resuspended in sample buffer containing 5%SDS and 0.5M unbuffered Tris base to ensure efficient solubilization, aided by gentle sonication in an ultrasonic water bath. Protein concentrations were determined using the BCA protein assay (ThermoScientific) and 5 μg of total protein was resolved on 4-12% NuPAGE Bis-Tris acrylamide gels (Novex) and western blotting was performed using the LICOR system. The primary antibodies used were mouse monoclonal anti-FLAG M2 antibody (Sigma-Aldrich; Cat # F1804) and rabbit monoclonal anti-HA antibody (EMD Millipore; Cat # 05-902R). IRDye-conjugated secondary antibodies (LICOR) were used for quantitative detection of the tags. Rough estimates of the protein molecular weights were made by comparison with the Precision Plus Protein Kaleidoscope standards (BioRad).
AUTHOR CONTRIBUTIONS
SJ and RKB performed the experiments, analyzed the data, and wrote the paper.
DISCLOSURE DECLARATION
The authors declare that no competing interests exist.
ACKNOWLEDGEMENTS
This research was supported by the Ellison Medical Foundation AG-NS-1030-13 (RKB) and the FSH Society FSHS-22014-01 (SJ). We thank Jesse Bloom and Guo-Liang Chew for comments on the manuscript.