Abstract
Pre-mRNA splicing is an important mechanism by which genetic variation influences complex traits. We developed a Multiplexed Functional Assay of Splicing using Sort-seq (MFASS) that allows us to qantify exon inclusion in large libraries of human exons and surrounding intronic contexts. We used MFASS to explore >10,000 designed mutations intended to alter regulatory elements that govern splicing. Many classes of mutations led to large-effect splicing disruptions including mutations far from canonical splice sites, and these effects were not easily predicted. We assayed 29,531 extant variants in the Exome Aggregation Consortium, and found that >1000 variants (3.6%) within or adjacent to 2393 assayed human exons led to almost complete loss of exon recognition. While most variants at the canonical splice site disrupt splicing, they represent <20% of splice-disrupting variants overall because genetic variation elsewhere dominates. Our results indicate that loss of exon recognition caused by rare genetic variation may play a larger role in trait diversity than previously appreciated, and that MFASS may provide a scalable way to functionally test such variants.
Main Text
Any individual’s genome contains ∼4-5 million deviations from the reference human genome, almost all of which are very rare1. How this collection of differences give rise to trait diversity and disease susceptibility is a central question in human genetics. Recent genetic studies implicate pre-mRNA splicing as a major and underappreciated means through which variation imparts functional consequences 2–5. However, genetic variation is depleted at the major splicing 2,6. If genetic variation is having major impacts on splicing, how does it impart its if not through the major sites known to affect splicing?
In humans, genetic and biochemical studies show that exons are first recognized in a process called exon definition, and then introns between them are removed7–11. The major exon recognition elements, including the splice donor, acceptor, branchpoint and polypyrimidine tract, taken together are too degenerate alone to discriminate true exons from those not utilized in vivo12–14. Numerous computational, in vitro, and genetic studies have shown that other cis-regulatory elements are required to distinguish false exons from included ones12,13,15. These sequences are short motifs that are broadly classified as exonic splicing enhancers (ESEs) and suppressors (ESSs) as well as their intronic counterparts 16,17 (ISEs & ISSs). Machine learning methods use these and other genomic features trained against genome-wide RNA sequencing datasets to build predictive models of splicing regulation7–9. However, the predictive power of these models may come almost entirely from sequence conservation rather than the mechanistic understanding of splicing18,19. These models predict that human genetic variation, and especially rare variation, often disrupt sequence features required for proper exon recognition, but it is difficult to verify the accuracy of these predictions at large scales7.
Several groups have developed massively parallel reporter assays of splicing8,14,20,21. Most of these assays look at a small set of exons and mutate them to understand which elements are important for splicing. Importantly, these methods have allowed us to better quantify how individual ESEs and ESSs combine to contribute to exon recognition in a small number of exon contexts, and can be used to build more general predictive models for exon splicing. Recently, a survey of disease variants within a much broader set of human exons found that ∼10% of these variants had exon recognition defects 20. Despite the recent progress, there are still several limitations inherent to these large-scale approaches. First, these reporters often assay exons in the contexts of short background intronic sequences, which have been shown to impact exon skipping and intron retention22. Second, most previous studies use transient transfections that do not reflect physiological chromatin contexts 23 and are usually highly overexpressed, which can lead to saturation of the splicing machinery24,25. Finally, most of these assays cannot screen both intronic and exonic changes simultaneously.
Here we develop a novel multiplexed assay that overcomes many of these shortcomings called MFASS (Multiplexed Functional Assay of Splicing by Sort-seq) that builds upon several previous approaches (Fig. 1A). MFASS allows testing of tens of thousands of chemically-synthesized exons and surrounding introns in the context of a reporter with long constant introns, stably integrated at single copy at a precise genomic locus with high efficiency (Supp. Fig. 1). Briefly, we split a GFP coding sequence with a constant intron backbone, with a downstream mCherry fluorescent marker to act as a control. Thus, the ratio of green to red fluorescence is a direct measure of exon inclusion. This is reminiscent of past approaches13,14 but optimized for large libraries 26, readout by next-generation sequencing, and optimized to study exon definition13 (Supp. Fig. 2). The library of exons and surrounding native intronic sequences is cloned into this constant intron backbone. We then integrated the plasmid library into an engineered serine integrase-based landing pad at the AAVS1 locus in HEK293T cells, ensuring only one integrant per cell, similar to recently published high-efficiency integration methods 26,27 (Supp. Fig. 1, 3). We sorted the integrated cell library into bins based on the GFP:mCherry ratio, followed by DNA-Seq of the integrated library (similar to past Sort-seq approaches28–31) to build a quantitative measure of exon inclusion level of any designed sequence.
We first designed, built and assayed a library to explore how Splicing Regulatory Elements (SRE) individually govern exon recognition across a randomly-chosen library of 205 natural human exons and surrounding intronic sequences (Figure 2A). We used fluorescence-activated cell sorting (FACS) to sort our pooled sequence library of splicing reporters into three bins (GFP<sub>neg<s/sub>, GFP<sub>int<s/sub> and GFP<sub>+<s/sub>). We expanded these sorted bins over several passages and observed that the sorted populations remained stable (Fig. 1B). We also performed bulk RT-PCR for each bin and found that the observed RNA splicing efficiencies corresponded almost directly with observed fluorescence of the bins (Fig. 1C, Supp. Fig. 4). In addition, we constructed individual reporters corresponding to individual library sequences, and evaluated both fluorescence and RNA splicing under transient expression and site-specific genome integration (Supp. Fig. 5). While level of exon inclusion as measured by RT-PCR is consistent between transient and stable expression, reporter fluorescence in stably integrated constructs is more consistent with RT-PCR results because the transient transfections included signals at very high gene dosage (Supp. Fig. 4, 5).
For our SRE library studies, we first tested a variety of short constant intron contexts, but found that these resulted ∼10-fold lower expression indicative of intron retention (Supp. Fig. 6), which is usually a rarer event in higher eukaryotes that contain longer introns32. We chose two longer intronic backbones (∼300-600 bp) shown previously to not suffer from such intronic retention (C. griseus DHFR and human SMN1 intron backbones), and found that the longer intron lengths improved both expression and assay reproducibility33,34. Exon inclusion metrics obtained from both of these intron contexts were highly reproducible between biological replicates (Fig. 1E) (r= 0.94, p < 10-16, DHFR intron backbone, and r = 0.89, p < 10-16, SMN1 intron backbone). Exon inclusion level for the entire library also correlates highly across DHFR and SMN1 constant intron contexts (Fig. 1E) (r = 0.85, p < 10-16), indicating our reporter assay is robust across broader intron contexts. Notably, most library sequences are represented predominantly in one exclusive bin showing either complete exon inclusion or skipping (Fig. 1D), consistent with bimodality in splicing behavior in our flow cytometry readout (Fig. 1B) and in single cells35–37. For all subsequent analyses, we only include constructs with Δinclusion index that agree within 0.30 for both biological replicates and across intron backbones.
We designed the SRE library using a software tool that we developed, Splicemod, that can iteratively mutate specific classes of regulatory elements that govern splicing without unintentionally creating new ones (Fig. 2A; Supp. Table 2). As expected, reducing the strength of the splice acceptor (SA) and splice donor (SD) adversely affects exon inclusion (Fig. 2B). We observe a significant correlation between decreased MaxEnt38 score (relative to wild-type) and Δinclusion index for both SA (r = 0.33, p < 10-16) and SD (r = 0.36, p < 10-16) (Fig. 2B). The change in score for both SA and SD combined explains 14% of the variation in Δinclusion index (multiple linear regression, p < 10-16). Variants designed to mutate SA and/or SD but retain comparable strength (i.e. same MaxEnt score) show that while the majority (79.2%, 236/298) shows little change relative to wild-type (-0.20 ≤ Δinclusion index ≤ 0.20), 16% (48/298) of variants exhibit large effects with Δinclusion index ≤ -0.50 (Splice-Disrupting Variants, SDVs). Taken together, while MaxEnt scores do correlate with function, there seems to be a context dependence that is not accounted for in the score alone.
Perturbations to ESEs result in a significant decrease in exon inclusion compared to random exonic changes (Mann-Whitney U test, p < 10-16), while weakening or destroying ESSs results in a small but significant increase in exon inclusion (Mann-Whitney U test, p = 1.33 × 10-4). Interestingly, disrupting only the strongest ESE results in a significant decrease in Δinclusion index (Mann-Whitney U test, p = 2.42 × 10-7). We calculated an average exon hexamer score for each sequence using the HAL model, which is learned from synthetic mini-genes focused on alternative 5’ and 3’ splicing8 (Fig. 2C). We quantified the change in average exon hexamer score as the difference relative to the wild-type (Δaverage exon hexamer score) and found a correlation with Δinclusion index (r = 0.26, p < 10-16) and a significant difference between mutants that increase or decrease the average score (two-tailed Student’s t test, p < 10-16). Compared to random intronic changes, we found that weakening or destroying intronic motifs does not have an overall significant effect on exon inclusion (Mann-Whitney U test), although 9.4% (63/672) of these mutants are SDVs. Additionally, we designed mutations that disrupt 53 RNA-binding protein (RBP) motifs and found small changes in Δinclusion index relative to random mutations (Mann-Whitney U test, p = 2.08 × 10-4 (intronic), p = 3.80 × 10-2 (exonic)), with 14.1% (48/341) being SDVs. We synthesized 109 dbSNP mutations but do not observe significant changes in Δinclusion index (as compared to random changes) for either exonic or intronic single nucleotide polymorphisms (SNPs)39 (Mann-Whitney U test).
Given the appreciable proportions of SDVs across many classes of elements, we sought to examine the extent to which rare human variants act as SDVs. We first examined a larger library of 4660 natural human exons and found that 2902 exons (62.2%) have an inclusion index of ≥ 0.80 in our assay (Fig. 3A). Based on these human sequences, we designed and synthesized all possible exonic and intronic single nucleotide variants (SNVs) from the Exome Aggregation Consortium 2 (ExAC, v0.3.1) (Fig. 3B), which represents a rich resource of genetic diversity from 60,706 individuals. We were able to quantify the effects of 29,531 SNVs across 2393 reference sequences, which is more than half (54.7%, 29,531/54,021) of those found in the ExAC for these exons (Fig. 3B). We evaluated all SNVs in the DHFR intron backbone, because the backbone provided more replicable data in the SRE datasets. We also only report data for variants with calculated Δinclusion index within 0.20 between biological replicates to be more conservative with potential SDVs (r = 0.80, p < 10-16) (Fig. 1E; Supp. Fig. 11). We also included four control sets: (1) random nucleotides, (2) a previously tested set of skipped exons in the SRE library, (3) systematic mutations of both the splice donor and acceptor of wild-type sequences, (4) and two reporter constructs that split at distinct positions of GFP to assess how reading frame affects exon inclusion. 100% of random sequences (n = 27), 98.6% of skipped exons (n = 95), and 97.3% of broken SD/SA sequences (n = 1391) demonstrate exon skipping (inclusion index < 0.50) (Supp. Fig. 10). Moreover, Δinclusion indices across two separate reporter constructs located in different parts of GFP and in different frames demonstrate robust correlation (r = 0.95, p < 10-16, Supp. Fig. 11).
Overall, we found that 3.6% (1050/29,531) of ExAC SNVs leads to large-effect splicing disruptions in exon recognition, and are spread broadly across human exon backgrounds (Fig. 3B). The annotations in ExAC use the Variant Effect Predictor classification40, and we find that 67.8% of splice site SNVs (2 bp of intron adjacent to exon) are SDVs (Fig. 3D). Note that in our assays, alternative 5’ and 3’ splice site usage will be called as false negatives and thus we may be missing other potential SDVs. Variants in the broader splice region category, which includes variants located 2 bp into the exon and 8 bp into the intron (excluding splice sites), only disrupt splicing 8.5% of the time. Synonymous, non-synonymous, and further intronic SNVs disrupt splicing more rarely at 3.0%, 3.1%, and 1.5% respectively. The increased sensitivity at splice site locations mirror added evolutionary constraints at these sites (Fig. 3C). However, SNVs at splice sites are rare in our library and also for all ExAC variants as a whole (Fig. 3C, Supp. Fig. 12), and the larger number of SNVs in other regions makes up for their reduced sensitivity (Fig. 3D). Notably, SNVs at splice sites only constitute 17% of the SDVs revealed by our assay, whereas intron variants, which are the least sensitive to genetic variation, contribute 19% of the SDVs (Fig. 3D). Overall, we observe almost equal contributions from intronic (53%) and exonic (47%) SDVs.
Evolutionary conservation does correlate with whether an SNV will be an SDV, and this is most clearly seen within introns, which are enriched for highly conserved SDVs (Fig. 4A) (two-sided Fisher’s exact test, p < 10-16). However, this conservation has limited predictive power, as there are more lowly conserved intronic SDVs than highly conserved ones especially for upstream intronic regions, while there are few poorly conserved exonic sites (Fig. 4B). Looking at gene level population genetic constraints, for exons within those genes that are predicted to be intolerant to loss-of-function (pLI ≥ 0.9), we observe significantly fewer SDVs (Fig. 4C) (two-sided Fisher’s exact test, p = 2.67 × 10-12). Finally, while a vast majority of SDVs are rare, the proportion of SNVs that are SDVs is significantly different across ExAC allele frequency bins p = 1.12 × 10-3, chi-squared test) ranging from extremely rare variants (singletons) to more common variants with allele frequency of ≥ 0.1% (Fig. 4D).
We compared multiple prediction algorithms to our human variant dataset, some designed specifically for splicing (SPANR7 and HAL8) and others to predict the impact of non-coding genetic variation (CADD41, DANN42, FATHMM-MKL43, fitCons44, and LINSIGHT45) (Fig. 4E, Supp. Fig. 13). Overall, we find that the two algorithms specifically designed for and trained on splicing data perform the best, mostly due to their ability to distinguish exonic SDVs (HAL only predicts exonic SNVs). Most of the models that use conservation and other functional attributes perform equally well on intronic SNVs. In particular, SPANR works best overall largely due to its increased ability to differentiate exonic SDVs (Fig. 4E, right; Supp. Fig. 13). At equivalent effect size (>50%), SPANR achieves 44.5% precision, though only 11.8% of the SDVs are called. However, SPANR is trained on bulk RNA-Seq data, and thus effect sizes can be skewed. As we lower the threshold for calling an SDV (i.e., the predicted effect size of an SNV), SPANR can achieve 14.9% precision at 50% recall level (of the SDVs called). For the other prediction algorithms, precision is below 10% at most appreciable recall levels.
As with other functional approaches, our assay has several limitations which must be considered46. First, we only perform this assay in a single cell type (HEK293T), and thus there might be trans-factors that mitigate or exacerbate splicing47. Using MFASS in other cell types will be important to understand the scope of these effects. Second, the tested regions are surrounded by non-native intron sequence that might affect the propensity of variants that affect splicing48. Third, because MFASS depends upon FACS, our limit of detection can only reliably observe large effect sizes. For calling SDVs this is tolerable, and it seems likely that only large-effect changes will translate across cell types. However, small-effect changes might be important both functionally and for constraining predictive models. Fourth, MFASS as designed can only observe full exon skipping events. Even though these events dominate a majority of splicing perturbations, other types of splicing disruptions, including alternative 3’ and 5’ splice-site usage, are likely to be false negatives from MFASS. Other multiplexed splicing assays that use barcoded RNAs can alleviate such issues, but are currently limited to short intronic regions8,21. Fifth, in this study we only examine exons starting and ending on frame 0. Since skipping an exon that preserves frame might be less deleterious than for frame-shifting exons, our library selected here may suffer from selection bias, even though we find no appreciable differences in conservation profiles between the two (Supp. Fig. 14). We also found during this study that several of the plasmids developed for MFASS can be directly used to screen for frame-shifting exons. Finally, oligonucleotide libraries such as those used here are limited to ∼200nt in length. This limits the size of exons we can explore, which can also lead to selection bias in that short exons of <100 bp may be more sequence constrained. This also limits the length of the surrounding intronic sequences, which could serve to buffer or alter the effects of sequence variation (Supp. Fig. 15). As oligonucleotide and gene library synthesis improves, we expect to include additional genetic context in the assays49,50.
Despite the limitations, we see clear indications that many more rare variants than we expected can lead to large-effect splicing disruptions. More than >1000 SDVs discovered in this study are variants that directly eliminate exon recognition, and we reason that such large-effect SDVs seem to be the most likely to translate to other cell types and/or play a role in human traits and diseases. In addition, because almost all the candidate SDVs are extremely rare, genome-wide splicing quantitative trait loci (sQTL) studies may be underestimating much of how mutations affect traits through splicing3,51. More broadly, using multiplexed empirical models of important biological processes, such as ones derived from MFASS, can both help build and provide an alternative to improved computational models. Finally, given the propensity of large-effect regulatory variants that disrupt splicing discovered here, MFASS provides a scalable platform to functionally screen and aid precise clinical interpretation and prioritization of rare genetic variants 52.
Acknowledgments
This work was supported by the National Institutes of Health (5U01HG007912 & DP2GM114829 to S.K.), the NIH Biomedical Big Data Training Grant (T32CA201160 to C.B.), Searle Scholars Program [to S.K.], Department of Energy (DE-FC02-02ER63421 to S.K.), UCLA, and Linda and Fred Wudl. We thank Ron Weiss for the original landing pad cell line, Felicia Codrea and Jessica Scholes (UCLA BSCRC flow cytometry core), and the BSCRC high throughput sequencing core for technical assistance. We thank Xinshu Grace Xiao, Douglas Black, and George Church for guidance while developing MFASS.
Footnotes
↵† The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.