Abstract
Abstract Cis-regulatory changes have long been suggested to contribute to organismal adaptation. While cis-regulatory changes can now be identified on a transcriptome- wide scale, in most cases the adaptive significance and mechanistic basis of rapid cis- regulatory divergence remains unclear. Here, we have characterized cis-regulatory changes associated with recent adaptive floral evolution in the selfing plant Capsella rubella, which diverged from the outcrosser Capsella grandiflora less than 200 kya. We assessed allele-specific expression (ASE) in leaves and flower buds at a total of 18,452 genes in three interspecific F1 C. grandiflora x C. rubella hybrids. After accounting for technical variation and read-mapping biases using genomic reads, we estimate that an average of 44% of these genes show evidence of ASE, however only 6% show strong allelic expression biases. Flower buds, but not leaves, show an enrichment of genes with ASE in genomic regions responsible for phenotypic divergence between C. rubella and C. grandiflora. We further detected an excess of heterozygous transposable element (TE) insertions in the vicinity of genes with ASE, and TE insertions targeted by uniquely mapping 24-nt small RNAs were associated with reduced allelic expression of nearby genes. Our results suggest that cis- regulatory changes have been important for recent adaptive floral evolution in Capsella and that differences in TE dynamics between selfing and outcrossing species could be an important mechanism underlying rapid regulatory divergence.
Author Summary The role of regulatory changes for adaptive evolution has long been debated. Cis- regulatory changes have been proposed to be especially likely to contribute to phenotypic adaptation, because they are expected to have fewer negative side effects than protein-coding changes. So far relatively few studies have investigated the role of cis-regulatory changes in wild plants. Here we assess the regulatory divergence between two closely related plant species that differ in their mating system and floral traits. We directly assess cis-regulatory divergence by quantifying the expression levels of both alleles in F1 hybrids of these species, and we find that genes showing cis-regulatory divergence are enriched in genomic regions that are responsible for floral and reproductive differences between the species. In combination with information on gene function for genes with cis-regulatory divergence in flower buds, this suggests that cis-regulatory changes might have been important for morphological differentiation between these species. Additionally we discover that transposable elements, which accumulate differently depending on mating system, might be involved in rapid regulatory divergence. These findings are an important step towards a better understanding of the role and the mechanisms of rapid regulatory divergence between plant species.
Introduction
The molecular nature of genetic changes that contribute to adaptation is a topic of long-standing interest in evolutionary biology. Ever since the discovery of regulatory sequences by Jacob and Monod in the early 1960s [1], there has been a strong focus on the role of regulatory changes for organismal adaptation (e.g. [2, 3, 4, 5, 6, 7]).
This work has mostly centered on changes in cis-regulatory elements (CREs), regulatory regions such as promoters or enhancers that are linked to a focal gene.
Due to the modular nature of CREs, cis-regulatory changes can alter the expression of the focal gene in a very specific manner, affecting only a particular tissue, cell type, or developmental stage. These changes therefore potentially have fewer negative pleiotropic effects than nonsynonymous mutations in coding regions [3]. For this reason, cis-regulatory changes have been suggested to contribute disproportionally to organismal adaptation ([3, 4, 5, 8, 9] but see [10]).
Numerous detailed investigations of single genes have identified causal cis- regulatory changes responsible for changes in animal form and color (e.g. Drosophila wing pigmentation [11]; pelvic reduction [12]; pigmentation [13] and tooth number in stickleback [14]. In yeast, the molecular mechanisms for and mode of selection on cis-regulatory variation have begun to be clarified in detail [15, 16]. Cis-regulatory changes in individual genes contributing to phenotypic evolution have also been identified in plants, with perhaps the most well-known example being an insertion of a transposable element (TE) into the regulatory region in the teosinte branched 1 (tbl) gene causing increased apical dominance in maize [17, 18]. Other examples include increased tolerance of heavy-metal polluted soils in Arabidopsis halleri due to a combination of copy number expansion and cis-regulatory changes at the gene HMA4 [19], cis-regulatory variation at the RCO-A gene conferring a change in leaf morphology in Capsella [20], and cis-regulatory variation at FLC conferring variation in vernalization response in A. thaliana [21].
With the advent of high-throughput methods for assessing gene expression, the prospects for identifying cis-regulatory changes on a transcriptome-wide scale have greatly improved [22] . Genes with cis-regulatory changes can be identified based on mapping local expression QTL (cis-eQTL) or by assessing allele-specific expression (ASE). Whereas map-based approaches can identify QTL for all genes with expression data, resolution is typically limited. In contrast, ASE studies require the presence of transcribed polymorphisms as well as rigorous bioinformatic approaches, but have greater resolution and can identify individual genes with cis- regulatory changes [23].
In Drosophila and yeast, transcriptome-wide studies have found that cis- regulatory changes or concordant cis- and trans-regulatory changes may be disproportionately fixed between lineages, which implies the action of directional selection on gene expression during divergence (e.g. [25, 26, 27]). Evidence for positive selection on cis-regulatory changes has also been found in crop plants, including rice [27] and maize [28].
Recent transcriptome-scale studies have begun to shed light on the mechanistic basis of cis-regulatory variation in plants. Studies in Arabidopsis have shown that silencing of transposable elements through the RNA-directed methylation pathway may be particularly important, as silencing of TEs through targeting by 24-nt small interfering RNA (siRNA) and subsequent methylation also affects the expression of nearby genes [29, 30]. Transcriptional gene silencing through the RNA- directed methylation pathway has been suggested to be an important mechanism by which regulatory variation is generated both within [31] and between species [30, 32, 33]. While analyses of the population frequencies and age distribution of methylated TE insertions suggest that most methylated TE insertions near genes are deleterious [29, 30], it has been suggested that some proportion of TE insertions might also contribute to organismal adaptation [34]. TE insertions have been selected for during domestication (e.g. maize [18]; domesticated silkworm [35]), and patterns of population differentiation suggest that TEs have contributed to adaptation to temperate environments in Drosophila [36]. Studies in Arabidopsis [37], maize [38], and rice [39] have also shown that TE insertions can influence stress-induced expression of nearby genes. However, the extent to which TEs contribute to adaptation in the wild is currently not clear for most species.
The crucifer genus Capsella is a promising system for assessing the role of cis-regulatory changes in association with plant mating system shifts and adaptation. In Capsella, genetic and genomic studies are greatly facilitated by the availability of the sequenced reference genome of Capsella rubella [40] and because it is feasible to generate crosses among closely related species. Capsella harbors four closely related species that vary in both mating system and ploidy: the self-incompatible outcrossing diploid Capsella grandiflora, the self-compatible diploids Capsella rubella and Capsella orientatis, and finally the allopolyploid Capsella bursa-pastoris [41].
In C. rubella, the transition to selfing occurred relatively recently (∼100 kya), and was associated with speciation from an outcrossing progenitor similar to present- day C. grandiflora [40, 42, 43, 44, 45]. Despite the recent shift to selfing, C. rubella already exhibits a derived reduction in petal size and an elevated pollen-ovule ratio, as well as a reduction of the degree of flower opening [46, 47]. C. rubella therefore exhibits floral characteristics typical of self-fertilizing plants, a so-called “selfing syndrome”. The selfing syndrome of C. rubella is associated with improved efficacy of autonomous self-pollination [46], and regions with quantitative trait loci for floral divergence between C. rubella and C. grandiflora exhibit an excess of fixed differences and reduced polymorphism in C. rubella [47]. Together, these observations suggest that the rapid evolution of the selfing syndrome in C. rubella was driven by positive selection. While the molecular genetic basis of the selfing syndrome in C. rubella has not been identified, it has been suggested that cis- regulatory changes could be involved, and a previous study found many flower and pollen development genes to be differentially expressed in flower buds of C. grandiflora and C. rubella [40]. As the two species differ in their genomic distribution of TEs, with C. rubella harboring fewer TEs close to genes than C. grandiflora [48], it is possible that TE silencing through the RNA-directed methylation pathway could constitute a mechanism for cis-regulatory divergence in this system.
In this study we assess cis-regulatory divergence between C. grandiflora and C. rubella and investigate the role of cis-regulatory changes for floral and reproductive trait divergence in C. rubella. We conduct deep sequencing of transcriptomes as well as genomes of C. grandiflora x C. rubella F1 hybrids to identify genes with cis-regulatory divergence in flower buds and leaves, and test whether cis-regulatory changes in flowers are overrepresented in genomic regions responsible for adaptive phenotypic divergence. We further conduct small RNA sequencing and test whether TE insertions targeted by uniquely mapping 24-nt siRNAs are associated with cis-regulatory divergence. Our results provide insight into the mechanisms and adaptive significance of cis-regulatory divergence in association with recent adaptation and phenotypic divergence in a wild plant system.
Results
Many genes exhibit allele-specific expression in interspecific F1 hybrids
In order to quantify ASE between C. grandiflora and C. rubella, we generated deep whole transcriptome RNAseq data from flower buds and leaves of three C. grandiflora x C. rubella F1 hybrids (total 52.1 vs 41.8 Gbp with Q??30 for flower buds and leaves, respectively). We included three technical replicates for one F1 in order to examine the reliability of our expression data. For all F1s and their C. rubella parents, we also generated deep (38-68x) whole genome resequencing data in order to reconstruct parental haplotypes and account for read mapping biases.
F1 RNAseq reads were mapped with high stringency to reconstructed parental haplotypes specific for each F1, i.e. reconstructed reference genomes containing whole-genome haplotypes for both the C. grandiflora and the C. rubella parent of each F1 (see Methods). We conducted stringent filtering of genomic regions where SNPs were deemed unreliable for ASE analyses due to e.g. high repeat content, copy number variation, or a high proportion of heterozygous genotypes in an inbred C. rubella line (for details, see Methods and S1 text); this mainly resulted in removal of pericentromeric regions (S2 Fig - S5 Fig). After filtering, we identified ∼18,200 genes with ∼274,000 transcribed heterozygous SNPs that were amenable to ASE analysis in each F1 (Table 1). The mean allelic ratio of genomic read counts at these SNPs was 0.5 (S6 Fig), suggesting that our bioinformatic procedures efficiently minimized read mapping biases. Furthermore, technical reliability of our RNAseq data was high, as indicated by a mean Spearman’s ρ between replicates of 0.98 (range 0.94-0.99).
We assessed ASE using a Bayesian statistical method with a reduced false positive rate compared to the standard binomial test [49]. The method uses genomic read counts to model technical variation in ASE and estimates the global proportion of genes with ASE, independent of specific significance cutoffs, and also yields gene- specific estimates of the ASE ratio and the posterior probability of ASE. The model also allows for and estimates the degree of variability in ASE along the gene, through the inclusion of a dispersion parameter.
Based on this method, we estimate that on average, the proportion of assayed genes with ASE is as high as 44.6% (S6 Table). In general, most allelic expression biases were moderate, and only 5.9% of assayed genes showed ASE ratios greater than 0.8 or less than 0.2 (Fig. 1; Fig. 2). There was little variation in ASE ratios along genes, as indicated by the distribution of the dispersion parameter estimates having a mode close to zero and a narrow range (Fig. 1, Fig. 2). This suggests that unequal expression of differentially spliced transcripts is not a major contributor to regulatory divergence between C. rubella and C. grandiflora (Fig. 1, Fig. 2). It also suggests limits to ASE patterns arising as stochastic artifacts, which might also tend to create variation in ASE ratios within genes.
For genes with evidence for ASE (hereafter defined as posterior probability of ASE ≥ 0.95), there was a moderate shift toward higher expression of the C. rubella allele (mean ratio C. rubella/total=0.56; Fig. 1, Fig. 2). This shift was present for all F1s, for both leaves and flowers (Fig. 1, Fig. 2). No such shift was apparent for genomic reads, and ratios of genomic read counts for SNPs in genes with ASE were very close to 0.5 (mean ratio C. rubella/total=0.51; Fig. 1, Fig. 2). Furthermore, qPCR with allele-specific probes for five genes validated our ASE results empirically (S8 Table). This suggests that C. rubella alleles are on average expressed at a slightly higher level than C. grandiflora alleles in our F1s.
The mean ASE proportion, as well as the absolute number of genes with ASE was greater for leaves (49%; 6010 genes) than for flower buds (40%; 5216 genes), although this difference was largely driven by leaf samples from one of our F1s (Table 1). Most instances of ASE were specific to either leaves or flower buds, and on average, only 15% of genes expressed in both leaves and flower buds showed consistent ASE in both organs (Fig. 3). Many cases of ASE were also specific to a particular F1, and across all three F1s, there were 1305 genes that showed consistent ASE in flower buds, and 1663 in leaves (Fig. 3).
Enrichment of genes with ASE in genomic regions responsible for phenotypic divergence
We used permutation tests to check for an excess of genes showing ASE within previously-identified narrow (<2 Mb) QTL regions responsible for floral and reproductive trait divergence [47]. As the selfing syndrome seems to have a shared genetic basis in independent C. rubella accessions [46, 47], we reasoned that genes with consistent ASE across all F1s would be most likely to represent candidate cis- regulatory changes underlying QTL. Out of the 1305 genes with ASE in flower buds of all F1s, 85 were found in narrow QTL regions, and this overlap was significantly greater than expected by chance (permutation test, P=0.03; Fig. 4; see Methods for details). In contrast, for leaves, there was no significant excess of genes showing ASE in narrow QTL (permutation test, P= 1; Fig. 4). Thus, the association between QTL and ASE in flower buds is unlikely to be an artifact of locally elevated heterozygosity facilitating both ASE and QTL detection, which should affect analyses of both leaf and flower samples.
List enrichment analyses reveal floral candidate genes with ASE
We conducted list enrichment analyses to characterize the functions of genes showing ASE. There was an enrichment of Gene Ontology (GO) terms involved in defense and stress responses for genes with ASE in flower buds and in leaves (S9 Table). GO terms related to hormonal responses, including brassinosteroid and auxin biosynthetic processes, were specifically enriched among genes with ASE in flower buds (S9 Table). We further identified nineteen genes involved in floral and reproductive development in A. thaliana, which are located in QTL regions (see above), and show ASE in flower buds (Table 2). These genes are of special interest as candidate genes for detailed studies of the genetic basis of the selfing syndrome in C. rubella.
Intergenic divergence is elevated near genes with ASE
To assess the role of polymorphisms in regulatory regions for ASE, we assessed levels of heterozygosity in intergenic regions within 1 kb of genes that likely contain an elevated proportion of cis-regulatory elements, and in previously identified conserved noncoding regions [50] within 5 kb and 10 kb of genes. Genes with ASE were not significantly more likely to be associated with conserved noncoding regions with heterozygous SNPs than genes without ASE. However, levels of intergenic heterozygosity were slightly but significantly higher for genes with ASE than for those without ASE (median heterozygosity values 1 kb upstream of genes of 0.016 vs. 0.014, respectively, S10 Table), suggesting that polymorphisms in regulatory regions upstream of genes might have contributed to cis-regulatory divergence.
Enrichment of TEs near genes with ASE
To test whether differences in TE content might contribute to cis-regulatory divergence between C. rubella and C. grandiflora, we examined whether heterozygous TE insertions near genes were associated with ASE. We identified TE insertions specific to the C. grandiflora or C. rubella parents of our F1s using genomic read data, as in Agren et al (2014) [48] (Table 3; see Methods). Consistent with their results [48], we found that C. rubella harbored fewer TE insertions close to genes than C. grandiflora (on average, 482 vs 1154 insertions within 1 kb of genes in C. rubella and C. grandiflora, respectively). Among heterozygous TE insertions, Gypsy insertions were the most frequent (Table 3). There was a significant association between heterozygous TE insertions within 1 kb of genes and ASE, for both leaves and flower buds, and the strength of the association was greater for TE insertions closer to genes (Table 4; Fig. 5). This was true for individual F1s, as well as for all F1s collectively (Table 4; Fig. 5; S11 Table).
TEs targeted by uniquely mapping 24-nt small RNAs are associated with reduced allelic expression of nearby genes
To test whether siRNA-based silencing of TEs might be responsible for the association between TE insertions and ASE in Capsella, we analyzed data for flower buds from one of our F1s, for which we had matching small RNA data (see Methods). We selected only those 24-nt siRNA reads that mapped uniquely, without mismatch, to one site within each of our F1s, because uniquely mapping siRNAs have been shown to have a more marked association with gene expression in Arabidopsis [29]. For each gene, we then assessed the ASE ratio of the allele on the same chromosome as a TE insertion (i.e. ASE ratios were polarized such that relative ASE was equal to the ratio of the expression of the allele with a TE insertion on the same chromosome over the total expression of both alleles), and then further examined the influence of nearby siRNAs. Overall, the mean relative ASE was reduced for genes with nearby TE insertions (Fig. 6A) with a more pronounced effect for TE insertions within 1 kb (within the gene: Wilcoxon rank sum test, W = 1392103, p-value = 8.76*10-3; within 200 bp: Wilcoxon rank sum test, W = 1903047, p-value = 7.17*10-3; within 1 kb: Wilcoxon rank sum test, W = 3687972, p-value = 8.19*10-3). The magnitude of the effect on ASE was more pronounced for genes near TE insertions targeted by uniquely mapping 24-nt siRNAs (Fig. 6B; for genes with a TE insertion within the gene: Wilcoxon rank sum test, W = 423369, p-value = 1.36*10-4; within 200 bp: W = 540926, p-value = 1.82*10-5; within 1 kb: W = 983938, p-value = 3.13*10-3). In contrast, no significant effect on ASE was apparent for genes near TE insertions that were not targeted by uniquely mapping 24-nt siRNAs (Fig. 6C). Thus, uniquely mapping siRNAs targeting TE insertions appear to be responsible for the association we observe between ASE and TE insertions.
Discussion
Understanding the causes and consequences of cis-regulatory divergence is a longstanding aim in evolutionary genetics. In this study, we have quantified allele-specific expression in order to understand the mechanisms and adaptive significance of cis- regulatory changes in association with a recent plant mating system shift.
Our results indicate that many genes, on average over 40%, harbor c/s- regulatory changes between C. rubella and C. grandiflora. The proportion of genes with ASE may seem high given the recent divergence (∼100 kya) between C. rubella and C. grandiflora [40, 45]. However, the majority of genes with ASE showed relatively mild allelic expression biases, and while our estimates are higher than those in a recent microarray-based study of interspecific Arabidopsis hybrids (<10%) [32], our results are consistent with recent analyses of RNAseq data from intraspecific F1 hybrids of Arabidopsis accessions (∼30%) [51]. Somewhat higher levels of ASE were found in a recent study of maize and teosinte (∼70% of genes showed ASE in at least one tissue and F1 individual [28]), and using RNAseq data and the same hierarchical Bayesian analysis that we employed, Skelly et al (2011) [49] estimated that a substantially higher proportion, >70% of assayed genes, showed ASE among two strains of Saccharomyces cerevisiae. Thus, our estimates of the proportion of genes with ASE fall within the range commonly observed for recently diverged accessions or lines based on RNAseq data.
One of the key motivations for this study was to investigate whether cis- regulatory changes contributed to floral and reproductive adaptation to selfing in C. rubella. Two lines of evidence support this hypothesis; first, we find an excess of genes with ASE in flower buds within previously identified narrow QTL regions for floral and reproductive traits that harbor a signature of selection [47]. In contrast, no such excess is present for genes with ASE in leaves, suggesting that this observation is not simply a product of higher levels of divergence among C. rubella and C. grandiflora in certain genomic regions facilitating both QTL delimitation and ASE analysis. Second, we find that genes involved in hormonal responses, including brassinosteroid biosynthesis, are overrepresented among genes with ASE in flower buds, but not in leaves. Based on a study of differential expression and functional information from Arabidopsis thaliana, regulatory changes in this pathway were previously suggested to be important for the selfing syndrome in C. rubella [40].
While we do not find evidence for ASE the specific genes detected as differentially expressed in [40], our work nonetheless provides additional support for regulatory changes in the brassinosteroid pathway contributing to the selfing syndrome of C. rubella. Future studies should conduct fine-scale mapping and functional validation to fully explore this hypothesis. To facilitate this work, we have identified a set of candidate genes with ASE that are located in genomic regions harboring QTL for floral and reproductive trait divergence between C. rubella and C. grandiflora. Of particular interest in this list is the gene JAGGED (JAG). In A. thaliana, this gene is involved in determining petal growth and shape by promoting cell proliferation in the distal part of the petal [52, 53]. As C. rubella has reduced petal size due to a shortened period of proliferative growth [46], and the C. rubella allele is expressed at a lower level than the C. grandiflora allele, this gene is a very promising candidate gene for detailed studies of the genetic basis of the selfing syndrome.
Many instances of ASE were specific to a particular individual or tissue, an observation also supported by recent studies (e.g. [28, 32]). This suggests that there is substantial variation in ASE depending on genotype and developmental stage, consistent with the reasoning that cis-regulatory changes can have very specific effects, but expression noise is probably also a contributing factor. In our analyses, we took several steps to model and account for technical variation in order to reduce the incidence of false positives. However, it is difficult to completely rule out the possibility that some cases of subtle ASE may not represent biologically meaningful cis-regulatory variation. We also cannot fully rule out imprinting effects as potential causes of ASE, because generating reciprocal F1 hybrids was not possible due to seed abortion in our C. rubella x C. grandiflora crosses. However, we do not expect these effects to make a major contribution to the patterns we observed; in Arabidopsis, imprinting effects are only prevalent in endosperm tissue, and are rare in more advanced stage tissues such as those analyzed here [51, 54, 55], which suggests that imprinting is not likely to be responsible for the patterns we observe.
One somewhat unexpected finding was the subtle global shift in expression levels toward higher relative expression of the C. rubella allele in our F1 hybrids. While it is difficult to completely rule out systematic biases in ASE estimation as the cause for this shift, no marked bias was present for the same SNPs and genes in our genomic data, suggesting that if systematic bioinformatic biases are the cause, the effect is specific to transcriptomic reads. While this remains a possibility, it seems unlikely to completely explain the shift in expression that we observe, as we made considerable effort to avoid reference mapping bias, including high stringency mapping of transcriptomic reads to reconstructed parental haplotypes specific to each F1. Similar global shifts toward higher expression of the alleles from one parent have also been observed in F1s of maize and teosinte [28] and Drosophila [56]. An even stronger bias toward higher expression of the A. lyrata allele was recently observed in F1s of A. thaliana and A. lyrata [32], and was attributed to interspecific differences in gene silencing.
To investigate potential mechanisms for cis-regulatory divergence, we first examined heterozygosity in regulatory regions and conserved noncoding regions close to genes. While genes with ASE in general showed slightly elevated levels of heterozygosity in putatively regulatory regions, there was no enrichment of conserved noncoding regions with heterozygous SNPs close to genes with ASE. It thus seems likely that divergence in regulatory regions in the proximity of genes, but not specifically in conserved noncoding regions, has contributed to global cis-regulatory divergence between C. rubella and C. grandiflora.
To examine biological explanations for the shift toward a higher relative expression of C. rubella alleles, we examined the relationship between TE insertions and ASE. As C. rubella harbors a lower number of TE insertions near genes than C. grandiflora, we reasoned that TE silencing might contribute to the global shift in expression toward higher relative expression of the C. rubella allele, with C. grandiflora alleles being preferentially silenced due to targeted methylation of nearby TEs, through transcriptional gene silencing mediated by 24-nt siRNAs. Our results are consistent with this hypothesis. Not only is there is an association between genes with TEs and heterozygous TE insertions in our F1s, there is also reduced expression of alleles that reside on the same haplotype as a nearby TE insertion, and the reduction is particularly strong for TEs that are targeted by uniquely mapping siRNAs. In contrast, no effect on ASE is apparent for TEs that are not targeted by uniquely mapping siRNAs. Moreover, the relatively limited spatial scale over which siRNA-targeted TE insertions are associated with reduced expression of nearby genes (< 1 kb) is consistent with previous results from Arabidopsis [29, 30, 31]. We did not directly assess methylation patterns in this study, but it has been shown that data on siRNA targeting is a reliable proxy for TE methylation [29]. While other factors have probably also contributed, these findings suggest that TEs have been important for global cis-regulatory divergence between C. rubella and C. grandiflora.
Why then do C. rubella and C. grandiflora differ with respect to silenced TEs near genes? In Arabidopsis, methylated TE insertions near genes appear to be predominantly deleterious, and exhibit a signature of purifying selection [29]. It is tempting to speculate that the reduced prevalence of TE insertions near genes in C. rubella could be due to purging of recessive deleterious alleles that have become exposed to purifying selection due to increased homozygosity in this self-fertilizing species. Indeed, a recent simulation study has shown that such purging can occur rapidly upon the shift to selfing [57]. However, we prefer the alternative interpretation that deleterious alleles that were rare in the outcrossing ancestor were preferentially lost in C. rubella, mainly as a consequence of the reduction in effective population size associated with the shift to selfing in this species. The latter interpretation is more in line with analyses of polymorphism and divergence at nonsynonymous sites, for which C. rubella exhibits patterns consistent with a general relaxation of purifying selection [40]. We also note that none of the genes in narrow QTL regions that show ASE in all three F1s harbor nearby heterozygous TE insertions. Our study thus provides no evidence for a contribution of TE silencing to putatively adaptive cis- regulatory divergence.
If TE dynamics are generally important for cis-regulatory divergence in association with plant mating system shifts, we might expect different effects on cis- regulatory divergence depending not only on the genome-wide distribution of TEs, but also on the efficacy of silencing mechanisms in the host [29, 30, 58]. For instance, He et al (2012) [32] found a shift toward higher relative expression of alleles from the outcrosser A. lyrata, which harbors a higher TE content, a fact which they attributed to differences in silencing efficacy between A. thaliana and A. lyrata; indeed, TEs also showed upregulation of the A. lyrata allele [33] and A. lyrata TEs were targeted by a lower fraction of uniquely mapping siRNAs [30]. In contrast, we found no evidence for a difference in silencing efficacy between C. rubella and C. grandiflora, which harbor similar fractions of uniquely mapping siRNAs (12% vs 10% uniquely mapping/total 24-nt RNA reads for C. rubella and C. grandiflora, respectively). Thus, in the absence of strong divergence in silencing efficacy, differences in the spatial distribution of TEs might be more important for cis-regulatory divergence. More studies of ASE in F1s of selfers of different ages and their outcrossing relatives are needed to assess the general contribution of differences in silencing efficacy versus genomic distribution of TE insertions for cis-regulatory divergence in association with mating system shifts.
Conclusions
We have shown that many genes exhibit cis-regulatory changes between C. rubella and C. grandiflora and that there is an enrichment of genes with floral ASE in genomic regions responsible for phenotypic divergence. In combination with analyses of the function of genes with floral ASE, this suggests that cis-regulatory changes might have contributed to the evolution of the selfing syndrome in C. rubella. We further observe a general shift toward higher relative expression of the C. rubella allele, an observation that can at least in part be explained by elevated TE content close to genes in C. grandiflora and reduced expression of C. grandiflora alleles due to silencing of nearby TEs. These results support the idea that TE dynamics and silencing are of general importance for cis-regulatory divergence in association with plant mating system shifts.
Methods Plant Material
Plant Material
We generated three interspecific C. grandiflora x C. rubella F1s by crossing two accessions of the selfer C. rubella with three different accessions of the outcrosser C. grandiflora, from different populations (S12 Table). All crosses had C. grandiflora as the seed parent and C. rubella as the pollen donor, as no viable seeds were obtained from reciprocal crosses [47]. Seeds from F1s and their C. rubella parental lines were surface-sterilized and germinated on 0.5 x Murashige-Skoog medium. We transferred one-week old seedlings to soil in pots that were placed in randomized order in a growth chamber (16 h light: 8 h dark; 20° C: 14° C). After four weeks, but prior to bolting, we sampled young leaves for RNA sequencing. Mixed-stage flower buds were sampled 3 weeks later, when all F1s were flowering. To assess data reliability, we collected three separate samples of leaves and flower buds from one F1 individual, and three biological replicates of one C. rubella parental line. For genomic DNA extraction, we sampled leaves from all three F1 individuals as well as from their C. rubella parents. For small RNA sequencing, we germinated six F2 offspring from one of our F1 individuals and sampled flower buds as described above.
Sample Preparation and Sequencing
We extracted total RNA for whole transcriptome sequencing with the RNEasy Plant Mini Kit (Qiagen, Hilden, Germany), according to the manufacturer’s instructions. For small RNA sequencing, we extracted total RNA using the mirVana kit (Life Technologies). For whole genome sequencing, we used a modified CTAB DNA extraction [59] to obtain predominantly nuclear DNA. RNA sequencing libraries were prepared using the TruSeq RNA v2 protocol (Illumina, San Diego, CA, USA). DNA sequencing libraries were prepared using the TruSeq DNA v2 protocol. Sequencing was performed on an Illumina HiSeq 2000 instrument (Illumina, San Diego, CA, USA) to gain 100bp paired end reads, except for small RNA samples for which single end 50 bp reads were obtained. Sequencing was done at the Uppsala SNP & SEQ Technology Platform, Uppsala University, except for accession C. rubella Cr39.1 where genomic DNA sequencing was done at the Max Planck Institute of Developmental Biology, Tubingen. In total, we obtained 93.9 Gbp (Q ≥30) of RNAseq data, with an average of 9.3 Gbp per sample. In addition we obtained 45.6 Gbp (Q ≥30) of DNAseq data, corresponding to a mean expected coverage per individual of 52x, and 106,110,000 high-quality (Q ≥30) 50 bp small RNA reads. All sequence data has been submitted to the European Bioinformatics Institute (www.ebi.ac.uk), with study accession number: PRJEB9020.
Sequence Quality and Trimming
We merged read pairs from fragment spanning less than 185 nt (this also removes potential adapter sequences) in SeqPrep (https://github.com/jstjohn/SeqPrep) and trimmed reads based on sequence quality (phred cutoff of 30) in CutAdapt 1.3 [60]. For DNA and RNAseq reads, we removed all read pairs where either of the reads was shorter than 50 nt. We then analyzed each sample individually using fastQC v. 0.10.1 1015:(http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to identify potential errors that could have occurred in the process of amplifying DNA and RNA. We assessed RNA integrity by analyzing the overall depth of coverage over annotated coding genes, using geneBody_coverage.py that is part of the RSeQC package v. 2.3.3 [61]. For DNA reads we analyzed the genome coverage using bedtools v.2.17.0 [62] and removed all potential PCR duplicates using Picard v.1.92 (http://picard.sourceforge.net). Small RNA reads were trimmed using custom scripts and CutAdapt 1.3 and filtered to retain only reads of 24 nt length.
Read Mapping and Variant Calling
We mapped both genomic reads and RNAseq reads to the v1.0 reference C. rubella assembly [40] (http://www.phytozome.net/capsella) using STAR v.2.3.0.1 [63] with default parameters. For genomic reads we modified the default STAR settings to avoid splitting up reads, and for mapping 24-nt small RNA we used STAR with settings modified to require perfect matches to the parental haplotypes of the F1s as well as to a TE library based on multiple Brassicaceae species and previously used in Slotte et al (2013) [40].
Variant calling was done in GATK v. 2.5-2 [64] according to GATK best practices [65, 66]. Briefly, after duplicate marking, local realignment around indels was undertaken, and base quality scores were recalibrated, using a set of 1,538,085 SNPs identified in C. grandiflora [50] as known variants. Only SNPs considered high quality by GATK were kept for further analysis. Variant discovery was done jointly on all samples using the UnifiedGenotyper, and for each F1, genotypes were phased by transmission, by reference to the genotype of its highly inbred C. rubella parental accession.
We validated our procedure for calling variants in genomic data by comparing our calls for the inbred line C. rubella 1GR1 at 176,670 sites sequenced in a different individual from the same line by Sanger sequencing [67]. Overall, we found 29 calls that differed among the two sets, resulting in an error rate of 0.00016, considerably lower than the level of divergence among C. rubella and C. grandiflora (0.02 [45]).
Reconstruction of parental haplotypes of interspecific F1s
We reconstructed genome-wide parental haplotype sequences for each interspecific F1 and used these as a reference sequence for mapping genomic and transcriptomic reads for ASE analyses. The purpose of this was to reduce effects of read mapping biases on our analyses of ASE by increasing the number of mapped reads and reducing mismapping that can result when masking heterozygous SNPs in F1s [68].
To reconstruct parental genomes for each F1, we first conducted genomic read mapping, variant calling and phasing by reference to the inbred C. rubella parent as described in the section “Read Mapping and Variant Calling” above. The resulting phased vcf files were used in conjunction with the C. rubella reference genome sequence to create a new reference for each F1, containing both of its parental genome-wide haplotypes. Read mapping of both genomic and RNA reads from each F1 was then redone to its specific parental haplotype reference genome, and read counts at all reliable SNPs (see section “Filtering” below) were obtained using Samtools mpileup and a custom software written in javascript by Johan Reimegard. The resulting files with allele counts for genomic and transcriptomic data were used in all downstream analyses of allelic expression biases (see section “Analysis of Allele-Specific Expression” below).
Filtering
We used two approaches to filter the genome assembly to identify regions where we have high confidence in our SNP calls. Genomic regions with evidence for large-scale copy number variation were identified using Control-FREEC [69], and repeats and selfish genetic elements were identified using RepeatMasker (http://www.repeatmasker.org). Additionally, we identified genomic regions with unusually high proportions of heterozygous genotype calls in a lab-inbred C. rubella line, which is expected to be highly homozygous. Regions with evidence for high proportions of repeats, copy number variation or high proportion of heterozygous calls in the inbred line mainly corresponded to centromeric and pericentromeric regions, and these were removed from consideration in further analyses of allele- specific expression (S2 Fig. - S5 Fig.).
Analysis of allele-specific expression
Analyses of allele-specific expression (ASE) were done using a hierarchical Bayesian method developed by Skelly et al (2011) [49]. This analysis method has a reduced rate of false positives and naturally incorporates replicated data. The method requires read counts at heterozygous coding SNPs for both genomic and transcriptomic data. Genomic read counts are used to fit the parameters of a beta-binomial distribution, in order to obtain an empirical estimate of the distribution of variation in allelic ratios due to technical variation (as there is no true ASE for genomic data on read counts for heterozygous SNPs). This distribution is then used in analyses of RNAseq data where genes are assigned posterior probabilities of exhibiting ASE. Ultimately, this results in an estimate of the posterior probability of ASE for each gene, the mean level of ASE, and the degree of variability of ASE along the gene.
We conducted ASE analyses using the method of Skelly et al (2011) [49] for each of our three F1 individuals. Prior to analyses, we filtered the genomic data to only retain read counts for heterozygous SNPs in coding regions that did not overlap with neighboring genes, and following Skelly et al (2011) [49], we also removed SNPs that were the most strongly biased in the genomic data (specifically, in the 1% tails of a beta-binomial distribution fit to all heterozygous SNPs in each sample), as such highly biased SNPs may result in false inference of variable ASE if retained. The resulting data set showed very little evidence for read mapping bias affecting allelic ratios: the mean ratio of C. rubella alleles to total was 0.507 (S6 Fig).
All analyses were run in triplicate and MCMC convergence was checked by comparing parameter estimates across independent runs from different starting points, and by assessing the degree of mixing of chains. For all analyses of RNA counts, we used median estimates of the parameters of the beta-binomial distribution from analyses of genomic data for all three F1s (S7 Table). Runs were completed on a high-performance computing cluster at Uppsala University (UPPMAX) using the pqR implementation of R (http://www.pqr-project.org), for 200,000 generations or a maximum runtime of 10 days. We discarded the first 10% of each run as burn-in prior to obtaining parameter estimates.
ASE Validation by qPCR
We validated ASE results by performing qPCR with TaqMan(r) Reverse Transcription Reagents (LifeTechnologies, Carlsbad, CA, USA) using oligo(dT)16s to convert mRNA into cDNA using the manufacturers protocol and performed qPCR with the Custom TaqMan(r) Gene Expression Assay (LifeTechnologies, Carlsbad, CA, USA) with the colors FAM and VIC using manufacturers protocol. The qPCR for both alleles was multiplexed in one well to directly compare the two alleles using a BioRad CFX96 Touch(tm) Real-Time PCR Detection System (Bio-Rad, Hercules, CA, USA). For further details see S1 Text. To exclude color bias, we tested 5 genes using reciprocal probes with VIC and FAM colorant (S12 Table). The expression difference between the C. rubella and C. grandiflora allele was quantified using the difference in relative expression between the two alleles, as well as the Quantification Cycle (Cq value). A lower Cq value correlates with a higher amount of starting material in the sample. If the direction of allelic imbalance inferred by qPCR was the same as for ASE inferred by the method by Skelly et al (2011) [49], we considered that the qPCR supported the ASE results.
Enrichment of genes with ASE in genomic regions responsible for phenotypic divergence
We tested whether there was an excess of genes with evidence for ASE (posterior probability of ASE ≥ 0.95 in all three F1 hybrids) in previously identified genomic regions harboring QTL for phenotypic divergence between C. rubella and C. grandiflora [47]. For this purpose, we concentrated on narrow QTL regions defined in a previous study [47] (i.e. QTL regions with 1.5-LOD confidence intervals <2 Mb). Significance was based on a permutation test (1000 permutations) in R 3.1.2.
List enrichment tests of GO terms
We tested for enrichment of GO biological process terms among genes with ASE in all of our F1s using Fisher exact tests in the R module TopGO [70]. GO terms were downloaded from TAIR (http://www.arabidopsis.org) on September 3rd, 2013, for all A. thaliana genes that have orthologs in the C. rubella v1.0 annotation, and we only considered GO terms with at least two annotated members in the background set. Separate tests were conducted for leaf and flower bud samples, and background sets consisted of all genes where we could assess ASE.
Intergenic heterozygosity in regulatory and conserved noncoding regions
We quantified intergenic heterozygosity 1 kb upstream of genes using VCFTools [71], and compared levels of polymorphism among genes with and without ASE using a Wilcoxon rank sum test. We further assessed whether there was an an enrichment of conserved noncoding elements (identified in Williamson et al (2014) [50]) with heterozygous SNPs within 5 kb of genes with ASE, using Fisher exact tests. Separate tests were conducted for each F1.
Identification of TE insertions and association with ASE
We used PoPoolationTE [72] to identify transposable elements in our F1 parents. While intended for pooled datasets, this method can also be used on genomic reads from single individuals [48]. For this purpose we used a library of TE sequences based on several Brassicaceae species [40]. We used the default pipeline for PoPoolationTE, modified to require a minimum of 5 reads to call a TE insertion, and the procedure in Agren et al (2014) [48] to determine heterozygosity or homozygosity of TE insertions. Parental origins of TE insertions were inferred by combining information from runs on F1s and their C. rubella parents.
We tested whether heterozygous TE insertions within a range of different window sizes close to genes (200 bp, 1 kbp, 2 kbp, 5 kbp, and 10 kbp) were associated with ASE by performing Fisher exact tests in R 3.0.2. We tested whether the expression of the allele on the same chromosome as a nearby (within 1 kbp) TE insertion was reduced compared to ASE at against genes without nearby TE insertions using a Wilcoxon rank sum test. Similar tests were conducted to test for an effect on relative ASE of TE insertions with uniquely mapping siRNAs.
Acknowledgements
The authors thank Michael Nowak, Stockholm University, for valuable comments on the manuscript and Daniel Skelly, Duke University, for helpful advice on ASE analyses. Sequencing was performed by the SNP&SEQ Technology Platform in Uppsala. The facility is part of the National Genomics Infrastructure (NGI) Sweden and Science for Life Laboratory. The SNP&SEQ Platform is also supported by the Swedish Research Council and the Knut and Alice Wallenberg Foundation. The Computations were performed on resources provided by SNIC through Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX) under Project b2012122. T.S. acknowledges financial support from the Swedish Research Council, the Erik Philip-Sorensen foundation, the Nilsson-Ehle foundation, the Magnus Bergvall foundation, and the Royal Swedish Academy of Sciences. D.K. acknowledges financial support from the Human Frontier Science Program (LT000783) and Deutsche Forschungsgemeinschaft (German Research Foundation) Priority Program 1529 - ‘Adaptomics’ (WE 2897). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.