Abstract
The strength of purifying selection varies among loci and leads to differing frequencies of deleterious alleles within genomes. Selection is generally stronger for highly and broadly expressed genes but can be less efficient for diploid expressed, deleterious alleles if heterozygous. In plants expression level, tissue specificity and ploidy level differ between pollen specific and sporophyte specific genes. This may explain why the reported strength and direction of the relationship between selection and the specificity of a gene to either pollen or sporophytic tissues varies between studies and species. In this study, we investigate the individual effects of expression level and tissue specificity on selection efficacy within pollen genes and sporophytic genes of Arabidopsis thaliana. Due to high homozygosity levels caused by selfing, masking is expected to play a lesser role. We find that expression level and tissue specificity independently influence selection in A. thaliana. Furthermore, contrary to expectations, pollen genes are evolving faster due to relaxed purifying selection and have accumulated a higher frequency of deleterious alleles. This suggests that high homozygosity levels resulting from high selfing rates reduce the effects of pollen competition and masking in A. thaliana, so that the high tissue specificity and expression noise of pollen genes are leading to lower selection efficacy compared to sporophyte genes.
Introduction
Gene expression is arguably the most important component in how variation at the genetic level leads to variation at the phenotypic level, and therefore on how selection acts (Fay and Wu, 2003; Drummond et al., 2005; Rocha, 2006). Likewise, variation in expression among genes will lead to varying levels of selection within the same genome. Indeed, a significant correlation between expression level and the evolutionary rate of proteins has been reported for a wide range of taxa including bacteria (Rocha and Danchin, 2004), yeast (Pál et al., 2001; Drummond et al., 2006), Drosophila (Marais et al., 2004) and Arabidopsis thaliana (Wright et al., 2004; Wright and Andolfatto, 2008; Slotte et al., 2011; Yang and Gaut, 2011). Furthermore, selection on broadly expressed genes is generally stronger than for tissue specific genes (Duret and Mouchiroud, 2000; Liao et al., 2006).
The restriction of a gene’s expression to reproductive tissues also has an effect on selection strength. Across a broad range of taxa, including mammals, Drosophila, mollusks and fungi, genes involved in reproduction have been reported to evolve more rapidly than somatic genes due to increased positive selection (Swanson and Vacquier, 2002; Haerty et al., 2007; Turner and Hoekstra, 2008). However, isolating the strength and direction of the relationship between the involvement of a gene in reproduction and the efficacy of selection acting on that gene is not straight forward for some plant species (Arunkumar et al., 2013; Gossmann et al., 2013; Szövényi et al., 2013). This is because of the potentially confounding effects of differences between pollen genes and sporophytic genes in expression level and breadth, but also in ploidy level. Whether a gene is haploid or diploid can also effect its visibility to selection. The masking hypothesis describes the less efficient purging of deleterious alleles in diploids than in haploids due to masking by a dominant homologue when heterozygous (Kondrashov and Crow, 1991). For example, in the outcrossing crucifer, Capsella grandiflora, genes with expression restricted to the male gametophyte revealed evidence for more efficient purifying and adaptive selection than for sporophytic genes (Arunkumar et al., 2013). The stronger selection on male gametophytic genes was interpreted as resulting from the combined effects of haploid expression and pollen competition, however, the relative contributions of these two factors were difficult to disentangle. In the moss Funaria hygrometrica, on the other hand, little or no difference was observed between the divergence rates of pollen and sporophyte-specific proteins (Szövényi et al., 2013), but variation in tissue specificity, a potentially important confounding factor, was not considered.
Self-compatible Arabidopsis thaliana may offer the opportunity to isolate the contribution of the reproductive role of pollen-specific genes on selection efficacy from differences in ploidy. This is because high selfing rates lead to high homozygosity in A. thaliana populations (Nordborg, 2000; Wright et al., 2008; Platt et al., 2010), so the masking of deleterious alleles in diploid sporophyte stages compared to the haploid gametophyte stage is a priori likely to be much reduced. Furthermore, selfing reduces the magnitude of pollen competition, and so the strength of selection acting on pollen, as fewer genotypes compete for fertilization (Charlesworth and Charlesworth, 1992). In a recent study pollen-specific genes were found to contain a higher number of non-synonymous sites under purifying and adaptive selection than random genes sampled from the A. thaliana genome (Gossmann et al., 2013). Importantly though, differences in expression level and tissue specificity between gene groups were not controlled for in that study. In contrast, a further study found pollen-specific genes to be evolving faster than sporophytic genes due to relaxed purifying selection in A. thaliana (Szövényi et al., 2013). This was believed to be caused by a combination of high tissue specificity and higher expression noise in pollen compared to sporophytic genes. However, the individual effect of tissue specificity was not isolated.
In this study we aimed to isolate the individual effects of expression level, tissue specificity and the reproductive role of a gene on selection in A. thaliana. To investigate efficacy of selection, we analyzed levels of polymorphism within 269 A. thaliana strains and sequence divergence from the sister taxon A.lyrata. We also compared the frequency of deleterious mutations (premature stop codons and frameshift mutations) among loci. Expression level was expected to correlate positively and tissue specificity negatively with selection pressure. We, therefore, controlled for expression level and tissue specificity when comparing between pollen and sporophyte genes.
Results
Expression level, tissue specificity and life-stage limited expression are inter-related
Within the total data set containing 19,970 genes, expression level per gene (see Methods for details) ranged from 0 (not reliably detectable) to 19,470 with a median of 794.5 (IQR: 1,454) and a mean of 1,449 ±14.6 (standard error of the mean, sem). Tissue specificity (τ), which ranged from 0 to 1.0 with a mean of 0.572 ±0.002 (sem) and a median of 0.566 (IQR: 0.510), was significantly negatively correlated with expression level (ρ = -0.41; p < 2.2x10−16; Spearman’s rank correlation). That is, broadly expressed genes were generally expressed at a higher level.
Of the 16,360 genes with reliably detectable expression (see Methods), 1,503 genes were expressed only in pollen and a further 5,398 were limited to sporophytic tissues (referred to as pollen-specific genes and sporophyte-specific genes in this study). Pollen-specific and sporophyte-specific genes were randomly distributed among the five chromosomes (table 1), and their distributions within the chromosomes also did not differ significantly from each other (table 2).
Pollen and sporophyte-limited genes differed significantly from each other in terms of expression level and tissue specificity. Naturally, tissue specificity was higher among pollen genes (median: 0.934, IQR: than sporophyte genes (median: 0.812, IQR: 0.301), and the difference was highly significant (W = 5.7 x 106; p = 8.3 x 10−130; Mann Whitney U test; fig. 1). Although broadly expressed genes were generally highly expressed, the sporophyte genes were expressed at a significantly lower level than the highly tissue specific pollen genes (pollen median: 1,293, IQR: 2,590; sporophyte median: 659, IQR: 1,022; W= 5.3x106, p=1.8x10−69; Mann Whitney U test; fig. 1).
Expression level correlates with dN/dS, pN/pS and frequency of deleterious alleles
Sequence divergence, measured via interspecific dN/dS (rate of non-synonymous substitutions per non-synonymous site versus rate of synonymous substitutions per synonymous site between A. thaliana and A. lyrata), was significantly negatively correlated with expression level (ρ = -0.32; p < 2.2x10−16; Spearman’s rank correlation; table 3). This means that genes expressed at a low level have evolved more quickly than highly expressed genes between the two taxa. To determine whether stronger purifying selection among highly expressed genes is causing their slower evolution or lowly expressed genes are in fact evolving quickly as a consequence of elevated positive selection, intraspecific pN/pS (as with dN/dS but using within species substitution rates) was analyzed. pN/pS also significantly negatively correlated with expression (ρ = -0.17; p < 2.2x10−16; Spearman’s rank correlation; table 3, first row), meaning highly expressed genes not only diverge more slowly from the sister taxon A. lyrata, but are also less divergent between strains of A. thaliana. This is an indication of stronger purifying selection acting on highly expressed genes and relaxed selection among lowly expressed genes. This was corroborated by significant, negative correlations of expression level with the frequency of unique alleles resulting from premature stop codons (ρ = -0.12; p < 2.2x10−16; Spearman’s rank correlation; table 3) and the frequency of frameshift mutations (ρ = -0.25; p < 2.2x10−16; Spearman’s rank correlation; table 3). In order to control for τ the correlations were calculated within ten sub-groups of genes according to their τ values. All correlations remained negative and the majority significant (34 out of 40 significant correlations; table 3).
Tissue specificity correlates with dN/dS, pN/pS and frequency of deleterious alleles
A significant, positive correlation existed between tissue specificity and sequence divergence (ρ = 0.25; p < 2.2x10−16; Spearman’s rank correlation; table 4) suggesting more broadly expressed genes are subjected to stronger purifying selection. This was further supported by a positive correlation between τ and pN/pS (ρ = 0.17; p < 2.2x10−16; Spearman’s rank correlation; table 4). The frequency of deleterious alleles also correlated positively and significantly with τ, the highest frequency of stop codons and frameshifts occurring among the most tissue specific genes (stop codons: ρ = 0.07; p < 2.2x10−16; frameshifts: ρ = 0.20; p < 2.2x10−16; Spearman’s rank correlation; table 4). In order to control for the influence of expression level, the genes were allocated to four equally sized subgroups according to their expression level, and the correlations with τ were re-calculated within these subgroups. The correlations remained positive and significant for all four quartile groups for dN/dS, pN/pS, and frameshifts, and for the 3rd and 4th expression quartiles for stop codons (table 4).
Pollen genes under weaker selection
Pollen-specific genes seem to be evolving more quickly than sporophyte-specific genes in A. thaliana indicated by significantly higher dN/dS ratios (pollen median: 0.206, IQR: 0.217; sporophyte median: 0.164, IQR: 0.144; W= 1.3x106, p=3.2x10−14, Mann Whitney U test; fig. 2). This appears to be due to more relaxed purifying selection acting on pollen-specific genes revealed by significantly higher pN/pS values (pollen median: 0.095, IQR: 0.208; sporophyte median: 0.072, IQR: 0.177; W= 4.0x106, p=8.4x10−6, Mann Whitney U test; fig. 2) and significantly higher frequencies of stop codons (pollen mean: 1.200 ±0.041 sem; sporophyte mean: 0.873 ±0.019 sem; W= 4.6x106, p=1.1x10−18, Mann Whitney U test; fig. 2) and frameshifts (pollen mean: 0.020 ±0.002; sporophyte mean: 0.014 ±0.001; W= 4.6x106, p=8.3x10−26, Mann Whitney U test; fig. 2) among pollen-specific genes compared to sporophyte-specific genes.
To test whether the more relaxed selection pressure on pollen-specific genes was due to their higher tissue specificity, divergence, polymorphism and frequency of deleterious alleles were also calculated among tissue specific sporophyte-specific genes. Among the 1,690 sporophyte-specific genes (31.3%) and 790 pollen-specific genes (52.6%) with a τ value between 0.9 and 1.0, divergence, polymorphism and frequency of deleterious alleles remained significantly higher among the pollen-specific gene subset (fig. 3).
Within these gene sub-groups of high tissue specificity, expression was significantly higher within pollen-specific genes than sporophyte-specific genes. In order to control for expression level, we further analyzed those highly tissue-specific genes (τ ≥0.9), which had an expression level over 1,000. Within this group neither expression nor τ differed significantly between pollen-specific and sporophyte-specific genes. However, dN/dS, pN/pS, stop codons and frameshifts were all significantly higher among pollen-specific than sporophyte-specific genes (fig. 4).
Discussion
We investigated the role of three factors on the efficacy of selection on genes in Arabidopsis thaliana: expression level, tissue specificity and the restriction of expression to pollen. Higher selection efficacy was expected among highly and broadly expressed genes and even more so in pollen genes compared to sporophyte genes.
First, we found a significant negative correlation between expression level and rates of protein evolution dN/dS), polymorphism (pN/pS) and the frequency of deleterious alleles (stop codon and frameshift mutations). Second, there is a significant positive correlation between tissue specificity and dN/dS, pN/pS, stop codon frequency and frameshift frequency. Third, dN/dS, pN/pS and the frequency of deleterious mutations were all significantly higher among pollen genes than sporophyte genes, even when controlling for tissue specificity and expression level.
The importance of gene visibility to selection
The negative correlation of expression level with dN/dS and pN/pS indicates a positive relationship between expression level and purifying selection. This suggests that highly expressed genes are more likely to be constrained by purifying selection, whereas genes expressed at lower levels are less constrained. Indeed, more relaxed selection reducing the purging of deleterious alleles is confirmed by the significantly higher frequency of deleterious alleles among lowly expressed genes. Purifying selection was also stronger for broadly expressed genes, while the faster evolution of tissue-specific genes suggests relaxed selection. Importantly, however, although tissue specificity and expression level were significantly negatively correlated with each other, correlations with dN/dS and pN/pS remained significant when each was controlled for.
The effect of expression level (Rocha and Danchin, 2004; Pál et al., 2001; Drummond et al., 2006; Marais et al., 2004; Wright et al., 2004; Wright and Andolfatto, 2008; Slotte et al., 2011; Yang and Gaut, 2011) and tissue specificity (Duret and Mouchiroud, 2000; Liao et al., 2006) on selection has been confirmed in many previous studies for a broad range of taxa. Importantly, in this study we have confirmed that both factors independently have a significant effect on the efficacy of selection acting on genetic variation in A. thaliana.
Purifying selection is more relaxed for pollen-specific genes
Contrary to our expectations, we have discovered evidence for more relaxed purifying selection among genes exclusively expressed in pollen compared to sporophyte limited genes. This was true despite significantly higher expression levels among pollen genes and remained true when controlling for tissue specificity by comparing pollen genes only with the most tissue specific sporophyte genes. Therefore, the faster evolutionary rates of pollen-specific compared to sporophyte-specific genes due to relaxed purifying selection cannot be explained by differences in expression level or tissue specificity.
These results are in contrast to the findings of two recent studies, in which pollen genes were found to be under stronger purifying and adaptive selection than sporophyte genes in Capsella grandiflora (Arunkumar et al., 2013) and A. thaliana (Gossmann et al., 2013). The results of the A. thaliana study were based on a comparison between pollen-specific genes and a relatively small group of 476 random genes (excluding reproductive genes), presumably comprising mainly sporophytic genes (Gossmann et al., 2013). In this comparison differences in expression level and tissue specificity between gene groups were not controlled for. However, we have shown here that pollen-specific genes are expressed at a significantly higher level than sporophytic genes as previously shown for Arabidopsis (Honys and Twell, 2003), making them more visible to selection. This was even more apparent in the Gossmann et al. study (2013) because they separated sperm-specific genes, which are generally expressed at a lower level, from pollen-specific genes.
In the case of the outcrossing C. grandiflora, the more efficient purifying and adaptive selection on pollen genes was linked to two possible factors: haploid expression and pollen competition. A. thaliana is a highly self-fertilizing species with selfing rates generally in the range of 95 - 99% (Platt et al., 2010), so a priori haploid expression is unlikely to improve the efficacy of selection on pollen-specific genes relative to sporophyte genes. This is because most individuals found in natural populations are homozygous for the majority of loci, reducing the masking of deleterious alleles in heterozygous state when expressed in a diploid tissue (Platt et al., 2010).
But even in the complete absence of masking, pollen competition may be expected to generate more effective selection on pollen genes than sporophyte genes. A reduction in pollen competition can be expected due to the probably limited number of pollen genotypes in highly selfing populations (Charlesworth and Charlesworth, 1992; Mazer et al., 2010). However, outcrossing does occur in natural A. thaliana populations with one study reporting an effective outcrossing rate in one German population of 14.5% (Bomblies et al., 2010). Nevertheless, it appears that these generally rare outcrossing events may not be sufficient to prevent a reduction in pollen competition for A. thaliana.
So if we assume both masking and pollen competition are negligible forces when comparing selection on pollen-specific genes to sporophyte-specific genes, why is selection more relaxed among pollen-specific genes than sporophyte-specific genes? In fact, our results confirm recent findings indicating relaxed purifying selection in pollen specific genes compared to sporophytic genes in A. thaliana (Szövényi et al., 2013), a pattern explained by a combination of high tissue specificity and higher expression noise in pollen compared to sporophytic genes. However, the authors did not compare selection on pollen genes to tissue specific sporophyte genes suggesting tissue specificity as an alternative explanation. We have shown here that tissue specificity does not explain why selection is more relaxed among pollen genes, as divergence, polymorphism and the frequency of deleterious alleles were still significantly lower in tissue specific sporophyte genes than pollen-specific genes. Higher expression noise could, however, be an important factor influencing the level of deleterious alleles which exist for pollen genes in A. thaliana.
Expression noise has been found to reduce the efficacy of selection substantially and is expected to be considerably higher for haploid expressed genes (Wang and Zhang, 2011). It is, therefore, likely that in the absence of pollen competition and the masking of deleterious sporophyte-specific genes, expression noise becomes a dominant factor for pollen-specific genes of selfing plants. This leads to a reduction in selection efficacy and the accumulation of deleterious alleles in pollen-specific genes.
Conclusion
Our results confirm the effect of both expression level and tissue specificity on selection efficacy. In out-crossing plants, haploid expression and pollen competition, combined with high expression levels outweigh the negative impact of high tissue specificity and expression noise on the selection efficacy of pollen-specific genes. In the self-compatible A. thaliana high homozygosity likely reduces the counteracting effects of pollen competition and haploid expression, leading to lower selection efficacy and increased accumulation of deleterious mutations in pollen-specific compared to sporophyte-specific genes.
Methods
Genomic data
Publicly available variation data were obtained for 269 inbred strains of A. thaliana. Beside the reference genome of the Columbia strain (Col-0), which was released in 2000 (Arabidopsis, Genome Initiative), 250 were obtained from the 1001 genomes data center (http://1001genomes.org/datacenter/; accessed September 2013), 170 of which were sequenced by the Salk Institute (Schmitz et al., 2013) and 80 at the Max Planck Institute, Tübingen (Cao et al., 2011). A further 18 were downloaded from the 19 genomes project (http://mus.well.ox.ac.uk/; accessed September 2013; Gan et al. (2011)). These 268 files contained information on SNPs and indels recorded for separate inbred strains compared to the reference genome. A quality filter was applied to all files, in order to retain only SNPs and indels with a phred score of at least 25.
Expression data
Normalized microarray data, covering 19,970 genes specific to different developmental stages and tissues of A. thaliana (table 5), were obtained from Borg et al. (2011). The expression data consisted of 7 pollen and 10 sporophyte data sets (table 5). Four of the pollen data sets represented expression patterns of the pollen developmental stages, uninucleate, bicellular, tricellular and mature pollen grain, one contained expression data of sperm cells and the remaining two were pollen tube data sets. There was a strong, significant correlation between the two pollen tube data sets (ρ = 0.976; p < 2.2x10−16; Spearman’s rank correlation), so both were combined and the highest expression value of the two sets was used for each gene. Each of the 10 sporophyte data sets contained expression data for specific sporophytic tissues (table 5).
Each expression data point consisted of a normalized expression level (ranging from 0 to around 20,000, scalable and linear across all data points and data sets) and a presence score ranging from 0 to 1 based on its reliability of detection across repeats, as calculated by the MAS5.0 algorithm (Borg et al., 2011). In our analyses expression levels were conservatively considered as present if they had a presence score of at least 0.9, while all other values were regarded as zero expression. All analyses were repeated using a less conservative cut-off value of 0.7 and 0.5 (data not shown). This did not change the tendency of results obtained with the 0.9 cut-off.
Genes were classed as either pollen or sporophyte-specific genes, if expression was reliably detectable in only pollen or only sporophyte tissues or developmental stages. The highest expression value across all tissues or developmental stages was used to define the expression level of a particular gene.
Detecting signatures of selection
Evolutionary Rates
To estimate evolutionary rates of genes, dN/dS ratios (ratio of non-synonymous to synonymous substitution rates relative to the number of corresponding non-synonymous and synonymous sites) were calculated for all orthologous genes (15,772) between A. thaliana and A. lyrata and based on the TAIR 9 genome release (Szövényi et al., 2013).
Intra-specific polymorphism
pN/pS ratios were calculated with the yn00 programme within PAML (Phylogenetic Analysis by Maximum Likelihood, version 4.6, Yang 2007) for each pairwise comparison of strains. The individual pN/pS estimates achieved via the Nei-Gojobori method were extracted from the output files and averaged across all pairwise comparisons for each gene.
Putatively deleterious alleles
To quantify the frequency of deleterious mutations for each gene, the occurrence of premature stop codons and frame shifts was calculated for each gene locus among all 269 strains. Stop codons were recorded as the number of unique alternative alleles occurring within the 269 strains as a result of a premature stop codon. Frame shifts were calculated as a proportion of the strains containing a frame shift mutation for a particular gene. All analyses of coding regions were based on the representative splice models of the 27,202 A. thaliana genes (TAIR10 genome release www.arabidopsis.org).
Author contributions
All four authors developed the project idea and were involved in the interpretation of data and finalization of the manuscript. MCH analyzed the data and drafted the manuscript
Acknowledgements
MCH was supported by a PhD research grant from the Natural Environment Research Council (NERC). DT would like to acknowledge financial support from the UK Biotechnology and Biological Science Research Council (BBSRC).