Abstract
Gene duplication has played an important role in the evolution and domestication of flowering plants. Yet little is known about how plant duplicate genes evolve and are retained over long timescales, particularly those arising from small-scale duplication (SSD) rather than whole-genome duplication (WGD) events. Here we address this question in the Poaceae (grass) family by analyzing gene expression data from nine tissues of Brachypodium distachyon, Oryza sativa japonica (rice), and Sorghum bicolor (sorghum). Consistent with theoretical predictions, expression profiles of most grass genes are conserved after SSD, suggesting that functional conservation is the primary outcome of SSD in grasses. However, we also uncover support for widespread functional divergence, much of which occurs asymmetrically via the process of neofunctionalization. Moreover, neofunctionalization preferentially targets younger (child) duplicate gene copies, is associated with RNA-mediated duplication, and occurs quickly after duplication. Further analysis reveals that functional divergence of SSD-derived genes is positively correlated with both sequence divergence and tissue specificity in all three grass species, and particularly with anther expression in B. distachyon. Therefore, as found in many animal species, SSD-derived grass genes often undergo rapid functional divergence that may be driven by natural selection on male-specific phenotypes.
INTRODUCTION
Angiosperms, or flowering plants, compose one of the most evolutionarily and phenotypically diverse group of eukaryotes. Findings stemming from comparative genomic and experimental studies have led researchers to hypothesize that this extraordinary diversity is primarily a product of gene duplication events (Zhang 2003; Flagel and Wendel 2009; Van de Peer, et al. 2009). For one, duplicate genes are more abundant in angiosperms than in any other sequenced taxonomic group (Zhang 2003; Flagel and Wendel 2009), and differences in numbers of duplicates often contribute to genome sizes that differ by many orders of magnitude, even between closely related species (Flavell, et al. 1974; Bennetzen, et al. 2005). Second, a number of studies have shown that gene duplication can promote the origin of novel plant phenotypes (Flagel and Wendel 2009; Van de Peer, et al. 2009), and that it was likely a key driving factor in the domestication of flowering plants (Hilu 1993; Dubcovsky and Dvorak 2007; Meyer, et al. 2012; Salman-Minkov, et al. 2016). However, many of these findings are associated with studies of duplicates derived from whole-genome duplication (WGD) events, which occurred several times during the past 200 million years of angiosperm evolution (Paterson, et al. 2004; Lockton and Gaut 2005; Cui, et al. 2006; Van de Peer, et al. 2009; Jiao, et al. 2011; Rensing 2014; Panchy, et al. 2016). Yet substantial evidence shows that, in both plants and animals, duplicates deriving from WGD and small-scale duplication (SSD) events differ in quantifiable ways, such as evolutionary rate, essentiality, and function (Hakes, et al. 2007; Carretero-Paulet and Fares 2012; Rensing 2014; Maere and Van de Peer, 2010). Therefore, an open question is how SSD-derived genes in angiosperms evolve and are retained over long evolutionary timescales.
In the simplest case, SSD creates two copies of an ancestral single-copy gene. Considering directionality of duplication, the copy representing the ancestral gene is often called the “parent”, whereas the copy generated by duplication is termed the “child” (Han and Hahn 2009; Assis and Bachtrog 2013). Four mechanisms may underlie the evolution and long-term retention of gene copies in such a scenario. First, under conservation, the ancestral function is preserved in each copy after duplication, perhaps due to a beneficial effect of increased gene dosage (Ohno 1970). Second, under neofunctionalization, one copy preserves the ancestral function, whereas the other copy acquires a new function (Ohno 1970). Third, under subfunctionalization, the ancestral function is divided between copies (Force, et al. 1999; Stoltzfus 1999). Last, under specialization, rapid subfunctionalization is followed by neofunctionalization, resulting in both copies having distinct functions from one another and from their ancestral gene (He and Zhang 2005; Rastogi and Liberles 2005).
Though examples of all of these hypothesized retention mechanisms exist in angiosperms (Duarte, et al. 2005; Throude, et al. 2009; Marcussen, et al. 2010; Bekaert, et al. 2011; Aklilu, et al. 2013; Ma, et al. 2015; Zhang, et al. 2015), their relative abundances on a genome-wide scale remain unknown. One of the reasons for this gap in knowledge is the lack of methods for assessing functional divergence after gene duplication. To overcome this obstacle and distinguish among retention mechanisms of duplicate genes, researchers developed a phylogenetic approach that compares expression profiles between the ancestral single-copy gene in one species and the parent and child copies arising from a SSD event in a closely related sister species (Assis and Bachtrog 2013). Application of their approach to RNA-seq data from two Drosophila species suggested that approximately 65% of duplicate genes underwent neofunctionalization (Assis and Bachtrog 2013). Further analyses revealed that neofunctionalization often occurs within a few million years of duplication, results in acquisition of new functions by child copies that arose via RNA-mediated mechanisms, and generates testis-specific gene functions (Assis and Bachtrog 2013; Assis 2014). In contrast, examination of RNA-seq data from eight mammals showed that only 33% of duplicate genes were retained by neofunctionalization (Assis and Bachtrog 2015). The majority of duplicates were instead retained by conservation, and expression divergence was found to occur more gradually in mammals than in Drosophila, result in acquisition of new functions equally by parents and children, and generate a diversity of tissue-specific gene functions (Assis and Bachtrog 2015).
Natural selection may act more efficiently in Drosophila duplicate genes due to their much larger effective population sizes (Ne) than mammals (Lynch and Conery 2003), which may have contributed to the higher rates of expression divergence observed in Drosophila duplicate genes (Assis and Bachtrog 2013, 2015). In particular, the efficiency of selection is proportional to Ne×s, where s is the selective advantage of a beneficial mutation (Kimura 1983; Charlesworth 2009). Therefore, because angiosperms have comparable Ne to mammals (Lynch and Conery 2003; Ai et al. 2012; Adugna and Bekele 2015; Stritt et al. 2017, we might expect similar levels of expression conservation between duplicate genes of flowering plants and those of mammals.
In this study, we assess the genome-wide roles of duplicate gene retention mechanisms after SSD in three closely related self-pollinating (Beachell, et al. 1938; Dje, et al. 2000; Gordon, et al. 2014) angiosperms in the Poaceae (grass) family: Brachypodium distachyon, Oryza sativa japonica (rice), and Sorghum bicolor (sorghum). B. distachyon and O. sativa japonica share a more recent common ancestor 40-54 million years ago (MYA), and the most recent common ancestor of all three species occurred 45-60 MYA (Bowers, et al. 2005; Bennetzen 2007; Paterson, et al. 2009; International Brachypodium Initiative 2010). Grasses represent an interesting evolutionary system because they are agriculturally important (Davidson, et al. 2012) and, thus, have undergone domestication events in their recent evolutionary histories. Further, these three grass species are ideal for comparison due to the availability of RNA-seq data from the same nine tissues (leaf, anther, endosperm, early inflorescence, emerging inflorescence, pistil, embryo, seed 5 days after pollination, and seed 10 days after pollination) that were obtained in a single lab under similar experimental conditions (Davidson, et al. 2012). Hence, we have a powerful toolkit with which to assess expression divergence after SSD in grasses.
RESULTS
Retention mechanisms of SSD-derived duplicates in grasses
A primary goal of our study was to understand how a pair of SSD-derived grass duplicate genes evolves and is retained after its emergence from a single-copy ancestral gene. Therefore, considering the phylogenetic tree depicted in Figure 1, we were interested in pairs of duplicates that arose via SSD along the lineages of B. distachyon and O. sativa japonica after their divergence from S. bicolor (orange stars), B. distachyon after its divergence from O. sativa japonica (blue stars), O. sativa japonica after its divergence from B. distachyon (green stars), and S. bicolor after its divergence from B. distachyon and O. sativa japonica (purple stars). To identify such duplicates, we obtained a table of gene family sizes for 16 monocots and their full species phylogeny from the PLAZA 3.0 database (Proost, et al. 2014; Figure S1). Then, we used a maximum likelihood-based approach (Csűös 2010) to ascertain all duplications and losses that occurred along the monocot phylogeny. We applied parsimony rules to identify pairs of duplicates that arose along the branches indicated in Figure 1 (see Figure S1 for full tree). It is important to note that the most recent WGD event in monocots occurred approximately 65 MYA (Jiao, et al. 2011), which is before the divergence of B. distachyon, O. sativa japonica, and Sorghum bicolor. Therefore, given the size of the monocot tree and number of outgroups considered, the duplications that we extracted with this approach are more likely to be created by SSD rather than WGD events. Next, we required that both duplicate genes, as well as their single-copy ancestral gene in the closer of the two sister species considered, be expressed in at least one tissue (see Materials and Methods for details). This analysis yielded 272 SSD-derived gene pairs in B. distachyon, 289 pairs in O. sativa japonica, and 340 pairs in S. bicolor (Figure 1). Using sequence and synteny information, we inferred the most likely parent and child copy for each pair of duplicates in this dataset (see Materials and Methods for details).
To classify retention mechanisms of SSD-derived grass duplicate genes, we applied the phylogenetic method developed by Assis and Bachtrog (Assis and Bachtrog 2013) to expression profiles constructed from RNA-seq data in nine tissues (Davidson, et al. 2012) of single-copy, ancestral, parent, and child genes of B. distachyon, O. sativa japonica, and S. bicolor. In particular, this method (Assis and Bachtrog 2013) first utilizes the distribution of Euclidian distances between expression profiles of single-copy genes to establish a cutoff that represents the expected expression divergence between two species. Next, it computes Euclidian distances between ancestral and parent expression profiles, ancestral and child expression profiles, and ancestral and combined parent-child expression profiles. Last, it classifies retention mechanisms of each pair of duplicates based on phylogenetic rules. Briefly, the expression profile of the ancestral gene is expected to be similar to those of both the parent and child under conservation, to those of one copy but not the other under neofunctionalization, and to those of neither copy under subfunctionalization or specialization. Distinguishing between subfunctionalization and specialization requires an additional comparison of ancestral and combined parent-child expression profiles. Similarity between these expression profiles suggests that the function of the ancestral gene was subdivided between parent and child copies due to subfunctionalization, whereas dissimilarity points to functional divergence among all three genes due to specialization (Assis and Bachtrog 2013).
Application of the described classification approach (Assis and Bachtrog 2013) uncovered similar proportions of each retention mechanism among B. distachyon, O. sativa japonica, and S. bicolor SSD-derived duplicates (Table 1). Therefore, it appears that genes in all three grass species traverse similar evolutionary paths after SSD. In total, 60.6% of SSD-derived grass duplicates are conserved, 23.8% are neofunctionalized, 0.4% are subfunctionalized, and 15.2% are specialized. Hence, conservation is the most prevalent retention mechanism, indicating that SSD typically results in increased gene dosage in grasses. This level of functional conservation is higher than observed in Drosophila (Assis and Bachtrog 2013) and similar to that observed in mammals (Assis and Bachtrog 2015). Thus, our observation is consistent with the smaller Ne of grass and mammalian species compared with Drosophila (Lynch and Conery 2003; Ai et al. 2012; Adugna and Bekele 2015; Stritt et al. 2017).
Contribution of duplication mechanism to expression divergence of SSD-derived grass duplicates
Despite a prominent role of conservation, over one-third of SSD-derived grass duplicate genes undergo expression divergence, most of which occurs asymmetrically via neofunctionalization. This pattern of asymmetric expression divergence is consistent with findings in both Drosophila (Assis and Bachtrog 2013) and mammals (Assis and Bachtrog 2015). However, as in Drosophila (Assis and Bachtrog 2013) but not mammals (Assis and Bachtrog 2015), neofunctionalization in grasses is also biased in that approximately 72% of B. distachyon, 70% of O. sativa japonica, and 89% of S. bicolor neofunctionalized genes are child copies (Table 1). In Drosophila, this bias was associated with RNA-mediated duplication (Assis and Bachtrog 2013; Assis 2014), which produces child copies lacking the introns and regulatory elements of their ancestral genes. The new genomic context of RNA-mediated child duplicates may increase their likelihood of possessing or acquiring novel gene functions (Kaessmann, et al. 2009). Therefore, we hypothesized that RNA-mediated duplication may contribute to biased neofunctionalization of children in grasses as well. To test this hypothesis, we compared observed and expected counts of DNA-and RNA-mediated duplicates retained by conservation, neofunctionalization of parents, neofunctionalization of children, and specialization (Table 2; see Materials and Methods for details). Indeed, there is an overrepresentation of RNA-mediated duplicates retained by neofunctionalization of children (P = 0.01, χ2 test; see Materials and Methods for details), but not by any other mechanism. This finding indicates that RNA-mediated duplication is more likely to generate children with novel functions in grasses. Moreover, because this pattern exists in both grasses and Drosophila (Assis and Bachtrog 2013), it is possible that RNA-mediated duplication acts as a reservoir of functional innovation across many diverse taxonomic groups.
If RNA-mediated duplication contributes to neofunctionalization in grasses, then we might expect expression divergence to occur either as a byproduct of SSD or soon afterward. Therefore, next we were interested in ascertaining the timing of expression divergence after SSD in grasses. If expression divergence is rapid, then we expect the frequencies of retention mechanisms to be similar among duplicates that arose at different time points in monocot evolution, as was observed in Drosophila (Assis and Bachtrog 2013). Alternatively, if expression divergence occurs more gradually after SSD, then we expect higher frequencies of conservation in duplicates that arose more recently and higher frequencies of divergence in those that arose more distantly in the past, as was observed in mammals (Assis and Bachtrog 2015). To address this question in grasses, we divided the duplicates in our dataset into three age classes based on when SSD occurred along the monocot phylogeny and compared observed and expected counts of retention mechanisms in each age class (Tables S1-S3; see Materials and Methods for details). Consistent with findings in Drosophila (Assis and Bachtrog 2013), but not in mammals (Assis and Bachtrog 2015), proportions of retention mechanisms are similar among duplicates that arose by SSD at different time points in all three species. Therefore, it appears that functional divergence of SSD-derived grass duplicates often occurs either as a consequence of duplication or shortly afterward.
Sequence-and tissue-specific correlates with expression divergence of SSD-derived grass duplicates
Previous studies have demonstrated that expression divergence is often positively correlated with protein-coding sequence divergence of duplicate genes in many species (Gu, et al. 2002; Makova and Li 2003; Conant and Wagner 2004; Zhang, et al. 2004; Li, et al. 2005; Assis and Bachtrog 2013; Chau and Goodisman 2017). To assess this relationship in grasses, we calculated Pearson’ s correlation coefficients (r) between expression divergence (Euclidian distance) and both nonsynonymous sequence divergence (Ka) and nonsynonymous-to-synonymous sequence divergence (Ka/Ks) rates of each SSD-derived duplicate gene and its ancestral gene in B. distachyon, O. sativa japonica, and S. bicolor species (Figure 2; see Materials and Methods for details). In all three species, there is a moderately strong positive correlation between expression divergence and Ka (Figure 2A; r = 0.40 − 0.48; P < 0.001 for all comparisons, t tests; see Materials and Methods for details), and a weak positive correlation between expression divergence and Ka/Ks (Figure 2B; r = 0.10 − 0.17, P < 0.05 for all comparisons, t tests; see Materials and Methods for details). Thus, expression divergence of SSD-derived duplicates is significantly associated with protein-coding sequence divergence rates, suggesting that expression patterns and encoded proteins of grass duplicate genes evolve in tandem.
Moreover, expression divergence of SSD-derived duplicate genes is associated with increased tissue specificity in both Drosophila (Assis and Bachtrog 2013) and mammals (Assis and Bachtrog 2015). To assess this relationship in SSD-derived grass duplicates, we computed Pearson’ s correlation coefficients (r) between expression divergence (Euclidian distance) of each duplicate gene from its ancestral copy and its tissue specificity index τ (Yanai, et al. 2004; see Materials and Methods for details) in B. distachyon, O. sativa japonica, and S. bicolor (Figure 3A). Consistent with results in Drosophila (Assis and Bachtrog 2013) and mammals (Assis and Bachtrog 2015), there is a strong positive correlation between tissue specificity and expression divergence of SSD-derived duplicate genes in all three grass species (r = 0.80 − 0.87; P < 0.001 for all comparisons, t tests; see Materials and Methods for details). Thus, increased expression divergence of SSD-derived grass duplicates is associated with greater tissue specificity.
Whereas SSD-derived duplicate genes in Drosophila are primarily testis-specific (Betrán, et al. 2002; Levine, et al. 2006b; Zhou, et al. 2008; Assis and Bachtrog 2013), those in mammals are expressed specifically in a diversity of tissues (Assis and Bachtrog 2015). Therefore, our next question was whether there are particular tissues in which SSD-derived duplicates tend to be expressed in grasses. To answer this question, we designated the tissue in which each gene has its highest expression as its primary tissue, and compared the observed primary tissues to those expected based on primary tissues of single-copy genes (Figure 3B; see Materials and Methods for details). After correcting for multiple comparisons (see Materials and Methods for details), our analysis yielded two significant findings. First, there is an underrepresentation of leaf-expressed duplicates in S. bicolor (P = 1.84×10−6, binomial test; see Materials and Methods for details). Because leaf is the only tissue assayed that is not related to reproduction, this result suggests that duplicates in S. bicolor are typically expressed in reproductive tissues. Second, we discovered an overrepresentation of anther-expressed duplicates in B. distachyon (P = 0.02, binomial test; see Materials and Methods for details). Because the anther produces pollen grains (Goldberg, et al. 1993), this result suggests that SSD-derived B. distachyon duplicates are involved in male-specific reproduction, as is common in many animal species (Betran, et al. 2002; Paulding, et al. 2003; Marques, et al. 2005; Levine, et al. 2006; Vinckenbosch, et al. 2006; Zhou, et al. 2008; Assis and Bachtrog 2013, 2015). Therefore, SSD may be associated with reproduction in plants, as it is in animals.
DISCUSSION
Despite the abundance of duplicate genes in angiosperms, and their prominent roles in evolution (Hilu 1993; Dubcovsky and Dvorak 2007; Flagel and Wendel 2009; Van de Peer, et al. 2009; Meyer, et al. 2012; Salman-Minkov, et al. 2016), their paths from genetic redundancy to functional divergence and longterm retention remain unclear. Studies in several animal species have uncovered evidence of rapid and asymmetric sequence and expression divergence after duplication that is consistent with natural selection (Conant and Wagner 2003; Blanc and Wolfe 2004; Kellis, et al. 2004; Li, et al. 2005; Assis and Bachtrog 2013, 2015; Jiang and Assis 2017). However, many angiosperms are unique in that they are self-pollinating, which may reduce their adaptive potentials (Nordborg 1997; Glémin 2012; Roze 2015; Hartfield, et al. 2017), and therefore hinder the evolutionary divergence of duplicate genes. Yet, largely due to the absence of approaches for assessing functional divergence after duplication until recently (Assis and Bachtrog 2013), no genome-wide studies have been performed to address how duplicate genes in angiosperms evolve and are retained over long evolutionary timescales. Further, previous studies in angiosperms have primarily focused on WGD-derived duplicates, whereas little emphasis has been placed on describing evolution after SSD. Therefore, our study represents the first genome-scale analysis of functional evolution after SSD in angiosperms.
Examination of expression profiles across nine tissues of B. distachyon, O. sativa japonica, and S. bicolor revealed that functional conservation is the primary long-term outcome of SSD in grasses. Conservation of duplicate genes can either be a product of selection for increased gene dosage or a consequence of slowed divergence due to a decreased efficiency of selection, which can be further exacerbated by nonallelic gene conversion. Either one or both of these mechanisms may hamper evolutionary divergence of duplicate genes in grasses. In particular, though our study focused on SSD, analyses of WGD often point to increased gene dosage as a mechanism for duplicate gene retention in plants (Bekaert, et al. 2011). On the other hand, levels of conservation in grasses are higher than those observed in Drosophila (Assis and Bachtrog 2013), consistent with their smaller Ne relative to Drosophila (Lynch and Conery 2003). Moreover, mammals and grasses have similar Ne (Lynch and Conery 2003), and the efficiency of selection in grasses is close to that in mammals. Therefore, the comparison among levels of conservation in Drosophila, mammals, and grasses provides additional support for a role of natural selection in evolution after gene duplication across diverse taxonomic groups.
Though our analysis suggests that most grass duplicates are functionally conserved, they also indicate that a large proportion of SSD-derived duplicates may have experienced functional divergence. Previous studies in Arabidopsis thaliana demonstrated that SSD-derived duplicates have greater sequence and expression divergence rates than WGD duplicates of the same age (Casneuf et al. 2006; Carretero-Paulet and Fares 2012), which may be attributed to relaxed constraint (Carretero-Paulet and Fares 2012). Therefore, it is not surprising that SSD-derived duplicates in the species considered here may have diverged functionally from their ancestral state, and it is possible that an analogous study of WGD-derived duplicates would reveal a similar trend to that observed in A. thaliana. Moreover, we found that expression divergence of SSD-derived grass duplicates primarily occurs asymmetrically via neofunctionalization, as has been uncovered in both Drosophila (Assis and Bachtrog 2013) and mammals (Assis and Bachtrog 2015). This finding is also consistent with the increased prevalence of neofunctionalization among A. thaliana duplicates generated by SSD (Rensing 2014). Therefore, asymmetric evolutionary divergence appears to be a common outcome of SSD in both plant and animal species. However, neofunctionalization often occurs in child copies and is associated with RNA-mediated duplication in grasses, as in Drosophila (Assis and Bachtrog 2013), but not in mammals (Assis and Bachtrog 2015). Further, evolutionary fates of grass duplicates are reached quickly after duplication, also consistent with findings in Drosophila (Assis and Bachtrog 2013), but not in mammals (Assis and Bachtrog 2015). Together, these results support the hypothesis that neofunctionalization may often occur as a byproduct of SSD itself, perhaps due to the placement of RNA-mediated duplicates in novel genomic contexts without their ancestral regulatory elements (Kaessmann, et al. 2009). Thus, aside from their slower divergence rates, the evolutionary trajectories of grass duplicates more closely mirror those of Drosophila (Assis and Bachtrog 2013) than mammals (Assis and Bachtrog 2015). This is somewhat surprising because the Ne of grass species are smaller than those of Drosophila species (Lynch and Conery 2003; Ai et al. 2012; Adugna and Bekele 2015; Stritt et al. 2017). However, in mammals, functional divergence often occurs over longer evolutionary time (Assis and Bachtrog 2015), suggesting that neofunctionalization is only biased toward child copies when it happens rapidly. This is not unexpected, given that conserved duplicates are initially redundant and, thus, the probabilities of divergence of parent and child copies over time should be equal. Therefore, this comparison further highlights the role of asymmetric duplication events, such as those that are RNA-mediated, in asymmetric divergence and child-biased neofunctionalization.
Assessment of expression divergence of SSD-derived grass duplicate genes revealed that it is positively correlated with protein-coding sequence divergence and tissue specificity. Moreover, in B. distachyon, we found an enrichment of duplicates highly expressed in anther, which is the tissue that produces pollen in flowering plants. This finding is consistent with those in A. thaliana RNA-mediated duplicates (Abdelsamad and Pecinka, 2014; Casola and Betrán, 2017) and supports the “out of the pollen” hypothesis, in which new plant genes originate from the vegetative nucleus of the mature pollen due to increased activities of transposable elements (Wu, et al. 2014). Because anther is analogous to testis in animals, our result is also synonymous with the “out of the testis” hypothesis, which posits that new genes often emerge with testis-related functions and acquire novel functions over time (Kaessmann 2010) and is supported by data in many species (Betrán, et al. 2002; Paulding, et al. 2003; Marques, et al. 2005; Levine, et al. 2006; Vinckenbosch, et al. 2006; Zhou, et al. 2008; Assis and Bachtrog 2013, 2015). Several hypotheses have been proposed to explain the male-biased origin of new genes, including increased mutation rates due to greater numbers of germline cell divisions in male tissues (Shimmin, et al. 1993), positive selection due to sexual selection (Pröschel, et al. 2006; Ellegren and Parsch 2007), and relaxed negative selection due to reduced functional pleiotropy (Ellegren and Parsch 2007; Gershoni and Pietrokovski 2014; Harrison, et al. 2015). However, as in animals (e.g., Kaessmann 2010; Assis and Bachtrog 2013, 2015), any of these proposed mechanisms may contribute to the male-biased origin of duplicate genes in grasses. In particular, the increased mutation rate hypothesis (Shimmin, et al. 1993) is consistent with more cell divisions during pollen than ovule production in grasses (Filatov and Charlesworth 2002; Whittle and Johnston 2002), positive selection (Pröschel, et al. 2006; Ellegren and Parsch 2007) with the positive correlation between expression divergence and protein-coding sequence divergence of duplicates (Figure 2), and negative selection (Ellegren and Parsch 2007; Gershoni and Pietrokovski 2014; Harrison, et al. 2015) with the positive correlation between expression divergence and tissue specificity of duplicates (Figure 3A). Therefore, comparison of our findings in grasses to those in diverse animal species (Betrán, et al. 2002; Paulding, et al. 2003; Marques, et al. 2005; Levine, et al. 2006; Vinckenbosch, et al. 2006; Zhou, et al. 2008; Assis and Bachtrog 2013; Assis and Bachtrog 2014; Assis and Bachtrog 2015; Jiang and Assis 2017) highlights a universal role for gene duplication in the origin of male-specific phenotypes across plant and animal kingdoms.
MATERIALS AND METHODS
Identification of single-copy and duplicate genes
Reference genome annotation and sequence data from B. distachyon (version 1.2) (Vogel, et al. 2010), O. sativa japonica (version 1.0) (Sasaki, et al. 2005), and S. bicolor (version 1.4) (Paterson, et al. 2009), as well as a table of gene family sizes for 16 monocots, were downloaded from PLAZA 3.0 (Proost, et al. 2014) at https://bioinformatics.psb.ugent.be/plaza/. Gene families consisting of one copy in B. distachyon, O. sativa japonica, and S. bicolor were considered as single-copy genes. In total, there are 5,132 single-copy genes annotated in B. distachyon, 11,672 single-copy genes annotated in O. sativa japonica, and 6,724 single-copy genes annotated in S. bicolor. Removal of lowly-expressed genes (see Sequence and expression analyses) yielded 4,769 single-copy genes in B. distachyon, 5,439 single-copy genes in O. sativa japonica and 5,976 single-copy genes in S. bicolor that we used in tissue enrichment test. There are 3,466 annotated 1:1 orthologs in B. distachyon and O. sativa japonica, 3,166 annotated 1:1 orthologs in B. distachyon and S. bicolor and 3,154 annotated 1:1 orthologs in O. sativa japonica and S. bicolor. Removal of lowly-expressed genes (see Sequence and expression analyses) yielded 3,269 1:1 orthologs in B. distachyon and O. sativa japonica, 3,024 1:1 orthologs in B. distachyon and S. bicolor and 3,015 1:1 orthologs in O. sativa japonica and S. bicolor.
To identify pairs of duplicate genes that arose via SSD along designated branches shown in Figure 1 (full tree depicted in Figure S1), we used the maximum-likelihood method Count (Csűös 2010) to estimate rates of duplications and losses along the monocot phylogeny downloaded from PLAZA 3.0 (Proost, et al. 2014) and perform asymmetric Wagner parsimony using these rates (Swofford and Maddison 1987). In total, this approach yielded 391 pairs of duplicate genes that arose along the B. distachyon lineage, 478 pairs of duplicate genes that arose along the O. sativa japonica lineage, and 462 pairs of duplicate genes that arose along the S. bicolor lineage. After removing lowly-expressed genes (see Sequence and expression analyses), we obtained 272 pairs of B. distachyon duplicates, 289 pairs of O. sativa japonica duplicates, and 340 pairs of S. bicolor duplicates (see Figure 1). To assess directionality of duplications and assign parent and child copies, we used tables of orthologs from OrthoMCL (Li, et al. 2003), TribeMCL (Enright, et al. 2002), and i-ADHrRE (Fostier, et al. 2011) that were downloaded from the PLAZA 3.0 database (Proost, et al. 2014). When orthology predictions from all three methods were available yet conflicting, we applied a majority-voting scheme to infer the most likely orthologs. When predictions from only two methods were available and conflicting, we prioritized OrthoMCL orthologs above all others, and i-ADHrRE above TribeMCL.
Sequence and expression analyses
We performed all sequence alignments between duplicates and ancestral single-copy genes using MACSE 1.0 (Ranwez, et al. 2011), which accounts for frameshifts and stop codons. We estimated Ka and Ka/Ks ratios using the codeml package in PAML 4.0 (Yang 2007) with runmode = −2, model = 0, and NSsites = 0. To avoid saturation at synonymous sites, we only considered Ks < 3. Tables containing expression abundances estimated in transcripts per million (TPM) from RNA-seq data of protein-coding genes in nine tissues (leaf, anther, endosperm, early inflorescence, emerging inflorescence, pistil, embryo, seed 5 days after pollination, and seed 10 days after pollination) (Davidson, et al. 2012) of Brachypodium distachyon, Oryza sativa japonica, and Sorghum bicolor were downloaded from Expression Atlas at https://www.ebi.ac.uk/gxa/home. These RNA-seq data were quantified with HTSeq 0.6 (Anders, et al. 2015), which only counts reads that unambiguously map to a single gene, thereby minimizing the probability of incorrect mapping between duplicate gene copies. Data were then log-transformed, and genes with log2(TPM + 1) < 1 in all nine tissues were removed. We estimated the expression breadth of each gene with the tissue specificity index τ (Yanai, et al. 2004), which is defined as , where xi represents the expression level in the ith tissue normalized by the maximal expression value. The range of τ is from 0 to 1, with larger τ signifying greater tissue specificity.
We classified retention mechanisms of duplicate genes in our dataset using the CDROM R package (Perry and Assis 2016), which implements Assis and Bachtrog’ s phylogenetic approach (Assis and Bachtrog 2013). In particular, CDROM takes as input tables of expression measurements for multiple conditions in two sister species, lists of orthologous single-copy genes in the two sisters, and a list of parent and child duplicate gene pairs in one sister and their ancestral genes in the second sister. We used B. distachyon as the sister species to O. sativa and S. bicolor and applied CDROM to the RNA-seq data described above, which consists of log-transformed TPMs for genes in nine tissues of B. distachyon, O. sativa japonica, and S. bicolor (Davidson, et al. 2012). CDROM first calculates Euclidian distances between expression profiles of orthologous single-copy genes (ES1,S2), expression profiles of parent and child duplicate genes and the ancestral gene (EP, A and EC,A), and combined expression profiles of both duplicate genes and the ancestral gene (EP+CA). Next, it uses a user-specific cutoff for ES1,S2 (Ediv) to classify retention mechanisms of duplicates. Specifically, duplicates with EP,A ≤ Ediv and EC,A ≤ Ediv are classified as functionally conserved; those with either EP,A ≤ Ediv and EC,A > Ediv or EC,A ≤ Ediv and EP,A > Ediv as neofunctionalized; those with EP,A > Ediv, EC,A > Ediv, and EP+CA ≤ Ediv as subfunctionalized, and those with EPA > Ediv, EC,A > Ediv and EP+C,A > Ediv as specialized. We used distributions of Euclidian distances between gene expression profiles to choose Ediv for each species (Figure S2).
Determination of DNA-and RNA-mediated duplication mechanisms
Exon counts for parent and child duplicates were obtained from genome annotation files (B. distachyon version 1.2 (Vogel, et al. 2010), O. sativa japonica version 1.0 (Sasaki et al. 2005), and S. bicolor version 1.4 (Paterson, et al. 2009)) downloaded from the PLAZA 3.0 database (Proost, et al. 2014). The child was considered as arising through DNA-mediated duplication when the parent and child copies both have multiple exons, and through RNA-mediated duplication when the parent copy has multiple exons and the child copy has one exon. When both the parent and child have one exon, the mechanism was considered to be unknown (43 pairs in Brachypodium distachyon, 39 in Oryza sativa japonica and 50 in Sorghum bicolor). Genes with unknown duplication mechanisms genes were not used in the analysis presented in Table 2.
Statistical analyses
We performed all statistical analyses in the R software environment (R Core Team 2013). χ2 tests were used to compare observed and expected DNA-and RNA-mediated duplicates retained through different mechanisms (Table 2), as well as observed and expected retention mechanisms of duplicates in different age groups (Tables S1-3). Expected counts of DNA-and RNA-mediate duplicates were obtained by multiplying the number of duplicates retained by each mechanism by total proportions of DNA-and RNA-mediated duplicates, respectively. Expected counts of retention mechanisms of duplicates in different age groups were obtained by multiplying the number of duplicates retained by each mechanism by total proportions of duplicates in different age groups. Significance of Pearson s correlation coefficients depicted in Figures 2 were assessed via Student’ s t tests. Two-tailed binomial tests were implemented to compare observed counts of highest-expressed duplicates relative to their expected probabilities. Each binomial test was performed by setting the number of trials as the total number of duplicates, the number of successes as the number of highest-expressed duplicates in the tissue of interest, and the probability of success as the frequency of single-copy genes in the tissue of interest. P-values from binomial tests were Bonferroni-adjusted to correct for the nine comparisons performed.