Abstract
Parallel evolution, defined as identical changes arising in independent populations, is often attributed to similar selective pressures favoring the fixation of identical genetic changes. However, some level of parallel evolution is also expected if mutation rates are heterogeneous across regions of the genome. Theory suggests that mutation and selection can have equal impacts on patterns of parallel evolution, however empirical studies have yet to jointly quantify the relative importance of these two processes. Here, we introduce several statistical models to examine the contributions of mutation and selection heterogeneity to shaping parallel evolutionary changes at the gene-level. Using this framework we analyze published data from forty experimentally evolved Saccharomyces cerevisiae populations. We can partition the effects of a number of genomic variables into those affecting patterns of parallel evolution via effects on the rate of arising mutations, and those affecting the retention versus loss of the arising mutations (i.e. selection). Our results suggest that gene-to-gene heterogeneity in both mutation and selection, associated with gene length, recombination rate, and number of protein domains drive parallel evolution at both synonymous and nonsynonymous sites.
Introduction
Documenting patterns of parallel evolution during the adaptive divergence of populations or during repeated bouts of adaptation in populations maintained in the lab is becoming increasingly feasible. Beyond the fascination for the pattern of repeatable evolution, an outstanding open question is to understand which underlying processes are driving the pattern of molecular evolution during adaptation. Theory makes clear cut predictions: in the absence of selective interference between beneficial mutations (the so called strong selection weak mutation, or SSWM, domain), heterogeneity in mutation rates and selection coefficients between loci are expected to have equal influence on patterns of parallel evolution (Chevin et al., 2010; Lenormand et al., 2016). So far very few empirical studies have attempted to jointly quantify the relative importance of these two processes in shaping patterns of parallel evolution in genetic data. One study has explored this indirectly by quantifying the contribution of these two processes in shaping the parallel evolution of heritable traits that are assumed to be associated with parallel genetic changes (Streisfeld and Rausher, 2011). Recent work by Bailey et al., 2017outlines an approach for quantifying the effects of mutation and selection heterogeneity in driving parallel evolution in experimental evolution data, but this alternate approach can not identify potential genomic drivers of that heterogeneity, as we do here. Other previous studies looking explicitly at parallel genetic changes have focused on the impacts of either selection or mutation separately.
A number of studies have examined gene-level mutation counts, looking for levels of parallel evolution that exceed what one would expect in the absence of selection (Caballero et al., 2015; Marvig et al., 2015; Woods et al., 2006), according to some null model, with an aim to identify genes that are under selection. For example, (Caballero et al., 2015) calculated the probability of instances of gene-level parallel evolution in whole genome sequences of Pseudomonas aeruginosa repeatedly sampled over the course of a year from the sputum of a cystic fibrosis (CF) patient assuming uniform re-sampling of ~150 mutation events across the approximately 6000 genes in the genome. The authors were able to identify 19 different genes for which there was significant deviation from their null model, and that pattern was interpreted as evidence for selection acting on these genes. However this study and other similar approaches do not account for the possibility of heterogeneity in mutation rate from gene-to-gene, a process that can generate false positives when using “abnormal” levels of parallel evolution as a means to detect selected genes.
Others have compared instances of parallel and convergent evolution across species (see Christin et al., 2010 for a review and examples). These studies also aim to identify genes under selection by searching for genes that exhibit a higher that expected number of instances of parallel evolution according to a specified null model for evolution. Many cross-species comparative studies report instances of parallel molecular evolution and readily interpret these as being driven by positive selection (e.g. Castoe et al., 2009; Feldman et al., 2012; Jost et al., 2008; Liu et al., 2014). However (Zou and Zhang, 2015) show that in this type of analysis the choice of null model is crucial and suggest that many previously reported instances of parallel evolution driven by selection could in fact have resulted simply from mutation biases and mutational heterogeneity in the absence of selection.
In contrast to studies aimed at identifying selection, other work has focused on examining how heterogeneity in mutation rate can effect the distribution of mutations across a genome, and so the probability of parallel evolution. These studies focus exclusively on either those mutations that are assumed to be to a first approximation neutral (e.g. synonymous mutations, Maddamsetti et al., 2015) or mutations arising in the course of experiments where selection is minimal (e.g. mutations arising in a mutation accumulation experiment, Ness et al., 2015). On the whole, these studies suggest substantial gene-to-gene heterogeneity in mutation rate and this can arguably also generate differences in the distribution of mutations across the genome (although studies differ in the factors identified that drive that heterogeneity). However, it is not clear what the relative contribution of mutation rate heterogeneity is when the mutations of interest also have the potential to be under varying degrees of selection.
In this study we aim to explore the effects of both mutation and selection in generating observed patterns of parallel evolution at the gene-level. To do this we propose a framework that explicitly considers both selection and mutational heterogeneity. Using both Poisson and negative binomial regression models, we analyze gene-level mutation count data obtained from whole genome sequencing of a large set of yeast (Saccharomyces cerevisiae) experimental populations that were adapted in parallel to a glucose environment in the lab (Lang et al., 2013). We find that the best predictor of parallel mutations at the gene-level is simply the length of the gene, and along with this, a few other genomic covariates – namely the number of protein domains and the rate of recombination – also affect patterns of parallel evolution.
Models for identifying processes underlying parallel evolution
We are interested in quantifying heterogeneity in mutation rate and selection, and how these in turn are driving patterns of parallel evolution, and identifying genomic variables that predict how these processes vary from gene-to-gene. To accomplish this, we need a framework that can explicitly separate the effects of variation in mutation rate and variation in selection. We do this by examining separately the observed synonymous and nonsynonymous mutations, making the assumption (which we then check) that gene-to-gene variation in the rate at which synonymous mutations rise to observable frequencies is driven solely by variation in the mutation rate per gene, while gene-to-gene variation in the rate at which nonsynonymous mutations have arisen may be driven by heterogeneity in both mutation and selection processes. We describe the number of mutations observed in gene i during the course of an experiment, as Xi = XiS + XiN, where XiS and XiN denote the synonymous and nonsynonymous mutation counts respectively. We assume these mutations are Poisson distributed with rates λiS and λiN respectively. For synonymous mutations, this Poisson rate can be modeled as
Here, M0 is a parameter that absorbs both time and population size at which the evolution occurred and that is constant across the genome, μ0 is the per-nucleotide mutation rate that we assume (and check) is constant across the genome, Li is the length of gene i in nucleotides, and π0 is the probability of a synonymous mutation rising to an observable frequency in the population (we assume that synonymous mutations are selectively neutral and so this probability is assumed to be constant across the genome). For nonsynonymous mutations, where πi is the probability of a nonsynonymous mutation in gene i rising to observable frequencies in the population. This probability, πi , is a function of the mean selection coefficient of gene i, si , and under strong-selection-weak-mutation (SSWM) conditions, πi ∝ Si (Gillespie, 1984). The type of data used and the underlying assumptions are summarized in Fig. 1.
Given these underlying assumptions about the processes giving rise to observable mutations in the experimental sequence data, we can then use Poisson and negative binomial (NB) regression to identify potential genomic variables that significantly explain variation in λiN and λiS, and thus ultimately in the mutation and selection processes from gene-to-gene. The Poisson regression is used to explore counts of rare events (i.e. the observed mutations) that have a fixed probability of being observed, while for a NB regression, the rate of those rare events is itself a random variable that is gamma-distributed. A NB regression incorporates an extra parameter beyond a Poisson rate, known as the dispersion parameter (here denoted by θ), reflecting the amount of underlying variation in the rate of observed mutations from gene-to-gene and governs the “extra” variance of the NB distribution relative to a Poisson distribution with identical mean. If there is no heterogeneity among the rate of observed mutations from gene-to-gene, the dispersion parameter θ goes to zero and we recover a Poisson regression model. Therefore, the Poisson regression model is a special case of the NB regression model, as NB(λi, θ) reduces to Poisson(λi) at the limit of θ → 0 (see for instance Zuur, 2009). As a consequence, the Poisson and NB models are “nested” and their relative fit can be compared using a likelihood ratio test when exploring the fit of both types of regression models in this study.
More precisely, we use the models Xi ~ Poisson (λi) or Xi ~ NB(λi, θ), fitting the following regression: where λ = (λ1, …,λi, …,λn) are the Poisson rates for all n genes, A1 … Aj are the j potential genomic explanatory variables, and α1 … αj , the estimated regression coefficients for those j variables. Thus, in the case of the synonymous mutations, constant = log(M0 π0 μ0), A1 = log(Li) setting α1 = 1. For nonsynonymous mutations, α2 A2 + … + αj Aj = log(πi). Details of the implementation of these models is provided below.
Methods
The data
Evolution experiment data
We analyzed data obtained from whole genome re-sequencing of forty populations of S. cerevisiae adapted in parallel to a glucose-limited environment in the lab (Lang et al., 2013). In our analysis we include all detected genic mutations, i.e. all genic mutations that were able to escape drift and so rise to frequencies of at least approximately 10% in the populations (mutations below this frequency could not reliably be detected, see Lang et al., 2013). Mutations were grouped by gene across all forty populations, and categorized as synonymous (SYN) or nonsynonymous (NS), i.e. those that do not confer amino acid changes, and those that do, respectively.
Comparative genomics data
We used a set of orthologuous gene alignments spanning four distinct yeast species (S. cerevisiae, S. paradoxus, S. bayanus, and S. mikatae; available from www.yeastgenome.org/download-data/genomics; Kellis et al., 2003; Cliften et al., 2003) to infer the gene-to-gene heterogeneity of the substitution rates at synonymous sites and nonsynonymous sites, hereafter dS and dN respectively. To do so, we first realigned the gene sequences using ClustalW (Larkin et al., 2007) on the translated protein sequence data and then applied a number of filters to the data with an aim at removing those gene alignments that might result in inaccurate codon substitution model predictions. We removed alignments for those genes where sequences were not available from all four species, alignments for which at least one sequence had <30% overlap with the one of the other 3 sequences, and alignments for which at least one sequence was <300 bps in length. We then used a maximum likelihood codon based method (CodeML in the PAML software package; Yang, 2007) to infer dS and dN, for each gene in our data set. We used a codon table model with a fixed tree topology (a comparison of AICs among alternative codon based models indicated this was the most appropriate model for the data set).
Additional genomic data
We included eight additional genomic variables in our analysis that we expected could have the potential to effect the probability of a gene to harbor mutations. Our collection of variables is not meant to be exhaustive, but simply meant to illustrate the potential for additional genomic information to improve our predictions of which genes bear mutations across the genome. For each gene we consider: gene length, % GC content, multi-functionality, degree of protein-protein interaction (PPI), codon adaptation index (CAI), number of domains, level of expression, local recombination rate, and essential genes. We expect some of these variables may capture heterogeneity in the per-gene mutation rates, for example: gene length, which likely captures variation in a gene’s mutational target size, and local recombination rate, which has been shown to be associated with mutability in yeast (Holbeck and Strathern, 1997; Strathern et al., 1995). We expect other variables may capture heterogeneity in selection from gene-to-gene, for example: multi-functionality and PPI, which may characterize aspects of how pleiotropic a given gene is and so the level of evolutionary constraint it is under. We expect still other genomic variables may capture heterogeneity in both mutation and selection. For example, level of expression of a gene may be correlated with gene-to-gene variation in selection as highly expressed genes have been shown to be more highly conserved, both specifically in yeast (Drummond et al., 2005; Pál et al., 2001) and as a more general phenomenon across species (Drummond and Wilke, 2008). On the other hand, level of expression of a gene has also been shown to be positively correlated with mutability (Ness et al., 2015). Descriptions of the variables used in this study and sources from which the data were obtained are provided in Table 1.
A data set integrating the mutation counts originally made available by Lang et al., 2013 (from their Supplementary Table 1) and all the genomic covariates that we aggregated for this study, as well as the gene alignments used for estimating dN and dS are available on Dryad (doi will be inserted here).
Regression models
Regression models and explanatory variables tested
We used the Poisson and negative binomial regression models described in the “Models” section above to examine how much of the variation in our explanatory variables could account for patterns of variation in mutation counts per gene. We used the ‘glm’ and ‘glm.nb’ functions in R (R Development Core Team, 2014) to implement these models. We fit a series of models to synonymous and nonsynonymous mutation count data separately. To start, we fit the synonymous mutations (model MS), testing our assumptions that rate of observed mutations per gene is proportional to number of nucleotide sites in the gene (Li), and the per nucleotide mutations does not vary significantly across the genome – i.e. a model assuming μ0 is a fixed parameter (Poisson regression) fits the data better than a model where μ0 for each gene is drawn from a gamma distribution (NB regression).
After these assumptions were confirmed, we moved on to fit the nonsynonymous mutation data (MN), testing the 11 genomic variables listed in Table 1. We then examined an alternate model (MNPC), fitting the nonsynonymous mutations using the principal components of the 11 genomic variables in place of the raw variables. The reason we explore this model is that many genomic variables tend to be correlated (for correlations between the particular variables used in this study, see supplementary Table S1), and one approach to reducing potential problems with co-linearity is to transform the raw variables into their principal components and use the resulting uncorrelated composite variables for the regression analysis. We performed a principal component analysis on 11 genomic variables using the ‘prcomp’ function in R to obtain 11 principal components (PCs).
Model selection and significance of variables
For each variable and parameter of interest we tested significance by comparing versions of the models with and without that variable or parameter of interest through a likelihood-ratio test (LRT). Significance testing for LRTs was done using permutation tests instead of relying on asymptotic distribution of the LRTs, approximating the null distribution and obtaining P-values by calculating the frequency of permutations where the model fit resulted in a likelihood-ratio greater than or equal to the observed value. Variables found to significantly improve model fit were retained in the final “best” model. We choose to test significance using permutations given that asymptotic results on the distribution of the likelihood ratio test may break down as the reduced model – the Poisson regression – lies at the boundaries of the parameter space for θ, included in the NB regression (see for instance Self and Liang, 1987). In practice, 1000 permutations were used to approximate the null and obtain p-values on each variable (more permutations might be required if needed to approximate p-values that are much smaller than 10^-3).
The two nonsynonymous mutation models MN and MNPC were compared with each other using Akaike information criterion (AIC; Akaike, 1973), and the proportion of variation explained (pseudo-R2) was estimated as the R2 obtained from a linear regression (using ‘lm’ in R) between the observed and predicted mutation counts for a given model. Note that this statistic is not used for any formal goodness-of-fit but as an illustrative way to report how much of the whole variation is accounted for by any model we fit to the mutation count data.
All statistical analyses, including the permutation tests, were scripted in R (R Development Core Team, 2014) and an example script for implementing our model framework and hypothesis testing is available on Dryad (doi will be inserted here).
Results
The data
Mutation counts data
We used experimental data comprising all mutations detected at a frequency over 10% in the forty evolved S. cerevisiae populations described in Lang et al., 2013. After removing those genes for which we had incomplete or unreliable data (see Methods), we were left with 2891 genes out of a total of 6603. The filtered data set contained 357 nonsynonymous mutations distributed across 267 genes, and 58 synonymous mutations distributed across 57 genes. The genes removed by our filtering rules had disproportionately more mutations compared to those genes that were retained in the data set (χ2 = 50.57, df = 1, P < 0.001). This is not unexpected as highly divergent genes are more likely to be filtered out due to alignment issues, and it is not surprising that highly divergent genes would tend see more mutations than average, whether it be as a result of mutation and / or selection mechanisms. This bias in the filtering means that our results are likely conservative in terms of detecting significant relationships between long-term (from comparative genomics data) and short-term (from experimental evolution data) measures of divergence.
Genomic variables
We used codon substitution models comparing four yeast species to estimate dS and dN/dS for each gene. Estimates for dS ranged widely, from 0.21 to 68.7, however the vast majority of dS estimates (~95%) were less than 4. Estimates for dN/dS ranged from 0.00010 to 0.43, and these values are weakly negatively correlated with dS (r = -0.043, P = 0.021). We collated and/ or calculated nine other genomic variables with the potential to effect the mutation and selection processes in this system and estimated correlation coefficients between all pairs of explanatory variables used in this study (Table S1). While the correlations between these variables tend to be quite weak, many are, in fact, significant due to the large number of observations in the data set.
Mutation counts analysis
Synonymous mutations
We used regression models to test our assumption that gene-level mutation rate can be adequately described as simply being directly proportional to gene length. Restricting the data to the synonymous mutations, we compared Poisson regression models with and without gene length included as an explanatory variable (MS0: λS = constant and MS1: λS = constant*(Li)α , respectively), and a Poisson regression model where rate is restricted to be directly proportional to gene length (i.e. MS2: λS = constant*Li) We also compared with negative binomial versions of these model to look for the possibility of additional unexplained variation in the rate λ. The results of these comparisons are shown in Table 2. Model MS2 was the best model according to a comparison of AICs, confirming our assumptions. The fits of these models to the distribution of synonymous mutation counts per gene are visualized in Fig. 2A.
Nonsynonymous mutations
We fit regression models to the nonsynonymous mutation data, including eleven genomic variables, trying to identify which of those variables could significantly explain variation in the number of observed mutations per gene. We found that gene length (L), number of domains in the encoded protein (num.dom), and recombination rate (r) were significant in our model (see model MN.NB in Table 3).
When we fit regression models using the principal components of the genomic variables in place of the raw variables, we found that only a single principal component, PC10, was significant in the model (see model MN.NBPC in Table 3). PC10 is fairly evenly loaded with a number genomic variables (see Fig. 3), however the three significant genomic variables from MN.NB (L, num.dom, and r) are among the variables more heavily loaded on PC10, so the two models seem to be roughly in agreement. A comparison of Poisson and negative binomial regression models, as well as models including the raw genomic variables versus the transformed principal component variables, suggests that the best model for these nonsynonymous mutation count data is a negative binomial regression using the raw genomic variables (see AIC values in Table 4). The fits of these models to the distribution of nonsynonymous mutation counts per gene are visualized in Fig. 2B.
Discussion
Here we present a modeling framework to infer what genomic variables may underlie gene to gene variation in mutation rate and intensity of selection. We use these models to provide evidence that parallel evolution at both nonsynonymous and synonymous sites is driven by non trivial amounts of gene-to-gene heterogeneity in the mutation and selection processes. Using our modeling approach, we identified a number of genomic variables that can significantly predict the distribution of mutations observed across genes in experimentally evolved populations of S. cerevisiae (Lang et al., 2013). We are also able to classify genomic variables into those that have affected mutation counts 1) through their effect on the mutation rate (variables that significantly predict synonymous mutations), and/ or 2) through their effect on the probability of a mutation being either observed/ lost due to selection (variables that significantly predict nonsynonymous mutations). Out of all the variables tested, we found that gene length explained the most variation in both synonymous and nonsynonymous mutation counts per gene – plainly speaking, longer genes accumulate more mutations. However, number of domains and recombination also had significant effects. Below we discuss in detail these genomic variables and their potential contributions to the probability of parallel evolution via the processes of mutation and selection.
Longer genes harbor more mutations
By far, the variable having the largest effect on variation in the number of synonymous and nonsynonymous mutations observed was gene length. More specifically, gene length positively affected the rate of mutation at the gene-level, meaning genes comprising more nucleotides were more likely to harbor mutations. This result is not surprising and is in agreement with recent analysis of synonymous mutation counts from Lenksi’s long term evolution experiment with E. coli (Maddamsetti et al., 2015).
Long-term divergence does not predict short-term mutation counts
Our model for synonymous mutation counts suggests that divergence estimates from long-term evolutionary comparisons at the species level do not provide insight into expected mutation counts on the shorter time scale of evolution in the lab, also in agreement with recent analysis of E. coli data (Maddamsetti et al., 2015). Maddamsetti et al found that their proxy for long-term per gene mutation rate, θs (a measure of within-species nucleotide diversity), did not explain gene-to-gene variation in synonymous mutation counts in their data. The authors argued that horizontal gene transfer (HGT) is therefore likely a more important process driving gene-to-gene variation in long-term divergence between naturally occurring E. coli strains, and since HGT did not occur in their evolution experiment, it is not surprising that the experiment’s synonymous mutation counts did not correlate with θs . However, rates of HGT tend to be higher in bacteria, and in particular E. coli, as compared to yeast and other eukaryotes (e.g. Boto 2010). Furthermore, a recent mutation accumulation experiment with the eukaryote Chlamydomonas reinhardtii showed a positive correlation between a proxy for long-term mutation rate (θs) and per site mutability (Ness et al., 2015). Thus, it is somewhat surprising that we do not see a significant relationship between dS and dN/dS and counts of synonymous and nonsynonymous mutations respectively in our examination of the S. cerevisiae data used in this study. One possibility might also be that dS and dN/dS are noisy to estimate at the gene level and that tends to downplay their predictive power in our analysis of counts in evolve and re-sequence experiment.
Nonsynonymous mutation counts show evidence of selection heterogeneity
As expected (Lenormand et al., 2016), we report strong evidence that the distribution of nonsynonymous mutations across the genome was driven in part by gene-to-gene heterogeneity in selection. Of those genomic variables tested, we found three that were significant predictors of nonsynonymous mutation counts, suggesting that those variables may drive or are correlated with processes that modulate the intensity of selection across genes. The significant variables were gene length, recombination rate, and number of protein domains.
We found that gene length predicts nonsynonymous mutation count via selection, over and above its effects on per gene mutation rate – as estimated from models aimed at explaining the synonymous mutation count only. While one might not expect gene length to have direct effects on selection, we suggest that gene length may show a significant effect here because it is correlated with other attributes of the genome that could have important effects on selection, for example gene expression levels and multifunctionality. Because of these correlations, it could be that gene length acts as a kind of summary variable for these covariates and other unidentified factors we have not captured in these models. Further evidence that gene length acts as a summary variable comes from the M3 results (summarized in Table 3), where we see that gene length is no longer significant when other summary variables - the principal components – are included in the model.
In contrast to the positive relationship between gene length and number of nonsynonymous mutations, we also found that the number of protein domains that a gene codes for (a variable that is positively correlated with gene length; Table S1) actually negatively predicts the number of nonsynonymous mutations. In other words, the more domains in the encoded protein of a gene, the fewer mutations that gene is expected to incur in the course of the yeast evolution experiment analyzed here. The mechanism behind this effect is not clear, but certainly protein structure has previously been reported to have significant impacts on evolutionary rates in yeast (Bloom et al., 2006) and one can also posit that genes encoding proteins with multiple domains and thereby involved in more numerous interactions are – all else being equal – more severely constrained by purifying selection. It is interesting that this effect can be observed in the course of relatively short time span (relative to between species divergence times) through the relative paucity of nonsynonymous mutations in these genes.
Our analysis also showed that recombination rate is a significant predictor of the observed number of nonsynonymous mutations observed in a given gene in these data. Genes with higher recombination rates are more likely to bear nonsynonymous mutations. We expect recombination rate to be correlated with mutation, as previous studies in yeast have shown that recombinational repair of double strand breaks in substantially increases the frequency of nearby point mutations in nearby intervals (e.g. Holbeck and Strathern, 1997; Strathern et al., 1995). However, it is not clear how high recombination rates might drive, or be correlated with other processes that drive, selection – as our models suggest is the case for this data set. Another non exclusive possibility might be the fact that biased gene conversion might vary from gene to gene and also – like selection – affect the probability of detecting variants in evolve and re-sequence experiments
Factors driving mutation and selection are complex
It is difficult to obtain any additional insights from models that include principal components of the genomic covariate data, however there is at least some level of agreement between those variables that are significant (i.e. length, recombination, and number of domains) and ones that are heavily weighted in PC10 – the principal component that was found to be significant (see Fig. 3). The local properties of the genome do appear to drive some heterogeneity in the selection processes, and in turn, shape the patterns of parallel evolution, however individual effects that can be ascribed to individual variables are not easy to parse out.
Finally we want to stress that while we were able to identify a number of factors affecting the count of mutations observed in this evolution experiment data set, the total explained variance is still low: 1 % and 16.0 % in the synonymous and nonsynonymous models respectively (calculated from pseudo-r2 estimates of the “best” models, see methods). While the models do capture the general distribution of mutation counts (Fig. 2) and so the degree of parallel evolution, accurately predicting on which genes those mutations will fall is still very difficult. This is not surprising given the amount of stochasticity involved in both the origin of new mutations and their evolutionary fate through drift and selection. A clearer picture might emerge when using our modeling approach in a meta-analysis approach where several evolve and re-sequence experiments are considered together (see Bailey et al., 2017 for a similar approach on summary statistics of the amount of parallel evolution at the gene level across a wide range of experimental studies in yeast and bacteria)
While we do find a number of genomic variables that significantly affect the distribution of mutations across the genome, it is noteworthy that these models are still unable to capture the more extreme patterns of parallel evolution observed in this data set. For example, one gene (IRA1) saw mutations in over 50% of the populations sequenced in this experimental data set (discussed in more detail in Lang et al., 2013). Such a mutation count is completely out of the range of likely outcomes predicted by our models. Some of this discrepancy may be because of the simplifying assumptions made about the process of selection. Our framework models the process of mutation and its heterogeneity but while we account for the fact that newly arising mutations may have different probabilities of reaching an observable frequency, the modeling of that process could be made more precise by incorporating an explicit underlying distribution of fitness effects of new mutations at each gene. Incorporating a selection process that allows for different amounts of both positive and negative selection, as well as further details about the selection pressures in the particular environment of interest - something we do not consider at all in this study – would likely improve prediction for some of these more extreme events.
Advantages of this regression framework
Relying on the assumption that synonymous mutations are selectively neutral (which does appear to be the case for these data), the regression models we use in this study allow us to distinguish between genomic variables influencing the observed distribution of mutations across a genome through their potential effects on both gene-to-gene heterogeneity in mutation rate and gene-to-gene heterogeneity in selection. The great advantage of this is that it allows us to begin to break down the importance of these two processes in shaping patterns of parallel evolution we see, and move closer the goal of predicting which genes will be involved in evolution when organisms adapt to new environments. It will be interesting to apply this model framework to other data sets of this type, as they become available, to see how general these patterns are across different organisms and selection environments (Bailey and Bataillon, 2016).
Data archival location
Dryad, doi to be included later
Acknowledgments
This work was supported by the European Research Council under the European Union’s Seventh Framework Program [FP7/20072013, ERC grant number 311341 to T.B.].