Abstract
Phylogenetic analyses of molecular data require a quantitative model for how sequences evolve. Traditionally, the details of the site-specific selection that governs sequence evolution are not known a priori, making it challenging to create evolutionary models that adequately capture the heterogeneity of selection at different sites. However, recent advances in high-throughput experiments have made it possible to quantify the effects of all single mutations on gene function. I have previously shown that such high-throughput experiments can be combined with knowledge of underlying mutation rates to create a parameter-free evolutionary model that describes the phylogeny of influenza nucleoprotein far better than commonly used existing models. Here I extend this work by showing that published experimental data on TEM-1 beta-lactamase (Firnberg et al., 2014) can be combined with a few mutation rate parameters to create an evolutionary model that describes beta-lactamase phylogenies much than most common existing models. This experimentally informed evolutionary model is superior even for homologs that are substantially diverged (about 35% divergence at the protein level) from the TEM-1 parent that was the subject of the experimental study. These results suggest that experimental measurements can inform phylogenetic evolutionary models that are applicable to homologs that span a substantial range of sequence divergence.
Introduction
Most approaches for the phylogenetic analysis of gene sequences require a quantitative evolutionary model specifying the rate at which each site substitutes from one identity to another. In maximum-likelihood and Bayesian approaches, the evolutionary model is used to calculate the likelihood of the observed sequences given the phylogenetic tree (Felsenstein, 1981; Huelsenbeck et al., 2001). In distance-based approaches, the evolutionary model is used to calculate the distances between pairs of sequences (Saitou and Nei, 1987; Hasegawa et al., 1985). For all these approaches, inaccurate evolutionary models can lead to errors in inferred phylogenetic properties, including incorrect estimates of evolutionary distances (Halpern and Bruno, 1998) and incorrect tree topologies (Felsenstein, 1978; Huelsenbeck and Hillis, 1993; Lartillot et al., 2007; Rokas and Carroll, 2008).
Unfortunately, existing phylogenetic evolutionary models are extreme simplifications of the actual process of mutation and selection that shapes sequence evolution (Thorne et al., 2007). At least two major unrealistic assumptions afflict most of these models. First, in order to make phylogenetic algorithms computationally tractable, it is generally assumed that each site within a gene evolves independently. Second, most widely used evolutionary models compound the first assumption of independence among sites with the second unrealistic assumption that all sites evolve identically – a severely flawed assumption since there is overwhelming evidence that proteins have strong preferences for certain amino acids at specific sites (Ashenberg et al., 2013; Halpern and Bruno, 1998). It is the second of these unrealistic assumptions that is addressed by the experimentally informed evolutionary model described here.
A major reason that most phylogenetic evolutionary models assume that sites evolve identically is because it has traditionally been thought that there is insufficient information to do better. In the absence of a priori knowledge about selection on individual sites, the parameters of an evolutionary model must be inferred from the same sequences that are being analyzed phyloge-netically. For instance, typical codon-level models infer parameters describing the equilibrium frequencies of different codons, the relative rates of transition and transversion mutations, the relative rates of nonsynonymous and synonymous mutations, and in many cases the shapes of distributions that allow some of these rates to be drawn from several categories (Goldman and Yang, 1994; Muse and Gaut, 1994; Yang et al., 2000; Kosiol et al., 2007). These parameters can easily be inferred for a single general model that applies to all sites in a gene, but it is much more challenging to infer them separately for each site without overfitting the available sequence data (Posada and Buckley, 2004; Rodrigue, 2013). Some studies have attempted to bypass this problem by predicting site-specific substitution rates or classifying sites based on knowledge of the protein structure (Thorne et al., 1996; Goldman et al., 1998; Scherrer et al., 2012; Rodrigue et al., 2009; Kleinman et al., 2010) – however, such approaches are limited by the fact that the relationship between protein structure and site-specific selection is complex, and cannot be reliably predicted even by state-of-the-art molecular modeling (Potapov et al., 2009).
A more recent and more promising alternative approach is to infer the site-specific substitution process directly from the sequence data (Lartillot and Philippe, 2004; Rodrigue et al., 2010; Le et al., 2008; Wu et al., 2013; Wang et al., 2008). Fully specifying a different substitution process for each site in a maximum-likelihood framework requires inferring a large number of free parameters (there are 19 × L parameters for a gene with L codons if selection is assumed to act on the amino-acid sequence). Therefore, the details of the site-specific substitution process must be inferred without overfitting the finite available data. Two strategies have been successfully employed to do this: constraining sites to fall in a fixed number of different substitution-model classes (Wang et al., 2008; Le et al., 2008), or using non-parametric Bayesian mixture models that treat the site-specific substitution probabilities as random variables that are integrated over a statistical distribution estimated from the data (Lartillot and Philippe, 2004; Rodrigue et al., 2010). Although both of these strategies are somewhat complex, they yield much more nuanced evolutionary models that eliminate some of the problems associated with the unrealistic assumption that sites evolve identically.
Even more recently, a new type of high-throughput experiment has begun to yield data that enables the creation of site-specific evolutionary models without any need to infer site-specific selection from the naturally occurring gene sequences that are the subject of the phylogenetic analysis. This new type of experiment is deep mutational scanning (Fowler et al., 2010; Araya and Fowler, 2011), a technique in which a gene is randomly mutagenized and subjected to functional selection in the laboratory, and then deep sequenced to quantify the relative frequencies of mutations before and after selection. In cases where the laboratory selection is sufficiently representative of the gene’s real biological function, these experiments provide information that can be used to approximate the site-specific natural selection on mutations. To date, deep mutational scanning has been used to quantify the impact of most nucleotide or codon mutations to several proteins or protein domains (Fowler et al., 2010; Roscoe et al., 2013; Starita et al., 2013; Melamed et al., 2013; Traxlmayr et al., 2012; McLaughlin Jr et al., 2012; Firnberg et al., 2014; Bloom, 2014). For a few of these studies, the experimental coverage of possible mutations is sufficiently comprehensive to define site-specific amino-acid preferences for all positions in a gene.
I have previously shown that such experimentally determined site-specific amino-acid preferences can be combined with measurements of mutation rates to create a parameter-free evolutionary model that describes the phylogeny of influenza nucleoprotein far better than existing non-site-specific models that contain numerous free parameters (Bloom, 2014). Here I extend that work by showing that it is also possible to create an experimentally informed evolutionary model for another gene. I do this using deep mutational scanning data published by Firnberg et al. (2014) that quantifies the effects of nearly all amino-acid mutations on TEM-1 beta-lactamase. In this case, no measurements of mutation rates are available, so I construct an evolutionary model that is informed by the experimentally measured site-specific amino-acid preferences but also contains a few free parameters representing the mutation rates. I also augment this model with an additional parameter that reflects the stringency of the site-specific amino-acid preferences in natural evolution versus the deep mutational scanning experiment used to measure these preferences. I show that this evolutionary model greatly improves the phylogenetic fit to both TEM and SHV beta-lactamases, the latter of which are substantially diverged (about 35% divergence at the protein level) from the TEM-1 parent that was the subject of the deep mutational scanning by Firnberg et al. (2014). These results generalize previous work on experimentally determined evolutionary models, and suggest that site-specific amino-acid preferences are sufficiently conserved during evolution to be applicable to gene homologs that span a substantial range of sequence divergence.
Results
Evolutionary model
Summary of evolutionary model
I have previously described a codon-level phylogenetic evolutionary model for influenza nucleo-protein for which both the site-specific amino-acid preferences and the nucleotide mutation rates (assumed to be identical across sites) were determined experimentally (Bloom, 2014). The current work examines a protein for which the site-specific amino-acid preferences have been measured experimentally, but for which the nucleotide mutation rates are unknown. It is therefore necessary to extend the evolutionary model to treat the nucleotide mutation rates as unknown free parameters. Here I describe this extension.
In the model used here, the rate Pr,xy of substitution from codon x to some other codon y at site r is where Qxy denotes the rate of mutation from x to y, and Fr,xy gives the probability that a mutation from x to y fixes if it occurs. This equation assumes that mutation rates are uniform across sites, and that the selection on mutations is site-specific but site-independent (i.e. the fixation probability at one site is not influenced by mutations at other sites).
Fixation probabilities from amino-acid preferences
The fixation probability of a mutation from codon x to y is assumed to depend only on the encoded amino acids and , as synonymous mutations are assumed to be selectively neutral. The fixation probabilities Fr,xy are defined in terms of the experimentally measured amino-acid preferences at site r, where πr,a denotes the preference for amino-acid a at site r, and the preferences at each site sum to one .
As in previous work (Bloom, 2014), I consider two different mathematical relationships between the amino-acid preferences and the fixation probabilities. However, novel to this work, I consider a generalization of these relationships that allows the stringency of the amino-acid preferences to differ between natural sequence evolution and the deep mutational scanning experiments used to measure πr,a. Specifically, I take the probability of fixation to depend on (πr,a)β where β is a free parameter (constrained to have values ≥ 0) that scales the stringency of the amino-acid preferences. A value of β = 1 implies equally stringent preferences in natural evolution and the deep mutational scanning experiments. A value of β < 1 corresponds to less stringent preferences in natural evolution than in the deep mutational scanning experiments, while a value of β > 1 corresponds to more stringent preferences in natural evolution in the deep mutational scanning experiments. Naively, one might conjecture that β will typically have values > 1, since laboratory experiments tend to be less stringent than natural evolutionary selection. In the following sections, I will describe testing evolutionary models models that constrain β = 1 versus models that treat β as a free parameter.
With the addition of the stringency parameter β, I consider the following relationships between the amino-acid preferences and the fixation probabilities. The first relationship derives from considering the amino-acid preferences to be directly related to selection coefficients, and is a generalization of the ninth equation derived by Halpern and Bruno (1998):
Note that in Equation 2 the mutation terms in the original equation derived by Halpern and Bruno (1998) are all set to be equal since the calculations of the amino-acid preferences πr,a from the deep mutational scanning experiments already correct for differences in the mutagenesis rates, whereas Halpern and Bruno (1998) were inferring the preferences from natural sequences and so had to account for the mutation rates during natural evolution. Note also that the ninth equation of Halpern and Bruno (1998) contains a typographical error in the denominator which is corrected in Equation 2. Halpern and Bruno (1998) derive Equation 2 with β = 1 by assuming that the sequences are evolving in the weak-mutation limit, and rigorous application of this relationship with β = 1 in the context of the current work requires assuming that the effective population size in the deep mutational scanning experiment is equivalent to that for the natural sequences that are being phylogenetically analyzed.
The second relationship is based on considering the amino-acid preferences to reflect the fraction of genetic backgrounds that tolerate specific mutations rather than selection coefficients in any one genetic background. Specifically, experiments have shown that mutations that are deleterious in one genetic background can sometimes be neutral (or even advantageous) in a related genetic background (Lunzer et al., 2010; Gong et al., 2013). One reason that mutational effects depend on genetic background is that most proteins are under selection to maintain their overall stability above some critical threshold (Gong et al., 2013). This type of threshold selection gives rise to the evolutionary dynamics described in Bloom et al. (2007), where stabilizing mutations are always tolerated but destabilizing mutations are only tolerated in a fraction of genetic back-grounds. In this case, mutations to higher preference (putatively more stabilizing) amino acids will always be able to fix without deleterious effect, but mutations to lower preference (putatively less stabilizing) amino acids are only sometimes tolerated. One way to represent these dynamics is to use a relationship equivalent to the Metropolis et al. (1953) sampling criterion:
Both of these relationships (Equation 2 and Equation 3) share the feature that mutations to higher-preference amino acids fix more frequently than mutations to lower-preference amino acids as long as β > 0.
Mutation rates
The rate of mutation Qxy from codon x to y is defined in terms of the underlying rates of nucleotide mutation. Let Rm→n denote the rate of mutation from nucleotide m to n. Then
Assuming that the same mutation process operates on both the sequenced and complementary strands of the nucleic acid gives the constraint where mc denotes the complement of nucleotide m, since for example a mutation from A to G on one strand induces a mutation from T to C on the other strand. There are therefore six unique nucleotide mutation rates: RA→C = RT→G, RA→G = RT→C, RC→A = RT→A, RC→A = RG→T, RC→G = RG→C, and RC→T = RG→A.
In principle, these six mutation rates could be measured experimentally for the system of interest. In my previous work on influenza nucleoprotein (Bloom, 2014), I was able to devise an experimental system for measuring the mutation rates for influenza in cell culture. The mutation rates measured for influenza in this system were roughly symmetric (Rm→n = Rn→m), which was sufficient to make the overall evolutionary model in Equation 1 reversible. However, in general it is unlikely to be feasible to measure the mutation rates for most systems of interest. Furthermore, it is known that mutation process is AT-biased for many species (Hershberg and Petrov, 2010), meaning that in general mutation rates will not be symmetric. Therefore, in general (and also for lactamase specifically) it is necessary to infer the mutation rates from the sequence data without assuming that they are symmetric.
In the absence of any constraints on the six mutation rates given above, the overall evolutionary model defined by Equation 1 will not necessarily be reversible. However, it turns out (see Methods) that placing the following constraint on the mutation rates is sufficient to make the overall evolutionary model reversible:
This constraint lacks a biological justification, and is assumed purely for the mathematical convenience that it makes the model reversible. Although there is no biological reason to believe that Equation 6 actually holds for real evolution, it is possible to give interpretations about what assuming this equation implies about the mutational process. One interpretation is that the probability of mutating from C to G via an intermediate mutation to T is equal to the probability of mutating from C to G via an intermediate mutation to A, as Equation 6 implies that RC→T × RT→G = RC→A × RA→G. Another interpretation is that the AT bias is the same for transitions and transversions, as Equation 6 implies that RC→T /RT→C = RC→A/RA→C.
In the absence of independent information to calibrate absolute values for the branch lengths or mutation rates, one of the rates is confounded with the branch-length scaling and so can be assigned an arbitrary value > 0 without affecting the tree or its likelihood. Here the choice is made to assign so that all other mutation rates are defined relative to this rate. With these constraints, there are now four independent mutation rates that must be treated as unknown free parameters:
In practice, these four mutation rate parameters will be estimated at their maximum likelihood values given the sequences and tree topology.
Equilibrium frequencies
Calculation of a phylogenetic likelihood requires assigning evolutionary equilibrium frequencies to the possible codons in addition to specifying the transition probabilities given by Equation 1. In many conventional phylogenetic models, these equilibrium frequencies are treated as free parameters that are estimated empirically from the sequence data. However, in reality the equilibrium frequencies are the result of mutation and selection, and so can be calculated as the stationary state of the stochastic process defined by the evolutionary model. Specifically, it can be shown (see Methods) that for the evolutionary model in Equation 1, the equilibrium frequency pr,x of codon x at site r is where qx is given by where is the number of A and T nucleotides in codon x, the proportionality constant is never needed since qx values always appear in the form of ratios, and the simplification from the first to second line follows f rom Equation 5, Equation 6, and Equation 7. The equilibrium frequencies pr,x are therefore completely determined by knowledge of the experimentally determined amino-acid preferences πr,a, the mutation rate parameters in Equation 8, and the value of the stringency parameter β.
Experimentally determined amino-acid preferences for beta-lactamase
The site-specific amino-acid preferences for beta-lactamase were determined using data from a previously published deep mutational scanning experiment performed by Firnberg et al. (2014). Specifically, Firnberg et al. (2014) created nearly all possible amino-acid mutants of TEM-1 beta-lactamase and then used antibiotic selection to enrich for functional variants at various antibiotic concentrations. Next, they used high-throughput sequencing to examine how the frequencies of mutations changed during this functional selection. They analyzed their data to estimate the impact of individual mutations on TEM-1 function, and had sufficient data to estimate the impact of 96% of the 297 × 19 = 5, 453 possible amino-acid mutations.
Firnberg et al. (2014) report the impact of mutations in terms of what they refer to as the “fitness” effects. Firnberg et al. (2014) calculate these fitness values from their deep mutational scanning experiment in such a way that a mutation’s fitness effect is approximately proportional to the highest concentration of antibiotic on which bacteria carrying that beta-lactamase variant are able to grow. Therefore, although the fitness values are not calculated in a true population-genetics framework, they certainly reflect effects of specific mutations on the ability of TEM-1 mutants to function.
Here I use the fitness values provided by Firnberg et al. (2014) to estimate the preferences for each of the 20 amino acids at each site in TEM-1. Specifically, let wr,a be the fitness value for mutation to amino-acid a at site r reported by Firnberg et al. (2014) in Data S2 of their paper. I calculate the preference πr,a for a at site r as where the sum over a′ ranges over all 20 amino acids, the wild-type amino acid at site r is assigned a fitness of w r,a = 1 in accordance with the normalization scheme used by Firnberg et al. (2014), and the wr,a values for the 4% of mutations for which no value is estimated by Firnberg et al. (2014) are set to the average wr,a of all non-wildtype amino acids at site r for which a wr,a value is provided.
The amino-acid preferences calculated in this manner are displayed graphically in Figure 1 along with information about residue secondary structure and solvent accessibility (see Supplementary file 1 for numerical d ata). As is extensively discussed by Firnberg et al. (2014) in their original description of the data, these preferences are qualitatively consistent with known information about highly constrained positions in TEM-1, and show the expected qualitative patterns of higher preferences for specific (particularly hydrophobic) amino acids at residues that are buried in the protein’s folded structure. Here I focus on using these amino-acid preferences in a quantitative phylogenetic evolutionary model as described in the next section.
Experimentally determined amino-acid preferences improve phylogenetic fit
TEM and SHV beta-lactamase phylogenetic trees
To test if evolutionary models informed by the experimentally determined amino-acid preferences are superior to existing alternative models, I compared the fit of various models to beta-lactamase sequence phylogenies. Firnberg et al. (2014) performed their deep mutational scanning on TEM-1 beta-lactamase. There are a large number of TEM beta-lactamases with high sequence identity to TEM-1; the next closest group of lactamases is the SHV beta-lactamases (Bush et al., 1995), which on average have 62% nucleotide and 65% protein identity to TEM beta-lactamases. I assembled a collection of TEM and SHV beta-lactamases from the manually curated Lahey Clinic database (http://www.lahey.org/Studies/). These sequences were aligned to TEM-1, and highly similar sequences (sequences that differed by less than four nucleotides) were removed. The resulting alignment contained 85 beta-lactamase sequences (Supplementary file 2), of which 49 were TEM and 36 were SHV.
Maximum-likelihood phylogenetic trees of the TEM and SHV beta-lactamases were constructed using codonPhyML (Gil et al., 2013) with the codon substitution model of either Goldman and Yang (1994) or Kosiol et al. (2007). The resulting trees are displayed in Figure 2. The two different substitution models give similar tree topologies – the Robinson-Foulds distance (Robinson and Foulds, 1981) between the trees inferred with the two different models is calculated by RAxML (Stamatakis, 2006) to be 0.14. In both cases, the trees partition into two clades of closely related sequences, corresponding to the TEM and SHV beta-lactamases.
Experimentally informed models are superior for combined TEM and SHV phylogeny
To compare the evolutionary models, HYPHY (Pond et al., 2005) was used to optimize the branch lengths and free parameters of the evolutionary models to their maximum likelihood values on the fixed tree topologies in Figure 2. This analysis showed that the evolutionary models informed by the experimentally determined amino-acid preferences were clearly superior to commonly used alternative codon-substitution models.
Specifically, Table 1 and Table 2 show that the experimentally informed evolutionary models fit the combined TEM and SHV phylogeny with higher likelihoods than any of a variety of commonly used alternative models, regardless of which tree topology from Figure 2 is used. This superiority is despite the fact that the alternative models (Goldman and Yang, 1994; Kosiol et al., 2007) contain many more free parameters. For instance, the most heavily parameterized alternative model contains 60 empirically estimated equilibrium frequency parameters plus an optimized parameter corresponding to the transition-transversion ratio, two optimized parameters corresponding to a gamma distribution of nonsynonymous-synonymous ratios across sites (Yang et al., 2000), and an optimized parameter corresponding to a distribution of substitution rates across sites (Yang, 1994). In contrast, the experimentally informed models only contain four or five free parameters (the mutation rates and optionally the stringency parameter β) – yet these experimentally informed models have substantially higher likelihoods. When AIC (Posada and Buckley, 2004) is used to penalize parameters, the superiority of the experimentally informed models is even more clear.
The experimentally informed models are superior to the non-site-specific models even when the stringency parameter β is fixed to one; however, the phylogenetic fit is substantially enhanced by treating β as a free parameter (Table 1 and Table 2). In all cases, fitting of the stringency parameter yields values of β > 1, indicating that natural evolution is more stringent than the experiments in its preferences for specific amino-acids. This result is consistent with the notion that selection during real evolution selection is more sensitive than the typical laboratory experiment.
To confirm that the superiority of the experimentally informed models is due to the fact that the deep mutational scanning of Firnberg et al. (2014) captures information about the site-specific amino-acid preferences, I tested evolutionary models in which these preferences were randomized among sites or were set to the average frequencies of all amino-acids in the lactamase alignments. In the former case, the πr,a values are randomly shuffled among sites, where in the latter case πr,a for all values of r is set to the average frequency of amino-acid a over the entire lactamase alignment. As can be seen from Table 1 and Table 2, these non-site-specific models perform substantially worse than even the simplest versions of the models of Goldman and Yang (1994) and Kosiol et al. (2007). This result shows that the improved performance of the experimentally informed evolutionary models is overwhelmingly due to incorporation of information on the site-specific amino-acid preferences rather than better modeling of the mutational process.
Experimentally informed models are superior for individual TEM and SHV phylogenies
The foregoing results show that experimentally informed models are superior for describing the combined TEM and SHV beta-lactamase phylogeny. Given that the amino-acid preferences were determined by experiments using a TEM-1 parent, it is worth asking whether these preferences accurately describe the evolution of both the TEM and SHV sequences, or whether they more accurately describe the TEM sequences (which are closely related to TEM-1, Figure 2) than the SHV sequences (which only have about 65% protein identity to TEM-1, Figure 2). This question is relevant because the extent to which site-specific amino-acid preferences are conserved during protein evolution remains unclear. For instance, while several experimental studies have suggested that such preferences are largely conserved among moderately diverged homologs (Ashenberg et al., 2013; Serrano et al., 1993), a simulation-based study has argued that preferences shift substantially during protein evolution (Pollock et al., 2012; Pollock and Goldstein, 2014). If the site-specific amino-acid preferences are largely conserved during the divergence of the TEM and SHV sequences, then the experimentally informed models should work well for both these groups – but if the preferences shift rapidly during evolution, then the experimentally informed models should be effective only for the closely related TEM sequences.
To test these competing possibilities, I repeated the analysis in the foregoing section separately for the TEM and SHV clades of the overall phylogenetic tree (the red versus blue clades in Figure 2). This analysis found that the experimentally informed evolutionary models were clearly superior to all alternative models for the SHV as well as the TEM clade (Table 3, Table 4, Table 5, Table 6). In fact, the extent of superiority of the experimentally informed model (as quantified by AIC) was greater for the SHV clade than the TEM clade, despite the fact that the former has fewer sequences. These results suggest that the applicability of the experimentally determined amino-acid preferences extends to beta-lactamase homologs that are substantially diverged from the TEM-1 parent that was the specific subject of the experiments of Firnberg et al. (2014).
The stringency parameter β generally improves experimentally informed models
The results in the foregoing sections show that use of the stringency parameter β improves the phylogenetic fit of the experimentally informed models of lactamase evolution. Previous work (Bloom, 2014) has shown that an experimentally informed evolutionary model without a stringency parameter (β = 1) improves the phylogenetic fit to influenza nucleoprotein sequences relative to non-site-specific models. To test whether the fitting of a stringency parameter further improves the experimentally informed evolutionary model of nucleoprotein, I repeated the analysis of Bloom (2014) but also included a model variant in which β was fit by maximum likelihood.
Table 7 shows that fitting the stringency parameter substantially improves the phylogenetic fit of the experimentally informed model of nucleoprotein evolution. This table also shows that as with the lactamase models described above, the fitted value of β was greater than one. Overall, this result suggests that inclusion of a stringency parameter generally improves the phylogenetic fit of experimentally informed evolutionary models. Presumably this improvement arises from the fact that deep mutational scanning experiments are generally less stringent than real evolution, and so values of β > 1 help scale the experimentally determined site-specific amino-acid preferences to a stringency more reflective of those that constrain real evolution.
Experimentally informed models are slightly better for many sites
The results described above and in Bloom (2014) demonstrate that experimentally informed evolutionary models improve phylogenetic relative to non-site-specific models when analyzing the entire gene sequences of lactamase or nucleoprotein. It is interesting to investigate which sites are described more accurately by the experimentally informed models. This can be done by examining the differences in per-site likelihoods between models after fixing the branch lengths and model parameters to their maximum likelihood values for the entire gene.
Figure 3 compares the per-site likelihoods for the best experimentally informed evolutionary model for lactamase and for nucleoprotein relative to the best non-site-specific model for each of these genes (using the models from Table 1 and Table 7). For both lactamase and nucleoprotein, the experimentally informed models lead to small improvements for many sites. Overall, 72% (207 of 286) lactamase sites are described better by the experimentally informed model (Supplementary file 3), and 82% (407 of 498) of nucleoprotein sites are described better by the experimentally informed model (Supplementary file 4). There appears to be a slight trend for improvements due to the experimentally informed models to be most common for buried sites, but the experimentally informed models lead to small improvements for sites spanning a range of solvent accessibilities and secondary structures.
Interestingly, for both genes there are also a few sites for which the experimentally informed models are far worse than the non-site-specific models. Presumably these sites are modeled poorly because the preferences determined by the deep mutational scanning experiments do not accurately capture the real preferences of natural evolution. Such discrepancies could arise from either a failure of the deep mutational scanning experiments to fully reflect natural selection pressures, or from epistatic interactions that strongly shift the site preferences during natural evolution away from those measured for the parent gene by the experiments. I was unable to observe any obvious features of the specific sites that were described poorly by the experimentally informed models (Supplementary file 3, Supplementary file 4). It therefore remains an open question why certain sites are described very poorly by the experimentally informed models even though the vast majority of sites are described better by these models.
Comparison of different methods for computing fixation probabilities
In the foregoing analyses, two different mathematical relationships were used to mathematically relate the experimentally determined amino-acid preferences to the substitution probabilities in the evolutionary models. One relationship (Equation 2) is based on a true population-genetics derivation by Halpern and Bruno (1998) under the assumption that the preferences are reflective of selection coefficients for amino acids at specific sites (an assumption that would only be strictly true only in the unlikely case that individual sites in a gene contribute independently to fitness). The other relationship (Equation 3) is a more ad hoc one that I suggested in previous work (Bloom, 2014) on the grounds that the amino-acid preferences might be best envisioned not as selection coefficients, but rather as measurements of the fraction of genetic backgrounds that tolerate a specific mutation, as would be implied by the evolutionary dynamics described in Bloom et al. (2007). Although both relationships share the qualitative feature that mutations to higher-preference amino acids are favored over mutations to lower-preference ones, they differ in their quantitative details. In previous work on influenza nucleoprotein (Bloom, 2014), I reported that the relationship in Equation 3 outperformed the one in Equation 2 derived by Halpern and Bruno (1998).
In contrast, for the beta-lactamase sequences studied here, the relationship of Halpern and Bruno (1998) outperforms the one in Equation 3 (Table 1, Table 2, Table 3, Table 4, Table 5, Table 6). The reason for and relevance of these discordant results remains unclear. There are almost certainly differences in the evolutionary conditions (population size, degree of polymorphism, etc) for influenza nucleoprotein and beta-lactamase that influence the relationship between selection coefficients and fixation probabilities. In addition, there are substantial differences between the experiments of Firnberg et al. (2014) on beta-lactamase and my previous work on nucleoprotein– in particular, Firnberg et al. (2014) examine the effects of single mutations to the parental gene, whereas the nucleoprotein experiments examined the average effects of individual mutations in variants that often contained multiple mutations. Finally, the experimental measurements are imperfect – for nucleoprotein, the preferences determined by independent biological replicates of the experiments only had a Pearson correlation coefficient of 0.79; Firnberg et al. (2014) do not provide data on the consistency of their measurements across biological replicates, but it seems safe to assume that their experiments are also imperfect. Therefore, further work is probably needed to determine if any meaning can be ascribed to the differences in fit for Equation 2 versus Equation 3, as well as to identify the optimal mathematical relationship for connecting experimentally measured amino-acid preferences to substitution probabilities in evolutionary models. However, both the results presented here and in Bloom (2014) strongly suggest that using any reasonable mathematical relationship to inform evolutionary models with experimentally determined amino-acid preferences is sufficient to lead to dramatic improvements in phylogenetic fit.
It is also interesting to speculate on the precise meaning of the stringency parameter β. According to population-genetics theory, the strength of selection increases with effective population size. It is therefore tempting to interpret β as reflecting differences in the effective population size in the deep mutational scanning experiments relative to natural evolution. Indeed, one could imagine even attempting to use the inferred value of β to indirectly quantify effective population size. However, it is important to temper this tantalizing possibility with the reminder that the fixation probabilities over the entire phylogeny can be related to the fitness effects of specific mutations only under the unrealistic assumption that the fitness contributions of mutations at different sites are entirely independent, since these probabilities reflect the effect of a mutation averaged over all genetic backgrounds in the phylogeny. So while the fact that using β > 1 consistently improves phylogenetic does indicate that natural evolution is more stringent in its site-specific amino-acid preferences than the experiments, the exact population-genetics interpretation of β is unclear.
Discussion
I have shown that an evolutionary model informed by experimentally determined site-specific amino-acid preferences fits beta-lactamase phylogenies better than a variety of existing models that do not utilize site-specific information. When considered in combination with prior work demonstrating that an experimentally determined evolutionary model dramatically improves phylogenetic fit for influenza nucleoprotein (Bloom, 2014), these results suggest that experimentally informed models are generally superior to non-site-specific models for phylogenetic analyses of protein-coding genes. The explanation for this superiority is obvious: proteins have strong preferences for certain amino acids at specific sites (Ashenberg et al., 2013; Halpern and Bruno, 1998) which are ignored by non-site-specific models. The use of experimentally measured site-specific amino-acid preferences improves evolutionary models by informing them about the complex selection that shapes actual sequence evolution. Interestingly, inclusion of a parameter that allows the stringency of site-specific amino-acid preferences during natural evolution to exceed those measured by experiments further improves phylogenetic fit, suggesting that deep mutational scanning experiments remain less sensitive than actual natural selection.
I have not compared the experimentally informed evolutionary models to more recent site-specific models that infer aspects of the substitution process from the sequence data itself (Lartillot and Philippe, 2004; Rodrigue et al., 2010; Le et al., 2008; Wu et al., 2013; Wang et al., 2008). Such a comparison is desirable since these site-specific models will almost certainly compare more favorably to the experimentally informed models used here. Unfortunately, such a comparison is sufficiently technically challenging to be beyond the scope of the current work, since it requires comparing between maximum-likelihood and Bayesian approaches. In any case, the experimentally informed evolutionary models described here should not be viewed purely as competitors for existing site-specific substitution models. Rather, one can imagine future approaches that integrate the results of a deep mutational scanning experiment with additional site-specific details inferred from natural sequence data to create even more nuanced evolutionary models.
An advantage of experimentally informed evolutionary models is that they naturally lend themselves to interpretation in terms of quantities that can be directly related to both specific biochemical measurements and the underlying processes of mutation and selection. This stands in contrast to most existing models, which are phrased in terms of heuristic parameters (such as codon “equilibrium frequencies”) that reflect the combined action of mutation and selection and are not accessible to direct experimental measurement. Experimentally informed evolutionary models therefore have the potential to facilitate connections between the phylogenetic substitution processes and the underlying biochemistry and population genetics of gene evolution (Thorne et al., 2007; Halpern and Bruno, 1998).
The major drawback of experimentally informed models is their more limited scope. Most existing codon-based evolutionary models can be applied to any gene (Goldman and Yang, 1994; Muse and Gaut, 1994; Kosiol et al., 2007) – but experimentally informed models require experimental data for the gene in question. However, this requirement may not be as crippling as it initially appears. The first experimentally determined evolutionary model for influenza nucleoprotein required direct measurement of both the site-specific amino-acid preferences and the underlying mutation rates (Bloom, 2014). However, the model presented here only requires measurement of the amino-acid preferences, as the mutation rates are treated as free parameters. Rapid advances in the experimental technique of deep mutational scanning are making such data available for an increasing number of proteins (Fowler et al., 2010; Roscoe et al., 2013; Starita et al., 2013; Melamed et al., 2013; Traxlmayr et al., 2012; McLaughlin Jr et al., 2012; Firnberg et al., 2014; Bloom, 2014).
In this respect, it is encouraging that the site-specific amino-acid preferences determined experimentally for TEM-1 improve phylogenetic fit to substantially diverged (35% protein sequence divergence) SHV beta-lactamases as well as highly similar TEM beta-lactamases. As discussed in the Introduction, there are two major limitations to most existing evolutionary models: they treat sites identically, and they treat sites independently. Experimentally informed evolutionary models of the type described here have the potential to completely remedy the first limitation as deep mutational scanning defines site-specific selection with increasing precision. However, such models still treat sites independently – and this limitation will never be completely overcome by experiments, since the unforgiving math of combinatorics means that no experiment can examine all arbitrary combinations of mutations (for example, TEM-1 has only 5453 single amino-acid mutants, but it has 14815801 double mutants, 26742520805 triple mutants, and over 1013 quadruple mutants). The utility of experimentally informed evolutionary models therefore depends on the extent to which site-specific amino-acid preferences measured for one protein can be extrapolated to other homologs – in other words, are sites sufficiently independent that the preferences at a given position remain similar after mutations at other positions? This question remains a topic of active debate, with experimental studies suggesting that site-specific preferences are largely conserved among closely and moderately related homologs (Ashenberg et al., 2013; Serrano et al., 1993), but some computational studies emphasizing substantial shifts in preferences during evolution (Pollock et al., 2012; Pollock and Goldstein, 2014). The fact that the TEM-1 experimental data informs a model that accurately describes the substantially diverged SHV homologs suggests reasonable conservation of site-specific amino-acid preferences among beta-lactamase homologs.
This apparent conservation of site-specific amino-acid preferences suggests that the phylogenetic utility of experimentally informed evolutionary models may extend well beyond the immediate proteins that were experimentally characterized. This type of experimental generalization would have precedent: only a tiny fraction of proteins have been crystallized, but because structure is largely conserved during protein evolution, it is frequently possible to use a structure determined for one protein to draw insights about a range of related homologs (Lesk and Chothia, 1980; Sander and Schneider, 1991). It seems plausible that the conservation of site-specific amino-acid preferences could similarly enable deep mutational scanning to provide the experimental data to inform evolutionary models of sufficient scope to improve the accuracy and interpretability of phylogenetic analyses for a substantial number of proteins of interest.
Methods
Availability of computer code and data
The phylogenetic analyses were performed using the software package phyloExpCM (phylogenetic analyses with experimental codon models, https://github.com/jbloom/phyloExpCM), which primarily serves as an interface to run HYPHY (Pond et al., 2005). Input data, computer code, and a description sufficient to enable replication of all analyses reported in this paper are available via http://jbloom.github.io/phyloExpCM/example_2014Analysis_lactamase.html and http://jbloom.github.io/phyloExpCM/example_2014Analysis_Influenza_NP_Human_1918_Descended_withbeta.html.
Equilibrium frequencies and reversibility
Here I show that the evolutionary model defined by Equation 1 is reversible (satisfies detailed balance), and has pr,x defined by Equation 9 as its equilibrium frequency.
First, note that the fixation probabilities Fr,xy defined by both Equation 2 and Equation 3 satisfy reversibility with respect to the stringency-adjusted amino-acid preferences namely that as can be verified by direct substitution. This relationship means that if all codon interchanges were equally likely (all Qxy values are equal), then the equilibrium frequency pr,x of codon x would simply be proportional to the stringency-adjusted preference for the encoded amino acid.
However, in practice all codon interchanges are not equally likely, so the actual equilibrium frequencies pr,x will also depend on the mutation rate parameters Rm→n listed in Equation 8. This dependence is given by the qx terms in Equation 9. These qx terms can be thought of as the expected equilibrium frequencies of the codons in a hypothetical situation in which there is no selection and all 64 codons are equally fit (all Fr,xy values are equal). In other words, the qx terms define the stationary state of the reversible stochastic process defined by the mutation rates Qxy. In order to show qx given Equation 10 defines this stationary state, it is necessary to show that
There are up to 12 possible types of single-nucleotide mutations that can be made to a codon x to create a different codon y (three possible mutations to each of the four possible nucleotides); however only four of these types of mutations require independent verification o f E quation 13. Specifically, the possible types of mutations that require independent verification of Equation 13 are when x differs from y by a mutation of A →T, of C →G, of A →C, or of A →G. The other eight types of mutations do not have to be verified because they are equivalent to one of these first four cases: C → A is equivalent to A → C by symmetry (i.e. interchange of the labels of codons x and y), G →A is equivalent to A → G by symmetry, T → A is equivalent to A → T by symmetry, G → C is equivalent to C → G by symmetry, T → G is equivalent to A → C because of Equation 5, T → C is equivalent to A → G because of Equation 5, G → T is equivalent to C → A by Equation 5 and then to A → C by symmetry, and C → T is equivalent to G → A by Equation 5 and then A → G by symmetry. So below I verify Equation 13 for the four independent types of mutations, full verifying Equation 13.
The first case is where x differs from y by a mutation of A → T, such that Q xy = R A→T and Qyx = RT→A. In this case, we have where the first l ine substitutes the definitions of Equation 10 and Equation 4, the second line follows from Equation 5 and the fact that if x and y differ only by an A →T mutation, and the final line again substitutes the definitions of Equation 10 and Equation 4.
The second case is where x differs from y by a mutation of C → G, such that Qxy = RC→G and Qyx = RG→C. In this case, we have where the justifications for the three lines are identical for those used for Equation 14.
The third case is where x differs from y by a mutation of A → C, such that Qxy = RA→C and Qyx = RC→A. In this case, we have where the first line substitutes the definitions of Equation 10 and Equation 4, the second line uses Equation 7, the third line is simple algebra, the fourth line follows from the fact that if x can be converted to y by an A → C mutation, and the final line again substitutes the definitions of Equation 10 and Equation 4.
The fourth and final case is where x differs from y by a mutation of A → G, such that Qxy = RA→G and Qy→x = RG→A. In this case, we have where the first line substitutes the definitions of Equation 10 and Equation 4, the second line uses Equation 6 and Equation 7, the third line is simple algebra, the fourth line follows from the fact that if x can be converted to y by an A → C mutation, the fifth line follows from Equation 5, and the final line again substitutes the definitions of Equation 10 and Equation 4.
Taken together, Equation 14, Equation 15, Equation 16, and Equation 17 establish that Equation 13 holds for all possible independent types of mutations.
Finally, to show that the overall evolutionary model in Equation 1 is reversible and has pr,x defined by Equation 9 as its equilibrium frequency, it is necessary to show that pr,x × Pr,xy = pr,y × Pr,yx. This follows trivially from Equation 12 and Equation 13:
The fact that Pr,xy defines a reversible Markov process with stationary state pr,x means that it is possible to define a symmetric matrix Sr such that where diag (…, pr,x,…) is the diagonal matrix with pr,x along its diagonal. Noting , we have
This matrix is symmetric since Sr,xy = Sr,yx as can be verified from the fact that and as is guaranteed by Equation 12 and Equation 13.
Acknowledgments
Thanks to T. Bedford and J. Felsenstein for helpful discussions, and the editor and two anonymous reviewers for exceptionally helpful comments. This work was supported by the National Institute of General Medical Sciences of the National Institutes of Health (grant number R01 GM102198).
Footnotes
E-mail: jbloom{at}fhcrc.org.