Abstract
Most methods for performing a phylogenetic analysis based on sequence alignments of gene data assume that the mechanism of evolution is constant through time. It is recognised that some sites do evolve somewhat faster than others, and this can be captured using a (gamma) rate heterogeneity model. Further, some species have shorter replication times than others, and this results in faster rates of substitution in some lineages. This feature of lineage specific rate variation can be captured to some extent, by using relaxed clock models. However, it is also clear that there are additional poorly characterised features of sequence data that can sometimes lead to extreme differences in lineage specific rates. This variation is poorly captured by constant time reversible substitution models. The significance of extreme lineage specific rate differences is that they lead both to errors in reconstructing evolutionary relationships as well as biased estimates for the age of ancestral nodes. We propose a new model that allows gamma rate heterogeneity to change on branches, thus offering a more realistic model of sequence evolution. It adds negligible computational cost to likelihood calculations. We illustrate its effectiveness with an example of green algae and land-plants. For many real world data sets, we find a much better fit with multi-gamma sites models as well as substantial differences in ancestral node date estimates.
1 Introduction
Systematic error is one of the major concerns of phylogenetic reconstruction, as it can mislead tree building. Heterotachy [22, 24] – the inference of lineage specific rate variation – is one problem and potential cause of systematic error. Heterotachy is present in many data sets [22]. Different underlying processes of evolution can explain heterotachy [20], one of these being that changes in intermolecular interactions over time can affect which and how many sites are free to vary [16, 15]. To some extent, the processes of nucleotide and amino acid substitution can be accommodated by covarion and gamma distributed substitution models [12, 20, 23, 28, 29]. To maintain time reversibility, these models assume that the number of sites assigned to a particular evolutionary rate class sequence remains constant through time [30]. However, this might not be a realistic biological assumption. Numerous observations suggest that the proportion of variable sites in homologues changes over larger evolutionary distances [15, 19, 21] (but see also [33] regarding sufficiency). A consequence of this form of non-stationarity, when modelling sequence evolution using a discrete gamma distribution, is that the alpha shape parameter, which describes the proportion of sites assigned to a particular rate class, will change across the underlying tree [30].
For the gamma rate heterogeneity model to be consistent, combining two samples should result in similar shape parameter estimates. However, it is well known that different sampling strategies result can in different estimates of the shape parameter [28]. This suggests that the rate heterogeneity is not constant, but changes throughout the tree, which justifies investigating models that allow such shape variation.
Here we investigate heterotachy and its impact on topology and divergence time estimates in the green plant evolutionary lineage. This is one of the oldest and most diverse branches of the tree of life. Researchers reconstructing evolutionary relationships with green plants face difficulties in phylogenetic reconstruction similar to those being faced elsewhere by others [24]. The challenge is worthwhile, as overcoming the problems will lead to insight into unresolved questions such as: What was the nature of the earliest angiosperms and land plants? How many times was land colonised from the water by “green algae”? Where did the key adaptive features for life on land come from? How many times has multicellularity arisen in the green plants? Did multicellularity ever reverse? How many times did alternation of generations and diploid-dominant life-cycles arise? While justifiable emphasis has been placed on the importance of taxon sampling, increasing attention is now being focussed on the problem of data-model fit [6]. Here, we examine the potential of multigamma and relaxed-gamma site models to improve data-model fit and in so doing improve phylogenetic inference. We describe our findings and implement these new models in the open source software BEAST and thus introduced a practical and user friendly way to model shape parameter variation throughout the tree.
2 Methods
First, we revisit the classic gamma rate heterogeneity model, before describing the multi-gamma model and the relaxed-gamma model.
2.1 Gamma site model (with invariant sites)
The likelihood of the data in an alignment D for tips of a tree g is: where Ω = {Q, µ} includes parameters Q of the substitution model and the overall rate µ. Consider edges 〈i, j〉 ϵ R of tree g = {R, t}, and node heights ti and tj, and let the entry si,k be a nucleotide base at site k of the sequence at node i. As usual, we assume the sites in the alignment are independent, and the alignment represents data at the tips, but not for internal nodes. Then we sum over all possible ancestral sequences DY for assignments of internal node to get Felsenstein’s tree likelihood [11] which can be written as
This likelihood assumes constant rates over all sites, which is not realistic in many situations. Yang et al. [34] introduced a model that assumes site rates are distributed according to a gamma distribution with mean one, leaving one shape parameter α. The likelihood can be written as which can be efficiently calculated using Felstenstein’s peeling algorithm [11]. Computationally, it is convenient to approximate the integral by a sum over K categories with rates rc(α) for category c representing the mean of the kth quantile of the gamma distribution with mean 1 and shape parameter α. The likelihood can be calculated as a mixture of likelihoods with rates rc(α) as
Often, there are many constant sites, and a rate of zero would fit best for these sites. Adding another category r1 = 0 with invariant sites [14, 32] for a proportion p1 of the mixture, and using the remainder for the gamma rate model with K categories, the likelihood becomes where for c > 1 and rc(α) the mean of the c – 1th quantile of the gamma distribution for c > 1.
2.2 Multi-gamma site model
In the gamma rate heterogeneity model, the shape parameter α is assumed to be constant throughout the tree. In the multi-gamma model we assume that for each branch and in node i in the tree, we have an individual shape parameter αi governing the heterogeneity of the rates. Practically, this means replacing the rate rc(α) in Equation (1) with rate rc(αi).
To distinguish this model from Yang’s model, we call the latter the single-gamma mode in the remainder of this paper. An obvious disadvantage is that in a tree with n branches, we need to estimate n parameters for multi-gamma model while only a single parameter needs to be estimated for the single-gamma model.
Our experience it that this is achievable in practice if there is sufficient heterotachy in the data, and it looks like a good option for maximum likelihood implementations. However, care must be taken with binary rooted trees, since the shape parameters for branches to the root appears to be unidentifiable in practice.
2.3 Relaxed-gamma site model
Instead of estimating the individual shape parameters, we can assume they are distributed according to a density p(α|θ) shared by all shape parameters and integrate them out. Pr{D|g, Ω} becomes
We call this model the relaxed-gamma model to distinguish it from the single and multi-gamma models. Note that the number of parameters in this model is just the cardinality of θ. If p(α|θ) is a log-normal or gamma distribution, there are just two parameters, which is just a single parameter more than the single-gamma model.
Unfortunately, calculating Equation (4) directly is not feasible, so we use MCMC in a Bayesian setting to approximate these integrals in a similar fashion as done for the uncorrelated relaxed model [8]. The density p(α) can be discretised into a number of quantiles. Earlier findings [18, 25] suggests that setting the number of quantiles equal to the number of branches is appropriate that number is larger than 100. For each quantile q, we calculate the mean value mq and for each branch i, and we maintain an index αi of a quantile for that branch. We use the following proposals
a proposal, that randomly selects a branch and randomly selects a new value for the index αi inside a window of its old value,
a proposal that randomly selects a branch and randomly selects a new value for the index αi in its range,
a proposal that randomly selects two branches and swaps their indices,
a proposal to scale the parameters of the density p(α|θ) for each of the parameters in θ.
2.4 Miscellaneous
Similar to the single gamma model with a category for invariant sites, the multigamma and relaxed gamma models can be extended to include invariant sites by generalising Equation (3) in a similar way we generalised Equation (2) by adding a category with rate zero for all branches.
Remark: We did not make any assumptions about the tree, so the theory applies to unrooted, rooted and rooted time trees [8], binary trees as well as polytopies and sampled ancestor trees [13].
Note that the computation complexity of the tree-likelihood – Equation (1) – adapted for the multi-gamma or relaxed gamma model using the Felstenstein’s peeling algorithm does not change, so no extra computational cost is incurred in calculating the likelihood. Of course, since the parameter space is larger for these models, longer computation is required to estimate these extra parameters.
3 Results
In this section, we show the effectiveness of the model using a simulation study and using randomly selected alignments from TreeBase [26], alignments from the PartitionFinder dataset repository https://github.com/roblanf/PartitionedAlignments and a case study concerning the evolutionary relationships of green plants and algae.
3.1 Simulation study
First, we establish how reliably the shape parameter of the single-gamma model can be estimated. The shape parameter α was sampled from the exponential distribution with mean 1, which is commonly used as prior for the shape parameter. Sequences of 10,000 sites were simulated under the HKY model on a tree with 3 taxa ((A, B), C) and a single-gamma model with four categories. The simulation study was performed using BEASTShell [2]. Then, an analysis under the same model was done in BEAST where α was estimated. Figure 1 shows the sampled α on the x-axis and estimated α on the y-axis. It shows that α can be estimated correctly over a large range, though it underestimates the alpha value for very large values and overestimates the alpha value for very small values. This is to be expected, since these extreme values, though rare, do not leave a large trace in the data, so the prior pulls the estimate towards the mean of 1.0. Note that smaller α values are estimated with lower error than higher α values.
We repeated the experiment, but instead of keeping α constant throughout the tree, only the α for a single branch in taxon A was varied. The left of Figure 2 shows what happens when we estimated α using the single-gamma model; it appears that the estimated α is the mean of the simulated α values, when branch lengths are taken into account. The α value that was varied does not dominate the estimate. The right of Figure 2 shows α values for individual branches. There is some increase in the uncertainty of the estimates, but simulated values remain in the 95%HPD of estimated values, again with the exception of extreme α values. In conclusion, our observations show that under the simulated conditions the multi-gamma model can recover shape parameters.
3.2 Nucleotide alignments from TreeBase
To get an impression of how the multi-gamma model performs on empirical data, we randomly selected 50 nucleotide alignments of various sizes from TreeBase [26]. Appendix A lists the details of the alignments. The analysis uses a HKY substitution model [17], the uncorrelated relaxed clock model with log normal distribution [9] and Yule tree prior [35]. The model was run both with single gamma and a category for invariant sites, and results compared with the multi-gamma and relaxed gamma model, both with a category for invariant sites. The analyses were run till convergence, judging from effective sample sizes of at least 200.
Figure 3 shows results for the 50 datasets in order of alignment size, which is the number of nucleotides contained in the alignment, that is, the number of taxa times the length of the alignment. The top shows differences in likelihood estimates and AICM [1] scores for these analyses. It appears that for smaller alignments, there is not much benefit of using the multi-gamma model. However, it does not cause much damage either. For larger alignments, there can be significant improvement of the fit. Given these data sets were randomly selected, we expect this to be true for many data sets in general.
The bottom of Figure 3 shows there are sometimes dramatic differences in divergence time estimates. This observation highlights the concern for the impact that model misspecification can have on molecular dating.
This conclusion is supported by similar analyses of the datasets from the partition finder repository (see Appendix D). Comparing results for single and relaxed gamma site models that assume a proportion of invariable sites, we find that for those datasets that converged (not all of them did, since some datasets were too large to converge over the period of analysis) relaxed gamma site models in general fit better for any substantial dataset, judging from AICM estimates. Further, time estimates can differ considerably between models.
3.3 Dating the origin of green plants and algae
We collected a set of 34 atpB and rpoC1 chloroplast genes from Genbank (see Appendix B for accession numbers), aligned their inferred protein sequences using MUSCLE [10] in MEGA [31] and then converted the protein sequences back to nucleotides. This produced data matrices for atpB with 1401 sites (749 patterns) and for rpoC1 with 1359 sites (1097 patterns). Furthermore, we constrained the tree to the topology shown in Figure 4 and we have two calibrations one on the stem leading to angiosperms following divergence from conifers and one on the stem leading to ferns subsequent to their divergence from seed plants [27] (Figure 4).
Data was analysed in BEAST [5] by assuming a separate partition for each codon position. A reversible-jump based substitution model [3] was used and GTR was found to be preferred in all cases, with the exception of partition 3 of the atpB dataset, for which the EVS model was preferred. A discrete four category gamma distribution model that also assumed invariable sites, was found to fit the data much better than a model that assumed no positional rate heterogeneity or a model that assumed a single rate class plus invariable sites. We used an uncorrelated relaxed clock (log normal) model [9] for branches, and estimated a coefficient of variation of around 0.7 for both genes. This result indicates that that the strict clock can be rejected. A Yule process was assumed for the tree prior. The main issue when using a single-gamma site model is that root height estimates differ considerably, with mean estimates leaving a gap of over 100M year. Also, the topology in the part of the tree that is not restricted differs between the genes. Given both genes are chloroplast genes, hence must share a common ancestry, one would expect the difference between time estimates to be much smaller, as well as showing support for the same topology. See Appendix C for details on the BEAST XML files.
Figure 5 shows the marginal likelihoods for the two genes, with single-gamma, multi-gamma and relaxed gamma models. The relaxed-gamma model was run as a single model shared by the three partitions, and a variant that has separate relaxed-gamma models for each of the three partitions. Figure 5 shows the latter model fits much better than the former as indicated by the non-overlapping densities of log likelihood, while both perform much better than the single-gamma model. The multi-gamma model applied to all three partitions (that is, three shape parameters per branch) fits similarly to the relaxed-gamma model; slightly worse for atpB and slightly better for rpoC1.
Figure 6 shows a DensiTree [4] of the atpB and rpoC1 tree-posteriors. Switching from single to relaxed-gamma-model does not change the inferred topology very much, and the differences in topology is illustrated in the figure.
Figure 7 shows marginal densities of root height estimates. It shows that mean root height estimates for the atpB and rpoC1 gene differ much less using the multi-gamma (79My) model than the single-gamma model (107My). Using the relaxed gamma model makes the estimates even less dramatic, reducing the difference to 28My. Note that the densities between atpB and rpoC1 overlap much more for the multi and relaxed gamma-models than the single-gamma model, indicating less discrepancy between the mdoels. However, the 95% HPDs of root heights becomes much larger under the latter models.
It is notable that there is a rate paradox for this data; estimates for clock rates are higher for multi-gamma and relaxed-gamma models (mean rate 6.86E-10 and 1.03E-9 substitutions per site per year for single gamma and 9.17E-10 and 1.04E-9 s/s/y for relaxed-gamma model for APT and rpoC1 respectively). The higher clock rate is consistent with a better fit, but one would expect a lower root to accommodate the mutations in the data, but the models also infer a higher root date instead of a lower one. This suggests that for this data the multi/relaxed gamma model accommodate more mutations. An alternative interpretation of our results is that the single-gamma model underestimates the amount of mutation.
The analysis was repeated with atpB and rpoC1 concatenated into a single alignment, which seems reasonable given both are from the chloroplast genome, hence must share a common history. Essentially, all results can be interpreted as the average of the results for the individual atpB and rpoC1 analyses. The fit for multi/relaxed-gamma is still much better than for single gamma, the root estimate higher, and in between that of atpB and rpoC1 individually, and topology estimate is partly compatible with the atpB topology and partly with the rpoC1 topology.
4 Discussion
We introduced a Bayesian framework for allowing shape parameters governing rate heterogeneity to vary across branches. The method allows us to get consistent time estimates for different chloroplast genes for green plants, while without the multi-gamma rate model we get substantially different estimates. This suggests that the heterotachy suspected for this data can be captured effectively through multi-gamma site models. Furthermore, for many datasets, the method gives a much better fit as well as different time estimates.
The method is implemented in the MGSM (multi-gamma site model) package for BEAST [5]. It is open source and freely available under LGPL license from https://github.com/BEAST2-Dev/MGSM/, and offers GUI support through BEAUti for setting up an analysis (see https://github.com/BEAST2-Dev/MGSM/wiki).
It did not escape our attention that the techniques for handling branch related parameters as in the multi-gamma and relaxed gamma models could be used for other features of substitution models over branches. Possible directions for generalising multi-gamma site models include site models with a limited number of gamma shapes, in a similar fashion as clock models such as the random local clock [7] can have a limited number of branch rates in between 1 and the total number of branches.
Acknowledgements
This work was assisted by stimulating discussions with Chris Simon and Dave Marshall as well as comments at the New Zealand Phylogenomics 2015 meeting. This work was funded by a Rutherford fellowship (http://www.royalsociety.org.nz/programmes/funds/rutherford-discovery/) from the Royal Society of New Zealand awarded to Prof. Alexei Drummond.
Appendix A: List of alignments from TreeBase
Summary table for nucleotide sequences from TreeBase. The number in first column coincides with numbers in Figure 3
Appendix B: Genbank sequences
Appendix C: XML for green plants and algae analysis
All BEAST XML files can be downloaded from here: https://www.cs.auckland.ac.nz/~remco/MGSMxml.tgz.
All analyses uses partitions for each 3 codon positions, a reversible-jump based substitution model, uncorrelated log normal clock model and Yule tree prior.
atpB = atpB chloroplast gene from green plants and algae
rpoc = rpoC1 chloroplast gene from green plants and algae
atprpo = atpB and rpoC1 genes concatenated
Appendix D: Results for partition finder repository data
Footnotes
remco{at}cs.auckland.ac.nz P.J.Lockhart{at}massey.ac.nz