Abstract
The pattern of molecular evolution varies among gene sites and genes in a genome. By taking into account the complex heterogeneity of evolutionary processes among sites in a genome, Bayesian infinite mixture models of genomic evolution enable robust phylogenetic inference. With large modern data sets, however, the computational burden of Markov chain Monte Carlo sampling techniq343 cl:1 ues becomes prohibitive. Here, we have developed a variational Bayesian procedure to speed up the widely used PhyloBayes MPI program, whic deals with the heterogeneity of amino acid propensity. Rather than sampling from the posterior distribution, the procedure approximates the (unknown) posterior distribution using a manageable distribution called the variational distribution. The parameters in the variational distribution are estimated by minimizing Kullback-Leibler divergence. To examine performance, we analyzed three large data sets consisting of mitochondrial, plastid-encoded, and nuclear proteins. Our variational method accurately approximated the Bayesian phylogenetic tree, mixture proportions, and the amino acid propensity of each component of the mixture while using orders of magnitude less computational time.
1 Introduction
Understanding the evolutionary variation of phenotypic characters and testing hypotheses about the underlying mechanism are some of the main concerns of evolutionary biology. Because this variation needs be interpreted as an evolutionary history, accurately inferring the phylogenetic tree is important. Otherwise, the uncertainty of phylogenetic inference must be taken into account to obtain an unbiased picture of evolutionary variation.
The increasing amount of available genomic data enables reliable inference of phylogenetic trees. Because molecular evolution is largely driven by nearly neutral or slightly deleterious mutations Ohta (1973), this process is less prone to convergent evolution compared with the evolution of phenotypic traits. The pattern of molecular evolution is statistically formulated by Markov processes. The pattern and rate of molecular evolution is complex, however, depending on various factors affecting mutation rates and functional constraints. To model protein evolution, Thorne, Goldman, and Jones (1996) introduced the concept of hidden states of secondary structure to describe sites of heterogeneity Goldman et al. (1996); Thorne et al. (1996); Jones et al. (1996). Koshi and Goldstein (1998) developed a model of physico-chemical properties of amino acids, while Halpern and Bruno (1998) introduced a more advanced model with position-specific amino acid frequencies.
Equilibrium amino acid frequencies, which reflect structural and functional constraints, vary among sites within and among proteins. Inter-species comparative genomics approaches can analyze a huge number of alignment columns, but the number of taxa is often insufficient to estimate individual position-specific amino acid frequencies. To achieve a balance between variance and bias, Lartillot and Philippe (2004) proposed a Bayesian non-parametric approach based on a countable infinite mixture model, referred to as the CAT model. This model specifies K of distinct processes (or classes), each characterized by a particular set of equilibrium frequencies, and sites are distributed according to a mixture of these K distinct processes. By proposing a truncated stick-breaking representation of the Dirichlet process prior on the space of equilibrium frequencies (Ferguson 1973; Green and Richardson 2001; Ishwaran and James 2001), the total number of classes can be treated as free variables of the model. A hybrid framework between Gibbs-sampling and Metropolis-Hastings algorithm have been developed to estimate all parameters of the model Papaspiliopoulos and Roberts (2008).
Existing approaches cannot take full advantage of the CAT model (Lartillot and Philippe 2004; Lartillot 2006), because the computational burden is prohibitive for inference based on large data sets. Even well-designed sampling schemes need to generate a large number of posterior samples through the entire data set to resolve convergence, and their convergence can be difficult to diagnose. To provide faster estimation, Lartillot et al. (2013) developed a message passing interface (MPI) for parallelization of the PhyloBayes MPI program. By implementing Markov chain Monte Carlo (MCMC) samplers in a parallel environment, PhyloBayes MPI allows for faster phylogenetic reconstruction under complex mixture models.
Here, we propose an alternative approach, a variational inference method (Jordan et al. 1999; Bishop 2006; Blei et al. 2006; Hoffman et al. 2013). The basic idea of variational inference is the formulation of the estimation of marginal or conditional probabilities as an optimization problem rather than sampling-based inference. Variational methods, originally used in statistical physics to approximate intractable integrals, have been successfully used in wide variety of applications related to complex networks (Gopalan and Blei 2013) and population genetics (Gopalan et al. 2016; Raj et al. 2014). In this article, we demonstrate that our algorithms are considerably faster than PhyloBayes MPI while achieving comparable accuracies.
2 New Approaches
The CAT model formulates the substitutional heterogeneity across sites of protein sequences as a mixture of different equilibrium amino acid frequencies, called profiles. By introducing a Dirichlet process prior on these profiles, the number of categories, the profile of each category, and the resultant phylogenetic tree are estimated from the data in a Bayesian framework. The standard Markov chain Monte Carlo (MCMC) approach facilitates parameter estimation of this parameter-rich model and enables robust inference of phylogenetic trees while allowing for the complexity of protein evolution.
The rapid growth of genomic databases theoretically enables accurate classification of amino acid sites in protein sequences, but the Monte Carlo integration becomes computationally more challenging. To allow the CAT model to extract the maximum amount of relevant information from the data, we have developed a variational Bayesian procedure. The core of the variational framework is a mean-field approximation of the posterior distribution. We approximate the posterior distribution with a mean field representation of the variational distribution, which is much easier to work with computationally. In this approximation, the parameters and hidden variables are assumed to be independent of one another. The parameters of the variational distribution are obtained by minimizing the Kullback-Leibler (KL) divergence between the true conditional distributions of the hidden variables given the observations and their variational distributions. Inference becomes a single optimization problem that gives us approximate analytical forms for the posterior distributions over unknown variables of the CAT model as well as an approximate estimate of the intractable marginal likelihood. To deal with the uncertainty of tree topologies, we have preserved the Gibbs sampling algorithm of tree topologies (Lartillot et al. 2013).
Both variational inference and MCMC algorithms were run in a parallel environment. The properties of the parallel version were evaluated on a personal computer (Intel Core i7-6700 CPU 3.40GHz, 8 cores, 2 threads per core, 4 cores per socket, 16 Gb RAM), under Linux Mint 17.3 Rosa.
3 Results
3.1 Runtime Performance
To compare the performance of our version of variational inference with that of the MCMC algorithm of PhyloBayes MPI, we estimated the CAT model with both algorithms using real data sets. This portion of the study was carried out using three real data sets, the largest consisting of 38,330 amino acid positions from 66 species. The goals of this data analysis were to demonstrate the numerical feasibility of our implementations and to ascertain the accuracy of our variational inference approach. In our comparisons, all algorithms were timed under equivalent computational conditions. Because of the intensive nature of the estimations, further computational experiments will be required to test the performance of variational inference on much more massive data sets.
First, we explored whether our new approximation approach could significantly reduce the computational burden required to estimate all parameters of the CAT model. We focused our analysis on the three real data sets described in detailed Data Sets5.6.
Table 1 illustrates the computational time required for estimation of all parameters in the CAT-Poisson model when optimized using variational inference compared with sampling under the MCMC algorithm across three data sets. These data sets contained drastically different numbers of taxa and sites. For example, the number of taxa and sites in data set C were approximately three times larger than those in data set B. The time complexity of each of the above algorithms was found to increase regularly with the number of genes, species and total aligned amino acid positions. Run times were significantly reduced in the variational inference framework compared with those in the MCMC approach. While our procedure uses the variational inference procedure to estimate parameters of the evolutionary process, we note that we have retained the algorithm for Gibbs sampling of tree topologies. If this partial MCMC algorithm can also be replaced by some other optimization, the computation burden will be greatly reduced.
3.2 Accuracy of Estimated Topologies, Tree Lengths, and Profiles
The tree topology and branch lengths estimated by variational inference were almost the same as those obtained by the MCMC algorithm (Figure 1).
By introducing a Dirichlet process prior, the CAT model provides a posterior distribution of K, the number of separate categories, and the size of each category. The PhyloBayes MPI program, which is based on a hybrid strategy between Gibbs sampling and Metropolis-Hastings algorithm, first proposes allocation variables and stationary probabilities at all other sites. These site to category reallocation proposals, which are driven by the posterior weights of the mixture and profiles associated with each component of the mixture, are performed by Gibbs sampling. Metropolis-Hastings algorithms are then used to consider the classes for sites. This strategy guarantees that the samplers leave the posterior distribution invariant. Our approach, variational inference, proposes variational distributions for allocation variables and weights of the mixture and profiles. The choice among alternative allocations of sites to categories is driven by updating parameters of these variational distributions and computing the expected values of these variables under variational distribution.
Top-ranked estimated classes are listed along with the number of sites distributed in each class. The results are for real data set A, with the number of sites calculated by counting sites allocated to each class.
Table 2 compares some major categories estimated by MCMC and variational inference. The size of each category was approximated by the number of sites assigned to that class. The number of distinct categories was estimated for data set A representing 6,622 amino acid positions. As can be seen in the table, variational inference accurately approximated posterior mean sizes of these categories, their profiles were accurately estimated as well (Figure 2).
Taken together, these results demonstrate that the estimation time required by the variational inference framework compares favorably with that used by sampling algorithms such as MCMC, while a sufficient level of accuracy under the CAT model is still guaranteed.
4 Discussion
We have developed a new framework for estimating all parameters of the CAT model, namely, stochastic variational inference, that can considerably improve runtime performance as well as significantly reduce the computational burden. In contrast to existing approaches designed for the same purpose that rely on simulation framework, such as Gibbs-sampling and Metropolis-Hastings algorithms (Lartillot and Philippe 2004; Lartillot 2006; Lartillot et al. 2013), stochastic variational inference recasts the problem of inference as an optimization problem, thus allowing us to design powerful tools for convex optimization. In this way, our approach proposes a feasible family of variational distributions and then selects the family member closest to the true intractable posterior distribution of interest by optimizing Kullback-Leibler (KL) divergence.
We have demonstrated through analysis of actual data sets that our method accurately approximates the posterior distribution of the CAT model with improved speed. This substantial runtime enhancement with no loss of accuracy allows our method to be applied to the large data sets that are steadily becoming the norm in phylogenetic and biological evolutionary studies. Finally, our results were obtained on a modest computing platform. The implementation of a variational inference version of PhyloBayes MPI to exploit advanced computing architectures holds the promise of analyzing even larger data sets than the examples in our paper.
Bayesian models of sequence evolution allow substitutional heterogeneity across protein sequence sites to be taken into account. In particular, the CAT model treats the number of substitutional categories as a free parameter and is able to uncover a level of heterogeneity much higher than that assumed by other mixture models. Under the variational inference approach, all of these special features of the CAT model are guaranteed. Given the increasing size of studied data sets, ensuring statistical algorithms scale to a large number of species with massive numbers of nucleotide positions is critical. We have shown that such analyses are time consuming when undertaken with MCMC algorithms, which perform a large number of multiple simulated iterations over the entire data set. With their more efficient optimization framework, stochastic variational inference algorithms overcome this limitation without compromising all of the principles and statistical assumptions behind the model. For improved estimations of site-heterogeneous Bayesian mixture models with the massive data set, we recommend implementation of a variational inference version of PhyloBayes MPI.
5 Materials and Methods
5.1 CAT Model
We use an infinite mixture model that describes site heterogeneity with respect to the substitution process. This model is similar to one proposed by Lartillot and Philippe (2004), but, instead of sampling-based inference, we have developed a new approach that allows for efficient inference of ancestral sequences. Our model, which does not assume that all sites of a protein evolve under the same substitution process, is characterized by a 20×20 substitution matrix. In addition, the model does not assume a fixed number of distinct substitution processes (or classes) and respective amino-acid profiles; instead, these are treated as free variables of the model. Poisson (or F81) Felsenstein (1981) Markov processes are considered to apply to all substitution processes along the branches of a tree (Lartillot and Philippe 2004; Lartillot 2006). Each Markov process is characterized by a rate matrix Q = [Qab], which can be expressed in terms of a vector of stationary probabilities, or equilibrium frequencies πa, 1 ≤ a ≤ 20 such that and a set of relative rates, or exchangeability parameters, (ρab), 1 ≤ a, b ≤ 20. We determine the size of the segments adaptively as described below. This approach allows us to work in the framework of a Bayesian mixture model with parameters representing the mixture of distinct classes, the rates at each site. and branch lengths.
Given an amino-acid database including N aligned positions (columns) and P taxa, we label the data matrix Dip as simply the possible states of the process operating at site i for i = 1,…, N at the leaf indexed by p (1 < p < P). We consider lj (1 < j < 2P 3); ri (1≤i≤N) to be random variables that denote the branch j and the relative rate of substitution at each site i. Formally, the CAT-Poisson model assumes that (i) a gamma distribution of shape 1 and scale β > 0 is the prior distribution of branch lengths, (ii) a gamma distribution of shape α and scale α is the prior distribution of rates, and (iii) the prior distribution on profile π is a flat Dirichlet distribution. Furthermore, Lartillot et al. (2013) has developed a Dirichlet process mixture model formulated in terms of a stick-breaking construction over the equilibrium frequency profile to generate an infinite number of mixtures of Poisson processes for describing sites, with each mixture characterized by its own substitution matrix {Qk}, k = 1, …, ∞ and only the stationary probabilities differing. By proposing a new random variable Vk which is the unit length of the kth stick, the stick-breaking representation allows the construction of an infinite mixture structure. Moreover, each site i in an amino-acid sequence belongs to a category k that is specified by the allocation variable zi ∈ [1, …, ∞]. The vector z = (zi) where i ∈ [1, …, N], is called the allocation vector. The allocations z =(zi) are drawn i.i.d from a multinomial of the infinite vector of mixing proportions, namely, φ = (φk), k∈[1, …,∞]. In addition, we use a data augmentation algorithm, that is proposed (Nielsen 2002), to obtain the substitution mapping in the case of Poisson processes. The substitution mapping is described by the formula , nij denotes the number of substitutions on branch j and at site i, denotes the successive states of the process and random variable is the total number of substitutions to state a at sites which are assigned in cluster k, plus one if a is the state at the root of the tree. This algorithm is used to simulate the mutational history for a single site. The probability distribution of nij is defined by the Poisson distribution of rate parameter rilj and is drawn from
Given a data set of amino-acid sequences, Markov chain Monte Carlo (MCMC) sampling methods have been proposed to approximate full parameters of this model (Lartillot and Philippe 2004; Lartillot 2006). A parallel computing version has been developed to speed up the estimation process, thus allowing faster inference the phylogenetic reconstruction under a Dirichlet Mixture Process (Lartillot et al. 2013). Basically, however, MCMC methods, even parallel MCMC, solve this problem based on sampling schemes from a Markov chain whose stationary distribution is the posterior of interest and by updating an estimate of the model parameters. When a database becomes too large for memory or iterative computation, these approaches significantly increase the time complexity of inference.
5.2 Variational Inference
Variational inference is a class of methods that reformulate the problem of approximating the posterior inference for complex probabilistic models as an optimization problem. The central purpose of the variational inference algorithm is to approximate the true intractable posterior distribution p(Φ, Ξ|D), Φ = {V, z, π, l, r} by finding an element of a tractable family of probability distributions q(Φ, Ξ|Θ), called the variational distribution. These distributions are parameterized by free parameters, called variational parameters Θ. Variational inference fits these parameters to find a distribution close to the true intractable posterior distribution of interest. The distance on probability space for a pair of probability distribution q(Φ, Ξ|Θ) and p(Φ, Ξ|D) is measured with Kullback-Leibler (KL) divergence:
The term log p (D) in equation (1), which is the cause of computational difficulty in Bayesian analysis, can be treated as a constant to estimate the variational distribution that is closest to the posterior distribution:
By adopting the compromised target function KL [q(Φ, Ξ|Θ) p(Φ, Ξ|D)], the variational inference maximizes the computational feasible target function:
Because the equation (2) is called Evidence Lower BOund (ELBO (Jordan et al. (1999))) It should be noted that the value of the target function cannot be used for comparison between different models of variational functions. Currently, the standard model checking process is to compare the important aspects of q∗(Φ, Ξ|Θ) with those of MCMC runs by example data.
5.3 Two Illustrative Examples
5.3.1 Variational Inference of Bayesian Ridge Regression
We consider the linear regression model. Given a data set of the explanatory variables xn = (xn1, …, xnM)T and the dependent variable tn(n ∈ (1, …, N)). The likelihood is where w is the regression coefficient and β is the noise precision parameter. In order to simplify the discussion, we assume that the noise precision parameter β is known and consider a conjugate Gaussian prior distribution over w, p (w|α) = N (w|0, α-1I). α determines the extent of shrinkage. When α is known, the posterior distribution of w follows a normal distribution. To allow for the uncertainty with this extent, a gamma prior distribution is introduced over α, p (α) = Gam (α|a0, b0). In this case, the posterior distribution cannot be expressed explicitly.
In the variational framework, the mean field representation of w and α is
A practical variational distribution is
The joint density and the mean-field family are combined in order to form the ELBO for Bayesian ridge regression model. It is a function of the variational parameters mN, SN and aN, bN.
Using the coordinate ascent algorithm, we update each variational parameter in turn as follows:
update mN, SN: Given the values , the value of mN, SN is updated as where X is the design matrix (x1, …, xN)T.
update aN, bN: Given the values , the value of aN, bN is updated as
Using the optimized value of the parameters, the posterior means are estimated as:
5.3.2 Variational Inference of Topic Model
The second example is the simplest topic model - Latent Dirichlet Allocation (LDA). In case of LDA model, the observed variables are the words, organized into documents. The input data wdn denote the nth word in the dth document. Across a collection, the documents share the same mixture components which are called topics βk for k ∈ (1, …, K). A vector of topic proportions θd for d ∈ (1, …, D) describes the degree to which each document exhibits those topics. The LDA model assumes Dirichlet priors for both βk and θd:
The topic assignment zdn denotes each word in each document which is assumed to have been drawn from a single topic. Therefore, the topics, topic proportions and topic assignments are latent variables. The posterior distribution is written generally as
However, the denominator is computationally infeasible.
In the variational framework for topic models, we consider firstly the variational distributions for latent variables. The mean-field variational family contains approximate posterior densities of the form
The factors q (βk|λk) and q (θd|γd) are the Dirichlet distributions on the kth topic with global per-topic Dirichlet parameter λk and the dth document with local per-document Dirichlet parameter γd. The factor q (zd,n|φd,n) is a multinomial distribution on the nth observation’s topic assignment; its local assignment probabilities are a K-vector φd,n. We construct the general ELBO for LDA model by combining the joint density and the mean-field variational family,
With the complete conditionals, we now use the coordinate ascent variational inference algorithm which iterates between updating each local variational parameter and updating the global variational parameter:
update the local variational parameter: Given the values of , the values of and γdk are updated as
Here, Ψ(.) is the digamma function, the first derivative of the log Gamma function.
update the global variational parameter: Given the values of , the values of λkn are updated as
5.4 Variational Inference of CAT Model
Using results from the simplest family of distributions as mean-field variational approximations (Blei et al. 2006; Hoffman et al. 2013), each variable in the CAT-Poisson model is independent and governed by its own variational parametric distribution. Moreover, we consider truncated stick-breaking representations, proposed previously by Blei et al. (2006), for only the variational distributions. The truncated level or the largest number of categories Kmax can be freely chosen. The family of variational distributions in the CAT-Poisson model can be written as follows:
Where are free variational parameters. To guarantee the tractability of computing the expectations of variational distributions, we choose variational distributions from exponential families (Wainwright et al. 2008).
To estimate each variational parameter in the CAT-Poisson model (3,4), we consider dividing the set of variational variables into two subgroups - global variables [Φg = (Ξ, π, l, r)] and local variables [Φl = (V, z)]. The local variational variables (V, z) are per-data-point latent variables. The kth local variable Vk is unit length of kth stick in stick-breaking representation which is used to make the infinite vector of mixing proportions. The ith local variable of the mixture component represents the allocation situation of site i of alignment of amino-acid sequences. Each local variable are governed by “local variational parameters” Bishop (2006) has proposed coordinate ascent algorithm for solving the optimization problem of these variables. The coordinate ascent algorithm tries to find the local optimum of the ELBO by optimizing each factor of the mean field variational distribution, while fixing the others. The optimal q (z) and q (V) are then proportional to the exponentiated expected log of the the joint distribution,
Here, E|z and E|V denote expectations with respect to the variational distributions of all the variables except for z or V. The global variables Φg potentially control any of the data. These variables are governed by the “global variational parameters” [Θg = (γ, γ′, ζ, ζ′, λ, ω,)]. The coordinate ascent algorithm iterates t times to update local variational parameters based on mapping data, where η (.) are the natural parameters.
To estimate each global variational parameter in the CAT-Poisson model, we use the stochastic variational inference (SVI) algorithm to optimize the lower bound in Equation (2) (Hoffman et al. 2013). The stochastic variational algorithm is based on stochastic gradient ascent, the noisy realization of the gradient. The natural gradients (?) are adopted to account for the geometric structure of probability parameters (Robbins and Monro 1951). Importantly, natural gradients are easy to compute and give faster convergence than standard gradients. The SVI repeatedly subsamples the data, updates the values of the local parameters based on the subsampled data, and adjusts the global parameters in an appropriate way. Such estimates can guarantee algorithms to avoid shallow local optima of complex objective functions.
In our setting, we sample a mapping data point Ξn at each iteration, and compute the conditional natural parameters for the global variational parameters given N replicates of Ξn. Then, the noisy natural gradients are obtained. By using these gradient, we update Θg at each t iteration (with step size ρt) where t (.) denote the sufficient statistics.
Based on the subsampling techniques, this procedure reduces the computational burden by avoiding the expensive sums in the above lower bound. The SVI algorithm thus significantly accelerates the variational objective analysis of the large database. Applying the previously proposed SVI framework (Hoffman et al. 2013), we can separate the computational cycle into the following steps:
Sample amino acid data from the whole set of input data.
Estimate how each site is assigned to a category, on the basis of observational data and the current approximation of variational parameters.
Update variational parameters
-Local parameters are assignment variables, and breaking proportions.
-Global parameters are equilibrium frequency profile, branch length, and rate across sites.
The lower bound of the data in terms of the variational parameters is specifically described in the Supplementary Material. Mathematical details of the variational objective function and computational methods of noisy derivatives and updating of variational parameters are also explained in that section.
5.5 Parallelization and Tree Topology
To parallelize the algorithm at the single machine level and thus reduce runtimes, we adopted the MPI parallelization of the PhyloBayes MPI program (Lartillot et al. 2013). Specifically, we use one master process for dispatching computational tasks and collecting and summing results, and with multiple slave processes executing the orders and returning all essential information to the master. This parallel strategy helps to equally divide the computational burden among slaves.
In addition, a partial Gibbs sampling algorithm for subtree pruning and regrafting (SPR) is adopted to update the tree topology (Lartillot et al. 2013). In a parallel environment, the task of the master process is to randomly select a subtree for pruning and send this information to all slaves. The task of each slave process is to update the conditional likelihood vectors of each resulting topology and the complete scan of all possible regrafting points. One single log likelihood for each regrafting point is arranged into an array and sent back to the master process. All arrays are collected and summed and lastly the Gibbs-sampling decision rule is finally applied to select regrafting position.
5.6 Data Sets
Three real data sets were used for our computational experiments. Data set A was a mitochondrial data set which consisting of 33 proteins, 6,622 amino acid positions from 13 species. Data set B was a plastid data set which composed of 50 plastidencoded proteins, 10,137 amino acid positions from 28 species. A total of 13% and 5% amino acid positions were missing from the mitochondrial and plastid data sets, respectively (Rodríguez-Ezpeleta et al. 2006; Lartillot et al. 2013). Finally, data set C was a more challenging and larger complete set of mitochondrial protein sequences derived from, a large alignment of EST and genome data, which consists of 197 genes, a total of 38,330 amino-acid positions from 66 species and with 30% missing data, is constructed by (Philippe et al. 2011).
C++ code for the variational inference version of the CAT model to perform computational experiments with these data sets is available at https://github.com/tungtokyo1108/.
6 Supplementary Material
Supplementary data are available at Molecular Biology and Evolution online
7 Acknowledgments
This study was supported by Grant-in-Aid for Scientific Research (B) 16H02788 from the Japan Society for the Promotion of Science. We thank Edanz Group (www.edanzediting.com/ac) for editing the English text of a draft of this manuscript.