Abstract
The large state space of gene genealogies is a major hurdle for inference methods based on Kingman’s coalescent. Here, we present a new Bayesian approach for inferring past population sizes which relies on a lower resolution coalescent process we refer to as “Tajima’s coalescent”. Tajima’s coalescent has a drastically smaller state space, and hence it is a computationally more efficient model, than the standard Kingman coalescent model. We provide a new algorithm for efficient and exact likelihood calculations, which exploits a directed acyclic graph and a correspondingly tailored Markov Chain Monte Carlo method. We compare the performance of our Bayesian Estimation of population size changes by Sampling Tajima’s Trees (BESTT) with a popular implementation of coalescent-based inference in BEAST using simulated data and human data. We empirically demonstrate that BESTT can accurately infer effective population sizes, and it further provides an efficient alternative to the Kingman’s coalescent. The algorithms described here are implemented in the R package phylodyn, which is available for download at https://github.com/JuliaPalacios/phylodyn.
1 Introduction
Modeling gene genealogies from an alignment of sequences — timed and rooted bifurcating trees reflecting the ancestral relationships among sampled sequences — is a key step in coalescent-based inference of evolutionary parameters such as effective population sizes. In the neutral coalescent model without recombination, observed sequence variation is produced by a stochastic process of mutation acting along the branches of the gene genealogy (Kingman, 1982a; Watterson, 1975), which is modeled as a realization of the coalescent point process at a neutral non-recombining locus. In the coalescent point process, the rate of coalescence (the merging of two lineages into a common ancestor at some time in the past) is a function that varies with time, and it is inversely proportional to the effective population size at time t, N(t) (Kingman, 1982b; Slatkin and Hudson, 1991; Donnelly and Tavaré, 1995). Our goal is to infer (N(t))t≥0 which we will refer to as the “effective population size trajectory”.
Multiple methods have been developed to infer (N(t))t≥0 using the standard coalescent model on genomic sequence datasets (Griffiths and Tavaré, 1996; Kuhner and Smith, 2007; Minin et al., 2008; Li and Durbin, 2011; Drummond et al., 2012; Palacios and Minin, 2013; Gill et al., 2013; Sheehan et al., 2013; Palacios et al., 2015). These methods must contend with two major challenges: (i) choosing a prior distribution or functional form for (N(t))t≥0, and (ii) integrating over the large hidden state space of genealogies. For example, several previous approaches have assumed exponential growth (Griffiths and Tavaré, 1996; Kuhner et al., 1998; Kuhner and Smith, 2007), in which case the estimation of (N(t))t≥0 is reduced to the estimation of one or two parameters. In general, the functional form of (N(t))t≥0 is unknown and needs to be inferred. A commonly used naive nonparametric prior on (N(t))t≥0 is a piecewise linear or constant function defined on time intervals of constant or varying sizes (Heled and Drummond, 2008; Sheehan et al., 2013; Schiffels and Durbin, 2014). The specification of change points in such time-discretized effective population size trajectories is inherently difficult because it can lead to runaway behavior or large uncertainty in (Minin et al., 2008; Heled and Drummond, 2008; Li and Durbin, 2011; Sheehan et al., 2013; Palacios et al., 2015). These difficulties can be avoided by the use of Gaussian-process priors in a Bayesian nonparametric framework, allowing accurate and precise estimation (Palacios and Minin, 2013; Gill et al., 2013; Lan et al., 2014; Palacios et al., 2015).
The second challenge for coalescent-based inference of (N(t))t≥0 is the integration over the hidden state space of genealogies. Given molecular sequence data Y and a mutation model with vector of parameters μ, current methods rely on calculating the marginal likelihood function Pr(Y|(N(t))t≥0, μ) by integrating over all possible coalescence and mutation events. Under the infinite-sites mutation model without intra-locus recombination (Watterson, 1975), this integration requires a computationally expensive importance sampling technique or Markov Chain Monte Carlo (MCMC) techniques (Griffiths and Tavaré, 1994a; Stephens and Donnelly, 2000; Hobolth et al., 2008; Wu, 2010). Moreover, a maximum likelihood estimate of (N(t))t≥0 cannot be explicitly obtained; instead, it is obtained by exploring a grid of parameter values. For finite-sites mutation models, current methods approximate the marginal likelihood function by integrating over all possible genealogies via MCMC methods (Equation (1); Kuhner (2006); Drummond et al. (2012)). Both cases may be represented as in which Pr(·) is used to denote both the probability of discrete variables and the density of continuous variables. The integral above involves an (n − 1)-dimensional integral over n − 1 coalescent times and a sum over all possible tree topologies with n leaves. Therefore, these methods require a very large number of MCMC samples, and exploration of the posterior space of genealogies continues to be an active area of research (Kuhner et al., 1998; Rannala and Yang, 2003; Drummond et al., 2012; Whidden and Matsen, 2015; Aberer et al., 2016).
Current methods rely on the Kingman n-coalescent process to model the sample’s ancestry. However, the state space of genealogical trees grows superexponentially with the number of samples, making inference computationally challenging for large sample sizes. In this study, we develop a Bayesian nonparametric model that relies on Tajima’s coalescent, a lower resolution coalescent process with a drastically smaller state space than that of Kingman’s coalescent. In Cappello and Palacios (2019), the authors quantify this striking reduction in cardinality. In particular in this study, we infer the posterior distribution Pr((N(t))t≥0, gT, τ | Y, μ), where gT corresponds to the Tajima’s genealogy of the sample (see Figure 1A and Section 2.4), (log N(t))t≥0 has Gaussian process prior with precision hyperparameter τ, and mutations occur according to the infinite-sites model of Watterson (1975). This results in a new more efficient method for inferring (N(t))t≥0 called Bayesian Estimation by Sampling Tajima’s Trees (BESTT), with a drastic reduction in the state space of genealogies. We show using simulated data that BESTT can accurately infer effective population size trajectories and that it provides a more efficient alternative than Kingman’s coalescent models.
Next, we start with an overview of BESTT, detail our representation of molecular sequence data and define the Tajima coalescent process. We then introduce a new augmented representation of sequence data as a directed acyclic graph (DAG). This representation allows us to both calculate the conditional likelihood under the Tajima coalescent model, and to sample tree topologies compatible with the observed data. We then provide an algorithm for likelihood calculations and develop an MCMC approach to efficiently explore the state space of unknown parameters. Finally, we compare our method to other methods implemented in BEAST (Drummond et al., 2012) and estimate the effective population size trajectory from human mtDNA data. We close with a discussion of possible extensions and limitations of the proposed model and implementation.
2 Methods/Theory
2.1 Overview of BESTT
Our objective in the implementation of BESTT is to estimate the posterior distribution of model parameters by replacing Kingman’s genealogy with Tajima’s genealogy gT. A Tajima’s genealogy does not include labels at the tips (Figure 1): we do not order individuals in the sample but label only the lineages that are ancestral to at least two individuals (that is, we only label the internal nodes of the genealogy). Replacing Kingman’s genealogy by Tajima’s genealogy in our posterior distribution exponentially reduces the size of the state space of genealogies (Figure 1B). In order to compute Pr(Y|gT, μ), the conditional likelihood of the data conditioned on a Tajima’s genealogy, we assume the infinite sites model of mutations and leverage a directed acyclic graph (DAG) representation of sequence data and genealogical information. Note that the overall likelihood, Eq. (1), will differ only by a combinatorial factor from the corresponding likelihood under the Kingman coalescent. Our DAG represents the data with a gene tree (Griffiths and Tavaré, 1994a), constructed via a modified version of the perfect phylogeny algorithm of Gusfield (1991). This provides an economical representation of the uncertainty and conditional independences induced by the model and the observed data.
Under the infinite-sites mutation model, there is a one-to-one correspondence between observed sequence data and the gene tree of the data (Gusfield, 1991) (Sections 2.2–2.3). We further augment the gene tree representation with the allocation of the number of observed mutations along the Tajima’s genealogy to generate a DAG (Section 2.5). The conditional likelihood Pr(Y | gT, μ) is then calculated via a recursive algorithm that exploits the auxiliary variables defined in the DAG nodes, marginalizing over all possible mutation allocations (Section 2.6). We approximate the joint posterior distribution Pr((N(t))t≥0, gT, τ | Y, μ) via an MCMC algorithm using Hamiltonian Monte Carlo for sampling the continuous parameters of the model and a novel Metropolis-Hastings algorithm for sampling the discrete tree space.
2.2 Summarizing sequence data Y as haplotypes and mutation groups
Let the data consist of n fully linked haploid sequences or alignments of nucleotides at s segregating sites sampled from n individuals at time t = 0 (the present). Note that any labels we afix to the individuals are arbitrary in the sense that they will not enter into the calculation of the likelihood. We further assume the infinite sites mutation model of Watterson (1975) with mutation parameter μ and known ancestral states for each of the sites. Then we can encode the data into a binary matrix Y of n rows and s columns with elements yi,j ∈ {0, 1}, where 0 indicates the ancestral allele.
In order to calculate the Tajima’s conditional likelihood Pr(Y | gT, μ), we first record each haplotype’s frequency and group repeated columns to form mutation groups; a mutation group corresponds to a shared set of mutations in a subset of the sampled individuals. We record the cardinality of each mutation group (i.e., the number of columns grouped to form each mutation group). In Figure 2A, there are two columns labeled “b”, corresponding to two segregating sites which have the exact same pattern of allelic states across the sample. Further, two individuals carry the derived allele of mutation group “b”, so in this case the frequency of haplotype 7 and the cardinality mutation group “b” are both equal to 2. We denote the number of haplotypes in the sample as h, the number of mutation groups as m, and the representation of Y as haplotypes and mutation groups as Yh×m.
2.3 Representing Yh×m as a gene tree
Yh×m (Figure 2A, previous section) can alternatively be represented as a gene tree (or perfect phylogeny; Gusfield (1991); Griffiths and Tavaré (1994b)). This representation relies on our assumption of the infinite sites mutation model in which, if a site mutates once in a given lineage, all descendants of that lineage also have the mutation and no other individuals carry that mutation. The haplotype data summarized in Figure 2A corresponds to the gene tree (perfect phylogeny) given in Figure 2B.
A gene tree for a matrix Yh×m of h haplotypes and m mutation groups is a rooted tree with h leaves such that (Figure 2B):
Each row of Yh×m corresponds to exactly one leaf of . The black numbers at leaf nodes in Figure 2B are the haplotype frequencies.
Each mutation group m of Yh×m is represented by exactly one edge of . The red numbers along edges in Figure 2B give the cardinality of each mutation group (i.e. the number of segregating sites in each mutation group; see Figure 2A). The edges of corresponding to each mutation group are labeled accordingly (letters in Figures 2A and 2B). Note that external edges need not be labeled if there are no mutations exclusively present in the group of individuals descending from those edges.
The labels and the numbers associated with the edges along the unique path from the root to a leaf exactly specify a row of Yh×m.
Dan Gusfield’s perfect phylogeny algorithm (Gusfield, 1991) transforms the sequence data Yh×m into a gene tree and this transformation is one-to-one.
2.4 Tajima’s genealogies
BESTT explores the state space of Tajima’s genealogies (gT) as opposed to Kingman’s genealogies (g) and calculates the conditional likelihood Pr(Y|gT, μ). Kingman’s n-coalescent is a continuous-time Markov chain taking its values in the set of partitions of the label set {1, …, n}, which we denote as . At time t = 0, the process starts at {{1}, …, {n}}, when there are n labeled lineages, and it stops at {{1, …, n}}, when there is a single lineage ancestral to all n individuals. A transition in the n-coalescent consists of a merger, or coalescence, of two lineages (i.e., two blocks of the partition are chosen uniformly at random to merge in a single block). Sainudiin et al. (2015) describe six different resolutions of the discrete coalescent process for n lineages; these different resolutions are Markovian lumpings of the states of the coalescent process. For example, the pure death process that tracks the number of lineages over time (Kingman, 1982a) is the coarsest resolution, while the partition-valued n-coalescent provides one of the highest resolutions.
The process that keeps track of the blocks formed at each time step (including the labels of the individuals of the sample within each block, which share a common ancestor) induces a ranked labeled rooted binary tree that we call a “labeled topology”. Note that when the labeled tree topology is presented together with the coalescent times, ranking of coalescent events is redundant. For this reason, only four of the six resolutions presented in Sainudiin et al. (2015) are distinguishable.
A Kingman’s genealogy is a pair g = {Kn, t}, consisting of a labeled tree topology Kn, and a vector of coalescent times t = (tn, …, t2) whose state space is . In this study, we assume that all sequences are sampled at the same time (i.e. the present, or time t = 0) and that the coalescent times, tn, …, t2, are measured from the present back into the past. In addition, we define tk to be the time of the coalescent event which decreases the number of ancestral lineages from k to k − 1. Thus, tk − tk+1 is the length of time in the ancestry of the sample during which there are exactly k lineages. The number of possible labeled tree topologies for a sample of size n is n!(n − 1)!/2n−1, and each of these is equally likely. The density of a Kingman’s genealogy with effective population size trajectory (N(t))t≥0, denoted by Pr(g | (N(t))t≥0), can be factored as the product of the probability of Kingman’s genealogy and the coalescent times density: where Pr(Kn) = 2n−1/[n!(n − 1)!]. We refer to the distribution of a genealogy as density, although the distribution is a mixed distribution of continuous random variables (coalescent times) and discrete random variables (topology). Here again, we use the notation Pr(·) to denote the probability or the density of the random variable of interest, be it continuous or discrete.
In contrast, Tajima’s genealogy is a pair gT = {Fn, t} of a ranked tree shape Fn, and a vector of coalescent times t (Figure 2; Sainudiin et al. (2015); Palacios et al. (2015)). A ranked tree shape is a bifurcating tree with internal nodes labeled by their rankings (i.e. their order in time from past to present) and leaf labels omitted. In matrix notation, the ranked tree shape Fn is encoded by a triangular matrix of size n × n (Figure 3). During the interval (ti+1, ti), there are exactly i lineages (i = 2, …, n and by convention we set tn+1 = 0). The number of lineages through time is encoded on the diagonal of F: Fi,i = i for i in {2, 3, …, n}. For j < i, the entry Fi,j denotes the number of lineages that do not coalesce in the time interval (ti+1, tj); in particular, Fi,1 = 0 and for every i in {2, 3, …, n}, Fn,i denotes the number of singletons (i.e., external branches that have not coalesced) in the time interval (ti+1, ti) (Figure 3). The number of possible ranked tree shapes for a sample of size n (also called unlabeled histories, evolutionary relationships or vintaged and sized coalescent; see Sainudiin et al. (2015)) corresponds to the n-th term of the sequence A000111 of Euler zig-zag numbers (Disanto and Wiehe, 2013). The density of a Tajima’s genealogy with effective population size trajectory (N(t))t≥0, denoted by Pr(gT | (N(t))t≥0), can be factored as the product of the probability of Tajima’s genealogy and the coalescent times density: with where c is the number of cherries (nodes with two leaves; c = 3 in Figure 3A), which can be expressed in terms of the entries of the matrix Fn. Equation (2) was derived independently by both Sainudiin et al. (2015) and Palacios et al. (2015).
Observe that the density of Kingman’s and Tajima’s genealogies differ solely by the discrete probability corresponding to the tree topology. Using either Kingman’s or Tajima’s genealogies, the distribution of coalescent times can be viewed as the distribution of a point process of coalescent events at times t := (tn, tn−1, …, t2), where tk indicates the time, measured from the present time 0 back into the past, when two of the k extant lineages reach a common ancestor and merge. The rate at which pairs of lineages coalesce depends on the effective population size trajectory (N(t))t≥0. In contrast to the case of constant effective population size, the coalescent intervals tk − tk+1 for k = 2, …, n are not independent of each other when N(t) varies with time. However, the density of a realization of the coalescent point process can be decomposed into a product of conditional densities as follows: where again we set tn+1 = 0. The conditional density of the coalescent interval tk − tk+1 takes the following form: (Slatkin and Hudson, 1991) where is the number of possible coalescent events when k ancestral lineages are present.
2.5 An augmented data representation using directed acyclic graphs
A key component of BESTT is the calculation of the conditional likelihood Pr(Y|gT, μ). We compute the conditional likelihood recursively over a directed acyclic graph (DAG) D. Our DAG exploits the gene tree representation of the data (Figure 2B), incorporates the branch length information of the Tajima’s genealogy gT (Figure 2C) and facilitates the recursive allocation of mutations to the branches of gT. Here we detail the construction of the DAG.
We construct the DAG using three pieces of information: the gene tree , a Tajima’s genealogy gT and an “allocation” of mutations along the branches of the Tajima’s genealogy (Figure 2). An allocation refers to a possible mapping (compatible with the data) of the observed numbers of mutations (red numbers in Figure 2B) to branches in the Tajima’s genealogy. Figure 4A shows one possible mapping for the Tajima’s genealogy in Figure 2C; usually this mapping is not unique. Our construction of D enables an efficient recursive consideration of all possible allocations of mutations along gT when computing the conditional likelihood Pr(Y | gT, μ).
Constructing the DAG D
Our DAG D = {Z, E} (Figure 2D) with nodes Z and edges E is constructed from a gene tree . The number of internal nodes in the DAG D is the same as the number of internal nodes in . However, sister leaf nodes in with the same number of descendants are grouped together in D. For example, the leaves in Figure 2B subtending from edges i and j are grouped into Z6 in Figure 2D, as they both have haplotype frequency 2. However, the leaves subtending from the e and f edges are not grouped (and correspond to Z8 and Z9 in the DAG Figure 2D) since they have respective haplotype frequencies 2 and 1. We label the root node of D as Z0 and increase the index i of each node Zi from top to bottom, moving left to right. For i < j, we assign a directed edge Ei,j if the node in corresponding to Zi is connected to the node in corresponding to Zj. The index set of internal nodes in D is denoted by and the index set of leaf nodes is denoted by .
Random variables in D
Each node in D represents a random vector, Zj, which includes number of descendants, number of mutations and allocation of mutations. Although the number of descendants and number of mutations are part of the observed data rather than random variables, for ease of exposition, we use capital letters to denote all three types of information. We define the random vector Zj as follows: where Dj denotes the number of descendants of (i.e., of sampled sequences subtended by) node Zj, Xj denotes the number of mutations separating Zj from its parent node, and Aj denotes the allocation of mutations along gT (described in detail below). The number of descendants Dj is thus the number of individuals/sequences descending from node Zj (this information is part of ). For internal nodes, Xj records the cardinality of a mutation group, represented as a red number along the edge Ei,j of in Figure 2B, where i is the index of the parent node of Zj. Leaf nodes in D may correspond to more than one leaf node in , namely any sister nodes with the same number of descendants. In this case, Xj is a vector with the cardinalities of the corresponding mutation groups (see for example node Z6 in Figure 4B).
Allocation of mutation groups along gT
The allocation random variables {Aj} are constrained by the information in the Tajima’s genealogy gT. In a given gT, every subtree is labeled by its ranking from past to present (Figure 3). Subtree i is subtended by branch bi with length li, for i = 2, …, n. We will assume that l2, the length of the root branch, is 0. Let c be the number of cherries (nodes with two leaves) in gT; the two branches of a given cherry share the same label bj ∈ {bn+1, …, bn+c}. The actual label of external branches is arbitrary but, for ease of exposition, we first label the cherries’ branches from left to right by {bn+1, …, bn+c}; singleton branches are labeled from left to right by bn+c+1, …, b2n−c (Figure 2C). The allocation variables {Aj} determine a possible correspondence between subtrees in gT and nodes in D: in particular, Aj indicates the branches in gT that subtend the subtrees corresponding to nodes {Zk} if {Zk} are child nodes of Zj.
Allocations of mutations to branches are usually not unique and computation of the conditional likelihood Pr(Y | gT, μ) requires summing over all possible allocations. In Figure 4A we show one such possible allocation of the mutation groups of the gene tree in Figure 2B along the Tajima’s genealogy in Figure 2C. For example, mutation group “a” in Figure 2B with cardinality 1 (number in red) is a mutation observed in 7 individuals (sum of black numbers of leaves descending from edge marked a). This same mutation group, “a”, is shown as a red number 1 in Figure 4A allocated to branch b5. If Zj is an internal node, the number of mutations Xj separating it from its parent node is a vector of length 1. If Zj is a leaf node, Xj can be a vector of length greater than 1. The length of Xj is the number of the corresponding sister nodes in that were grouped together in forming node Zj. Aj = (Aj,1, …, Aj,|ch(j)|) denotes a collection of |ch(j)| vectors of branch labels in gT subtending the child-node subtrees of node Zj. Aj,1 corresponds to the branch subtending from the leftmost child node of Zj on D, Aj, 2 corresponds to the branch subtending from the next child node of Zj, etc., and Aj,|ch(j)| corresponds to the branch subtending from the rightmost child node of Zj on D. Observe that, since we group some of the leaf nodes in into a single node in D, any Aj,k may be a vector of branch labels; for example A1,1 = (b12, b9) and A1,2 = b10 in Figure 4B.
2.6 Computing the conditional likelihood
Under the infinite-sites mutation model, mutations are superimposed independently on the branches of gT as a Poisson process with rate μ. In order to compute we marginalize over the unspecified allocation information in the directed acyclic graph D; that is, we sum over all possible mappings of mutations in to branches in gT as follows: where , pa(i) denotes the index of the parent of node i in D and we set P (Z0 | gT, μ) = 1 because it is assumed that there are no mutations above the root node and the length of the root branch l2 = 0. Writing for the tree length of gT (i.e., the sum of the lengths of all branches of gT) and factoring out a global factor (due to the Poisson distribution of mutations across the genealogy) from each of the above products over i ∈ {1, …, nI + nL}, we have where Π(xi, k) is the set of all permutations of xi = {xi1, …, xik} divided into mi groups of different sizes. The number of different permutations of the k values of xi divided into mi groups of sizes is
For example, assume that xi = {2, 2, 2, 0, 3, 3} and apa(i) = (b3, b4, b5, b6, b7, b8) with branch lengths {l3, l4, l5, l6, l7, l8}. In this case, k1 = 3 because there will be 3 branches with 2 mutations, k2 = 1 because there will be 1 branch with 0 mutations and k3 = 2 because there will be 2 branches with 3 mutations. The number of permutations of k = 6 mutations groups divided into mi = 3 groups with cardinalities 2, 0, 3 of sizes 3, 1, 2 is 6!/(3!1!2!) = 60.
The conditional likelihood Pr(Y | gT, μ) is calculated via a depth first search algorithm (Appendix). The algorithm marginalizes the allocations by traversing the DAG from the tips to the root. The pseudocode can be found in the Appendix.
2.7 The case of unknown ancestral states
Up to now, we have assumed that the ancestral state was known at every segregating site. The representation of the data Y that we use in this case records the cardinalities of each mutation group and the genealogical relations between these groups, but does not assign labels to the sequences. Hence, in the terminology of Griffiths and Tavaré (1995), our data corresponds to an unlabeled rooted gene tree.
When the ancestral types are not known, the data (now denoted Y0) may be represented as an unlabeled unrooted gene tree. By the remark following Equation (1) in Griffiths and Tavaré (1995), if s is the number of segregating sites, then there are at most s + 1 unlabeled rooted gene trees that correspond to the unrooted gene tree of the observed data . By the law of total probability (see also Equation (10) in Griffiths and Tavaré (1995)), the conditional likelihood of Y0 can be written as the sum over all compatible unlabeled rooted gene trees Y (i) of the probability of Y(i) conditionally on gT. That is: where each of the Y(i) corresponds to a unique unlabeled rooted gene tree compatible with the unrooted gene tree Y0 and denotes the number of those unlabeled rooted gene trees. In the following sections, we shall assume that the ancestral type at each site is known.
2.8 Bayesian inference of the effective population size trajectory
Our posterior distribution of interest is where has a Gaussian process prior with mean 0 and covariance function C(τ). This specification ensures (N(t))t>0 is non-negative. In our implementation, we assume a regular geometric random walk prior, that is, at B regularly spaced time points in [0, T] with
The parameter τ is a length scale parameter that controls the degree of regularity of the random walk. We place a Gamma prior with parameters α = .01 and β = .001 on τ, reflecting our lack of prior information about the smoothness of the logarithm of the effective population size trajectory.
We approximate the posterior distribution of model parameters via a MCMC sampling scheme. Model parameters are sampled in blocks within a random scan Metropolis-within-Gibbs framework.
To summarize the effective population size trajectory, we compute the posterior median and 95% credible intervals pointwise at each grid point in , were is the maximum time to the most recent common ancestor sampled.
2.8.1 Metropolis-Hastings updates for ranked tree shapes
There is a large literature on local transition proposal distributions for Kingman’s topologies (Kuhner et al., 1998; Rannala and Yang, 2003; Drummond et al., 2012; Whidden and Matsen, 2015; Aberer et al., 2016). In this paper, we adapted the local transition proposal of Markovtsova et al. (2000) to Tajima’s topologies. We briefly describe the scheme below and provide a pseudocode algorithm in the Appendix (Algorithm 1).
Given the current state of the chain {γ, τ, gT} = {γ, τ, Fn, t}, we propose a new ranked tree shape F* in two steps: (1) we first sample a coalescent interval k uniformly on {1, …, n − 2}. Given k, we focus solely on the coalescent event sampled and the one that follows (thus, the last coalescent event cannot be sampled). For step (2), there are two possible scenarios: either vintage k undergoes a coalescent at event at k + 1, or it does not. In the former scenario, we choose a new pair of lineages at random to coalesce at k from the 3 lineages subtending k and k + 1 (excluding k), and we coalesce the remaining lineage with k at k + 1. In the latter scenario, we invert the order of the coalescent events; that is, vintage k is relabeled k + 1 and conversely. The transition probability is given by the product of the probabilities of the two steps. The new ranked tree shape is accepted with probability given by the Metropolis-Hastings ratio defined below:
2.8.2 Split Hamiltonian Monte Carlo updates of (γ, τ)
To make efficient joint proposals of γ and τ, we use the Split Hamiltonian Monte Carlo method proposed by Lan et al. (2014). The target density, π(γ, τ) ∝ Pr(t | γ)Pr(γ | τ)Pr(τ), is the same target density implemented in Karcher et al. (2017) for fixed coalescent times t.
2.8.3 Hamiltonian Monte Carlo updates of coalescent times
Given the current state {γ, τ, gT} = {γ, Fn, t, τ}, we propose a new vector of coalescent times with target density π(t′) ∝ P(Y | Fn, t′, μ)P (t′ | γ) by numerically simulating a Hamilton system with Hamiltonian where s is the momentum vector assumed to be normally distributed. For our implementation, we set the mass matrix M = I, the identity matrix. We simulate the Hamiltonian dynamics of the logarithm of times to avoid proposals with negative values. Solving the equations of the Hamilton system requires calculating the gradient of the logarithm of the target density with respect to the vector of log coalescent times. The gradient of the log conditional likelihood (score function) is calculated at every marginalization step in the sum-product algorithm for the likelihood calculation.
At the beginning of Section 2.8, we described how we assume a regular geometric random walk prior on (N(t))t>0 at B regularly spaced time points in [0, T]. Ideally, the window size T must be at least t2, the time to the most recent common ancestor (TMRCA). However, t2 is not known. Our initial values of coalescent times t are obtained from the UPGMA implementation in phangorn (Schliep, 2011) with times properly rescaled by the mutation rate, and we set T = t2. We initially discretize the time interval [0, T] into B intervals of length T/(B − 1). As we generate new samples of t, we expand or contract our grid according to the current value of t2 by keeping the grid interval length fixed to T/(B − 1), effectively increasing or decreasing the dimension of γ.
2.8.4 Local updates of coalescent times
In addition to HMC updates of coalescent times, we propose a move of a single coalescent time (excluding the TMRCA t2) chosen uniformly at random and sampled uniformly in the intercoalescent interval; that is, we choose i ∼ U ({n, n − 1, …, 3}) and . This is a symmetric proposal and the corresponding Metropolis-Hastings acceptance probability is
2.8.5 Multiple Independent loci
Thus far, we have assumed our data consist of a single linked locus of s segregating sites. We can extend our methodology to l independent loci with si segregating sites for i = 1,…, l. In this case, our data consist of l aligned sequences with elements {0, 1}, where 0 indicates the ancestral allele as before. We then jointly estimate the Tajima’s genealogies , precision parameter τ, and vector of log effective population sizes γ through their posterior distribution:
In Equation (11), we enforce that all loci follow the same effective population size trajectory but every locus can have its own mutation rate μi.
3 Results
3.1 The performance of BESTT in applications to simulated data
We tested our new method, BESTT, on simulated data under two different demographic scenarios. Note that in this section, N(t) is rescaled to the coalescent time scale, meaning that 1/N(t) is the pairwise rate of coalescence at time t in the past relative to the rate at the present time zero. We simulated genealogies under four different population size trajectories:
A period of exponential growth followed by constant size:
A trajector with instantaneous growth:
An exponential growth: N(t) = 25e−5t
A constant trajectory: N(t) = 1
Given a genealogy of length , where tj − tj+1 is the intercoalescent length while there are j lineages, we drew the total number of mutations (segregating sites) s according to a Poisson distribution with parameter μL. We then placed the mutations uniformly at random along the branches of the genealogy. For each of the s mutations, we assigned the mutant type to individuals descending from the branch where the mutation occurred and the ancestral type otherwise.
We summarize our posterior inference by the posterior median and 95% Bayesian credible intervals after 200 thousand iterations and thinned every 10 iterations with 100 iterations of burn in. Our initial number of change points for N(t) was set to 50 over the time interval between 0 and the initial time to the most recent common ancestor t2 for all simulations; however, over the course of MCMC iterations, this number could increase or decrease according to the posterior distribution of t2.
We assess accuracy and precision of our estimates using the sum of relative errors (SRE) where is the estimated effective population size trajectory at time ωi. Second, we computed the mean relative width as where corresponds to the 97.5% upper limit and corresponds to the 2.5% lower limit of the estimated posterior distribution of N(ωi). In addition, we measured how well the 95% credible intervals cover the truth and compute the envelope measure, ENV:
We first simulated 3 datasets of n = 10 individuals with an average number of 100 segregating sites under different types of population size trajectories: constant, exponential growth and instantaneous growth. Results are depicted in Figure 5. Posterior medians and 95% credible intervals are shown as black curves and gray shaded areas respectively. The trajectory used to simulate the data is depicted as a dashed line. Figure 5 shows that our BESTT method recovers the constant and exponential growth trajectories very well but the instantaneous growth scenario is less accurate and with high uncertainty (wide credible intervals). In all three cases, our envelope measure is above 95%. Performance measures on all simulations are summarized in table 1.
We analyzed the effect of increasing the number of segregating sites, the number of samples and the number of independent genealogies on posterior inference with BESTT. In all three cases, we expect our method to better recover the truth. Figure 6 shows our results on simulated data under a population size trajectory with instantaneous growth (Equation 13) of n = 10 individuals with 31, 63 and 120 segregating sites. As expected, our method recovers the truth with higher precision (MRW) and accuracy (SRE) when we increase the number of segregating sites. Increasing the number of segregating sites may result in more constraints in the gene tree. For n = 10, there are 7936 possible ranked tree shapes, however for the datasets simulated with 31, 63 and 102 segregating sites, there are only 2582 ± 32, 2670 ± 34 and 556 ± 7 ranked tree shapes compatible with their corresponding gene trees. These numbers were estimated by importance sampling (Cappello and Palacios, 2019).
As another performance assessment, we simulated datasets from a population size trajectory with instantaneous growth with varying number of samples. We simulated datasets with n = 10, 25 and 35 samples with 215 expected number of segregating sites. Our results depicted in Figure 7 show that our method performs better in terms of SRE and MRE when the number of samples increases. Similarly, precision (MRW) and accuracy (SRE) increases when inference is done from a larger number of independent datasets. Finally, Figure 8 shows our results from 1, 5 and 10 datasets simulated from 1, 5 and 10 independent genealogies of 10 individuals with a population size trajectory of growth followed by a constant period (Equation 12). As expected, our method’s performance substantially increases by increasing the number of genealogies.
3.2 Comparison to other methods
To our knowledge, there is no other method for inferring (variable) effective population size over time from haplotype data that assumes the infinite sites mutation and a nonparametric prior on N(t), therefore we cannot have a direct comparison of our method to others. Moreover, our method is the only one that explicitly averages over Tajima genealogies instead of Kingman genealogies. BEAST (Drummond et al., 2012) is a program for analyzing molecular sequences that uses MCMC to average over the Kingman tree space and it is therefore a good reference for comparison to our method. We compared our results to the Extended Bayesian Skyline method (Heled and Drummond, 2008) implemented in BEAST.
Since the infinite sites mutation model is not implemented in BEAST, we first converted our simulated sequences of 0s and 1s to sequences of nucleotides by sampling s ancestral nucleotides uniformly on {A, T, C, G} and assigning one of the remaining 3 types uniformly at random to be the mutant type. This corresponds to a simulation of the Jukes-Cantor mutation model (Jukes and Cantor, 1969) that is currently implemented in BEAST.
We compare the results of BESTT depicted in Figure 5 to those of BEAST (Drummond et al., 2005, 2012) in Figure 9. We note that results from BEAST are generated from 10 million iterations and thinned every 1000 iterations, while results from BESTT are generated from 200 thousand iterations.
We compared our point estimates from both methods to the ground truth for each simulation (Table 2). In the three cases BESTT has better envelope than BEAST. For the exponential growth simulation (Figure 9, second row) the BEAST result has better SRE and MRW, however, the credible intervals are uneven with very wide intervals at the ends. For the instantaneous growth simulation (Figure 9, third row), BEAST does not provide results beyond the time point 0.06, for this reason we recomputed the performance statistics for the overlapping time interval (0, 0.06). In this interval, BESTT outperforms BEAST in terms of envelope and SRE.
Other methods implemented in BEAST are more comparable to BESTT such as Bayesian Skyride (Minin et al., 2008) and Bayesian Skygrid (Gill et al., 2013). These methods assume Gaussian process priors on log N(t) as BESTT, however, for the simulations shown in Figure 9, we were not able to obtain reliable results given that the acceptance probability of the effective population size samplers in BEAST was 0.
4. Inferring human population demography from mtDNA
We selected n = 35 samples of mtDNA at random from 107 Yoruban individuals available from the 1000 Genomes Project phase 3 (The 1000 Genomes Project Consortium, 2015). We retained the coding region: base pairs 576 − 16, 024 according to the rCRS reference of Human Mitochondrial DNA (Anderson et al., 1981; Andrews et al., 1999) and removed 38 indels. Of the 260 polymorphic sites, we retained 240 sites compatible with the infinite sites mutation model. The final file is available in https://github.com/JuliaPalacios/phylodyn. To encode our data as 0s and 1s, we use the inferred root sequence RSRS of Behar et al. (2012) to define the ancestral type at each site. To rescale our results in units of years, we assumed a mutation rate per site per year of 1.3 × 10−8 (Rebolledo-Jaramillo et al., 2014). We compare our results with the Extended Bayesian Skyline method (Drummond et al., 2012) implemented in BEAST in Figure 10. When applying BEAST, we assumed the Jukes-Cantor mutation model. Both methods detect an inflection point around 20kya followed by exponential growth. The mean time to the most recent ancestor (TM-RCA) inferred for these YRI mtDNA samples with BESTT is around 170kya with a 95% BCI of (142868, 207455), while the mean TMRCA inferred with BEAST is around 160kya with a 95% BCI of (133239, 196900). In Appendix B, we include two more comparisons of BESTT and BEAST.
5 Discussion
The size of emergent sequencing datasets prohibits the use of standard coalescent modeling for inferring evolutionary parameters. The main computational bottleneck of coalescent-based inference of evolutionary histories lies in the large cardinality of the hidden state space of genealogies. In the standard Kingman coalescent, a genealogy is a random labeled bifurcating tree that models the set of ancestral relationships of the samples. The genealogy accounts for the correlated structure induced by the shared past history of organisms and explicit modeling of genealogies is fundamental for learning about the past history of organisms. However, the genomic era is producing large datasets that require more efficient approaches that efficiently integrate over the hidden state space of genealogies.
In this manuscript we show that a lower resolution coalescent model on genealogies, the “Tajima’s coalescent”, can be used as an alternative to the standard Kingman coalescent model. In particular, we show that the Tajima coalescent model provides a feasible alternative that integrates over a smaller state space than the standard Kingman model. The main advantage in Tajima’s coalescent is to model the ranked tree topology as opposed to the fully labeled tree topology as in Kingman’s coalescent.
A priori, the cardinality of the state space of ranked tree shapes is much smaller than the cardinality of the state space of labeled trees. However, in this manuscript we show that when the Tajima coalescent model is coupled with the infinite sites mutation model, the space of ranked tree shapes is constrained by the data and the reduction on the cardinality of the hidden state space of Tajima’s trees is even more pronounced than expected.
In order to leverage the constraints imposed by the data and the infinite-sites mutation model, we apply Dan Gusfield’s perfect phylogeny algorithm (Gusfield, 1991) to represent sequence alignments as a gene tree. We exploit the gene tree representation for conditional likelihood calculations and for exploring the state space of ranked tree shapes.
For the calculation of the likelihood of the data conditioned on a given Tajima’s genealogy, we augment the gene tree representation of the data with the Tajima’s genealogy and map observed mutations to branches. We define a directed acyclic graph (DAG) with the augmented gene tree. This new representation as a DAG allows for calculating the likelihood as a depth-first search algorithm that transverses the gene tree from the leaves to the root. Our present implementation computational’s bottleneck lies in the likelihood calculation. In future studies, our proposed algorithm for likelihood calculation can be further optimized as a sum-product algorithm; however, we are able to infer effective population size trajectories from samples of size n ≈ 35 in a regular personal laptop computer within few hours.
Our statistical framework draws on Bayesian nonparametrics. We place a flexible geometric random walk process prior on the effective population size that allows us to recover population size trajectories with abrupt changes in simulations. The inference procedure proposed in this manuscript relies on Markov chain Monte Carlo (MCMC) methods with 3 large Gibbs block updates of: coalescent times, effective population size trajectory and ranked tree shape topology. We use Hamiltonian Monte Carlo updates for continuous random variables: coalescent times and effective population size; and a Metropolis Hastings sampler for exploring the space of ranked tree shapes. For exploring the genealogical space, Markovtsova et al. (2000) suggest a joint local proposal for both coalescent times and topology. Here we restrict our attention to the topology alone. A future line of research includes the development of a joint local proposal of coalescent times and ranked tree shapes. We also envision that a joint sampler of coalescent times and effective population size trajectories should improve mixing and convergence.
Finally, haplotype data of many organisms is usually sparse with few unique haplotypes presented at high frequencies. Our proposed method is ideally suited for this scenario where the space of ranked tree shapes is drastically smaller than the space of labeled topolgies.
Acknowledgements
This research is supported in part by a National Institutes of Health grant R01-GM-131404 and the Alfred P. Sloan Foundation to J.A.P.․ We want to acknowledge the developers of R-ape, R-phangorn and R-phylodyn that facilitated our implementations. A.V. was supported in part by the chaire program Modélisation Mathématique et Biodiversité of Veolia Environnement – École Polytechnique – Museum National d’Histoire Naturelle – Fondation X. A.V. and J.A.P. was supported by the France-Stanford Center for interdisciplinary Studies. This was work also supported by the National Science Foundation CAREER Award DBI-1452622 to S.R.
6 Appendix A Markovian proposal of ranked tree shapes
The following algorithm generates a new ranked tree shape from a Markovian proposal and outputs the corresponding transition probabilities. This proposal is used in section 2.8.1.
Algorithms for conditional likelihood calculation
The following two algorithms detail the calculation of Pr(Y | gT, μ). Y is encoded in GeneTree, the observed data as a Tree structure. Each node in GeneTree has number of descendants (or lineages) and mutation information attached to it. Tajima’s genealogy gT is encoded as Fpath that contains the ranked tree shape Fn and times that contains the vector of coalescent times t multiplied by the mutation rate μ.
7 Appendix B
We replicated the BEAST EBSP Analysis of the 35 Yoruban individuals from the 1000 Genomes Project phase 3 using the whole mtDNA coding region consisting of 15409 sites. In both cases we assumed the Jukes-Cantor mutation model (Jukes and Cantor, 1969). Figure 11 shows the comparison between EBSP inference from the 240 segregating sites retained in section 4 that are compatible with the infinite sites mutation model assumption. In both cases we recover very similar trajectories.
In addition, we compared our results with BEAST Bayesian Skyline Plot (BSP) (Drummond and Rodrigo, 2000). For our reduced dataset of 240 segregating sites, we could not generate valid inference of N(t) with Metropolis-Hastings acceptance probability greater than 0. Instead we were able to generate results with BEAST BSP from the complete dataset of 15409 sites. The comparison of our method from 240 segregating sites to BEAST BSP from 15409 sites is depicted in Figure 12.