Abstract
Divergence times estimation is an essential component of evolutionary studies. It generally involves several types of data (molecular, morphological, fossils…) and relies on stochastic models of evolution. An important part of these models is the prior distribution of divergence times that they use. The birth-death-sampling model, which is certainly the simplest realistic model of phylogenetic trees, is a natural choice for deriving such prior distributions. It has been well studied and is still widely used in this context.
The main result provided here is a method for computing the exact distribution of the divergence times of any phylogenetic tree under a birth-death-sampling model. This computation has a cubic time-complexity, allowing us to deal with phylogenies of hundreds of tips on standard computers. The approach can be used for dating phylogenetic trees from their topologies only, for visualizing effects of diversification parameters etc.
An additional result shows how to directly sample all the divergence times of a phylogenetic tree with linear complexity under the same model. This sampling procedure can be integrated into phylogenetic inference methods, e.g., for proposing accurate MCMC moves.
1 Introduction
Estimating divergence times is an essential and difficult stage of phylogenetic inference [18, 19, 12, 3, 16, 6]. In order to perform this estimation, current approaches use stochastic models for combining different types of information: molecular and/or morphological data, fossil calibrations, evolutionary assumptions etc [25, 20, 7, 9]. Both in a Bayesian and a maximum likelihood context, an important component of these stochastic models is the “prior” probability distribution of divergence times (i.e., that which does not take into account information about genotype or phenotype of species [26, 11, 9]), which is often derived from diversification models [4, 10, 26, 13, 14]. Among these, the birth-death-sampling model is arguably the simplest realistic model since it includes three important features shaping phylogenetic trees [27, 28]. Namely, it models cladogenesis and extinction of species by a birth-death process and takes account of the incompleteness of data by assuming an uniform sampling of extant taxa. Birth-death-sampling model has been further studied and is currently used for phylogenetic inference [22, 24, 10, 4].
The main result of this work is a method to compute the exact distributions of divergence times of a given tree topology from the parameters of a birth-death-sampling model. Namely, for any internal node of the phylogeny and any time t, we provide an algorithm which computes the exact probability for the divergence time associated with this node to be anterior to t. The complexity of this algorithm is polynomial with the size of the phylogeny, namely cubic in time and quadratic in memory space. In practice, it can deal with phylogenetic trees with up to hundreds of tips on standard desktop computers.
The computation of divergence time distributions can be applied to various questions. First, it can be used for dating phylogenetic trees from their topology only, as the method implemented in the function compute.brlen of the R-package APE [8, 17]. Second, it can provide prior distributions in phylogenetic inference frameworks. It also allows to visualize the effects of the birth-death-sampling parameters on the prior divergence times distributions, to investigate consequences of evolutionary assumptions etc.
An additional result of this work is a method allowing to directly draw samples of the set all the divergence times of a phylogenetic tree according to the birth-death-sampling model. Though based on the same ideas, this sampling procedure is independent of the computation of distributions referred to just above. The sampling procedure is very fast and can easily be integrated into phylogenetic inference software [6, 21], e.g., for proposing accurate MCMC moves.
The approaches presented here can be extended in several directions. In particular, it is quite straightforward to deal with heterogeneity in time and to model massive extinctions as in [23]. Taking into account heterogeneity in the speciation-extinction rates and/or the sampling probability between clades seems feasible but is a more difficult question which I plan to investigate soon. In collaboration with a co-author of [2], we are currently integrating fossil finds in the model in order to obtain better node-calibrations for phylogenetic inference.
C-source code of the software performing the computation of divergence time distributions and their sampling under a birth-death-sampling model is available at https://github.com/gilles-didier/DateBDS.
The rest of the paper is organized as follows. Birth-death-sampling models are formally introduced in Section 2. Section 3 presents definitions and some results about tree topologies. The “start- and end-patterns”, i.e., the subparts of the diversification process from which are computed divergence times probabilities, are introduced in Section 4. Sections 5 and 6 present the computation of the divergence time distributions and a polynomial algorithm to perform it respectively. The method for directly sampling the divergence times is described in Section 7. The computation of divergence time distributions is illustrated with the phylogenetic tree of Hominoidae from [5] in Section 8.
2 Birth-death-sampling model
The dynamics of speciation and extinction of species is modelled as a birth-death process with constant rates λ and μ both through time and lineage [15]. Following [27], each extant species is assumed to be independently sampled (i.e. included in the study) with probability ρ. The whole model will be referred to as the birth-death-sampling model and has thus three parameters:
λ: the speciation rate,
μ: the extinction rate and
ρ: the probability for an extant taxa to be sampled.
In all what follows, we make the technical assumption that λ > μ and ρ > 0.
A important point is to distinguish between the part of the process that actually happened, which will be referred to as the whole or the complete process (Figure 1-Left) and the part that can be observed from the available information at the present time (i.e., from the sampled extant taxa), which will be referred to as the observed or the reconstructed process (Figure 1-Right). It can be shown that the reconstructed process of a birth-death-sampling process is a pure-birth process with time-inhomogeneous birth rate.
Let us start by recalling some already derived probabilities of interest under this model. Under the simple birth-death model (i.e. with ρ =1) with parameters (λ, μ), the probability that a single lineage at time 0 has exactly n descendants at time t was given in [15]. We have that and, for all n > 0,
Under the birth-death-sampling model (λ,μ, ρ), the probability Q(n, t) that a single lineage at time 0 has exactly n descendants sampled at time t was given in [27]. We have that and, for all n > 0,
In the case where all the tips are sampled, we obtain the same equations as those of [15] just above, i.e., if ρ =1 then for all λ, μ, n and t.
3 Tree topologies
Tree topologies arising from diversification processes are rooted and binary thus so are all the tree topologies considered here. Moreover, all the tree topologies considered below will be labelled, which means their tips, and consequently all their nodes, are unambiguously identified. From now on, “tree topology” has to be understood as “labelled-rooted-binary tree topology”.
Since the context will avoid any confusion, we still write for the set of nodes of any tree topology . For all tree topologies , we put for the set of tips of and, for all nodes n of for the subtree of rooted at n.
For all sets S, |S| denotes the cardinality of S. In particular, denotes the size of the tree topology (i.e., its total number of nodes, internal or tips) and its number of tips.
3.1 Probability
Let us define as the probability of a tree topology given its number of tips under a lineage-homogeneous process with no extinction, such as the reconstructed birth-death-sampling process [2, Supp. Mat., Appendix 2].
A tree topology resulting from a pure-birth realization of a lineage-homogeneous process has probability conditioned on having tips, where denotes the number of divergence time rankings consistent with .
If , i.e. T is a single tip, we have .
Otherwise, by putting a and b for the two direct descendants of the root of , we have
Theorem 1 implies that both and can be computed in linear time through a post-order traversal of the tree topology .
3.2 Start-sets
A start-set of a tree topology is a possibly empty subset A of internal nodes of which is such that if an internal node of belongs to A then so do all its ancestors. Remark that the empty set Ø is start-set of any topology.
Being given a tree topology and a non-empty start-set A, we define
the start-tree as the subtree topology of made of all nodes in A and their direct descendants (Figure 2-Center);
the set of end-trees as the set of all the subtrees of rooted at the tips of , i.e. (Figure 2-Right).
By convention, , the start-tree associated to the empty start-set, is the subtree topology made only of the root of . There is then only a single end-tree that is the whole tree topology , i.e., .
In all cases, there are as many trees in as tips in the tree .
For all internal nodes n of the tree topology , we define as the set of all start-sets of which contain n.
4 Patterns
From now on, we shall consider diversification processes starting at origin time s and ending at time e by evolving following a birth-death-sampling model (λ, μ, ρ). In practice, the ending time e is usually the present time. A pattern is a part of the observed diversification process starting from a single lineage at a given time and ending with a certain number of lineages at another given time. It consists of the resulting tree topology and of the bounding times (Figure 3). We shall consider two types of patterns. Start-patterns start from the origin o of the diversification process and end at a given time t ∈ [s, e]. End-patterns go from a given time t ∈ [s, e] to the ending time e.
Start- and end-patterns are very similar to patterns of types c and a defined in [2], respectively. Proofs of Lemmas 1 and 2 are essentially the same as those of the corresponding claims in [2].
4.1 Start-patterns
Start-patterns encompass the observed part of the diversification process running from the origin time s until a given time t ∈ [s, e]. We recall that, by observed part, we mean the part that can be reconstructed from the ending time. More formally, for all times t ∈ [s, e], a lineage alive at time t is observable if itself or one of its descendants is sampled at e.
A start-pattern starts with a single lineage at the time origin s and ends with n observable lineages and a tree topology at t (Figure 3-left).
The probability O(t) for a lineage living at time t in the complete diversification process (as in Figure 1-Left) to be observable at the ending time e is the complementary probability of having no descendant sampled at time e. We have that
Let us now compute the probability X(k, s,t) that a single lineage at time s has k descendants observable from e at time t ∈ [s, e]. This probability is the sum over all numbers n ≥ k, of the probability that the lineage at s has n descendants at t in the whole process (i.e., without sampling, which is equal to , among which exactly k ones are observable (i.e., ). By setting n = j + k, we thus have
Under the birth-death-sampling model (λ, μ, ρ), the probability of the start-pattern is
The probability of the start-pattern is the probability of the tree topology conditioned on its number of tips, which is from Theorem 1, multiplied by the probability of observing this number of tips in a start-pattern, which is that of getting observable lineages at t from a single lineage at s, i.e., .
4.2 End-patterns
End-patterns encompass all the observed parts of the diversification process which start from a single lineage at a given time t ∈ [s, e] until the ending time of the process.
An end-pattern starts with a single lineage at time t and ends with lineages and a tree topology at the ending time e (Figure 3-right).
Under the birth-death-sampling model (λ, μ, ρ), the probability of the end-pattern is
The probability of the end-pattern is the probability of the tree topology conditioned on its number of tips, which is from Theorem 1, multiplied by the probability of observing this number of tips in an end-pattern, which is the probability that a single lineage at time t has descendants sampled at time e, i.e., [27].
5 Distribution of divergence times
Let be a tree topology, s < e be two times and n be an internal node of . The joint probability of observing the tree topology and that the divergence time τn associated with n is anterior to a time t ∈ [s, e] from a diversification process starting at s and ending at e following the birth-death-sampling model (λ, μ, ρ) is
Under the notations of the theorem, let us define as the set of nodes of whose divergence times are anterior to t (i.e. ). Since divergence times corresponding to ancestors of a given node are always posterior to its own divergence time, all sets are start-sets. Moreover, any event including “τn < t” implies that , thus that . By construction, the set contains all the possible configurations of nodes of with divergence times anterior to t such that τn < t. Since all these possibilities are mutually exclusive, the law of total probabilities gives us that
The entries of the second column of Figure 4 (just after the sign sum) represent all the start-sets A in .
In order to compute the probability for a start-set , we remark that
the part of the diversification process anterior to t is the start-pattern and
the part of the diversification process posterior to t consists of all end patterns ( with .
Since the birth-death-sampling process is a Markov process, evolution of all the end patterns are independent one to another and with regard to the part of the process anterior to t, conditional upon starting with an observable lineage at time t. By assuming that its set of tip labels is known and conditional upon starting with an observable lineage, the probability of the end pattern is
From Lemma 1, the probability of the start-pattern is under the assumption that is labelled. We have not a direct labeling of here: tips of are identified though the labels of their tip descendants in , i.e., the tips of the subtrees in . Since all labellings of are equiprobable, the probability of a labeling of is the inverse of the number of ways of choosing a subset of labels from ones for all tips m of without replacement, i.e. the inverse of corresponding multinomial coefficient, which is
By construction, a labeling of fully determines the set of tip-labels of all subtrees with . Putting all together, we eventually get that which, with Equation 1, ends the proof. The whole computation of a toy example is schematized in Figure 4.
Corollary 1. Let be a tree topology, s and e be the origin and end times of the diversification process and n be an internal node of . The probability that the divergence time τn associated with n is anterior to a time t ∈ [s, e] conditioned on observing the tree topology under the birth-death-sampling model (λ, μ, ρ) is
It is enough to remark that the probability of observing a tree topology (without constraint on its divergence times other than belonging to [s, e]) for a diversification process starting from time s with a single lineage and ending at time e is, by definition, exactly that of the end-pattern given in Lemma 2.
We put Fn for the cumulative distribution function (CDF) of the divergence time associated with node n, namely,
Figure 5 displays the probability densities of all the divergence times of the tree at its left. In Figures 5 and 6, the densities are computed from the corresponding distributions by finite difference approximations.
6 A polynomial computation
Since the number of start-sets may be exponential with the size of the tree, notably for balanced trees, Theorem 2 does not directly provide a polynomial algorithm for computing the divergence time distributions. We shall show in this section that the left-side of the equation of Theorem 2 can be factorized in order to obtain a polynomial computation.
Let us first introduce an additional notation. For all tree topologies , all internal nodes n of and all numbers k between 1 and the number of tips of , we put for the set of start-sets A of containing n and such that the corresponding start-tree has exactly tips. By construction, a start-tree of has at least one tip and at most tips. We have:
Let us set for all nodes m of , where stands here for the set of nodes of the subtree topology rooted at m. Since, by construction, the elements of are start-sets of the tree topology , the start-tree a and the set of end-trees are well-defined for all . For all numbers , we put for the set of start-sets such that the corresponding start-tree has exactly k tips.
Let us now define for all nodes m of and all , the quantity
Basically, by putting r for the root of , we have that
We shall see how to compute for all nodes m of .
Let us first consider the case where k = 1. Since for all nodes m of , we have that
Let us now assume that k > 1, which makes sense only if contains more than a single node, and let a and b be the two direct descendants of m. Since we assume k > 1, all start-sets of contain m. It follows that we have if and only if there exist two start-sets and with {m} ∪ I ∪ J = A. The tree topology has root m with two child-subtrees and . In particular, we have . From Theorem 1, we have that
Moreover, since by construction, , we get that
More generally, the start-sets of are in one-to-one correspondence with the set of pairs (I, J) of such that . This set of pairs is exactly the union over all pairs of positive numbers (i, j) such that i + j = k, of the product sets . It follows that
After factorizing the left hand side of the equation just above, we eventually get that for all k > 1,
Let be a tree topology, s and e be the origin and end times of the diversification process and n be an internal node of . Both and can be computed with time-complexity and memory-space-complexity .
For all tips m of , reduces to (Wm,1), which is directly given by Equation 3.
For all internal nodes m of , (Wm,1) is also given by Equation 3. Equation 4 shows that for all , Wm,k can be computed from and , where a and b are the direct descendants of m, in operations.
In sum, for all nodes m of , the quantities require memory space to be stored, each entry Wm,k requiring operations to be computed. It follows that the quantities can be determined with a post-order traversal of in time by using memory space.
From Equation 2, the probability P(τn < t) can be computed with complexity from the quantities , which ends the proof.
7 Direct sampling of divergence times
Theorems 2 and 3 and Corollary 1 show how to compute the marginal (with regard to other divergence times) of the divergence time distribution of any internal node of a phylogenetic tree from given birth-death-sampling parameters and origin and end times of the diversification. It allows in particular to sample any divergence time of the phylogenetic tree disregarding the other divergence times. We shall see in this section how to draw a sample of all the divergence times of any tree topology, still being given birth-death-sampling parameters and origin and end times of the diversification.
Let be a tree topology, s and e be the origin and end times of the diversification process and r be the root of . The probability that the root divergence time τr is anterior to a time t ∈ [s,e] conditioned on observing the tree topology under the birth-death-sampling model (λ, μ, ρ) is
The probability that the divergence time τr associated with r is anterior to a time t ∈ [s, e] is the complementary probability that τr > t. Observing τr > t means that the starting lineage at s has a single descendant observable at t from which descends the tree topology sampled at e. By putting for the tree topology made of a single tip/lineage, it follows that
Remark that the computation of requires only the number of tips of (in particular, the shape of does not matter). Lemma 3 implies that the CDF Fr: t → P(τr < t) can be computed at any time t with complexity O(1). Moreover, one can verify that Fr is strictly increasing under the assumptions on the birth-death-sampling parameters, i.e., λ > μ and ρ > 0.
Let us first show how to sample the divergence time of the root of a tree topology. The marginal, with regard to the other divergence times, of the distribution of the root-divergence time conditioned on the tree topology is Fr. In order to sample τr under this distribution, we shall use inverse transform sampling which is based on the fact that if a random variable U is uniform over [0, 1] then has distribution function Fr (e.g., [1, chapter 2]). Though Lemma 3 provides an expression of , I did not find an explicit formula for . We thus have to rely on numerical inversion at a given precision level in order to get a sample of the distribution Fr from an uniform sample on [0, 1]. The current implementation uses the bisection method, which computes an approximate inverse with a number of Fr-computations smaller than minus the logarithm of the required precision [1, p 32].
In order to sample the other divergence times, let us remark that by putting a and b for the two direct descendants of the root of and t for the time sampled for the root-divergence, we have two independent diversification processes both starting at t and giving the two subtree topologies and at e. By applying Lemma 3 to and between t and e, the divergence times of the roots of these subtrees, i.e., a and b, can thus be sampled in the same way as above. The very same steps can then be performed recursively in order to sample all the divergence times of .
In short, a pre-order traversal of allows to sample all its divergence times in a time linear in with a multiplicative factor proportional to minus the logarithm of the precision required for the samples.
8 Example – Influence of the birth-death-sampling parameters
In order to illustrate the computation of the divergence time distributions, let us consider the Hominoidea subtree from the Primates tree of [5]. The approach can actually compute the divergence time distributions of the whole Primates tree of [5] but they cannot be displayed legibly because of its size.
The divergence time distributions were computed under several sets of birth-death-sampling parameters, namely all combinations with λ = 0.1 or 1, μ = λ – 0.09 or λ – 0.01 and ρ = 0.1 or 0.9. Since λ – μ appears in the probability formulas, several sets of parameters are chosen in such a way that they have the same difference between their birth and death rates.
Divergence time distributions obtained in this way are displayed in Figure 6 around their internal nodes (literally, since nodes are positioned at the median of their divergence times). Each distribution is plotted at its own scale in order to be optimally displayed. This representation allows to visualize the effects of each parameter on the shape and the position of distributions, to investigate which parameter values are consistent with a given evolutionary assumption etc.
We observe on Figure 6 that, all other parameters being fixed, the greater the speciation/birth rate λ (resp. the sampling probability ρ), the closer are the divergence time distributions to the ending time
Influence of the extinction/death rate on the divergence time distributions is more subtle and ambiguous, at least for this set of parameters. All other parameters being fixed, it seems that an increase of the extinction rate tends to push distributions of nodes close to the root towards the starting time and, conversely, those of nodes close to the tips towards the ending time.
The divergence time distributions obtained for λ = 0.1, μ = 0.01 and ρ = 0.9 (Figure 6, column 2, top) and for λ = 1, μ = 0.91 and ρ = 0.1 (Figure 6, column 1, bottom) are close one to another. The same remark holds for λ = 0.1, μ = 0.09 and ρ = 0.9 (Figure 6, column 4, top) and for λ = 1, μ = 0.99 and ρ = 0.1 (Figure 6, column 3, bottom). This point suggests that estimating the birth-death-sampling parameters from the divergence times might be difficult, even if the divergence times are accurately determined.
The variety of shapes of divergence times probability densities observed in Figures 5 and 6 exceeds that of standard prior distributions used in phylogenetic inference, e.g., uniform, lognormal, gamma, exponential [11, 9].
Footnotes
gilles.didier{at}univ-amu.fr