ABSTRACT
Placing a new species on an existing phylogeny has increasing relevance to several applications. Placement can be used to update phylogenies in a scalable fashion and can help identify unknown query samples using (meta-)barcoding, skimming, or metagenomic data. Maximum likelihood (ML) methods of phylogenetic placement exist, but these methods are not scalable to reference trees with many thousands of leaves, limiting their ability to enjoy benefits of dense taxon sampling in modern reference libraries. They also rely on assembled sequences for the reference set and aligned sequences for the query. Thus, ML methods cannot analyze datasets where the reference consists of unassembled reads, a scenario relevant to emerging applications of genome-skimming for sample identification. We introduce APPLES, a distance-based method for phylogenetic placement. Compared to ML, APPLES is an order of magnitude faster and more memory efficient, and unlike ML, it is able to place on large backbone trees (tested for up to 200,000 leaves). We show that using dense references improves accuracy substantially so that APPLES on dense trees is more accurate than ML on sparser trees, where it can run. Finally, APPLES can accurately identify samples without assembled reference or aligned queries using kmer-based distances, a scenario that ML cannot handle. APPLES is available publically at github.com/balabanmetin/apples.
Phylogenetic placement is the problem of finding the optimal position for a new query species on an existing backbone (or, reference) tree. Placement, as opposed to a de novo reconstruction of the full phylogeny, has two advantages. In some applications (discussed below), placement is all that is needed, and in terms of accuracy, it is as good as, and perhaps even better than, de novo reconstruction. Moreover, placement can be more scalable than de novo reconstruction when dealing with very large trees.
Earlier research on placement was motivated by scalability. For example, placement is used in greedy algorithms that start with an empty tree and add sequences sequentially (e.g., Felsenstein, 1981; Desper and Gascuel, 2002). Each placement requires polynomial (often linear) time with respect to the size of the backbone, and thus, these greedy algorithms are scalable (often requiring quadratic time). Despite computational challenges (Warnow, 2017), there has been much progress in the de novo reconstruction of ultra-large trees (e.g., thousands to millions of sequences) using both maximum likelihood (ML) (e.g., Price et al., 2010; Nguyen et al., 2015) and the distance-based (e.g., Lefort et al., 2015) approaches. However, these large-scale reconstructions require significant resources. As new sequences continually become available, placement can be used to update existing trees without repeating previous computations on full dataset.
More recently, placement has found a new application in sample identification: given one or more query sequences of unknown origins, detect the identity of the (set of) organism(s) that could have generated that sequence. These identifications can be made easily using sequence matching tools such as BLAST (Altschul et al., 1990) when the query either exactly matches or is very close to a sequence in the reference library. However, when the sequence is novel (i.e., has lowered similarity to known sequences in the reference), this closest match approach is not sufficiently accurate (Koski and Golding, 2001), leading some researchers to adopt a phylogenetic approach (Sunagawa et al., 2013; Nguyen et al., 2014). Sample identification is essential to the study of mixed environmental samples, especially of the microbiome, both using 16S profiling (e.g., Gill et al., 2006; Krause et al., 2008) and metagenomics (e.g., von Mering et al., 2007). It is also relevant to barcoding (Hebert et al., 2003) and meta-barcoding (Clarke et al., 2014; Bush et al., 2017) and quantification of biodiversity (e.g., Findley et al., 2013). Driven by applications to microbiome profiling, placement tools like pplacer (Matsen et al., 2010) and EPA(-ng) (Berger et al., 2011; Barbera et al., 2018) have been developed. Researchers have also developed methods for aligning query sequence (e.g., Berger and Stamatakis, 2011; Mirarab et al., 2012) and for downstream steps (e.g., Stark et al., 2010; Matsen and Evans, 2013). These publications have made a strong case that for sample identification, placement is sufficient (i.e., de novo is not needed). Moreover, some studies (e.g., Janssen et al., 2018) have shown that when dealing with fragmentary reads typically found in microbiome samples, placement can be more accurate than de novo construction and can lead to improved associations of microbiome with clinical information.
Existing phylogenetic placement methods have focused on the ML inference of the best placement – a successful approach, which nevertheless, suffers from two shortcomings. On the one hand, ML can only be applied when the reference species are assembled into full-length sequences (e.g., an entire gene) and are aligned; however, in new applications that we will describe, assembling (and hence aligning) the reference set is not possible. On the other hand, ML, while somewhat scalable, is still computationally demanding, especially in memory usage, and cannot place on backbone trees with many thousands of leaves. As the density of reference substantially impacts the accuracy and resolution of placement, this inability to use ultra-large trees as backbone also limits accuracy. This limitation has motivated alternative methods using local sensitive hashing (Brown and Truszkowski, 2013) and divide-and-conquer (Mirarab et al., 2012).
Assembly-free and alignment-free sample identification using genome-skimming (Dodsworth, 2015) can also benefit from phylogenetic placement. A genome-skim is a shut-gun sample of the genome sequenced at low coverage (e.g., 1X) – so low that assembling the nuclear genome is not possible (though, mitochondrial or plastid genomes can often be assembled). Genome-skimming promises to replace traditional marker-based barcoding of biological samples (Coissac et al., 2016) but limiting analyses to organelle genome can limit resolution. Sarmashghi et al. (2019) have recently shown that using shared k-mers, the distance between two unassembled genome-skims with low coverage can be accurately estimated. This approach, unlike assembling organelle genomes, uses data from the entire nuclear genome and hence promises to provide a higher resolution (e.g., at species or sub-species levels) while keeping the low sequencing cost. However, ML and other methods that require assembled sequences cannot analyze genome-skims, where both the reference and the query species are unassembled genome-wide bags of reads.
Distance-based approaches to phylogenetics are well-studied, but no existing tool can perform distance-based placement of a query sequence on a given backbone. The distance-based approach promises to solve both shortcomings of ML methods. Distance-based methods are computationally efficient and do not require assemblies. They only need distances (however computed). Thus, they can take as input assembly-free estimates of genomic distance estimated from low coverage genome-skims using Skmer (Sarmashghi et al., 2019) or other alternatives (Haubold, 2014; Leimeister and Morgenstern, 2014; Leimeister et al., 2017; Yi and Jin, 2013; Benoit et al., 2016; Fan et al., 2015; Ondov et al., 2016; Jain et al., 2017). While alignment-based phylogenetics has been traditionally more accurate than alignment-free methods when both methods are possible, in these new scenarios, only alignment-free methods are applicable.
Here, we introduce a new method for distance-based phylogenetic placement called APPLES (Accurate Phylogenetic Placement using LEast Squares). APPLES uses dynamic programming to find the optimal distance-based placement of a sequence with running time and memory usage that scale linearly with the size of the backbone tree. We test APPLES in simulations and on real data, both for alignment-free and aligned scenarios.
MATERIALS AND METHODS
Problem Statement
Notations
Let an unrooted tree T = (V, E) be a weighted connected acyclic undirected graph with leaves denoted by ℒ = {1 … n}. We let T* be the rooting of T on a leaf 1 obtained by directing all edges away from 1. For node u ∈ V, let p(u) denote its parent, c(u) denote its set of children, sib(u) denote its siblings, and g(u) denote the set of leaves at or below u (i.e., those that have u on their path to the root), all with respect to T*. Also let l(u) denote the length of the edge (p(u), u).
Distances
The tree T defines an n × n matrix where each entry dij(T) corresponds to the path length between leaves i and j. We further generalize this definition so that duv(T *) indicates the length of the undirected path between any two nodes of T* (when clear, we simply write duv). Given some input data, we can compute a matrix of all pairwise sequence distances Δ, where the entry δij indicates the dissimilarity between species i and j. When the sequence distance δij is computed using (the correct) phylogenetic model, it will be a noisy but statistically consistent estimate of the tree distance dij(T) (Felsenstein, 2003). Given these “phylogenetically corrected” distances (e.g. is the corrected hamming distance h under the Jukes and Cantor (1969) model), we can define optimization problems to recover the tree that best fits the distances. A natural choice is minimizing the (weighted) least square difference between tree and sequence distances:
Here, weights (e.g., wij) are used to reduce the impact of large distances (expected to have high variance). A general weighting schema can be defined as for a constant value k ∈ ℕ. Standard choices of k include k = 0 for the ordinary least squares (OLS) method of Cavalli-Sforza and Edwards 1967, k = 1 due to Beyer et al. 1974 (BE), and k = 2 due to Fitch and Margoliash 1967 (FM).
Finding arg minT Q*(T) is NP-Complete (Day, 1987). However, decades of research has produced heuristics like neighbor-joining (Saitou and Nei, 1987), alternative formulations like (balanced) minimum evolution (Cavalli-Sforza and Edwards, 1967; Desper and Gascuel, 2002), and several effective tools for solving the problem heuristically (e.g., FastME by Lefort et al. 2015, DAMBE by Xia 2018, and Ninja by Wheeler 2009).
Phylogenetic placement
We let P (u, x1, x2) be the tree obtained by adding a query taxon q on an edge (p(u), u), creating three edges (t, q), (p(u), t), and (t, u), with weights x1, x2, and l(u) – x2, respectively (Fig. 1). When clear, we simply write P and note that P induces T both in topology and branch length. We now define the problem.
Least Squares Phylogenetic Placement (LSPP)
Input: A backbone tree T on ℒ, a query species q, and a vector Δq* with elements δqi giving sequence distances between q and every species i ∈ ℒ;
Output: The placement tree P that adds q on T and minimizes
Linear Time Solution for LSPP
The number of possible placements of q is 2n – 3. Therefore, LSPP can be solved by simply iterating over all the topologies, optimizing the score for that branch, and returning the placement with the minimum least square error. A naive algorithm can accomplish this in Θ(n2) running time by optimizing Eq. 2 for each of the 2n – 3 branches. However, using dynamic programming, the optimal solution can be found in linear time.
The LSPP problem can be solved with Θ(n) running time and memory.
The proof (given in Appendix A) follows easily from three lemmas that we next state. The algorithm starts with precomputing a fixed-size set of values for each nodes. For any node u and exponents a ∈ ℤ and b ∈ ℕ+, let and for b = 0, let . Note that S’(0, u) = |g(u)|. Similarly, for u ∈ V \ {1}, let for b > 0 and let .
The set of all S(a, b, u) and R(a, b, u) values can be precomputed in Θ(n) time with two tree traversals using the dynamic programming given by:
Equation 2 can be rearranged (see Eq. S2 in Appendix A) such that computing Q(P) for a given P = P (u, x1, x2) requires a constant time computation using S(a, b, u) and R(a, b, u) values for -k ≤ a ≤ 2 -k and 0 ≤ b ≤ 2.
Thus, after a linear time precomputation, we can compute the error for any given placement in constant time. It remains to show that for each node, the optimal placement on the branch above it (e.g., x1 and x2) can be computed in constant time.
For a fixed node u ∈ V \ {1}, if , then and hence can be computed in constant time.
Non-negative branch lengths
The solution to Equation 5 does not necessarily conform to constraints 0 ≤ x1 and 0 ≤ x2 ≤ l(u). However, the following lemma (proof in Appendix A) allows us to easily impose the constraints by choosing optimal boundary points when unrestricted solutions fall outside boundaries.
With respect to variables x1 and x2, Q(P (u, x1, x2)) is a convex function.
Minimum evolution
An alternative to directly using MLSE (Eq. 1) is the minimum evolution (ME) principle (Cavalli-Sforza and Edwards, 1967; Rzhetsky and Nei, 1992). Our algorithm can also optimize the ME criterion: after computing x1 and x2 by optimizing MLSE for each node u, we choose the placement with the minimum total branch length. This is equivalent to using arg minu x1, since the value of x2 does not contribute to total branch length. Other solution for ME placement exists (Desper and Gascuel, 2002), a topic we return to in the Discussion section.
Hybrid
We have observed cases where ME is correct more often than MLSE, but when it is wrong, unlike MLSE, it has a relatively high error. This observation led us to design a hybrid approach. After computing x1 and x2 for all branches, we first select the top log2(n) edges with minimum Q(P (u, x1, x2)) values (this requires Θ(n log log n) time). Among this set of edges, we place the query on the edge satisfying the ME criteria.
Datasets
We benchmark accuracy and scalability of APPLES in two settings: sample identification using assembly-free genome-skims on real biological data and placement using aligned sequences on simulated data.
Real genome-skim datasets for the assembly-free scenario
Columbicola genome-skims
We use a set of 61 genome-skims by Boyd et al. (2017), including 45 known lice species (some represented multiple times) and 7 undescribed species. We generate lower coverage skims of 0.1Gb or 0.5Gb by randomly subsampling the reads from the sequence read archives (SRA) provided by the original publication (NCBI BioProject PRJNA296666). We use BBTools (Bushnell, 2014) to filter subsampled reads for adapters and contaminants and remove duplicated reads. Since this dataset is not assembled, the coverage of the genome-skims is unknown; Skmer estimates the coverage to be between 0.2X and 1X for 0.1Gb samples (and 5 times that coverage with 0.5Gb).
Anopheles and Drosophila datasets
We also use two insect datasets used by Sarmashghi et al. (2019): a dataset of 22 Anopheles and a dataset of 21 Drosophila genomes (Table S1), both obtained from InsectBase (Yin et al., 2016). For both datasets, genome-skims with 0.1Gb and 0.5Gb sequence were generated from the assemblies using the short-read simulator tool ART, with the read length l = 100 and default error profile. Since species have different genome sizes, with 0.1Gb data, our subsampled genome-skims range in coverage from 0.35X to 1X for Anopheles and from 0.4X to 0.8X for Drosophila.
More recently, Miller et al. (2018) sequenced several Drosophila genomes, including 12 species shared with the InsectBase dataset. Sarmashghi et al. (2019) subsampled the SRAs from this second project to 0.1Gb or 0.5Gb and, after filtering contaminants, obtained artificial genome-skims. We can use these genome-skims as query and the genome-skims from the InsectBase dataset as the backbone. Since the reference and query come from two projects, the query genome-skim can have a non-zero distance to the same species in the reference set, providing a realistic test of sample identification applications.
Simulated datasets for the aligned sequence scenario
GTR
We use a 101-taxon dataset available from Mirarab and Warnow 2015. Sequences were simulated under the General Time Reversible (GTR) plus the Γ model of site rate heterogeneity using INDELible (Fletcher and Yang, 2009) on gene trees that were simulated using SimPhy (Mallo et al., 2016) under the coalescent model evolving on species trees generated under the Yule model. Note that the same model is used for inference under ML placement methods (i.e., no model misspecification). We took all 20 replicates of this dataset with mutation rates between 5 × 10-8 and 2 × 10-7, and for each replicate, randomly selected five estimated gene trees among those with ≤20% RF distance between estimated and true gene tree. Thus, we have a total of 100 backbone trees.
RNASim
Guo et al. 2009 designed a complex model of RNA evolution that does not make usual i.i.d assumptions of sequence evolution. Instead, it uses models of energy of the secondary structure to simulate RNA evolution by a mutation-selection population genetics model. This model is based on an inhomogeneous stochastic process without a global substitution matrix. The model complexity of RNASim allows us to test both ML and APPLES under a substantially misspecified model. An RNASim dataset of 106 sequences is available from Mirarab et al. 2015. We created several subsets of the full RNASim dataset.
i) Heterogeneous: We first randomly subsampled the full dataset to create 10 datasets of size 104. Then, we chose the largest clade of size at most 250 from each replicate; this gives us 10 backbone trees of mean size 249.
ii) Varied diameter: To evaluate the impact of the evolutionary diameter (i.e., the highest distance between any two leaves in the backbone), we also created datasets with low, medium, and high diameters. We sampled the largest five clades of size at most 250 from each of the 10 replicates used for the heterogeneous dataset. Among these 50 clades, we picked the bottom, middle, and top five clades in diameter, which had diameter in [0.3, 0.4] (mean: 0.36), [0.5, 0.52] (mean: 0.51), and [0.65, 1.07] (mean: 0.82), respectively.
iii) Varied size: We randomly subsampled the tree of size 106 to create 5 replicates of datasets of size 5 × 102, 103, 5 × 103, 104, 5 × 104, and 105, and 1 replicate (due to size) of size 2 × 105. For replicates that contain at least 5 × 103 species, we removed sites that contain gaps in 95% or more of the sequences in the alignment.
Methods
Alternative methods
For aligned data, we compare APPLES to two ML methods: pplacer (Matsen et al., 2010) and EPA-ng (Barbera et al., 2018). Matsen et al. (2010) found pplacer to be substantially faster than EPA (Berger and Stamatakis, 2011) while their accuracy was similar. EPA-ng improves the scalability of EPA; thus, we compare to EPA-ng in analyses that concerned scalability (e.g., RNASim-Varied Size). We run pplacer and EPA-ng in their default mode using GTR+Γ model (the only option for pplacer). We also compare with a simple method referred to as CLOSEST that places the query as the sister to the species with the minimum distance to it. CLOSEST is meant to emulate the use of BLAST (if it could be used). For the assembly-free setting, existing phylogenetic placement methods cannot be used, and we compared only against CLOSEST.
Distance calculation and models
We modified FastME to compute distances only between query and backbone sequences, not among backbone sequences. This version, called FastME* here, also ensures that when estimating model parameters, positions that have a gap in at least one of the two sequences are always ignored.
We compute phylogenetic distances under the parameter-free JC69 model, the six-parameter Tamura and Nei 1993 (TN93) model, and the 12-parameter general Markov model (Lockhart et al., 1994). We compute distances independently for all pairs, and not simultaneously as suggested by Tamura et al. (2004). We also use the Gamma model of sites rate heterogeneity for JC69 and TN93 using the standard approach (Waddell and Steel, 1997). Pairing Gamma with GTR is theoretically possible in the absence of noise; however, the method can run into problems on real data (Waddell and Steel, 1997). Thus, we do not include a GTR model directly. Instead, we use the log-det approach that can handle the most general (12-parameter) Markov model (Lockhart et al., 1994); however, log-det cannot account for rate across sites heterogeneity (Waddell and Steel, 1997). The α parameter of the Gamma model cannot be computed from pairwise sequence comparisons (Steel, 2009); instead, we use the α computed from the backbone tree. We used the α parameter computed by RAxML (Stamatakis, 2014) run on the backbone alignment and given the backbone tree.
In analyses on assembly-free datasets, we first compute genomic distances using Skmer (Sarmashghi et al., 2019). We then correct these distances using the JC69 model, without the Gamma model of rate variation.
Backbone trees
For genome-skimming experiments, we estimated the backbone tree using FastME* from the JC69 distance matrix computed from genome-skims using Skmer. For simulated datasets, we estimated the topology of the backbone tree by running RAxML (Stamatakis, 2014) on the true alignment using GTRGAMMA model and used this tree as the backbone for pplacer and EPA-ng. However, to handle large trees, we used FastTree-2 (Price et al., 2010) to estimate the backbone tree for RNASim-varied size and re-estimated branch lengths on the fixed topology using RAxML. For the backbone of APPLES, we always used the same tree topology but re-estimated branch lengths using FastTree-2 under the JC69 model.
APPLES parameters
We have chosen default parameter settings for APPLES and refer to this version as APPLES*. By default, we use FM weighting, the MLSE selection criterion, enforcement of non-negative branch lengths, and JC69 distances. When not specified otherwise, these default parameters are used.
Evaluation Procedure
To evaluate the accuracy, we use a leave-one-out strategy. We remove each leaf i from the backbone tree T and place it back on this T \ i tree to obtain the placement tree P. However, on the RNAsim-varied size dataset, due to its large size, we only removed and added back 200 randomly chosen leaves per replicate.
Delta error
We measure the accuracy of the placement using delta error (Δe): the number of branches of the true tree missing from P minus the number of branches of the true tree missing from T \ i (induced on the same leafset). Note that Δe ≥ 0 because adding i cannot decrease the number of missing branches in T \ i. Note that placing i to the same location as the backbone before leaving it out (e.g., T) can still have a non-zero delta error because the backbone tree is not the true tree. We refer to the placement of a leaf into its position in the backbone tree as the de novo placement.
On biological data, where the true tree is unknown, we use a reference tree (Fig. S1). For Drosophila and Anopheles, we use the tree available from the Open Tree Of Life (Hinchliff et al., 2015) as the reference. For Columbicola, we use the ML concatenation tree published by Boyd et al. (2017) as the reference.
Results
Assembly-free Placement of Genome-skims
On our three biological genome-skim datasets, APPLES* successfully places the queries on the optimal position in most cases (97%, 95%, and 71% for Columbicola, Anopheles, and Drosophila, respectively) and is never off from the optimal position by more than one branch. Other versions of APPLES are less accurate than APPLES*; e.g., APPLES with ME can have up to five wrong branches (Table 1). On genome-skims, where assembly and alignment are not possible, existing placement tools cannot be used, and the only alternative is the CLOSEST method (emulating BLAST if assembly was possible).
CLOSEST finds the optimal placement only in 54% and 57% of times for Columbicola and Drosophila; moreover, it can be off from the best placement by up to seven branches for the Columbicola dataset. On the Anopheles dataset, where the reference tree is unresolved (Fig. S1), all methods perform similarly.
APPLES* is less accurate on the Drosophila dataset than other datasets. However, here, simply placing each query on its position in the backbone tree would lead to identical results (Table 1). Thus, placements by APPLES* are as good as the de novo construction, meaning that errors of APPLES* are entirely due to the differences between our backbone tree and the reference tree. Moreover, these errors are not due to low coverage; increasing the genome-skim size 5x (to 0.5Gb) does not decrease error (Table S4).
On Drosophila dataset, we next tested a more realistic sample identification scenario using the 12 genome-skims from the separate study (and thus, non-zero distance to the corresponding species in the backbone tree). As desired, APPLES* places all of 12 queries from the second study as sister to the corresponding species in the reference dataset.
Alignment-based Placement
We first compare the accuracy and scalability of APPLES* to ML methods and then compare various settings of APPLES. For ML, we use pplacer (shown everywhere) and EPA-ng (shown only when we study scalability).
Comparison to Maximum Likelihood (ML)
GTR dataset
On this dataset, where it faces no model misspecification, pplacer has high accuracy. It finds the best placement in 84% of cases and is off by one edge in 15% (Fig. 2a); its mean delta error (Δe) is only 0.17 edges. APPLES* is also accurate, finding the best placement in 78% of cases and resulting in the mean Δe =0.28 edges. Thus, even though pplacer uses ML and faces no model misspecification and APPLES* uses distances based on a simpler model, the accuracy of the two methods is within 0.1 edges on average. In contrast, CLOSEST has poor accuracy and is correct only 50% of the times, with the mean Δe of 1.0 edge.
Model misspecification
On the small RNASim data with subsampled clades of ≈ 250 species), both APPLES* and pplacer face model misspecification. Here, the accuracy of APPLES* is very close to ML using pplacer. On the heterogeneous subset (Fig. 2b and Table 2), pplacer and APPLES* find the best placement in 88% and 86% of cases and have a mean delta error of 0.13 and 0.17 edges, respectively. Both methods are much more accurate than CLOSEST, which has a delta error of 0.87 edges on average.
Impact of diameter
When we control the tree diameter, APPLES* and pplacer remain very close in accuracy (Fig. 2c). The changes in error are small and not monotonic as the diameters change (Table 2). The accuracies of the two methods at low and high diameters are similar. The two methods are most divergent in the medium diameter case, where pplacer has its lowest error (Δe =0.11) and APPLES* has its highest error (Δe =0.18).
To summarize results on small RNASim dataset with model misspecification, although APPLES* uses a parameter-free model, its accuracy is extremely close to ML using pplacer with the GTR+G model.
Impact of taxon sampling
The real advantage of APPLES* over pplacer becomes clear for placing on larger backbone trees (Fig. 3 and Table 3). For backbone sizes of 500 and 1000, pplacer continues to be slightly more accurate than APPLES* (mean Δe of pplacer is better than APPLES* by 0.09 and 0.23 edges, respectively). However, with backbones of 5000 leaves, pplacer fails to run on 449/1000 cases, producing infinity likelihood (perhaps due to numerical issues) and has 41 times higher error than APPLES* on the rest (Fig. S2).
Since pplacer could not scale to 5,000 leaves, we also tested the recent method, EPA-ng (Barbera et al., 2018). On datasets with up to 1000 leaves, EPA-ng was less accurate than pplacer and close in accuracy to APPLES* (Fig. 3ab). It also failed in 800/1000 replicates of the 5000-taxon backbone but had 4% less error than APPLES* in the minority of cases where it could run (Fig. S2).
For backbones trees with at least 104 leaves, pplacer and EPA-ng were not able to run, and CLOSEST is not very accurate (finding the best placement in only 59% of cases). However, APPLES* continues to be accurate for all backbone sizes. As the backbone size increases, the taxon sampling of the tree is improving (recall that these trees are all random subsets of the same tree). With denser backbone trees, APPLES* has increased accuracy despite placing on larger trees (Fig. 3a, Table 3). For example, using a backbone tree of 2 × 105 leaves, APPLES* is able to find the best placement of query sequences in 87% of cases, which is better than the accuracy of either APPLES* or ML tools on any backbone size. Thus, an increased taxon sampling helps accuracy, but ML tools are limited in the size of the tree they can handle.
Running time and memory
As the backbone size increases, the running times of all methods increase close to linearly with the size of the backbone tree (Fig. 3c). However, APPLES is on average 15 times faster than pplacer and 12 times faster than EPA-ng on backbone trees with 5000 leaves in cases where those methods could run. Similarly, the memory of all methods increases linearly with the backbone size, but APPLES requires dramatically less memory (Fig. 3d). For example, for placing on a backbone with 5000 leaves, pplacer requires 25GB of memory, EPA-ng requires 30GB whereas APPLES requires only 0.25GB. APPLES easily scales to a backbone of 2 × 105 sequences, running in only 4 minutes and using 8GB of memory per query (including all precomputations in the dynamic programming). These numbers also include the time and memory needed to compute the distance between the query sequence and all the backbone sequences.
Comparing parameters of APPLES
We now compare different settings of APPLES. Comparing five models of sequence evolution, we see similar patterns of accuracy across all models despite their varying complexity, ranging from 0 to 12 parameters (Fig. S3). Since the JC69 model is parameter-free and results in similar accuracy to others, we have used it as the default. Next, we ask whether imposing the constraint to disallow negative branch lengths improves the accuracy. The answer depends on the optimization strategy. Forcing non-negative lengths marginally increases the accuracy for MLSE but dramatically reduces the accuracy for ME (Fig. 4). Thus, we always impose non-negative constraints on MLSE but never for ME. Likewise, our Hybrid method includes the constraint for the first MLSE step but not for the following ME step (Fig. S4).
The next parameter to choose is the weighting scheme. Among the three methods available in APPLES, the best accuracy belongs to the FM scheme closely followed by the BE (Fig. S5). The OLS scheme, which does not penalize long distances, performs substantially worse than FM and BE. Thus, the most aggressive form of weighting (FM) results in the best accuracy. Fixing the weighting scheme to FM and comparing the three optimization strategies (MLSE, ME, and Hybrid), the MLSE approach has the best accuracy (Fig. 2), finding the correct placement 84% of the time (mean error: 0.18), and ME has the lowest accuracy, finding the best placement in only 67% of cases (mean error: 0.70). The Hybrid approach is between the two (mean error: 0.34) and fails to outperform MLSE on this dataset. However, when we restrict the RNASim backbone trees to only 20 leaves, we observe that Hybrid can have the best accuracy (Fig. S6).
Discussion
We introduced APPLES: a new method for adding query species onto large backbone trees using both unassembled genome-skims and aligned data. The accuracy of APPLES was very close to ML using pplacer in most settings where ML could run; the accuracy advantages of ML were particularly small for the more realistic simulation, RNASim, where both methods face model misspecification. As expected by the substantial evidence from the literature (Hillis et al., 2003; Zwickl and Hillis, 2002), improved taxon sampling increased the accuracy of placement. Thus, overall, the best accuracy on RNASim dataset was obtained by APPLES* run on the full reference dataset. This observation motivates the use of scalable methods such as APPLES* instead of ML methods, which have to restrict their backbone to at most several thousand species. It is possible to follow up the APPLES* placements with a round of ML placement on smaller trees, but the small differences in accuracy of pplacer and APPLES* on smaller trees did not give us compelling reasons to try such hybrid approaches.
Phylogenetic insertion using the ME criterion has been previously studied for the purpose of creating an algorithm for greedy minimum evolution (GME). Desper and Gascuel 2002 have designed a method that given the tree T can update it to get a tree with n + 1 leaves in Θ(n) after precomputation of a data-structure that gives the average sequence distances between all adjacent clusters in T. The formulation by Desper and Gascuel 2002 has a subtle but consequential difference from our ME placement. Their algorithm does not compute branch lenghts for inserted sequence (e.g., x1 and x2). It is able to compute the optimal placement topology without knowing branch lengths of the backbone tree. Instead, it relies on pairwise distances among backbone sequences (Δ), which are precomputed and saved in the data-structure mentioned before. In the context of the greedy algorithm for tree inference, in each iteration, the data structure can be updated in Θ(n), which does not impact the overall running time of the algorithm. However, if we were to start with a tree of n leaves, computing this structure from scratch would still require Θ(n2). Thus, computing the placement for a new query would need quadratic time, unless if the Θ(n2) precomputation is allowed to be amortized over Ω(n) queries. Our formulation, in contrast, uses branch lengths of the backbone tree (which is assumed fixed) and thus never uses pairwise distances among the backbone sequences. Thus, using tree distances is what allows us to develop a linear time algorithm.
Our comparisons between versions of APPLES answered many questions but left others to future work. For example, we observed no advantage in using models more complex than JC69+G for distance calculation. However, these results may be due to our estimation of model parameters (e.g., base compositions) for each pair of sequences. More complex models may perform better if we instead estimate model parameters on the backbone alignment/tree and reuse the parameters for queries (or simultaneously among all queries and the reference sequences). Simultaneous estimation of distances has many advantages over using independent distances for the de novo case (Tamura et al., 2004; Xia, 2009); these results gives us hope that using simultaneous distances inside APPLES can further improve its accuracy.
In the aligned case, we were unable to test other methods. LSHPlace is theoretically fast, but we could not find an implementation of it. The distance-based insertion algorithm of FastME (Desper and Gascuel, 2002) is available only as part of a larger greedy algorithm but is not available as a stand-alone feature to place on a given tree. SEPP (Mirarab et al., 2012) performs alignment and placement simultaneously (using alignment scores to help the placement); however, in our experiments, our goal was only to test the placement step and not the alignment. Thus, we used true alignments in all the simulation tests and left an exploration of the impact of alignment error on different methods to future work. On a related note, future work can incorporate APPLES inside SEPP to perform alignment and placement in a unified pipeline.
In our assembly-free test, we used Skmer to get distances because alternative alignment-free methods of estimating distance generally either require assemblies (e.g., Haubold, 2014; Leimeister and Morgenstern, 2014; Leimeister et al., 2017) or higher coverage than Skmer (e.g., Benoit et al., 2016; Yi and Jin, 2013; Ondov et al., 2016); however, combining APPLES with other alignment-free methods can be attempted in future (finding the best way of computing distances without assemblies was not our focus). Moreover, the Skmer paper has described a trick that can be used to compute log-det distances from genome-skims. Future studies should test whether using that trick and using GTR instead of JC69 improves accuracy.
Branch lengths of our backbone trees were computed using the same distance model as the one used for computing the distance of the query to backbone species. Using consistent models for the query and for the backbone branch lengths is essential for obtaining good accuracy (see Fig. S7 for evidence). Thus, in addition to having a large backbone tree at hand, we need to ensure that branch lengths are computed using the right model. Fortunately, FastTree-2 can compute both topologies and branch lengths on large trees in a scalable fashion, without a need for quadratic time/memory computation of distance matrices (Price et al., 2010).
APPLES was an order of magnitude or more faster and less memory-hungry than ML tools (pplacer and EPA-ng), but it has room for improvement. The python APPLES code is not optimized and can be dramatically improved. For example, APPLES can save precomputed values of Equations 3 and 4 for each backbone tree in a file, eliminating the need to recompute them. Also, online processing of the backbone alignment can dramatically reduce the memory usage for the distance calculation. Its current version uses Dendropy (Sukumaran and Holder, 2010), which is not optimized for large trees; switching to other platforms such as ETE (Huerta-Cepas et al., 2010) can improve memory usage. Future implementations of APPLES will improve speed and memory by applying such optimizations.
Acknowledgments
This work was supported by the National Science Foundation (NSF) grant IIS-1565862 and National Institutes of Health (NIH) subaward 5P30AI027767-28 to M.B. and S.M., and NSF grant NSF-1815485 to M.B., S.S., and S.M. Computations were performed on the San Diego Supercomputer Center (SDSC) through XSEDE allocations, which is supported by the NSF grant ACI-1053575.
APPENDIX
Appendix A. Proofs and derivations
Recall the following notations.
For any node u and exponents a ∈ ℤ and b ∈ ℤ+, let
For b = 0, let and let S’(a, u) be a shorthand for S(a, 0, u). Similarly, let
Proof of Lemma 2
Proof. Recall the dynamic programming recursions of Equations 3 and 4:
Since u is not a leaf, for each leaf i ∈ g(u), there exists a v ∈ c(u) such that the directed path from u to i passes through v. Therefore every leaf i can be grouped under its corresponding v.
Similarly, given the condition u ≠ 1, for each leaf i ∉ g(u), either (1) there exists v ∈ sib(u) such that the directed path from p(u) to i passes through v, or (2) undirected path between i and p(u) passes through p(p(u)).
Boundary conditions follow from definitions. For u ∉ ℒ\ {1}, since dii = 0, we have S(a, b, u) = 0 and it’s trivial to see . For R(,,) recursions, the boundary case happens at the unique child of the root, which we denote as 1’. Based on the definition, since the only i ∉ g(1’) is 1, and , we trivially have R(a, b, 1’) = 0. For b = 0,.
A post-order traversal on T* can compute S(a, b, u), and a subsequent pre-order traversal can compute R(a, b, u), both in constant time and using constant memory per node. Recall that a and b are both no more than k, which is a constant. Thus, time and memory complexity of this dynamic programming is Θ(bn), which translates to Θ(n) in least squares setting, where b ≤ 2.
Proof of Lemma 3
Recall and that Equation 2:
Proof. Equation 2 can be re-written as:
By simple rearrangement of the terms, we can rewrite Equation S1 as follows.
Note that computing Q(P (u, x1, x2)) requires only S(,, u) and R(,, u) values and l(u). Thus, computing Q(P) requires only computing S(a, b, u) and R(a, b, u) values for −k ≤ a ≤ 2 −k and 0 ≤ b ≤ 2.
Proof of Lemma 4
Recall definitions and recall Eq. S1:
Proof. We take the derivative of Eq. S1 with respect to x1 and set it equal to zero:
Similarly,
These two linear equations have a unique solution for the pair x1, x2 if and only if the following matrix has the full rank:
Determinant of H is det(H) = 4R’(−k, u)S’(−k, u). Assuming that δqi > 0 for all i ∈ ℒ, both R’(−k, u) > 0 and S’(−k, u) > 0 hold. Therefore, H has the full rank. However, δqi = 0 for q ≠ i can be encountered on real data, especially for low divergence times, low evolutionary rates, or short sequences. In this case, APPLES is designed to place q on the pendant edge of i with x1 = 0 and x2 = l(i). In case there are multiple leaves i that satisfy δqi = 0 for q ≠ i, we pick one of them arbitrarily.
Proof of Theorem 1
Proof. First, using two traversals of the tree, we compute all the S(a, b, u) and R(a, b, u) values by Lemma 2. To find the optimal placement edge, we first optimize Q(P (u, x1, x2)) for all u ∈ V \ {1}. By Lemma 4, this task requires only constant time after the precompuations. Then, for each node, we compute Q(P (u, x1, x2)) in constant time for the optimal u, x1, x2 by Lemma 3. Thus, each node is processed in linear time and the whole optimization requires linear time. Note that the system of equations (shown in Lemma 4) will not have a solution iff δqi ≤ 0 for some i; if there is δqi = 0, we make q sister to i, breaking ties arbitrarily.
Proof of Lemma 5
Proof. Eigenvalues of the Hessian matrix of Q(P (u, x1, x2)) are 2R’(−k, u) and 2S’(−k, u), which are both non-negative since δqi ≥ 0 for i ∈ ℒ. Thus, the Hessian matrix is positive semidefinite and therefore P (u, x1, x2) is a convex function of x1 and x2.
APPENDIX B. SUPPLEMENTARY FIGURES
APPENDIX C. SUPPLEMENTARY TABLES
APPENDIX D. COMMANDS
Sampling Clades
For sampling clades of size at most 250 from a tree ”tree.nwk”, we used the TreeCluster package, available at https://github.com/niemasd/TreeCluster.
#!/bin/bash
python TreeCluster/TreeCluster.py -i 250 -o clusters.txt -t tree.nwk -m count_max_clade
Backbone tree estimation
When multiple sequence alignment is available, we used the following RaxML command to compute backbone tree for all datasets except RNAsim varied size dataset. We used RAxML version 7.2.6
#!/bin/bash
raxmlHPC-PTHREADS -m GTRGAMMA -p 88 -n REF -s aln_dna.phy -T 6
For RNAsim varied size dataset, we used FastTreeMP version 2.1.10 for estimating backbone topology. We run FastTreeMP with the following command:
#!/bin/bash
FastTreeMP -nosupport -gtr -gamma -nt -log tree.log < aln_dna.fa > tree.nwk
For alignment free datasets such as Drosophila dataset, we computed backbone tree using FastME* (based on FastME version 2.1.6.1) which is available at https://github.com/balabanmetin/FastME-personal-copy. FastME* is run with the following command:
#!/bin/bash
fastme -i dist.mat -o tree.nwk -T 1
Note that we performed Jukes-Cantor correction on the distance matrix ”dist.mat” before running FastME*.
Backbone tree branch length re-estimation
When multiple sequence alignment is available, we used FastME* to recompute backbone tree branch lengths for all datasets except RNAsim varied size dataset. We run FastME* with the following command:
#!/bin/bash
fastme -dJ -i aln_dna.phy -u RAxML_result.REF -o tree_me.nwk
For RNAsim varied size dataset, we used RAxML version 7.2.6 for re-estimating 728 ML based branch lengths and used that tree for performing placements using pplacer. 729 RAxML is run with the following command:
#!/bin/bash
raxmlHPC-PTHREADS -f e -t tree.nwk -m GTRGAMMA -s aln_dna.phy -n REF -p 1984 -T 8
For the same dataset, we used FastTree again for re-estimating Minimum Evolution based branch lengths and used that tree for performing placements using APPLES. FastTree is run with the following command:
#!/bin/bash
FastTreeMP -nosupport -nt -nome -noml -log tree.log
-intree tree.nwk < aln_dna.fa > tree_me.nwk
Performing placement
We performed phylogenetic placement of a query using pplacer with the following commands:
#!/bin/bash
nw_prune RAxML_result.REF query > backbone.nwk
pplacer -m GTR -s RAxML_info.REF -t backbone.nwk -o query.jplace aln_dna.fa