APPLES: Distance-based Phylogenetic Placement for Scalable and Assembly-free Sample Identification

Metin Balaban; Shahab Sarmashghi; Siavash Mirarab

doi:10.1101/475566

ABSTRACT

Placing a new species on an existing phylogeny has increasing relevance to several applications. Placement can be used to update phylogenies in a scalable fashion and can help identify unknown query samples using (meta-)barcoding, skimming, or metagenomic data. Maximum likelihood (ML) methods of phylogenetic placement exist, but these methods are not scalable to reference trees with many thousands of leaves, limiting their ability to enjoy benefits of dense taxon sampling in modern reference libraries. They also rely on assembled sequences for the reference set and aligned sequences for the query. Thus, ML methods cannot analyze datasets where the reference consists of unassembled reads, a scenario relevant to emerging applications of genome-skimming for sample identification. We introduce APPLES, a distance-based method for phylogenetic placement. Compared to ML, APPLES is an order of magnitude faster and more memory efficient, and unlike ML, it is able to place on large backbone trees (tested for up to 200,000 leaves). We show that using dense references improves accuracy substantially so that APPLES on dense trees is more accurate than ML on sparser trees, where it can run. Finally, APPLES can accurately identify samples without assembled reference or aligned queries using kmer-based distances, a scenario that ML cannot handle. APPLES is available publically at github.com/balabanmetin/apples.

Phylogenetic placement is the problem of finding the optimal position for a new query species on an existing backbone (or, reference) tree. Placement, as opposed to a de novo reconstruction of the full phylogeny, has two advantages. In some applications (discussed below), placement is all that is needed, and in terms of accuracy, it is as good as, and perhaps even better than, de novo reconstruction. Moreover, placement can be more scalable than de novo reconstruction when dealing with very large trees.

Earlier research on placement was motivated by scalability. For example, placement is used in greedy algorithms that start with an empty tree and add sequences sequentially (e.g., Felsenstein, 1981; Desper and Gascuel, 2002). Each placement requires polynomial (often linear) time with respect to the size of the backbone, and thus, these greedy algorithms are scalable (often requiring quadratic time). Despite computational challenges (Warnow, 2017), there has been much progress in the de novo reconstruction of ultra-large trees (e.g., thousands to millions of sequences) using both maximum likelihood (ML) (e.g., Price et al., 2010; Nguyen et al., 2015) and the distance-based (e.g., Lefort et al., 2015) approaches. However, these large-scale reconstructions require significant resources. As new sequences continually become available, placement can be used to update existing trees without repeating previous computations on full dataset.

More recently, placement has found a new application in sample identification: given one or more query sequences of unknown origins, detect the identity of the (set of) organism(s) that could have generated that sequence. These identifications can be made easily using sequence matching tools such as BLAST (Altschul et al., 1990) when the query either exactly matches or is very close to a sequence in the reference library. However, when the sequence is novel (i.e., has lowered similarity to known sequences in the reference), this closest match approach is not sufficiently accurate (Koski and Golding, 2001), leading some researchers to adopt a phylogenetic approach (Sunagawa et al., 2013; Nguyen et al., 2014). Sample identification is essential to the study of mixed environmental samples, especially of the microbiome, both using 16S profiling (e.g., Gill et al., 2006; Krause et al., 2008) and metagenomics (e.g., von Mering et al., 2007). It is also relevant to barcoding (Hebert et al., 2003) and meta-barcoding (Clarke et al., 2014; Bush et al., 2017) and quantification of biodiversity (e.g., Findley et al., 2013). Driven by applications to microbiome profiling, placement tools like pplacer (Matsen et al., 2010) and EPA(-ng) (Berger et al., 2011; Barbera et al., 2018) have been developed. Researchers have also developed methods for aligning query sequence (e.g., Berger and Stamatakis, 2011; Mirarab et al., 2012) and for downstream steps (e.g., Stark et al., 2010; Matsen and Evans, 2013). These publications have made a strong case that for sample identification, placement is sufficient (i.e., de novo is not needed). Moreover, some studies (e.g., Janssen et al., 2018) have shown that when dealing with fragmentary reads typically found in microbiome samples, placement can be more accurate than de novo construction and can lead to improved associations of microbiome with clinical information.

Existing phylogenetic placement methods have focused on the ML inference of the best placement – a successful approach, which nevertheless, suffers from two shortcomings. On the one hand, ML can only be applied when the reference species are assembled into full-length sequences (e.g., an entire gene) and are aligned; however, in new applications that we will describe, assembling (and hence aligning) the reference set is not possible. On the other hand, ML, while somewhat scalable, is still computationally demanding, especially in memory usage, and cannot place on backbone trees with many thousands of leaves. As the density of reference substantially impacts the accuracy and resolution of placement, this inability to use ultra-large trees as backbone also limits accuracy. This limitation has motivated alternative methods using local sensitive hashing (Brown and Truszkowski, 2013) and divide-and-conquer (Mirarab et al., 2012).

Assembly-free and alignment-free sample identification using genome-skimming (Dodsworth, 2015) can also benefit from phylogenetic placement. A genome-skim is a shut-gun sample of the genome sequenced at low coverage (e.g., 1X) – so low that assembling the nuclear genome is not possible (though, mitochondrial or plastid genomes can often be assembled). Genome-skimming promises to replace traditional marker-based barcoding of biological samples (Coissac et al., 2016) but limiting analyses to organelle genome can limit resolution. Sarmashghi et al. (2019) have recently shown that using shared k-mers, the distance between two unassembled genome-skims with low coverage can be accurately estimated. This approach, unlike assembling organelle genomes, uses data from the entire nuclear genome and hence promises to provide a higher resolution (e.g., at species or sub-species levels) while keeping the low sequencing cost. However, ML and other methods that require assembled sequences cannot analyze genome-skims, where both the reference and the query species are unassembled genome-wide bags of reads.

Distance-based approaches to phylogenetics are well-studied, but no existing tool can perform distance-based placement of a query sequence on a given backbone. The distance-based approach promises to solve both shortcomings of ML methods. Distance-based methods are computationally efficient and do not require assemblies. They only need distances (however computed). Thus, they can take as input assembly-free estimates of genomic distance estimated from low coverage genome-skims using Skmer (Sarmashghi et al., 2019) or other alternatives (Haubold, 2014; Leimeister and Morgenstern, 2014; Leimeister et al., 2017; Yi and Jin, 2013; Benoit et al., 2016; Fan et al., 2015; Ondov et al., 2016; Jain et al., 2017). While alignment-based phylogenetics has been traditionally more accurate than alignment-free methods when both methods are possible, in these new scenarios, only alignment-free methods are applicable.

Here, we introduce a new method for distance-based phylogenetic placement called APPLES (Accurate Phylogenetic Placement using LEast Squares). APPLES uses dynamic programming to find the optimal distance-based placement of a sequence with running time and memory usage that scale linearly with the size of the backbone tree. We test APPLES in simulations and on real data, both for alignment-free and aligned scenarios.

MATERIALS AND METHODS

Problem Statement

Notations

Let an unrooted tree T = (V, E) be a weighted connected acyclic undirected graph with leaves denoted by ℒ = {1 … n}. We let T* be the rooting of T on a leaf 1 obtained by directing all edges away from 1. For node u ∈ V, let p(u) denote its parent, c(u) denote its set of children, sib(u) denote its siblings, and g(u) denote the set of leaves at or below u (i.e., those that have u on their path to the root), all with respect to T*. Also let l(u) denote the length of the edge (p(u), u).

Distances

The tree T defines an n × n matrix where each entry d_ij(T) corresponds to the path length between leaves i and j. We further generalize this definition so that d_uv(T *) indicates the length of the undirected path between any two nodes of T* (when clear, we simply write d_uv). Given some input data, we can compute a matrix of all pairwise sequence distances Δ, where the entry δ_ij indicates the dissimilarity between species i and j. When the sequence distance δ_ij is computed using (the correct) phylogenetic model, it will be a noisy but statistically consistent estimate of the tree distance d_ij(T) (Felsenstein, 2003). Given these “phylogenetically corrected” distances (e.g. is the corrected hamming distance h under the Jukes and Cantor (1969) model), we can define optimization problems to recover the tree that best fits the distances. A natural choice is minimizing the (weighted) least square difference between tree and sequence distances:

Here, weights (e.g., w_ij) are used to reduce the impact of large distances (expected to have high variance). A general weighting schema can be defined as for a constant value k ∈ ℕ. Standard choices of k include k = 0 for the ordinary least squares (OLS) method of Cavalli-Sforza and Edwards 1967, k = 1 due to Beyer et al. 1974 (BE), and k = 2 due to Fitch and Margoliash 1967 (FM).

Finding arg min_T Q*(T) is NP-Complete (Day, 1987). However, decades of research has produced heuristics like neighbor-joining (Saitou and Nei, 1987), alternative formulations like (balanced) minimum evolution (Cavalli-Sforza and Edwards, 1967; Desper and Gascuel, 2002), and several effective tools for solving the problem heuristically (e.g., FastME by Lefort et al. 2015, DAMBE by Xia 2018, and Ninja by Wheeler 2009).

Phylogenetic placement

We let P (u, x₁, x₂) be the tree obtained by adding a query taxon q on an edge (p(u), u), creating three edges (t, q), (p(u), t), and (t, u), with weights x₁, x₂, and l(u) – x₂, respectively (Fig. 1). When clear, we simply write P and note that P induces T both in topology and branch length. We now define the problem.

Fig. 1.

Any placement of q can be characterized as a tree P (u, x₁, x₂), shown here. The backbone tree T * is an arborescence on leaves ℒ = {1 … n}, rooted at leaf 1. Query taxon q is added on the edge between u and p(u), creating a node t. All placements on this edge are characterized by x₁, the length of the pendant branch, and x₂, the distance between t and p(u).

Least Squares Phylogenetic Placement (LSPP)

Input: A backbone tree T on ℒ, a query species q, and a vector Δ_q* with elements δ_qi giving sequence distances between q and every species i ∈ ℒ;

Output: The placement tree P that adds q on T and minimizes

Linear Time Solution for LSPP

The number of possible placements of q is 2n – 3. Therefore, LSPP can be solved by simply iterating over all the topologies, optimizing the score for that branch, and returning the placement with the minimum least square error. A naive algorithm can accomplish this in Θ(n²) running time by optimizing Eq. 2 for each of the 2n – 3 branches. However, using dynamic programming, the optimal solution can be found in linear time.

THEOREM 1

The LSPP problem can be solved with Θ(n) running time and memory.

The proof (given in Appendix A) follows easily from three lemmas that we next state. The algorithm starts with precomputing a fixed-size set of values for each nodes. For any node u and exponents a ∈ ℤ and b ∈ ℕ⁺, let and for b = 0, let . Note that S’(0, u) = |g(u)|. Similarly, for u ∈ V \ {1}, let for b > 0 and let .

LEMMA 2

The set of all S(a, b, u) and R(a, b, u) values can be precomputed in Θ(n) time with two tree traversals using the dynamic programming given by:

LEMMA 3

Equation 2 can be rearranged (see Eq. S2 in Appendix A) such that computing Q(P) for a given P = P (u, x₁, x₂) requires a constant time computation using S(a, b, u) and R(a, b, u) values for -k ≤ a ≤ 2 -k and 0 ≤ b ≤ 2.

Thus, after a linear time precomputation, we can compute the error for any given placement in constant time. It remains to show that for each node, the optimal placement on the branch above it (e.g., x₁ and x₂) can be computed in constant time.

LEMMA 4

For a fixed node u ∈ V \ {1}, if , then and hence can be computed in constant time.

Non-negative branch lengths

The solution to Equation 5 does not necessarily conform to constraints 0 ≤ x₁ and 0 ≤ x₂ ≤ l(u). However, the following lemma (proof in Appendix A) allows us to easily impose the constraints by choosing optimal boundary points when unrestricted solutions fall outside boundaries.

LEMMA 5

With respect to variables x₁ and x₂, Q(P (u, x₁, x₂)) is a convex function.

Minimum evolution

An alternative to directly using MLSE (Eq. 1) is the minimum evolution (ME) principle (Cavalli-Sforza and Edwards, 1967; Rzhetsky and Nei, 1992). Our algorithm can also optimize the ME criterion: after computing x₁ and x₂ by optimizing MLSE for each node u, we choose the placement with the minimum total branch length. This is equivalent to using arg min_u x₁, since the value of x₂ does not contribute to total branch length. Other solution for ME placement exists (Desper and Gascuel, 2002), a topic we return to in the Discussion section.

Hybrid

We have observed cases where ME is correct more often than MLSE, but when it is wrong, unlike MLSE, it has a relatively high error. This observation led us to design a hybrid approach. After computing x₁ and x₂ for all branches, we first select the top log₂(n) edges with minimum Q(P (u, x₁, x₂)) values (this requires Θ(n log log n) time). Among this set of edges, we place the query on the edge satisfying the ME criteria.

Datasets

We benchmark accuracy and scalability of APPLES in two settings: sample identification using assembly-free genome-skims on real biological data and placement using aligned sequences on simulated data.

Real genome-skim datasets for the assembly-free scenario

Columbicola genome-skims

We use a set of 61 genome-skims by Boyd et al. (2017), including 45 known lice species (some represented multiple times) and 7 undescribed species. We generate lower coverage skims of 0.1Gb or 0.5Gb by randomly subsampling the reads from the sequence read archives (SRA) provided by the original publication (NCBI BioProject PRJNA296666). We use BBTools (Bushnell, 2014) to filter subsampled reads for adapters and contaminants and remove duplicated reads. Since this dataset is not assembled, the coverage of the genome-skims is unknown; Skmer estimates the coverage to be between 0.2X and 1X for 0.1Gb samples (and 5 times that coverage with 0.5Gb).

Anopheles and Drosophila datasets

We also use two insect datasets used by Sarmashghi et al. (2019): a dataset of 22 Anopheles and a dataset of 21 Drosophila genomes (Table S1), both obtained from InsectBase (Yin et al., 2016). For both datasets, genome-skims with 0.1Gb and 0.5Gb sequence were generated from the assemblies using the short-read simulator tool ART, with the read length l = 100 and default error profile. Since species have different genome sizes, with 0.1Gb data, our subsampled genome-skims range in coverage from 0.35X to 1X for Anopheles and from 0.4X to 0.8X for Drosophila.

More recently, Miller et al. (2018) sequenced several Drosophila genomes, including 12 species shared with the InsectBase dataset. Sarmashghi et al. (2019) subsampled the SRAs from this second project to 0.1Gb or 0.5Gb and, after filtering contaminants, obtained artificial genome-skims. We can use these genome-skims as query and the genome-skims from the InsectBase dataset as the backbone. Since the reference and query come from two projects, the query genome-skim can have a non-zero distance to the same species in the reference set, providing a realistic test of sample identification applications.

Simulated datasets for the aligned sequence scenario

GTR

We use a 101-taxon dataset available from Mirarab and Warnow 2015. Sequences were simulated under the General Time Reversible (GTR) plus the Γ model of site rate heterogeneity using INDELible (Fletcher and Yang, 2009) on gene trees that were simulated using SimPhy (Mallo et al., 2016) under the coalescent model evolving on species trees generated under the Yule model. Note that the same model is used for inference under ML placement methods (i.e., no model misspecification). We took all 20 replicates of this dataset with mutation rates between 5 × 10^-8 and 2 × 10^-7, and for each replicate, randomly selected five estimated gene trees among those with ≤20% RF distance between estimated and true gene tree. Thus, we have a total of 100 backbone trees.

RNASim

Guo et al. 2009 designed a complex model of RNA evolution that does not make usual i.i.d assumptions of sequence evolution. Instead, it uses models of energy of the secondary structure to simulate RNA evolution by a mutation-selection population genetics model. This model is based on an inhomogeneous stochastic process without a global substitution matrix. The model complexity of RNASim allows us to test both ML and APPLES under a substantially misspecified model. An RNASim dataset of 10⁶ sequences is available from Mirarab et al. 2015. We created several subsets of the full RNASim dataset.

i) Heterogeneous: We first randomly subsampled the full dataset to create 10 datasets of size 10⁴. Then, we chose the largest clade of size at most 250 from each replicate; this gives us 10 backbone trees of mean size 249.

ii) Varied diameter: To evaluate the impact of the evolutionary diameter (i.e., the highest distance between any two leaves in the backbone), we also created datasets with low, medium, and high diameters. We sampled the largest five clades of size at most 250 from each of the 10 replicates used for the heterogeneous dataset. Among these 50 clades, we picked the bottom, middle, and top five clades in diameter, which had diameter in [0.3, 0.4] (mean: 0.36), [0.5, 0.52] (mean: 0.51), and [0.65, 1.07] (mean: 0.82), respectively.

iii) Varied size: We randomly subsampled the tree of size 10⁶ to create 5 replicates of datasets of size 5 × 10², 10³, 5 × 10³, 10⁴, 5 × 10⁴, and 10⁵, and 1 replicate (due to size) of size 2 × 10⁵. For replicates that contain at least 5 × 10³ species, we removed sites that contain gaps in 95% or more of the sequences in the alignment.

Methods

Alternative methods

For aligned data, we compare APPLES to two ML methods: pplacer (Matsen et al., 2010) and EPA-ng (Barbera et al., 2018). Matsen et al. (2010) found pplacer to be substantially faster than EPA (Berger and Stamatakis, 2011) while their accuracy was similar. EPA-ng improves the scalability of EPA; thus, we compare to EPA-ng in analyses that concerned scalability (e.g., RNASim-Varied Size). We run pplacer and EPA-ng in their default mode using GTR+Γ model (the only option for pplacer). We also compare with a simple method referred to as CLOSEST that places the query as the sister to the species with the minimum distance to it. CLOSEST is meant to emulate the use of BLAST (if it could be used). For the assembly-free setting, existing phylogenetic placement methods cannot be used, and we compared only against CLOSEST.

Distance calculation and models

We modified FastME to compute distances only between query and backbone sequences, not among backbone sequences. This version, called FastME* here, also ensures that when estimating model parameters, positions that have a gap in at least one of the two sequences are always ignored.

We compute phylogenetic distances under the parameter-free JC69 model, the six-parameter Tamura and Nei 1993 (TN93) model, and the 12-parameter general Markov model (Lockhart et al., 1994). We compute distances independently for all pairs, and not simultaneously as suggested by Tamura et al. (2004). We also use the Gamma model of sites rate heterogeneity for JC69 and TN93 using the standard approach (Waddell and Steel, 1997). Pairing Gamma with GTR is theoretically possible in the absence of noise; however, the method can run into problems on real data (Waddell and Steel, 1997). Thus, we do not include a GTR model directly. Instead, we use the log-det approach that can handle the most general (12-parameter) Markov model (Lockhart et al., 1994); however, log-det cannot account for rate across sites heterogeneity (Waddell and Steel, 1997). The α parameter of the Gamma model cannot be computed from pairwise sequence comparisons (Steel, 2009); instead, we use the α computed from the backbone tree. We used the α parameter computed by RAxML (Stamatakis, 2014) run on the backbone alignment and given the backbone tree.

In analyses on assembly-free datasets, we first compute genomic distances using Skmer (Sarmashghi et al., 2019). We then correct these distances using the JC69 model, without the Gamma model of rate variation.

Backbone trees

For genome-skimming experiments, we estimated the backbone tree using FastME* from the JC69 distance matrix computed from genome-skims using Skmer. For simulated datasets, we estimated the topology of the backbone tree by running RAxML (Stamatakis, 2014) on the true alignment using GTRGAMMA model and used this tree as the backbone for pplacer and EPA-ng. However, to handle large trees, we used FastTree-2 (Price et al., 2010) to estimate the backbone tree for RNASim-varied size and re-estimated branch lengths on the fixed topology using RAxML. For the backbone of APPLES, we always used the same tree topology but re-estimated branch lengths using FastTree-2 under the JC69 model.

APPLES parameters

We have chosen default parameter settings for APPLES and refer to this version as APPLES*. By default, we use FM weighting, the MLSE selection criterion, enforcement of non-negative branch lengths, and JC69 distances. When not specified otherwise, these default parameters are used.

Evaluation Procedure

To evaluate the accuracy, we use a leave-one-out strategy. We remove each leaf i from the backbone tree T and place it back on this T \ i tree to obtain the placement tree P. However, on the RNAsim-varied size dataset, due to its large size, we only removed and added back 200 randomly chosen leaves per replicate.

Delta error

We measure the accuracy of the placement using delta error (Δe): the number of branches of the true tree missing from P minus the number of branches of the true tree missing from T \ i (induced on the same leafset). Note that Δe ≥ 0 because adding i cannot decrease the number of missing branches in T \ i. Note that placing i to the same location as the backbone before leaving it out (e.g., T) can still have a non-zero delta error because the backbone tree is not the true tree. We refer to the placement of a leaf into its position in the backbone tree as the de novo placement.

On biological data, where the true tree is unknown, we use a reference tree (Fig. S1). For Drosophila and Anopheles, we use the tree available from the Open Tree Of Life (Hinchliff et al., 2015) as the reference. For Columbicola, we use the ML concatenation tree published by Boyd et al. (2017) as the reference.

Results

Assembly-free Placement of Genome-skims

On our three biological genome-skim datasets, APPLES* successfully places the queries on the optimal position in most cases (97%, 95%, and 71% for Columbicola, Anopheles, and Drosophila, respectively) and is never off from the optimal position by more than one branch. Other versions of APPLES are less accurate than APPLES*; e.g., APPLES with ME can have up to five wrong branches (Table 1). On genome-skims, where assembly and alignment are not possible, existing placement tools cannot be used, and the only alternative is the CLOSEST method (emulating BLAST if assembly was possible).

View this table:

Table 1. Assembly-free placement of genome-skims.

We show the percentage of placements into optimal position (those that do not increase Δe), average delta error (Δe), and maximum delta error (e_max) for APPLES, assignment to the CLOSEST species, and the placement to the position in the backbone (DE-NOVO) over the 61 (a), 22 (b), and 21 (c) placements. Results are shown for genome skims with 0.1Gbp of reads. Delta error is the increase in the missing branches between the reference tree and the backbone tree after placing each query.

CLOSEST finds the optimal placement only in 54% and 57% of times for Columbicola and Drosophila; moreover, it can be off from the best placement by up to seven branches for the Columbicola dataset. On the Anopheles dataset, where the reference tree is unresolved (Fig. S1), all methods perform similarly.

APPLES* is less accurate on the Drosophila dataset than other datasets. However, here, simply placing each query on its position in the backbone tree would lead to identical results (Table 1). Thus, placements by APPLES* are as good as the de novo construction, meaning that errors of APPLES* are entirely due to the differences between our backbone tree and the reference tree. Moreover, these errors are not due to low coverage; increasing the genome-skim size 5x (to 0.5Gb) does not decrease error (Table S4).

On Drosophila dataset, we next tested a more realistic sample identification scenario using the 12 genome-skims from the separate study (and thus, non-zero distance to the corresponding species in the backbone tree). As desired, APPLES* places all of 12 queries from the second study as sister to the corresponding species in the reference dataset.

Alignment-based Placement

We first compare the accuracy and scalability of APPLES* to ML methods and then compare various settings of APPLES. For ML, we use pplacer (shown everywhere) and EPA-ng (shown only when we study scalability).

Comparison to Maximum Likelihood (ML)

GTR dataset

On this dataset, where it faces no model misspecification, pplacer has high accuracy. It finds the best placement in 84% of cases and is off by one edge in 15% (Fig. 2a); its mean delta error (Δe) is only 0.17 edges. APPLES* is also accurate, finding the best placement in 78% of cases and resulting in the mean Δe =0.28 edges. Thus, even though pplacer uses ML and faces no model misspecification and APPLES* uses distances based on a simpler model, the accuracy of the two methods is within 0.1 edges on average. In contrast, CLOSEST has poor accuracy and is correct only 50% of the times, with the mean Δe of 1.0 edge.

Fig. 2. Accuracy on simulated data.

We show empirical cumulative distribution of the delta error, defined as the increase in the number of missing branches in the estimated tree compared to the true tree. We compare pplacer (dotted), CLOSEST match (dashed), and APPLES with FM weighting and JC69 distances and MLSE (APPLES*), ME, or Hybrid optimization. (a) GTR dataset. (b) RNASim-Heterogeneous. (c) RNASim-varied diameter, shown in boxes: low, medium (mid), or high (hi). Distributions are over 10, 000 (a), 2450 (b), and 3675 (c) points.

Model misspecification

On the small RNASim data with subsampled clades of ≈ 250 species), both APPLES* and pplacer face model misspecification. Here, the accuracy of APPLES* is very close to ML using pplacer. On the heterogeneous subset (Fig. 2b and Table 2), pplacer and APPLES* find the best placement in 88% and 86% of cases and have a mean delta error of 0.13 and 0.17 edges, respectively. Both methods are much more accurate than CLOSEST, which has a delta error of 0.87 edges on average.

View this table:

Table 2.

The delta error for APPLES*, CLOSEST match, and pplacer on the RNASim-varied diameter dataset (low, medium, or high) and the RNA-heterogeneous dataset. Measurements are shown over 1250 placements for each diameter size category, corresponding to 5 backbone trees and 250 placements per replicate.

Impact of diameter

When we control the tree diameter, APPLES* and pplacer remain very close in accuracy (Fig. 2c). The changes in error are small and not monotonic as the diameters change (Table 2). The accuracies of the two methods at low and high diameters are similar. The two methods are most divergent in the medium diameter case, where pplacer has its lowest error (Δe =0.11) and APPLES* has its highest error (Δe =0.18).

To summarize results on small RNASim dataset with model misspecification, although APPLES* uses a parameter-free model, its accuracy is extremely close to ML using pplacer with the GTR+G model.

Impact of taxon sampling

The real advantage of APPLES* over pplacer becomes clear for placing on larger backbone trees (Fig. 3 and Table 3). For backbone sizes of 500 and 1000, pplacer continues to be slightly more accurate than APPLES* (mean Δe of pplacer is better than APPLES* by 0.09 and 0.23 edges, respectively). However, with backbones of 5000 leaves, pplacer fails to run on 449/1000 cases, producing infinity likelihood (perhaps due to numerical issues) and has 41 times higher error than APPLES* on the rest (Fig. S2).

View this table:

Table 3.

Percentage of correct placements (shown as %) and the delta error (Δe) on the RNASim datasets with various backbone size (n). % and Δe is over 1000 placements (except n = 200, 000, which is over 200 placements). Running pplacer and EPA-ng was not possible (n.p) for trees with at least 10, 000 leaves and failed in some cases (number of fails shown) for 5, 000 leaves.

Fig. 3. Results on RNASim-Varied size.

(a) Placement accuracy for various levels of taxon sampling, comparing pplacer, CLOSEST match, EPA-ng, and APPLES* on RNASim dataset with backbone size ranging from 500 to 200,000. (b) The empirical cumulative distribution of the delta error on the same datasets, only shown for the backbone size 500 and 1000 where all methods can run. Distributions are over 1000 points. (c,d) Running time and peak memory usage of placement methods for a single placement. For APPLES*, measurements are shown with and without the distance calculation step performed using FastME*. On backbones of size 5000, pplacer managed to correctly run for only 551 out of 1000 placements, whereas EPA-ng managed to run for 200/1000 placements (Fig S2). Lines are fitted in the log-log scale; thus, the slope of the line (indicated on the figure) gives an empirical estimate of the polynomial degree of the asymptotic growth curve. All curves grow close to linearly (slopes ≈1). APPLES lines are fitted to ≥ 5, 000 points because the first two values correspond to extremely low memory (100Mb) and are irrelevant to asymptotic behavior. All calculations are on 8-core, 2.6GHz Intel Xeon CPUs (Sandy Bridge) with 64GB of memory.

Since pplacer could not scale to 5,000 leaves, we also tested the recent method, EPA-ng (Barbera et al., 2018). On datasets with up to 1000 leaves, EPA-ng was less accurate than pplacer and close in accuracy to APPLES* (Fig. 3ab). It also failed in 800/1000 replicates of the 5000-taxon backbone but had 4% less error than APPLES* in the minority of cases where it could run (Fig. S2).

For backbones trees with at least 10⁴ leaves, pplacer and EPA-ng were not able to run, and CLOSEST is not very accurate (finding the best placement in only 59% of cases). However, APPLES* continues to be accurate for all backbone sizes. As the backbone size increases, the taxon sampling of the tree is improving (recall that these trees are all random subsets of the same tree). With denser backbone trees, APPLES* has increased accuracy despite placing on larger trees (Fig. 3a, Table 3). For example, using a backbone tree of 2 × 10⁵ leaves, APPLES* is able to find the best placement of query sequences in 87% of cases, which is better than the accuracy of either APPLES* or ML tools on any backbone size. Thus, an increased taxon sampling helps accuracy, but ML tools are limited in the size of the tree they can handle.

Running time and memory

As the backbone size increases, the running times of all methods increase close to linearly with the size of the backbone tree (Fig. 3c). However, APPLES is on average 15 times faster than pplacer and 12 times faster than EPA-ng on backbone trees with 5000 leaves in cases where those methods could run. Similarly, the memory of all methods increases linearly with the backbone size, but APPLES requires dramatically less memory (Fig. 3d). For example, for placing on a backbone with 5000 leaves, pplacer requires 25GB of memory, EPA-ng requires 30GB whereas APPLES requires only 0.25GB. APPLES easily scales to a backbone of 2 × 10⁵ sequences, running in only 4 minutes and using 8GB of memory per query (including all precomputations in the dynamic programming). These numbers also include the time and memory needed to compute the distance between the query sequence and all the backbone sequences.

Comparing parameters of APPLES

We now compare different settings of APPLES. Comparing five models of sequence evolution, we see similar patterns of accuracy across all models despite their varying complexity, ranging from 0 to 12 parameters (Fig. S3). Since the JC69 model is parameter-free and results in similar accuracy to others, we have used it as the default. Next, we ask whether imposing the constraint to disallow negative branch lengths improves the accuracy. The answer depends on the optimization strategy. Forcing non-negative lengths marginally increases the accuracy for MLSE but dramatically reduces the accuracy for ME (Fig. 4). Thus, we always impose non-negative constraints on MLSE but never for ME. Likewise, our Hybrid method includes the constraint for the first MLSE step but not for the following ME step (Fig. S4).

Fig. 4. The effect of imposing positivity constraint on error.

We show the error of (a) APPLES-MLSE and APPLES-ME run both with and without enforcement of non-negative branch lengths on RNASim heterogeneous dataset. Accuracy improves substantially for MLSE whereas it reduces drastically for ME.

The next parameter to choose is the weighting scheme. Among the three methods available in APPLES, the best accuracy belongs to the FM scheme closely followed by the BE (Fig. S5). The OLS scheme, which does not penalize long distances, performs substantially worse than FM and BE. Thus, the most aggressive form of weighting (FM) results in the best accuracy. Fixing the weighting scheme to FM and comparing the three optimization strategies (MLSE, ME, and Hybrid), the MLSE approach has the best accuracy (Fig. 2), finding the correct placement 84% of the time (mean error: 0.18), and ME has the lowest accuracy, finding the best placement in only 67% of cases (mean error: 0.70). The Hybrid approach is between the two (mean error: 0.34) and fails to outperform MLSE on this dataset. However, when we restrict the RNASim backbone trees to only 20 leaves, we observe that Hybrid can have the best accuracy (Fig. S6).

Discussion

We introduced APPLES: a new method for adding query species onto large backbone trees using both unassembled genome-skims and aligned data. The accuracy of APPLES was very close to ML using pplacer in most settings where ML could run; the accuracy advantages of ML were particularly small for the more realistic simulation, RNASim, where both methods face model misspecification. As expected by the substantial evidence from the literature (Hillis et al., 2003; Zwickl and Hillis, 2002), improved taxon sampling increased the accuracy of placement. Thus, overall, the best accuracy on RNASim dataset was obtained by APPLES* run on the full reference dataset. This observation motivates the use of scalable methods such as APPLES* instead of ML methods, which have to restrict their backbone to at most several thousand species. It is possible to follow up the APPLES* placements with a round of ML placement on smaller trees, but the small differences in accuracy of pplacer and APPLES* on smaller trees did not give us compelling reasons to try such hybrid approaches.

Phylogenetic insertion using the ME criterion has been previously studied for the purpose of creating an algorithm for greedy minimum evolution (GME). Desper and Gascuel 2002 have designed a method that given the tree T can update it to get a tree with n + 1 leaves in Θ(n) after precomputation of a data-structure that gives the average sequence distances between all adjacent clusters in T. The formulation by Desper and Gascuel 2002 has a subtle but consequential difference from our ME placement. Their algorithm does not compute branch lenghts for inserted sequence (e.g., x₁ and x₂). It is able to compute the optimal placement topology without knowing branch lengths of the backbone tree. Instead, it relies on pairwise distances among backbone sequences (Δ), which are precomputed and saved in the data-structure mentioned before. In the context of the greedy algorithm for tree inference, in each iteration, the data structure can be updated in Θ(n), which does not impact the overall running time of the algorithm. However, if we were to start with a tree of n leaves, computing this structure from scratch would still require Θ(n²). Thus, computing the placement for a new query would need quadratic time, unless if the Θ(n²) precomputation is allowed to be amortized over Ω(n) queries. Our formulation, in contrast, uses branch lengths of the backbone tree (which is assumed fixed) and thus never uses pairwise distances among the backbone sequences. Thus, using tree distances is what allows us to develop a linear time algorithm.

Our comparisons between versions of APPLES answered many questions but left others to future work. For example, we observed no advantage in using models more complex than JC69+G for distance calculation. However, these results may be due to our estimation of model parameters (e.g., base compositions) for each pair of sequences. More complex models may perform better if we instead estimate model parameters on the backbone alignment/tree and reuse the parameters for queries (or simultaneously among all queries and the reference sequences). Simultaneous estimation of distances has many advantages over using independent distances for the de novo case (Tamura et al., 2004; Xia, 2009); these results gives us hope that using simultaneous distances inside APPLES can further improve its accuracy.

In the aligned case, we were unable to test other methods. LSHPlace is theoretically fast, but we could not find an implementation of it. The distance-based insertion algorithm of FastME (Desper and Gascuel, 2002) is available only as part of a larger greedy algorithm but is not available as a stand-alone feature to place on a given tree. SEPP (Mirarab et al., 2012) performs alignment and placement simultaneously (using alignment scores to help the placement); however, in our experiments, our goal was only to test the placement step and not the alignment. Thus, we used true alignments in all the simulation tests and left an exploration of the impact of alignment error on different methods to future work. On a related note, future work can incorporate APPLES inside SEPP to perform alignment and placement in a unified pipeline.

In our assembly-free test, we used Skmer to get distances because alternative alignment-free methods of estimating distance generally either require assemblies (e.g., Haubold, 2014; Leimeister and Morgenstern, 2014; Leimeister et al., 2017) or higher coverage than Skmer (e.g., Benoit et al., 2016; Yi and Jin, 2013; Ondov et al., 2016); however, combining APPLES with other alignment-free methods can be attempted in future (finding the best way of computing distances without assemblies was not our focus). Moreover, the Skmer paper has described a trick that can be used to compute log-det distances from genome-skims. Future studies should test whether using that trick and using GTR instead of JC69 improves accuracy.

Branch lengths of our backbone trees were computed using the same distance model as the one used for computing the distance of the query to backbone species. Using consistent models for the query and for the backbone branch lengths is essential for obtaining good accuracy (see Fig. S7 for evidence). Thus, in addition to having a large backbone tree at hand, we need to ensure that branch lengths are computed using the right model. Fortunately, FastTree-2 can compute both topologies and branch lengths on large trees in a scalable fashion, without a need for quadratic time/memory computation of distance matrices (Price et al., 2010).

APPLES was an order of magnitude or more faster and less memory-hungry than ML tools (pplacer and EPA-ng), but it has room for improvement. The python APPLES code is not optimized and can be dramatically improved. For example, APPLES can save precomputed values of Equations 3 and 4 for each backbone tree in a file, eliminating the need to recompute them. Also, online processing of the backbone alignment can dramatically reduce the memory usage for the distance calculation. Its current version uses Dendropy (Sukumaran and Holder, 2010), which is not optimized for large trees; switching to other platforms such as ETE (Huerta-Cepas et al., 2010) can improve memory usage. Future implementations of APPLES will improve speed and memory by applying such optimizations.

Acknowledgments

This work was supported by the National Science Foundation (NSF) grant IIS-1565862 and National Institutes of Health (NIH) subaward 5P30AI027767-28 to M.B. and S.M., and NSF grant NSF-1815485 to M.B., S.S., and S.M. Computations were performed on the San Diego Supercomputer Center (SDSC) through XSEDE allocations, which is supported by the NSF grant ACI-1053575.

APPENDIX

Appendix A. Proofs and derivations

Recall the following notations.

For any node u and exponents a ∈ ℤ and b ∈ ℤ⁺, let
For b = 0, let and let S’(a, u) be a shorthand for S(a, 0, u). Similarly, let

Proof of Lemma 2

Proof. Recall the dynamic programming recursions of Equations 3 and 4:

Since u is not a leaf, for each leaf i ∈ g(u), there exists a v ∈ c(u) such that the directed path from u to i passes through v. Therefore every leaf i can be grouped under its corresponding v.

Similarly, given the condition u ≠ 1, for each leaf i ∉ g(u), either (1) there exists v ∈ sib(u) such that the directed path from p(u) to i passes through v, or (2) undirected path between i and p(u) passes through p(p(u)).

Boundary conditions follow from definitions. For u ∉ ℒ\ {1}, since d_ii = 0, we have S(a, b, u) = 0 and it’s trivial to see . For R(,,) recursions, the boundary case happens at the unique child of the root, which we denote as 1’. Based on the definition, since the only i ∉ g(1’) is 1, and , we trivially have R(a, b, 1’) = 0. For b = 0,.

A post-order traversal on T^* can compute S(a, b, u), and a subsequent pre-order traversal can compute R(a, b, u), both in constant time and using constant memory per node. Recall that a and b are both no more than k, which is a constant. Thus, time and memory complexity of this dynamic programming is Θ(bn), which translates to Θ(n) in least squares setting, where b ≤ 2.

Proof of Lemma 3

Recall and that Equation 2:

Proof. Equation 2 can be re-written as:

By simple rearrangement of the terms, we can rewrite Equation S1 as follows.

Note that computing Q(P (u, x₁, x₂)) requires only S(,, u) and R(,, u) values and l(u). Thus, computing Q(P) requires only computing S(a, b, u) and R(a, b, u) values for −k ≤ a ≤ 2 −k and 0 ≤ b ≤ 2.

Proof of Lemma 4

Recall definitions and recall Eq. S1:

Proof. We take the derivative of Eq. S1 with respect to x₁ and set it equal to zero:

Similarly,

These two linear equations have a unique solution for the pair x₁, x₂ if and only if the following matrix has the full rank:

Determinant of H is det(H) = 4R’(−k, u)S’(−k, u). Assuming that δ_qi > 0 for all i ∈ ℒ, both R’(−k, u) > 0 and S’(−k, u) > 0 hold. Therefore, H has the full rank. However, δ_qi = 0 for q ≠ i can be encountered on real data, especially for low divergence times, low evolutionary rates, or short sequences. In this case, APPLES is designed to place q on the pendant edge of i with x₁ = 0 and x₂ = l(i). In case there are multiple leaves i that satisfy δ_qi = 0 for q ≠ i, we pick one of them arbitrarily.

Proof of Theorem 1

Proof. First, using two traversals of the tree, we compute all the S(a, b, u) and R(a, b, u) values by Lemma 2. To find the optimal placement edge, we first optimize Q(P (u, x₁, x₂)) for all u ∈ V \ {1}. By Lemma 4, this task requires only constant time after the precompuations. Then, for each node, we compute Q(P (u, x₁, x₂)) in constant time for the optimal u, x₁, x₂ by Lemma 3. Thus, each node is processed in linear time and the whole optimization requires linear time. Note that the system of equations (shown in Lemma 4) will not have a solution iff δ_qi ≤ 0 for some i; if there is δ_qi = 0, we make q sister to i, breaking ties arbitrarily.

Proof of Lemma 5

Proof. Eigenvalues of the Hessian matrix of Q(P (u, x₁, x₂)) are 2R’(−k, u) and 2S’(−k, u), which are both non-negative since δ_qi ≥ 0 for i ∈ ℒ. Thus, the Hessian matrix is positive semidefinite and therefore P (u, x₁, x₂) is a convex function of x₁ and x₂.

APPENDIX B. SUPPLEMENTARY FIGURES

Fig. S1. The reference biological trees obtained from Open Tree of Life (Drosophila and Anopheles) and from Boyd et al. (2017) (Columbicola).

Fig. S2. APPLES versus ML tools on 5,000 backbone trees.

The empirical cumulative distribution of the delta error is shown. We compare pplacer, EPA-ng, and APPLES* on RNASim-varied backbone dataset with 5000 leaves. Distributions is over 551 cases where pplacer could run for the panel on the right and 200 points where EPA-NG could run for the panel on the left.

Fig. S3. Comparing various models of DNA evolution.

For the GTR (a) and RNASim-heterogeneous (b) datasets, we show the delta error (edges) of APPLES* run with five distance matrices calculated based on different models of DNA evolution. All model parameters are estimated per pair of sequences. The five models have similar accuracy.

Fig. S4. The effect of imposing positivity constraint on accuracy on Hybrid.

The HYBRID approach does not benefit from imposing positivity constraint on its second (ME) stage.

Fig. S5. Comparing APPLES versions.

For the RNASim dataset (without controlling for the diameter), we show the delta error (edges) of APPLES run with three options for weighting: FM (green), BE (red), and OLS (blue), and three options for selection strategy (MLSE, ME, and Hybrid). For each method, the mean (colored circle) and standard errors (lines; too small to see) are shown over 2500 data points, each shown as dots. Some of the methods occasionally have error above 5 branches, but for better resolution, we cap the y-axis at 5.

Fig. S6. APPLES-HYBRID has higher accuracy on sparse RNAsim dataset.

On the RNAsim dataset, we chose 20 sequences randomly from the larger RNAsim-heterogeneous dataset; here, APPLES-HYBRID has higher accuracy than APPLES* (MLSE).

Fig. S7. The effect of reestimating branch lengths of the backbone tree on accuracy.

We show the accuracy of pplacer and APPLES-FM with its three optimization criteria. APPLES is run both with (dotted) and without (solid) re-estimating branch lengths in the backbone tree using the same model (here, TN93+Γ) used for computing distances of query sequences to backbone sequences. FastME* is used to re-estimate branch lengths. Accuracy improves dramatically by recomputing backbone branch lengths using the same model. The case labeled “Not re-estimated” uses branch lengths produced using RAxML under the GTR+Γ model.

APPENDIX C. SUPPLEMENTARY TABLES

View this table:

Table S1.

GenBank accession numbers and URLs for the dataset of 22 Anopheles genomes

View this table:

Table S2.

GenBank accession numbers and URLs for the dataset of 21 Drosophila genomes

View this table:

Table S3.

GenBank accession numbers of microbial species used in contamination removal.

View this table:

Table S4. Assembly-free placement of genome-skims.

We show the percentage of correct placements (those that do not increase Δe), average delta error (Δe), and maximum delta error (e_max) for APPLES, assignment to the CLOSEST species, and the placement to the position in the backbone (DE-NOVO) over the 61 (a), 22 (b), and 21 (c) placements. Results are shown for skims with 0.1 and 0.5Gbp of reads. Delta error is the increase in the number missing branches between the reference tree and the backbone tree after placing each query.

APPENDIX D. COMMANDS

Sampling Clades

For sampling clades of size at most 250 from a tree ”tree.nwk”, we used the TreeCluster package, available at https://github.com/niemasd/TreeCluster.

#!/bin/bash

python TreeCluster/TreeCluster.py -i 250 -o clusters.txt -t tree.nwk -m count_max_clade

Backbone tree estimation

When multiple sequence alignment is available, we used the following RaxML command to compute backbone tree for all datasets except RNAsim varied size dataset. We used RAxML version 7.2.6

#!/bin/bash

raxmlHPC-PTHREADS -m GTRGAMMA -p 88 -n REF -s aln_dna.phy -T 6

For RNAsim varied size dataset, we used FastTreeMP version 2.1.10 for estimating backbone topology. We run FastTreeMP with the following command:

#!/bin/bash

FastTreeMP -nosupport -gtr -gamma -nt -log tree.log < aln_dna.fa > tree.nwk

For alignment free datasets such as Drosophila dataset, we computed backbone tree using FastME* (based on FastME version 2.1.6.1) which is available at https://github.com/balabanmetin/FastME-personal-copy. FastME* is run with the following command:

#!/bin/bash

fastme -i dist.mat -o tree.nwk -T 1

Note that we performed Jukes-Cantor correction on the distance matrix ”dist.mat” before running FastME*.

Backbone tree branch length re-estimation

When multiple sequence alignment is available, we used FastME* to recompute backbone tree branch lengths for all datasets except RNAsim varied size dataset. We run FastME* with the following command:

#!/bin/bash

fastme -dJ -i aln_dna.phy -u RAxML_result.REF -o tree_me.nwk

For RNAsim varied size dataset, we used RAxML version 7.2.6 for re-estimating 728 ML based branch lengths and used that tree for performing placements using pplacer. 729 RAxML is run with the following command:

#!/bin/bash

raxmlHPC-PTHREADS -f e -t tree.nwk -m GTRGAMMA -s aln_dna.phy -n REF -p 1984 -T 8

For the same dataset, we used FastTree again for re-estimating Minimum Evolution based branch lengths and used that tree for performing placements using APPLES. FastTree is run with the following command:

#!/bin/bash

FastTreeMP -nosupport -nt -nome -noml -log tree.log

-intree tree.nwk < aln_dna.fa > tree_me.nwk

Performing placement

We performed phylogenetic placement of a query using pplacer with the following commands:

#!/bin/bash

nw_prune RAxML_result.REF query > backbone.nwk

pplacer -m GTR -s RAxML_info.REF -t backbone.nwk -o query.jplace aln_dna.fa

References

↵
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. 1990. Basic local alignment search tool. Journal of Molecular Biology, 215(3): 403–410.
OpenUrl CrossRef PubMed Web of Science
↵
Barbera, P., Kozlov, A. M., Czech, L., Morel, B., Darriba, D., Flouri, T., and Stamatakis, A. 2018. Epa-ng: massively parallel evolutionary placement of genetic sequences. BioRxiv, page 291658.
↵
Benoit, G., Peterlongo, P., Mariadassou, M., Drezen, E., Schbath, S., Lavenier, D., and Lemaitre, C. 2016. Multiple comparative metagenomics using multiset k-mer counting. PeerJ Computer Science, 2: e94.
OpenUrl
↵
Berger, S. a. and Stamatakis, A. 2011. Aligning short Reads to Reference Alignments and Trees. Bioinformatics, 27(15): 2068–75.
OpenUrl CrossRef PubMed Web of Science
↵
Berger, S. A., Krompass, D., Stamatakis, A. D.K., Stamatakis, A., Krompass, D., and Stamatakis, A. 2011. Performance, Accuracy, and Web Server for Evolutionary Placement of Short Sequence Reads under Maximum Likelihood. Systematic Biology, 60(3): 291–302.
OpenUrl CrossRef PubMed Web of Science
↵
Beyer, W. A., Stein, M. L., Smith, T. F., and Ulam, S. M. 1974. A molecular sequence metric and evolutionary trees. Mathematical Biosciences, 19(1-2): 9–25.
OpenUrl CrossRef
↵
Boyd, B. M., Allen, J. M., Nguyen, N.-P., Sweet, A. D., Warnow, T., Shapiro, M. D., Villa, S. M., Bush, S. E., Clayton, D. H., and Johnson, K. P. 2017. Phylogenomics using target-restricted assembly resolves intrageneric relationships of parasitic lice (phthiraptera: Columbicola). Systematic biology, 66(6): 896–911.
OpenUrl
↵
Brown, D. G. and Truszkowski, J. 2013. LSHPlace: fast phylogenetic placement using locality-sensitive hashing. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, pages 310–9.
↵
Bush, A., Sollmann, R., Wilting, A., Bohmann, K., Cole, B., Balzter, H., Martius, C., Zlinszky, A., Calvignac-Spencer, S., Cobbold, C. A., Dawson, T. P., Emerson, B. C., Ferrier, S., Gilbert, M. T. P., Herold, M., Jones, L., Leendertz, F. H., Matthews, L., Millington, J. D. A., Olson, J. R., Ovaskainen, O., Raffaelli, D., Reeve, R., Rodel, M.-O., Rodgers, T. W., Snape, S., Visseren-Hamakers, I., Vogler, A. P., White, P. C. L., Wooster, M. J., and Yu, D. W. 2017. Connecting Earth observation to high-throughput biodiversity data. Nature Ecology & Evolution, 1(7): 41559–017.
OpenUrl
↵
Bushnell, B. 2014. Bbtools software package. URL http://sourceforge.net/projects/bbmap.
↵
Cavalli-Sforza, L. L. and Edwards, A. W. 1967. Phylogenetic analysis. Models and estimation procedures. American journal of human genetics, 19(3 Pt 1): 233–57.
OpenUrl PubMed Web of Science
↵
Clarke, L. J., Soubrier, J., Weyrich, L. S., and Cooper, A. 2014. Environmental metabarcodes for insects: In silico PCR reveals potential for taxonomic bias. Molecular Ecology Resources, 14(6): 1160–1170.
OpenUrl
↵
Coissac, E., Hollingsworth, P. M., Lavergne, S., and Taberlet, P. 2016. From barcodes to genomes: extending the concept of DNA barcoding. Molecular Ecology, 25(7): 1423–1428.
OpenUrl CrossRef
↵
Day, W. H. 1987. Computational complexity of inferring phylogenies from dissimilarity matrices. Bulletin of Mathematical Biology, 49(4): 461–467.
OpenUrl CrossRef PubMed Web of Science
↵
Desper, R. and Gascuel, O. 2002. Fast and Accurate Phylogeny Reconstruction Algorithms Based on the Minimum-Evolution Principle. Journal of Computational Biology, 9(5): 687–705.
OpenUrl CrossRef PubMed Web of Science
↵
Dodsworth, S. 2015. Genome skimming for next-generation biodiversity analysis. Trends in Plant Science, 20(9): 525–527.
OpenUrl CrossRef PubMed
↵
Fan, H., Ives, A. R., Surget-Groba, Y., and Cannon, C. H. 2015. An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC Genomics, 16(1): 522.
OpenUrl
↵
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: A maximum likelihood approach. Journal of Molecular Evolution, 17(6): 368–376.
OpenUrl CrossRef PubMed Web of Science
↵
Felsenstein, J. 2003. Inferring phytogenies. Sunderland.
↵
Findley, K., Oh, J., Yang, J., Conlan, S., Deming, C., Meyer, J. A., Schoenfeld, D., Nomicos, E., Park, M., Kong, H. H., and Segre, J. A. 2013. Topographic diversity of fungal and bacterial communities in human skin. Nature, 498(7454): 367–370.
OpenUrl CrossRef PubMed Web of Science
↵
Fitch, W. M. and Margoliash, E. 1967. Construction of phylogenetic trees. Science, 155(3760): 279–284.
OpenUrl FREE Full Text
↵
Fletcher, W. and Yang, Z. 2009. INDELible: A flexible simulator of biological sequence evolution. Molecular Biology and Evolution, 26(8): 1879–1888.
OpenUrl CrossRef PubMed Web of Science
↵
Gill, S. R., Pop, M., DeBoy, R. T., Eckburg, P. B., Turnbaugh, P. J., Samuel, B. S., Gordon, J. I., Relman, D. A., Fraser-Liggett, C. M., and Nelson, K. E. 2006. Metagenomic Analysis of the Human Distal Gut Microbiome. Science, 312(5778): 1355–1359.
OpenUrl Abstract/FREE Full Text
↵
Guo, S., Wang, L.-S., and Kim, J. 2009. Large-scale simulation of RNA macroevolution by an energy-dependent fitness model.
↵
Haubold, B. 2014. Alignment-free phylogenetics and population genetics. Briefings in Bioinformatics, 15(3): 407–418.
OpenUrl CrossRef PubMed
↵
Hebert, P. D. N., Cywinska, A., Ball, S. L., and deWaard, J. R. 2003. Biological identifications through DNA barcodes. Proceedings of the Royal Society B: Biological Sciences, 270(1512): 313–321.
OpenUrl CrossRef PubMed Web of Science
↵
Hillis, D. M., Pollock, D. D., McGuire, J. A., and Zwickl, D. J. 2003. Is sparse taxon sampling a problem for phylogenetic inference? Systematic biology, 52(1): 124–126.
OpenUrl CrossRef PubMed Web of Science
↵
Hinchliff, C. E., Smith, S. a., Allman, J. F., Burleigh, J. G., Chaudhary, R., Coghill, L. M., Crandall, K. A., Deng, J., Drew, B. T., Gazis, R., Gude, K., Hibbett, D. S., Katz, L. a., Laughinghouse, H. D., McTavish, E. J., Midford, P. E., Owen, C. L., Ree, R. H., Rees, J. a., Soltis, D. E., Williams, T. L., and Cranston, K. a. 2015. Synthesis of phylogeny and taxonomy into a comprehensive tree of life. Proceedings of the National Academy of Sciences, 112(41): 12764–12769.
OpenUrl Abstract/FREE Full Text
↵
Huerta-Cepas, J., Dopazo, J., and Gabaldón, T. 2010. ETE: a python Environment for Tree Exploration. BMC Bioinformatics, 11(1): 24.
OpenUrl CrossRef PubMed
↵
Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T., and Aluru, S. 2017. High-throughput ANI Analysis of 90K Prokaryotic Genomes Reveals Clear Species Boundaries. bioRxiv.
↵
Janssen, S., McDonald, D., Gonzalez, A., Navas-Molina, J. A., Jiang, L., Xu, Z. Z., Winker, K., Kado, D. M., Orwoll, E., Manary, M., Mirarab, S., and Knight, R. 2018. Phylogenetic Placement of Exact Amplicon Sequences Improves Associations with Clinical Information. mSystems, 3(3): 00021–18.
OpenUrl
↵
Jukes, T. H. and Cantor, C. R. 1969. Evolution of protein molecules. In In Mammalian protein metabolism, Vol. III (1969), pp. 21–132, volume III, pages 21–132.
OpenUrl
↵
Koski, L. B. and Golding, G. B. 2001. The Closest BLAST Hit Is Often Not the Nearest Neighbor. Journal of Molecular Evolution, 52(6): 540–542.
OpenUrl CrossRef PubMed Web of Science
↵
Krause, L., Diaz, N. N., Goesmann, A., Kelley, S., Nattkemper, T. W., Rohwer, F., Edwards, R. A., and Stoye, J. 2008. Phylogenetic classification of short environmental DNA fragments. Nucleic acids research, 36(7): 2230–9.
OpenUrl CrossRef PubMed Web of Science
↵
Lefort, V., Desper, R., and Gascuel, O. 2015. FastME 2.0: A comprehensive, accurate, and fast distance-based phylogeny inference program. Molecular Biology and Evolution, 32(10): 2798–2800.
OpenUrl CrossRef PubMed
↵
Leimeister, C.-A. and Morgenstern, B. 2014. kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics, 30(14): 2000–2008.
OpenUrl CrossRef PubMed Web of Science
↵
Leimeister, C.-A., Sohrabi-Jahromi, S., and Morgenstern, B. 2017. Fast and accurate phylogeny reconstruction using filtered spaced-word matches. Bioinformatics, page btw776.
↵
Lockhart, P. J., Steel, M. A., Hendy, M. D., and Penny, D. 1994. Recovering evolutionary trees under a more realistic model of sequence evolution. Molecular Biology and Evolution, 11(4): 605–612.
OpenUrl PubMed Web of Science
↵
Mallo, D., De Oliveira Martins, L., and Posada, D. 2016. SimPhy : Phylogenomic Simulation of Gene, Locus, and Species Trees. Systematic Biology, 65(2): 334–344.
OpenUrl CrossRef PubMed
↵
Matsen, F. A. and Evans, S. N. 2013. Edge Principal Components and Squash Clustering: Using the Special Structure of Phylogenetic Placement Data for Sample Comparison. PLoS ONE, 8(3).
↵
Matsen, F. A., Kodner, R. B., and Armbrust, E. V. 2010. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC bioinformatics, 11(1): 538.
OpenUrl CrossRef PubMed
↵
Miller, D. E., Staber, C., Zeitlinger, J., and Hawley, R. S. 2018. Highly Contiguous Genome Assemblies of 15 Drosophila Species Generated Using Nanopore Sequencing. G3: Genes, Genomes, Genetics, 8(10): 3131–3141.
OpenUrl
↵
Mirarab, S. and Warnow, T. 2015. ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics, 31(12): i44–i52.
OpenUrl CrossRef PubMed
↵
Mirarab, S., Nguyen, N., and Warnow, T. 2012. SEPP: SATé-Enabled Phylogenetic Placement. Pacific Symposium On Biocomputing, pages 247–58.
↵
Mirarab, S., Nguyen, N., Guo, S., Wang, L.-S., Kim, J., and Warnow, T. 2015. PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences. Journal of Computational Biology, 22(05): 377–386.
OpenUrl CrossRef PubMed
↵
Nguyen, L. T., Schmidt, H. A., Von Haeseler, A., and Minh, B. Q. 2015. IQ-TREE: A fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Molecular Biology and Evolution, 32(1).
↵
Nguyen, N. N.-p., Mirarab, S., Liu, B. B., Pop, M., and Warnow, T. 2014. TIPP: Taxonomic Identification and Phylogenetic Profiling. Bioinformatics, 30(24): 3548–3555.
OpenUrl CrossRef PubMed Web of Science
↵
Ondov, B. D., Treangen, T. J., Melsted, P., Mallonee, A. B., Bergman, N. H., Koren, S., and Phillippy, A. M. 2016. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biology, 17(1): 132.
OpenUrl CrossRef PubMed
↵
Price, M. N., Dehal, P. S., and Arkin, A. P. 2010. FastTree-2 Approximately Maximum-Likelihood Trees for Large Alignments. PLoS ONE, 5(3): e9490.
OpenUrl CrossRef PubMed
↵
Rzhetsky, A. and Nei, M. 1992. A simple method for estimating and testing minimum-evolution trees.
↵
Saitou, N. and Nei, M. 1987. The neighbour-joining method: a new method for reconstructing phylogenetic trees. Molecular biology and evolution, 4(4): 406–425.
OpenUrl CrossRef PubMed Web of Science
↵
Sarmashghi, S., Bohmann, K., P. Gilbert, M. T., Bafna, V., and Mirarab, S. 2019. Skmer: assembly-free and alignment-free sample identification using genome skims. Genome Biology, 20(1): 34.
OpenUrl
↵
Stamatakis, A. 2014. RAxML version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics, 30(9): 1312–1313.
OpenUrl CrossRef PubMed Web of Science
↵
Stark, M., Berger, S. A., Stamatakis, A., and von Mering, C. 2010. MLTreeMap–accurate Maximum Likelihood placement of environmental DNA sequences into taxonomic and functional reference phylogenies. BMC genomics, 11(1): 461.
OpenUrl CrossRef PubMed
↵
Steel, M. 2009. A basic limitation on inferring phylogenies by pairwise sequence comparisons. Journal of Theoretical Biology, 256(3): 467–472.
OpenUrl CrossRef PubMed Web of Science
↵
Sukumaran, J. and Holder, M. T. 2010. DendroPy: a Python library for phylogenetic computing. Bioinformatics, 26(12): 1569–1571.
OpenUrl CrossRef PubMed Web of Science
↵
Sunagawa, S., Mende, D. R., Zeller, G., Izquierdo-Carrasco, F., Berger, S. a., Kultima, J. R., Coelho, L. P., Arumugam, M., Tap, J., Nielsen, H. B., Rasmussen, S., Brunak, S., Pedersen, O., Guarner, F., de Vos, W. M., Wang, J., Li, J., Dore, J., Ehrlich, S. D., Stamatakis, a., and Bork, P. 2013. Metagenomic species profiling using universal phylogenetic marker genes. Nature methods, 10(12): 1196–1199.
OpenUrl
↵
Tamura, K. and Nei, M. 1993. Estimation of the Number of Nucleotide Substitutions in the Control Region of Mitochondrial-DNA in Humans and Chimpanzees. Molecular biology and evolution, 10(3): 512–526.
OpenUrl CrossRef PubMed Web of Science
↵
Tamura, K., Nei, M., and Kumar, S. 2004. Prospects for inferring very large phylogenies by using the neighbor-joining method. Proceedings of the National Academy of Sciences, 101(30): 11030–11035.
OpenUrl Abstract/FREE Full Text
↵
von Mering, C., Hugenholtz, P., Raes, J., Tringe, S. G., Doerks, T., Jensen, L. J., Ward, N., and Bork, P. 2007. Quantitative Phylogenetic Assessment of Microbial Communities in Diverse Environments. Science, 315(5815): 1126–1130.
OpenUrl Abstract/FREE Full Text
↵
Waddell, P. J. and Steel, M. 1997. General Time-Reversible Distances with Unequal Rates across Sites: Mixing G and Inverse Gaussian Distributions with Invariant Sites. Molecular Phylogenetics and Evolution, 8(3): 398–414.
OpenUrl CrossRef PubMed Web of Science
↵
Warnow, T. 2017. Computational phylogenetics: An introduction to designing methods for phylogeny estimation. Cambridge University Press.
↵
Wheeler, T. J. 2009. Large-scale neighbor-joining with NINJA. In Algorithms in Bioinformatics, pages 375–389. Springer.
↵
Xia, X. 2009. Information-theoretic indices and an approximate significance test for testing the molecular clock hypothesis with genetic distances. Molecular Phylogenetics and Evolution, 52(3): 665–676.
OpenUrl CrossRef PubMed
↵
Xia, X. 2018. DAMBE7: New and Improved Tools for Data Analysis in Molecular Biology and Evolution. Molecular Biology and Evolution, 35(6): 1550–1552.
OpenUrl CrossRef
↵
Yi, H. and Jin, L. 2013. Co-phylog: an assembly-free phylogenomic approach for closely related organisms. Nucleic Acids Research, 41(7): e75–e75.
OpenUrl CrossRef PubMed
↵
Yin, C., Shen, G., Guo, D., Wang, S., Ma, X., Xiao, H., Liu, J., Zhang, Z., Liu, Y., Zhang, Y., Yu, K., Huang, S., and Li, F. 2016. InsectBase: a resource for insect genomes and transcriptomes. Nucleic Acids Research, 44(D1): D801–D807.
OpenUrl CrossRef PubMed
↵
Zwickl, D. J. and Hillis, D. M. 2002. Increased taxon sampling greatly reduces phylogenetic error. Systematic biology, 51(4): 588–98.
OpenUrl CrossRef PubMed Web of Science

View the discussion thread.

Posted March 27, 2019.

Download PDF

Citation Tools

Subject Area

Bioinformatics

Subject Areas

All Articles

Animal Behavior and Cognition (5210)
Biochemistry (11736)
Bioengineering (8746)
Bioinformatics (29186)
Biophysics (14964)
Cancer Biology (12084)
Cell Biology (17401)
Clinical Trials (138)
Developmental Biology (9418)
Ecology (14176)
Epidemiology (2067)
Evolutionary Biology (18299)
Genetics (12235)
Genomics (16793)
Immunology (11863)
Microbiology (28066)
Molecular Biology (11580)
Neuroscience (60925)
Paleontology (451)
Pathology (1870)
Pharmacology and Toxicology (3238)
Physiology (4956)
Plant Biology (10422)
Scientific Communication and Education (1683)
Synthetic Biology (2883)
Systems Biology (7338)
Zoology (1650)

[1] ↵
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. 1990. Basic local alignment search tool. Journal of Molecular Biology, 215(3): 403–410.
OpenUrl CrossRef PubMed Web of Science

[2] ↵
Barbera, P., Kozlov, A. M., Czech, L., Morel, B., Darriba, D., Flouri, T., and Stamatakis, A. 2018. Epa-ng: massively parallel evolutionary placement of genetic sequences. BioRxiv, page 291658.

[3] ↵
Benoit, G., Peterlongo, P., Mariadassou, M., Drezen, E., Schbath, S., Lavenier, D., and Lemaitre, C. 2016. Multiple comparative metagenomics using multiset k-mer counting. PeerJ Computer Science, 2: e94.
OpenUrl

[4] ↵
Berger, S. a. and Stamatakis, A. 2011. Aligning short Reads to Reference Alignments and Trees. Bioinformatics, 27(15): 2068–75.
OpenUrl CrossRef PubMed Web of Science

[5] ↵
Berger, S. A., Krompass, D., Stamatakis, A. D.K., Stamatakis, A., Krompass, D., and Stamatakis, A. 2011. Performance, Accuracy, and Web Server for Evolutionary Placement of Short Sequence Reads under Maximum Likelihood. Systematic Biology, 60(3): 291–302.
OpenUrl CrossRef PubMed Web of Science

[6] ↵
Beyer, W. A., Stein, M. L., Smith, T. F., and Ulam, S. M. 1974. A molecular sequence metric and evolutionary trees. Mathematical Biosciences, 19(1-2): 9–25.
OpenUrl CrossRef

[7] ↵
Boyd, B. M., Allen, J. M., Nguyen, N.-P., Sweet, A. D., Warnow, T., Shapiro, M. D., Villa, S. M., Bush, S. E., Clayton, D. H., and Johnson, K. P. 2017. Phylogenomics using target-restricted assembly resolves intrageneric relationships of parasitic lice (phthiraptera: Columbicola). Systematic biology, 66(6): 896–911.
OpenUrl

[8] ↵
Brown, D. G. and Truszkowski, J. 2013. LSHPlace: fast phylogenetic placement using locality-sensitive hashing. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, pages 310–9.

[9] ↵
Bush, A., Sollmann, R., Wilting, A., Bohmann, K., Cole, B., Balzter, H., Martius, C., Zlinszky, A., Calvignac-Spencer, S., Cobbold, C. A., Dawson, T. P., Emerson, B. C., Ferrier, S., Gilbert, M. T. P., Herold, M., Jones, L., Leendertz, F. H., Matthews, L., Millington, J. D. A., Olson, J. R., Ovaskainen, O., Raffaelli, D., Reeve, R., Rodel, M.-O., Rodgers, T. W., Snape, S., Visseren-Hamakers, I., Vogler, A. P., White, P. C. L., Wooster, M. J., and Yu, D. W. 2017. Connecting Earth observation to high-throughput biodiversity data. Nature Ecology & Evolution, 1(7): 41559–017.
OpenUrl

[10] ↵
Bushnell, B. 2014. Bbtools software package. URL http://sourceforge.net/projects/bbmap.

[11] ↵
Cavalli-Sforza, L. L. and Edwards, A. W. 1967. Phylogenetic analysis. Models and estimation procedures. American journal of human genetics, 19(3 Pt 1): 233–57.
OpenUrl PubMed Web of Science

[12] ↵
Clarke, L. J., Soubrier, J., Weyrich, L. S., and Cooper, A. 2014. Environmental metabarcodes for insects: In silico PCR reveals potential for taxonomic bias. Molecular Ecology Resources, 14(6): 1160–1170.
OpenUrl

[13] ↵
Coissac, E., Hollingsworth, P. M., Lavergne, S., and Taberlet, P. 2016. From barcodes to genomes: extending the concept of DNA barcoding. Molecular Ecology, 25(7): 1423–1428.
OpenUrl CrossRef

[14] ↵
Day, W. H. 1987. Computational complexity of inferring phylogenies from dissimilarity matrices. Bulletin of Mathematical Biology, 49(4): 461–467.
OpenUrl CrossRef PubMed Web of Science

[15] ↵
Desper, R. and Gascuel, O. 2002. Fast and Accurate Phylogeny Reconstruction Algorithms Based on the Minimum-Evolution Principle. Journal of Computational Biology, 9(5): 687–705.
OpenUrl CrossRef PubMed Web of Science

[16] ↵
Dodsworth, S. 2015. Genome skimming for next-generation biodiversity analysis. Trends in Plant Science, 20(9): 525–527.
OpenUrl CrossRef PubMed

[17] ↵
Fan, H., Ives, A. R., Surget-Groba, Y., and Cannon, C. H. 2015. An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC Genomics, 16(1): 522.
OpenUrl

[18] ↵
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: A maximum likelihood approach. Journal of Molecular Evolution, 17(6): 368–376.
OpenUrl CrossRef PubMed Web of Science

[19] ↵
Felsenstein, J. 2003. Inferring phytogenies. Sunderland.

[20] ↵
Findley, K., Oh, J., Yang, J., Conlan, S., Deming, C., Meyer, J. A., Schoenfeld, D., Nomicos, E., Park, M., Kong, H. H., and Segre, J. A. 2013. Topographic diversity of fungal and bacterial communities in human skin. Nature, 498(7454): 367–370.
OpenUrl CrossRef PubMed Web of Science

[21] ↵
Fitch, W. M. and Margoliash, E. 1967. Construction of phylogenetic trees. Science, 155(3760): 279–284.
OpenUrl FREE Full Text

[22] ↵
Fletcher, W. and Yang, Z. 2009. INDELible: A flexible simulator of biological sequence evolution. Molecular Biology and Evolution, 26(8): 1879–1888.
OpenUrl CrossRef PubMed Web of Science

[23] ↵
Gill, S. R., Pop, M., DeBoy, R. T., Eckburg, P. B., Turnbaugh, P. J., Samuel, B. S., Gordon, J. I., Relman, D. A., Fraser-Liggett, C. M., and Nelson, K. E. 2006. Metagenomic Analysis of the Human Distal Gut Microbiome. Science, 312(5778): 1355–1359.
OpenUrl Abstract/FREE Full Text

[24] ↵
Guo, S., Wang, L.-S., and Kim, J. 2009. Large-scale simulation of RNA macroevolution by an energy-dependent fitness model.

[25] ↵
Haubold, B. 2014. Alignment-free phylogenetics and population genetics. Briefings in Bioinformatics, 15(3): 407–418.
OpenUrl CrossRef PubMed

[26] ↵
Hebert, P. D. N., Cywinska, A., Ball, S. L., and deWaard, J. R. 2003. Biological identifications through DNA barcodes. Proceedings of the Royal Society B: Biological Sciences, 270(1512): 313–321.
OpenUrl CrossRef PubMed Web of Science

[27] ↵
Hillis, D. M., Pollock, D. D., McGuire, J. A., and Zwickl, D. J. 2003. Is sparse taxon sampling a problem for phylogenetic inference? Systematic biology, 52(1): 124–126.
OpenUrl CrossRef PubMed Web of Science

[28] ↵
Hinchliff, C. E., Smith, S. a., Allman, J. F., Burleigh, J. G., Chaudhary, R., Coghill, L. M., Crandall, K. A., Deng, J., Drew, B. T., Gazis, R., Gude, K., Hibbett, D. S., Katz, L. a., Laughinghouse, H. D., McTavish, E. J., Midford, P. E., Owen, C. L., Ree, R. H., Rees, J. a., Soltis, D. E., Williams, T. L., and Cranston, K. a. 2015. Synthesis of phylogeny and taxonomy into a comprehensive tree of life. Proceedings of the National Academy of Sciences, 112(41): 12764–12769.
OpenUrl Abstract/FREE Full Text

[29] ↵
Huerta-Cepas, J., Dopazo, J., and Gabaldón, T. 2010. ETE: a python Environment for Tree Exploration. BMC Bioinformatics, 11(1): 24.
OpenUrl CrossRef PubMed

[30] ↵
Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T., and Aluru, S. 2017. High-throughput ANI Analysis of 90K Prokaryotic Genomes Reveals Clear Species Boundaries. bioRxiv.

[31] ↵
Janssen, S., McDonald, D., Gonzalez, A., Navas-Molina, J. A., Jiang, L., Xu, Z. Z., Winker, K., Kado, D. M., Orwoll, E., Manary, M., Mirarab, S., and Knight, R. 2018. Phylogenetic Placement of Exact Amplicon Sequences Improves Associations with Clinical Information. mSystems, 3(3): 00021–18.
OpenUrl

[32] ↵
Jukes, T. H. and Cantor, C. R. 1969. Evolution of protein molecules. In In Mammalian protein metabolism, Vol. III (1969), pp. 21–132, volume III, pages 21–132.
OpenUrl

[33] ↵
Koski, L. B. and Golding, G. B. 2001. The Closest BLAST Hit Is Often Not the Nearest Neighbor. Journal of Molecular Evolution, 52(6): 540–542.
OpenUrl CrossRef PubMed Web of Science

[34] ↵
Krause, L., Diaz, N. N., Goesmann, A., Kelley, S., Nattkemper, T. W., Rohwer, F., Edwards, R. A., and Stoye, J. 2008. Phylogenetic classification of short environmental DNA fragments. Nucleic acids research, 36(7): 2230–9.
OpenUrl CrossRef PubMed Web of Science

[35] ↵
Lefort, V., Desper, R., and Gascuel, O. 2015. FastME 2.0: A comprehensive, accurate, and fast distance-based phylogeny inference program. Molecular Biology and Evolution, 32(10): 2798–2800.
OpenUrl CrossRef PubMed

[36] ↵
Leimeister, C.-A. and Morgenstern, B. 2014. kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics, 30(14): 2000–2008.
OpenUrl CrossRef PubMed Web of Science

[37] ↵
Leimeister, C.-A., Sohrabi-Jahromi, S., and Morgenstern, B. 2017. Fast and accurate phylogeny reconstruction using filtered spaced-word matches. Bioinformatics, page btw776.

[38] ↵
Lockhart, P. J., Steel, M. A., Hendy, M. D., and Penny, D. 1994. Recovering evolutionary trees under a more realistic model of sequence evolution. Molecular Biology and Evolution, 11(4): 605–612.
OpenUrl PubMed Web of Science

[39] ↵
Mallo, D., De Oliveira Martins, L., and Posada, D. 2016. SimPhy : Phylogenomic Simulation of Gene, Locus, and Species Trees. Systematic Biology, 65(2): 334–344.
OpenUrl CrossRef PubMed

[40] ↵
Matsen, F. A. and Evans, S. N. 2013. Edge Principal Components and Squash Clustering: Using the Special Structure of Phylogenetic Placement Data for Sample Comparison. PLoS ONE, 8(3).

[41] ↵
Matsen, F. A., Kodner, R. B., and Armbrust, E. V. 2010. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC bioinformatics, 11(1): 538.
OpenUrl CrossRef PubMed

[42] ↵
Miller, D. E., Staber, C., Zeitlinger, J., and Hawley, R. S. 2018. Highly Contiguous Genome Assemblies of 15 Drosophila Species Generated Using Nanopore Sequencing. G3: Genes, Genomes, Genetics, 8(10): 3131–3141.
OpenUrl

[43] ↵
Mirarab, S. and Warnow, T. 2015. ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics, 31(12): i44–i52.
OpenUrl CrossRef PubMed

[44] ↵
Mirarab, S., Nguyen, N., and Warnow, T. 2012. SEPP: SATé-Enabled Phylogenetic Placement. Pacific Symposium On Biocomputing, pages 247–58.

[45] ↵
Mirarab, S., Nguyen, N., Guo, S., Wang, L.-S., Kim, J., and Warnow, T. 2015. PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences. Journal of Computational Biology, 22(05): 377–386.
OpenUrl CrossRef PubMed

[46] ↵
Nguyen, L. T., Schmidt, H. A., Von Haeseler, A., and Minh, B. Q. 2015. IQ-TREE: A fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Molecular Biology and Evolution, 32(1).

[47] ↵
Nguyen, N. N.-p., Mirarab, S., Liu, B. B., Pop, M., and Warnow, T. 2014. TIPP: Taxonomic Identification and Phylogenetic Profiling. Bioinformatics, 30(24): 3548–3555.
OpenUrl CrossRef PubMed Web of Science

[48] ↵
Ondov, B. D., Treangen, T. J., Melsted, P., Mallonee, A. B., Bergman, N. H., Koren, S., and Phillippy, A. M. 2016. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biology, 17(1): 132.
OpenUrl CrossRef PubMed

[49] ↵
Price, M. N., Dehal, P. S., and Arkin, A. P. 2010. FastTree-2 Approximately Maximum-Likelihood Trees for Large Alignments. PLoS ONE, 5(3): e9490.
OpenUrl CrossRef PubMed

[50] ↵
Rzhetsky, A. and Nei, M. 1992. A simple method for estimating and testing minimum-evolution trees.

[51] ↵
Saitou, N. and Nei, M. 1987. The neighbour-joining method: a new method for reconstructing phylogenetic trees. Molecular biology and evolution, 4(4): 406–425.
OpenUrl CrossRef PubMed Web of Science

[52] ↵
Sarmashghi, S., Bohmann, K., P. Gilbert, M. T., Bafna, V., and Mirarab, S. 2019. Skmer: assembly-free and alignment-free sample identification using genome skims. Genome Biology, 20(1): 34.
OpenUrl

[53] ↵
Stamatakis, A. 2014. RAxML version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics, 30(9): 1312–1313.
OpenUrl CrossRef PubMed Web of Science

[54] ↵
Stark, M., Berger, S. A., Stamatakis, A., and von Mering, C. 2010. MLTreeMap–accurate Maximum Likelihood placement of environmental DNA sequences into taxonomic and functional reference phylogenies. BMC genomics, 11(1): 461.
OpenUrl CrossRef PubMed

[55] ↵
Steel, M. 2009. A basic limitation on inferring phylogenies by pairwise sequence comparisons. Journal of Theoretical Biology, 256(3): 467–472.
OpenUrl CrossRef PubMed Web of Science

[56] ↵
Sukumaran, J. and Holder, M. T. 2010. DendroPy: a Python library for phylogenetic computing. Bioinformatics, 26(12): 1569–1571.
OpenUrl CrossRef PubMed Web of Science

[57] ↵
Sunagawa, S., Mende, D. R., Zeller, G., Izquierdo-Carrasco, F., Berger, S. a., Kultima, J. R., Coelho, L. P., Arumugam, M., Tap, J., Nielsen, H. B., Rasmussen, S., Brunak, S., Pedersen, O., Guarner, F., de Vos, W. M., Wang, J., Li, J., Dore, J., Ehrlich, S. D., Stamatakis, a., and Bork, P. 2013. Metagenomic species profiling using universal phylogenetic marker genes. Nature methods, 10(12): 1196–1199.
OpenUrl

[58] ↵
Tamura, K. and Nei, M. 1993. Estimation of the Number of Nucleotide Substitutions in the Control Region of Mitochondrial-DNA in Humans and Chimpanzees. Molecular biology and evolution, 10(3): 512–526.
OpenUrl CrossRef PubMed Web of Science

[59] ↵
Tamura, K., Nei, M., and Kumar, S. 2004. Prospects for inferring very large phylogenies by using the neighbor-joining method. Proceedings of the National Academy of Sciences, 101(30): 11030–11035.
OpenUrl Abstract/FREE Full Text

[60] ↵
von Mering, C., Hugenholtz, P., Raes, J., Tringe, S. G., Doerks, T., Jensen, L. J., Ward, N., and Bork, P. 2007. Quantitative Phylogenetic Assessment of Microbial Communities in Diverse Environments. Science, 315(5815): 1126–1130.
OpenUrl Abstract/FREE Full Text

[61] ↵
Waddell, P. J. and Steel, M. 1997. General Time-Reversible Distances with Unequal Rates across Sites: Mixing G and Inverse Gaussian Distributions with Invariant Sites. Molecular Phylogenetics and Evolution, 8(3): 398–414.
OpenUrl CrossRef PubMed Web of Science

[62] ↵
Warnow, T. 2017. Computational phylogenetics: An introduction to designing methods for phylogeny estimation. Cambridge University Press.

[63] ↵
Wheeler, T. J. 2009. Large-scale neighbor-joining with NINJA. In Algorithms in Bioinformatics, pages 375–389. Springer.

[64] ↵
Xia, X. 2009. Information-theoretic indices and an approximate significance test for testing the molecular clock hypothesis with genetic distances. Molecular Phylogenetics and Evolution, 52(3): 665–676.
OpenUrl CrossRef PubMed

[65] ↵
Xia, X. 2018. DAMBE7: New and Improved Tools for Data Analysis in Molecular Biology and Evolution. Molecular Biology and Evolution, 35(6): 1550–1552.
OpenUrl CrossRef

[66] ↵
Yi, H. and Jin, L. 2013. Co-phylog: an assembly-free phylogenomic approach for closely related organisms. Nucleic Acids Research, 41(7): e75–e75.
OpenUrl CrossRef PubMed

[67] ↵
Yin, C., Shen, G., Guo, D., Wang, S., Ma, X., Xiao, H., Liu, J., Zhang, Z., Liu, Y., Zhang, Y., Yu, K., Huang, S., and Li, F. 2016. InsectBase: a resource for insect genomes and transcriptomes. Nucleic Acids Research, 44(D1): D801–D807.
OpenUrl CrossRef PubMed

[68] ↵
Zwickl, D. J. and Hillis, D. M. 2002. Increased taxon sampling greatly reduces phylogenetic error. Systematic biology, 51(4): 588–98.
OpenUrl CrossRef PubMed Web of Science

APPLES: Distance-based Phylogenetic Placement for Scalable and Assembly-free Sample Identification

ABSTRACT

MATERIALS AND METHODS

Problem Statement

Notations

Distances

Phylogenetic placement

Least Squares Phylogenetic Placement (LSPP)

Linear Time Solution for LSPP

Non-negative branch lengths

Minimum evolution

Hybrid

Datasets

Real genome-skim datasets for the assembly-free scenario

Columbicola genome-skims

Anopheles and Drosophila datasets

Simulated datasets for the aligned sequence scenario

GTR

RNASim

Methods

Alternative methods

Distance calculation and models

Backbone trees

APPLES parameters

Evaluation Procedure

Delta error

Results

Assembly-free Placement of Genome-skims

Alignment-based Placement

Comparison to Maximum Likelihood (ML)

GTR dataset

Model misspecification

Impact of diameter

Impact of taxon sampling

Running time and memory

Comparing parameters of APPLES

Discussion

Acknowledgments

APPENDIX

Appendix A. Proofs and derivations

Proof of Lemma 2

Proof of Lemma 3

Proof of Lemma 4

Proof of Lemma 5

APPENDIX B. SUPPLEMENTARY FIGURES

APPENDIX C. SUPPLEMENTARY TABLES

APPENDIX D. COMMANDS

Sampling Clades

Backbone tree estimation

Backbone tree branch length re-estimation

Performing placement

References

Citation Manager Formats

Subject Area