1 Abstract
The exponential decrease in molecular sequencing cost generates unprecedented amounts of data. Hence, scalable methods to analyze these data are required. Phylogenetic (or Evolutionary) Placement methods identify the evolutionary provenance of anonymous sequences with respect to a given reference phylogeny. This increasingly popular method is deployed for scrutinizing metagenomic samples from environments such as water, soil, or the human gut.
Here, we present novel and, more importantly, highly scalable methods for analyzing phylogenetic placements of metagenomic samples. More specifically, we introduce methods for (a) visualizing differences between samples and their correlation with associated meta-data on the reference phylogeny, (b) clustering similar samples using a variant of the fc-means method, and (c) finding phylogenetic factors using an adaptation of the Phylofactorization method. These methods enable to interpret metagenomic data in a phylogenetic context, to find patterns in the data, and to identify branches of the phylogeny that are driving these patterns.
To demonstrate the scalability and utility of our methods, as well as to provide exemplary interpretations of our methods, we applied them to 3 publicly available datasets comprising 9782 samples with a total of approximately 168 million sequences. The results indicate that new biological insights can be attained via our methods.
2 Introduction
The availability of high-throughput DNA sequencing technologies has revolutionized biology by transforming it into an ever more data-driven and compute-intense discipline [1]. In particular, Next Generation Sequencing (NGS) [2], as well as later generations [3–6], have given rise to novel methods for studying microbial environments [7–10]. These technologies are often used in metagenomic studies to sequence organisms in water [11–13] or soil [14,15] samples, in the human microbiome [16–18], and a plethora of other environments. These studies yield a large set of short anonymous DNA sequences, so-called reads, for each sample. Reads that are obtained from specific parts of the genome are called meta-barcoding reads; most often, reads are amplified before sequencing and later de-replicated again, resulting in so-called amplicons. A typical task in metagenomic studies is to identify and classify these sequences with respect to known reference sequences, either in a taxonomic or a phylogenetic context.
Conventional methods like BLAST [19] are based on sequence similarity or identity. Such methods are fast, but only attain satisfying accuracy levels if the query sequences (e.g., the environmental reads or amplicons) are sufficiently similar to the reference sequences. Furthermore, BLAST might yield suboptimal results [20], and the best BLAST hit does often not represent the most closely related species [21].
Alternatively, so-called phylogenetic (or evolutionary) placement methods [22–24] identify query sequences based on a phylogenetic tree of reference sequences. Thereby, they incorporate information about the evolutionary history of the species under study and hence provide a more accurate means for read identification. We outline the method in more detail below, and describe the standard data analysis pipeline in S2 Text. The result of a phylogenetic placement is a mapping of the query sequences to the branches of the reference tree. Such a mapping also elucidates the evolutionary distance between the query and the reference sequences.
The information provided by phylogenetic placement represents useful biological knowledge per se, which has already been used for developing several downstream analysis methods. For example, the placement of query sequences on the branches of the reference tree can be understood as a classification of the query sequences with respect to the given phylogeny. This classification can then be summarized by means of sequence abundances [25,26], or be used to obtain taxonomic annotations [27,28]. Phylogenetic placement can also be utilized to compare sets of environmental samples with each other (for example, from different locations or points in time). Note that distinct samples from one study are typically mapped to the same underlying reference tree, thus facilitating such comparisons. For instance, existing analysis methods such as Edge PCA and Squash Clustering [29] exploit the information provided by the reference tree to visualize differences between sets of metagenomic samples, or to cluster samples based on placement similarity. We show examples of Edge PCA and Squash Clustering results later in Fig 8.
In this paper, we present novel, scalable methods to analyze and visualize phylogenetic placement data. The remainder of this article is structured as follows. First, we introduce phylogenetic placement and related terminology, and provide an overview over existing post-analysis methods for phylogenetic placements. Then, we describe several novel methods for visualizing differences between the placement data of distinct environmental samples, and for visualizing their correlation with per-sample meta-data. Next, we propose a clustering algorithm for samples that is useful for analyzing extremely large environmental studies. Furthermore, we present an adaptation of the Phylogenetic Isometric Log-Ratio (PhILR) transformation and balances [30] to phylogenetic placement data. Lastly, we introduce an adaptation of Phylofactorization [31] to placements. Phylofactorization is a method that identifies those edges in the phylogeny that characterize patterns in the data related to per-sample meta-data. In order to evaluate our methods, we apply them to three real word datasets, namely, Bacterial Vaginosis (BV) [18], Tara Oceans (TO) [11,32,33], and Human Microbiome Project (HMP) [16,17], which are introduced in more detail later. We provide exemplary interpretations of the results obtained from these datasets, compare the results to existing methods, and analyze the run-time performance of our methods.
Our methods are implemented in our gappa tool, which is freely available under GPLv3 at http://github.com/lczech/gappa. We provide an overview of the tool in S2 Text. Furthermore, scripts, data, and other tools used for the tests and generating the figures presented here are available at http://github.com/lczech/placement-methods-paper.
2.1 Phylogenetic placement
In brief, phylogenetic placement calculates the most probable insertion branches for each given query sequence (QS) on a reference tree (RT). The QSs are reads or amplicons from environmental samples. The RT and the reference sequences it represents are typically assembled by the user so that they capture the expected species diversity in the samples. Nonetheless, we recently presented an automated approach for selecting and constructing appropriate reference sequences from large sequence databases [34]. Current implementations of phylogenetic placement furthermore expect the RT to be strictly bifurcating. Prior to the placement, the QSs need to be aligned against the reference alignment of the RT by programs such as PAPARA [35,36] or HMMALIGN, which is part of the HMMER suite [37,38]. The placement is then conducted by initially inserting one QS as a new tip into a branch of the tree, then re-optimizing the branch lengths that are most affected by this insertion, and thereafter evaluating the resulting likelihood score of the tree under a given model of nucleotide substitution, such as the Generalized Time-Reversible (GTR) model [39]. The QS is then removed from the current branch and subsequently placed into all other branches of the RT.
Thus, for each branch of the tree, the process yields a so-called placement of the QS, that is, an optimized position on the branch, along with a likelihood score for the whole tree. The likelihood scores for a QS are then transformed into probabilities, which quantify the uncertainty of placing the sequence on the respective branch [40, 41]. Those probabilities are called likelihood weight ratios (LWRs). The accumulated LWR sum over all branches for a single QS is 1.0. On the one hand, sequences that have one or a few closely related reference sequences in the reference tree thus exhibit high LWRs at the respective branches, while having LWRs close to 0 on most other branches. On the other hand, if suitable references are missing from the tree, or if the chosen genetic region of the sequences has a poor phylogenetic resolution with respect to the query sequences, the LWRs can be distributed across several parts of the tree, indicating a high degree of placement uncertainty. Fig 1 shows an example depicting the placements of one QS, including the respective LWRs.
This process is repeated for every QS. Note that the placement process is conducted independently for each QS, and that the phylogenetic relationships among the QSs are not resolved. That is, for each QS, the algorithm starts calculating placements from scratch on the original RT.
In summary, the result of a phylogenetic placement analysis is a mapping of the QSs in a sample to positions on the branches of the RT. Each such position, along with the corresponding LWR, is called a placement of the QS.
2.2 Normalization
When placing multiple samples, for instance, from different geographical locations, typically, the same RT is used, to allow for comparing the phylogenetic composition of these samples. Then, each sample results in a set of placements on the RT for all QSs contained in the sample. If one intends to compare different samples based on the placement of their sequences, it is important to consider how to properly normalize the placements per sample. Normalization is required as the sample size (often also called library size), that is, the number of sequences per sample, can vary by several orders of magnitude, due to efficiency variations in the sequencing process or biases introduced by the amplification process. As a consequence, metagenomic sequence data are inherently compositional [42], which can lead to spurious statistical analyses [43–46]. This impedes statistical analyses of the data and hence needs to be considered in all analysis steps [30].
We here briefly outline common types of problems due to normalization that also affect phylogenetic placement data. Selecting an appropriate normalization strategy constitutes a common general problem in metagenomic studies. The appropriateness depends on data characteristics [47], but also on the biological question. For example, estimating indices such as the species richness are often implemented via rarefaction and rarefaction curves [48], which potentially omit a large amount of the available valid data [49]. Furthermore, the specific type of input sequence data has to be taken into account for normalization: Biases induced by the amplification process can potentially be avoided if, instead of amplicons, data based on shotgun sequencing are used, such as mitags [50]. Moreover, the sequences can be clustered prior to a phylogenetic placement analysis, for instance, by constructing operational taxonomic units (OTUs) [51–54]. Analyses using OTUs focus on species diversity instead of simple abundances. OTU clustering substantially reduces the number of sequences, and hence greatly decreases the computational cost for placement analyses. Lastly, one may completely ignore the abundances (which are called the “multiplicities” of placements) of the placed sequences, reads, or OTUs, and only be interested in their presence/absence when comparing samples.
All of the above analysis strategies are also applicable to phylogenetic placement, for instance, by placing OTUs on the RT instead of sequences. Which of these strategies is deployed, depends on the specific design of the study and the research question at hand. The common challenge is that the number of sequences per sample differs, which affects most post-analysis methods, and can lead to conflicting interpretations and irreproducible results [45,55]. For example, if inappropriately normalized, the visualization methods presented here might highlight irrelevant branches or clades of the tree.
Before introducing our methods, we therefore first explain how the necessary normalizations of sample sizes can be performed. We also describe general techniques for interpreting and working with phylogenetic placement data. These are not methods of their own, but tools necessary for later. Some of these techniques have been used before as building blocks for methods such as Edge PCA and Squash Clustering [29,56].
2.3 Edge masses
Methods that compare samples directly based on their sequences, such as the UniFrac distance [57,58], can benefit from rarefaction [47]. However, in the context of phylogenetic placement, rarefaction is not necessary. Thus, a larger amount of valid data can be kept. For a thorough comparison of UniFrac to methods based on phylogenetic placement, see [29], which furthermore applies Edge PCA and Squash Clustering on the BV dataset that we later user for evaluation.
In order to compare the placements obtained from a set of samples, it is convenient to think of the reference tree as a graph (when exploiting graph properties of the tree, we refer to the branches of the tree as edges). Then, the per-branch LWRs for a single QS can be interpreted as a distribution of mass points over the edges of the RT, including their respective placement positions on the branches, cf. Fig 1. This implies that each QS has a total accumulated mass of 1.0 on the RT. We call this the mass interpretation of the placed QSs, and henceforth use mass and LWR interchangeably. The mass of an edge or edge mass refers to the sum of the LWRs on that edge for all QSs of a sample, as shown in Fig 2(a). The edge masses can serve as a simplification of the data that summarizes a whole sample in a vector with m entries, for an RT with m edges. That is, instead of describing a sample by the placements of all its sequences on the RT, we describe it by the total per-branch mass of all its sequences. A larger example for the full RT is shown in Fig 3(a). The total mass of a sample is then the sum over all edge masses, which is identical to the number of QSs in the sample.
The key idea is to use the distribution of placement mass points over the edges of the RT to characterize a sample. This allows for normalizing samples of different size by scaling the total sample mass to unit mass 1.0. This is done by dividing the mass of each placement location by the total sum of all masses in the sample. In other words, absolute abundances are converted into relative abundances. This way, rare species, which might have been removed by rarefaction, can be kept, as they only contribute a negligible mass to the branches into which they have been placed. This approach is analogous to using proportional values for methods based on OTU count tables, that is, scaling each sample/column of the table by its sum of OTU counts [47]. Most of the methods presented here use normalized samples, that is, they use relative abundances.
As relative abundances are compositional data, certain caveats occur [43, 55, 59], which we discuss where appropriate.
When working with large numbers of QSs, the mass interpretation allows to further simplify and reduce the data: The masses on each edge of the tree can be quantized into b discrete bins, as shown in Fig 2(b). That is, each edge is divided into b intervals (or bins) of the corresponding branch length, and all mass points on that edge are accumulated into their respective nearest bin. The parameter b controls the resolution and accuracy of this approximation. In the extreme case b := 1, all masses on an edge are grouped into one single bin (which is equivalent to only considering the edge mass as described above instead of individual LWRs). This branch binning process drastically reduces the number of mass points that need to be stored and analyzed in several of the methods we present, while only inducing a negligible decrease in accuracy. As shown in S2 Table, branch binning can yield a speedup of up to 75% for post-analysis run-times without altering the results of the analysis.
The interpretation of placements as masses on the edges of the tree further allows to summarize a set of samples by annotating the RT with their (weighted) average per-edge mass distribution, as shown in Fig 2(c). This procedure is called squashing [29]. Because phylogenetic placements represent point masses on the edges of the tree, squashing can be seen as joining the (weighted) set of placements of the samples on corresponding edges of the RT. The resulting masses can then be normalized again to obtain unit mass for the resulting average tree. This tree summarizes the (sub-)set of samples it represents; a larger example of squashing is shown in Fig 5.
2.4 Edge imbalances
So far, we have only considered the per-edge masses. Often, however, it is also of interest to “summarize” the mass of an entire clade. For example, sequences of the RT that represent species or strains might not provide sufficient phylogenetic signal for properly resolving the phylogenetic placement of short sequences [60]. In these cases, the placement mass of a sequence can be spread across different edges representing the same genus or species, thus blurring analyses based on per-edge masses.
Instead, a clade-based summary can yield clearer analysis results. It can be computed by using the tree structure to appropriately transform the edge masses. Each edge splits the tree into two parts, of which only one contains the root (or top-level trifurcation) of the tree. For a given edge, its mass difference is then calculated by summing all masses in the root part of the tree and subtracting all masses in the other part, while ignoring the mass of the edge itself [29], as shown in Fig 2(d). This difference is called the imbalance of the edge [29]. It is usually normalized to represent unit total mass, as the absolute (not normalized) imbalance otherwise propagates the effects of differing sample sizes all across the tree. A larger example for the full RT is shown in Fig 3(a).
The edge imbalance relates the masses on the two sides of an edge to each other. This implicitly captures the RT topology and reveals information about its clades. Furthermore, this transformation can also reveal differences in the placement mass distribution of nearby branches of the tree. This is in contrast to the KR distance [56], which we introduce later, as the KR distance yields low values for masses that are close to each other on the tree. Note that the imbalance of a leaf edge is simply the total mass of the tree minus the mass of the edge. It thus mostly contains irrelevant information and can often be omitted.
2.5 Placement data matrices
An example that shows all edge masses and edge imbalances of a sample on the reference tree is shown in Fig 3(a). Illustrations of the different use cases for edge mass and edge imbalance metrics are shown in the Results section. These values can be summarized by two matrices, which we use for several downstream edge- and clade-related analyses, respectively. These matrices have dimensions n × m for n samples and m edges of the RT. The edge masses matrix contains the sum of placement masses per edge for all QS in a sample, while the edge imbalance matrix transforms these masses as described before. Note that these matrices can either store absolute or relative abundances, depending on whether the placement mass was normalized.
Furthermore, many studies provide meta-data for their samples, for instance, the pH value or temperature of the samples’ environment. Such meta-data features can also be summarized in a per-sample matrix, where each column corresponds to one feature. The three matrices are shown in Fig 3(b). For example, the Edge principal components analysis (Edge PCA) [29] is a method that utilizes the imbalance matrix to detect and visualize edges with a high heterogeneity of mass difference between samples. Edge PCA further allows to annotate its plots with meta-data variables, for instance, by coloring, thus establishing a connection between differences in samples and differences in their meta-data [18]. In the following, we propose several new techniques to analyze placement data and their associated meta-data.
3 Methods
In this section, we introduce novel methods for analyzing and visualizing the phylogenetic placement data of a set of environmental samples. Each such sample represents a geographical location, a body site, a point in time, etc. In the following, we represent a sample by the placement locations of its metagenomic QSs, including the respective per-branch LWRs. Furthermore, for a specific analysis, we assume the standard use case, that is, all placements were computed on the same fixed reference tree (RT) and reference alignment. We initially describe our novel methods. We then assess their application to real world data and their computational efficiency in Section Results and discussion.
3.1 Visualization
A first step in analyzing phylogenetic placement data often is to visualize them. For small samples, it is possible to mark individual placement locations on the RT, as offered for example by iTOL [61], or even to create a tree where the most probable placement per QS is attached as a new branch, as implemented in the guppy tool from the pplacer suite [22], RAxML-EPA [23,62], and our tool gappa. For larger samples, one can alternatively display the per-edge placement mass, either by adjusting the line widths of the edges according to their mass, or by using a color scale, as offered in ggtree [63], guppy, and gappa. Using per-edge line widths or colors corresponds to binning all placements on an edge into one bin. For large datasets, the per-edge masses can vary by several orders of magnitude. In these cases, it is often preferable to use a logarithmic scaling, as shown in [15].
These simple visualizations directly depict the placement masses on the tree. When visualizing the accumulated masses of multiple samples at once, it is important to chose the appropriate normalization strategy for the task at hand. For example, if samples represent different locations, one might prefer to use normalized masses, as comparing relative abundances is common for this type of data. On the other hand, if samples from the same location are combined (e.g., from different points in time, or different size fractions), it might be preferable to use absolute abundances instead, so that the total number of sequences per sample can be visualized.
The visualizations provide an overview of the species abundances over the tree. They can be regarded as a more detailed version of classic abundance pie charts. Moreover, these visualizations can be used to assess the quality of the RT. For example, placements into inner branches of the RT may indicate that appropriate reference sequences (i) have not been included or (ii) are simply not yet available.
Here, we introduce visualization methods that highlight (i) regions of the tree with a high variance in their placement distribution (called Edge Dispersion), and (ii) regions with a high correlation to meta-data features (called Edge Correlation).
3.1.1 Edge Dispersion
The Edge Dispersion is derived from the edge masses or edge imbalances matrix by calculating a measure of dispersion for each of the matrix columns, for example the standard deviation σ. Because each column corresponds to an edge, this information can be mapped back to the tree, and visualized, for instance, via color coding. This allows to examine which edges exhibit a high heterogeneity of placement masses across samples, and indicates which edges discriminate samples.
As the abundances of different species, and hence also the edge mass values, can span many orders of magnitude, it might be necessary to scale the variance logarithmically. Often, one is more interested in the branches with high placement mass. In these cases, using the standard deviation or variance is appropriate, as they also indicate the mean mass per edge. On the other hand, by calculating the per-edge Index of Dispersion [64], that is, the variance-mean-ratio σ2/μ, differences on edges with small mass also become visible. As Edge Dispersion relates placement masses from different samples to each other, the choice of the normalization strategy is important. When using normalized masses, the magnitude of dispersion values needs to be cautiously interpreted [59].
Edge Dispersion can also be calculated for edge imbalances. As edge imbalances are typically normalized to [−1.0,1.0], their dispersion can be visualized directly without any further normalization steps. An example for an Edge Dispersion visualization is shown in Fig 4(a), and discussed in Section Visualization.
3.1.2 Edge Correlation
In addition to the per-edge masses, the Edge Correlation further takes a specific meta-data feature into account, that is, a column of the meta-data matrix. The Edge Correlation is calculated as the correlation between each edge column and the feature column, for example by using the Pearson Correlation Coefficient or Spearman’s Rank Correlation Coefficient [64]. This yields a per-edge correlation of the placement masses or imbalances with the meta-data feature, and can again be visualized via a color coding of the edges. It is inexpensive to calculate and hence scales well to large datasets. As typical correlation coefficients are within [−1.0,1.0], there is again no need for further normalization. This yields a tree where edges or clades with either a high linear or monotonic correlation with the selected meta-data feature are highlighted. Fig 4(b) shows an example of this method.
In contrast to Edge PCA [29] that can use meta-data features to annotate samples in its scatter plots, our Edge Correlation method directly assesses the influence of a feature on the branches or clades of the tree. It can thus, for example, help to identify and visualize dependencies between species abundances and environmental factors such as temperature or nutrient levels. Again, the choice of the normalization strategy is important to draw meaningful conclusions. However, the correlation is not calculated between samples or sequence abundances. Hence, even when using normalized samples, the pitfalls regarding correlations of compositional data [59] do not apply here.
The method further bears some conceptual similarity to Phylofactorization [31], for which we later present an adaptation to phylogenetic placements, called Placement-Factorization. Phylofactorization also takes meta-data features into account and can thereby identify relationships between changes in environmental variables and changes in abundances in clades of the tree. It typically uses linear regression in form of a Generalized Linear Model (GLM) to assess these relationships. Note that the correlation coefficient used in our Edge Correlation can be interpreted as the slope of the regression line of the standardized values, which establishes a connection between Edge Correlation and Phylofactorization. The advantage of using correlations here instead of a GLM lies in its simplicity regarding result interpretation: Here, we are interested in whether changes in a meta-data feature are connected to an increase or decrease of abundances on branches or in clades of the tree, which is readily accessible via correlation coefficients.
However, using a GLM instead of correlation constitutes a potentially useful alternative, for example by visualizing the fit of the model with the data. This can be understood as fitting a regression line to the scatter plot of meta-data variables on one axis, and edge masses (i.e., per-branch abundances) on the other axis, and measuring how well this line predicts the data. We later discuss this idea in more detail in the context of the objective function of Placement-Factorization. Using GLMs in that way allows to assess the influence of multiple meta-data features simultaneously. For some types of meta-data, this might be advantageous, but might also hinder a clear interpretation of the relationships between different meta-data features and the abundances. We show some results of this idea as a by-product of Placement-Factorization in Fig 10 and S15 Fig, but leave a full exploration of this as a method of its own as future work.
3.2 Clustering
Given a set of metagenomic samples, one key question is how much they differ from each other. A common distance metric for microbial communities is the (weighted) UniFrac distance [57,58]. It uses the fraction of unique and shared branch lengths between phylogenetic trees to determine their difference. UniFrac has been generalized and adapted to phylogenetic placements in form of the phylogenetic Kantorovich-Rubinstein (KR) distance [29,56]; see there for a thorough comparison of UniFrac and KR distance. In other contexts, the KR distance is also called Wasserstein distance, Mallows distance, or Earth Mover’s distance [65–68]. The KR distance between two metagenomic samples is a metric that describes by at least how much the normalized mass distribution of one sample has to be moved across the RT to obtain the distribution of the other sample. The distance is symmetrical, and increases the more mass needs to be moved, and the larger the respective displacement (moving distance) is. As the two samples being compared need to have equal masses, the KR distance operates on normalized samples; that is, it compares relative abundances.
Given such a distance measure between samples, a fundamental task consists in clustering samples that are similar to each other. Standard linkage-based clustering methods like UPGMA [69–71] are solely based on the distances between samples, that is, they calculate the distances of clusters as a function of pairwise sample distances. Squash Clustering [18,29] is a method that also takes into account the intrinsic structure of phylogenetic placement data. It uses the KR distance to perform agglomerative hierarchical clustering of samples. Instead of using pairwise sample distances, however, it merges (squashes) clusters of samples by calculating their average per-edge placement mass, as described in the Introduction. A further example of the squashing of two samples is shown in Fig 5. Thus, in each step, Squash Clustering operates on the same type of data, that is, on mass distributions on the RT: In the beginning, each item considered in the clustering is one sample, while later steps of the clustering operate on the merged (squashed) samples, until only one large cluster remains that combines all samples. This results in a hierarchical clustering tree, where tips correspond to samples, and branch lengths correspond to KR distances between clusters. We show examples of this clustering tree later in Fig 8.
3.2.1 Phylogenetic k-means
The number of tips in the resulting clustering tree obtained through Squash Clustering is equal to the number n of samples that are being clustered. Thus, for datasets with more than a few hundred samples, the clustering result becomes hard to inspect and interpret visually. We propose a variant of k-means clustering [72] to address this problem, which we call Phylogenetic k-means. It uses a similar approach as Squash Clustering, but yields a predefined number of k clusters. It is hence able to work with arbitrarily large datasets. Note that we are clustering samples here, instead of sequences [73]. We discuss the choice of a reasonable value for k later.
The underlying idea is to assign each of the n samples to one of k cluster centroids, where each centroid represents the average mass distribution of all samples assigned to it. Note that all samples and centroids are of the same data type, namely, they are mass distributions on a fixed RT. It is thus possible to calculate distances between samples and centroids, and to calculate their average mass distributions, as described earlier. Our implementation follows Lloyd’s algorithm [74], as shown in Algorithm 1.
Phylogenetic k-means
1: initialize k Centroids 2: while not converged do 3: assign each Sample to nearest Centroid 4: update Centroids as mass averages of their Samples 5:return Assignments and Centroids
By default, we use the k-means++ initialization algorithm [75] to obtain a set of k initial centroids. It works by subsequent random selection of samples to be used as initial centroids, until k centroids have been selected. In each step, the probability of selecting a sample is proportional to its squared distance to the nearest already selected sample. An alternative initialization is to select samples as initial clusters entirely at random. This is however more likely to yield sub-optimal clusterings [76].
Then, each sample is assigned to its nearest centroid, using the KR distance. Lastly, the centroids are updated to represent the average mass distribution of all samples that are currently assigned to them. This iterative process alternates between improving the assignments and improving the centroids. Thus, the main difference to normal k-means is the use of phylogenetic information: Instead of euclidean distances on vectors, we use the KR distance, and instead of averaging vectors to obtain centroids, we use the average mass distribution on the tree.
The process is repeated until it converges, that is, the cluster assignments do not change any more, or until a maximum number of iterations have been executed. The second stopping criterion is added to avoid the super-polynomial worst case running time of k-means, which however almost never occurs in practice [77,78].
The result of the algorithm is an assignment of each sample to one of the k clusters. As the algorithm relies on the KR distance, it clusters samples with similar relative abundances. The cluster centroids can be visualized as trees with a mass distribution, analogous to how Squash Clustering visualizes inner nodes of the clustering tree. That is, each centroid can be represented as the average mass distribution of the samples that were assigned to it, as shown in Fig 5. This allows to inspect the centroids and thus to interpret how the samples were clustered. Examples of this are shown in S8 Fig.
The key question is how to select an appropriate k that reflects the number of “natural” clusters in the data. There exist various suggestions in the literature [79–84]; we assessed the Elbow method [79] as explained in S10 Fig, which is a straight forward method and yields reasonable results for our test datasets. Additionally, for a quantitative evaluation of the clusterings, we used the k that arose from the number of distinct labels based on the available meta-data for the data. For example, the HMP samples are labeled with 18 distinct body sites, describing where each sample was taken from, c.f. Fig 9.
3.2.2 Algorithmic improvements
In each assignment step of the algorithm, distances from all n samples to all k centroids are computed, which has a time complexity of . In order to accelerate this step, we can apply branch binning as introduced in Section Edge masses. For the BV dataset, we found that even using just 2 bins per edge does not alter the cluster assignments. Branch binning reduces the number of mass points that have to be accessed in memory during KR distance calculations; however, the costs for tree traversals remain unaltered. Thus, we observed a maximal speedup of 75% when using one bin per branch, see S2 Table for details.
Furthermore, during the execution of the algorithm, empty clusters can occur, for example, if k is greater than the number of natural clusters in the data. Although this problem did not occur in our tests, we implemented the following solution: First, find the cluster with the highest variance. Then, choose the sample of that cluster that is furthest from its centroid, and assign it to the empty cluster instead. This process is repeated if multiple empty clusters occur at once in an iteration.
3.2.3 Imbalance k-means
We further propose Imbalance k-means, which is a variant of k-means that makes use of the edge imbalance transformation. In order to quantify the difference in imbalances between two samples, we use the Euclidean distance between their imbalance vectors (that is, rows of the imbalance matrix). This is a suitable distance measure, as the imbalances implicitly capture the tree topology as well as the placement mass distributions. As a consequence, the expensive tree traversals required for Phylogenetic k-means are not necessary here. The algorithm takes the edge imbalance matrix of normalized samples as input, as shown in Fig 3(b), and performs a standard Euclidean k-means clustering following Lloyd’s algorithm.
This variant of k-means tends to find clusters that are consistent with the results of Edge PCA, as both use the same input data (imbalances) as well as the same distance measure (Euclidean). Furthermore, as the method does not need to calculate KR distances, and thus does not involve tree traversals, it is several orders of magnitude faster than Phylogenetic k-means. For example, on the HMP dataset, it runs in a few seconds, instead of several hours needed for Phylogenetic k-means; see Section Performance for details.
3.3 Phylogenetic ILR transform and Phylofactorization
The concepts and methods presented above resemble two recent approaches for analyzing phylogenetic data: the Phylogenetic Isometric Log-Ratio (PhILR) transformation and balances [30], as well as Phylogenetic Factorization (Phylofactorization) [31]. These methods use a tree inferred from the OTU sequences of the samples (instead of a fixed reference tree), and annotate the abundances of OTUs per sample on the tips of this tree (instead of placement masses on the branches). The methods use these data to draw conclusions about compositional changes in clades of the tree in different samples, as well as relationships of per-clade OTU abundances with environmental meta-data variables.
In both of these approaches, a balance between OTU abundances in two subtrees of the underlying tree is computed. This is a measure of contrast that expresses which of the two subtrees comprises more OTUs. In the PhILR transform [30], these balances are computed for the two subtrees below each inner node of a rooted binary tree, while ignoring abundances in the respective remainder of the tree. In Phylofactorization however, these balances are computed for the two subtrees that are induced by the edges of the tree. This is highly similar to the concept of edge imbalances that we introduced above. Note that despite sharing a similar name and exhibiting conceptual similarities, balances and the previously introduced edge imbalances are distinct approaches that should not be confused. We later discuss respective similarities and differences in more detail.
Furthermore, we remark that the Balance Trees method [85] employs analogous concepts by calculating the balance of nodes using the isometric log-ratio of OTU abundances. However, instead of using a phylogenetic tree, it assumes any binary partitioning of the OTUs, e. g., obtained from a UPGMA clustering of the OTUs based on a meta-data feature. These nodes thus correspond to specific meta-data values, again allowing for statements about the changes in OTU abundances that occur with changing environmental variables. As we already have a binary partitioning in form of the reference tree, we do not further consider the Balance Trees approach here.
In the following, we present adaptations of the PhILR transform (balances) and of Phylofactorization to phylogenetic placement data. The main adaptation step consists in placing masses on the branches of our (fixed) reference tree, instead of only considering masses (abundances) at the tips of the OTU tree. Here, we focus on balances that contrast the subtrees induced by edges of the tree, as used by Phylofactorization [31], because this is more natural in the context of phylogenetic placement data. The same concepts could however also be employed for subtrees below nodes, as used by the PhILR transform [30].
3.3.1 Phylogenetic ILR transform for phylogenetic placements
In the introduction, we briefly explained the inherently compositional nature of metagenomic sequence data [42]. For a thorough discussion of the implications of this, see [30]. One solution is to transform the data into an unconstrained space that is not compositional. This can, for example, be achieved via the Isometric Log-Ratio (ILR) transform [86], which, given a compositional space, creates a new coordinate system with an orthonormal basis [87]. The ILR transform requires a sequential binary partitioning of the underlying original space [88]. As suggested in [30], a bifurcating phylogenetic tree (e.g., our RT) represents such a partitioning, which also provides a meaningful way of interpreting the resulting coordinates. This so-called Phylogenetic ILR (PhILR) transform yields an ILR coordinate system that captures the evolutionary relationships of the phylogeny [30]. The resulting coordinates are called balances [86,87]. The balances obtained from an ILR transform represent the log-ratio of the geometric means of the data in the two subtrees. Hence, they can be interpreted as a contrast (log-ratio) between two aggregates (geometric means). Furthermore, due to the orthogonality of the ILR basis vectors, the balances can be used by conventional statistical tools without suffering from compositional artifacts.
In the following, we present an adaptation of the PhILR transform and balances to placement data, based on the work of [30]. See there for more details on the method and the underlying mathematical concepts, such as the connection between the ILR transform and the centered log-ratio (CLR) transform. We now describe the computation of the Phylogenetic ILR transform, along with the changes needed for phylogenetic placement data. We focus on the computation for a single sample. We assume that a fixed reference tree (RT) (a sequential binary partitioning) along with the per-branch placements of the sequences in the sample are given. The placements are represented by a vector c of size m, containing the absolute (not normalized) edge masses, where m is the number of edges in the tree. In other words, our input is a single row (one sample) of the edge masses matrix, as shown in Fig 3(b). The absolute masses are transformed into relative abundances as described in Section Edge masses: Each element of c is divided by the sum of all elements, yielding the relative masses vector x for the given sample. In compositional data analysis, this operation is known as the closure of the data [43].
The original PhILR furthermore allows to use per-taxon weights p in order to down-weigh the impact of low abundant taxa/OTUs [30,89]. In our adaptation, this weighting scheme is accordingly changed to weights p per edge of the RT. Unfortunately, the nomenclature of existing publications (namely, [29] and [30]) creates a conflict here: These weights are not to be confused with our terminology of likelihood weights and edge masses. The default case of edge weights p = (1,…, 1) represents no weighting, where each edge equally contributes to the balance, while any p ≠ (1,…, 1) is a generalized form of the ILR transform [30]. We later describe an appropriate choice of weights in Section Edge Weights. These weights are applied to the relative masses x to obtain the shifted composition y = x/p, using element-wise division [89].
In the original PhILR, balances are calculated for the two subtrees below a given node of the tree [30]. In the context of Phylofactorization, this has been generalized to balances between any two disjoint sets R and S of taxa (tips of the tree) [31]. We here build on the latter, but again change R and S to refer to disjoint sets of edges of our reference tree. We use the notation yR and pR to refer to the subsets of masses and weights of the given sample at the edges in R. Then, the balance y* between the sets R and S is computed as:
The first term is a scaling term that, for a given edge, is constant across all samples. It ensures unit length of the ILR basis elements, and uses the sums of weights in p:
The second term is the log-ratio of geometric means, where gm(yR, pR) is the weighted geometric mean of the values in yR with weights pR:
Note that if p = (1,…, 1), Eq (1) represents the original ILR transform without a weighting scheme [86], Eq (2) equals the number of edges in R and S, respectively, and Eq (3) is the standard (unweighted) geometric mean.
Balances as defined here can be computed as a measure of contrast between any disjoint sets R and S of edges. Interchanging S and R flips the sign of the balance; this is irrelevant for the subsequent steps presented here, as long as the interchange is applied consistently. When computing the balance between the edges in the two subtrees induced by some given edge e, the conceptual similarity with the previously described Edge imbalances becomes apparent: Imbalances use the difference of sums for contrasting and aggregating, while balances use the ratio of means for the same purpose. We show an example of this computation in Fig 6. Hence, balances represent a similar transformation of the placement data, that can also be used to conduct analyses, such as the Phylofactorization, as presented in the next section.
We however remark that the using (unweighted) balances in our previously presented methods, such as Edge Correlation and k-means clustering, might lead to spurious results, due to the insensitivity of the geometric mean to singular large values. That is, individual branches that accumulate a large fraction of the placement mass (sequence abundances) might only insignificantly change the geometric mean of their clade.
However, such branches are typically the interesting ones, and hence should exert more influence on the transformation, which is exactly the purpose of the taxon weighting scheme. This further implies that balances are not indifferent to splittings of reference taxa into multiple representatives (pers. comm. with A. Washburne on 2018-11-23). We discuss the implications of this in more detail in the evaluation of the method.
3.3.2 Edge Weights
The PhILR also allows for incorporating two distinct weighting schemes for the balances, one based on taxon abundances, and one based on the branch lengths of the underlying phylogeny [30]. As mentioned above, we implemented the former, while leaving the latter as future work.
We now describe how to adapt the taxon weights of [30] to our placement-based approach, that is, how an appropriate vector p of edge weights can be constructed. Originally, this weighting scheme down-weighs the influence of low abundant taxa [30], which are known to be less reliable and more variable [90]. Here, we accordingly down weigh edges with low placement mass, for the same reasons. We follow the approach of [30], and construct the edge weights by multiplicatively combining two terms:
A measure for the central tendency of the absolute edge masses, for example, their mean across all samples. This is the main component of the weight that yields low values for edges with low mass and vice versa.
A vector norm of the relative edge masses across the samples. This term additionally weighs edges by their specificity.
Our implementation allows to use the median, the arithmetic mean, and the geometric mean, as well as different ℓp-norms (such as the Manhattan, Euclidean, and maximum norm), and the Aitchison norm [88]. We follow the advice of [30], and by default use the geometric mean (with pseudo-counts added to the masses to avoid skew from edges without any placement mass) and the Euclidean norm. In that case, the weights for edge j are computed as follows:
Here, n is the number of samples, is the vector of absolute masses at edge j across all samples, and the vector of relative masses at edge j across all samples, both of length n. That is, these measures use the masses of all n samples; consequently, we use columns instead of rows of the edge masses matrix of Fig 3(b), where each column is used for the weights of the corresponding edge. The resulting edge weights p are then fixed and used across the balance computation of all samples.
3.4 Phylogenetic Factorization
Phylofactorization is a method to identify edges in a phylogenetic tree that drive patterns in the composition of microbial communities [31]. An edge constitutes a separation or split of groups of taxa into the two subtrees induced by the edge. In an evolutionary context, an edge denotes a difference in (putative) traits that may have arisen along the edge. The goal of Phylofactorization is to identify edges that are related to differences in per-sample meta-data features. To this end, it aggregates and contrasts the abundances in the subtrees (groups of taxa) induced by an edge, and evaluates how changes in environmental variables across samples are reflected in abundance changes.
The original method uses a tree inferred from the OTUs that are present in the set of samples, and iteratively identifies edges that split the tree into nested subtrees which exhibit the largest predictable differences between the taxa in these subtrees. Each such edge can be interpreted as a phylogenetic factor (or short, phylofactor) for splitting the tree: Once an edge has been selected in one iteration, its induced subtrees are then considered separately in subsequent iterations. The resulting factors are hence independent of each other, which ensures orthogonality of the factors. In other words, each factor describes a different dimension in which samples differ. Furthermore, by iteratively considering subtrees of decreasing size, nested factors can be found, which correspond to relationships within a subtree that only affect the taxa in the subtree itself. The algorithm stops after a predefined number of iterations/factors, or until a stopping criterion is met.
In a typical use case, each environmental sample is represented by its OTU abundances at the tips of the tree. Given a per-sample meta-data feature such as the pH-value, Phylofactorization can be employed to find edges where a change of the pH-value between samples predicts a change in OTU abundances in the two subtrees induced by the edge. For example, an increasing ph-value might indicate a relative increase in the OTU abundances in one subtree compared to another subtree. The resulting factorization can serve as a dimensionality-reduction mechanism, as an ordination and visualization tool, and as an inferential tool that can identify edges corresponding to changes in functional ecological traits [31].
We now present an adaptation of Phylofactorization to phylogenetic placements, which we call Placement-Factorization. We explain our adaptation following the description of the Generalized Phylogenetic Factorization (GPF) [91,92]. The GPF is a recent generalization of Phylofactorization that also allows for other types of input data than relative OTU abundances, for example, presence/absence data. It is hence suited for a wider range of community ecological data [92]. Conceptually and algorithmically, Phylofactorization, GPF, and our Placement-Factorization, work the same; we here use the mathematical notation of GPF as a scaffold to explain our adaptation. In the following, we briefly introduce the original method, outline the necessary adaptations, and explain how to use the balances obtained from (our adaptation of) the PhILR transform (as explained above) in the context of Placement-Factorization.
3.4.1 Placement-Factorization
Phylofactorization can be understood as an iterative greedy graph-partitioning algorithm for a tree T [91]. In each iteration, a winning edge e* is identified that splits the tree into two disjoint groups R and S of edges. To determine the winning edge, an objective function is maximized that expresses the intensity of the relationship between abundances and meta-data variables. We later discuss this objective function in more detail.
The input to the original Phylofactorization is an n × m data matrix X, for j = 1,…,n samples, and i = 1,…,m species (corresponding to the OTUs at the tips of the tree). The matrix can represent abundances, presence/absence data, or other data related to the species in the tree [92]. In our adaptation, we use the per-edge masses from the phylogenetic placement of the samples, as shown before in Fig 3(b). That is, instead of m species representing the tips of the tree, we use an n × m data matrix X where the m columns correspond to the edges of our reference tree (for consistency of notation, we re-use and re-purpose the index m here, and transpose X compared to the original notation). Lastly, Phylofactorization uses an n × p meta-data matrix Z for the n samples and p per-sample meta-data variables.
In analogy to the Generalized Phylofactorization [91,92], our adapted algorithm requires three functions:
An aggregation function AR = A(Xj,R, T), which aggregates (summarizes) a subset R of edges for a sample j.
A contrast function CR,S = C(AR, AS, T, e), which contrasts (compares) the aggregates of two disjoint subsets R and S of edges on the two sides of an edge e.
An objective function ω(C, Z) that evaluates a contrast for all samples in the context of the per-sample meta-data, in order to determine the winning edge.
We later discuss appropriate choices for these functions. For now, we assume that we are given functions that allow identifying edges whose induced subtrees exhibit predictable differences in the edge masses X driven by changes in the meta-data Z of different samples. The algorithm starts by considering the entire tree T as one large “subtree”. Then, in each iteration, Phylofactorization and Placement-Factorization work as follows:
For each edge e that separates disjoint groups Re and Se of edges within the subtree that contains e:
Compute the aggregates ARe = A(Xj,Re, T) and ASe = A(Xj,Se, T).
Compute their contrast Ce = C(ARe, ASe, T, e).
Compute the objective value ωe = ω(Ce, Z).
The aggregates ARe and ASe, as well as the contrast Ce are computed separately for every sample. The value ωe of the objective function then expresses the relationship of the contrasts of all samples with their respective meta-data values in Z.
Select the winning edge e* = argmaxe(ωe) that maximizes the value of the objective function.
Partition the subtree that contains e* into two disjoint subtrees, separated by e*.
Repeat until a stopping criterion is met.
This closely follows the description of the algorithm in [91,92], see there for details. The difference between the algorithms is that the groups R and S in our case consist of reference tree edges, instead of species at the tips of the OTU tree. Because of this, the aggregates of edges that lead to tip nodes are empty, meaning that we do not consider those edges as candidates in the algorithm. This is analogous to “tip edges” not having a meaningful edge imbalance, as described above. An example of the first two iterations of the algorithm is shown in Fig 7.
Each iteration further splits a subtree at the respective winning edge, so that after i iterations, i + 1 subtrees are produced. It is important to note that the winning edges of previous iterations split the tree into disjoint subtrees, and that in later iterations, the aggregates and contrasts induced by an edge are only computed within their respective subtrees. This ensures the previously mentioned orthogonality of the phylogenetic factors (winning edges), meaning that systematic dependencies between the contrasts of any two factors are eliminated, and that instead, nested relationships can be identified.
The original publication proposes a stopping criterion using a Kolmogorov-Smirnov (KS) test based on test statistics of the identified phylofactors [31,92]. Although these could be implemented for Placement-Factorization, we leave this as future work; our implementation currently runs for a given number i of iterations, and hence computes i phylofactors.
So far, we have assumed to be given the three functions required for Phylofactorization. The choice of these functions depends on the data X, the data Z, and the research question at hand. In order to be consistent and comparable with the original implementation [31], in our evaluation we used the same set of functions, namely the balances of the ILR transformation as explained above for aggregating and contrasting subtrees, and an objective function based on Generalized Linear Models (GLMs), which we explain in the following.
3.4.2 Objective Function
Phylofactorization requires an objective function ω(Ce, Z) that quantifies the relationship between Ce and Z for a given edge e, where Ce are the contrasts between the two subtrees induced by e for all samples (for example, the balances), and Z are the per-sample meta-data variables. That is, both Ce and Z have size n, the number of samples, with Z potentially containing multiple columns (one for each meta-data feature). In order to identify the winning edge e* of an iteration (the phylofactor), the function is evaluated for all edges, and the edge maximizing ω is selected. The choice of the objective function depends on the research question at hand; see [31] and [91] for a thorough discussion.
Our implementation is as general as the original Phylofactorization [31], in that it allows for an arbitrary objective function. For simplicity, and in line with the original publication, we here focus on functions that treat the meta-data variables Z as independent variables and the contrasts Ce as dependent variables whose relationship with Z is assessed, for instance, via a predictive model. Then, the selected phylogenetic factors correspond to edges where a change in Z most strongly predicts a change in Ce across the samples, that is, where the effect of the (independent) meta-data variables on the (dependent) underlying data (e.g., per-clade abundances) is most pronounced.
A powerful approach is to model the relationship between Ce and Z via linear regression, that is, we assess how well Z can predict Ce. In the simple one-dimensional case, this can be thought of as fitting a line through a scatter plot of the meta-data feature on the x-axis and the contrasts on the y-axis, where each point represents one sample. This concept is generalized via Generalized Linear Model (GLM) [93–95].
In short, GLMs allow to predict a single (response) variable using multiple input (explanatory) variables. Typically, the response variable is assumed to follow any distribution from the exponential family (normal, exponential, Poisson, Binomial, etc), which is given for balances as used here. In contrast to this, the explanatory variables (the meta-data features) are assumed to have a linear relationship with the response. Note that this mathematical restriction of the model does not mean that only meta-data features can be used that behave linearly; transformations and interactions of the features basically allow for arbitrary types of data. For example, categorical variables such as the body site where a sample was taken from can be transformed into so-called dummy variables that fulfill the requirements.
Once the model parameters of the GLM have been estimated, that is, once it has been fit to the data via some optimization algorithm, we need to evaluate the GLM for the purposes of Phylofactorization. We are interested in a value for ω that expresses how well the meta-data variables explain the balances. To this end, Phylofactorization and our adaptation thereof use the difference between the null deviance of the balances and the deviance obtained from the GLM. This difference expresses how much better the model explains the balances than just predicting them from their mean. For details on the usage of GLMs for Phylofactorization, see [31].
Predictive models such as GLMs expect the response variable (that is, the predicted values; here, the contrasts) to have certain statistical properties. In particular, linear models assume the deviation of response from the predicted value to be normally distributed. The ILR transform for compositional data has been proven to behave asymptotically normal [86,96], which allows their application within standard multivariate methods, and within GLMs as presented here.
Lastly, we note that depending on the research question, other objective functions can be used, see [31,91] for some examples. For instance, simple test statistics such as the variation in Ce explained by regression on Z can be used. Furthermore, instead of predicting contrasts from meta-data, one could be interested in the opposite, that is, predicting a meta-data variable given the per-sample contrasts. In this case, the maximization of the objective function yields edges that best predict a certain feature of the data; this is suitable for identifying clades that can serve as a bio-indicator. Using GLMs for this allows to model any type of meta-data variable; for example, the binary information encoded in presence/absence data can be predicted using logistic regression. While our implementation supports all those use cases, they have been explored and discussed before [31,92]. For the sake of simplicity, we focus on linear (gaussian) modeling of Ce ~ Z, that is, predicting balances from meta-data.
3.4.3 Phylofactorization for phylogenetic placements
In summary, Phylofactorization and our adaptation Placement-Factorization identify edges of the phylogeny that exhibit a predictable relationship between changes in meta-data variables and abundance changes in the subtrees induced by these edges. Our adaptation can be understood as a generalization of the original method [31,92], where masses/counts can be placed along the edges of the tree, instead of just at its tips.
While the original method uses abundances of taxa/OTUs per sample on a tree inferred from the OTU sequences, we use the placement masses on a fixed reference tree (RT). For many use cases, this has several advantages: The RT can be inferred from reference sequences that are longer than typical metagenomic reads used for OTU-based analysis, such as the 16S or 18S regions of the genome; hence, phylogenetic inference will be more reliable. Furthermore, the size of the RT can be chosen as needed, for example via our Phylogenetic Automatic (Reference) Tree (PhAT) method [34], instead of having to use the number of OTUs that result from the clustering and preprocessing steps. This also eliminates the need for the (mostly arbitrary) OTU cutoff step that is common to many metagenomic analyses, where OTUs with low abundance or low spread across samples are filtered out in order to keep the number of OTUs manageable. That is, with our approach, all sequences in a dataset can be placed and analyzed. Another advantage of a fixed RT is the availability of taxonomic annotation for the reference sequences. Often, in metabarcoding studies, the environmental sequences are anonymous and might not be closely related to any known species [11,15,32], which can hinder common taxonomic assignment methods [21].
Placing the sequences onto an RT with known taxonomic labels allows to easily interpret results within a given taxonomic framework. Using a taxonomically constrained RT can further improve interpretability. Lastly, using a fixed reference tree better allows to conduct cross- or meta-studies that compare samples from different sources, or to easily run analyses for samples that were added to the dataset later on. Using a fixed tree means that the context of interpretation remains unaltered. This is not easily possible with trees inferred from OTUs, as those change depending on the input sequences.
For further details on Phylofactorization, in particular the mathematical properties of the method, we refer to [92], which also covers different objective functions, elaborates on stopping criteria, and compares the method to other phylogenetic methods for analyzing ecological data. Compared to other tools and methods that use the phylogeny as a guide for analyzing microbial data, both, the original Phylofactorization as well as our adaptation allow for a direct interpretation of the results in terms of the edges of the tree, while avoiding nested dependencies between overlapping subtrees and circumventing issues associated with the compositional nature of the data.
3.5 Methods summary
We presented several novel methods for analyzing phylogenetic placement data. The methods are complementary, as each of them can identify different aspects and patterns in the underlying metagenomic samples. They are hence best used in combination, thereby enabling a thorough and comprehensive analysis and interpretation of the results. We also presented two transformations of phylogenetic placements (imbalances and balances) that shift the focus of the methods from a per-branch view of the data to a per-clade view.
In the following evaluation, we exemplify how the results obtained from the distinct methods can be interpreted in light of each other. Furthermore, we discuss strengths, weaknesses, appropriateness, and limitations of the methods and transformations.
4 Results and discussion
We used three real world datasets to evaluate our methods:
Bacterial Vaginosis (BV) [18]. This small dataset was already analyzed with phylogenetic placement in the original publication. We used it as an example of an established study to compare our results to. It has 220 samples with a total of 15 060 unique sequences. See also [29] for a detailed analysis of this dataset which compares standard methods such as UniFrac [57,58] to methods like Edge PCA and Squash Clustering, which are based on phylogenetic placement.
Tara Oceans (TO) [11,32,33]. This world-wide sequencing effort of the open oceans provides a rich set of meta-data, such as geographic location, temperature, and salinity. Unfortunately, the sample analysis for creating the official data repository is still ongoing. We thus were only able to use 370 samples with 27 697 007 unique sequences.
Human Microbiome Project (HMP) [16,17]. This large data repository intends to characterize the human microbiota. It contains 9192 samples from different body sites with a total of 63 221 538 unique sequences. There is additional meta-data such as age and medical history, which is available upon special request. We only used the publicly available meta-data. See S1 Table for an overview of the dataset.
Details of the datasets (download links, data statistics, data preprocessing, etc.) are provided in S1 Text. At the time of writing, about one year after we initially downloaded the data, the TO repository has grown to 1170 samples, while the HMP even published a second phase and now comprises 23 666 samples of the 16S region. This further emphasizes the need for scalable data analysis methods.
Our test datasets represent a wide range of environments, number of samples, and sequence lengths. We use them to evaluate our methods and to exemplify which method is applicable to what kind of data. To this end, the sequences of the datasets were placed on appropriate phylogenetic RTs as explained in S1 Text, in order to obtain phylogenetic placements that our methods can be applied to. In the following, we present the respective results, and also compare our methods to other methods where applicable. As the amount and type of available meta-data differs for each dataset, we could not apply all methods to all datasets. Lastly, we also report the run-time performance of our methods on these data.
4.1 Visualization
4.1.1 BV dataset
We re-analyzed the Bacterial Vaginosis (BV) dataset by inferring a tree from the original reference sequence set and conducting phylogenetic placement of the 220 samples. Bacterial Vaginosis is a disease of the vagina that manifests itself in form of an abnormal vaginal microbiome [18]. The characteristics of this dataset were already thoroughly explored in [18] and [29]. We use it here to give exemplary interpretations of our Edge Dispersion and Edge Correlation methods, and to evaluate them in comparison to existing methods. An overview of the placement result of this dataset is given in S1 Fig, where we show a tree and a heat map that indicate abundances/placement masses per edge and per sample.
Fig 4 shows our novel visualizations of the BV dataset. Edge Dispersion is shown in Fig 4(a), while Fig 4(b) shows Edge Correlation with the so-called Nugent score. The Nugent score [97] is a clinical standard for the diagnosis of Bacterial Vaginosis, ranging from 0 (healthy) to 10 (severe illness).
The connection between the Nugent score and the abundance of placements on particular edges was already explored in [29], but only visualized indirectly (i.e., not on the RT itself). For example, Figure 6 of the original study plots the first two Edge PCA components colorized by the Nugent score. We recalculated this figure for comparison in S7(i) Fig. In contrast, our Edge Correlation measure directly reveals the connection between Nugent score and placements on the reference tree: The clade on the left hand side of the tree in Fig 4(b), to which the red and orange branches lead to, are Lactobacillus iners and Lactobacillus crispatus, respectively, which were identified in [18] to be associated with a healthy vaginal microbiome. Thus, their presence in a sample is anti-correlated with the Nugent score, which is lower for healthy subjects. The branches leading to this clade are hence colored in red. On the other hand, there is a multitude of different other clades that exhibit a positive correlation with the Nugent score, that is, were green and blue paths lead to in the figure, again a finding already reported in [18].
Both trees in Fig 4 highlight the same parts of the tree: The dark branches with high deviation in Fig 4(a) represent clades attached to either highly correlated (blue) or anti-correlated (red) paths Fig 4(b). This indicates that edges that have a high dispersion also vary between samples of different Nugent score.
We further compared our methods to the visualization of Edge PCA components on the reference tree. To this end, we recalculated Figures 4 and 5 of [29], and visualized them with our color scheme in S5 Fig for ease of comparison. They show the first two components of Edge PCA, mapped back to the RT. The first component reveals that the Lactobacillus clade represents the axis with the highest heterogeneity across samples, while the second component further distinguishes between the two aforementioned clades within Lactobacillus. Edge Correlation also highlights the Lactobacillus clade as shown in Fig 4(b), but does not distinguish further between its sub-clades. This is because a high Nugent score is associated with a high abundance of placements in either of the two relevant Lactobacillus clades.
Further examples of variants of Edge Dispersion and Edge Correlation on this dataset are shown in S2 Fig and S3 Fig. We also conducted Edge Correlation using Amsel’s criteria [98] and the vaginal pH value as shown in S4 Fig, both of which were used in [18] as additional indicators of Bacterial Vaginosis. We again found similar correlations compared to the Nugent score.
4.1.2 Tara Oceans dataset
We analyzed the Tara Oceans (TO) dataset to provide further exemplary use cases for our visualization methods. To this end, we used the unconstrained Eukaryota RT with 2059 taxa as provided by our Phylogenetic Automatic (Reference) Tree (PhAT) method [34]. The meta-data features of this dataset that best lend themselves to our methods are the sensor values for chlorophyll, nitrate, and oxygen concentration, as well as the salinity and temperature of the water samples. Other available meta-data features such as longitude and latitude are available, but would require more involved methods. This is because geographical coordinates yield pairwise distances between samples, whose integration into our correlation analysis methods is challenging. The Edge Correlation of the 370 samples with the nitrate concentration, the salinity, the chlorophyll concentration, and the water temperature are shown in S6 Fig.
We selected the diatoms and the animals as two exemplary clades for closer examination of the results. In particular, the diatoms show a high correlation with the nitrate concentration, as well as an anti-correlation with salinity, which represent well-known relationships [99,100]. See S6 Fig for details. These findings indicate that the method is able to identify known relationships. It will therefore also be useful to investigate or discover novel relationships between sequence abundances and environmental parameters.
4.1.3 Performance
Both methods (Edge Dispersion and Edge Correlation) are computationally inexpensive, and thus applicable to large datasets. The calculation of the above visualizations took about 30 s each, which were mainly required for reading in the data. Furthermore, in order to scale to large datasets, we reimplemented Edge PCA, which was originally implemented as a command in the guppy program [22]. For the BV dataset with 220 samples, guppy required 9 min and used 2.2 GB of memory, while our implementation only required 33 s on a single core, using less than 600 MB of main memory. For the HMP dataset, as it is only single-threaded, guppy took 11 days and 75.1 GB memory, while our implementation needed 7.5 min on 16 cores and used 43.5 GB memory.
4.2 Clustering
We here evaluate our Phylogenetic k-means clustering (which uses edge masses and KR distances) and Imbalance k-means clustering (which uses edge imbalances and euclidean distances) methods in terms of their clustering accuracy. We used the BV as an example of a small dataset to which methods such as Squash Clustering [29] are still applicable, and the HMP dataset to showcase that our methods scale to datasets that are too large for existing methods.
4.2.1 BV dataset
We again use the re-analyzed BV dataset to test whether our methods work as expected, by comparing them to the existing analysis of the data in [18] and [29]. To this end, we ran both Phylogenetic k-means and Imbalance k-means on the BV dataset. We chose k := 3, inspired by the findings of [18]. They distinguish between subjects affected by Bacterial Vaginosis and healthy subjects, and further separate the healthy ones into two categories depending on the dominating clade in the vaginal microbiome, which is either Lactobacillus iners or Lactobacillus crispatus. Any choice of k > 3 would simply result in smaller, more fine-grained clusters, but not change the general findings of these experiments. An evaluation of the number of clusters using the Elbow method is shown in S10 Fig. We furthermore conducted Squash Clustering and Edge PCA on the dataset, thereby reproducing previous results, in order to allow for a direct comparison between the methods, see Fig 8. Also, see [29] for a detailed interpretation of the results of Squash Clustering and Edge PCA on this dataset. The figure shows the results of Squash Clustering, Edge PCA, and two alternative dimensionality reduction methods, colorized by the cluster assignments PKM of Phylogenetic k-means (in red, green, and blue) and IKM of the Imbalance k-means (in purple, orange, and gray), respectively. We use two different color sets for the two methods, in order to make them distinguishable at first glance. Note that the mapping of colors to clusters is arbitrary and depends on the random initialization of the algorithm.
As can be seen in Fig 8(a), Squash Clustering as well as Phylogenetic k-means can distinguish healthy subjects from those affected by Bacterial Vaginosis. Healthy subjects constitute the lower part of the cluster tree. They have shorter branches between each other, indicating the smaller KR distance between them, which is a result of the dominance of Lactobacillus in healthy subjects. The same clusters are found by Phylogenetic k-means: As it uses the KR distance, it assigns all healthy subjects with short cluster tree branches to one cluster (shown in red). The green and blue clusters are mostly the subjects affected by the disease.
The distinguishing features between the green and the blue cluster are not apparent in the Squash cluster tree. This can however be seen in Fig 8(c), which shows a Multidimensional scaling (MDS) plot of the pairwise KR distances between the samples. MDS [64,102,103] is a dimensionality reduction method that can be used for visualizing levels of similarity between data points. Given a pairwise distance matrix, it finds an embedding into lower dimensions (in this case, 2 dimensions) that preserves higher dimensional distances as well as possible. Here, the red cluster forms a dense region, which is in agreement with its short branch lengths in the cluster tree. At the same time, the green and blue cluster are separated in the MDS plot, but form a coherent region of low density, indicating that k := 3 might be too large with Phylogenetic k-means on this dataset. That is, the actual clustering just distinguishes two clusters: healthy and sick patients (S10 Fig).
A similar visualization of the pairwise KR distances is shown in Fig 8(d). It is a recalculation of Figure 4 in the preprint [101], which did not appear in the final published version [29]. The figure shows a standard Principal Components Analysis (PCA) [64,103] applied to the distance matrix by interpreting it as a data matrix, and was previously used to motivate Edge PCA. However, although it is mathematically sound, the direct application of PCA to a distance matrix lacks a simple interpretation. Again, the red cluster is clearly separated from the rest, but this time, the distinction between the green and the blue cluster is not as apparent.
In Fig 8(b), we compare Squash Clustering to Imbalance k-means. Here, the distinction between the two Lactobacillus clades can be seen by the purple and orange cluster assignments. The cluster tree also separates those clusters into clades. The separate small group of orange samples above the purple clade is an artifact of the tree ladderization. The diseased subjects are all assigned to the gray cluster, represented by the upper half of the cluster tree. It is apparent that both methods separate the same samples from each other.
Lastly, Fig 8(e) compares Imbalance k-means to Edge PCA. The plot is a recalculation of Figure 3 of [101], which also appeared in Figure 6 in [29] and Figure 3 in [18], but colored using our cluster assignments. Because both methods work on edge imbalances, they group the data in the same way, that is, they clearly separate the two healthy groups and the diseased one from each other. Edge PCA forms a plot with three corners, which are colored by the three Imbalance k-means cluster assignments.
In S7 Fig, we report more details of the comparison of our k-means variants to the dimensionality reduction methods used here. Furthermore, exemplary visualizations of the cluster centroids are shown in S8 Fig, which further supports that our methods yield results that are in agreement with existing methods.
4.2.2 HMP dataset
The HMP dataset is used here as an example to show that our method scales to large datasets. To this end, we used the unconstrained Bacteria RT with 1914 taxa as provided by our Phylogenetic Automatic (Reference) Tree (PhAT) method [34]. The tree represents a broad taxonomic range of Bacteria, that is, the sequences were not explicitly selected for the HMP dataset, in order to test the robustness of our clustering methods. We then placed the 9192 samples of the HMP dataset with a total of 118 701818 sequences on that tree, and calculated Phylogenetic and Imbalance k-means on the samples. The freely available meta-data for the HMP dataset labels each sample by the body site were it was taken from. As there are 18 different body site labels, we used k := 18. The result is shown in Fig 9. Furthermore, in S9 Fig, we show a clustering of this dataset into k := 8 broader body site regions to exemplify the effects of using different values of k. This is further explored by using the Elbow method as shown in S10 Fig.
Ideally, all samples from one body site would be assigned to the same cluster, hence forming a diagonal on the plot. However, as there are several nearby body sites, which share a large fraction of their microbiome [16], we do not expect a perfect clustering. Furthermore, we used a broad reference tree that might not be able to resolve details in some clades. Nonetheless, the clustering is reasonable, which indicates a robustness against the exact choice of reference taxa, and can thus by used for distinguishing among samples. For example, stool and vaginal samples are clearly clustered. Furthermore, the sites that are on the surface of the body (ear, nose, and arm) also mostly form two blocks of cluster columns.
4.2.3 Performance
The complexity of Phylogenetic k-means is in , with k clusters, i iterations, and n samples, and e being the number of tree edges, which corresponds to the number of dimensions in standard euclidean k-means. As the centroids are randomly initialized, the number of iterations can vary; in our tests, it was mostly below 100. For the BV dataset with 220 samples and a reference tree with 1590 edges, using k := 3, our implementation ran 9 iterations, needing 35 s and 730 MB of main memory on a single core. For the HMP dataset with 9192 samples containing 119 million sequences, and a reference tree with 3824 edges, we used k := 18, which took 46 iterations and ran in 2.7 h on 16 cores, using 48 GB memory.
In contrast to this, Imbalance k-means does not need to conduct any expensive tree traversals, nor take single placement locations into account, but instead operates on compact vectors with one entry per edge, using euclidean distances. It is hence several orders of magnitude faster than Phylogenetic k-means, and only needs a fraction of the memory. For example, using again k := 18 for the HMP dataset, the algorithm executed 75 iterations in 2 s. It is thus also applicable to extremely large datasets.
Furthermore, as the KR distance is used in Phylogenetic k-means as well as other methods such as Squash Clustering, our implementation is highly optimized and outperforms the existing implementation in guppy [22] by orders of magnitude (see below for details). The KR distance between two samples has a linear computational complexity in both the number of QSs and the tree size. As a test case, we computed the pairwise distance matrix between sets of samples. Calculating this matrix is quadratic in the number of samples, and is thus expensive for large datasets. For example, in order to calculate the matrix for the BV dataset with 220 samples, guppy can only use a single core and required 86 min. Our KR distance implementation in genesis is faster and also supports multiple cores. It only needed 90 s on a single core; almost half of this time is used for reading input files. When using 32 cores, the matrix calculation itself only took 8 s. This allows to process larger datasets: The distance matrix of the HMP dataset with 9192 samples placed on a tree with 3824 branches for instance took less than 10 h to calculate using 16 cores in genesis. In contrast, guppy needed 43 days for this dataset. Lastly, branch binning can be used to achieve additional speedups, as shown in S2 Table.
4.3 Phylogenetic ILR Transform and Balances
As a first test of our adaptation of balances to placement data, we apply it to the BV dataset. Further assessment of balances for placement data, also with the HMP dataset, is implicitly conducted by the evaluation of Placement-Factorization below, which uses balances for aggregation and contrasting of subtrees.
As balances are conceptually similar to edge imbalances, we perform analogous evaluations. To this end, we computed the per-edge balance for all edges of the BV reference tree, across all 220 samples. That is, for each edge, we computed the balance between the two subtrees induced by the edge. This yields a matrix that we call balance matrix, and which corresponds to the imbalance matrix used for Edge PCA, c. f. Fig 3(b). Hence, a natural first visualization of the balances is to analyze their principal components, that is, to compute the PCA of the balance matrix. The first two components are shown in S11 Fig, for both variants of the balance computation (with and without taxon weighting). The components exhibit a separation of the samples by Nugent score, showing that they yield results comparable to Edge PCA.
In order to interpret what the axes of these principal components mean, we can again employ the visualization of PCA eigenvectors on the reference tree as used in Edge PCA [29], c. f., S5 Fig. We show the results for PCA of the balances in S12 Fig. As with Edge PCA, the principal components correspond to the Lactobacillus clade, with the first component mostly separating Lactobacillus from the rest of the tree, and the second component further distinguishing between Lactobacillus crispatus and Lactobacillus iners.
Furthermore, as mentioned in the methods description, balances could in principle be used as input to our previously presented methods, such as Edge Correlation and k-means clustering (which adequately might be called Balance k-means), in the same manner that we used imbalances before. We exemplify the correlation of the Nugent score with balances in S13 Fig. However, artifacts might arise from the underlying mathematical framework of balances, in particular the usage of the geometric mean without taxon weighting: The geometric mean is not sensitive to singular large values, such as the high amount of placement mass on one of the Lactobacillus branches. It only significantly increases if multiple high values are present, such as the multitude of bacterial taxa with high abundance in diseased patients of the BV dataset [18]. This can lead to spurious results, as shown in S13(b) Fig, where the correlation of the unweighted balances with the Nugent score yields unrealistically high negative correlations for almost all branches that have little placement mass on them. Moreover, this property of the geometric mean implies that it is sensitive to taxa splitting (pers. comm. with A. Washburne on 2018-11-23): For example, it does make a difference whether masses are focused on a single branch, or distributed across several representatives of the same species.
Hence, we do not recommend to use (unweighted) balances for computations such as correlations or k-means clustering. Note that when used with GLMs, such as in Phylofactorization, these issues do not arise: The winning edge is chosen to maximize the difference between the null deviance and the deviance of the linear model. That difference is small for clades with almost no mass (such as the ones affected by the issue above), so that the value of the objective function for such edges is lower than for edges with more mass. Hence, the factorization does not incorrectly identify these low-abundance clades as potential factors. The usage of balances in Placement-Factorization is further explored in the next section.
4.4 Placement-Factorization
In the following we present results from Placement-Factorization, and compare them to the above results from our other methods, as well as to the original Phylofactorization. For comparability with the original method, we solely use balances for aggregating and contrasting, and an objective function that maximizes the difference between the null deviance and the deviance obtained by a Generalized Linear Model (GLM). Other choices of functions for Phylofactorization have been explored in [31] and [92], see there for details. The exploration of their effect on Placement-Factorization is left as future work, although based on the consistency of our results with the original method, we conjecture that they behave according to the findings of the original publications.
Furthermore, we note that our implementation supports taxon weighting as introduced in the Phylogenetic ILR transform [30], which is however not (yet) supported by the original Phylofactorization [31]. We found this weighting scheme to be a natural and valuable addition in the balances computation that yields results closer to those obtained with imbalances. We suspect that this is because the weighting scheme can alleviate the issues of the geometric mean mentioned above.
4.4.1 BV dataset
We analyzed the BV dataset [18] with our Placement-Factorization with and without taxon weighting, using balances for aggregating and contrasting, and GLMs for the objective function. As GLMs support multiple predictors at the same time, we used all three available meta-data features of the dataset simultaneously for the regression, that is, Nugent score [97], Amsel’s criteria [98], and the pH-value of the samples. We also tested with only the Nugent score to be consistent with our previous analyses, and to asses the robustness of the method with respect to the specific choice of meta-data features. We observe only minor differences in the ordering of the identified factors, that is, which clades were “winning” in which iteration. Hence, we focus on the results obtained with all three meta-data features taken into account.
For comparing with the original method, we clustered the dataset into OTUs using two different OTU clustering methods, vsearch [54] and swarm [52,53], and inferred two trees from these OTU clusterings. We used two distinct OTU clustering methods to asses how they affect factorization; see S1 Text for details on the preprocessing steps. We then conducted an analysis of both trees with the original Phylofactorization, again using balances and GLMs. We compare the results of Placement-Factorization to our previous analyses of the data as well as to the original Phylofactorization on the two alternative OTU trees.
In S3 Table, we compare the clades found by the two Phylofactorization variants (with vsearch and with swarm) to the clades found by Placement-Factorization without taxon weighting. Moreover, in S14 Fig we visualize the clades found by Placement-Factorization. These outcomes show that our results are consistent with the existing Phylofactorization, in that similar clades are split from the tree, albeit with some variation in the order by which clades are selected. The clades being split are also consistent with previous analyses of the dataset [18], as all taxa found by the first 10 factors of Placement-Factorization were also found to be relevant in the context of Bacterial Vaginosis in [18]. However, the vsearch-based Phylofactorization is the only evaluated variant that split the Lactobacillus clade in the first factor and further Lactobacillus crispatus from Lactobacillus iners in the second factor. The swarm-based variant and our Placement-Factorization without taxon weighting also identified these clades, but not in the first two iterations.
When using taxon weighting on the other hand, Placement-Factorization also finds these two clades in the first two factors, and is hence more consistent with existing analyses. However, due to small differences in the value of the objective function, the first iteration splits a larger clade than expected. We observed a similar behavior with the vsearch-based Phylofactorization; see the long list of taxa of the first factor in S3 Table. In order to identify the provenance of this effect and to correctly interpret the factors, we developed a novel visualization of the results: In Fig 10 and in S15 Fig, we show the reference tree, where each edge is colored by the value of the objective function at that edge. This type of visualization helps to assess the uncertainty involved in identifying the winning edge of a specific iteration.
For example, the visualization in Fig 10(a) reveals that the large clade found by the first factor (marked with an arrow) does in fact include many branches and taxa with a low value of the objective function (yellow branches). These are branches with low placement mass that do not contribute much to this factor. Instead, there is a path of comparably high values of the objective function that leads down to the Lactobacillus clade. This indicates that there are several ‘good’ candidate edges for distinguishing patients by their health status, and that the smaller Lactobacillus clade is the actual clade of interest in this factor. The winning edge just happened to have a slightly higher value than other edges on this path. To address this issue, a proper statistical test of the significance of each winning edge compared to the other edges evaluated in the iteration could be employed. This is connected to the idea of confidence regions of each factor on the tree, as presented in [92], which we discuss later.
To further assess how the samples are split by individual factors, we used the balances at each iteration/factor as an ordination of the data, as suggested in [31], which we show in S16 Fig. These plots reveal that the splitting into healthy vs. diseased patients works both with and without taxon weighting, albeit the differences in the respective plot shapes are pronounced. As scatter plots of balances can only reasonably reveal the first two or three factors, S16 Fig serves as a caveat for this type of visualization: The BV dataset does indeed have two important features concerning the healthy patients, namely the Lactobacillus clade, and the further distinction into Lactobacillus crispatus and Lactobacillus iners. However, as discussed above and visible in S1 Fig, the diseased patients exhibit high abundances in a multitude of other clades, which cannot be expressed by just two or three factors. We later show a novel way for visualizing balances in the HMP dataset that can help to understand the balances of all factors.
In summary, we find that Placement-Factorization without taxon weighting behaves similar to the original Phylofactorization (which also does not employ a taxon weighting scheme), while Placement-Factorization with taxon weighting yields results that are more in line with our previous results based on imbalances. The latter is likely because taxon weighting has a similar effect of reducing the influence of low mass branches (low abundance taxa) as the summation-based aggregation step of imbalances.
4.4.2 HMP dataset
The original publication of Phylofactorization used a dataset comprising oral and fecal samples from the human microbiome as one of their case studies [104], see Figure 4 and supplementary figures S3–S8 of [31]. For our comparison, we selected a suitable subset of the HMP dataset [16,17]: In particular, we selected all 600 stool samples of the dataset, as well as 600 randomly chosen samples from the mouth region, that is, from the samples labeled “Mouth (back)” and “Mouth (front)” in S1 Table. We again used the placement of these samples on the unconstrained Bacteria tree of our Phylogenetic Automatic (Reference) Tree (PhAT) method [34], containing 1914 taxa, to conduct Placement-Factorization. We henceforth assume that the oral/fecal dataset of [104] and our oral/fecal subset of the HMP dataset exhibit comparable sequence compositions. Furthermore, as the tree used for Phylofactorization in [31] is based on the OTUs of the sequences, it only contains taxa that are sufficiently abundant in the input. It thus differs from the more general reference tree used for our evaluation here. Therefore, we had to map the taxa found by Phylofactorization to the underlying Silva taxonomy [105,106] that was used for constructing our tree.
Despite these differences, Placement-Factorization yielded factors that are similar to the ones found by Phylofactorization. We compare the taxa identified by the first 10 factors of each variant. For simplicity, we only compare the clades on the non-root side of the (arbitrarily rooted) reference tree; the paraphyletic “remainder” clade is not taken into account. Furthermore, we do not consider the order of the factors here. Similar to the findings of the BV dataset above, Placement-Factorization with taxon weighting yielded larger clades than without taxon weighting, which again yielded larger clades than the OTU-based Phylofactorization. The latter is a consequence of the OTU tree containing fewer taxa than our broad Bacteria tree. We found that 84% of the taxa identified by Phylofactorization were also part of the factors of our variants, with the major difference being a set of Proteobacteria that were part of the split in the first factor of Phylofactorization [31], but not by our variants. This is most likely an artifact of the differing trees being used in the factorization. Furthermore, 95% of the taxa found by Placement-Factorization without taxon weighting were also part of the clades with taxon weighting. We visualized the clades found by all three variants in S17 Fig, which also provides further details on the comparison. Most of the factors found by the three variants agree with each other, with their disagreement mostly concerning the clade sizes. The actual differences in taxa (such as the Proteobacteria not found by our variants) serve as a caveat for the importance of the underlying reference tree: Differences in topology will inevitably be reflected in different factors, which might in turn suggest a different interpretation of results. In an ideal world with a known phylogeny of all of life, alternative OTU clusterings and alternative trees would simply collapse nodes at different depths (pers. comm. with A. Washburne on 2019-03-01). Unfortunately, real world data, and particularly different OTU clustering methods and tree inference methods, will yield discordant trees. The influence of uncertainty in the phylogeny is further discussed in [92].
Next, we investigated how well the factors found by Placement-Factorization separate oral from fecal samples. To this end, we again employed the balances of the winning edge of each factor for an ordination visualization [31], which we show in S18(a) Fig and S18(b) Fig. The ordination clearly separates the samples, both with and without taxon weighting. Again, ordination scatter plots can only reveal up to three dimensions/factors. In order to evaluate the separation of samples at later factors, we use a visualization of the factor balances, which we call balance swarm plots, and which are similar to the per-factor ordination plots used in [92]. These plots can show the ordination of arbitrarily many factors at the same time, as shown in Fig 11(a), as well as S18(c) Fig and S18(d) Fig.
As shown in Fig 11(a) and S18(c) Fig, eight out of the first ten factors found by Placement-Factorization with taxon weighting clearly separate the oral from the fecal samples. The remaining two factors (PF7 and PF9) separate most of the samples, but also have an interval of balances that contains samples from both body sites. Placement-Factorization without taxon weighting also separates samples based on their body site, as shown in S18(d) Fig, but with a less clear distinction. This is also obvious from the ordination scatter plots shown in S18(b) Fig.
Finally, we conducted Placement-Factorization on the whole HMP dataset with all 9192 samples, instead of just the oral/fecal subset, in order to evaluate how the method performs on large datasets with more than two categories (body sites) to distinguish. See S1 Table for an overview of the samples, as well as a list of the eight body site labels that we used for classifying the samples. We do not discuss the taxa that were split by each factor, as such an in-depth biological discussion is beyond the scope of this manuscript. Instead, we evaluate how well different body sites were separated by the factors. In S19 Fig, we show ordination plots of the first two and three factors. These plots already reveal that Placement-Factorization indeed separates samples from each other based on their body site. However, given the eight body site labels that we used, these plots are overloaded. Hence, we extended on the idea of balance swarm plots (as introduced above) by separating them into individual plots per factor, each showing the balance distribution of groups of samples. An example for the first factor is shown in Fig 11(b); we furthermore show the first four factors in S20 Fig. These visualizations indicate that Placement-Factorization separates samples mainly based on the distinction oral vs. remaining body sites, with a further separation of plaque samples in the oral region. This can, for example, be seen in Fig 11(b), where the first three groups “Mouth (back)”, “Mouth (front)”, and “Saliva” exhibit balances above 0, while all other groups have balances below 0. Further factors then separate vaginal samples and skin and airways samples from the rest of the samples, as shown in S20 Fig. Overall, Placement-Factorization can distinguish these samples by body site, at least to the extent that can expected from abundance differences in the samples. For example, it would be unrealistic to expect the algorithm to perfectly separate samples from the back and front of the mouth from each other.
In conclusion, Placement-Factorization yields factors of the oral/fecal data that are mostly consistent with the findings of Phylofactorization [31], and is also able to reasonably separate the samples of larger datasets with several categorical labels.
4.4.3 Performance
The run time of Placement-Factorization depends on (a) the number of input samples, (b) the number of branches of the reference tree, and (c) the number of iterations to run. As the computations are conducted on the mass matrix instead of single placements, the performance and memory requirements of Placement-Factorization are independent from the total number of sequences/placements in the dataset. In each iteration, and for each edge of the tree (except the ones that won previous factors), the balances of all samples are computed, and the objective function is evaluated. In case of using a GLM to express the relationship of balances with meta-data, this involves fitting a model across all samples. In our implementation, all these computations are parallelized.
Our relatively small BV test dataset ran on a standard laptop with 4 cores, taking 30 s per iteration. The full HMP dataset with 9192 samples and our reference tree with 3825 branches required 13.0 GB of memory in our non-optimized prototype implementation, as took less than 90 s per iteration using 20 cores. Also, note that later iterations tend to become faster, as the splitting of the tree into subtrees reduces the number of edges that need to be taken into account in each balance computation. Hence, we conclude that Placement-Factorization is well suited even for very large datasets.
4.4.4 Future directions
Phylofactorization is a very recent method whose full potential has just begun being explored [92]. We contributed novel ideas by (a) adapting the concept to phylogenetic placement, which can be thought of as placing abundances along branches of the tree instead of just at its tips; and (b) suggesting a novel visualization for objective function value at each edge of the reference tree, which helps in the interpretation of the factors being split in each iteration.
As discussed above, we found that both, the original Phylofactorization, as well as our Placement-Factorization, can split clades that are larger than one would expect from other types of analyses of the data. Considering the distribution of objective function values, as shown in Fig 10, it is likely that such large clades are the result of random variability along a path of branches that are equally relevant for the factor. Further research is needed to confirm this.
These findings suggest that it might be beneficial to introduce a significance value for each factor, which assesses how relevant the particular winning edge is compared to other edges that yielded a high objective value in an iteration. This idea is intrinsically connected (pers. comm. with A. Washburne on 2019-03-01) to the stopping function of the original Phylofactorization [31], which uses a Kolmogorov-Smirnov (KS) test to conservatively estimate when a sufficient number of factors have been identified. Another strongly connected idea is that of confidence regions of the phylogeny, defined by regions of the tree in which the “true” winning edge falls with a certain confidence [92]. Such a significance value for the winning edge might also enable a form of soft factorization, that does not greedily pick one winning edge per iteration.
Furthermore, the paths of high objective values as seen in Fig 10 indicate that there is a gradient of the objective function along the branches of the tree. This could be exploited in a gradient-ascending graph-walking algorithm to identify the phylogenetic factors of extremely large datasets without having to exhaustively evaluate the objective function at every edge (pers. comm. with A. Washburne on 2019-03-01).
5 Conclusion
We presented novel, scalable methods to analyze and visualize phylogenetic placements of metagenomic samples. Phylogenetic placement of the sequences on a fixed reference tree allows for an interpretation of the data in a phylogenetic context. The methods are computationally inexpensive, and are thus, as we have demonstrated, applicable to large datasets. They are built on top of a common set of concepts, and hence gear well into each other.
Edge Dispersion highlights branches of the phylogenetic tree that exhibit variations in the number of placements, and thus allows to identify regions of the tree with a high placement heterogeneity. Edge Correlation additionally takes meta-data features into account, and identifies branches of the tree that correlate with quantitative features, such as the temperature or the pH value of the environmental samples. These methods complement existing methods such as Edge PCA [29], and represent data exploration tools that can help unravel new patterns in phylogenetic placement data, and hence, in metagenomic samples.
We presented adapted variants of the k-means method, called Phylogenetic k-means and Imbalance k-means, which exploit the structure of phylogenetic placement data to identify clusters of environmental samples. The methods build upon ideas such as Squash Clustering [29] and can be applied to substantially larger datasets, as they use a pre-defined number of clusters. For future exploration, other forms of cluster analyses could be adapted to phylogenetic placement data, for example, soft k-means clustering [107,108] or density-based methods [109]. The main challenge when adopting such methods consists in making them phylogeny-aware, that is, to use mass distributions on trees instead of the typical ℝn vectors.
We introduced an adaptation of the Phylogenetic ILR transform and balances [30] to phylogenetic placements. As balances are a transformation that yields orthogonal components (one for each node or branch of the tree), issues pertaining to the normalization of compositional data do not arise. With samples being represented as a vector of balances, numerous standard tools for data visualization, ordination, and clustering in the euclidean space can be readily applied to phylogenetic placement data. Applying these methods to placements instead of OTUs allows for more detailed analyses, as the entire original sequence data can be used. Furthermore, using a fixed reference tree instead of one inferred from the OTUs present in a set of samples enables comparative studies across datasets.
Lastly, we presented an adaption of Phylofactorization [31,92], which we call Placement-Factorization. Placement-Factorization identifies branches of the reference tree, called phylogenetic factors, that exhibit a relationship with environmental meta-data features, that is, branches along which putative functional traits might have arisen in conjunction with changes in environmental variables. This factorization of the tree can be used as an ordination tool to visualize how samples are separated by changes along the factors, and as a dimensionality-reduction tool [31]. It thus complements Edge Correlation, but further allows to identify nested dependencies within sub-clades of the reference tree. We leave the adaptation of some of the original concepts of Phylofactorization to phylogenetic placements as future work, such as binned phylogenetic units (BPUs), stopping criteria for the iterations, as well as further experimentation with different objective functions and aggregation and contrast functions [31,91]. Based on our findings and experiments, we conjecture that these concepts should be readily applicable to our Placement-Factorization.
The presented methods take either the edge masses as input, that is, the abundances of metagenomic sequences on each branch of the reference tree, or a transformation of theses masses, such as edge imbalances or balances, and can hence analyze different aspects of the placements and the environmental samples. While edge masses reveal information about single branches, edge imbalances and balances take entire reference tree clades into account. Depending on the task at hand, either of them might be preferable and better suited for identifying patterns. We also elaborated on the limitations and caveats of these transformations. Furthermore, we again emphasize the importance of appropriately normalizing the samples as required, in order to cope with the compositional nature of metagenomic data. That is, depending on the type of sequence data, using either absolute or relative abundances is critical to allow for meaningful interpretation of the results.
We evaluated our novel methods on three real-world datasets (see S1 Text for details on the datasets and their preprocessing), and gave exemplary interpretations of the results. We further showed that these results are consistent with existing methods as well as empirical biological knowledge. Hence, our methods will also be useful to unravel new, unexplored relationships in metagenomic data. Each of them has their strengths and weaknesses, and focuses on different aspects of the data. The methods and their variants are hence best used in combination with each other, in order to obtain a thorough and comprehensive analysis.
The methods are implemented in our tool gappa, which is freely available under GPLv3 at http://github.com/lczech/gappa. Furthermore, gappa has recently been bundled into a bioconda package, which is available at https://anaconda.org/bioconda/gappa; note that this package is not maintained by ourselves. In S2 Text, we briefly describe the software and its commands, and also describe a typical phylogenetic placement analysis pipeline. The methods and transformations presented here constitute a toolbox of different techniques that can be combined with each other, and allow for further extension and experimentation. The implementation in gappa is mainly based on our library genesis, which is also freely available under GPLv3 at http://github.com/lczech/genesis. Furthermore, scripts, data and other tools used for the tests and figures presented here are available at http://github.com/lczech/placement-methods-paper.
6 Acknowledgments
We thank S. Srinivasan and E. Matsen for providing the Bacterial Vaginosis dataset [18] and for helping us understanding their methods and implementations. We further thank A. Washburne and J. Silverman for fruitful exchange about the underlying mathematics of their methods [30,31,91,92]. We also thank M. Dunthorn, L. Rubinat-Ripoll, C. Berney, L. Guidi, G. Lentendu, A. Kozlov, and P. Barbera for their feedback on our methods and help with this manuscript, and we thank G. Douglas for bundling gappa as a bioconda package. Lastly, we want to express our gratitude towards one of the anonymous reviewers of this manuscript for inspiring us to our adaptation of balances and phylofactorization.
Footnotes
↵* lucas.czech{at}h-its.org, alexandros.stamatakis{at}h-its.org
References
- 1.↵
- 2.↵
- 3.↵
- 4.
- 5.
- 6.↵
- 7.↵
- 8.
- 9.↵
- 10.↵
- 11.↵
- 12.
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.
- 67.
- 68.↵
- 69.↵
- 70.
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.
- 81.
- 82.
- 83.
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵
- 105.↵
- 106.↵
- 107.↵
- 108.↵
- 109.↵
- 110.↵
- 111.↵
- 112.↵
- 113.↵
- 114.↵
- 115.↵
- 116.↵
- 117.↵
- 118.↵
- 119.↵
- 120.↵