Abstract
Reconstructing the phylogenetic relationships that unite all lineages (the tree of life) is a grand challenge. The paucity of homologous character data across disparately related lineages currently renders direct phylogenetic inference untenable. To reconstruct a comprehensive tree of life we therefore synthesized published phylogenies, together with taxonomic classifications for taxa never incorporated into a phylogeny. We present a draft tree containing 2.3 million tips -- the Open Tree of Life. Realization of this tree required the assembly of two additional community resources: 1) a novel comprehensive global reference taxonomy; and 2) a database of published phylogenetic trees mapped to this taxonomy. Our open source framework facilitates community comment and contribution, enabling the tree to be continuously updated when new phylogenetic and taxonomic data become digitally available. While data coverage and phylogenetic conflict across the Open Tree of Life illuminate gaps in both the underlying data available for phylogenetic reconstruction and the publication of trees as digital objects, the tree provides a compelling starting point for community contribution. This comprehensive tree will fuel fundamental research on the nature of biological diversity, ultimately providing up-to-date phylogenies for downstream applications in comparative biology, ecology, conservation biology, climate change, agriculture, and genomics.
Significance statement
Scientists have used gene sequences and morphological data to construct tens of thousands of evolutionary trees that describe the evolutionary history of animals, plants and microbes. This study is the first to apply an efficient and automated process for assembling published trees into a complete tree of life. This tree, and the underlying data, are available to browse and download from the web, facilitating subsequent analyses that require evolutionary trees. The tree can be easily updated with newly published data. Our analysis of coverage not only reveals gaps in sampling and naming biodiversity, but also further demonstrates that most published phylogenies are not available in digital formats that can be summarized into a tree of life.
Introduction
The realization that all organisms on Earth are related by common descent (1) was one of the most profound insights in scientific history. The goal of reconstructing the tree of life is one of the most daunting challenges in biology. The scope of the problem is immense: there are ∼1.8 million named species, and most species have yet to be described (2)(3)(4). Despite decades of effort and thousands of phylogenetic studies on diverse clades, we lack a comprehensive tree of life, or even a summary of our current knowledge. One reason for this shortcoming is lack of data. GenBank contains DNA sequences for ∼411,000 species, only 22% of estimated named species. While some gene regions (e.g., rbcL, 16S, COI) have been widely sequenced across some lineages, they are insufficient for resolving relationships across the entire tree (5). Most recognized species have never been included in a phylogenetic analysis because no appropriate molecular or morphological data have been collected.
There is extensive publication of new phylogenies, data, and inference methods, but little attention to synthesis. We therefore focus on constructing the first comprehensive tree of life through the integration of published phylogenies with taxonomic information. Phylogenies by systematists with expertise in particular taxa likely represent the best estimates of relationships for individual clades. By focusing on trees instead of raw data, we avoid issues of dataset assembly (6). However, most published phylogenies are available only as journal figures, rather than in electronic formats that can be integrated into databases and synthesis methods (7, 8), (9). Although there are efforts to digitize trees from figures (10), we focus instead on synthesis of published, digitally-available phylogenies.
When source phylogenies are absent or sparsely sampled, taxonomic hierarchies provide structure and completeness (11, 12). Given the limits of data availability, synthesizing phylogeny and taxonomic classification is the only way to construct a tree of life that includes all recognized species. One obstacle has been the absence of a complete, phylogenetically-informed taxonomy that spans traditional taxonomic codes (13). We therefore assembled a comprehensive global reference taxonomy via alignment and merging of multiple openly-available taxonomic resources. The Open Tree Taxonomy (OTT) is open, extensible, and updatable, and reflects the overall phylogeny of life. With the continued updating of phylogenetic information from published studies, this framework is poised to update taxonomy in a phylogenetically-informed manner far more rapidly than has occurred historically.
We used new graph methods (14) to synthesize a tree of life of over 2.3 million OTUs (operational taxonomic units) from the reference taxonomy and curated phylogenies. Taxonomies contribute to the structure only where we do not have phylogenetic trees. Advantages of graph methods include easy storage of topological conflict among underlying source trees in a single database, the construction of alternative synthetic trees, and the ability to continuously update the tree with new phylogenetic and/or taxonomic information. Importantly, our methodology also highlights the current state of knowledge for any given clade and reveals those portions of the tree that most require additional study. Although a massive undertaking in its own right, this draft tree of life represents only a first step. Through feedback, addition of new data, and development of new methods, the broader community can improve this tree.
Results
Open Tree Taxonomy
To align phylogenies from different sources, the tips, which may represent different taxonomic levels, must be mapped to a common taxonomic framework (14). For synthesizing phylogenetic data, taxonomy also provides completeness and structure where phylogenetic studies have not sampled all known lineages (true of most clades). Available taxonomies differ in completeness and how closely the hierarchy matches known evolutionary relationships. The Open Tree Taxonomy, OTT, is an automated synthesis of available taxonomies, maximizing the number of taxa and preferring input taxonomies that better align to phylogenetic hypotheses in various clades (see Methods). It contains taxa with traditional Linnaean names and unnamed taxa known only from sequence data. OTT v 2.8 has 2,722,024 OTUs without descendants and includes 382,564 higher taxa; 585,081 of the names are classified as non-phylogenetic units (e.g., incertae sedis) and were therefore not included in the synthesis pipeline. The taxonomy is available for download and through web services, including a taxonomic name resolution service for aligning other trees with our taxonomy (see Data and Software Availability, below).
Input Phylogenies
We built user interface for collection and curation of potential trees for synthesis (https://tree.opentreeoflife.org/curator). The complete database contains 6810 trees from 3062 studies. At the time of publication, 484 studies in our database are incorporated into the draft tree of life. Our goal is to generate a best estimate of phylogenetic knowledge; based on our tests, we give several reasons not to use all available trees for synthesis. First, including trees that are incorrect does not improve the synthetic estimate. In each major clade, expert curators selected and ranked input trees for inclusion based on date of publication, underlying data, and methods of inference (see Methods for details). These rankings generally reflect community consensus about phylogenetic hypotheses. Second, including trees that merely confirm or are subsets of other analyses only increases computational difficulty without significantly improving the synthetic tree. For example, while we have many framework phylogenies spanning angiosperms, we did not include older trees where a newer tree extends the same underlying data. Third, inclusion of trees requires a minimum level of curation, where most OTU labels have been mapped to the taxonomic database, the root is correctly identified, and an ingroup clade has been identified. This information is not in the input file and requires manual curation from the associated publication. Not all trees are sufficiently well-curated; at this point, we have focused curation efforts on trees that will most improve the synthetic tree. The full set of trees in the database is important for other questions such as estimating conflict or studying history of inference in a clade, highlighting the importance of continued deposition and curation of trees into public data repositories. See Dataset 1 for a list of input trees and metadata.
A draft tree of life
We constructed a tree alignment graph (14), the graph of life, by loading the Open Tree Taxonomy and the 484 rooted phylogenies into a neo4j database. The graph of life contains 2,339,460 leaf nodes (after excluding non-phylogenetic units from OTT), plus 229,801 internal nodes. It preserves conflict among phylogenies and between phylogenies and the taxonomy. To create the synthetic tree, we traverse the graph, resolving conflict based on the rank of inputs, and label accepted branches that trace a synthetic tree summarizing the source information. This allows for clear communication of how conflicts are resolved through ranking, and of the source trees and / or taxonomies that support a particular resolution. The synthetic tree contains phylogenetic structure where we have published trees and taxonomic structure where we do not. See SOM for details. The tree is available to browse and download, and web services allow extraction of subtrees given lists of species (see Data and Software Availability, below).
A. Coverage
Of the 2,339,460 tips in the synthetic tree of life, 37,525 are represented in at least one input phylogeny with an additional 4254 non-terminal taxa represented as tips in phylogenetic inputs (Figure 1). In Bacteria, Fungi, Nematoda, and Insecta, there is a large gap between the estimated number of species and what exists in taxonomic and sequence databases (Figure 2). In contrast, Chordata and Embryophyta are nearly fully sampled in databases and in OTT (Figure 2). Poorly sampled clades require more data collection and deposition and, in some cases, formal taxonomic codification and identification to be incorporated in taxonomic databases. Most tips in the synthetic tree are not represented by phylogenetic analyses. The limited number of input trees highlights the need for both new sequencing efforts, additional phylogenetic studies and the deposition of published tree files into data repositories.
B. Resolution and conflicts
The tree of life we provide is only one representation of the Open Tree of Life data. Analysis of the full graph database (the graph of life) allows us to examine conflict between the synthetic tree of life, taxonomy, and source phylogenies. Figure 3 depicts the types of alternate resolutions that exist in the graph. We recovered 153,109 clades in the tree of life, of which 129,778 (84.8%) are shared between the tree of life and the Open Tree Taxonomy. There are 23,331 clades that either conflict with the taxonomy (4610 clades; 3.0%) or where the taxonomy is agnostic to the presence of the clade (18721 clades; 12.2%). The average number of children for each node in the taxonomy is 19.4, indicating a poor degree of resolution compared to an average of 2.1 in the input trees. When we combine the taxonomy and phylogenies into the synthetic tree, the resolution improves to an average of 16.0 children per internal node. See SOM for details.
Alignment of nodes between the synthetic tree and taxonomy reveals how well taxonomy reflects current phylogenetic knowledge. Strong alignment is found in Primates and Mammalia, while our analyses reveal a wide gulf between taxonomy and phylogeny in Fungi, Viridiplantae (green plants), Bacteria, and various microbial eukaryotes (Table 1).
C. Comparison with supertree approaches
There were no supertree methods that scale to phylogenetic reconstruction of the entire tree of life, meaning that our graph synthesis method was the only option for tree-of-life-scale analyses. To compare our method against existing supertree methods, we employed a hybrid MultiLevelSupertree (MLS, (15)) + synthesis approach (see Methods). The total number of internal nodes in the MLS tree is 151458, compared to 155830 in the graph synthesis tree, although the average number of children is the same (16.0 children / node). If we compare the source phylogenies against the MLS supertree and the draft synthetic tree, the synthesis method is better at capturing the signal in the inputs. The average topological error (normalized Robinson-Foulds distance, where 0 = share all clades and 100 = share no clades (16)) of the MLS vs input trees is 31, compared to 15 for the graph synthesis tree. See SOM for details.
Discussion
Using novel graph database methods, we combine published phylogenetic data and the Open Tree Taxonomy to produce a first draft tree of life with 2.3 million tips -- the Open Tree of Life. This tree is comprehensive in terms of named species, but it is far from complete in terms of biodiversity or phylogenetic knowledge. It does not aim to infer novel phylogenetic relationships, but instead is a summary of published and digitally-available phylogenetic knowledge. This is the first time a comprehensive tree of life has been available for any analyses that requires a phylogeny, even if the species of interest have not been analyzed together in a single, published phylogeny.
As a result of data availability, data quality, and conflict resolution, there are many areas where relationships in the tree do not match current phylogenetic thinking (e.g., relationships within Fabaceae, Compositae, Arthropoda). This draft tree of life represents an initial step. The next step in this community-driven process is for experts to contribute trees and annotate areas of the tree they know best.
Limitations on coverage
Many microbial eukaryotes, Bacteria, and Archaea are not present in openly available taxonomic databases and therefore not incorporated into the Open Tree Taxonomy and the synthetic tree. Most tips in the synthetic tree (98%) come from taxonomy only, reflecting both the need to incorporate more species into phylogenies and the need to make published phylogenies available. We obtained trees from digital repositories and also by contacting author directly, but our overall success rate was only 16% (8). Many published relationships are not represented in the synthetic tree because this knowledge only exists as journal images. Our infrastructure allows for the synthetic tree to be easily and continuously updated via updated taxonomies and newly published phylogenies. The latter is dependent on authors making tree files available in repositories such as TreeBASE (17), Dryad (http://datadryad.org) or through direct upload to Open Tree of Life (http://tree.opentreeoflife.org/curator) and on having sufficient metadata for trees. We hope this synthetic approach will provide incentive for the community to change the way we view phylogenies - as resources to be cataloged in open repositories rather than simply as static images.
Conflicts in the tree of life
The synthetic tree of life is a bifurcating phylogeny (with “soft” polytomies reflecting uncertainty), but some relationships are more accurately described using reticulating networks. The Open Tree of Life contains areas with conflict (Figure 3). For example, the monophyly of Archaea is contentious - some data store trees indicate that eukaryotes are embedded within Archaea (18)(19) rather than a separate clade. Similarly, multiple resolutions of early diverging animal lineages have been proposed (20–23). Reticulations help visualize competing hypotheses, gene tree / species tree conflicts, and underlying processes such as HGT, recombination, and hybridization, which have had major impacts throughout the tree of life (e.g., hybridization in diverse clades of green plants (24) and animal lineages (25), including our own (26), and HGT in bacteria and archaea (27–29)). The graphical synthesis approach employed here naturally allows for storage of conflict and non-tree-like structure, enabling downstream visualization, analysis, and annotation of conflict (Figure 3) and highlighting the need for additional work in this area.
Resolving conflict is a challenge in supertree methods, including our graph method. The number of input trees that support a synthetic edge may be considered a reasonable criterion for resolving conflict, but the datasets used to construct each source tree may have overlapping data, making them non-independent. The number of taxa or gene regions involved cannot be used alone without other information to assess the quality of the particular analysis. Better methods for resolving conflict require additional metadata about the underlying data and phylogenetic inference methods.
Selection of input trees
We used only a subset of trees in the database for synthesis, filtering out trees that are redundant, erroneous or have insufficient metadata. Our current synthesis method relies on manual ranking of input trees by expert curators within major clades. The potential to automate this ranking, and to use metadata to resolve conflict, depends on the availability of machine-readable metadata for trees; such data currently must be entered manually by curators after reading the publication. Additional metadata would allow a comparison of synthesis trees based on, for example, morphological versus molecular data, inference method, or the number of underlying genes. Manual curation is time-consuming and labor-intensive; scalability would improve greatly by having standardized metadata (41) encoded in the files output by inference packages (e.g.,in NeXML files; (30).
Source trees as a community resource
The availability of well-curated trees allows for many analyses other than synthesis, such as calculating the increase in information content for a clade over time or by a particular project or lab; comparing trees constructed by different approaches; or recording the reduction in conflict in clades over time. These analyses require that tips be mapped to a common taxonomy to compare across trees. Our database contains thousands of trees mapped to existing taxonomies through the Open Tree Taxonomy. The data curation interface is publicly available (http://tree.opentreeoflife.org/curator) as is the underlying data store (http://github.com/opentreeoflife/phylesystem).
Dark parts of the tree
Hyperdiverse, poorly understood groups including Fungi, microbial eukaryotes, Bacteria, and Archaea are not yet well represented in input taxonomies. Our effort also highlights where major research is needed to achieve a better understanding of existing biodiversity. Metagenomic studies routinely reveal numerous OTUs that cannot be assigned to named species (31, 32). For Archaea and Bacteria, there are additional challenges created by their immense diversity, lack of clarity regarding species concepts, and rampant horizontal gene transfer (HGT) (27)(33)(34). The operational unit is often strains (not species), which are not regulated by any taxonomic code; strain collections are not available to download, making it difficult to map taxa between trees and taxonomy and estimate named biodiversity. Open databases such as BioProject at NCBI (http://www.ncbi.nlm.nih.gov/bioproject) have the potential to better catalog biodiversity that does not fit into traditional taxonomic workflows.
Materials and methods
Input data: taxonomy
No single taxonomy is both complete and has a backbone well-informed by phylogenetic studies. We therefore constructed the Open Tree Taxonomy (OTT), by merging Index Fungorum (35), SILVA (36, 37), NCBI (38), GBIF (39), IRMNG (40), and two clade-specific resources (41) (42) using a fully-documented, repeatable process that includes both generalized merge steps and user-defined patches (See SOM). OTT (v 2.8.5) consists of 2,722,024 well-named entities and 1,360,819 synonyms with an additional 585,081 entities having non-biological or taxonomically incomplete names, (“environmental samples” or “incertae sedis”), that are not included in the synthetic phylogeny.
Input data: phylogenetic trees
We imported and curated phylogenetic trees using a new interface that saves tree data directly into a GitHub repository (43). We obtained published trees from TreeBASE (17) and Dryad, and by direct appeal to authors. The data retrieved are by no means a complete representation of phylogenetic knowledge, as we obtained digital phylogeny files for only 16% of recently published trees (9). Even when available (as newick, NEXUS, NeXML files or via TreeBASE import), trees require significant curation to be usable for synthesis. We mapped taxon labels (which often include lab codes or abbreviations) to taxonomic entities in OTT. We rooted (or re-rooted) trees to match figures from papers. As relationships among outgroup taxa were often problematic, we identified the ingroup / focal clade for the study. For studies with multiple trees, we tagged the tree that best matched the conclusions of the study as “preferred”. Then, within major taxonomic groups (eukaryotic microbial clades, animals, plants and fungi) we ranked preferred trees to generate prioritized lists. In the absence of structured metadata about the phylogenetic methods and data used to infer the input trees, rankings were assembled by authors with expertise in specific clades and were based on date of publication, taxon sampling, the number of genes / characters in the alignment, whether the specific genomic regions are known to be problematic, support values, and phylogenetic reliability (agreement or disagreement with well-established relationships), see Table 2 for details. In general, rankings reflect community consensus about phylogenetic hypotheses. As we collect more metadata - such as that described by the MIAPA, Minimum Information for a Phylogenetic Analysis (44), either by manual entry into the system or by upload of tree files with structured, machine-readable metadata - automated filtering / weighting trees based on metadata will be possible.
Synthesis
The goal of the supertree (or “synthesis”) operation is to summarize the ranked input trees and taxonomy (with the taxonomy given the lowest rank). We use an algorithmic approach to produce the synthetic tree rather than a search through tree space for an optimal tree. Given a set of edges labeled with the ranks of supporting trees, the algorithm is a greedy heuristic that tries to maximize the sum of the ranks of the included edges. We summarize the major steps of the method here and provide details in Supplementary Online Materials (SOM).
The first steps include pre-processing the inputs. We prune non-biological or taxonomically incomplete names from OTT, and prune outgroups and unmapped taxa from input trees. Removal of outgroups reduces errors from unexpected relationships among outgroup taxa. Finally, we find uncontested nodes across the taxonomy + input trees and break the inputs at these nodes into a set of subproblems. This allows for a divide-and-conquer approach that shortens running time and reduces memory requirements.
We then build a tree alignment graph (14, 45), which we refer to as the graph of life. Tree alignment graphs allow for representation of both congruence and conflict in the same data structure, allow for non-overlapping taxon sets in the inputs (as well as tips mapped to higher taxa) and are computationally-tractable at the scale of 2.3 million tips and hundreds of input trees. We load the taxonomy nodes and edges into the graph, and then each subproblem, creating new nodes and edges and mapping tree nodes onto compatible taxonomy nodes. We also create new nodes and edges that reflect potential paths between the inputs.
Once the graph is complete, generating the synthetic tree involves traversing the graph and preferring edges that originate from high-ranked inputs. This means that we always prefer phylogenies over taxonomy. Given additional digitized metadata about trees, this system allows for custom synthesis procedures based on preference for inference methods, data types or other factors.
As a comparison to this rank-based analysis, we also created a synthetic tree using MultiLevelSupertrees (MLS) (15), a supertree method where the tips in the source trees can represent different taxonomic hierarchies. We built MLS supertrees for the largest clades that were computationally feasible and then used these non-overlapping trees as input into the graph database and conducted synthesis. Due to the lack of taxon overlap between each MLS tree, there was no topological conflict, and creating the final MLS supertree simply involved traversing the graph and preferring phylogeny over taxonomy.
Data and software availability
The current version of the tree of life is available for browse, comment and download at https://tree.opentreeoflife.org. All software is open-source and available at https://github.com/opentreeoflife. The tree data store is available at https://github.com/opentreeoflife/phylesystem. Where not limited by pre-existing terms of use, all data are published with a CC-0 copyright waiver. The Open Tree of Life taxonomy, the synthetic tree and processed inputs are available from Dryad: http://dx.doi.org/10.5061/dryad.8j60q
Acknowledgements
We are grateful to Paul Kirk at Index Fungorum, Tony Rees at IRMNG, and Markus Doering at GBIF for taxonomy data and advice on taxonomy synthesis; Mark Holder for discussion, feedback and software development; Joseph Brown for data collection and curation, software development, data analysis, and writing; Pam Soltis for helpful comments on the manuscript, to authors who made their tree files available in TreeBASE or Dryad and tree files that were not otherwise available, to curators who imported trees and added metadata; and finally to NSF AVATOL #1208809 for funding.