What is an archaeon and are the Archaea really unique?

Ajith Harish

doi:10.1101/256263

Abstract

The recognition of the group Archaea 40 years ago stimulated research in microbial evolution and molecular systematics that prompted a new classificatory scheme to organize biodiversity. Advances in DNA sequencing techniques have since significantly improved the genomic representation of the archaeal biodiversity. In addition, advances in phylogenetic modeling that facilitate large-scale phylogenomics have resolved many recalcitrant branches of the Tree of Life. Despite the technical advances and an expanded taxonomic representation, two important aspects of the origins and evolution of the Archaea remain controversial, even as we celebrate the 40th anniversary of the monumental discovery. The issues concern (i) the uniqueness (monophyly) of the Archaea, and (ii) the evolutionary relationships of the Archaea to the Bacteria and the Eukarya; both of these are relevant to the deep structure of the Tree of Life. The uncertainty is primarily due to a scarcity of information in standard datasets—the core-genes datasets—to reliably resolve the conflicts. These conflicts can be resolved efficiently by employing complex genomic features and genome-scale evolution models—a distinct class of phylogenomic characters and evolution models—that can be employed routinely to maximize the use of genome sequences as well as to minimize uncertainties in tests of evolutionary hypotheses.

Introduction

The recognition of the Archaea as the so-called “third form of life” was made possible in part by a new technology for sequence analysis, oligonucleotide cataloging, developed by Fredrik Sanger and colleagues in the 1960s (1, 2). Carl Woese’s insight of using this method, and the choice of the small subunit ribosomal RNA (16S/SSU rRNA) as a phylogenetic marker, not only put microorganisms on a phylogenetic map (or tree), but also revolutionized the field of molecular systematics that Zukerkandl and Pauling has previously alluded to (3). Comparative analysis of organism-specific (oligonucleotide) sequence-signatures in SSU rRNA led to the recognition of a distinct group of microorganisms (2, 4). Initially referred to as Archaeabacteria, these unusual organisms had ‘oligonucleotide signatures’ distinct from other bacteria (Eubacteria), and they were later found to be different from those of Eukarya (eukaryotes) as well. Many other features, including molecular, biochemical as well as ecological, corroborated the uniqueness of the Archaea. Thus the archaeal concept was established (2).

The study of microbial diversity and evolution has come a long way since then: sequencing microbial genomes, and directly from the environment without the need for culturing is now routine (5, 6). This wealth of sequence information is exciting not only for cataloging and organizing biodiversity, but also to understand the ecology and evolution of microorganisms – archaea and bacteria as well as eukaryotes – that make up a vast majority of the planetary biodiversity. Since large-scale exploration by the means of environmental genome sequencing became possible almost a decade ago, there has also been a palpable excitement and anticipation of the discovery of a fourth form of life or a “fourth domain” of life (7). The reference here is to a fourth form of cellular life, but not to viruses, which some have already proposed to be the fourth domain of the Tree of Life (ToL) (7, 8). If a fourth form of life were to be found, what would the distinguishing features be, and how could it be measured, defined and classified?

Rather than the discovery of a fourth domain, and contrary to the expectations, however, current discussion is centered around the return to a dichotomous classification of life (9-11), despite the rapid expansion of sequenced biodiversity – hundreds of novel phyla descriptions (12, 13). The proposed dichotomous classifications schemes, unfortunately, are in sharp contrast to each other, depending on: (i) whether the Archaea constitute a monophyletic group—a unique line of descent that is distinct from those of the Bacteria as well as the Eukarya; and (ii) whether the Archaea form a sister clade to the Eukarya or to the Bacteria. Both the issues stem from difficulties involved in resolving the deep branches of the ToL (10, 11, 14).

The twin issues, first recognized in the 80s based on single-gene (SSU rRNA) analyses, continue to be the subjects of a long-standing debate, which remains unresolved despite large-scale analyses of multi-gene datasets (5, 15-19). In addition to the choice of genes to be analyzed, the choice of the underlying character evolution model is at the core of contradictory results that either supports the Three-domains tree (5, 19) or the Eocyte tree (17, 20). In many cases, adding more data, either as enhanced taxon (species) sampling or enhanced character (gene) sampling, or both, can resolve ambiguities (21, 22). However, as the taxonomic diversity and evolutionary distance increases among the taxa studied, the number of conserved marker-genes that can be used for phylogenomic analyses decreases. Accordingly, resolving the phylogenetic relationships of the Archaea, Bacteria and Eukarya is restricted to a small set of genes—50 at most—in spite of the large increase in the numbers of genomes sequenced and the associated development of sophisticated phylogenomic methods.

Based on a closer scrutiny of the recent phylogenomic datasets employed in the ongoing debate, I will show here that one of the reasons for this persistent ambiguity is that the ‘information’ necessary to resolve these conflicts is practically nonexistent in the standard marker-genes (i.e. core-genes) datasets employed routinely for phylogenomics. Further, I discuss analytical approaches that maximize the use of the information that is in genome sequence data and simultaneously minimize phylogenetic uncertainties. In addition, I discuss simple but important, yet undervalued, aspects of phylogenetic hypothesis testing, which together with the new approaches hold promise to resolve these long-standing issues effectively.

Results

Information in core genes is inadequate to resolve the archaeal radiation

Data-display networks (DDNs) are useful to examine and visualize character conflicts in phylogenetic datasets, especially in the absence of prior knowledge about the source of such conflicts, ideally before downstream processing of the data for phylogenetic inference (23, 24). While congruent data will be displayed as a tree in a DDN, incongruences are displayed as reticulations in the tree. Fig. 1A shows a neighbor-net analysis of the SSU rRNA alignment used to resolve the phylogenetic position of the recently discovered Asgard archaea (20). The DDN is based on character distances calculated as the observed genetic distance (p-distance) of 1,462 characters, and shows the total amount of conflict in the dataset that is incongruent with character bipartitions (splits). The edge (branch) lengths in the DDN correspond to the support for the respective splits. Accordingly, two well-supported sets of splits for the Bacteria and the Eukarya are observed. The Archaea, however, does not form a distinct, well-resolved/well-supported group, and is unlikely to correspond to a monophyletic group in a phylogenetic tree.

Figure 1.

Data-display networks depicting the character conflicts in different datasets that employ different character types. (A) SSU rRNA alignment of 1,462 characters. Concatenated protein sequence alignment of (B) 29 core-genes, 8,563 characters; (C) 48 core-genes, 9,868 characters and (D) also 48 core-genes, 9,868 SR4 recoded characters (data simplified from 20 to 4 character-states). Each network is constructed from a neighbor-net analysis based on the observed genetic distance (p-distance) and displayed as an equal angle split network. Edge (branch) lengths correspond to the support for character bipartitions (splits), and reticulations in the tree correspond to character conflicts. Datasets in (A), (C) and (D) are from Ref. 20, and in (B) is from Ref. 17.

Likewise, the concatenated protein sequence alignment of the so-called ‘genealogy defining core of genes’(25) – a set of conserved single-copy genes – also does not support a unique archael lineage. Fig. 1B is a DDN derived from a neighbor-net analysis of 8,563 characters in 29 concatenated core-genes (17), while Fig. 1C,D is based on 9,868 characters in 44 concatenated core-genes (also from (20)). Even taken together, none of the standard marker gene datasets are likely to support the monophyly of the Archaea — a key assertion of the three-domains hypothesis (26). Simply put, there is not enough information in the core-gene datasets to resolve the archaeal radiation, or to determine whether the Archaea are really unique compared to the Bacteria and Eukarya. However, other complex features — including molecular, biochemical and phenotypic characters, as well as ecological adaptations — support the uniqueness of the Archaea. These idiosyncratic archaeal characters include the subunit composition of supramolecular complexes like the ribosome, DNA- and RNA-polymerases, biochemical composition of cell membranes, cell walls, and physiological adaptations to energy-starved environments, among other things (27, 28).

Complex phylogenomic characters minimize uncertainties regarding the uniqueness of the Archaea

A nucleotide is the smallest possible locus, and an amino acid is a proxy for a locus of a nucleotide triplet. Unlike the elementary amino acid- or nucleotide-characters in the core-genes dataset (Fig.1), the DDN in Fig. 2 is based on complex molecular characters – genomic loci that correspond to protein domains, typically ~200 amino acids (600 nucleotides) long. Neighbor-net analysis of protein-domain data coded as binary characters (presence/absence) is based on the Hamming distance (identical to the p-distance used in Fig.1). Here the Archaea also form a distinct well-supported cluster, as do the Bacteria and the Eukarya.

Figure 2.

Data-display networks (DDN) depicting character conflicts among complex phylogenomic characters – genomic loci corresponding to protein-domains in this case. (A) Neighbor-net analysis based on Hamming distance (identical to the p-distance used in Fig.1) of 1,732 characters sampled from 141 species. (B) DDN based on an enriched taxon sampling of 81 additional species totaling 222 species and a modest increase to 1,738 characters. The dataset in (A) is from Ref. 10, which was updated with novel species to represent the recently described archaeal and bacterial species (5, 12, 20).

Fig 2A is a DDN based on the dataset that includes protein-domain cohorts of 141 species, used in a phylogenomic analysis to resolve the uncertainties at the root of the ToL (29). Compared to the data in Fig. 1, the taxonomic diversity sampled for the Bacteria and Eukarya is more extensive, but less extensive for the Archaea; it is composed of the traditional groups Euryarchaeota and Crenarchaeota. Fig. 2B is a DDN of an enriched sampling of 81 additional species, which includes representatives of the newly described archaeal groups: TACK (30), DPANN (5), and Asgard group including the Lokiarchaeota (20). In addition, species sampling was enhanced with representatives from the candidate phyla described for Bacteria, and with unicellular species of Eukarya. The complete list species analyzed is in SI Table 1.

Notably, the extension of the protein-domain cohort was insignificant, from 1,732 to 1,738 distinct domains (characters). Based on the well-supported splits in the DDN that form a distinct archaeal cluster, the Archaea are likely to be a monophyletic group (clade) in phylogenies inferred from these datasets.

Data quality affects model complexity required to explain phylogenetic datasets

Resolving the paraphyly or monophyly of the Archaea is relevant to determining whether the Eocyte tree (Fig. 3A) or the Three-domains tree (Fig. 3B), respectively, is a better-supported hypothesis. Recovering the Eocyte tree typically requires implementing complex models of sequence evolution rather than their relatively simpler versions (11). In general, complex models tend to fit the data better. For instance, according to a model selection test for the 29 core-genes dataset, the LG model (31) of protein sequence evolution is a better-fitting model than other standard models, such as the WAG or JTT substitution model (SI-Table 2), as reported previously (17). Further, a relatively more complex version of the LG model, with multiple rate-categories was found to be a better-fitting model than the simpler single-rate-category model (Fig. 3C; SI-Table 2). The fit of the data is estimated as the likelihood of the best tree given the model.

Figure 3.

Comparison of concatenated-gene trees derived from amino acid characters and genome trees derived from protein-domain characters. Branch support is shown only for the major branches. Scale bars represent the expected number of changes per character. (A), (B) Core-genes-tree derived from a better-fitting model (LG+G4) and a worse fitting mode (LG), respectively, of amino acid substitutions. (C) Model fit to data is ranked according the log likelihood ratio (LLR) scores. LLR scores are computed as the difference from the best-fitting model (LG+G12) of the likelihood scores estimated in PhyML. Thus, larger LLR values indicate less support for that model/tree relative to the most-likely model/tree. Substitution rate heterogeneity is approximated with 4, 8 or 12 rate categories in the complex models, but with a single rate category in the simpler model. (D), (E) are genome-trees derived from a better-fitting model (Mk+G4) and a worse fitting model (Mk), respectively, of protein-domain innovation. (F) Model fit to data is ranked according log Bayes factor (LBF) scores, which like LLR scores are the log odds of the hypotheses. LBF scores are computed as the difference in likelihood scores estimated in MrBayes.

A complex, multiple rate-categories model accounts for site-specific substitution rate variation. Substitution-rate heterogeneity across different sites in the multiple-sequence alignment (MSA) was approximated using a discrete Gamma model with 4, 8 or 12 rate categories (LG+G4, LG+G8 or LG+G12, respectively). The Archaea is consistent with a paraphyletic group in trees derived from the rate-heterogeneous versions of the LG model (Fig. 3A). Furthermore, the fit of the data improves with the increase in complexity of the substitution model (Fig. 3C). Model complexity increases with any increase in the number of rate categories and/or the associated numbers of parameters that need to be estimated. However, with a relatively simpler version – a rate-homogeneous LG model, in which the substitution-rates are approximated to a single rate-category, the Archaea are consistent with a monophyletic group (Fig. 3B).

In contrast, trees inferred from the protein-domain datasets are consistent with monophyly of the Archaea irrespective of the complexity of the underlying model (Fig. 3D-F). The Mk model (Markov k model) is the best-known probabilistic model of discrete character evolution, particularly of complex characters coded as binary-state characters (32, 33). Since the Mk model assumes a stochastic process of evolution, it is able to estimate multiple state changes along the same branch. Implementing a simpler rate-homogeneous version of the Mk model (Fig. 3D), as well as more complex rate-heterogeneous versions with 4, 8 or 12 rate categories (Mk+G4, Mk+G8 or Mk+G12, respectively), also recovered trees that are consistent with the monophyly of the Archaea (Fig. 3E) The tree derived from the Mk+G4 model is shown in Fig. 3E. While the tree derived from Mk+G8 model is identical (SI-Fig. 1) to the Mk+G4 tree, the Mk+G12 tree is almost identical with minor differences in the bacterial sub-groups (SI-Fig. 2)

In all cases, bipartitions for Archaea show strong support with posterior probability (PP) of 0.99 while that of Bacteria and Eukarya is supported with a PP of 1.0; in spite of substantially different fits of the data. The uniqueness of the Archaea is almost unambiguous in this case (but see next section).

Siblings and cousins are indistinguishable when reversible models are employed

Although a DDN is useful to identify and diagnose character conflicts in phylogenetic datasets and to postulate evolutionary hypotheses, a DDN by itself cannot be interpreted as an evolutionary network, because the edges do not necessarily represent evolutionary phenomena and the nodes do not represent ancestors (23, 24). Therefore, evolutionary relationships cannot be inferred from a DDN. Likewise, evolutionary relationships cannot be inferred from unrooted trees, even though nodes in an unrooted tree do represent ancestors and an evolution model defines the branches (see Fig. 4A).

Figure 4.

Effect of alternative ad hoc rootings on the phylogenetic classification of archaeal biodiversity. (A) An unrooted tree is not fully resolved into bipartitions at the root of the tree (i.e. a polytomous rather than a dichotomous root branching) and thus precludes identification of sister group relationships. It is common practice to add a user-specified root a posteriori based on prior knowledge (or belief) of the investigator. Four possible (of many) rootings R1-R4 are shown. (B) Operationally, adding a root (rooting) a posteriori amounts to adding new information – a new bipartition and an ancestor as well as an evolutionary polarity – that is independent of the source data. (C-F) The different possible evolutionary relationships of the Archaea to other taxa, depending on the position of the root, are shown. Rooting is necessary to determine the recency of common ancestry as well the temporal order of key evolutionary transitions that define phylogenetic relationships.

An unrooted tree, unlike a rooted tree, is not an evolutionary (phylogenetic) tree per se, since it is a minimally defined hypothesis of evolution or of relationships; it is, nevertheless, useful to rule out many possible bipartitions and groups (34, 35). Given that a primary objective of phylogenetic analyses is to identify clades and the relationships between these clades, it is not possible to interpret an unrooted tree meaningfully without rooting the tree (see Fig. 4A). Identifying the root is essential to: (i) distinguish between ancestral and derived states of characters, (ii) determine the ancestor-descendant polarity of taxa, and (iii) diagnose clades and sister-group relationships (Fig. 4). Yet, most phylogenetic software construct only unrooted trees, which are then consistent with several rooted trees (Fig. 4 C-F). However, an unrooted tree cannot be fully resolved into bipartitions, because an unresolved polytomy (a trifurcation in this case) exists near the root of the tree (Fig. 4A), which otherwise corresponds to the deepest split (root) in a rooted tree (Fig. 4, C-F).

Resolving the polytomy requires identifying the root of the tree. The identity of the root corresponds, in principle, to any one of the possible ancestors as follows:

Any one of the inferred-ancestors at the resolved bipartitions (open circles in Fig. 4A), or
Any one of the yet-to-be-inferred-ancestors that lies along the stem-branches of the unre-solved polytomy (dashed lines in Fig. 4A) or along the internal-braches.

In the latter case, rooting the tree a posteriori on any of the branches amounts to inserting an additional bipartition and an ancestor that is neither inferred from the source data nor deduced from the underlying character evolution model. Since standard evolution models employed routinely cannot resolve the polytomy, rooting, and hence interpreting the Tree of Life depends on:

Prior knowledge — eg., fossils or a known sister-group (outgroup), or
Prior beliefs/expectations of the investigators — eg., simple is primitive (36, 37), bacteria are primitive (38, 39), archaea are primitive (1), etc.

Both of these options are independent of the data used to infer the unrooted ToL. Some possible rootings and the resulting rooted-tree topologies are shown as cladograms in Fig. 4, C-F. If the root lies on any of the internal branches (e.g. R1 in Fig. 4,A-C), or corresponds to one of the internal nodes, within the archael radiation, the Archaea would not constitute a unique clade (Fig. 4C). However, if the root lies on one of the stem-branches (R2/R3/R4 in Fig. 4 A, B), monophyly of the Archaea would be unambiguous (Fig. 4 D-F). Determining the evolutionary relationship of the Archaea to other taxa, though, requires identifying the root.

Directional evolution models, unlike reversible models, are able to identify the polarity of state transitions, and thus the root of a tree (40-42). Therefore, the uncertainty due to a polytomous root branching is not an issue (Fig 5A). Moreover, directional evolution models are useful to evaluate the empirical support for prior beliefs about the universal common ancestor (UCA) at the root of the ToL (29). A Bayesian model selection test implemented to detect directional trends (42) chooses the directional model, overwhelmingly (Fig. 5B), over the unpolarized model for the protein-domain dataset in Fig. 2B, as reported previously for the dataset in Fig. 2A (29). Further, the best-supported rooting corresponds to root R4 (Fig. 4F and Fig. 5A) — monophyly of the Archaea is maximally supported (PP of 1.0). Furthermore, the sister-group relationship of the Archaea to the Bacteria is maximally supported (PP 1.0). Accordingly, a higher order taxon, Akaryotes, proposed earlier (Forterre 1992) forms a well-supported clade. Thus Akaryotes (or Akarya) and Eukarya are sister clades that diverge from the UCA at the root of the ToL, also as reported previously (29).

Figure 5.

(A) Rooted tree of life inferred from patterns of inheritance of unique genomic-signatures. A dichotomous classification of the diversity of life such that Archaea is a sister group to Bacteria, which together constitute a clade of akaryotes (Akarya). Eukarya and Akarya are sister-clades that diverge from the root of the tree of life. Each clade is supported by the highest posterior probability of 1.0. The phylogeny supports a scenario of independent origins and descent of eukaryotes and akaryotes. (B) Model selection tests identify, overwhelmingly, directional evolution models to be better-fitting models. (C) Alternative rootings, and accordingly alternative classifications or scenarios for the origins of the major clades of life, are much less probable and not supported.

Alternative rootings are much less likely, and are not supported (Fig. 5C). Accordingly, independent origin of the eukaryotes as well akaryotes is the best-supported scenario. The Three-domains tree (root R3, Fig. 4E) is 10¹⁷¹ times less likely, and the scenario proposed by the Eocyte hypothesis (root R1, Fig. 5A) is highly unlikely. The common belief that simple is primitive, as well as beliefs that archaea are primitive or that archaea and bacteria evolved before eukaryotes, are not supported either.

Employing complex molecular characters maximizes representation of orthologous, non-recombining genomic loci, and thus phylogenetic signal

Genomic loci that can be aligned with high confidence using MSA algorithms are typically more conserved than those loci for which alignment uncertainty is high. Such ambiguously aligned regions of sequences are routinely trimmed off before phylogenetic analyses (43). Typically, the conserved well-aligned regions correspond to protein domains with highly ordered three-dimensional (3D) structures with specific 3D folds (Fig. 6A). Regions of sequences that are trimmed usually show higher variability in length, are less ordered and are known to accumulate insertion and deletion (indel) mutations at a higher frequency than in the regions that correspond to folded domains (44). These variable, structurally disordered regions, which flank the structurally ordered domains, link different domains in multi-domain proteins (Fig. 6A). Multi-domain architecture (MDA), the N-to-C terminal sequence of domain arrangement, is distinct for a protein family, and differs in closely related protein families with similar functions (Fig. 6A). The variation in MDA also relates to alignment uncertainties.

Figure 6.

Alignment uncertainty in closely related proteins due to domain recombination. (A) Multi-domain architecture (MDA) of the translational GTPase superfamily based on recombination of 8 modular domains. 57 distinct families with varying MDAs are known, of which 6 canonical families are shown as a schematic on the left and the corresponding 3D folds on the right. Amino acid sequences of only 2 of the 8 conserved domains can be aligned with confidence for use in phylogenetic analysis. The length of the alignment varies from 200-300 amino acids depending on the sequence diversity sampled (14,76). The EF-Tu—EF-G paralogous pair employed as pseudo-outgroups for the classical rooting of the rRNA tree is highlighted. (B) Phyletic distribution of 1,738 out the 2,000 distinct SCOP-domains sampled from 222 species used for phylogenetic analyses in the present study. About 70 percent of the domains are widely distributed across the sampled taxonomic diversity.

Figure 7.

Redundant representation of protein-domains in concatenated core-genes datasets. The P-loop NTP hydrolase domain is one of the most prevalent domain. Genomic loci corresponding to P-loop hydrolase domain are represented 8-9 times in each species in the single-copy genes employed from core-genes multiple sequence alignments. Redundant loci in the core-genes datasets vary depending on the genes and species sampled for phylogenomic analyses.

A closer look at the 29 core-genes dataset shows that the concatenated-MSA corresponds to a total of 27 distinct protein domains or genomic loci (Table 1). The number of loci sampled from different species varies between 20 and 27, since not all loci are found in all species. While some loci are absent in some species, some loci are redundant. For instance, the P-loop NTP hydrolase domain, one of the most prevalent protein domains, is represented up to 9 times in many species (Table 1). Many central cellular functions are driven by the conformational changes in proteins induced by the hydrolysis of nucleoside triphosphate (NTP) catalyzed by the P-loop domain. Out of a total of 27 distinct domains, 7 are redundant, with two or more copies represented per species. Similarly, 9 of the 50 domains have a redundant representation in the 44 core-gene dataset (Table 1). The observed redundancy of the genomic loci in the core-genes alignments is inconsistent with the common (and typically untested) assumption of using single-copy genes as a proxy for orthologous loci sampled for phylogenetic analysis.

In contrast, the protein-domain datasets are composed of unique loci (Fig. 6B). Despite the superficial similarity of the DDNs in Fig. 1 and Fig.2, they are both qualitatively and quantitatively different codings of genome sequences. As opposed to tracing the history of 30-50 loci in the standard core-genes datasets (Fig. 1), up to 60 fold (1738 loci) more information can be represented when genome sequences are coded as protein-domain characters (Fig. 2). Currently 2,000 unique domains are described by SCOP (Structural Classification of Proteins) (45). The phyletic distribution of 1,738 domains identified in the 222 representative species sampled here is shown in a Venn diagram (Fig. 5B).

Discussion

Improving data quality can be more effective for resolving recalcitrant branches than increasing model complexity

In the phylogenetic literature, the concept of data quality refers to the quality or the strength of the phylogenetic signal that can be extracted from the data. The strength of the phylogenetic signal is proportional to the confidence with which unique state-transitions can be determined for a given set of characters on a given tree. Ideally, historically unique character transitions that entail rare evolutionary innovations are desirable, to identify patterns of uniquely shared innovations (synapomorphies) among lineages. Synapomorphies are the diagnostic features used for assessing lineage-specific inheritance of evolutionary innovations. Therefore identifying character transitions that are likely to be low probability events is a basic requirement for the accuracy of phylogenetic analysis.

In their pioneering studies, Woese and colleagues identified unique features of the SSU rRNA – [oligonucleotide] “signatures” – that were six nucleotides or longer, to determine evolutionary relationships (2). An underlying assumption was that the probability of occurrence of the same set of oligomer signatures by chance, in non-homologous sequences, is low in a large molecule like SSU rRNA (1500-2000 nucleotides). Oligomers shorter than six nucleotides were statistically less likely to be efficient markers of homology (46). Thus SSU rRNA was an information-rich molecule to identify homologous signatures (characters) useful for phylogenetic analysis.

However, as sequencing of full-length rRNAs and statistical models of nucleotide substitution became common, complex oligomer-characters were replaced by elementary nucleotide-characters; and more recently by amino acid characters. Identifying rare or historically unique substitutions in empirical datasets has proven to be difficult (47, 48), consequently the uncertainty of resolving the deeper branches of the Tree of Life using marker-gene sequences remains high. A primary reason is the prevalence of phylogenetic noise (homoplasy) in primary sequence datasets (Figs 1), due to the characteristic redundancy of nucleotide and amino acid substitutions and the resulting difficulty in distinguishing phylogenetic noise from signal (homology) (49, 50). Better-fitting (or best-fitting) models are expected to extract phylogenetic signal more efficiently and thus explain the data better, but tend to be more complex than worse-fitting models (Fig. 3 C, F). Increasingly sophisticated statistical models that have been developed over the years have only marginally improved the situation (51, 52). Although increasing model complexity can correct errors of estimation and improve the fit of the data to the tree, it is not a solution to improve phylogenetic signal, especially when not present in the source data.

Character recoding is found to be effective in reducing the noise/redundancy in the data, and thus uncertainties in phylogenetic reconstructions. This is a form of data simplification wherein the number of amino acid alphabets is reduced to a smaller set of alphabets that are frequently substituted for each other, usually reduced from 20 to 6. Character recoding into reduced alphabets is useful in cases were compositional heterogeneity or substitution saturation is high. However, datasets in which phylogenetic noise is inherently limited are more desirable, to minimize ambiguities. Like amino acids, protein domains are also modular alphabets, albeit higher order and more complex alphabets of proteins. Moreover, unlike the 20 standard amino acids, there are approximately 2,000 unique protein domains identified at present according to SCOP (45). The number is expected to increase; the theoretical estimates range between 4,000 and 10,000 distinct domain modules, depending on the classification scheme (53). Coding features as binary characters is the simplest possible representation of data for describing historically unique events.

The idea of‘oligonucleotide-signatures’ used for estimating a gene phylogeny has been extended, naturally, to infer a genome phylogeny (54). The signatures were defined in terms of protein-coding genes that were shared among the Archaea. However, as proteins are mosaics of domains, domains are unique genomic signatures (Fig. 6). Protein domains defined by SCOP correspond to complex ‘multi-dimensional signatures’ defined by: (i) a unique 3D fold, (ii) a distinct sequence profile, and (iii) a characteristic function. Though domain recombination is frequent, substitution of one protein domain for another has not been observed in homologous proteins (Fig. 6). For phylogenomic applications protein domains are ‘sequence signatures’ that essentially correspond to single-copy orthologous loci when coded as binary-state characters (presence/absence). These sequence signatures are consistent with unique, non-recombining genomic loci, and are identified using sophisticated statistical models — profile hidden Markov models (pHMMs) (55, 56) — that can be used routinely to annotate and curate genome sequences in automated pipelines (57, 58).

For these reasons, protein domains are ideal molecular phylogenetic markers for which character-homology can be validated through more than one property, statistically significant (i) sequence similarity, (ii) 3D structure similarity; and (iii) function similarity. In addition, employing genomic loci for protein domains maximizes the genomic information that can be employed for phylogenetic analysis. Even though many other genomic features are known to be useful markers (59), protein domains are the most conserved as well as most widely applicable genomic characters (Fig. 6B).

Sorting vertical evolution (signal) and horizontal evolution (noise)

Single-copy genes are employed as phylogenetic markers to minimize phylogenetic noise caused by reticulate evolution, including hybridization, introgression, recombination, horizontal transfer (HT), duplication-loss (DL), or incomplete lineage sorting (ILS) of genomic loci. However, the noise observed in the DDNs based on MSA of core-genes (Fig. 1) cannot be directly related to any of the above genome-scale reticulations, since the characters are individual nucleotides or amino acids. Apart from stochastic character conflicts, the observed conflicts are better explained by convergent substitutions, given the redundancy of substitutions. Convergent substitutions caused either due to stringent selection or by chance are a well-recognized form of homoplasy in gene-sequence data (47, 50, 60), and based on recent genome-scale analyses it is now known to be rampant (61, 62).

The observed noise in the DDNs based on protein-domain characters (Fig. 2), however, can be related directly to genome-scale reticulation processes and homoplasies. In general, homoplasy implies evolutionary convergence, parallelism or character reversals caused by multiple processes. In contrast, homology implies only one process: inheritance of traits that evolved in the common ancestor and were passed to its descendants. Operationally, tree-based assessment of homology requires tracing the phylogenetic continuity of characters (and states), whereas homoplasy manifests as discontinuities along the tree. Since clades are diagnosed on the basis of shared innovations (synapomorphies) and defined by ancestry (63, 64), accuracy of a phylogeny depends on an accurate assessment of homology — unambiguous identification of relative synapomorphies on a best fitting tree.

Identifying homoplasies caused by character reversals, i.e. reversal to ancestral states requires identification of the ancestral state of the characters under study. However, implementing reversible models precludes the estimation of ancestral states, in the absence of sister groups (outgroups) or other external references. Thus, the critical distinction between shared ancestral homology (symplesiomorphy) and shared derived homology (synapomorphy) is not possible with unrooted trees derived from standard reversible models. Hence, unrooted trees (Fig. 3) are not evolutionary (phylogenetic) trees per se, as they are uninformative about the evolutionary polarity (34, 35, 65). Thus, identifying the root (or root-state) is crucial to (i) determine the polarity of state transitions, (ii) identify synapomorphies, and (iii) diagnose clades.

Moreover, because clades are associated with the emergence and inheritance of evolutionary novelties, the discovery of clades is fundamental for describing and diagnosing sister group differences, which is a primary objective of modern systematics (66). A well-recognized deficiency of phylogenetic inference based on primary sequences is the abstraction of evolutionary ‘information’ (54), often into less tangible quantitative measures. For instance, ‘information’ relevant to diagnosing clades and support for clades is abstracted to branch lengths. Branch-length estimation is, ideally, a function of the source data and the underlying model. However, in the core-genes dataset the estimated branch lengths and the resulting tree is an expression of the model rather than of the data (Fig. 3 A, B). Some pertinent questions then are: should diagnosis of clades and the features by which clades are identified be delegated to, and restricted to, substitution mutations in a small set of loci and substitution models? Are substitution mutations in 40-50 loci more informative, or the birth and death of unique genomic loci more informative?

Proponents of the total evidence approach recommend that all relevant information — molecular, biochemical, anatomical, morphological, fossils — should be used to reconstruct evolutionary history, yet genome sequences are the most widely applicable data at present (59, 67). Accordingly, phylogenetic classification is, in practice, a classification of genomes. There is no a priori theoretical reason that phylogenetic inference should be restricted to a small set of genomic loci corresponding to the core genes, nor is there a reason for limiting phylogenetic models to interpreting patterns of substitution mutations alone. The ease of sequencing and the practical convenience of assembling large character matrices, by themselves, are no longer compelling reasons to adhere to the traditional marker gene analysis.

Annotations for reference genomes of homologous protein domains identified by SCOP and other protein-classification schemes, as well as tools for identifying corresponding sequence signatures, are readily available in public databases. An added advantage is that the biochemical function and molecular phenotype of the domains are readily accessible as well, through additional resources including protein data bank (PDB) and InterPro. For complex characters such as protein domains, character homology can be determined with high confidence using sophisticated statistical models (HMMs). Homology of a protein domain implies that the de novo evolution of a genomic locus corresponding to that protein domain is a unique historical event. Therefore, homoplasy due to convergences and parallelisms is highly improbable (68, 69). Although a handful of cases of convergent evolution of 3D structures is known, these instances relate to relatively simple 3D folds coded for by relatively simple sequence repeats (70).

However, the vast majority of domains identified by SCOP correspond to polypeptides that are on average 200 residues long with unique sequence profiles (57, 68). Thus, identifying homoplasy in the protein-domain datasets depends largely on estimating reversals, which in this case will be cases of secondary gains/losses; for instance gain-loss-regain events caused by DL-HT or HT. Such secondary gains are more likely to correspond to HT events than to convergent evolution, for reasons specified above. Instances of reversals are minimal, as seen from the strong directional trends detected in the data (Fig. 5B and Fig. 6B).

Vertical and horizontal classification

For decades, biologists have been faced with a choice between so-called horizontal (Linnean) and vertical (Darwinian) classification of biodiversity (71). The similarity of both schools of systematics concerns the identification of “signatures” or sets of characteristic features that codify evolutionary relationships (54, 63, 71). But the former emphasizes the unity of contemporary groups, i.e. those at a similar evolutionary state, and therefore separates ancestors from descendants, while the latter emphasizes the unity of the ancestors and separates descendants that diverge from a common ancestry (71). Vertical classification is more consistent with the concept of lineal descent, and is the predominant paradigm for which the operational methodology and the algorithmic logic were laid out as the principles of phylogenetic systematics (63, 72). Accordingly, determining the ancestor-descendant polarity, starting from the universal common ancestor (UCA) at the root of the Tree of Life, is crucial to accurately reconstructing the path of evolutionary descent.

The classical rooting of the (rRNA)ToL based on the EF-Tu—EF-G paralogous pair (73, 74) is known to be error-prone and highly ambiguous, due to LBA artifacts (14, 75). Remarkably, sequences corresponding to only one of the two conserved domains common to EF-Tu and EF-G (200 residues in the P-loop-containing NTP hydrolase domain (Fig. 5A)) can be aligned with confidence (14). Implementing better-fitting substitution models results in two alternative rootings (R1 and R4 in Fig. 5), which relate to distinct, irreconcilable scenarios (14) similar to scenarios in Fig 4C and 4F. Moreover, the EF-Tu—EF-G paralogous pair is only 2 of 57 known paralogs of the translational GTPase protein superfamily (76). Thus the assumption that EF-Tu—EF-G duplication is a unique event, which is essential for the paralogous outgroup-rooting method, is untenable.

In the absence of prior knowledge of outgroups or of fossils, rooting the Tree of Life is arguably one of the most difficult phylogenetic problems. Incorrect rooting may lead to profoundly misleading conclusions about evolutionary scenarios and taxonomic affinities, and it appears to be common in phylogenetic studies (77). Perhaps worse yet seems to be the preponderance of subjective a posteriori rooting based on untested preconceptions (e.g. (78, 79)) and scenario-driven erection of taxonomic ranks (e.g. (1, 30)) (80). The conventional practice of a posteriori rooting, wherein an unrooted tree is converted into a rooted tree by adding an ad hoc root, encourages a subjective interpretation of the ToL. For example, the so-called bacterial rooting of the ToL (root R3; Fig. 4) is the preferred rooting hypothesis to interpret the ToL even though that rooting is not well supported (14).

Untangling data bias, model bias and investigator bias (prior beliefs)

Phylogenies, and hence the taxonomies and evolutionary scenarios they support, are falsifiable hypotheses. Statistical hypothesis testing is now an integral part of phylogenetic inference, to quantify the empirical evidence in support of the various plausible evolutionary scenarios. However, common statistical models implemented for phylogenomic analyses are limited to modeling variation in patterns of point mutations, particularly substitution mutations. These statistical models are intimately linked to basic concepts of molecular evolution, such as the universal molecular clock (3), the universal chronometer (78), paralogous outgroup rooting (81), etc., which are gene-centric concepts that were developed to study the gene, during the age of the gene. Moreover, these idealized notions originated from the analyses of relatively small single-gene datasets.

Conventional phylogenomics of multi-locus datasets is a direct extension of the concepts and methods developed for single-locus datasets, which rely exclusively on substitution mutations (50). In contrast, the fundamental concepts of phylogenetic theory: homology, synapomorphy, homoplasy, character polarity, etc., even if idealized, are more generally applicable. And, apparently they are better suited for unique and complex genomic characters rather than for redundant, elementary sequence characters, with regards to determining both qualitative as well as statistical consistency of the data and the underlying assumptions.

Phylogenetic theory that was developed to trace the evolutionary history of organismal species, as well as related methods of discrete character analysis for classifying organismal families (63, 82), was adopted, although not entirely, to determine the evolution and classification of gene families (1, 3). The discovery and initial description of the Archaea was based on the comparative analysis of a single-gene (rRNA) family. However, in spite of the large number of characters that can be analyzed, neither the rRNA genes nor multi-gene concatenations of core-genes have proved to be efficient phylogenetic markers to reliably resolve the evolutionary history and phylogenetic affinities of the Archaea (83, 84).

Uncertainties and errors in phylogenetic inference are primarily errors in adequately distinguishing homologous similarities from homoplastic similarities (34, 50, 85). Homologies, synapomorphies and homoplasies are qualitative inferences, yet are inherently statistical (probabilistic). The probabilistic framework (maximum likelihood and Bayesian methods) has proven to be powerful for quantifying uncertainties and testing alternative hypotheses. Log odds ratios, such as LLR and LBF, are measures of how one changes belief in a hypothesis in light of new evidence (86). Accordingly, directional evolution models are more optimal explanations of the observed distribution of genomic-characters, and such directional trends overwhelmingly support the monophyly of the Archaea, as well as the sisterhood of the Archaea and the Bacteria, i.e. monophyly of Akarya (Fig 6).

Data quality is at least as important as the evolution models that are posited to explain the data. Although sophisticated statistical tests for evaluating tree robustness, and for selecting character-evolution models, are becoming a standard feature of phylogenetic software (e.g. IQ-tree, MrBayes, Phylobayes), tests for character evaluation are not common. Routines for collecting and curating data upstream of phylogenetic analyses are rather eclectic. Besides, it is an open question as to whether qualitatively different datasets (as in Fig.1 and Fig.2) can be compared effectively. Nevertheless, employing DDNs and other tools of exploratory data analysis could be useful to identify conflicts that arise due to data collection and/or curation errors (23, 24).

Conclusions

The Tree of Life is primarily a phylogenetic classification that is invaluable to organize and to describe the evolution of biodiversity, explicated through evolutionary scenarios. Phylogenies are hypotheses that mostly relate to extinct ancestors, while taxonomies are hypotheses that largely relate to extant species. Extant species contain distinct combinatorial mosaics of ancestral features (plesiomorphies) and evolutionary novelties (apomorphies). It is remarkable that the uniqueness of the Archaea was identified by the comparative analyses of oligonucleotide signatures in a single gene dataset (1). However the same is not true of the phylogenetic classification of the Archaea, based on marker-genes and reversible evolution models that rely exclusively on point mutations, specifically substitution mutations, which may not be ideal phylogenetic markers (59).

The Three-domains of Life hypothesis (26), which was initially based on the interpretation of an unrooted rRNA tree (of life) (1), was put forward largely to emphasize the uniqueness of the Archaea, ascribed to an exclusive lineal descent. Although many lines of evidence, molecular or otherwise, support the uniqueness of the Archaea, phylogenetic analysis of genomic signatures does not support the presumed primitive state of Archaea or Bacteria, and the common belief that Archaea and Bacteria are ancestors of Eukarya (1, 11, 39, 87). Models of evolution of genomic features support a Two-domains (or rather two empires) of Life hypothesis (9), as well as the independent origins and parallel descent of eukaryote and akaryote species (10, 14, 88, 89).

Data and methods

Data collection and curation

Marker domains datasets

Character matrices of homologous protein-domains, coded as binary-state characters were assembled from genome annotations of SCOP-domains available through the SUPERFAMILY HMM library and genome assignments server; v. 1.75 (http://supfam.org/SUPERFAMILY/) (57, 90).

141-species dataset was obtained from a previous study (29)
The 141-species dataset was updated with representatives of novel species described recently, largely with archaeal species from TACK group (30), DPANN group (5) and Asgard group including the Lokiarchaeota (20). In addition, species sampling was enhanced with representatives from the candidate phyla (unclassified) described for bacterial species and with unicellular species of eukaryotes, to a total of 222 species. The complete list of the species with their respective Taxonomy IDs is available in SI Table 1.

When genome annotations were unavailable from SUPERFAMILY database, curated reference proteomes were obtained from the universal protein resource (http://www.uniprot.org/proteomes/). SCOP-domains were annotated using the HMM library and genome annotation tools and routines recommended by the SUPERFAMILY resource.

Marker genes datasets

Marker gene datasets from previous studies were obtained as follows, (i) 29 core-genes alignment(17) and (ii) SSU rRNA alignment and 48 core-genes alignments (20).

Exploratory data analysis

DDNs were constructed with SplitsTree v. 4.14. Split networks were computed using the NeighborNet method from the observed P-distances of the taxa for both nucleotide- and amino acid-characters. Split networks of the protein-domain characterss were computed from Hamming distance, which is identical to the P-distance. The networks were drawn with the equal angle algorithm.

Phylogenetic analyses

Concatenated gene tree inference: Extensive analyses of the concatenated core-genes datasets are reported in the original studies (17, 20). Analysis here was restricted to the 29 core-genes dataset due its relatively small taxon sampling (44 species) compared to the 48 core-genes dataset (96 species) since there is little difference in data quality, but the computational time/resources required is significantly lesser. Moreover, the general conclusions based on these datasets are consistent despite a smaller taxon sampling, particularly of archaeal species (26 as opposed to 64 in the larger sampling).

Best-fitting amino acid substitution models were chosen using Smart Model Selection (SMS) (91) compatible with PhyML tree inference methods (92). Trees were estimated with a rate-homogeneous LG model as well as rate-heterogeneous versions of the LG model. Site-specific rate variation was approximated using the gamma distribution with 4, 8 and 12 rate categories, LG+G4, LG+G8 and LG+G12, respectively. More complex models (SI Table 2) that account for invariable sites (LG+GX+I) and/or models that compute alignment-specific state frequencies (LG+GX+F) were also used, but the trees inferred were identical to trees estimated from LG+GX models, and therefore not reported here. Log likelihoods ratio (LLR) was calculated as the difference in the raw log likelihoods for each model.

Genome tree inference: The Mk model (32) is the most widely implemented model for phylogenetic inference in the probabilistic framework (maximum likelihood (ML) and Bayesian methods) applicable to complex features coded as binary characters. However, only the reversible model is implemented in ML methods at present. Both reversible and directional evolution models as well as model selection routines implemented in MrBayes 3.2 (42, 93) were used. The Metropolis-coupled MCMC algorithm was used with two chains, sampling every 500th generation. The first half of the generations was discarded as burn-in. MCMC sampling was run until convergence, unless mentioned otherwise. Convergence was assessed through the average standard deviation of spilt frequencies (ASDSF, less than 0.01) for tree topology and the potential scale reduction factor (PSRF, equal to 1.00) for scalar parameters, unless mentioned otherwise. Bayes factors for model comparison were calculated using the harmonic mean estimator in MrBayes. The log Bayes factor (LBF) was calculated as the difference in the log likelihoods for each model.

Convergence between independent runs was generally slower for directional models compared to the reversible models. When convergence was extremely slow (requiring more than 100 million generations) topology constraints corresponding to the clusters derived in the unrooted trees (Fig. 3E) were applied to improve convergence rates. In general these clusters/constraints corresponded to named taxonomic groups e.g. Fungi, Metazoa, Crenarchaeota, etc. Convergence assessment between independent runs was relaxed for three specific cases that did not converge at the time of submission: the unrooted tree with Mk-uniform-rates model (ASDSF 0.05; PSRF 1.03), rooted trees corresponding to root-R2 (ASDSF 0.5; PSRF 1.04) and root-R3 (ASDSF 0.029; PSRF 1.03). In the three cases specified, the difference in bipartitions is in the shallow parts (minor branches) of the tree. For assessing well supported major branches of the tree, ASDSF values between 0.01 and 0.05 may be adequate, as recommended by the authors (94).

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. Work by this author was partially supported by The Swedish Research Council (to Måns Ehrenberg) and the Knut and Alice Wallenberg Foundation, RiboCORE (to Måns Ehrenberg and Dan Andersson).

Acknowledgements

I am grateful to Charles (Chuck) Kurland and Måns Ehrenberg for support and encouragement. I thank Chuck Kurland and Siv Andersson for the discussions in general; Chuck for the many stimulating debates and Siv for inspiring the article title, in part; Seraina Klopfstein for providing the algorithms for implementing the directional model in MrBayes and for helpful suggestions and Erling Wikman for help with computing equipment.

References

1.↵
Woese CR, Fox GE. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proceedings of the National Academy of Sciences. 1977;74(11):5088-90.
OpenUrl Abstract/FREE Full Text
2.↵
Woese CR. The Archaeal Concept and the World it Lives in: A Retrospective. Photosynthesis Research. 2004;80(1):361-72.
OpenUrl CrossRef PubMed
3.↵
Zuckerkandl E, Pauling L. Molecules as documents of evolutionary history. Journal of theoretical biology. 1965;8(2):357-66.
OpenUrl CrossRef PubMed Web of Science
4.↵
Ragan MA, Bernard G, Chan CX. Molecular phylogenetics before sequences. RNA Biology. 2014;11(3):176-85.
OpenUrl
5.↵
Rinke C, Schwientek P, Sczyrba A, Ivanova NN, Anderson IJ, Cheng J-F, et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature. 2013;499(7459):431-7.
OpenUrl CrossRef PubMed Web of Science
6.↵
Sunagawa S, Coelho LP, Chaffron S, Kultima JR, Labadie K, Salazar G, et al. Structure and function of the global ocean microbiome. Science. 2015;348(6237):1261359.
OpenUrl Abstract/FREE Full Text
7.↵
Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, Frazier M, et al. Stalking the Fourth Domain in Metagenomic Data: Searching for, Discovering, and Interpreting Novel, Deep Branches in Marker Gene Phylogenetic Trees. PLOS ONE. 2011;6(3):e18011.
OpenUrl CrossRef PubMed
8.↵
Boyer M, Madoui M-A, Gimenez G, La Scola B, Raoult D. Phylogenetic and Phyletic Studies of Informational Genes in Genomes Highlight Existence of a 4th Domain of Life Including Giant Viruses. PLOS ONE. 2010;5(12):e15530.
OpenUrl CrossRef PubMed
9.↵
Mayr E. Two empires or three? Proceedings of the National Academy of Sciences of the United States of America. 1998;95(17):9720-3.
OpenUrl FREE Full Text
10.↵
Harish A, Tunlid A, Kurland CG. Rooted phylogeny of the three superkingdoms. Biochimie. 2013;95(8):1593-604.
OpenUrl CrossRef PubMed
11.↵
Williams TA, Foster PG, Cox CJ, Embley TM. An archaeal origin of eukaryotes supports only two primary domains of life. Nature. 2013;504(7479):231-6.
OpenUrl CrossRef GeoRef PubMed Web of Science
12.↵
Hug LA, Baker BJ, Anantharaman K, Brown CT, Probst AJ, Castelle CJ, et al. A new view of the tree of life. Nature Microbiology. 2016;1:16048.
OpenUrl
13.↵
Parks DH, Rinke C, Chuvochina M, Chaumeil P-A, Woodcraft BJ, Evans PN, et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nature Microbiology. 2017.
14.↵
Gouy R, Baurain D, Philippe H. Rooting the tree of life: the phylogenetic jury is still out. Phil Trans R Soc B. 2015;370(1678):20140329.
OpenUrl CrossRef PubMed
15.↵
Lake JA. An alternative to archaebacterial dogma. Nature. 1986;319(6055):626-.
OpenUrl PubMed
16.
Tourasse NJ, Gouy M. Accounting for evolutionary rate variation among sequence sites consistently changes universal phylogenies deduced from rRNA and protein-coding genes. Molecular phylogenetics and evolution. 1999;13(1):159-68.
OpenUrl CrossRef PubMed Web of Science
17.↵
Williams TA, Embley TM. Archaeal “dark matter” and the origin of eukaryotes. Genome Biology and Evolution. 2014;6(3):474-81.
OpenUrl CrossRef PubMed
18.
Spang A, Saw JH, Jørgensen SL, Zaremba-Niedzwiedzka K, Martijn J, Lind AE, et al. Complex archaea that bridge the gap between prokaryotes and eukaryotes. Nature. 2015;521(7551):173-9.
OpenUrl CrossRef PubMed
19.↵
Da Cunha V, Gaia M, Gadelle D, Nasir A, Forterre P. Lokiarchaea are close relatives of Euryarchaeota, not bridging the gap between prokaryotes and eukaryotes. PLOS Genetics. 2017;13(6):e1006810.
OpenUrl
20.↵
Zaremba-Niedzwiedzka K, Caceres EF, Saw JH, Bäckström D, Juzokaite L, Vancaester E, et al. Asgard archaea illuminate the origin of eukaryotic cellular complexity. Nature. 2017;541(7637):353-8.
OpenUrl CrossRef PubMed
21.↵
Zwickl DJ, Hillis DM. Increased taxon sampling greatly reduces phylogenetic error. Systematic Biology. 2002;51(4):588-98.
OpenUrl CrossRef PubMed Web of Science
22.↵
Salichos L, Rokas A. Inferring ancient divergences requires genes with strong phylogenetic signals. Nature. 2013;497(7449):327-31.
OpenUrl CrossRef PubMed Web of Science
23.↵
Morrison DA. Using data-display networks for exploratory data analysis in phylogenetic studies. Molecular Biology and Evolution. 2009;27(5):1044-57.
OpenUrl
24.↵
Huson DH, Bryant D. Application of phylogenetic networks in evolutionary studies. Mol Biol Evol. 2006;23.
25.↵
Woese CR. On the evolution of cells. Proceedings of the National Academy of Sciences of the United States of America. 2002;99(13):8742-7.
OpenUrl Abstract/FREE Full Text
26.↵
Woese CR, Kandler O, Wheelis ML. Towards a natural system of organisms: Proposal for the domains Archaea, Bacteria, and Eucarya. Proceedings of the National Academy of Sciences of the United States of America. 1990;87(12):4576-9.
OpenUrl Abstract/FREE Full Text
27.↵
Garrett RA. Molecular evolution: The uniqueness of Archaebacteria. Nature. 1985;318:233-5.
OpenUrl
28.↵
Valentine DL. Adaptations to energy stress dictate the ecology and evolution of the Archaea. Nature Reviews Microbiology. 2007;5(4):316-23.
OpenUrl CrossRef PubMed Web of Science
29.↵
Harish A, Kurland CG. Akaryotes and Eukaryotes are independent descendants of a universal common ancestor. Biochimie. 2017;138:168-83.
OpenUrl
30.↵
Guy L, Ettema TJG. The archaeal TACK superphylum and the origin of eukaryotes. Trends in microbiology. 2011;19(12):580-7.
OpenUrl CrossRef PubMed Web of Science
31.↵
Le SQ, Gascuel O. An Improved General Amino Acid Replacement Matrix. Molecular Biology and Evolution. 2008;25(7):1307-20.
OpenUrl CrossRef PubMed Web of Science
32.↵
Lewis PO. A Likelihood Approach to Estimating Phylogeny from Discrete Morphological Character Data. Systematic Biology. 2001;50(6):913-25.
OpenUrl CrossRef PubMed Web of Science
33.↵
Wright AM, Hillis DM. Bayesian Analysis Using a Simple Likelihood Model Outperforms Parsimony for Estimation of Phylogeny from Discrete Morphological Data. PLOS ONE. 2014;9(10):e109210.
OpenUrl CrossRef PubMed
34.↵
Morrison DA. Phylogenetic Analyses of Parasites in the New Millennium. Advances in Parasitology2006. p. 1-124.
35.↵
Wiley EO, Lieberman BS. Phylogenetics: theory and practice of phylogenetic systematics: John Wiley & Sons; 2011.
36.↵
Whittaker RH. New concepts of kingdoms of organisms. Science. 1969;163(3863):150-60.
OpenUrl FREE Full Text
37.↵
Nasir A, Kim K, Caetano-Anolles G. Giant viruses coexisted with the cellular ancestors and represent a distinct supergroup along with superkingdoms Archaea, Bacteria and Eukarya. BMC Evolutionary Biology. 2012;12(1):156.
OpenUrl
38.↵
Stanier RY, Niel Cv. The concept of a bacterium. Archives of Microbiology. 1962;42(1):17-35.
OpenUrl CrossRef
39.↵
Sagan L. On the origin of mitosing cells. Journal of theoretical biology. 1967;14(3):225-75.
OpenUrl CrossRef Web of Science
40.↵
Yang Z, Roberts D. On the use of nucleic acid sequences to infer early branchings in the tree of life. Molecular Biology and Evolution. 1995;12(3):451-8.
OpenUrl CrossRef PubMed Web of Science
41.
Huelsenbeck JP, Bollback JP, Levine AM. Inferring the root of a phylogenetic tree. Systematic biology. 2002;51(1):32-43.
OpenUrl CrossRef PubMed Web of Science
42.↵
Klopfstein S, Vilhelmsen L, Ronquist F. A Nonstationary Markov Model Detects Directional Evolution in Hymenopteran Morphology. Systematic Biology. 2015;64(6):1089-103.
OpenUrl CrossRef PubMed
43.↵
Criscuolo A, Gribaldo S. BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evolutionary Biology. 2010;10(1):210.
OpenUrl
44.↵
Light S, Sagit R, Sachenkova O, Ekman D, Elofsson A. Protein expansion is primarily due to indels in intrinsically disordered regions. Molecular Biology and Evolution. 2013;30(12):2645-53.
OpenUrl CrossRef PubMed
45.↵
Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: A structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology. 1995;247(4):536-40.
OpenUrl CrossRef PubMed Web of Science
46.↵
Woese CR, Fox GE, Zablen L, Uchida T, Bonen L, Pechman K, et al. Conservation of primary structure in 16S ribosomal RNA. Nature. 1975;254(5495):83-6.
OpenUrl CrossRef PubMed Web of Science
47.↵
Rokas A, Carroll SB. Frequent and widespread parallel evolution of protein sequences. Molecular Biology and Evolution. 2008;25(9):1943-53.
OpenUrl CrossRef PubMed Web of Science
48.↵
Parker J, Tsagkogeorga G, Cotton JA, Liu Y, Provero P, Stupka E, et al. Genome-wide signatures of convergent evolution in echolocating mammals. Nature. 2013;502(7470):228-31.
OpenUrl CrossRef PubMed Web of Science
49.↵
Rokas A, Carroll SB. Bushes in the tree of life. PLoS Biology. 2006;4(11):1899-904.
OpenUrl Web of Science
50.↵
Philippe H, Roure B. Difficult phylogenetic questions: more data, maybe; better methods, certainly. BMC Biology. 2011;9(1):1–4.
OpenUrl
51.↵
Shen X-X, Hittinger CT, Rokas A. Contentious relationships in phylogenomic studies can be driven by a handful of genes. Nature ecology & evolution. 2017;1(5):0126.
OpenUrl
52.↵
Springer MS, Gatesy J. On the importance of homology in the age of phylogenomics. Systematics and Biodiversity. 2017:1-19.
53.↵
Govindarajan S, Recabarren R, Goldstein RA. Estimating the total number of protein folds. Proteins: Structure, Function and Genetics. 1999;35(4):408-14.
OpenUrl
54.↵
Graham DE, Overbeek R, Olsen GJ, Woese CR. An archaeal genomic signature. Proceedings of the National Academy of Sciences. 2000;97(7):3304-8.
OpenUrl Abstract/FREE Full Text
55.↵
Park J, Karplus K, Barrett C, Hughey R, Haussler D, Hubbard T, et al. Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. Journal of Molecular Biology. 1998;284(4):1201-10.
OpenUrl CrossRef PubMed Web of Science
56.↵
Eddy SR. Accelerated profile HMM searches. PLoS Computational Biology. 2011;7(10).
57.↵
Gough J, Karplus K, Hughey R, Chothia C. Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. Journal of Molecular Biology. 2001;313(4):903-19.
OpenUrl CrossRef PubMed Web of Science
58.↵
Fang H, Oates ME, Pethica RB, Greenwood JM, Sardar AJ, Rackham OJL, et al. A daily-updated tree of (sequenced) life as a reference for genome research. Scientific Reports. 2013;3.
59.↵
Rokas A, Holland PWH. Rare genomic changes as a tool for phylogenetics. Trends in Ecology and Evolution. 2000;15(11):454-9.
OpenUrl CrossRef
60.↵
Castoe TA, de Koning AJ, Pollock DD. Adaptive molecular convergence: Molecular evolution versus molecular phylogenetics. Communicative and Integrative Biology. 2010;3(1):12-7.
OpenUrl
61.↵
Liu Y, Cotton JA, Shen B, Han X, Rossiter SJ, Zhang S. Convergent sequence evolution between echolocating bats and dolphins. Current Biology. 2010;20(2):R53-R4.
OpenUrl CrossRef PubMed Web of Science
62.↵
Foote AD, Liu Y, Thomas GWC, Vinar T, Alfoldi J, Deng J, et al. Convergent evolution of the genomes of marine mammals. Nat Genet. 2015;advance online publication.
63.↵
Hennig W. Phylogenetic systematics. Annual review of entomology. 1965;10(1):97-116.
OpenUrl CrossRef Web of Science
64.↵
Padian K, Lindberg DR, Polly PD. Cladistics and the fossil record: the uses of history. Annual Review of Earth and Planetary Sciences. 1994;22:63-91.
OpenUrl
65.↵
Lienau EK, DeSalle R. Is the microbial tree of life verificationist? Cladistics. 2010;26(2):195-201.
OpenUrl
66.↵
Sanderson MJ. Where have all the clades gone? A systematist’s take in Inferring Phylogenies. Evolution. 2005;59(9):2056-8.
OpenUrl
67.↵
Wheeler Q, Assis L, Rieppel O. Phylogenetics: Heed the father of cladistics. Nature. 2013;496(7445):295-6.
OpenUrl CrossRef PubMed Web of Science
68.↵
Pethica RB, Levitt M, Gough J. Evolutionarily consistent families in SCOP: Sequence, structure and function. BMC Structural Biology. 2012;12.
69.↵
Mackin KA, Roy RA, Theobald DL. An empirical test of convergent evolution in rhodopsins. Molecular Biology and Evolution. 2014;31(1):85-95.
OpenUrl CrossRef PubMed
70.↵
Mistry J, Finn RD, Eddy SR, Bateman A, Punta M. Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions. Nucleic acids research. 2013;41(12):e121-e.
OpenUrl CrossRef PubMed
71.↵
Simpson GG. The Principles of Classification and a Classification of Mammals. Bull Amer Museum Nat History. 1945;85:xvi+350.
OpenUrl
72.↵
Felsenstein J. Inferring phylogenies. Sunderland, MA: Sinauer Associates; 2004.
73.↵
Iwabe N, Kuma K, Hasegawa M, Osawa S, Miyata T. Evolutionary relationship of archaebacteria, eubacteria, and eukaryotes inferred from phylogenetic trees of duplicated genes. Proceedings of the National Academy of Sciences. 1989;86(23):9355-9.
OpenUrl Abstract/FREE Full Text
74.↵
Baldauf SL, Palmer JD, Doolittle WF. The root of the universal tree and the origin of eukaryotes based on elongation factor phylogeny. Proceedings of the National Academy of Sciences of the United States of America. 1996;93(15):7749-54.
OpenUrl Abstract/FREE Full Text
75.↵
Forterre P, Philippe H. Where is the root of the universal tree of life? BioEssays. 1999;21(10):871-9.
OpenUrl CrossRef PubMed Web of Science
76.↵
Atkinson GC. The evolutionary and functional diversity of classical and lesser-known cytoplasmic and organellar translational GTPases across the tree of life. BMC Genomics. 2015;16(1):78.
OpenUrl CrossRef PubMed
77.↵
Graham SW, Olmstead RG, Barrett SCH. Rooting Phylogenetic Trees with Distant Outgroups: A Case Study from the Commelinoid Monocots. Molecular Biology and Evolution. 2002;19(10):1769-81.
OpenUrl CrossRef PubMed Web of Science
78.↵
Woese CR. Bacterial evolution. Microbiological reviews. 1987;51(2):221.
OpenUrl
79.↵
Nasir A, Caetano-Anollés G. A phylogenomic data-driven exploration of viral origins and evolution. Science Advances. 2015;1(8).
80.↵
Gribaldo S, Brochier-Armanet C. Time for order in microbial systematics. Trends in microbiology. 2012;20(5):209-10.
OpenUrl CrossRef PubMed Web of Science
81.↵
Schwartz R, Dayhoff M. Origins of prokaryotes, eukaryotes, mitochondria, and chloroplasts. Science. 1978;199(4327):395-403.
OpenUrl FREE Full Text
82.↵
Darwin C. On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life. London: John Murray; 1859.
83.↵
Gribaldo S, Poole AM, Daubin V, Forterre P, Brochier-Armanet C. The origin of eukaryotes and their relationship with the Archaea: are we at a phylogenomic impasse? Nat Rev Micro. 2010;8(10):743-52.
OpenUrl
84.↵
Gupta RS. Impact of genomics on the understanding of microbial evolution and classification: the importance of Darwin’s views on classification. FEMS microbiology reviews. 2016;40(4):520-53.
OpenUrl CrossRef PubMed
85.↵
Avise JC, Robinson TJ. Hemiplasy: a new term in the lexicon of phylogenetics. Systematic Biology. 2008;57(3):503-7.
OpenUrl CrossRef GeoRef PubMed Web of Science
86.↵
Huelsenbeck JP, Larget B, Alfaro ME. Bayesian Phylogenetic Model Selection Using Reversible Jump Markov Chain Monte Carlo. Molecular Biology and Evolution. 2004;21(6):1123-33.
OpenUrl CrossRef PubMed Web of Science
87.↵
Woese CR. Interpreting the universal phylogenetic tree. Proceedings of the National Academy of Sciences. 2000;97(15):8392-6.
OpenUrl Abstract/FREE Full Text
88.↵
Brinkmann H, Philippe H. Archaea sister group of Bacteria? Indications from tree reconstruction artifacts in ancient phylogenies. Molecular biology and evolution. 1999;16(6):817-25.
OpenUrl CrossRef PubMed Web of Science
89.↵
Harish A, Kurland CG. Mitochondria are not captive bacteria. Journal of Theoretical Biology. 2017;434:88-98.
OpenUrl
90.↵
Oates ME, Stahlhacke J, Vavoulis DV, Smithers B, Rackham OJL, Sardar AJ, et al. The SUPER-FAMILY 1.75 database in 2014: A doubling of data. Nucleic Acids Research. 2015;43(D1):D227-D33.
OpenUrl CrossRef PubMed
91.↵
Lefort V, Longueville J-E, Gascuel O. SMS: Smart Model Selection in PhyML. Molecular Biology and Evolution. 2017:msx149.
92.↵
Guindon S, Dufayard J-F, Lefort V, Anisimova M, Hordijk W, Gascuel O. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Systematic biology. 2010;59(3):307-21.
OpenUrl CrossRef PubMed Web of Science
93.↵
Ronquist F, Teslenko M, van der Mark P, Ayres DL, Darling A, Höhna S, et al. MrBayes 3.2: Efficient Bayesian Phylogenetic Inference and Model Choice Across a Large Model Space. Systematic Biology. 2012;61(3):539-42.
OpenUrl CrossRef PubMed
94.↵
Ronquist F, Huelsenbeck J, Teslenko M. MrBayes version 3.2 manual: tutorials and model summaries. Available with the software distribution at mrbayessourceforgenet/mb32_manualpdf. 2011.

View the discussion thread.

Posted February 13, 2018.

Download PDF

Citation Tools

Subject Area

Evolutionary Biology

Subject Areas

All Articles

Animal Behavior and Cognition (5215)
Biochemistry (11745)
Bioengineering (8752)
Bioinformatics (29200)
Biophysics (14972)
Cancer Biology (12096)
Cell Biology (17411)
Clinical Trials (138)
Developmental Biology (9421)
Ecology (14182)
Epidemiology (2067)
Evolutionary Biology (18308)
Genetics (12245)
Genomics (16803)
Immunology (11869)
Microbiology (28085)
Molecular Biology (11592)
Neuroscience (60969)
Paleontology (451)
Pathology (1871)
Pharmacology and Toxicology (3238)
Physiology (4959)
Plant Biology (10427)
Scientific Communication and Education (1683)
Synthetic Biology (2885)
Systems Biology (7340)
Zoology (1651)

[1] 1.↵
Woese CR, Fox GE. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proceedings of the National Academy of Sciences. 1977;74(11):5088-90.
OpenUrl Abstract/FREE Full Text

[2] 2.↵
Woese CR. The Archaeal Concept and the World it Lives in: A Retrospective. Photosynthesis Research. 2004;80(1):361-72.
OpenUrl CrossRef PubMed

[3] 3.↵
Zuckerkandl E, Pauling L. Molecules as documents of evolutionary history. Journal of theoretical biology. 1965;8(2):357-66.
OpenUrl CrossRef PubMed Web of Science

[4] 4.↵
Ragan MA, Bernard G, Chan CX. Molecular phylogenetics before sequences. RNA Biology. 2014;11(3):176-85.
OpenUrl

[5] 5.↵
Rinke C, Schwientek P, Sczyrba A, Ivanova NN, Anderson IJ, Cheng J-F, et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature. 2013;499(7459):431-7.
OpenUrl CrossRef PubMed Web of Science

[6] 6.↵
Sunagawa S, Coelho LP, Chaffron S, Kultima JR, Labadie K, Salazar G, et al. Structure and function of the global ocean microbiome. Science. 2015;348(6237):1261359.
OpenUrl Abstract/FREE Full Text

[7] 7.↵
Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, Frazier M, et al. Stalking the Fourth Domain in Metagenomic Data: Searching for, Discovering, and Interpreting Novel, Deep Branches in Marker Gene Phylogenetic Trees. PLOS ONE. 2011;6(3):e18011.
OpenUrl CrossRef PubMed

[8] 8.↵
Boyer M, Madoui M-A, Gimenez G, La Scola B, Raoult D. Phylogenetic and Phyletic Studies of Informational Genes in Genomes Highlight Existence of a 4th Domain of Life Including Giant Viruses. PLOS ONE. 2010;5(12):e15530.
OpenUrl CrossRef PubMed

[9] 9.↵
Mayr E. Two empires or three? Proceedings of the National Academy of Sciences of the United States of America. 1998;95(17):9720-3.
OpenUrl FREE Full Text

[10] 10.↵
Harish A, Tunlid A, Kurland CG. Rooted phylogeny of the three superkingdoms. Biochimie. 2013;95(8):1593-604.
OpenUrl CrossRef PubMed

[11] 11.↵
Williams TA, Foster PG, Cox CJ, Embley TM. An archaeal origin of eukaryotes supports only two primary domains of life. Nature. 2013;504(7479):231-6.
OpenUrl CrossRef GeoRef PubMed Web of Science

[12] 12.↵
Hug LA, Baker BJ, Anantharaman K, Brown CT, Probst AJ, Castelle CJ, et al. A new view of the tree of life. Nature Microbiology. 2016;1:16048.
OpenUrl

[13] 13.↵
Parks DH, Rinke C, Chuvochina M, Chaumeil P-A, Woodcraft BJ, Evans PN, et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nature Microbiology. 2017.

[14] 14.↵
Gouy R, Baurain D, Philippe H. Rooting the tree of life: the phylogenetic jury is still out. Phil Trans R Soc B. 2015;370(1678):20140329.
OpenUrl CrossRef PubMed

[15] 15.↵
Lake JA. An alternative to archaebacterial dogma. Nature. 1986;319(6055):626-.
OpenUrl PubMed

[16] 16.
Tourasse NJ, Gouy M. Accounting for evolutionary rate variation among sequence sites consistently changes universal phylogenies deduced from rRNA and protein-coding genes. Molecular phylogenetics and evolution. 1999;13(1):159-68.
OpenUrl CrossRef PubMed Web of Science

[17] 17.↵
Williams TA, Embley TM. Archaeal “dark matter” and the origin of eukaryotes. Genome Biology and Evolution. 2014;6(3):474-81.
OpenUrl CrossRef PubMed

[18] 18.
Spang A, Saw JH, Jørgensen SL, Zaremba-Niedzwiedzka K, Martijn J, Lind AE, et al. Complex archaea that bridge the gap between prokaryotes and eukaryotes. Nature. 2015;521(7551):173-9.
OpenUrl CrossRef PubMed

[19] 19.↵
Da Cunha V, Gaia M, Gadelle D, Nasir A, Forterre P. Lokiarchaea are close relatives of Euryarchaeota, not bridging the gap between prokaryotes and eukaryotes. PLOS Genetics. 2017;13(6):e1006810.
OpenUrl

[20] 20.↵
Zaremba-Niedzwiedzka K, Caceres EF, Saw JH, Bäckström D, Juzokaite L, Vancaester E, et al. Asgard archaea illuminate the origin of eukaryotic cellular complexity. Nature. 2017;541(7637):353-8.
OpenUrl CrossRef PubMed

[21] 21.↵
Zwickl DJ, Hillis DM. Increased taxon sampling greatly reduces phylogenetic error. Systematic Biology. 2002;51(4):588-98.
OpenUrl CrossRef PubMed Web of Science

[22] 22.↵
Salichos L, Rokas A. Inferring ancient divergences requires genes with strong phylogenetic signals. Nature. 2013;497(7449):327-31.
OpenUrl CrossRef PubMed Web of Science

[23] 23.↵
Morrison DA. Using data-display networks for exploratory data analysis in phylogenetic studies. Molecular Biology and Evolution. 2009;27(5):1044-57.
OpenUrl

[24] 24.↵
Huson DH, Bryant D. Application of phylogenetic networks in evolutionary studies. Mol Biol Evol. 2006;23.

[25] 25.↵
Woese CR. On the evolution of cells. Proceedings of the National Academy of Sciences of the United States of America. 2002;99(13):8742-7.
OpenUrl Abstract/FREE Full Text

[26] 26.↵
Woese CR, Kandler O, Wheelis ML. Towards a natural system of organisms: Proposal for the domains Archaea, Bacteria, and Eucarya. Proceedings of the National Academy of Sciences of the United States of America. 1990;87(12):4576-9.
OpenUrl Abstract/FREE Full Text

[27] 27.↵
Garrett RA. Molecular evolution: The uniqueness of Archaebacteria. Nature. 1985;318:233-5.
OpenUrl

[28] 28.↵
Valentine DL. Adaptations to energy stress dictate the ecology and evolution of the Archaea. Nature Reviews Microbiology. 2007;5(4):316-23.
OpenUrl CrossRef PubMed Web of Science

[29] 29.↵
Harish A, Kurland CG. Akaryotes and Eukaryotes are independent descendants of a universal common ancestor. Biochimie. 2017;138:168-83.
OpenUrl

[30] 30.↵
Guy L, Ettema TJG. The archaeal TACK superphylum and the origin of eukaryotes. Trends in microbiology. 2011;19(12):580-7.
OpenUrl CrossRef PubMed Web of Science

[31] 31.↵
Le SQ, Gascuel O. An Improved General Amino Acid Replacement Matrix. Molecular Biology and Evolution. 2008;25(7):1307-20.
OpenUrl CrossRef PubMed Web of Science

[32] 32.↵
Lewis PO. A Likelihood Approach to Estimating Phylogeny from Discrete Morphological Character Data. Systematic Biology. 2001;50(6):913-25.
OpenUrl CrossRef PubMed Web of Science

[33] 33.↵
Wright AM, Hillis DM. Bayesian Analysis Using a Simple Likelihood Model Outperforms Parsimony for Estimation of Phylogeny from Discrete Morphological Data. PLOS ONE. 2014;9(10):e109210.
OpenUrl CrossRef PubMed

[34] 34.↵
Morrison DA. Phylogenetic Analyses of Parasites in the New Millennium. Advances in Parasitology2006. p. 1-124.

[35] 35.↵
Wiley EO, Lieberman BS. Phylogenetics: theory and practice of phylogenetic systematics: John Wiley & Sons; 2011.

[36] 36.↵
Whittaker RH. New concepts of kingdoms of organisms. Science. 1969;163(3863):150-60.
OpenUrl FREE Full Text

[37] 37.↵
Nasir A, Kim K, Caetano-Anolles G. Giant viruses coexisted with the cellular ancestors and represent a distinct supergroup along with superkingdoms Archaea, Bacteria and Eukarya. BMC Evolutionary Biology. 2012;12(1):156.
OpenUrl

[38] 38.↵
Stanier RY, Niel Cv. The concept of a bacterium. Archives of Microbiology. 1962;42(1):17-35.
OpenUrl CrossRef

[39] 39.↵
Sagan L. On the origin of mitosing cells. Journal of theoretical biology. 1967;14(3):225-75.
OpenUrl CrossRef Web of Science

[40] 40.↵
Yang Z, Roberts D. On the use of nucleic acid sequences to infer early branchings in the tree of life. Molecular Biology and Evolution. 1995;12(3):451-8.
OpenUrl CrossRef PubMed Web of Science

[41] 41.
Huelsenbeck JP, Bollback JP, Levine AM. Inferring the root of a phylogenetic tree. Systematic biology. 2002;51(1):32-43.
OpenUrl CrossRef PubMed Web of Science

[42] 42.↵
Klopfstein S, Vilhelmsen L, Ronquist F. A Nonstationary Markov Model Detects Directional Evolution in Hymenopteran Morphology. Systematic Biology. 2015;64(6):1089-103.
OpenUrl CrossRef PubMed

[43] 43.↵
Criscuolo A, Gribaldo S. BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evolutionary Biology. 2010;10(1):210.
OpenUrl

[44] 44.↵
Light S, Sagit R, Sachenkova O, Ekman D, Elofsson A. Protein expansion is primarily due to indels in intrinsically disordered regions. Molecular Biology and Evolution. 2013;30(12):2645-53.
OpenUrl CrossRef PubMed

[45] 45.↵
Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: A structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology. 1995;247(4):536-40.
OpenUrl CrossRef PubMed Web of Science

[46] 46.↵
Woese CR, Fox GE, Zablen L, Uchida T, Bonen L, Pechman K, et al. Conservation of primary structure in 16S ribosomal RNA. Nature. 1975;254(5495):83-6.
OpenUrl CrossRef PubMed Web of Science

[47] 47.↵
Rokas A, Carroll SB. Frequent and widespread parallel evolution of protein sequences. Molecular Biology and Evolution. 2008;25(9):1943-53.
OpenUrl CrossRef PubMed Web of Science

[48] 48.↵
Parker J, Tsagkogeorga G, Cotton JA, Liu Y, Provero P, Stupka E, et al. Genome-wide signatures of convergent evolution in echolocating mammals. Nature. 2013;502(7470):228-31.
OpenUrl CrossRef PubMed Web of Science

[49] 49.↵
Rokas A, Carroll SB. Bushes in the tree of life. PLoS Biology. 2006;4(11):1899-904.
OpenUrl Web of Science

[50] 50.↵
Philippe H, Roure B. Difficult phylogenetic questions: more data, maybe; better methods, certainly. BMC Biology. 2011;9(1):1–4.
OpenUrl

[51] 51.↵
Shen X-X, Hittinger CT, Rokas A. Contentious relationships in phylogenomic studies can be driven by a handful of genes. Nature ecology & evolution. 2017;1(5):0126.
OpenUrl

[52] 52.↵
Springer MS, Gatesy J. On the importance of homology in the age of phylogenomics. Systematics and Biodiversity. 2017:1-19.

[53] 53.↵
Govindarajan S, Recabarren R, Goldstein RA. Estimating the total number of protein folds. Proteins: Structure, Function and Genetics. 1999;35(4):408-14.
OpenUrl

[54] 54.↵
Graham DE, Overbeek R, Olsen GJ, Woese CR. An archaeal genomic signature. Proceedings of the National Academy of Sciences. 2000;97(7):3304-8.
OpenUrl Abstract/FREE Full Text

[55] 55.↵
Park J, Karplus K, Barrett C, Hughey R, Haussler D, Hubbard T, et al. Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. Journal of Molecular Biology. 1998;284(4):1201-10.
OpenUrl CrossRef PubMed Web of Science

[56] 56.↵
Eddy SR. Accelerated profile HMM searches. PLoS Computational Biology. 2011;7(10).

[57] 57.↵
Gough J, Karplus K, Hughey R, Chothia C. Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. Journal of Molecular Biology. 2001;313(4):903-19.
OpenUrl CrossRef PubMed Web of Science

[58] 58.↵
Fang H, Oates ME, Pethica RB, Greenwood JM, Sardar AJ, Rackham OJL, et al. A daily-updated tree of (sequenced) life as a reference for genome research. Scientific Reports. 2013;3.

[59] 59.↵
Rokas A, Holland PWH. Rare genomic changes as a tool for phylogenetics. Trends in Ecology and Evolution. 2000;15(11):454-9.
OpenUrl CrossRef

[60] 60.↵
Castoe TA, de Koning AJ, Pollock DD. Adaptive molecular convergence: Molecular evolution versus molecular phylogenetics. Communicative and Integrative Biology. 2010;3(1):12-7.
OpenUrl

[61] 61.↵
Liu Y, Cotton JA, Shen B, Han X, Rossiter SJ, Zhang S. Convergent sequence evolution between echolocating bats and dolphins. Current Biology. 2010;20(2):R53-R4.
OpenUrl CrossRef PubMed Web of Science

[62] 62.↵
Foote AD, Liu Y, Thomas GWC, Vinar T, Alfoldi J, Deng J, et al. Convergent evolution of the genomes of marine mammals. Nat Genet. 2015;advance online publication.

[63] 63.↵
Hennig W. Phylogenetic systematics. Annual review of entomology. 1965;10(1):97-116.
OpenUrl CrossRef Web of Science

[64] 64.↵
Padian K, Lindberg DR, Polly PD. Cladistics and the fossil record: the uses of history. Annual Review of Earth and Planetary Sciences. 1994;22:63-91.
OpenUrl

[65] 65.↵
Lienau EK, DeSalle R. Is the microbial tree of life verificationist? Cladistics. 2010;26(2):195-201.
OpenUrl

[66] 66.↵
Sanderson MJ. Where have all the clades gone? A systematist’s take in Inferring Phylogenies. Evolution. 2005;59(9):2056-8.
OpenUrl

[67] 67.↵
Wheeler Q, Assis L, Rieppel O. Phylogenetics: Heed the father of cladistics. Nature. 2013;496(7445):295-6.
OpenUrl CrossRef PubMed Web of Science

[68] 68.↵
Pethica RB, Levitt M, Gough J. Evolutionarily consistent families in SCOP: Sequence, structure and function. BMC Structural Biology. 2012;12.

[69] 69.↵
Mackin KA, Roy RA, Theobald DL. An empirical test of convergent evolution in rhodopsins. Molecular Biology and Evolution. 2014;31(1):85-95.
OpenUrl CrossRef PubMed

[70] 70.↵
Mistry J, Finn RD, Eddy SR, Bateman A, Punta M. Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions. Nucleic acids research. 2013;41(12):e121-e.
OpenUrl CrossRef PubMed

[71] 71.↵
Simpson GG. The Principles of Classification and a Classification of Mammals. Bull Amer Museum Nat History. 1945;85:xvi+350.
OpenUrl

[72] 72.↵
Felsenstein J. Inferring phylogenies. Sunderland, MA: Sinauer Associates; 2004.

[73] 73.↵
Iwabe N, Kuma K, Hasegawa M, Osawa S, Miyata T. Evolutionary relationship of archaebacteria, eubacteria, and eukaryotes inferred from phylogenetic trees of duplicated genes. Proceedings of the National Academy of Sciences. 1989;86(23):9355-9.
OpenUrl Abstract/FREE Full Text

[74] 74.↵
Baldauf SL, Palmer JD, Doolittle WF. The root of the universal tree and the origin of eukaryotes based on elongation factor phylogeny. Proceedings of the National Academy of Sciences of the United States of America. 1996;93(15):7749-54.
OpenUrl Abstract/FREE Full Text

[75] 75.↵
Forterre P, Philippe H. Where is the root of the universal tree of life? BioEssays. 1999;21(10):871-9.
OpenUrl CrossRef PubMed Web of Science

[76] 76.↵
Atkinson GC. The evolutionary and functional diversity of classical and lesser-known cytoplasmic and organellar translational GTPases across the tree of life. BMC Genomics. 2015;16(1):78.
OpenUrl CrossRef PubMed

[77] 77.↵
Graham SW, Olmstead RG, Barrett SCH. Rooting Phylogenetic Trees with Distant Outgroups: A Case Study from the Commelinoid Monocots. Molecular Biology and Evolution. 2002;19(10):1769-81.
OpenUrl CrossRef PubMed Web of Science

[78] 78.↵
Woese CR. Bacterial evolution. Microbiological reviews. 1987;51(2):221.
OpenUrl

[79] 79.↵
Nasir A, Caetano-Anollés G. A phylogenomic data-driven exploration of viral origins and evolution. Science Advances. 2015;1(8).

[80] 80.↵
Gribaldo S, Brochier-Armanet C. Time for order in microbial systematics. Trends in microbiology. 2012;20(5):209-10.
OpenUrl CrossRef PubMed Web of Science

[81] 81.↵
Schwartz R, Dayhoff M. Origins of prokaryotes, eukaryotes, mitochondria, and chloroplasts. Science. 1978;199(4327):395-403.
OpenUrl FREE Full Text

[82] 82.↵
Darwin C. On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life. London: John Murray; 1859.

[83] 83.↵
Gribaldo S, Poole AM, Daubin V, Forterre P, Brochier-Armanet C. The origin of eukaryotes and their relationship with the Archaea: are we at a phylogenomic impasse? Nat Rev Micro. 2010;8(10):743-52.
OpenUrl

[84] 84.↵
Gupta RS. Impact of genomics on the understanding of microbial evolution and classification: the importance of Darwin’s views on classification. FEMS microbiology reviews. 2016;40(4):520-53.
OpenUrl CrossRef PubMed

[85] 85.↵
Avise JC, Robinson TJ. Hemiplasy: a new term in the lexicon of phylogenetics. Systematic Biology. 2008;57(3):503-7.
OpenUrl CrossRef GeoRef PubMed Web of Science

[86] 86.↵
Huelsenbeck JP, Larget B, Alfaro ME. Bayesian Phylogenetic Model Selection Using Reversible Jump Markov Chain Monte Carlo. Molecular Biology and Evolution. 2004;21(6):1123-33.
OpenUrl CrossRef PubMed Web of Science

[87] 87.↵
Woese CR. Interpreting the universal phylogenetic tree. Proceedings of the National Academy of Sciences. 2000;97(15):8392-6.
OpenUrl Abstract/FREE Full Text

[88] 88.↵
Brinkmann H, Philippe H. Archaea sister group of Bacteria? Indications from tree reconstruction artifacts in ancient phylogenies. Molecular biology and evolution. 1999;16(6):817-25.
OpenUrl CrossRef PubMed Web of Science

[89] 89.↵
Harish A, Kurland CG. Mitochondria are not captive bacteria. Journal of Theoretical Biology. 2017;434:88-98.
OpenUrl

[90] 90.↵
Oates ME, Stahlhacke J, Vavoulis DV, Smithers B, Rackham OJL, Sardar AJ, et al. The SUPER-FAMILY 1.75 database in 2014: A doubling of data. Nucleic Acids Research. 2015;43(D1):D227-D33.
OpenUrl CrossRef PubMed

[91] 91.↵
Lefort V, Longueville J-E, Gascuel O. SMS: Smart Model Selection in PhyML. Molecular Biology and Evolution. 2017:msx149.

[92] 92.↵
Guindon S, Dufayard J-F, Lefort V, Anisimova M, Hordijk W, Gascuel O. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Systematic biology. 2010;59(3):307-21.
OpenUrl CrossRef PubMed Web of Science

[93] 93.↵
Ronquist F, Teslenko M, van der Mark P, Ayres DL, Darling A, Höhna S, et al. MrBayes 3.2: Efficient Bayesian Phylogenetic Inference and Model Choice Across a Large Model Space. Systematic Biology. 2012;61(3):539-42.
OpenUrl CrossRef PubMed

[94] 94.↵
Ronquist F, Huelsenbeck J, Teslenko M. MrBayes version 3.2 manual: tutorials and model summaries. Available with the software distribution at mrbayessourceforgenet/mb32_manualpdf. 2011.