Abstract
The recognition of the group Archaea 40 years ago stimulated research in microbial evolution and molecular systematics that prompted a new classificatory scheme to organize biodiversity. Advances in DNA sequencing techniques have since significantly improved the genomic representation of the archaeal biodiversity. In addition, advances in phylogenetic modeling that facilitate large-scale phylogenomics have resolved many recalcitrant branches of the Tree of Life. Despite the technical advances and an expanded taxonomic representation, two important aspects of the origins and evolution of the Archaea remain controversial, even as we celebrate the 40th anniversary of the monumental discovery. The issues concern (i) the uniqueness (monophyly) of the Archaea, and (ii) the evolutionary relationships of the Archaea to the Bacteria and the Eukarya; both of these are relevant to the deep structure of the Tree of Life. The uncertainty is primarily due to a scarcity of information in standard datasets—the core-genes datasets—to reliably resolve the conflicts. These conflicts can be resolved efficiently by employing complex genomic features and genome-scale evolution models—a distinct class of phylogenomic characters and evolution models—that can be employed routinely to maximize the use of genome sequences as well as to minimize uncertainties in tests of evolutionary hypotheses.
Introduction
The recognition of the Archaea as the so-called “third form of life” was made possible in part by a new technology for sequence analysis, oligonucleotide cataloging, developed by Fredrik Sanger and colleagues in the 1960s (1, 2). Carl Woese’s insight of using this method, and the choice of the small subunit ribosomal RNA (16S/SSU rRNA) as a phylogenetic marker, not only put microorganisms on a phylogenetic map (or tree), but also revolutionized the field of molecular systematics that Zukerkandl and Pauling has previously alluded to (3). Comparative analysis of organism-specific (oligonucleotide) sequence-signatures in SSU rRNA led to the recognition of a distinct group of microorganisms (2, 4). Initially referred to as Archaeabacteria, these unusual organisms had ‘oligonucleotide signatures’ distinct from other bacteria (Eubacteria), and they were later found to be different from those of Eukarya (eukaryotes) as well. Many other features, including molecular, biochemical as well as ecological, corroborated the uniqueness of the Archaea. Thus the archaeal concept was established (2).
The study of microbial diversity and evolution has come a long way since then: sequencing microbial genomes, and directly from the environment without the need for culturing is now routine (5, 6). This wealth of sequence information is exciting not only for cataloging and organizing biodiversity, but also to understand the ecology and evolution of microorganisms – archaea and bacteria as well as eukaryotes – that make up a vast majority of the planetary biodiversity. Since large-scale exploration by the means of environmental genome sequencing became possible almost a decade ago, there has also been a palpable excitement and anticipation of the discovery of a fourth form of life or a “fourth domain” of life (7). The reference here is to a fourth form of cellular life, but not to viruses, which some have already proposed to be the fourth domain of the Tree of Life (ToL) (7, 8). If a fourth form of life were to be found, what would the distinguishing features be, and how could it be measured, defined and classified?
Rather than the discovery of a fourth domain, and contrary to the expectations, however, current discussion is centered around the return to a dichotomous classification of life (9-11), despite the rapid expansion of sequenced biodiversity – hundreds of novel phyla descriptions (12, 13). The proposed dichotomous classifications schemes, unfortunately, are in sharp contrast to each other, depending on: (i) whether the Archaea constitute a monophyletic group—a unique line of descent that is distinct from those of the Bacteria as well as the Eukarya; and (ii) whether the Archaea form a sister clade to the Eukarya or to the Bacteria. Both the issues stem from difficulties involved in resolving the deep branches of the ToL (10, 11, 14).
The twin issues, first recognized in the 80s based on single-gene (SSU rRNA) analyses, continue to be the subjects of a long-standing debate, which remains unresolved despite large-scale analyses of multi-gene datasets (5, 15-19). In addition to the choice of genes to be analyzed, the choice of the underlying character evolution model is at the core of contradictory results that either supports the Three-domains tree (5, 19) or the Eocyte tree (17, 20). In many cases, adding more data, either as enhanced taxon (species) sampling or enhanced character (gene) sampling, or both, can resolve ambiguities (21, 22). However, as the taxonomic diversity and evolutionary distance increases among the taxa studied, the number of conserved marker-genes that can be used for phylogenomic analyses decreases. Accordingly, resolving the phylogenetic relationships of the Archaea, Bacteria and Eukarya is restricted to a small set of genes—50 at most—in spite of the large increase in the numbers of genomes sequenced and the associated development of sophisticated phylogenomic methods.
Based on a closer scrutiny of the recent phylogenomic datasets employed in the ongoing debate, I will show here that one of the reasons for this persistent ambiguity is that the ‘information’ necessary to resolve these conflicts is practically nonexistent in the standard marker-genes (i.e. core-genes) datasets employed routinely for phylogenomics. Further, I discuss analytical approaches that maximize the use of the information that is in genome sequence data and simultaneously minimize phylogenetic uncertainties. In addition, I discuss simple but important, yet undervalued, aspects of phylogenetic hypothesis testing, which together with the new approaches hold promise to resolve these long-standing issues effectively.
Results
Information in core genes is inadequate to resolve the archaeal radiation
Data-display networks (DDNs) are useful to examine and visualize character conflicts in phylogenetic datasets, especially in the absence of prior knowledge about the source of such conflicts, ideally before downstream processing of the data for phylogenetic inference (23, 24). While congruent data will be displayed as a tree in a DDN, incongruences are displayed as reticulations in the tree. Fig. 1A shows a neighbor-net analysis of the SSU rRNA alignment used to resolve the phylogenetic position of the recently discovered Asgard archaea (20). The DDN is based on character distances calculated as the observed genetic distance (p-distance) of 1,462 characters, and shows the total amount of conflict in the dataset that is incongruent with character bipartitions (splits). The edge (branch) lengths in the DDN correspond to the support for the respective splits. Accordingly, two well-supported sets of splits for the Bacteria and the Eukarya are observed. The Archaea, however, does not form a distinct, well-resolved/well-supported group, and is unlikely to correspond to a monophyletic group in a phylogenetic tree.
Likewise, the concatenated protein sequence alignment of the so-called ‘genealogy defining core of genes’(25) – a set of conserved single-copy genes – also does not support a unique archael lineage. Fig. 1B is a DDN derived from a neighbor-net analysis of 8,563 characters in 29 concatenated core-genes (17), while Fig. 1C,D is based on 9,868 characters in 44 concatenated core-genes (also from (20)). Even taken together, none of the standard marker gene datasets are likely to support the monophyly of the Archaea — a key assertion of the three-domains hypothesis (26). Simply put, there is not enough information in the core-gene datasets to resolve the archaeal radiation, or to determine whether the Archaea are really unique compared to the Bacteria and Eukarya. However, other complex features — including molecular, biochemical and phenotypic characters, as well as ecological adaptations — support the uniqueness of the Archaea. These idiosyncratic archaeal characters include the subunit composition of supramolecular complexes like the ribosome, DNA- and RNA-polymerases, biochemical composition of cell membranes, cell walls, and physiological adaptations to energy-starved environments, among other things (27, 28).
Complex phylogenomic characters minimize uncertainties regarding the uniqueness of the Archaea
A nucleotide is the smallest possible locus, and an amino acid is a proxy for a locus of a nucleotide triplet. Unlike the elementary amino acid- or nucleotide-characters in the core-genes dataset (Fig.1), the DDN in Fig. 2 is based on complex molecular characters – genomic loci that correspond to protein domains, typically ~200 amino acids (600 nucleotides) long. Neighbor-net analysis of protein-domain data coded as binary characters (presence/absence) is based on the Hamming distance (identical to the p-distance used in Fig.1). Here the Archaea also form a distinct well-supported cluster, as do the Bacteria and the Eukarya.
Fig 2A is a DDN based on the dataset that includes protein-domain cohorts of 141 species, used in a phylogenomic analysis to resolve the uncertainties at the root of the ToL (29). Compared to the data in Fig. 1, the taxonomic diversity sampled for the Bacteria and Eukarya is more extensive, but less extensive for the Archaea; it is composed of the traditional groups Euryarchaeota and Crenarchaeota. Fig. 2B is a DDN of an enriched sampling of 81 additional species, which includes representatives of the newly described archaeal groups: TACK (30), DPANN (5), and Asgard group including the Lokiarchaeota (20). In addition, species sampling was enhanced with representatives from the candidate phyla described for Bacteria, and with unicellular species of Eukarya. The complete list species analyzed is in SI Table 1.
Notably, the extension of the protein-domain cohort was insignificant, from 1,732 to 1,738 distinct domains (characters). Based on the well-supported splits in the DDN that form a distinct archaeal cluster, the Archaea are likely to be a monophyletic group (clade) in phylogenies inferred from these datasets.
Data quality affects model complexity required to explain phylogenetic datasets
Resolving the paraphyly or monophyly of the Archaea is relevant to determining whether the Eocyte tree (Fig. 3A) or the Three-domains tree (Fig. 3B), respectively, is a better-supported hypothesis. Recovering the Eocyte tree typically requires implementing complex models of sequence evolution rather than their relatively simpler versions (11). In general, complex models tend to fit the data better. For instance, according to a model selection test for the 29 core-genes dataset, the LG model (31) of protein sequence evolution is a better-fitting model than other standard models, such as the WAG or JTT substitution model (SI-Table 2), as reported previously (17). Further, a relatively more complex version of the LG model, with multiple rate-categories was found to be a better-fitting model than the simpler single-rate-category model (Fig. 3C; SI-Table 2). The fit of the data is estimated as the likelihood of the best tree given the model.
A complex, multiple rate-categories model accounts for site-specific substitution rate variation. Substitution-rate heterogeneity across different sites in the multiple-sequence alignment (MSA) was approximated using a discrete Gamma model with 4, 8 or 12 rate categories (LG+G4, LG+G8 or LG+G12, respectively). The Archaea is consistent with a paraphyletic group in trees derived from the rate-heterogeneous versions of the LG model (Fig. 3A). Furthermore, the fit of the data improves with the increase in complexity of the substitution model (Fig. 3C). Model complexity increases with any increase in the number of rate categories and/or the associated numbers of parameters that need to be estimated. However, with a relatively simpler version – a rate-homogeneous LG model, in which the substitution-rates are approximated to a single rate-category, the Archaea are consistent with a monophyletic group (Fig. 3B).
In contrast, trees inferred from the protein-domain datasets are consistent with monophyly of the Archaea irrespective of the complexity of the underlying model (Fig. 3D-F). The Mk model (Markov k model) is the best-known probabilistic model of discrete character evolution, particularly of complex characters coded as binary-state characters (32, 33). Since the Mk model assumes a stochastic process of evolution, it is able to estimate multiple state changes along the same branch. Implementing a simpler rate-homogeneous version of the Mk model (Fig. 3D), as well as more complex rate-heterogeneous versions with 4, 8 or 12 rate categories (Mk+G4, Mk+G8 or Mk+G12, respectively), also recovered trees that are consistent with the monophyly of the Archaea (Fig. 3E) The tree derived from the Mk+G4 model is shown in Fig. 3E. While the tree derived from Mk+G8 model is identical (SI-Fig. 1) to the Mk+G4 tree, the Mk+G12 tree is almost identical with minor differences in the bacterial sub-groups (SI-Fig. 2)
In all cases, bipartitions for Archaea show strong support with posterior probability (PP) of 0.99 while that of Bacteria and Eukarya is supported with a PP of 1.0; in spite of substantially different fits of the data. The uniqueness of the Archaea is almost unambiguous in this case (but see next section).
Siblings and cousins are indistinguishable when reversible models are employed
Although a DDN is useful to identify and diagnose character conflicts in phylogenetic datasets and to postulate evolutionary hypotheses, a DDN by itself cannot be interpreted as an evolutionary network, because the edges do not necessarily represent evolutionary phenomena and the nodes do not represent ancestors (23, 24). Therefore, evolutionary relationships cannot be inferred from a DDN. Likewise, evolutionary relationships cannot be inferred from unrooted trees, even though nodes in an unrooted tree do represent ancestors and an evolution model defines the branches (see Fig. 4A).
An unrooted tree, unlike a rooted tree, is not an evolutionary (phylogenetic) tree per se, since it is a minimally defined hypothesis of evolution or of relationships; it is, nevertheless, useful to rule out many possible bipartitions and groups (34, 35). Given that a primary objective of phylogenetic analyses is to identify clades and the relationships between these clades, it is not possible to interpret an unrooted tree meaningfully without rooting the tree (see Fig. 4A). Identifying the root is essential to: (i) distinguish between ancestral and derived states of characters, (ii) determine the ancestor-descendant polarity of taxa, and (iii) diagnose clades and sister-group relationships (Fig. 4). Yet, most phylogenetic software construct only unrooted trees, which are then consistent with several rooted trees (Fig. 4 C-F). However, an unrooted tree cannot be fully resolved into bipartitions, because an unresolved polytomy (a trifurcation in this case) exists near the root of the tree (Fig. 4A), which otherwise corresponds to the deepest split (root) in a rooted tree (Fig. 4, C-F).
Resolving the polytomy requires identifying the root of the tree. The identity of the root corresponds, in principle, to any one of the possible ancestors as follows:
Any one of the inferred-ancestors at the resolved bipartitions (open circles in Fig. 4A), or
Any one of the yet-to-be-inferred-ancestors that lies along the stem-branches of the unre-solved polytomy (dashed lines in Fig. 4A) or along the internal-braches.
In the latter case, rooting the tree a posteriori on any of the branches amounts to inserting an additional bipartition and an ancestor that is neither inferred from the source data nor deduced from the underlying character evolution model. Since standard evolution models employed routinely cannot resolve the polytomy, rooting, and hence interpreting the Tree of Life depends on:
Prior knowledge — eg., fossils or a known sister-group (outgroup), or
Prior beliefs/expectations of the investigators — eg., simple is primitive (36, 37), bacteria are primitive (38, 39), archaea are primitive (1), etc.
Both of these options are independent of the data used to infer the unrooted ToL. Some possible rootings and the resulting rooted-tree topologies are shown as cladograms in Fig. 4, C-F. If the root lies on any of the internal branches (e.g. R1 in Fig. 4,A-C), or corresponds to one of the internal nodes, within the archael radiation, the Archaea would not constitute a unique clade (Fig. 4C). However, if the root lies on one of the stem-branches (R2/R3/R4 in Fig. 4 A, B), monophyly of the Archaea would be unambiguous (Fig. 4 D-F). Determining the evolutionary relationship of the Archaea to other taxa, though, requires identifying the root.
Directional evolution models, unlike reversible models, are able to identify the polarity of state transitions, and thus the root of a tree (40-42). Therefore, the uncertainty due to a polytomous root branching is not an issue (Fig 5A). Moreover, directional evolution models are useful to evaluate the empirical support for prior beliefs about the universal common ancestor (UCA) at the root of the ToL (29). A Bayesian model selection test implemented to detect directional trends (42) chooses the directional model, overwhelmingly (Fig. 5B), over the unpolarized model for the protein-domain dataset in Fig. 2B, as reported previously for the dataset in Fig. 2A (29). Further, the best-supported rooting corresponds to root R4 (Fig. 4F and Fig. 5A) — monophyly of the Archaea is maximally supported (PP of 1.0). Furthermore, the sister-group relationship of the Archaea to the Bacteria is maximally supported (PP 1.0). Accordingly, a higher order taxon, Akaryotes, proposed earlier (Forterre 1992) forms a well-supported clade. Thus Akaryotes (or Akarya) and Eukarya are sister clades that diverge from the UCA at the root of the ToL, also as reported previously (29).
Alternative rootings are much less likely, and are not supported (Fig. 5C). Accordingly, independent origin of the eukaryotes as well akaryotes is the best-supported scenario. The Three-domains tree (root R3, Fig. 4E) is 10171 times less likely, and the scenario proposed by the Eocyte hypothesis (root R1, Fig. 5A) is highly unlikely. The common belief that simple is primitive, as well as beliefs that archaea are primitive or that archaea and bacteria evolved before eukaryotes, are not supported either.
Employing complex molecular characters maximizes representation of orthologous, non-recombining genomic loci, and thus phylogenetic signal
Genomic loci that can be aligned with high confidence using MSA algorithms are typically more conserved than those loci for which alignment uncertainty is high. Such ambiguously aligned regions of sequences are routinely trimmed off before phylogenetic analyses (43). Typically, the conserved well-aligned regions correspond to protein domains with highly ordered three-dimensional (3D) structures with specific 3D folds (Fig. 6A). Regions of sequences that are trimmed usually show higher variability in length, are less ordered and are known to accumulate insertion and deletion (indel) mutations at a higher frequency than in the regions that correspond to folded domains (44). These variable, structurally disordered regions, which flank the structurally ordered domains, link different domains in multi-domain proteins (Fig. 6A). Multi-domain architecture (MDA), the N-to-C terminal sequence of domain arrangement, is distinct for a protein family, and differs in closely related protein families with similar functions (Fig. 6A). The variation in MDA also relates to alignment uncertainties.
A closer look at the 29 core-genes dataset shows that the concatenated-MSA corresponds to a total of 27 distinct protein domains or genomic loci (Table 1). The number of loci sampled from different species varies between 20 and 27, since not all loci are found in all species. While some loci are absent in some species, some loci are redundant. For instance, the P-loop NTP hydrolase domain, one of the most prevalent protein domains, is represented up to 9 times in many species (Table 1). Many central cellular functions are driven by the conformational changes in proteins induced by the hydrolysis of nucleoside triphosphate (NTP) catalyzed by the P-loop domain. Out of a total of 27 distinct domains, 7 are redundant, with two or more copies represented per species. Similarly, 9 of the 50 domains have a redundant representation in the 44 core-gene dataset (Table 1). The observed redundancy of the genomic loci in the core-genes alignments is inconsistent with the common (and typically untested) assumption of using single-copy genes as a proxy for orthologous loci sampled for phylogenetic analysis.
In contrast, the protein-domain datasets are composed of unique loci (Fig. 6B). Despite the superficial similarity of the DDNs in Fig. 1 and Fig.2, they are both qualitatively and quantitatively different codings of genome sequences. As opposed to tracing the history of 30-50 loci in the standard core-genes datasets (Fig. 1), up to 60 fold (1738 loci) more information can be represented when genome sequences are coded as protein-domain characters (Fig. 2). Currently 2,000 unique domains are described by SCOP (Structural Classification of Proteins) (45). The phyletic distribution of 1,738 domains identified in the 222 representative species sampled here is shown in a Venn diagram (Fig. 5B).
Discussion
Improving data quality can be more effective for resolving recalcitrant branches than increasing model complexity
In the phylogenetic literature, the concept of data quality refers to the quality or the strength of the phylogenetic signal that can be extracted from the data. The strength of the phylogenetic signal is proportional to the confidence with which unique state-transitions can be determined for a given set of characters on a given tree. Ideally, historically unique character transitions that entail rare evolutionary innovations are desirable, to identify patterns of uniquely shared innovations (synapomorphies) among lineages. Synapomorphies are the diagnostic features used for assessing lineage-specific inheritance of evolutionary innovations. Therefore identifying character transitions that are likely to be low probability events is a basic requirement for the accuracy of phylogenetic analysis.
In their pioneering studies, Woese and colleagues identified unique features of the SSU rRNA – [oligonucleotide] “signatures” – that were six nucleotides or longer, to determine evolutionary relationships (2). An underlying assumption was that the probability of occurrence of the same set of oligomer signatures by chance, in non-homologous sequences, is low in a large molecule like SSU rRNA (1500-2000 nucleotides). Oligomers shorter than six nucleotides were statistically less likely to be efficient markers of homology (46). Thus SSU rRNA was an information-rich molecule to identify homologous signatures (characters) useful for phylogenetic analysis.
However, as sequencing of full-length rRNAs and statistical models of nucleotide substitution became common, complex oligomer-characters were replaced by elementary nucleotide-characters; and more recently by amino acid characters. Identifying rare or historically unique substitutions in empirical datasets has proven to be difficult (47, 48), consequently the uncertainty of resolving the deeper branches of the Tree of Life using marker-gene sequences remains high. A primary reason is the prevalence of phylogenetic noise (homoplasy) in primary sequence datasets (Figs 1), due to the characteristic redundancy of nucleotide and amino acid substitutions and the resulting difficulty in distinguishing phylogenetic noise from signal (homology) (49, 50). Better-fitting (or best-fitting) models are expected to extract phylogenetic signal more efficiently and thus explain the data better, but tend to be more complex than worse-fitting models (Fig. 3 C, F). Increasingly sophisticated statistical models that have been developed over the years have only marginally improved the situation (51, 52). Although increasing model complexity can correct errors of estimation and improve the fit of the data to the tree, it is not a solution to improve phylogenetic signal, especially when not present in the source data.
Character recoding is found to be effective in reducing the noise/redundancy in the data, and thus uncertainties in phylogenetic reconstructions. This is a form of data simplification wherein the number of amino acid alphabets is reduced to a smaller set of alphabets that are frequently substituted for each other, usually reduced from 20 to 6. Character recoding into reduced alphabets is useful in cases were compositional heterogeneity or substitution saturation is high. However, datasets in which phylogenetic noise is inherently limited are more desirable, to minimize ambiguities. Like amino acids, protein domains are also modular alphabets, albeit higher order and more complex alphabets of proteins. Moreover, unlike the 20 standard amino acids, there are approximately 2,000 unique protein domains identified at present according to SCOP (45). The number is expected to increase; the theoretical estimates range between 4,000 and 10,000 distinct domain modules, depending on the classification scheme (53). Coding features as binary characters is the simplest possible representation of data for describing historically unique events.
The idea of‘oligonucleotide-signatures’ used for estimating a gene phylogeny has been extended, naturally, to infer a genome phylogeny (54). The signatures were defined in terms of protein-coding genes that were shared among the Archaea. However, as proteins are mosaics of domains, domains are unique genomic signatures (Fig. 6). Protein domains defined by SCOP correspond to complex ‘multi-dimensional signatures’ defined by: (i) a unique 3D fold, (ii) a distinct sequence profile, and (iii) a characteristic function. Though domain recombination is frequent, substitution of one protein domain for another has not been observed in homologous proteins (Fig. 6). For phylogenomic applications protein domains are ‘sequence signatures’ that essentially correspond to single-copy orthologous loci when coded as binary-state characters (presence/absence). These sequence signatures are consistent with unique, non-recombining genomic loci, and are identified using sophisticated statistical models — profile hidden Markov models (pHMMs) (55, 56) — that can be used routinely to annotate and curate genome sequences in automated pipelines (57, 58).
For these reasons, protein domains are ideal molecular phylogenetic markers for which character-homology can be validated through more than one property, statistically significant (i) sequence similarity, (ii) 3D structure similarity; and (iii) function similarity. In addition, employing genomic loci for protein domains maximizes the genomic information that can be employed for phylogenetic analysis. Even though many other genomic features are known to be useful markers (59), protein domains are the most conserved as well as most widely applicable genomic characters (Fig. 6B).
Sorting vertical evolution (signal) and horizontal evolution (noise)
Single-copy genes are employed as phylogenetic markers to minimize phylogenetic noise caused by reticulate evolution, including hybridization, introgression, recombination, horizontal transfer (HT), duplication-loss (DL), or incomplete lineage sorting (ILS) of genomic loci. However, the noise observed in the DDNs based on MSA of core-genes (Fig. 1) cannot be directly related to any of the above genome-scale reticulations, since the characters are individual nucleotides or amino acids. Apart from stochastic character conflicts, the observed conflicts are better explained by convergent substitutions, given the redundancy of substitutions. Convergent substitutions caused either due to stringent selection or by chance are a well-recognized form of homoplasy in gene-sequence data (47, 50, 60), and based on recent genome-scale analyses it is now known to be rampant (61, 62).
The observed noise in the DDNs based on protein-domain characters (Fig. 2), however, can be related directly to genome-scale reticulation processes and homoplasies. In general, homoplasy implies evolutionary convergence, parallelism or character reversals caused by multiple processes. In contrast, homology implies only one process: inheritance of traits that evolved in the common ancestor and were passed to its descendants. Operationally, tree-based assessment of homology requires tracing the phylogenetic continuity of characters (and states), whereas homoplasy manifests as discontinuities along the tree. Since clades are diagnosed on the basis of shared innovations (synapomorphies) and defined by ancestry (63, 64), accuracy of a phylogeny depends on an accurate assessment of homology — unambiguous identification of relative synapomorphies on a best fitting tree.
Identifying homoplasies caused by character reversals, i.e. reversal to ancestral states requires identification of the ancestral state of the characters under study. However, implementing reversible models precludes the estimation of ancestral states, in the absence of sister groups (outgroups) or other external references. Thus, the critical distinction between shared ancestral homology (symplesiomorphy) and shared derived homology (synapomorphy) is not possible with unrooted trees derived from standard reversible models. Hence, unrooted trees (Fig. 3) are not evolutionary (phylogenetic) trees per se, as they are uninformative about the evolutionary polarity (34, 35, 65). Thus, identifying the root (or root-state) is crucial to (i) determine the polarity of state transitions, (ii) identify synapomorphies, and (iii) diagnose clades.
Moreover, because clades are associated with the emergence and inheritance of evolutionary novelties, the discovery of clades is fundamental for describing and diagnosing sister group differences, which is a primary objective of modern systematics (66). A well-recognized deficiency of phylogenetic inference based on primary sequences is the abstraction of evolutionary ‘information’ (54), often into less tangible quantitative measures. For instance, ‘information’ relevant to diagnosing clades and support for clades is abstracted to branch lengths. Branch-length estimation is, ideally, a function of the source data and the underlying model. However, in the core-genes dataset the estimated branch lengths and the resulting tree is an expression of the model rather than of the data (Fig. 3 A, B). Some pertinent questions then are: should diagnosis of clades and the features by which clades are identified be delegated to, and restricted to, substitution mutations in a small set of loci and substitution models? Are substitution mutations in 40-50 loci more informative, or the birth and death of unique genomic loci more informative?
Proponents of the total evidence approach recommend that all relevant information — molecular, biochemical, anatomical, morphological, fossils — should be used to reconstruct evolutionary history, yet genome sequences are the most widely applicable data at present (59, 67). Accordingly, phylogenetic classification is, in practice, a classification of genomes. There is no a priori theoretical reason that phylogenetic inference should be restricted to a small set of genomic loci corresponding to the core genes, nor is there a reason for limiting phylogenetic models to interpreting patterns of substitution mutations alone. The ease of sequencing and the practical convenience of assembling large character matrices, by themselves, are no longer compelling reasons to adhere to the traditional marker gene analysis.
Annotations for reference genomes of homologous protein domains identified by SCOP and other protein-classification schemes, as well as tools for identifying corresponding sequence signatures, are readily available in public databases. An added advantage is that the biochemical function and molecular phenotype of the domains are readily accessible as well, through additional resources including protein data bank (PDB) and InterPro. For complex characters such as protein domains, character homology can be determined with high confidence using sophisticated statistical models (HMMs). Homology of a protein domain implies that the de novo evolution of a genomic locus corresponding to that protein domain is a unique historical event. Therefore, homoplasy due to convergences and parallelisms is highly improbable (68, 69). Although a handful of cases of convergent evolution of 3D structures is known, these instances relate to relatively simple 3D folds coded for by relatively simple sequence repeats (70).
However, the vast majority of domains identified by SCOP correspond to polypeptides that are on average 200 residues long with unique sequence profiles (57, 68). Thus, identifying homoplasy in the protein-domain datasets depends largely on estimating reversals, which in this case will be cases of secondary gains/losses; for instance gain-loss-regain events caused by DL-HT or HT. Such secondary gains are more likely to correspond to HT events than to convergent evolution, for reasons specified above. Instances of reversals are minimal, as seen from the strong directional trends detected in the data (Fig. 5B and Fig. 6B).
Vertical and horizontal classification
For decades, biologists have been faced with a choice between so-called horizontal (Linnean) and vertical (Darwinian) classification of biodiversity (71). The similarity of both schools of systematics concerns the identification of “signatures” or sets of characteristic features that codify evolutionary relationships (54, 63, 71). But the former emphasizes the unity of contemporary groups, i.e. those at a similar evolutionary state, and therefore separates ancestors from descendants, while the latter emphasizes the unity of the ancestors and separates descendants that diverge from a common ancestry (71). Vertical classification is more consistent with the concept of lineal descent, and is the predominant paradigm for which the operational methodology and the algorithmic logic were laid out as the principles of phylogenetic systematics (63, 72). Accordingly, determining the ancestor-descendant polarity, starting from the universal common ancestor (UCA) at the root of the Tree of Life, is crucial to accurately reconstructing the path of evolutionary descent.
The classical rooting of the (rRNA)ToL based on the EF-Tu—EF-G paralogous pair (73, 74) is known to be error-prone and highly ambiguous, due to LBA artifacts (14, 75). Remarkably, sequences corresponding to only one of the two conserved domains common to EF-Tu and EF-G (200 residues in the P-loop-containing NTP hydrolase domain (Fig. 5A)) can be aligned with confidence (14). Implementing better-fitting substitution models results in two alternative rootings (R1 and R4 in Fig. 5), which relate to distinct, irreconcilable scenarios (14) similar to scenarios in Fig 4C and 4F. Moreover, the EF-Tu—EF-G paralogous pair is only 2 of 57 known paralogs of the translational GTPase protein superfamily (76). Thus the assumption that EF-Tu—EF-G duplication is a unique event, which is essential for the paralogous outgroup-rooting method, is untenable.
In the absence of prior knowledge of outgroups or of fossils, rooting the Tree of Life is arguably one of the most difficult phylogenetic problems. Incorrect rooting may lead to profoundly misleading conclusions about evolutionary scenarios and taxonomic affinities, and it appears to be common in phylogenetic studies (77). Perhaps worse yet seems to be the preponderance of subjective a posteriori rooting based on untested preconceptions (e.g. (78, 79)) and scenario-driven erection of taxonomic ranks (e.g. (1, 30)) (80). The conventional practice of a posteriori rooting, wherein an unrooted tree is converted into a rooted tree by adding an ad hoc root, encourages a subjective interpretation of the ToL. For example, the so-called bacterial rooting of the ToL (root R3; Fig. 4) is the preferred rooting hypothesis to interpret the ToL even though that rooting is not well supported (14).
Untangling data bias, model bias and investigator bias (prior beliefs)
Phylogenies, and hence the taxonomies and evolutionary scenarios they support, are falsifiable hypotheses. Statistical hypothesis testing is now an integral part of phylogenetic inference, to quantify the empirical evidence in support of the various plausible evolutionary scenarios. However, common statistical models implemented for phylogenomic analyses are limited to modeling variation in patterns of point mutations, particularly substitution mutations. These statistical models are intimately linked to basic concepts of molecular evolution, such as the universal molecular clock (3), the universal chronometer (78), paralogous outgroup rooting (81), etc., which are gene-centric concepts that were developed to study the gene, during the age of the gene. Moreover, these idealized notions originated from the analyses of relatively small single-gene datasets.
Conventional phylogenomics of multi-locus datasets is a direct extension of the concepts and methods developed for single-locus datasets, which rely exclusively on substitution mutations (50). In contrast, the fundamental concepts of phylogenetic theory: homology, synapomorphy, homoplasy, character polarity, etc., even if idealized, are more generally applicable. And, apparently they are better suited for unique and complex genomic characters rather than for redundant, elementary sequence characters, with regards to determining both qualitative as well as statistical consistency of the data and the underlying assumptions.
Phylogenetic theory that was developed to trace the evolutionary history of organismal species, as well as related methods of discrete character analysis for classifying organismal families (63, 82), was adopted, although not entirely, to determine the evolution and classification of gene families (1, 3). The discovery and initial description of the Archaea was based on the comparative analysis of a single-gene (rRNA) family. However, in spite of the large number of characters that can be analyzed, neither the rRNA genes nor multi-gene concatenations of core-genes have proved to be efficient phylogenetic markers to reliably resolve the evolutionary history and phylogenetic affinities of the Archaea (83, 84).
Uncertainties and errors in phylogenetic inference are primarily errors in adequately distinguishing homologous similarities from homoplastic similarities (34, 50, 85). Homologies, synapomorphies and homoplasies are qualitative inferences, yet are inherently statistical (probabilistic). The probabilistic framework (maximum likelihood and Bayesian methods) has proven to be powerful for quantifying uncertainties and testing alternative hypotheses. Log odds ratios, such as LLR and LBF, are measures of how one changes belief in a hypothesis in light of new evidence (86). Accordingly, directional evolution models are more optimal explanations of the observed distribution of genomic-characters, and such directional trends overwhelmingly support the monophyly of the Archaea, as well as the sisterhood of the Archaea and the Bacteria, i.e. monophyly of Akarya (Fig 6).
Data quality is at least as important as the evolution models that are posited to explain the data. Although sophisticated statistical tests for evaluating tree robustness, and for selecting character-evolution models, are becoming a standard feature of phylogenetic software (e.g. IQ-tree, MrBayes, Phylobayes), tests for character evaluation are not common. Routines for collecting and curating data upstream of phylogenetic analyses are rather eclectic. Besides, it is an open question as to whether qualitatively different datasets (as in Fig.1 and Fig.2) can be compared effectively. Nevertheless, employing DDNs and other tools of exploratory data analysis could be useful to identify conflicts that arise due to data collection and/or curation errors (23, 24).
Conclusions
The Tree of Life is primarily a phylogenetic classification that is invaluable to organize and to describe the evolution of biodiversity, explicated through evolutionary scenarios. Phylogenies are hypotheses that mostly relate to extinct ancestors, while taxonomies are hypotheses that largely relate to extant species. Extant species contain distinct combinatorial mosaics of ancestral features (plesiomorphies) and evolutionary novelties (apomorphies). It is remarkable that the uniqueness of the Archaea was identified by the comparative analyses of oligonucleotide signatures in a single gene dataset (1). However the same is not true of the phylogenetic classification of the Archaea, based on marker-genes and reversible evolution models that rely exclusively on point mutations, specifically substitution mutations, which may not be ideal phylogenetic markers (59).
The Three-domains of Life hypothesis (26), which was initially based on the interpretation of an unrooted rRNA tree (of life) (1), was put forward largely to emphasize the uniqueness of the Archaea, ascribed to an exclusive lineal descent. Although many lines of evidence, molecular or otherwise, support the uniqueness of the Archaea, phylogenetic analysis of genomic signatures does not support the presumed primitive state of Archaea or Bacteria, and the common belief that Archaea and Bacteria are ancestors of Eukarya (1, 11, 39, 87). Models of evolution of genomic features support a Two-domains (or rather two empires) of Life hypothesis (9), as well as the independent origins and parallel descent of eukaryote and akaryote species (10, 14, 88, 89).
Data and methods
Data collection and curation
Marker domains datasets
Character matrices of homologous protein-domains, coded as binary-state characters were assembled from genome annotations of SCOP-domains available through the SUPERFAMILY HMM library and genome assignments server; v. 1.75 (http://supfam.org/SUPERFAMILY/) (57, 90).
141-species dataset was obtained from a previous study (29)
The 141-species dataset was updated with representatives of novel species described recently, largely with archaeal species from TACK group (30), DPANN group (5) and Asgard group including the Lokiarchaeota (20). In addition, species sampling was enhanced with representatives from the candidate phyla (unclassified) described for bacterial species and with unicellular species of eukaryotes, to a total of 222 species. The complete list of the species with their respective Taxonomy IDs is available in SI Table 1.
When genome annotations were unavailable from SUPERFAMILY database, curated reference proteomes were obtained from the universal protein resource (http://www.uniprot.org/proteomes/). SCOP-domains were annotated using the HMM library and genome annotation tools and routines recommended by the SUPERFAMILY resource.
Exploratory data analysis
DDNs were constructed with SplitsTree v. 4.14. Split networks were computed using the NeighborNet method from the observed P-distances of the taxa for both nucleotide- and amino acid-characters. Split networks of the protein-domain characterss were computed from Hamming distance, which is identical to the P-distance. The networks were drawn with the equal angle algorithm.
Phylogenetic analyses
Concatenated gene tree inference: Extensive analyses of the concatenated core-genes datasets are reported in the original studies (17, 20). Analysis here was restricted to the 29 core-genes dataset due its relatively small taxon sampling (44 species) compared to the 48 core-genes dataset (96 species) since there is little difference in data quality, but the computational time/resources required is significantly lesser. Moreover, the general conclusions based on these datasets are consistent despite a smaller taxon sampling, particularly of archaeal species (26 as opposed to 64 in the larger sampling).
Best-fitting amino acid substitution models were chosen using Smart Model Selection (SMS) (91) compatible with PhyML tree inference methods (92). Trees were estimated with a rate-homogeneous LG model as well as rate-heterogeneous versions of the LG model. Site-specific rate variation was approximated using the gamma distribution with 4, 8 and 12 rate categories, LG+G4, LG+G8 and LG+G12, respectively. More complex models (SI Table 2) that account for invariable sites (LG+GX+I) and/or models that compute alignment-specific state frequencies (LG+GX+F) were also used, but the trees inferred were identical to trees estimated from LG+GX models, and therefore not reported here. Log likelihoods ratio (LLR) was calculated as the difference in the raw log likelihoods for each model.
Genome tree inference: The Mk model (32) is the most widely implemented model for phylogenetic inference in the probabilistic framework (maximum likelihood (ML) and Bayesian methods) applicable to complex features coded as binary characters. However, only the reversible model is implemented in ML methods at present. Both reversible and directional evolution models as well as model selection routines implemented in MrBayes 3.2 (42, 93) were used. The Metropolis-coupled MCMC algorithm was used with two chains, sampling every 500th generation. The first half of the generations was discarded as burn-in. MCMC sampling was run until convergence, unless mentioned otherwise. Convergence was assessed through the average standard deviation of spilt frequencies (ASDSF, less than 0.01) for tree topology and the potential scale reduction factor (PSRF, equal to 1.00) for scalar parameters, unless mentioned otherwise. Bayes factors for model comparison were calculated using the harmonic mean estimator in MrBayes. The log Bayes factor (LBF) was calculated as the difference in the log likelihoods for each model.
Convergence between independent runs was generally slower for directional models compared to the reversible models. When convergence was extremely slow (requiring more than 100 million generations) topology constraints corresponding to the clusters derived in the unrooted trees (Fig. 3E) were applied to improve convergence rates. In general these clusters/constraints corresponded to named taxonomic groups e.g. Fungi, Metazoa, Crenarchaeota, etc. Convergence assessment between independent runs was relaxed for three specific cases that did not converge at the time of submission: the unrooted tree with Mk-uniform-rates model (ASDSF 0.05; PSRF 1.03), rooted trees corresponding to root-R2 (ASDSF 0.5; PSRF 1.04) and root-R3 (ASDSF 0.029; PSRF 1.03). In the three cases specified, the difference in bipartitions is in the shallow parts (minor branches) of the tree. For assessing well supported major branches of the tree, ASDSF values between 0.01 and 0.05 may be adequate, as recommended by the authors (94).
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. Work by this author was partially supported by The Swedish Research Council (to Måns Ehrenberg) and the Knut and Alice Wallenberg Foundation, RiboCORE (to Måns Ehrenberg and Dan Andersson).
Acknowledgements
I am grateful to Charles (Chuck) Kurland and Måns Ehrenberg for support and encouragement. I thank Chuck Kurland and Siv Andersson for the discussions in general; Chuck for the many stimulating debates and Siv for inspiring the article title, in part; Seraina Klopfstein for providing the algorithms for implementing the directional model in MrBayes and for helpful suggestions and Erling Wikman for help with computing equipment.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.
- 17.↵
- 18.
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵