Abstract
The trait of oxygenic photosynthesis was acquired by the last common ancestor of Archaeplastida through endosymbiosis of the cyanobacterial progenitor of modern-day plastids. Although a single origin of plastids by endosymbiosis is broadly supported, recent phylogenomic studies report contradictory evidence that plastids branch either early or late within the cyanobacterial Tree of Life. Here we describe CYANO-MLP, a general-purpose phyloclassifier of cyanobacterial genomes implemented using a Multi-Layer Perceptron. CYANO-MLP exploits consistent phylogenetic signals in bioinformatically estimated structure-function maps of tRNAs. CYANO-MLP accurately classifies cyanobacterial genomes into one of eight well-supported cyanobacterial clades in a manner that is robust to missing data, unbalanced data and variation in model specification. CYANO-MLP supports a late-branching origin of plastids: we classify 99.32% of 440 plastid genomes into one of two late-branching cyanobacterial clades with strong statistical support, and confidently assign 98.41% of plastid genomes to one late-branching clade containing unicellular starch-producing marine/freshwater diazotrophic Cyanobacteria. CYANO-MLP correctly classifies the chromatophore of Paulinella chromatophora and rejects a sister relationship between plastids and the early-branching cyanobacterium Gloeomargarita lithophora. We show that recently applied phylogenetic models and character recoding strategies fit cyanobacterial/plastid phylogenomic datasets poorly, because of heterogeneity both in substitution processes over sites and compositions over lineages.
Introduction
The acquisition of a cyanobacterial endosymbiont by the last common ancestor of Archaeplastida [1, 36, 38] transferred the trait of oxygenic photosynthesis to eukaryotes over one billion years ago [20]. The diversity of eukaryotic photoautotrophs radiating from this event profoundly transformed the terrestrial biosphere through changes to primary biomass production, atmospheric oxygenation, and the colonization of new ecosystems [26].
It is widely accepted both that the plastids originated in a single primary endosymbiotic event [37], and that the photosynthetic chromatophore of the freshwater amoeba Paulinella chromatophora evolved later in a second primary endosymbiotic event [25, 40]. However, despite substantial progress on a robust cyanobacterial Tree of Life (CyanoToL) [9, 39, 45, 47], the root of plastids within the CyanoToL remains controversial. Recent phylogenomic studies strongly support contradictory conclusions, with plastids branching either early [12, 43, 47, 52] or late [8, 13, 20, 39] within the CyanoToL. In contrast, orthogonal evidence from endosymbiotic gene transfers [16] and eukaryotic evolution of glycogen and starch metabolic pathways [6, 15] consistently support a late-branching origin of plastids within the CyanoToL.
Phylogenetic inferences concerning plastid origins are complicated by large evolutionary distances accumulated over at least one billion years of vertical descent, by extreme reduction of genomes in plastids [53] and Cyanobacteria [18, 44], and by secondary and tertiary endosymbiotic acquisitions of plastids. Furthermore, reductive genome evolution alters the stationary nucleotide composition of genomes and gene products [7], violating the assumptions and applicability of many phylogenetic models [10, 17, 21, 28, 42].
Recently, we introduced a machine learning approach to the phyloclassification of genomes based on scoring tRNA gene complements against bioinformatically estimated taxon-specific tRNA functional signatures called tRNA Class-Informative Features (CIFs) [3]. tRNA CIFs, as visualized in function logos [22], contain information [46] about the functional identity of tRNAs for tRNA-interacting proteins. We demonstrated the strong recall and accuracy of a tRNA-CIF-based alpha-proteobacterial phyloclassifier despite convergent non-stationary compositions of alpha-proteobacterial tRNA genes, and likely horizontal transfers of genes for tRNAs and tRNA-interacting proteins [3].
In the present work, we improved our tRNA-based phyloclassifier approach and applied it to investigate the origin of plastids within the CyanoToL. Based on 5,270 tRNA gene sequences from 113 cyanobacterial genomes, our CYANO-Multi-Layer Perceptron (CYANO-MLP) phyloclassifier consistently classifies 433 plastid genomes within the B2 and B3 sister clades of Cyanobacteria [47]. These clades include marine/freshwater unicellular diazotrophic species previously noted to share synapomorphic starch metabolic pathway traits with plastids [15, 20]. We reconciled our results with prior work by demonstrating that recently applied phylogenetic models and character recoding strategies fit cyanobacterial/plastid phylogenomic datasets poorly because of heterogeneity of substitution processes over sites and lineage-specific compositional biases.
Materials and Methods
tRNA Gene Data and Genome Sets
From NCBI, we downloaded the set S of 117 cyanobacterial genomes analyzed in [47], the set Gl of one genome of the cyanobacterium Gloeomargarita lithophora, the set Pc of one genome of the chromatophore of the fresh-water amoeba Paulinella chromatophora, and the set P of 440 complete plastid genomes containing representatives from all three lineages of Archaeplastida (Glaucocystophyta, Rhodophyta, and Viridiplantae). Let C ≡ S ∪ Gl ∪ Pc. For every genome g ∈ C, we annotated a set Tg of tRNA genes as the union of predictions from tRNAscan-SE v1.31 [34] in bacterial mode and ARAGORN v1.2.36 [31] with default settings. We annotated tRNA genes in the set P of plastid genomes similarly, except we discarded as false positives gene predictions from ARAGORN that contained introns in tRNA isotypes that have not been previously described to contain introns [35, 48, 54]. We additionally filtered away tRNA gene predictions for land plant plastid genomes that contained anticodons not previously observed in land plant plastid tRNA genes [2, 50].
We annotated the functional types of tRNA genes either as elongator isotypes by anticodon alone or, for those containing the CAU anticodon, into initiator tRNA Met (”X”), elongator tRNA Met, or tRNA Ile CA U (”J”) using TFAM v1.4 [4] with the TFAM model used in [3, 5]. We aligned tRNA sequences using COVEA v2.4.4 [19] and the prokaryotic tRNA covariance model from tRNAscan-SE [34]. We edited the alignment by first removing sites containing 99% or more gaps using FAST v1.6 [32], and then removing sequences with unusual secondary structure. Lastly, we mapped sites to Sprinzl coordinates [49] and manually removed the variable arm, CCA tail, and sites not mapping to a Sprinzl coordinate using Seaview v4.6.1 [24]. The alignment is available as supplementary data.
We partitioned cyanobacterial tRNA genes into sets Tg for each genome g of origin, and separately into sets TX for each cyanobacterial clade X, with X ∈ CC ≡ {A, B1, B2 + 3, C1, C3, E, F, G} corresponding to clades identified in [47], except for fusion of clades B2 and B3 into their union B2+3 and exclusion of four genomes in two clades, C2 and D, for insufficient data as defined by yielding fewer than 120 tRNA genes (Fig. 1). Let R ⊂ S be the set of all 113 cyanobacterial genomes not excluded. For every genome g ∈ R and each clade X ∈ CC, we also created leave-one-out cross-validation training sets .
Genome Scoring
Following Amrine et al. [3], we produced training input vectors by first calculating clade-dependent Gorodkin heights [3, 23] , in function logos [22] for all clade-specific tRNA gene sets TX or with X ∈ CC, for all features f ∈ F ≡ {A, C, G, U } × SC, where SC is the set of Sprinzl Coordinates [49], and for all functional types i ∈ I ≡ A ⋃ {J, X}, where A is the set of short IUPAC amino acid symbols standing for aminoacylation identities of elongators. We computed function logos using custom software TSFM available at https://github.com/tlawrence3/tsfm/tree/v0.9.6.
To score the tRNA gene complement Tg of genome g, we calculated a vector of tRNA CIF-based scores , in which element is the average, over all genes t ∈ Tg of any type it ∈ I, where it is the type of gene t, of the sum over all features f ∈ t ⊂ F contained in that gene, of the Gorodkin heights [23] of those features for genes of that type in clade X ∈ CC:
Following recommended practice [11], we standardized score vectors of both training and query data by subtracting the mean score vector of training data and dividing element-wise by the standard deviations of scores by clade. Let be the standardized score vector of Sg.
Phyloclassifier Model Training and Optimization
We implemented our multilayer neural network phyloclassifier using the MLPClassifier API of scikit-learn v0.18.1 [41] in Python v3.5.2. We trained models for up to 2000 training epochs, stopping early if for two consecutive iterations the Cross-Entropy loss function value did not decrease by a minimum of 1 × 10−4, and with random shuffling of data between epochs. We used the rectifier activation function for hidden layer neurons, the L-BFGS algorithm for weight optimization, and an alpha value of 0.01 for the L2 regularization penalty parameter. Lastly, we used the soft-max function to calculate classification probability vectors. Using leave-one-out cross-validation (LOOCV), we optimized neural network architecture for accuracy averaged over genomes g∈ R considering all architectures with from one to four hidden layers and each layer individually containing from eight to sixteen nodes. To test the statistical significance of the average accuracy from LOOCV of the architecture-optimized CYANO-MLP, we permuted clade labels over training data in 100,000 replicates, followed by LOOCV and model retraining for each replicate.
Phyloclassification and Bootstrapping
For each genomic tRNA gene set Tg, with g P⋃ Pc ⋃ Gl, we computed a standardized score vector , input this to CYANO-MLP, and classified to the clade with largest classification probability. To examine the consistency of phylogenetic signals in our data, we computed 100 bootstrap replicates of sites in our alignment of training and test tRNA gene data, followed by CIF-estimation, model retraining, and genome scoring and classification with each bootstrap replicate of CYANO-MLP. We summarized bootstrap results for cyanobacterial genomes by the number of replicates in which the most probable classification for a genome was its true clade of origin.
Leave-Clade-Out and Balanced Model Variants
To examine the sensitivity of CYANO-MLP to missing data and model mis-specification, we re-optimized and re-trained models after leaving out one cyanobacterial clade or using only cyanobacterial clades A, B1, and B2+3. To produce clade-balanced training datasets, we randomly resampled training score vectors so that each cyanobacterial clade had sample sizes equal to the best-sampled clade, and then re-optimized and re-trained models.
Evaluation of Phylogenetic Model Adequacy
We examined goodness of fits of the phylogenomic datasets of Shih et al. [47], Ponce-Toledo et al. (chloroplast-marker dataset) [43] and Ochoa de Alda et al. (dataset 11) [39] with the substitution models originally used in those studies, namely LG+4Γ [33] and CAT-GTR+4Γ [28, 29]. Posterior Predictive Analyses (PPA) were performed to test fits for site-specific constraint biases using PPA-DIV [28] and across-lineage compositional biases using PPA-MAX and PPA-MEAN [10]. Additionally, we assessed model adequacy under three amino acid recoding strategies, Dayhoff-6 (Day6) [14], the six-state recoding strategy of Susko and Roger (SR6) [51], and the six-state recoding strategy of Kosiol et al. (KGB6) [27]. PPA results were interpreted using Z-scores under the assumption that the test statistics follow a normal distribution. We used a Z-score threshold of Z ≥ 5 as strong evidence for rejecting the model. We performed phylogenetic analyses using Phylobayes MPI v1.8 [30] with at least 1000 replicates and running two MCMC chains in parallel for each analysis. Convergence of chain trajectories was assessed using TRACECOMP and BBCOMP utilities provided with Phylobayes MPI. Convergence was assumed when the discrepancies of model parameters and bipartition frequencies between independent chains was less than 0.18. The number of cycles to discard as burn-in was determined by visually examining the traces of the log-likelihood and other model parameters for stationarity using Tracer v1.6.0.
Results
tRNA Data and CIF estimation
We annotated and extracted 5,476 tRNA genes from the 117 cyanobacterial genomes analyzed in [47], averaging 46.80 tRNA genes per cyanobacterial genome, 14,841 tRNA genes in 440 Archaeplastida plastid genomes averaging 33.73 tRNA genes per plastid genome, 44 tRNA genes from the Cyanobacterium Gloeomargarita lithophora, and 42 tRNA genes from the chromatophore genome of the fresh-water amoeba P. chromatophora (Table 1; Supplemental File 1). We excluded four genomes from further analysis (Fig. 1) and estimated function logos for cyanobacterial clades A, B1, B2+3, C1, C3, E, F, and G (Fig. 1, S1-S8; Table S1,S2) using the clade nomenclature of [47]. We fused clades B2 and B3 because they are sister clades and B3 contained only one genome. The C1 clade had the biggest sample with a divergent nucleotide composition, elevated in contents of G and C (Table 1). Interestingly, clade C1 exhibited many gains of Uracil CIFs (Fig. 1B) and also Adenine CIFs (Fig. S4).
Training and Validation of a tRNA-Based Cyanobacterial Phyloclassifier
We trained CYANO-MLP on input vectors generated by scoring cyanobacterial genomic tRNA gene complements against clade-specific cyanobacterial CIFs as described in Methods. We systematically optimized the parameters and architecture of CYANO-MLP on the training data, settling on a single hidden layer of 13 nodes (Fig. 1), which achieved an average accuracy of 0.8673 (permutation test; p = 0.0001), calculated using Leave-One-Out Cross-Validation (LOOCV; Fig. S9, S10; Table S3). To examine the effects of unbalanced training data on the performance of CYANO-MLP, we created a separate clade-balanced version of the model (CYANO-MLP-BAL) by resampling data from under-represented clades. CYANO-MLP-BAL achieved a LOOCV average accuracy of 0.9875 with improvements in precision and recall for all clades (fig. S19; Table S3), suggesting that biased sampling is an important and addressable limitation to the accuracy of CYANO-MLP. In addition, cyanobacterial reclassifications were correct in at least 97 of 100 bootstrap replicates of CYANO-MLP, showing that phylogenetic signals are consistent across tRNA CIFs (Supplemental File 1).
A desirable attribute of a phyloclassifier is an ability to signal “none-of-the-above” when the true clade of a query is unrepresented in the model. To address robustness to model specification and investigate the ability of CYANO-MLP to signal “none-of-the-above”, we trained additional versions of CYANO-MLP that leave out the largest-sampled clades, namely A (CYANO-MLP[!A]), B1 (CYANO-MLP[!B1]), B2+3 (CYANO-MLP[!B2+3]), or C1 (CYANO-MLP[!C1]). For each model variant, we then reclassified all genomes, including from the clade that had been left out. Overall, average LOOCV accuracies were similar to CYANO-MLP for each classifier variant (Fig. S14-S18; Table S3) with CYANO-MLP[!A] having the largest gain in accuracy (LOOCV: 0.9216; Fig. S16; Table S3) over CYANO-MLP. This was not unexpected, given that clade A had the lowest precision (Fig. S15-S18) and the smallest sample size among left-out clades (Table 1). Furthermore, the recall of clade A was most improved in CYANO-MLP-BAL. Generally with CYANO-MLP, classifications of cyanobacterial genomes from excluded clades were more equivocal than those of genomes from represented clades (Table S9-S12; Supplemental File 1). We claim that equivocal classifications with CYANO-MLP signal “none-of-the-above.”
The Paulinella chromatophora Chromatophore Phyloclassifies to the Marine C1 Prochlorococcus/Synechococcus Clade
The phylogenetic origin of the P. chromatophora chromatophore from marine Prochlorococcus/Synechococcus clade (clade C1; Fig. 1) is well-supported by several phylogenomic analyses [39, 47, 52]. CYANO-MLP classified the P. chromatophora chromatophore to clade C1 concordantly with a 99.98% probability and 100% bootstrap support (Fig. 2; Table S5). Additionally, this phyloclassification was robust to model specification and obtained also with CYANO-MLP-BAL, CYANO-MLP[!A]), CYANO-MLP[!B1], and CYANO-MLP[!B2+3]. Finally, the P. chromatophora chromatophore classified similarly to other C1 genomes when using CYANO-MLP[!C1] (Table S11,S13).
CYANO-MLP Robustly Phyloclassifies Plastid Genomes to Late-Branching Cyanobacterial Clades
Using CYANO-MLP, we phyloclassified 437/440 (99.32%) plastid genomes to late-branching clades of Cyanobacteria, with 433 plastid genomes classifying to the B2+3 clade and four plastid genomes classifying to the A clade with high probabilities (Fig. 2; Table S5). Plastid genomes from all three Archaeplastida lineages phyloclassified to the late-branching cyanobacterial B2+3 clade. The majority of plastid bootstrap replicates classified to late-branching clades A, B1, and B2+3 with the median bootstrap frequency of all plastid groups against clade B2+3 at or above 70, except for the Glaucocystophyta genome (Fig. 2, S11-S14). Three remaining plastid genomes classified to early-diverging cyanobacterial clades; two to clade F and one to clade G (Fig 2, Table S5). With CYANO-MLP-BAL, 18 and 384 plastid genomes classified to clades A and B2+3 respectively (Tables S8).
Plastid genome classifications were mostly robust to model specifications. Classifications of clade-represented genomes were mostly unchanged in the CYANO-MLP[!A]) and CYANO-MLP[!C1] leave-clade-out models (Fig S21; Table S10,S11). Distinctly, plastid classifications with CYANO-MLP[!B1] were ambiguous with equal probabilities between clades A and B2+3 (Table S12). However, after retraining CYANO-MLP[!B1] using balanced training data (CYANO-MLP-BAL[!B]) phyloclassifications were restored to be similar to those with CYANO-MLP and CYANO-MLP-BAL (Table S12,S13). We then developed two phyloclassifers including training data only from late-branching clades A, B1, and B2+3, one with balanced training data (CYANO-MLP-BAL[AB1B2+3]) and one without (CYANO-MLP[AB1B2+3]). CYANO-MLP[AB1B2+3] classified plastids equivocally between clades A and B2+3, similarly to CYANO-MLP[!B1], though slightly favoring clade B2+3 (Table S13,S14). After balancing data by resampling, CYANO-MLP-BAL[AB1B2+3] more decisively phyloclassified 331 plastid genomes to clade B2+3 and only 106 genomes to clade A (Table S8,S13,S14). Remarkably, phyloclassifications of both plastid and B2+3-cyanobacterial genomes with the CYANO-MLP[!B2+3] leave-clade-out model were equivocal and similar to one another (Fig S21; Table S6-S9), consistent with “none-of-the-above” classification and providing compelling additional support that plastids belong to clade B2+3.
Phyloclassification of G. lithophora is Consistent with its Early Divergence within Cyanobacteria
Recent phylogenomic analyses support a sister relationship between plastids and an early-diverging lineage containing G. lithophora as its only member [43, 52]. With only one genome, there was insufficient tRNA sequence data to estimate CIFs for this lineage. Instead, we classified the G. lithophora genome using CYANO-MLP to determine if it classified similarly to plastids, which would be consistent with a sister relationship of G. lithophora and plastids. We found that the G. lithophora genome obtained greater than 75% total classification probability against three early-diverging clades, classifying to clade F with probability 57.3%, to clade G with probability 18.4%, and to clade E with probability 3.2%. In addition, G. lithophora classified to the late-diverging clade A with probability 20.3% (Fig. 2). We interpreted the results as consistent with a “none-of-the-above” classification, yet, favoring an early-branching of G. lithophora, in agreement with recent phylogenomic analyses [43, 52]. Notably, the incongruity of our results for G. lithophora and plastids rejects their sister relationship.
Inadequate modeling of systematic biases can explain discrepancies with prior work
We examined goodness of fit of various evolutionary models to published combined cyanobacterial/plastid phylogenomic datasets by posterior predictive analysis [10, 28]. We found evidence that site-specific amino acid constraints are critical to fitting all three cyanobacterial/plastid phylogenomic datasets (Figure 3A; Table S15). The empirical matrix model LG+4Γ [33], with site-rate heterogeneity, fails to model site-specific substitution processes [28, 29] in all three phylogenomic datasets and fits them poorly (Fig. 3A, Table S15). The inadequacy of empirical matrix models to fit data with site-specific constraints was previously reported [29]; their use to fit such data results in long-branch attraction artifacts caused by underestimation of homoplasy [28]. In contrast, the CAT model [28, 29] specifically accommodates site-specific constraints, fitting all three datasets adequately (Fig. 3A; Table S15). However, even in combination with CAT, none of the amino acid recoding methods adequately mitigate lineage-specific compositional biases (Z≥5, Fig. 3B; Table S15). When lineage-specific compositional biases are not adequately modeled, unrelated sequences with similar compositions may artifactually cluster during phylogenetic tree reconstruction [10].
Discussion
We recovered strong support for a late-branching origin of plastids within or closely related to the B2+3 clade of the CyanoToL (Figs 1-2; Table S5). Furthermore, our result of a late-branching clade B2+3 origin of plastids is robust to bootstrap resampling of tRNA structural positions (Fig 2; Supplemental file 1), missing data (Fig S15-S18,S21; Table S7-S12), and unbalanced training data (Fig S19-S21; Table S7,S8,S12,S13). Additionally, we were able to reject recent hypotheses supporting the early-branching G. lithophora as sister to plastids [43, 52] (Fig. 2). Our results conform to independent metabolic evidence that plastids originated from a unicellular starch-producing diazotrophic cyanobacterial species [6, 15], and independent comparative evidence that photosynthetic eukaryotes originated and diversified rapidly in a low-salinity habitat [8, 52].
Importantly, the significantly lower classification accuracy of CYANO-MLP on class-permuted training datasets (Fig. S9) support that CYANO-MLP phyloclassifications depend on learned phylogenetic signals in cyanobacterial tRNA CIFs. Furthermore, we argue against the interpretation that plastid genomes have experienced distinctive selection pressures yielding idiosyncratic score vectors and artifactual results, because of the consistency with which plastids classified in the various re-specifications of CYANO-MLP, and consistent classifications of C1 clade Cyanobacteria with reduced genomes and the P. chromatophora chromatophore genome, presumably under similar selection pressures as plastid genomes, to the C1 clade (Fig. in concert with previous work [39, 47, 52].
To reconcile recent studies with our results, we reexamined the fits of recently used models and recoding strategies to three published cyanobacterial/plastid phylogenomic datasets (Fig. 3; Table S15). We found that the CAT model [28, 29] accommodated site-specific constraints (Fig. 3A; Table S15), however, amino acid recoding strategies were unable to mitigate lineage-specific compositional biases (Fig. 3B; Table S15). Only one prior phylogenomic study took into account both sources of bias [39], in which 16S rDNA nucleotide data was modeled using CAT-GTR while removing compositionally divergent taxa to achieve compositional homogeneity. Notably, the findings of [39] are consistent with ours in supporting a late-branching origin of plastids within Cyanobacteria.
Acknowledgments
DHA and TJL were supported by the National Science Foundation (INSPIRE-1344279). DHA was supported by NIH/NIAID 1R21AI127582-0. Computational research was performed on the MERCED HPC cluster supported by the National Science Foundation (ACI-1429783). The authors thank Harish Bhat, Emily Jane McTavish, Suzanne Sindi, Carolin Frank, Dana Carper, Jeanne Milostan and David Noelle for discussions.
Footnotes
↵* dardell{at}ucmerced.edu