Abstract
The type III secretion system transports effector proteins of pathogenic and endosymbiotic Gram-negative bacteria into the cytoplasm of host cells. During infection, effectors convert host resources to work to bacterial advantage. Existing computational methods for the prediction of type III effectors mainly employ information encoded in the N-terminal protein sequence. Here we introduce pEffect, a method that predicts type III effector proteins using the entire amino acid sequence. It combines homology-based inference with de novo predictions, reaching 87±7% accuracy at 95±5% coverage for a large non-redundant set of proteins. This performance is up to 3-fold higher than that of other methods. pEffect also sheds new light on effector secretion mechanisms. We establish that “signals” for the recognition of type III effectors are distributed over the entire protein sequence instead of being confined to the N-terminus. Our method, therefore, maintains high performance even when used with sequence fragments like metagenomic reads, and potentially facilitates studies of microbial community interactions. Explorations into the evolutionary origins of type III secretion identify a variety of recently evolved effectors and highlight the possibility of type III secretion ancestor dating to times prior to the archaea/bacteria split. pEffect is available at http://www.bromberglab.org/services/pEffect.
Introduction
Six secretion systems have been identified in pathogenic and endosymbiotic Gram-negative bacteria (Cornelis, 2006; Holland et al., 2005; Leo et al., 2012; Low et al., 2014; Nivaskumar and Francetic, 2014). The type III secretion system mediates a wide range of bacterial infections in human, animals and plants (Buttner and He, 2009; Hueck, 1998; Marshall and Finlay, 2014). This system comprises a hollow needle-like structure localized on the surface of bacterial cells that injects specific bacterial proteins, effectors, directly into the cytoplasm of a host cell (Cornelis, 2006). During infection, effectors act in concert to convert host resources to their advantage and promote pathogenicity (Troisfontaines and Cornelis, 2005).
Advances in sequencing techniques are producing an ever-growing number of bacterial genome sequences (Wang et al., 2012). As a result, the identification of bacterial type III effectors has shifted away from experimental discovery of individual proteins to whole genome computational screens. Various machine learning algorithms, including Naive Bayes (Arnold et al., 2009), Support Vector Machines (SVMs) (Wang et al., 2011), Artificial Neural Networks (Lower and Schneider, 2009) and Markov models (Wang et al., 2013) have been deployed to identify type III effectors in silico. These methods use sequence similarity to experimentally known effectors as input; this similarity is defined on the basis of different features, such as GC content (coding genes), as well as, amino acid composition, secondary structure, and solvent accessibility (proteins). Methods often focus on features in the protein N-terminus, assumed to be most informative for the translocation of effectors through type III secretion (Ghosh, 2004). An independent benchmark revealed state-of-art-methods to predict type III effectors at similar levels up to 80% accuracy at 80% coverage (McDermott et al., 2011); thus, there still seems to be room for substantial improvement.
Here, we introduce pEffect, a method that combines sequence similarity-based inference (PSI-BLAST) with de novo prediction using machine learning techniques (Support Vector Machines; SVM). Our method uses information about the entire amino acid sequence of each protein. To allow users to focus on most relevant results, it provides a score reflecting the strength of each prediction. pEffect was developed using a positive data set comprising type III effectors extracted from the literature and UniProt (UniProt Consortum, 2012) and a negative data set combining bacterial non-effector proteins and eukaryotic proteins sequence similar to bacterial effectors. It attains 87±7% accuracy at 95±5% coverage in predicting type III effectors, significantly outperforming its components (PSI-BLAST and SVM). When tested on sequence fragments similar in length to shotgun sequencing reads, pEffect's performance was not significantly different. This result suggests that the information required for distinguishing effectors is not confined to any particular part of the amino acid sequence. Our method provides a basis for the identification of exported pathogenic proteins as targets for future therapeutic treatments. We also suggest using pEffect as a starting point for studies of interactions within microbial communities, detected directly from metagenomic reads and without need for individual genome assembly.
Methods
Development data sets. Our positive data set of known type III effector proteins was extracted from scientific publications (Angot et al., 2007; Arnold et al., 2009; Chang et al., 2005; Greenberg and Vinatzer, 2003; Gurlebeck et al., 2006; Guttman et al., 2002; Miao and Miller, 2000; Sato and Frank, 2004; Tobe et al., 2006) and the Pseudomonas-Plant Interaction web site (http://www.pseudomonas-syringae.org/). The corresponding amino acid sequences were taken from the UniProt database (UniProt Consortum, 2012), 2012_01 release. We additionally queried UniProt with keywords 'type III effector', 'type three effector' and 'T3SS effector' and manually curated the results for experimentally identified effectors. In total, our positive (effector) data set contained 1,388 proteins.
To compile our negative data set of non-type III effectors we used the experimentally annotated Swiss-Prot proteins (Bairoch and Apweiler, 2000) from the 2012_01 UniProt release. We extracted all bacterial proteins that were NOT annotated as type III effectors and had no significant sequence similarity (BLAST (Altschul et al., 1990) e-value > 10) to any type III effector in our positive set. We also added all eukaryotic proteins applying no sequence similarity filters. Our negative set thus contained roughly 470,000 proteins.
We removed from our sets all proteins that were annotated as 'uncharacterized', 'putative', or 'fragment'. We reduced sequence redundancy independently in each set using UniqueProt (Mika and Rost, 2003), ascertaining that no pair of proteins in one set had alignment length of less than 35 residues or a positive HSSP-value (HVAL≥0) (Rost, 1999; Sander and Schneider, 1991). After redundancy reduction our sequence-unique sets contained 115 type III effector proteins from 43 different bacterial species and 3,460 non-effector proteins (of which 37% were bacterial). Note that proteins from positive and negative sets are sometimes similar as homology reduction was only applied within sets and not across sets. Here, this set of sequences (positive and negative sets together) is termed the Development set. All pEffect performance results were compiled on stratified cross-validation of this Development set (five-fold cross-validation, i.e. we split the entire set into five similarly-sized subsets and trained five models, each on a different combination of four of these subsets, testing each model on every subset exactly once).
Additional data sets. Comparing pEffect performance to that of other methods using our cross-validation approach has only limited value due to the possible overlap between our testing and other methods' development/training sets. A more meaningful way is to use non-redundant sets of effector and non-effector proteins that have never been used for the development of any method. Toward this end, we extracted the following data sets:
(1) We collected all type III effectors added to UniProt between releases 2012_01 and 2014_08 and non-type III bacterial and eukaryotic proteins added between the corresponding releases of Swiss-Prot. These were redundancy reduced at HVAL<0 to produce the UniProt'14 HVAL0 test set (107 effectors and 1,159 non-effectors). Note that additionally reducing this set to be sequence dissimilar to the Development set would retain only 30 type III effectors, too few for reliable performance estimates. However, even for this smaller and completely independent set, the performance of pEffect was higher than of other tools, making pEffect a uniquely reliable method for determining new effectors (Supplementary Table S1).
(2) To answer the question “how well will pEffect perform on protein sequences added to databases within the next six months?” we collected the proteins added to UniProt (type III effectors) and Swiss-Prot (non-effector bacterial and eukaryotic sequences) after the 2014_08 release, producing the set UniProt'15Full (498 effectors and 1,509 non-effectors).
(3) We also extracted all bacterial type III effectors from the T3DB database (Wang et al., 2012) - T3DBFull set (218 effectors and 831 non-effectors). We deliberately kept the redundancy in this set (up to HVAL = 66, i.e. over 85% pairwise sequence identity over 450 residues aligned).
(4) Finally, we redundancy reduced T3DB set at HVAL<0. This gave the T3DBHVAL0 set (66 effectors and 128 non-effectors).
T3DB Ortholog clusters of the type III secretion system (T3SS) machinery. T3DB is a database of experimentally annotated T3SS-related proteins in 36 bacterial taxa. Proteins of the same function and the same evolutionary origin are clustered in T3DB into T3 Ortholog clusters (http://biocomputer.bio.cuhk.edu.hk/T3DB/T3-ortholog-clusters.php). The proteins of these clusters form ten components of the T3SS. Proteins of five of these components (export apparatus, inner membrane ring, outer membrane ring, cytoplasmic ring, and ATPase) are present in all 36 taxa in T3DB (Supplementary Table S2). We thus defined the minimum number of five components necessary for the formation of the T3SS machinery. With the exception of the outer membrane ring, these components have also been defined as the core before (McCann and Guttman, 2008).
Prediction methods. We tested several ideas for prediction, including the following.
Homology-basedinference. We transferred type III effector annotations by homology using PSI-BLAST (Altschul et al., 1997) alignments. For every query sequence we generated a PSI-BLAST profile (two iterations, inclusion threshold e-value ≤ 10-3) using an 80% non-redundant database combining UniProt (Bairoch and Apweiler, 2000) and PDB (Berman et al., 2000). We then aligned this profile (inclusion e-value ≤ 10-3) against all type III effectors extracted from the literature and the UniProt 2012_01 release. For known effectors, we excluded the PSI-BLAST self-hits. We transferred annotation to the query protein from the hit with highest pairwise sequence identity of all retrieved alignments.
De novo prediction. We used the WEKA (Frank et al., 2004) Support Vector Machine (SVM) (Cortes and Vapnik, 1995) implementation to discriminate between type III effector and non-effector proteins. For each protein sequence, we created a PSI-BLAST profile (as described above) and applied the Profile Kernel function (Hamp et al., 2013; Kuang et al., 2004) to map the profile to a vector indexed by all possible subsequences of length k from the alphabet of amino acids; we found that k = 4 amino acids provides best results. Each element in the vector represents one particular k-mer and its score gives the number of occurrences of this k-mer that is below a certain user-defined threshold; we found that σ = 7 provides best results. This score is calculated as the ungapped cumulative substitution score in the corresponding sequence profile. Thus, the dot product between two k-mer vectors reflects the similarity of two protein sequence profiles. Essentially, the method identifies those stretches of k adjacent residues in profiles of type III effectors that are most informative for prediction and matches these to the profile of a query protein. The parameters for the SVM and the kernel function were determined separately for each fold in our 5-fold cross-validation and, thus, were never optimized for the test sets.
pEffect. Our final method, pEffect, combined sequence similarity-based and de novo predictions. Toward this end, over-fitting was avoided through the simplest possible combination: if any known type III effector is sequence similar to the query use this (similarity-based prediction), otherwise use the de novo prediction.
Reliability index. The strength of a pEffect prediction is represented by a reliability index (RI) ranging from 0 (weak prediction) to 100 (strong prediction). For de novo predictions, we computed RI by multiplying the SVM output by 100 for positive (type III effector) predictions and subtracted this score from 100 for negative predictions. For sequence similarity-based inferences, the RI is the percentage of pairwise sequence identity normalized to the interval [50, 100], to agree with the SVM prediction range.
Existing methods. We benchmarked pEffect against three state-of-the-art publicly available methods for type III effector prediction, using their default parameters: BPBAac (Wang et al., 2011), Effective T3 (Arnold et al., 2009) and T3_MM (Wang et al., 2013) (Supplementary Section S1).
Evolutionary distances. For the discovery of novel type III effectors in entirely sequence organisms, we extracted evolutionary distances from the phylogenetic tree of 2,966 bacterial and archaeal taxa, inferred from 38 concatenated genes and available in the Newick format (Lang et al., 2013).
Results
pEffect succeeded linking homology-based and de novo predictions. Most functional annotations of new proteins originate from homology-based transfer, i.e. on the basis of their homology (shared ancestry) to proteins with experimental characterization. For type III effector prediction, homology-based inference implies finding a sequence-similar experimentally annotated type III effector (Methods).
The accuracy of homology-based inference by PSI-BLAST was comparable to that of our de novo prediction method on the cross-validation development set (Table 1: 91% vs. 92%). However, at this level of accuracy, its coverage was significantly higher (Table 1: 84% vs. 60%). This result encouraged combining these two approaches as introduced in our recent work, LocTree3 (Goldberg et al., 2014): use PSI-BLAST when sequence similarity suffices (e-value ≤ 10-3; Table 1: F1 = 0.87 complete set) and the SVM otherwise (Table 1: F1 = 0.67 on subset of proteins without PSI-BLAST hit). The combined method, pEffect, outperformed both its components, reaching an F1 measure of 0.91 (Table 1).
pEffect outperformed other methods. We compared pEffect to publicly available methods: BPBAac (Wang et al., 2011), Effective T3 (Arnold et al., 2009) and T3_MM (Wang et al., 2013). In contrast to pEffect, all these methods focus exclusively on N-terminal features (Supplementary Section S1). BPBAac and T3_MM rely solely on amino acid composition, while Effective T3 combines amino acid composition and secondary structure information. We compared performance for UniProt proteins that had NOT been used to develop any method, and for T3DB proteins, some of which all methods (incl. pEffect) had used for development. In our hands, pEffect significantly outperformed its competitors on all data sets (Figure 1, Supplementary Table S3). The F1 performance of pEffect exceeded the other methods by more than 0.58 when tested on any data set with eukaryotic proteins (ΔF1 = (pEffect, T3_MM) = 0.58 for both UniProt sets, Supplementary Table S3). Thus, pEffect excelled over existing tools in distinguishing type III effectors from bacteria (F1>0.64) and from eukaryotes (F1>0.85). This improvement is particularly important to, e.g., annotate results from metagenomic studies (Zhou et al., 2014).
pEffect excelled even for protein fragments. To evaluate pEffect's ability to annotate effectors from incomplete genomic assemblies and mistakes, we fragmented the proteins from the T3DBFull set, i.e. the data set for which other methods were at their best (Figure 1, Supplementary Table S4). We started with protein rather than gene sequence fragments because we did not expect incorrect gene translations of DNA reads, even if sufficiently long, to trigger incorrect effector predictions from any method. Four different approaches were used to generate protein fragments: (i) remove the first 30 residues (N-terminus) from the full protein sequence, (ii) remove the last 30 residues (C-terminus), (iii) randomly remove residues from N- and C- terminus until two thirds of protein are left, and (iv) randomly choose from each protein a single fragment of a typical translated read length (Supplementary Figure S1).
pEffect outperformed all other methods for all fragment sets (i-iv). All methods performed best fragments with C-terminal cleavage (set ii, Figure 1, Supplementary Table S4). Performance was lowest for random fragments of typical read lengths (set iv). However, pEffect performed almost equal to full-length sequences on this set (F1 = 0.67 on set iv vs. F1 = 0.69 on full length, Supplementary Table S4). For all fragment sets the pEffect and PSI-BLAST performances were within the standard error of what was obtained using full-length sequences (T3DBFull set; Figure 1, Supplementary Table S3). Furthermore, for smaller protein fragments (sets iii and iv) using de novo prediction in addition to PSI-BLAST did not improve pEffect. These results suggest that the features distinguishing type III effectors are spread over the entire protein sequence and are picked up by local alignment, i.e. PSI-BLAST.
Reliability index identified confident predictions. pEffect provides a reliability index (RI) to measure the confidence of a prediction; the value of RI ranges from 0 (uncertain) and 100, (most reliable). For PSI-BLAST searches, RIs are normalized values of percentage pairwise sequence identities read of the alignments. For de novo predictions, RIs are values corresponding to SVM scores (Methods). Including predictions with low RIs gives many trusted results at reduced accuracy. Higher accuracy predictions are obtained by sampling at higher RIs, thus reducing the total number of trusted samples. For example, at the threshold of RI ≥ 50, over 87% of all predictions of type III effectors are correct and 95% of all effectors in our set are identified (Figure 2: black arrow). On the other hand, at RI>80 effector predictions are correct 96% of the time, but only 78% of all effectors in the set are identified (Figure 2: gray arrow). Thus, users can choose the most appropriate threshold for a given study. Users can also focus on previously unidentified effectors (de novo predictions) or, vice versa, on validated homologs of known effectors (PSI-BLAST matches; Supplementary Figure S2).
Scanning proteomes for type III effector proteins. We used pEffect to annotate type III effectors in 862 bacterial (274 Gram-positive and 588 Gram-negative bacteria) and 90 archaeal proteomes from the European Bioinformatics Institute (EBI: http://www.ebi.ac.uk/genomes/). Our predictions are available at the pEffect website (http://www.bromberglab.org/services/pEffect/proteomes).
Each bacterium was predicted to contain at least one type III effector (Figure 3; Supplementary Table S5, a minimum of 0.8% - 2 out of all 240 proteins in a proteome are predicted as effectors). For some Gram-negative bacteria over 750 type III effectors were predicted (e.g. Sorangium cellulosum So ce56 - 1,207 effectors, Stigmatella aurantiaca DW4/3-1 - 870, Corallococcus coralloides DSM 2259 - 826 and Haliangium ochraceum DSM 14365 - 792, Supplementary Table S5). Stigmatella aurantiaca DW4/3-1 is hypothesized to have a type III secretion system (T3SS) and effectors (Konovalova et al., 2010). We could not find any literature record for the other three species.
Overall, the number of predicted type III effectors was 1% to 10% of the whole proteome in Gram-positive bacteria, and 1% to 15% in Gram-negative bacteria (Figure 3, Supplementary Table S5). To further understand our predictions, we retrieved UniProt keywords of predicted effectors. Their annotations varied widely, with the most common for both types of bacteria being transferase, depicting a large class of enzymes that are responsible for the transfer of specific functional groups from one molecule to another, nucleotide-binding, a common functionality of effector proteins, ATP-binding that is also an essential component of T3SS, and kinase, which is necessary for the expression of T3SS genes. About one fourth (26-29% per proteomes) of predicted type III effectors are functionally 'unknown' (Supplementary Table S6).
We also predicted type III effectors in all archaeal proteomes, with over 100 effectors identified in the proteomes of Haloterrigena turkmenica DSM 5511 and Methanosarcina acetivorans C2A (126 and 105 effectors, respectively; Supplementary Table S5). On average, there were fewer effectors predicted in archaea than in bacteria: 1.9% is the overall per-organism number for archaea vs. 3.4% for Gram-positive and 4.6% Gram-negative bacteria (Figure 3). The most frequent annotations of predicted archaeal effectors were similar to those for predicted bacterial effectors, namely 'unknown', nucleotide-binding, ATP-binding and transferase (Supplementary Table S6). We address the unexpected predictions of effectors in Archaea further in the Discussion section.
T3SS is most likely to exist in organisms with ≥5% predicted effectors and five type III machinery components. We BLASTed proteins representative of five T3DB Ortholog clusters (e-value ≤ 10-3; Supplementary Table S2) against the full proteomes of our 862 bacteria and 90 archaea set. We thus aimed to identify those proteomes likely equipped with the type III secretion system (T3SS) machinery (Figure 4).
We found that, as expected, archaea never contain a full T3SS (maximum three out of five components). In Gram-negative bacteria, the number of predicted effectors correlated much better with the number of type III machinery components (Pearson correlation r = 0.37) than in Gram-positive bacteria (r = 0.13). The combination of a high percentage of predicted type III effectors and a high number of conserved type III machinery components provides strong evidence for the presence of the type III secretion abilities (Figure 4). As a rule of thumb, based on our observations in archaea and Gram-positive bacteria, we suggest that these abilities can be reliably identified by the presence of the complete T3SS and ≥5% of the genome dedicated to effectors. With these cutoffs, 20% (120 species) of the Gram-negative bacteria in our set are identified as type III secreting. No archaeal species and only five Gram-positive bacteria fit these cutoffs. We searched the literature for annotation of ten randomly chosen Gram-negative bacteria from this set (Supplementary Table S7). We found evidence of type III machinery in seven of the ten organisms (Attree and Attree, 2001; Bertelli et al., 2010; Block et al., 2010; Brugirard-Ricaud et al., 2004; Dai and Li, 2014; Mavrodi et al., 2011; Salinero et al., 2009). For three bacteria the secretion machinery has not been studied. Overall, our results suggest that the experimental annotation of the type III secretion in isolated and cultured organisms is incomplete, leaving significant room for improvement.
Discussion
pEffect combines homology-based and de novo prediction. PSI-BLAST is commonly used to annotate protein function through sequence similarity (Radivojac et al., 2013). Applied to our sequence unique Development set, PSI-BLAST correctly annotated most type III effector proteins (F1 = 0.87 ± 0.09) through sequence comparisons against a set of known type III effectors. The de novo prediction with the profile kernel SVM annotated type III effectors slightly worse (F1 = 0.73 ± 0.11). Our new method, pEffect, successfully combined the complementary homology-based and de novo predictions, reaching sustained high levels of performance (F1 = 0.91 ± 0.08), better than each of its individual components (Table 1).
Predictions succeed even for fragments from metagenomic analyses. pEffect distinguishes type III effectors from other bacterial and eukaryotic proteins using either full length proteins or protein fragments. The detection of N-terminal signals, often used as the only source of evidence for predicting type III effectors computationally, presents a special problem for metagenomic data because of the erroneous gene predictions and potentially absent reads in contig assemblies. For all fragment sets tested, pEffect performed within one standard error of the level for full-length sequences (Figure 1, Supplementary Tables S3-S4). This result suggests that the features distinguishing type III effectors are present throughout the protein sequence and are not solely confined to the N-terminal region.
The finding that the secretion signals is somehow 'distributed' over the entire protein was surprising and extremely relevant for the analysis of metagenomic read data. Deep Sequencing (or NGS) produces immense amounts of DNA reads, which need to be assembled and annotated to be useful. Erroneous (chimeric) gene assemblies or wrong gene predictions are common in sequencing projects (Nielsen and Krogh, 2005). To bypass the assembly errors when identifying type III secretion activity in a particular metagenomic sample it would help to annotate effectors from raw protein fragments translated directly from the DNA reads. Since pEffect succeeds in our tests on fragments, our new method might just enable such a direct analysis. To ultimately establish this point, we will have to compare predictions from raw read translations to those from translations of assembled genomes. Clearly, the results from pEffect can help establish the presence or absence of pathogenic organisms in a particular environment.
Most predictions are de novo without sequence similarity to known effectors. Type III effectors were predicted in all types of prokaryotes that we tested. As expected, the number of effectors in Gram-positive bacteria and archaea that are not known to utilize T3SS was lower than in Gram-negative bacteria that do use the system (Figures 3-4). Interestingly, homology searches, i.e. PSI-BLAST results, have identified roughly equal numbers of effectors (1%; Figure 5, Supplementary Table S5) in both types of bacterial genomes. As some effectors often co-localize with the T3SS machinery in “pathogenicity islands” (Figueira and Holden, 2012; Okada et al., 2009; Reis and Horn, 2010), these findings are in line with the inheritance of the early complete secretory system, including the machinery and the secreted proteins.
Overall, the percentage of effectors predicted by sequence similarity (homology-based) ranged from 3%-71% for bacteria with an average of 21% (maximum for Onion yellows phytoplasma OY-M, an intracellular Gram-negative plant pathogen (Oshima et al., 2013); Supplementary Table S5). Conversely, a significantly larger fraction (on average ˜76%) of all effector predictions were based on our de novo method, i.e. could not have been identified without machine learning. The percentage of de novo predictions in Gram-negative bacteria was significantly larger than in Gram-positive ones (79±0.4% vs. 70±0.5%, respectively; Figure 3). Note, however, that 70% is still a drastically large fraction to appear in bacteria that seemingly have no use for them. Furthermore, the number of “new” effectors has grown over evolutionary time (Figure 5), suggesting functional innovation due to environmental pressures. The set of de novo-identified effectors found across bacteria is thus a good starting point for further investigation into effector origins.
Highest number of effectors in Gram-negative bacteria with full T3SS. The loss of type III secretion components in Gram-negative bacteria is accompanied by the loss of effectors, indicating the lack of necessity to further diversify in the absence of the complete machinery (Figure 4C). This type of correlation between the completeness of T3SS and the number of effectors in Gram-negative bacteria is not present for non-type III secreting Gram-positive bacteria (Figure 4B) or archaea (Figure 4A).
Further insight into evolution of bacterial T3SS. pEffect's high prediction accuracy raises an interesting question about its predictions of effectors in Gram-positive bacteria, which is not known to utilize T3SS. Roughly one fourth of their predicted effectors are of yet-unknown function. Bacterial proteins of annotated function are mostly transferases, hydrolases, ATP-binding proteins or kinases (Supplementary Table S6), all of which are necessary for flagellar motility. This finding is in line with evidence of shared ancestry between bacterial flagellar and type III secretion systems (McCann and Guttman, 2008). It is not known whether T3SS evolved from the flagellar apparatus or if the two systems evolved in parallel. However, gene genealogies (Gophna et al., 2003) and protein network analysis approaches (Medini et al., 2006) both suggest independent evolution from a common ancestor, which comprised a subset of proteins forming a membrane-bound complex. The fact that the flagellar system can also secrete proteins (Macnab, 2004) suggests that this ancestor may have played a secretory role (McCann and Guttman, 2008). The pervasiveness of the flagellar apparatus across the bacterial space suggests that the ancestral complex existed prior to the split of the cell-walled and double-membrane organisms, indicated by the differences in gram staining. The common ancestor protein complex of T3SS and flagellar system would have then been encoded in an even earlier ancestral genome. Thus, it is not surprising that we find T3SS component homology in Gram-positive bacteria even in the absence of type III secretion functionality. Interestingly, our results show that the loss of the complete T3SS and, inherently, the associated loss in type III functionality has proceeded at a roughly similar rate in Gram-positive and Gram-negative bacteria (Figure 6A); i.e. once the T3SS is incomplete (4 components), and arguably nonfunctional, further loss of components consistently follows. A complete T3SS, however, is only visible in early Gram-positive bacteria, but preserved across time in Gram-negative bacteria (Figure 6B), further confirming the presence of the ancestral secretory complex in the last common bacterial ancestor.
Did T3SS exist before the archaea/bacteria split? pEffect also predicts a significant number of effectors in archaea. However, the presence of the beginnings of T3SS in the common ancestor of bacteria and archaea is neither directly supported nor negated by our results. Archaeal flagella have little or no structural similarities to bacterial flagella, but share homology with the type IV secretion system (Ng et al., 2006). Some of the type IV secretion system and T3SS components are homologous, e.g. VirB11-like ATPases (Wallden et al., 2010). However, despite this observed homology none of the archaea that we tested had the complete set of T3SS components (Figure 3). If the common ancestor of archaea and bacteria did encode the core ancestral complex, these observations would indicate a loss of functionality in archaea. Another possibility is that the T3SS in bacteria, like the flagellar apparatus (Liu and Ochman, 2007), may have been built over time from duplicated and diversified paralogous genes of the core complex after the archaea/bacteria split. In both of these scenarios, the prediction of type III effectors in archaea would then indicate re-purposing of the proteins secreted by the ancestral complex. In fact, 0.5% of an average archaeal genome is identified by homology (PSI-BLAST) to known effectors and another 0.9% de novo identified proteins are homologous (PSI-BLAST e-value ≤ 10-3) to predicted effectors of Gram-negative bacteria. These proteins must have been re-purposed in modern archaea; they are usually annotated as hydrolases, transferases, and metal-binding proteins (Supplementary Table S6). The use of an additional 0.5% of the archaeal proteome that is picked up by pEffect de novo and has no homologs in bacteria remains an enigma. While a certain level of similarity exists between archaeal proteins and bacterial type III effectors and machinery, the observed signal is insufficient to draw definitive conclusions regarding common ancestry. It is, however, significant for further exploration - if roughly one tenth of the identified effectors of Gram-negative bacteria and half of the machinery have homologs in archaea, could there have been a common ancestral secretion complex that has developed early on in evolutionary time and has given root to many systems observed today?
Availability
pEffect is accessible as an online web server (http://www.bromberglab.org/services/pEffect). Proteome scanning results, described in this manuscript, are also available for download. We expect our method framework to improve in the future as more experimental data and more sequences become available. However, pEffect's high levels of accuracy and its ability to easily handle large-scale data already place the method at the ideal starting point for annotating type III effector functionality of individual proteins, whole proteomes, or even translated metagenomes.
Acknowledgements
Thanks to Tim Karl, Guy Yachdav, Laszlo Kajan (all TUM) and Yannick Mahlich (Rutgers) for invaluable help with hardware and software; to Chengsheng Zhu (Rutgers) for helpful discussions; to Jessie Maguire (Rutgers), Marlena Drabik, Inga Weise and Lothar Richter (all TUM) for administrative support. Last, not least, thanks to all those who deposit their experimental data in public databases, and to those who maintain these databases. The work was supported by a grant from the Alexander von Humboldt foundation through the German Federal Ministry for Education and Research (BMBF). Additional funding was provided to Tatyana Goldberg through Ernst Ludwig Ehrlich Studienwerk (ELES).