Expected and observed genotype complexity in prokaryotes: correlation between 16S-rRNA phylogeny and protein domain content

Jasper J. Koehorst; Edoardo Saccenti; Vitor Martins dos Santos; Maria Suarez-Diez; Peter J. Schaap

doi:10.1101/494625

ABSTRACT

Background The omnipresent 16S ribosomal RNA gene (16S-rRNA) is commonly used to identify and classify bacteria though it does not take into account the distinctive functional characteristics of taxa. We explored functional domain landscapes of over 5700 complete bacterial genomes, representing a wide coverage of the bacterial tree of life, and investigated to what extent the observed protein domain diversity correlates with the expected evolutionary diversity, using 16S-rRNA as metric for evolutionary distance.

Results Analysis of protein domains showed that 83% of the bacterial genes code for at least one of the 9722 domain classes identified. By comparing clade specific and global persistence scores, candidate horizontal gene transfer and signifying domains could be identified. 16S-rRNA and functional domain content distances were used to evaluate and compare species divergence and overall a sigmoid curve is observed. Already at close 16S-rRNA evolutionary distances, high levels of functional diversity can be observed. At a larger 16S-rRNA distance, functional differences accumulate at a relatively lower pace.

Conclusions Analysis of 16S-rRNA sequences in the same taxa suggests that, in many cases, additional means of classification are required to obtain reliable phylogenetic relationships. Whole genome protein domain class phylogenies correlate with, and complement 16S-rRNA sequence-based phylogenies. Moreover, domain-based phylogenies can be constructed over large evolutionary distances and provide an in-depth insight of the functional diversity within and among species and enables large scale functional comparisons. The increased granularity obtained paves way for new applications to better predict the relationships between genotype, physiology and ecology.

Introduction

The most commonly used method to classify bacteria and to identify new isolates is the direct comparison of the omnipresent 16S ribosomal RNA (16S-rRNA) gene sequence^1,2 with highly curated 16S-rRNA gene sequence databases^3–8.

Using only the 16S-RNA gene for taxonomic characterisations presents limitations and disadvantages. First, arbitrary minimal sequence similarity thresholds are used as working boundaries for differentiating between taxonomic ranks. Although these thresholds prove to be very useful for classification purposes, they are subject to progressive insights and are limited as there is no biological meaning attached to it⁹. For instance, the minimal sequence similarity threshold for species delineation, proposed for the 16S-rRNA gene, has changed over time from 97% to 98.7%10,11 and even at this updated stringency level, the resolution is too limited for a definite species classification of some phylogenetic groups,¹². Second, a restriction to the analysis of sequence variations in a single gene does not take into account the distinctive functional characteristics of the different prokaryotic taxa nor can it explain the genotypic, and the consequently phenotypic, differentiation observed between strains due to events such as gene loss or acquisition.

Alternative, inter-genomic BlastN-based sequence similarity methods exist that take into account full genome sequences. Examples are Average Nucleotide Identity (ANI)^13,14, Genome Blast Distance Phylogeny (GBDP)¹⁵ or a combination of 16S-rRNA sequence similarity and ANI values¹⁶. These methods help to increase taxonomic coherence at the smaller evolutionary distances, but are less suitable to monitor the impact of mutation and (strain specific) gene loss and horizontal gene transfer (HGT).

To better understand the impact of gene loss and HGT and to improve the characterisation of functional diversity, the analysis needs to be performed beyond genome sequence similarity comparison by considering protein function. Protein encoding genes reveal a modular design, with domains forming distinct globular structural and functional units. Bacterial innovation is in part driven by gain, loss, duplication and rearrangement of these functional units, resulting in the emergence of proteins with new domain combinations^17,18. Thus, a direct comparison of protein domain content should be able to reconstruct bacterial phylogeny independent of gene sequence similarity¹⁹ and as such may serve as a better indicator of shared physiology and ecology^20,21.

In this study we present an exhaustive exploration of the functional landscape of over 5700 complete bacterial genomes representing a wide coverage of the bacterial tree of life and investigated to what extent protein domain diversity correlates with taxonomic diversity using the 16S-rRNA gene sequence as metrics for evolutionary distances.

Results

We analysed 5713 fully sequenced publicly available, bacterial genomes corresponding to a wide range of different bacterial lineages (57 classes, 243 families, 818 genera and multiple strains of 1330 species), providing a good representation of the bacterial diversity observed in nature (See supplementary file S1 for more information). Genome sizes varied from 0.1 Mbp up to 13 Mbp. To avoid technical bias due to the use of different annotation strategies, all genomes were de-novo re-annotated with SAPP²² (see Methods section for details). The total number of genes varied from 167 (Candidatus Tremblaya princeps) to 9968 (Streptomyces bingchenggensis BCW-1).

16S-rRNA variability within and between species

From the 5713 completely sequenced genomes, 25098 complete 16S-rRNA genes could be retrieved. On average the predicted length of the 16S-rRNA gene was 1531 ± 94 nt (See supplementary Figure Supplementary file S1) and 84% of the completed genomes (4772) contained between two and fifteen copies of the 16S-rRNA gene (Figure 1). The 16S-rRNA genes from phylogenetic cohesive groups of at least 50 strains were further analysed at family level. As can be seen in Figure 1B, among different families there is a diverse variation in copy number. As already has been observed²³, while in some families the 16S-rRNA copy number is largely restricted to a single copy gene, copy number in others ranged from 1 to 15 copies. Furthermore, 52% of the analysed genomes contained two or more non-identical copies of the 16S-rRNA gene. Intragenomic sequence variation reflected an overall sequence identity of 99.6 (+0.4 / −2)%, which is higher than the currently accepted 98.7% threshold for species delineation.

Figure 1. 16S-rRNA copy number variation.

A) 16S-rRNA gene copy number variation in the complete set B) Copy number variation at family level; families represented by more than 50 strains were analysed.

For the complete set of genomes, a species network based on pair-wise 16S-rRNA sequence similarity scores was built. In this network, nodes represent genomes and edges were drawn between nodes when the 16S-rRNA showed at least 98.7% identity. Network connectivity analysis identified 2025 connected components (subnetworks). For further study, 294 subnetworks linking ten or more nodes, were were selected. In thirty-two of these subnetworks, taxonomic inconsistencies were observed as they linked genomes assigned to two or more species. The majority (30) of these taxonomic inconsistent subnetworks linked species belonging to the same genus. However, two subnetworks were identified that linked species from different genera. The first contained species of the Escherichia and Shigella genera. The second subnetwork showed even more diversity and contained members of the Citrobacter, Enterobacter, Klebsiella, Kosakonia, Raoultella and Salmonella genera (Figure 2). Both subnetworks eventually belong to the Enterbacteriaceae family. Overall, network analysis suggested that in many cases the 98,7% identity threshold is not sufficient and additional means of classification are required to obtain reliable phylogenetic relationships.

Figure 2. Topology of the similarity subnetwork of Enterbacteriaceae.

Nodes represent genomes and edges are drawn if the 16S-rRNA identity >98.7%. A) Network topologies with colours indicating the different species groups. Left panel, unambiguous species assignment; strains A,B and C are directly connected to type strain T. Right panel: Observed topology. Leave node strain D is in the cluster but has no direct link with type strain T. 16S-rRNA sequences of strain E and strain F are below the set similarity threshold and form an unlinked subnetwork. Strain G of the blue species functions as an articulation point linking the pink and red species subnetworks. B) Subnetwork linking six different genera based on the 16S-rRNA gene sequences using a sequence similarity threshold >98.7%. Size of each node is dependent on the betweenness centrality. Enterobacter is the main component that connects the different genera as no direct linkage between Salmonella and Klebsiella is observed. Three strains of Citrobacter have a direct connection to Salmonella and are disconnected from other Citrobacter strains. One Enterobacter (Enterobacter sp. R4-368) is isolated from the rest and is only connected to Kosakonia. The Raoultella genera have a close similarity to some of the Klebsiella strains. C) Topology of domain-class content subnetworks of the same strains using as threshold a binary distance ≤0.1. Distinct subnetworks are observed. Salmonella is now completely separated from the other genera; Enterobacter, Klebsiella and Citrobacter also form distinct clusters with a few members forming separate subnetworks.

Protein domain architectures

By breaking proteins into domains and using precomputed profile hidden Markov models (pHMM) to classify these domains, a semantically consistent classification of encoded protein functions can be obtained²⁰. As a pHMM gives greater weight to matches at conserved sites they are also better for remote homology detection than standard sequence similarity-based methods²⁴. To obtain such protein classification the 18949996 inferred protein sequences were scanned for the presence of Pfam domains²⁵. A total of 15747648 protein sequences were found to contain at least one domain instance (83.1%) and in total 9722 distinct protein domain classes were detected (See supplementary file S1 for more details). Two Pfam domains were discovered in 17.7% (3345544) and three or more domains in 6.4% (1205997) of these proteins (Table 1). Thus, the majority of the bacterial proteins appear to be single domain proteins (Figure 3A). Moreover, we observed that most multiple domain proteins appear to contain domain repetitions. Similar domain distributions were obtained when individual genomes were analysed, indicating that this is a general property of the architecture of bacterial genomes (Figure 3B).

Figure 3. Frequency distribution of 1,2,…,13 domain classes

A) In the full data set. B) In each genome

View this table:

Table 1. Overview of the number of proteins and corresponding protein domain content.

The majority of the proteins (83.1%) contained at least a single domain and only a few (1.74%) contained more than 3 domains.

Genome distribution of protein domains

The distribution of the domain classes across the studied genomes is shown in Figure 4. Panel A shows that there is a direct correlation between the genome size and the total number of domains detected. A non-linear relationship is observed between the total number of protein domains and the total number of protein domain classes indicating that domain copy numbers, but not so much the number of domain classes, increase in the larger genomes (Figure 4 panel B). On average, we counted 2.02 domain copies per genome. This copy number, however, showed a large variability, ranging from 1.07 copies for Carsonella ruddii (strain PV)²⁶ to 4.58 copies for Streptomyces bingchenggensis (strain BCW-1)²⁷.

Figure 4. Distribution of the domain classes across bacterial genomes

A) Correlation between genome size and number of protein domains. B) Correlation between the total number of domains and total number of domain classes. A non-linear relation is observed, suggesting that in the larger genomes an increase in domain copy number is favoured over an increase in domain classes.

Domain persistence and analysis of the pan- and core-domainomes

In total, 9722 domain classes were detected. The overall persistence (the fraction of the genomes sharing a given domain class) is shown in Figure 5. Only 324 domain classes were ubiquitous in over 95% of the analysed genomes. Three domains, PF00009, (GTP-binding elongation factor family), PF01479, (S4 domain) and PF03144 (Elongation factor Tu domain 2) were shown to persist in all genomes. Additionally, a small number of domains were found to be present in over 99.9% of the studied genomes, PF00012 (Hsp70 protein), PF00318 (Ribosomal protein S2), PF00380 (Ribosomal protein S9/S16), PF00679 (Elongation factor G C-terminus), PF01926 (50S ribosome-binding GTPase), PF02811 (PHP domain), PF07733 (Bacterial DNA polymerase III alpha subunit) and PF14492 (Elongation Factor G, domain II). Among the studied genomes there are domain classes with a high copy number. The domain with the highest copy number is PF00005, representing the ATP-binding domain of ABC transporters, with on average 62.9 copies per genome, yet the domain is absent in twelve small-sized genomes.

Figure 5.

Distribution of domain classes over 5713 genomes

Accurate measurements of the pan- and core-domainome sizes would entail knowledge of the functional content of every single organism in the corresponding group. We have estimated their respective sizes for the 18 families that contained more than 50 members each (Figure 6A). The largest observed pan-domainome was of Bacillaceae with 4783 protein domain classes. The largest core was observed for Yersiniaceae (1844 domain classes) (Figure 6B).

Figure 6. Persistence analysis of families with more than 50 members.

The estimated pan-domainome (Panel A) and estimated core (Panel B) shows a large degree of variability ranging from 78% for Chlamydiaceae and 7% for Enterobacteriaceae. The conservation ratio of the pan/core (Panel C) shows that in only Chlamydiaceae more than half of the protein domain content is conserved. The family pan-genome is closed (Panel D) when α>1.

When analysing the genomes of the Chlamydiaceae family, 78% of the protein domain classes are conserved. In contrast, the core of Enterobacteriaceae only covers 7% of the in total 4444 domains (Figure 6C). This is mostly due to the size of the genomes from the Moranella, Riesia, Blochmannia and Ishikawaella²⁸, genera as they are smaller than 1 Mbp, encoding as low as 444 genes, whereas the average genome size of Enterobacteriaceae is 4.8 Mbp, encoding on average 4510 genes. When excluding the small sized genomes, the core increases to 938 protein domains with a slightly smaller pan-domainome of 4441 yielding a 21% ratio between the core and pan-domainome. This shows the impact of including or excluding specific genomes in the analysis, as a single or few genomes can reduce the core significantly, thereby possibly eluting important information.

Openness of the pan-domainome provides another indication of the relative impact of horizontal acquisition and vertical transmission in shaping the domainome. Fitting a Heap’s law, we estimated whether the pan-domainome for each of the largest families was either open or closed by fitting the decay parameter of a Heap’s law function, α. The pan-domainome is closed when α >1.0 and open when α < 1.0. The majority of the bacterial families here considered showed a closed pan-domainome (Figure 6D). For Enterobacteriaceae the Heap’s parameter dropped from α=1.21 to α=1.17 upon removal of the previously indicated smaller genomes.

Signifying domains and horizontal domain transfer

Log persistence scores (log-P) were calculated for each of the domain classes present in the pan-domainomes from the five most abundant monophyletic species groups (Chlamydia trachomatis (74), Escherichia coli (105), Helicobacter pylori (65), Salmonella choleraesuis (350) and Staphylococcus aureus (74).) As null-model we consider the persistence of the domain in the full set of 5713 genome sequences.

For a small set of domain classes high (log-P) scores were obtained and are likely signifying domain classes (Table 2, Figure 7 and Supplementary Table S3 logP). On the other end of this scale we find a large amount of domain classes with negative log-P scores. These incidental domains have a low to very low intra-species persistence which suggests that they may have been acquired by horizontal gene transfer. Unlike the high scoring domains most of them have been assigned a molecular, often metabolic, function.

Figure 7. Persistence scores of Salmonella choleraesuis protein domain classes.

For each domain class present in the S. choleraesuis pan-domainome, persistence scores are compared with the pan-domainome persistence scores obtained from the complete set of 5713 genomes.

View this table:

Table 2.

Salmonella choleraesuis top 25 signifying and incidental domains

Co-evolution of bacterial 16S-rRNA and whole genome domain content

Protein domains provide a formal description of genome encoded functionalities, each contributing to bacterial genotypic complexity. The functional relatedness of an arbitrary pair of genomes can thus be determined by finding the fraction of encoding domain classes in common relative to the the number of domain classes present in each of these genomes. Through inclusion of the 16S-rRNA data the co-evolution of bacterial 16S-rRNA gene sequences with genotypic complexity can be studied (Figure 8). In panel A the distribution of domain based distances is plotted using a binary dissimilarity score. Likewise in panel D, the distribution of 16S-rRNA sequence distances is plotted. Panel C shows a pairwise comparison between 16S-rRNA distances and functional distances for the analysed genomes. Finally, panel B, presents a schematic representation of the relationship between the two methods.

Figure 8. Distance comparison of the 16S-rRNA gene with the functional diversity.

A) Distribution of domain based distances. B) Schematic representation of the three stages of diversification. 1) a fast-short-term evolution, as evolutionary distances measured by 16S-rRNA remain small, while functional diversification has already taken place. 2) long-term evolution, in which functional diversification occurs at a scale compatible with diversification by 16S-rRNA sequence evolution. 3) The distance of the 16S-rRNA remains behind the functional diversity as the 16S-rRNA distance can only diverse so far without loss of function. C) Comparison between pairwise 16S-rRNA distances and pairwise functional distances. Color indicates density of points, blue and red indicate lower and higher density respectively D) Distribution of 16S-rRNA based distances.

Overall, a good agreement is found between both approaches to evaluate species divergence. Analysis of the 16S-rRNA distances shows a marked differentiation in the [0.3, 0.35] interval, which appears as a steep increase in the abundance of instances of these distance values (Figure 8D). These differentiations correspond to lineage boundaries (specifically class and phylum differences). This increased density corresponds to the higher density in the center of the plot (Figure 8C), that reflects that most of the performed comparisons involve members distantly related in the evolutionary scale. This is also apparent on the higher number of instances of functional differences in the [0.6, 0.7] interval (Figure 8A), however functional differences accumulate more gradually, and no steep increase is observed.

The relationship between the two methods to evaluate species differences can be approximated through a sigmoidal curve and three regimes can be distinguished (Figure 8B). Species at close evolutionary distances show a broad range of functional similarity (Figure 8B region 1). A high diversity is observed, so that genomes with high similarity regarding their 16S-rRNA can show high functional diversity. The second region shown in Figure 8B, region 2, corresponds to regions of relatively large genetic differentiation (class differences) that accumulate functional differences at a relatively lower pace. Finally, the third region (region 3) corresponds to very distant species that as expected, have a large degree of functional differentiation.

In addition to functional similarities between evolutionary close strains, Figure 8C also indicates the presence of functionally very similar but evolutionary distant genomes. These are to be found in the region with low domain content variation (<0.05) and a large 16S-rRNA distance (>0.4). Gluconacetobacter diazotrophicus PAl 5, Moraxella catarrhalis BBH18 and Pseudomonas aeruginosa 39016 are some examples. Similar results are obtained when the analysis is repeated considering all available genomes. The presence of more than one copy of the 16S-rRNA gene may introduce a larger variability, however the overall agreement of 16S-rRNA classification remains the same.

Discussion

For several decades 16S-rRNA sequence similarity scores provided a good working metric for prokaryotic taxonomic classifications, but because of the ever-expanding sequence databases and the increased taxonomic complexity the limitations of this approach are emerging. Here, we have used a set of 5713 complete genomes to evaluate the predictive power of pair-wise 16S-rRNA sequence similarity scores on the diversity and taxonomic classification of these genomes.

We observed intragenomic variation of 16S-rRNA gene sequences, but further analysis showed that within the selection, this variation is limited and well above the currently advised species threshold of 98.7%, meaning that regardless of the selected copy, the same taxonomic classification should be obtained.

A network approach was subsequently used to study pair-wise 16S-rRNA sequence similarities between the 5713 sequenced strains (Figure 2). By using the currently accepted 98.7% minimal sequence similarity threshold, optimally this approach should lead to 1330 separate species networks, each containing all sequenced strains of a defined species and each individual node within such subnetwork should at least have a direct link to the node that represents the reference or type strain (Figure 2 panel A). However, many more subnetworks were obtained and what was observed is that strains of the same species are in separate subnetworks. Additionally, strains with intermediate 16S-rRNA sequences were present functioning as articulation points merging what should have been independent species subnetworks. (Figure 2 panel B and C). With the continuous addition of new 16S-rRNA sequences it is likely that species amalgamation will become more frequent. In the light of this, a more appropriate approach would be to consider the similarity threshold as a confidence level. In this way, there is a high probability that two sequences with a 16S-rRNA sequence identity below the selected threshold belong to different species. This provides a probabilistic interpretation to the threshold.

We used Pfam protein domain-class content to study strain diversity. Protein domains are considered to be distinct functional units and as such responsible for a particular function or interaction. The Pfam 30 protein family database consists of 16306 domain families or classes²⁹ of which 9721 were present in the studied dataset. Furthermore, we found that approximately 83% of the protein-encoding genes harbour at least one Pfam domain suggesting that the encoded domain-class content may provide a good metric to study strain diversity.

The core-genome of a taxonomic group contains genes that are present in all members of that group whereas the pan-genome contains all the different genes that can be found in any member of the population³⁰. Here we extended the idea to protein domain classes, as has been previously reported^21,31. We observed that most domain classes have a low persistence overall (Figure 5), but as shown in Figure 6, by adding taxonomic information, distinct sets of domain classes accumulate in the core domainomes of the various clades, suggesting that these core sets are somehow contributing to the physiology and ecology of these clades.

At family level, the pan to core domainome ratio is observed to be on average below 0.4 (Figure 6), but at lower taxonomic ranks this ratio increases. For C. trachomatis this ratio was determined to be 0.96, for Escherichia coli 0.58, for Helicobacter pylori 0.83 and for Staphylococcus aureus 0.76. We assumed that species core domainomes would consist of signifying or even species-specific domain classes and domain-classes representing essential metabolic functions. We expected that signifying domain-classes are only highly persistent within a clade but that domain-classes representing metabolic functions would be widely spread. For each domain class present in the pan-domainome of five selected species we calculated the ratio between clade specific persistence and global persistence (log-P scores) using a null-model that assumes that domain-classes are evenly distributed over the strains. The analysed species contributed to 6.2% or less of the total number of strains.

Top log-P scoring domains mostly corresponded to domains of unknown function (DUF) or domains involved in signal transduction whereas, being omnipresent, metabolic functions were underrepresented. Of the 25 top scoring domains, 6 in Salmonella choleraesuis, 15 in Chlamydia trachomatis, 8 in Escherichia coli, 3 in Helicobacter pylori and 11 in Staphylococcus aureus corresponded to a DUF class. For the Mycoplasma species it has been established that many DUFs are essential for growth^32,33 and at least four of the DUFs in the present study, two specific for Escherichia coli (PF07041 and PF10897) and two for Helicobacter pylori (PF12033 and PF10398) indeed have been characterised as being essential³⁴. Between these five species, top scoring domains also show no significant overlap suggesting that they are evolutionary conserved and may have a prominent role in shaping the species. Protein domain classes with the lowest persistence ratio’s are likely HGT candidates. Functionally, most of them represent a metabolic function suggesting, as has been reported^35,36, that horizontal gene transfer is an important source of metabolic diversity.

The impact of the presence of these signifying domains in the core domainome is demonstrated in Figure 2C. Nodes from the Enterobacteriaceae subnetwork (Figure 2B) were re-analysed using pair-wise domain-class content distance analysis. A similarity threshold of 90% resulted in clade specifc domain-class subnetworks for Salmonella, Enterobacter and to a lesser extent for Klebsiella. Note that by adopting a whole-genome domainome approach, the history of every domain-class present in the pan-domainome, is taken into account. However, signifying domain classes are the main contributors and similar to what has been observed in Ochman et al.³⁷, we observed that the many incidental HGT candidate domain classes appear to have little impact on whole-genome domainome based phylogenetic reconstructions.

The ratio between the core- and pan-domainome size of groups of organisms at different phylogenetic levels provided a good estimate for beta-diversity. A relatively low ratio between the core and pan-domainome reduces the functional assignments that can be inferred from the 16S-rRNA classification. Conversely, a high ratio gives more certainty that functionalities are present. Overall the majority of the analysed families showed a low ratio indicating that only a reduced functional landscape can be extrapolated using 16S-rRNA analysis and the ratio can differ significantly among families. For example, Chlamydiaceae shows a large ratio whereas Enterobacteriaceae has the lowest observed ratio, indicating that the Chlamydia genus which consists mostly of pathogenic bacteria that are obligate intracellular parasites have evolved through simplification instead of complexification and are therefore less diverse³⁸. Whereas Enterobacteriaceae is a diverse family consisting of members that are part of the gut flora and also contains a wide range of pathogenic species, showing a more diverse functional landscape.

Combining the information from the functional landscape with 16S-rRNA sequences, allowed us to relate the functional diversity with evolutionary distances (Figure 8). This analysis revealed that three stages of diversification can be defined³⁹. The first stage represents a fast-short-term evolution, as 16S-rRNA evolutionary distances remain small, though functional diversification has already taken place. This happens in closely related, near identical, related strains where gene acquisition could play a significant role in functional diversity. The second stage represents a long-term evolution, in which functional diversification occurs at a scale compatible with evolutionary time, as reflected by 16S-rRNA evolution. In the third stage diversification of the functional landscape continues but, due to 16S-rRNA genetic constraints, does not align well with 16S-rRNA sequence distances.

Conclusions

16S-rRNA similarity scores can still be used as a metric for taxonomic classification but we propose a more probalistic interpretation as its performances will be better at higher taxonomic levels.

Whole genome protein domain phylogenies correlate with, and complement 16S-rRNA sequence-based phylogenies. Moreover, domain-based phylogenies reveal rapid functional diversification, allowing for large scale functional comparisons between clades and can be constructed over large evolutionary distances.

Protein domain persistence ratio’s highlight both signifying domain classes and HGT candidates. The increased granularity obtained will pave the way for new applications to better predict the relationships between genotype, physiology and ecology.

Methods

Genome annotation

A total of 5713 publicly available complete bacterial genomes were downloaded from the NCBI repository (November 2016)⁴⁰. To prevent technical bias due to the use of different annotation tools and pipelines and different thresholds for assessing the significance of the inferred genetic elements, genomes were consistently structurally and functionally de-novo annotated using SAPP²², an annotation platform implementing a strictly defined ontology⁴¹.

16S-rRNA prediction was performed using RNAmmer 1.2⁴². Genes were predicted using Prodigal (2.6.3)⁴³ and the identified proteins were functionally annotated using the Pfam library (version 30.0) within InterProScan (version 5.21-60.0)^25,44. Annotations were automatically converted into RDF according to the GBOL ontology⁴¹ and loaded into a semantic database for high-throughput annotation and analysis. For the retrieval of information, SPARQL was used (See supplementary file S5 for all queries used).

Quality analysis

Scaling laws have been identified in the genomic distribution of protein domains⁴⁵. These laws result in linear relationships in the number of domain classes with n copies and the total number of domain classes in a genome (See supplementary Figure S5). We have verified the linear relationships in the analysed genomes. These indicators have been used here to further verify the integrity of the assembled genomes⁴⁶. Overall, the previously reported scaling laws also hold true when a higher number of genomes is studied.

Estimation of pan- and core-domainome size

The estimated number of domain classes in the pan- and core-genomes expected, if the sequences of every existing strain were to be included in the analysis, were computed using binomial mixture models as implemented in the micropan R package⁴⁷ using default values for the parameters. Heap’s analysis as implemented in the micropan R package was used to estimate openness or closeness of the pan-genome using 500 genome permutations and repeating the calculation 10 times.

Domain persistence

The following formulas were used to calculate persistence ratios

16S-rRNA distance calculations

From the de-novo annotation, 16S-rRNA sequences were obtained from the semantic database through a SPARQL query (See supplementary file S6 for all queries used). In total 25098 16S-rRNAs were retrieved. rRNA’s that were of low quality (containing N’s) or differed in size greater than the standard deviation were removed from the analysis. Duplicated 16S-rRNAs were merged into a single copy for the multiple alignment. For each 16S-rRNA the orientation was validated using OrientationChecker⁴⁸. The complete gene was used for calculation of pairwise alignment distances using the clustal omega suite for all possible 16S-rRNA pairs (Dataset 1 aligned). The resulting matrix was binarized using 98.7% sequence similarity as a cutoff. The binary matrix was then represented as networks using igraph⁴⁹ in R⁵⁰.

Domain based distance calculations

Genome distances based on protein domain class content were computed using the asymmetric binary method in which vectors are regarded as binary bits. Non-zero elements are on and zero elements are off. The distance is the proportion of bits in which only one is on amongst those in which at least one is on (dist function in R). A similarity cutoff of ≤ 0.1 was used.

Statistical software

Statistical analysis and visualisations were performed using R and the following packages, data.table⁵¹, reshape2⁵², plotly⁵³, Biostrings⁵⁴, devtools⁵⁵, micropan⁴⁷, gridExtra⁵⁶, hexbin⁵⁷ and RColorBrewer⁵⁸.

Author’s contributions

JJK, PJS and MSD participated in the conception and design of the study. JJK was responsible for the analysis. JJK, ES, PJS and MSD wrote the manuscript. All authors critically revised the manuscript.

Acknowledgements

This work was carried out on the Dutch national e-infrastructure with the support of the SURF foundation. This work was partly supported by the European Union’s Horizon 2020 research and innovation programme (EmPowerPutida, Contract No. 635536, granted to Vitor A P Martins dos Santos) and the Netherlands Organisation for Scientific Research funded UNLOCK project (NRGWI.obrug.2018.005) and has received funding form the European Union’s Horizon 2020 research and innovation programme under grant agreement 730976 (IBISBA 1.0).

Footnotes

↵* jasper.koehorst{at}wur.nl

References

1.↵
Weisburg, W. G., Barns, S. M., Pelletier, D. A. & Lane, D. J. 16S ribosomal DNA amplification for phylogenetic study. J. Bacteriol. 173, 697–703, DOI: 10.1128/JB.173.2.697-703.1991 (1991).
OpenUrl Abstract/FREE Full Text
2.↵
Yarza, P. et al. Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences. Nat. Rev. Microbiol. 12, 635–645, DOI: 10.1038/nrmicro3330 (2014).
OpenUrl CrossRef PubMed
3.↵
Yilmaz, P. et al. The SILVA and “all-species Living Tree Project (LTP)” taxonomic frameworks. Nucleic Acids Res. 42, DOI: 10.1093/nar/gkt1209 (2014).
OpenUrl CrossRef PubMed Web of Science
4.
Quast, C. et al. The SILVA ribosomal RNA gene database project: Improved data processing and web-based tools. Nucleic Acids Res. 41, DOI: 10.1093/nar/gks1219 (2013).
OpenUrl CrossRef PubMed Web of Science
5.
Yoon, S. H. et al. Introducing EzBioCloud: A taxonomically united database of 16S rRNA gene sequences and whole-genome assemblies. Int. J. Syst. Evol. Microbiol. 67, 1613–1617, DOI: 10.1099/ijsem.0.001755 (2017).
OpenUrl CrossRef
6.
McDonald, D. et al. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. The ISME J. 6, 610–618, DOI: 10.1038/ismej.2011.139 (2012).
OpenUrl CrossRef PubMed Web of Science
7.
Wang, Q., Garrity, G. M., Tiedje, J. M. & Cole, J. R. Na??ve Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl. Environ. Microbiol. 73, 5261–5267, DOI: 10.1128/AEM.00062-07 (2007). Wang,Qiong,2007,Naive.
OpenUrl Abstract/FREE Full Text
8.↵
Hinchliff, C. E. et al. Synthesis of phylogeny and taxonomy into a comprehensive tree of life. Proc. Natl. Acad. Sci. 112, 12764–12769, DOI: 10.1073/pnas.1423041112 (2015). 1503.03877.
OpenUrl Abstract/FREE Full Text
9.↵
Gupta, R. S. Impact of genomics on the understanding of microbial evolution and classification: The importance of Darwin’s views on classification. FEMS Microbiol. Rev. DOI: 10.1093/femsre/fuw011 (2016).
OpenUrl CrossRef PubMed
10.
Stackebrandt, E. & Goebel, B. M. Taxonomic Note: A Place for DNA-DNA Reassociation and 16S rRNA Sequence Analysis in the Present Species Definition in Bacteriology. Int. J. Syst. Evol. Microbiol. 44, 846–849, DOI: 10.1099/00207713-44-4-846(1994). doi:/dx.doi.org/10.1099/00207713-44-4-846.
OpenUrl CrossRef
11.
Kim, M., Oh, H. S., Park, S. C. & Chun, J. Towards a taxonomic coherence between average nucleotide identity and 16S rRNA gene sequence similarity for species demarcation of prokaryotes. Int. J. Syst. Evol. Microbiol. DOI: 10.1099/ijs.0.059774-0 (2014).
OpenUrl CrossRef PubMed Web of Science
12.↵
Janda, J. M. & Abbott, S. L. 16S rRNA gene sequencing for bacterial identification in the diagnostic laboratory: Pluses, perils, and pitfalls, DOI: 10.1128/JCM.01228-07 (2007).
OpenUrl FREE Full Text
13.↵
Konstantinidis, K. T. & Tiedje, J. M. Genomic insights that advance the species definition for prokaryotes. Proc. Natl. Acad. Sci. 102, 2567–2572, DOI: 10.1073/pnas.0409727102 (2005).
OpenUrl Abstract/FREE Full Text
14.↵
Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T. & Aluru, S. High-throughput ANI Analysis of 90K Prokaryotic Genomes Reveals Clear Species Boundaries. bioRxiv DOI: 10.1101/225342 (2017). doi:/dx.doi.org/10.1101/225342.
OpenUrl CrossRef
15.↵
Meier-Kolthoff, J. P., Auch, A. F., Klenk, H.-P.& Göker, M. Genome sequence-based species delimitation with confidence intervals and improved distance functions. BMC bioinformatics 14, 60 (2013).
OpenUrl CrossRef PubMed
16.↵
Chun, J. et al. Proposed minimal standards for the use of genome data for the taxonomy of prokaryotes. Int. J. Syst. Evol. Microbiol. DOI: 10.1099/ijsem.0.002516 (2018).
OpenUrl CrossRef
17.↵
Basu, M. K., Poliakov, E. & Rogozin, I. B. Domain mobility in proteins: Functional and evolutionary implications. Briefings Bioinforma. DOI: 10.1093/bib/bbn057 (2009).
OpenUrl CrossRef PubMed
18.↵
Zmasek, C. M. & Godzik, A. This Déjà Vu Feeling-Analysis of Multidomain Protein Evolution in Eukaryotic Genomes. PLoS Comput. Biol. DOI: 10.1371/journal.pcbi.1002701 (2012).
OpenUrl CrossRef PubMed
19.↵
Yang, S., Doolittle, R. F. & Bourne, P. E. Phylogeny determined by protein domain content. Proc. Natl. Acad. Sci. DOI: 10.1073/pnas.0408810102 (2005).
OpenUrl Abstract/FREE Full Text
20.↵
Koehorst, J. J., Saccenti, E., Schaap, P. J., dos Santos, V. A. M. & Suarez-Diez, M. Protein domain architectures provide a fast, efficient and scalable alternative to sequence-based methods for comparative functional genomics. F1000Research 5 (2016).
21.↵
Snipen, L. & Ussery, D. A domain sequence approach to pangenomics: applications to escherichia coli [version 2; referees: 2 approved]. F1000Research 1, DOI: 10.12688/f1000research.1-19.v2 (2013).
OpenUrl CrossRef
22.↵
Koehorst, J. J. et al. Sapp: functional genome annotation and analysis through a semantic framework using fair principles. Bioinformatics 1, 3 (2017).
OpenUrl
23.↵
Klappenbach, J. A. rrndb: the Ribosomal RNA Operon Copy Number Database. Nucleic Acids Res. DOI: 10.1093/nar/29.1.181 (2001).
OpenUrl CrossRef PubMed Web of Science
24.↵
Sonnhammer, E. L., Eddy, S. R., Birney, E., Bateman, A. & Durbin, R. Pfam: multiple sequence alignments and hmm-profiles of protein domains. Nucleic acids research 26, 320–322 (1998).
OpenUrl CrossRef PubMed Web of Science
25.↵
Finn, R. D. et al. The Pfam protein families database: Towards a more sustainable future. Nucleic Acids Res. 44, D279–D285, DOI: 10.1093/nar/gkv1344 (2016).
OpenUrl CrossRef PubMed
26.↵
Nakabachi, A. et al. The 160-Kilobase Genome of the Bacterial Endosymbiont Carsonella. Science 314, 267–267, DOI: 10.1126/science.1134196 (2006).
OpenUrl Abstract/FREE Full Text
27.↵
Wang, X. J. et al. Genome sequence of the milbemycin-producing bacterium Streptomyces bingchenggensis, DOI: 10.1128/JB.00596-10 (2010).
OpenUrl Abstract/FREE Full Text
28.↵
Nikoh, N., Hosokawa, T., Oshima, K., Hattori, M. & Fukatsu, T. Reductive evolution of bacterial genome in insect gut environment. Genome Biol. Evol. 3, 702–714, DOI: 10.1093/gbe/evr064 (2011).
OpenUrl CrossRef PubMed
29.↵
Finn, R. D. et al. Pfam: clans, web tools and services. Nucleic Acids Res. 34, D247—-D251, DOI: 10.1093/nar/gkj149 (2006).
OpenUrl CrossRef PubMed Web of Science
30.↵
Snipen, L., Almøy, T. & Ussery, D. W. Microbial comparative pan-genomics using binomial mixture models. BMC Genomics 10, 385, DOI: 10.1186/1471-2164-10-385 (2009).
OpenUrl CrossRef PubMed
31.↵
Koehorst, J. J. et al. Comparison of 432 pseudomonas strains through integration of genomic, functional, metabolic and expression data. Sci. reports 6, 38699 (2016).
OpenUrl
32.↵
Kamminga, T. et al. Persistence of functional protein domains in mycoplasma species and their role in host specificity and synthetic minimal life. Front. cellular infection microbiology 7, 31 (2017).
OpenUrl
33.↵
Hutchison, C. A. et al. Design and synthesis of a minimal bacterial genome. Science DOI: 10.1126/science.aad6253 (2016).
OpenUrl Abstract/FREE Full Text
34.↵
Goodacre, N. F., Gerloff, D. L. & Uetz, P. Protein domains of unknown function are essential in bacteria. mBio 5, DOI: 10.1128/mBio.00744-13 (2014). https://mbio.asm.org/content/5/1/e00744-13.full.pdf.
OpenUrl Abstract/FREE Full Text
35.↵
Ochman, H., Lawrence, J. G. & Groisman, E. A. Lateral gene transfer and the nature of bacterial innovation. nature 405, 299 (2000).
OpenUrl CrossRef PubMed Web of Science
36.↵
Dutta, C. & Pan, A. Horizontal gene transfer and bacterial diversity. J. biosciences 27, 27–33 (2002).
OpenUrl CrossRef
37.↵
Ochman, H., Lerat, E. & Daubin, V. Examining bacterial species under the specter of gene transfer and exchange. Proc. Natl. Acad. Sci. 102, 6595–6599 (2005).
OpenUrl Abstract/FREE Full Text
38.↵
Wolf, Y. I. & Koonin, E. V. Genome reduction as the dominant mode of evolution. BioEssays 35, 829–837, DOI: 10.1002/bies.201300037 (2013).
OpenUrl CrossRef PubMed Web of Science
39.↵
Plata, G., Henry, C. S. & Vitkup, D. Long-term phenotypic evolution of bacteria. Nature 517, 369–372 (2015).
OpenUrl CrossRef GeoRef PubMed
40.↵
Agarwala, R. et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 44, D7–D19, DOI: 10.1093/nar/gkv1290 (2016).
OpenUrl CrossRef PubMed
41.↵
van Dam, J. C. J., Koehorst, J. J., Vik, J. O., Schaap, P. J. & Suarez-Diez, M. Interoperable genome annotation with gbol, an extendable infrastructure for functional data mining. bioRxiv 184747 (2017).
42.↵
Lagesen, K. et al. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 35, 3100–3108, DOI: 10.1093/nar/gkm160 (2007).
OpenUrl CrossRef PubMed Web of Science
43.↵
Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinforma. 11, 119, DOI: 10.1186/1471-2105-11-119 (2010). 1401.7457.
OpenUrl CrossRef PubMed
44.↵
Finn, R. D. et al. InterPro in 2017-beyond protein family and domain annotations. Nucleic Acids Res. 45, D190–D199, DOI: 10.1093/nar/gkw1107 (2017).
OpenUrl CrossRef PubMed
45.↵
De Lazzari, E., Grilli, J., Maslov, S. & Lagomarsino, M. C. Family-specific scaling laws in bacterial genomes. Nucleic Acids Res. 45, 7615–7622, DOI: 10.1093/nar/gkx510 (2017). 1703.09822.
OpenUrl CrossRef
46.↵
Cosentino Lagomarsino, M., Sellerio, A. L., Heijning, P. D. & Bassetti, B. Universal features in the genome-level evolution of protein domains. Genome biology 10, R12, DOI: 10.1186/gb-2009-10-1-r12 (2009). 0807.1898.
OpenUrl CrossRef PubMed
47.↵
Snipen, L. & Liland, K. H. micropan: Microbial Pan-Genome Analysis (2018). R package version 1.2.
48.↵
Ashelford, K. E., Chuzhanova, N. A., Fry, J. C., Jones, A. J. & Weightman, A. J. New screening software shows that most recent large 16S rRNA gene clone libraries contain chimeras. Appl. Environ. Microbiol. 72, 5734–5741, DOI: 10.1128/AEM.00556-06 (2006).
OpenUrl Abstract/FREE Full Text
49.↵
Csardi, G. & Nepusz, T. The igraph software package for complex network research. InterJournal Complex Systems, 1695 (2006).
50.↵
(2013), R. C. T. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. (2013).
51.↵
Dowle, M. & Srinivasan, A. data.table: Extension of ‘data.frame’ (2018). R package version 1.11.4.
52.↵
Wickham, H. Reshaping data with the reshape package. J. Stat. Softw. 21, 1–20 (2007).
OpenUrl CrossRef
53.↵
Sievert, C. plotly for R (2018).
54.↵
Pagès, H., Aboyoun, P., Gentleman, R. & DebRoy, S. Biostrings: Efficient manipulation of biological strings (2018). R package version 2.48.0.
55.↵
Wickham, H., Hester, J. & Chang, W. devtools: Tools to Make Developing R Packages Easier (2018). R package version 1.13.6.
56.↵
Auguie, B. gridExtra: Miscellaneous Functions for “Grid” Graphics (2017). R package version 2.3.
57.↵
Carr, D., ported by Nicholas Lewin-Koh, Maechler, M. & contains copies of lattice functions written by Deepayan Sarkar. hexbin: Hexagonal Binning Routines (2018). R package version 1.27.2.
58.↵
Neuwirth, E. RColorBrewer: ColorBrewer Palettes (2014). R package version 1.1–2.

View the discussion thread.

Posted December 13, 2018.

Download PDF

Citation Tools

Subject Area

Genomics

Subject Areas

All Articles

Animal Behavior and Cognition (5200)
Biochemistry (11703)
Bioengineering (8718)
Bioinformatics (29127)
Biophysics (14930)
Cancer Biology (12048)
Cell Biology (17353)
Clinical Trials (138)
Developmental Biology (9406)
Ecology (14143)
Epidemiology (2067)
Evolutionary Biology (18266)
Genetics (12219)
Genomics (16765)
Immunology (11841)
Microbiology (28003)
Molecular Biology (11551)
Neuroscience (60804)
Paleontology (450)
Pathology (1864)
Pharmacology and Toxicology (3229)
Physiology (4939)
Plant Biology (10383)
Scientific Communication and Education (1679)
Synthetic Biology (2877)
Systems Biology (7333)
Zoology (1642)

[1] 1.↵
Weisburg, W. G., Barns, S. M., Pelletier, D. A. & Lane, D. J. 16S ribosomal DNA amplification for phylogenetic study. J. Bacteriol. 173, 697–703, DOI: 10.1128/JB.173.2.697-703.1991 (1991).
OpenUrl Abstract/FREE Full Text

[2] 2.↵
Yarza, P. et al. Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences. Nat. Rev. Microbiol. 12, 635–645, DOI: 10.1038/nrmicro3330 (2014).
OpenUrl CrossRef PubMed

[3] 3.↵
Yilmaz, P. et al. The SILVA and “all-species Living Tree Project (LTP)” taxonomic frameworks. Nucleic Acids Res. 42, DOI: 10.1093/nar/gkt1209 (2014).
OpenUrl CrossRef PubMed Web of Science

[4] 4.
Quast, C. et al. The SILVA ribosomal RNA gene database project: Improved data processing and web-based tools. Nucleic Acids Res. 41, DOI: 10.1093/nar/gks1219 (2013).
OpenUrl CrossRef PubMed Web of Science

[5] 5.
Yoon, S. H. et al. Introducing EzBioCloud: A taxonomically united database of 16S rRNA gene sequences and whole-genome assemblies. Int. J. Syst. Evol. Microbiol. 67, 1613–1617, DOI: 10.1099/ijsem.0.001755 (2017).
OpenUrl CrossRef

[6] 6.
McDonald, D. et al. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. The ISME J. 6, 610–618, DOI: 10.1038/ismej.2011.139 (2012).
OpenUrl CrossRef PubMed Web of Science

[7] 7.
Wang, Q., Garrity, G. M., Tiedje, J. M. & Cole, J. R. Na??ve Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl. Environ. Microbiol. 73, 5261–5267, DOI: 10.1128/AEM.00062-07 (2007). Wang,Qiong,2007,Naive.
OpenUrl Abstract/FREE Full Text

[8] 8.↵
Hinchliff, C. E. et al. Synthesis of phylogeny and taxonomy into a comprehensive tree of life. Proc. Natl. Acad. Sci. 112, 12764–12769, DOI: 10.1073/pnas.1423041112 (2015). 1503.03877.
OpenUrl Abstract/FREE Full Text

[9] 9.↵
Gupta, R. S. Impact of genomics on the understanding of microbial evolution and classification: The importance of Darwin’s views on classification. FEMS Microbiol. Rev. DOI: 10.1093/femsre/fuw011 (2016).
OpenUrl CrossRef PubMed

[10] 10.
Stackebrandt, E. & Goebel, B. M. Taxonomic Note: A Place for DNA-DNA Reassociation and 16S rRNA Sequence Analysis in the Present Species Definition in Bacteriology. Int. J. Syst. Evol. Microbiol. 44, 846–849, DOI: 10.1099/00207713-44-4-846(1994). doi:/dx.doi.org/10.1099/00207713-44-4-846.
OpenUrl CrossRef

[11] 11.
Kim, M., Oh, H. S., Park, S. C. & Chun, J. Towards a taxonomic coherence between average nucleotide identity and 16S rRNA gene sequence similarity for species demarcation of prokaryotes. Int. J. Syst. Evol. Microbiol. DOI: 10.1099/ijs.0.059774-0 (2014).
OpenUrl CrossRef PubMed Web of Science

[12] 12.↵
Janda, J. M. & Abbott, S. L. 16S rRNA gene sequencing for bacterial identification in the diagnostic laboratory: Pluses, perils, and pitfalls, DOI: 10.1128/JCM.01228-07 (2007).
OpenUrl FREE Full Text

[13] 13.↵
Konstantinidis, K. T. & Tiedje, J. M. Genomic insights that advance the species definition for prokaryotes. Proc. Natl. Acad. Sci. 102, 2567–2572, DOI: 10.1073/pnas.0409727102 (2005).
OpenUrl Abstract/FREE Full Text

[14] 14.↵
Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T. & Aluru, S. High-throughput ANI Analysis of 90K Prokaryotic Genomes Reveals Clear Species Boundaries. bioRxiv DOI: 10.1101/225342 (2017). doi:/dx.doi.org/10.1101/225342.
OpenUrl CrossRef

[15] 15.↵
Meier-Kolthoff, J. P., Auch, A. F., Klenk, H.-P.& Göker, M. Genome sequence-based species delimitation with confidence intervals and improved distance functions. BMC bioinformatics 14, 60 (2013).
OpenUrl CrossRef PubMed

[16] 16.↵
Chun, J. et al. Proposed minimal standards for the use of genome data for the taxonomy of prokaryotes. Int. J. Syst. Evol. Microbiol. DOI: 10.1099/ijsem.0.002516 (2018).
OpenUrl CrossRef

[17] 17.↵
Basu, M. K., Poliakov, E. & Rogozin, I. B. Domain mobility in proteins: Functional and evolutionary implications. Briefings Bioinforma. DOI: 10.1093/bib/bbn057 (2009).
OpenUrl CrossRef PubMed

[18] 18.↵
Zmasek, C. M. & Godzik, A. This Déjà Vu Feeling-Analysis of Multidomain Protein Evolution in Eukaryotic Genomes. PLoS Comput. Biol. DOI: 10.1371/journal.pcbi.1002701 (2012).
OpenUrl CrossRef PubMed

[19] 19.↵
Yang, S., Doolittle, R. F. & Bourne, P. E. Phylogeny determined by protein domain content. Proc. Natl. Acad. Sci. DOI: 10.1073/pnas.0408810102 (2005).
OpenUrl Abstract/FREE Full Text

[20] 20.↵
Koehorst, J. J., Saccenti, E., Schaap, P. J., dos Santos, V. A. M. & Suarez-Diez, M. Protein domain architectures provide a fast, efficient and scalable alternative to sequence-based methods for comparative functional genomics. F1000Research 5 (2016).

[21] 21.↵
Snipen, L. & Ussery, D. A domain sequence approach to pangenomics: applications to escherichia coli [version 2; referees: 2 approved]. F1000Research 1, DOI: 10.12688/f1000research.1-19.v2 (2013).
OpenUrl CrossRef

[22] 22.↵
Koehorst, J. J. et al. Sapp: functional genome annotation and analysis through a semantic framework using fair principles. Bioinformatics 1, 3 (2017).
OpenUrl

[23] 23.↵
Klappenbach, J. A. rrndb: the Ribosomal RNA Operon Copy Number Database. Nucleic Acids Res. DOI: 10.1093/nar/29.1.181 (2001).
OpenUrl CrossRef PubMed Web of Science

[24] 24.↵
Sonnhammer, E. L., Eddy, S. R., Birney, E., Bateman, A. & Durbin, R. Pfam: multiple sequence alignments and hmm-profiles of protein domains. Nucleic acids research 26, 320–322 (1998).
OpenUrl CrossRef PubMed Web of Science

[25] 25.↵
Finn, R. D. et al. The Pfam protein families database: Towards a more sustainable future. Nucleic Acids Res. 44, D279–D285, DOI: 10.1093/nar/gkv1344 (2016).
OpenUrl CrossRef PubMed

[26] 26.↵
Nakabachi, A. et al. The 160-Kilobase Genome of the Bacterial Endosymbiont Carsonella. Science 314, 267–267, DOI: 10.1126/science.1134196 (2006).
OpenUrl Abstract/FREE Full Text

[27] 27.↵
Wang, X. J. et al. Genome sequence of the milbemycin-producing bacterium Streptomyces bingchenggensis, DOI: 10.1128/JB.00596-10 (2010).
OpenUrl Abstract/FREE Full Text

[28] 28.↵
Nikoh, N., Hosokawa, T., Oshima, K., Hattori, M. & Fukatsu, T. Reductive evolution of bacterial genome in insect gut environment. Genome Biol. Evol. 3, 702–714, DOI: 10.1093/gbe/evr064 (2011).
OpenUrl CrossRef PubMed

[29] 29.↵
Finn, R. D. et al. Pfam: clans, web tools and services. Nucleic Acids Res. 34, D247—-D251, DOI: 10.1093/nar/gkj149 (2006).
OpenUrl CrossRef PubMed Web of Science

[30] 30.↵
Snipen, L., Almøy, T. & Ussery, D. W. Microbial comparative pan-genomics using binomial mixture models. BMC Genomics 10, 385, DOI: 10.1186/1471-2164-10-385 (2009).
OpenUrl CrossRef PubMed

[31] 31.↵
Koehorst, J. J. et al. Comparison of 432 pseudomonas strains through integration of genomic, functional, metabolic and expression data. Sci. reports 6, 38699 (2016).
OpenUrl

[32] 32.↵
Kamminga, T. et al. Persistence of functional protein domains in mycoplasma species and their role in host specificity and synthetic minimal life. Front. cellular infection microbiology 7, 31 (2017).
OpenUrl

[33] 33.↵
Hutchison, C. A. et al. Design and synthesis of a minimal bacterial genome. Science DOI: 10.1126/science.aad6253 (2016).
OpenUrl Abstract/FREE Full Text

[34] 34.↵
Goodacre, N. F., Gerloff, D. L. & Uetz, P. Protein domains of unknown function are essential in bacteria. mBio 5, DOI: 10.1128/mBio.00744-13 (2014). https://mbio.asm.org/content/5/1/e00744-13.full.pdf.
OpenUrl Abstract/FREE Full Text

[35] 35.↵
Ochman, H., Lawrence, J. G. & Groisman, E. A. Lateral gene transfer and the nature of bacterial innovation. nature 405, 299 (2000).
OpenUrl CrossRef PubMed Web of Science

[36] 36.↵
Dutta, C. & Pan, A. Horizontal gene transfer and bacterial diversity. J. biosciences 27, 27–33 (2002).
OpenUrl CrossRef

[37] 37.↵
Ochman, H., Lerat, E. & Daubin, V. Examining bacterial species under the specter of gene transfer and exchange. Proc. Natl. Acad. Sci. 102, 6595–6599 (2005).
OpenUrl Abstract/FREE Full Text

[38] 38.↵
Wolf, Y. I. & Koonin, E. V. Genome reduction as the dominant mode of evolution. BioEssays 35, 829–837, DOI: 10.1002/bies.201300037 (2013).
OpenUrl CrossRef PubMed Web of Science

[39] 39.↵
Plata, G., Henry, C. S. & Vitkup, D. Long-term phenotypic evolution of bacteria. Nature 517, 369–372 (2015).
OpenUrl CrossRef GeoRef PubMed

[40] 40.↵
Agarwala, R. et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 44, D7–D19, DOI: 10.1093/nar/gkv1290 (2016).
OpenUrl CrossRef PubMed

[41] 41.↵
van Dam, J. C. J., Koehorst, J. J., Vik, J. O., Schaap, P. J. & Suarez-Diez, M. Interoperable genome annotation with gbol, an extendable infrastructure for functional data mining. bioRxiv 184747 (2017).

[42] 42.↵
Lagesen, K. et al. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 35, 3100–3108, DOI: 10.1093/nar/gkm160 (2007).
OpenUrl CrossRef PubMed Web of Science

[43] 43.↵
Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinforma. 11, 119, DOI: 10.1186/1471-2105-11-119 (2010). 1401.7457.
OpenUrl CrossRef PubMed

[44] 44.↵
Finn, R. D. et al. InterPro in 2017-beyond protein family and domain annotations. Nucleic Acids Res. 45, D190–D199, DOI: 10.1093/nar/gkw1107 (2017).
OpenUrl CrossRef PubMed

[45] 45.↵
De Lazzari, E., Grilli, J., Maslov, S. & Lagomarsino, M. C. Family-specific scaling laws in bacterial genomes. Nucleic Acids Res. 45, 7615–7622, DOI: 10.1093/nar/gkx510 (2017). 1703.09822.
OpenUrl CrossRef

[46] 46.↵
Cosentino Lagomarsino, M., Sellerio, A. L., Heijning, P. D. & Bassetti, B. Universal features in the genome-level evolution of protein domains. Genome biology 10, R12, DOI: 10.1186/gb-2009-10-1-r12 (2009). 0807.1898.
OpenUrl CrossRef PubMed

[47] 47.↵
Snipen, L. & Liland, K. H. micropan: Microbial Pan-Genome Analysis (2018). R package version 1.2.

[48] 48.↵
Ashelford, K. E., Chuzhanova, N. A., Fry, J. C., Jones, A. J. & Weightman, A. J. New screening software shows that most recent large 16S rRNA gene clone libraries contain chimeras. Appl. Environ. Microbiol. 72, 5734–5741, DOI: 10.1128/AEM.00556-06 (2006).
OpenUrl Abstract/FREE Full Text

[49] 49.↵
Csardi, G. & Nepusz, T. The igraph software package for complex network research. InterJournal Complex Systems, 1695 (2006).

[50] 50.↵
(2013), R. C. T. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. (2013).

[51] 51.↵
Dowle, M. & Srinivasan, A. data.table: Extension of ‘data.frame’ (2018). R package version 1.11.4.

[52] 52.↵
Wickham, H. Reshaping data with the reshape package. J. Stat. Softw. 21, 1–20 (2007).
OpenUrl CrossRef

[53] 53.↵
Sievert, C. plotly for R (2018).

[54] 54.↵
Pagès, H., Aboyoun, P., Gentleman, R. & DebRoy, S. Biostrings: Efficient manipulation of biological strings (2018). R package version 2.48.0.

[55] 55.↵
Wickham, H., Hester, J. & Chang, W. devtools: Tools to Make Developing R Packages Easier (2018). R package version 1.13.6.

[56] 56.↵
Auguie, B. gridExtra: Miscellaneous Functions for “Grid” Graphics (2017). R package version 2.3.

[57] 57.↵
Carr, D., ported by Nicholas Lewin-Koh, Maechler, M. & contains copies of lattice functions written by Deepayan Sarkar. hexbin: Hexagonal Binning Routines (2018). R package version 1.27.2.

[58] 58.↵
Neuwirth, E. RColorBrewer: ColorBrewer Palettes (2014). R package version 1.1–2.