Abstract
Coherent genomic groups are frequently used as a proxy for bacterial species delineation through computation of overall genome relatedness indices (OGRI). Average nucleotidic identity (ANI) is the method of choice for estimating relatedness between genomic sequences. However, pairwise comparisons of genome sequences based on ANI is relatively computationally intensive and therefore precludes analyses of large datasets composed of thousand genomes sequences.
In this work we evaluated an alternative OGRI based on k-mers counts to study prokaryotic species delimitation. A dataset containing more than 3,500 Pseudomonas genome sequences was successfully classified in few hours with the same precision than ANI. A new visualization method based on zoomable circle packing was employed for assessing relationships between the 350 cliques generated. Amendment of databases with these Pseudomonas cliques greatly improve classification of metagenomic read sets with k-mers-based classifier.
The developed workflow was integrated in the user-friendly KI-S tool that is available at the following address: https://iris.angers.inra.fr/galaxypub-cfbp.
Background
Species is the unit of biological diversity. Species delineation of Bacteria and Archaea historically relies on a polyphasic approach based on a range of genotypic, phenotypic and chemo-taxonomic (e.g. fatty acid profiles) data of cultured specimen. According to the List of Prokaryotic Names with Standing in Nomenclature (LPSN), approximately 15,500 bacterial species names have been currently validated within this theoretical framework [1]. Since the number of bacterial species inhabiting planet Earth is predicted to range between 107 to 1012 species according to different estimates [2,3], the genomics revolution provides an opportunity to accelerate the pace of species description.
Prokaryotic species are primarily described as cohesive genomic groups and approaches based on similarity of whole genome sequence, also known as overall genome relatedness indices (OGRI), have been proposed for delineating species. Average nucleotidic identity (ANI) is nowadays the mostly acknowledged OGRI for assessing relatedness between genomic sequences. Distinct ANI algorithms such as ANI based on BLAST (ANIb [4]), ANI based on MUMmer (ANIm [5]) or ANI based on orthologous gene (OrthoANIb [6]; OrthoANIu [7]; gANI,AF [8]), which differ in their precisions but more importantly on their calculation times [7], have been developed. Indeed, improvement of calculation time for whole genomic comparison of large datasets is an essential parameter. As of November 2018, the total number of prokaryotic genome sequences publicly available in the NBCI database is 170,728. Considering an average time of 1 second for calculating ANI values of one pair of genome sequence, it would take approximately 1,000 years for obtaining ANI values for all pairwise comparisons.
The number of words of length k (k-mers) shared between read sets [9] or genomic sequences [10] is an alignment-free alternative for assessing the dis(similarities) between entities. Methods based on k-mers counts, such as SIMKA [9], can quickly compute pairwise comparison of multiple metagenome read sets with high accuracy. In addition, specific k-mers profiles are now routinely employed by multiple read classifiers for estimating the taxonomic structure of metagenome read sets [11–13]. While these k-mers based classifiers differ in term of sensitivity and specificity [14], they rely on accurate genome databases for affiliating read to a taxonomic rank.
The objective of the current work was to evaluate an alternative method based on k-mers counts to study species delimitation on extensive genome datasets. We therefore decided to employ k-mers counts for assessing similarity between genome sequences belonging to the Pseudomonas genus. Indeed, this genus contains an important diversity of species (n = 207), whose taxonomic affiliation is under constant evolution [15–21], and numerous genome sequences are available in public databases. We also proposed an original visualization tool based on D3 Zoomable Circle Packing (https://gist.github.com/mbostock/7607535) for assessing relatedness of thousands of genomes sequences. Finally, the benefit of taxonomic curation of reference database on the taxonomic affiliation of metagenomics read sets was assessed. The developed workflow was integrated in the user-friendly KI-S tool which is available in the galaxy toolbox of CIRM-CFBP (https://iris.angers.inra.fr/galaxypub-cfbp).
Methods
Genomic dataset
All genome sequences (n=3,623 as of April 2017) from Pseudomonas genus were downloaded from the NCBI database (https://www.ncbi.nlm.nih.gov/genome/browse#!/overview/).
Calculation of Overall Genome Relatedness Indices
The percentage of shared k-mers between genome sequences was calculated with Simka version 1.4 [9] with the following parameters (abundance-min 1 and k-mers length ranging from 10 to 20). The percentage of shared k-mers was compared to ANIb values calculated with PYANI version 0.2.3 (https://github.com/widdowquinn/pyani). Due to the computing time required for ANIb calculation, only a subset of Pseudomonas genomic sequences (n=934) was selected for this comparison. This subset was composed of genome sequences containing less than 150 scaffolds.
Development of KI-S tool
An integrative tool named KI-S was developed. The number of shared k-mers between genome sequences is first calculated with Simka [9]. A custom R script is then employed to cluster the genome sequences according to their connected components at different selected threshold (e.g. 50% of shared 15-mers). The clustering result is visualized with Zoomable Circle Packing representation with the D3.js JavaScript library (https://gist.github.com/mbostock/7607535). The source code of the KI-S tool is available at the following address: https://sourcesup.renater.fr/projects/ki-s/. A wrapper for accessing KI-S in a user-friendly Galaxy tool is also available at the following address: https://iris.angers.inra.fr/galaxypub-cfbp.
Taxonomic inference of metagenomic read sets
The taxonomic profiles of 9 metagenome read sets derived from seed, germinating seeds and seedlings of common bean (Phaseolus vulgaris var. Flavert) were estimated with Clark version 1.2.4 [13]. These metagenome datasets were selected because of the high relative abundance of reads affiliated to Pseudomonas [22]. The following Clark default parameters – k 31 –t <minFreqTarget> 0 and -o <minFreqtObject> 0 were used for the taxonomic profiling. Three distinct Clark databases were employed: (i) the original Clark database from NCBI/RefSeq at the species level (ii) the original Clark database supplemented with the 3,623 Pseudomonas genome sequences and their original NCBI taxonomic affiliation (iii) the original Clark database supplemented with the 3,623 Pseudomonas genome sequences whose taxonomic affiliation was corrected according to the reclassification based on the number of shared k-mers. For this third database, genome sequences were clustered at >50% of 15-mers.
Results
Selection of optimal k-mers size and percentage of shared k-mers
Using the percentage of shared k-mers as an OGRI for species delineation first required to determine the optimal k-mers size. This was performed by comparing the percentage of shared k-mers to a widely acknowledged OGRI, ANIb [4], between 934 Pseudomonas genome sequences. Since species delineation threshold was initially proposed following the observation of a gap in the distribution of pairwise comparison values [23], the distribution profiles obtained with k-mers lengths ranging from 10 to 20 were compared to ANIb values. Short k-mers (k < 12) were evenly shared by most strains and then not discriminative (Fig. 1). As the size of the k-mers increased, a multimodal distribution based on four peaks were observed (Fig. 1). The first peak is related to genomes sequences that do not belong to the same species. Then, depending on k length, the second and third peaks (e.g. 50% and 80% for k = 15) corresponded to genome sequences associated to the same species and subspecies, respectively. The fourth peak at 100% of shared k-mers was related to identical genome sequences.
Fifty percent of 15-mers is closed to ANIb value of 0.95 (Fig. 2), a threshold commonly employed for delineating bacterial species level [4]. More precisely the median percentage of shared 15-mers is 49% [34%-66%] for ANIb value ranging from 0.94 to 0.96. In addition, 15-mers allows the investigation of inter-and infra-specific relationship at lower and higher percentage of shared 15-mers, respectively.
Computing time of 15-mers for 934 genome sequences was 4 hours on a DELL Power Edge R510 server, while it took approximately 3 months for obtaining all ANIb pairwise comparisons (500-fold decrease of computing time).
Classification of Pseudomonas genomes
The percentage of shared 15-mers was then used to investigate relatedness between 3,623 Pseudomonas genome sequences publicly available. At a threshold of 50% of 15-mers, we identified 350 cliques. The clique containing the most abundant number of genome sequences was by far related to P. aeruginosa species (n = 2,341), followed by the phylogroups PG1 (n = 111), PG3 (n = 92) and PG2 (n = 74) of P. syringae species complex ([16]; Table S1). At the clustering threshold employed, 185 cliques were composed of a single genome sequence, therefore highlighting the high Pseudomonas strain diversity. Moreover, according to Chao1 index, Pseudomonas species richness is estimated at 629 cliques [+ 57], which indicates that additional strain isolations and sequencing effort are needed to cover the whole diversity of this bacterial genus. Graphical representation of hierarchical clustering by dendrogram for a large dataset is generally not optimal. Here we employed Zoomable circle packing as an alternative to dendrogram for representing similarity between genome sequences (Fig. 3 and FigS1.html). The different clustering thresholds that can be superimposed on the same graphical representation allow the investigation of inter- and intra-groups relationships (Fig. 3 and FigS1.html). This is useful for affiliating specific clique to a group or subgroup of Pseudomonas species.
Improvement of taxonomic affiliation of metagenomic read sets
The taxonomic composition of metagenome read sets is frequently estimated with k-mers based classifiers. While these k-mers based classifiers differ in term of sensitivity and specificity, they all rely on accurate genome databases for affiliating read to a taxonomic rank. Here, we investigated the impact of database content and curation on taxonomic affiliation. Using Clark [13] as a taxonomic profiler with the original Clark database, we classified metagenome read sets derived from bean seeds, germinating seeds and seedlings [22]. Adding the 3,623 Pseudomonas genomes with their original taxonomic affiliation from NCBI to the original Clark database did not increase the percentage of classified reads (Fig. 4). However, adding the same genome sequences reclassified in cliques according to their percentage of shared k-mers (k=15; threshold= 50%) increased 1.4-fold on average the number of classified reads (Fig. 4).
Discussion
Classification of bacterial strains on the basis on their genome sequences similarities has emerged since a decade as an alternative to the cumbersome DNA-DNA hybridizations [24]. Although ANIb is the current gold-standard method for investigating these genomic relatedness, its intensive computational time prohibited its used for comparing large genome datasets [7]. In contrast, investigating the percentage of shared k-mers is scalable for comparing thousands of genome sequences.
In a method based on k-mers counts, choosing the length of k is a compromise between accuracy and speed. The distribution of shared k-mers values between genome sequences is impacted by k length. For k = 15, four peaks were observed at 15%, 50%, 80% and 100% of shared k-mers. The second peak is closed to ANIb value of 0.95 and falls in the so called grey or fuzzy zone [24] where taxonomists might decide to split or merge species. Hence, according to our working dataset, it seems that 50% of 15-mers is a good proxy for estimating Pseudomonas clique. Despite the diverse range of habitats colonized by different Pseudomonas populations [19], it is likely that the percentage of shared k-mers has to be adapted when investigating other bacterial genera. Indeed, since population dynamics, lifestyle and location impact molecular evolution, it is somewhat illusory to define a fixed threshold for species delineation [25]. While 15-mers is a good starting point for investigating infra-specific to infra-generic relationships between genome sequences, the computational speed of KI-S offers the possibility to perform large scale genomic comparisons at different k sizes to select the most appropriate threshold.
Genomic relatedness using whole genome sequences becomes a standard for bacterial strain identification and bacterial taxonomy [24,26]. This proposition is primarily motivated by fast and inexpensive sequencing of bacterial genome together with the limited availability of cultured specimen for performing classical polyphasic approach. Whether full genome sequences should represent the basis of taxonomic classification is an ongoing debate between systematicians [27]. While this consideration is well beyond the objectives of this work, obtaining a classification of bacterial genome sequences into coherent groups is of general interest. Indeed, the number of misidentified genomes sequences is exponentially growing in public databases. A number of initiatives such as Digital Protologue Database (DPD [28]), Microbial Genomes Atlas (MiGA [29]), Life Identification Numbers database (LINbase [30]) or the Genome Taxonomy Database (GTDB [26]) proposed services to classify and rename bacterial strains based ANIb values or single copy marker proteins. Using the percentage of shared k-mers between unknown bacterial genome sequences and reference genome sequences associated to these databases could provide a rapid complementary approach for bacterial classification. Moreover, KI-S tool, provides a friendly visualization interface that could help systematicians to curate whole genome databases. Indeed, zoomable circle packing could be employed for highlighting (i) misidentified strains, (ii) bacterial taxa that possess representative type strains or (iii) bacterial taxa that contain few genomes sequences.
Association between a taxonomic group and its distribution across a range of habitats is useful for inferring the role of this taxa on its host or environment. For instance, community profiling approaches based on molecular marker such as hypervariable regions of 16S rRNA gene have been helpful for highlighting correlations between host fitness and microbiome composition. Finer-grained taxonomic resolution of microbiome composition could be achieved with metagenomics through k-mers based classification of reads. In this study we demonstrate that employing a database with a classification of strains reflecting their genomic relatedness greatly improve taxonomic assignments of reads. Therefore, investigating relationship between bacterial genome sequences not only benefits bacterial taxonomy but also deserves microbial ecology.
Competing interests
The authors declare that they have neither competing interests nor conflict of interest.
Funding
This research was supported in part by grant awarded by the Region des Pays de la Loire (metaSEED, 2013 10080).
Acknowledgements
The authors wish to thank Claire Lemaitre and Guillaume Rizk for their assistance with the SIMKA software.
Footnotes
Authors email addresses: Martial Briand: martial.briand{at}inra.fr, Mariam Bouzid: mariam.bouzid{at}icloud.com, Gilles Hunault: gilles.hunault{at}univ-angers.fr, Marc Legeay: legeay.marc{at}free.fr, Marion Fischer-Le Saux: marion.le-saux{at}inra.fr, Matthieu Barret: matthieu.barret{at}inra.fr