Higher classification sensitivity of short metagenomic reads with CLARK-S

Rachid Ounit; Stefano Lonardi

doi:10.1101/053462

Abstract

The growing number of metagenomic studies in medicine and environmental sciences is creating increasing demands on the computational infrastructure designed to analyze these very large datasets. Often, the construction of ultra-fast and precise taxonomic classifiers can compromise on their sensitivity (i.e., the number of reads correctly classified). Here we introduce CLARK-S, a new software tool that can classify short reads with high precision, high sensitivity and high speed at the same time.

Introduction

One of the primary goals of metagenomic studies is to determine the taxonomical identity of bacteria and viruses in a heterogenous microbial sample (e.g., soil, water, urban environment, human microbiome). This analysis can reveal the presence of unexpected bacteria and viruses in a newly explored microbial habitat (e.g., the marine environment in [1]), or in the case of the human body, elucidate relationships between diseases and imbalances in the microbiome (see, e.g., [2]).

Arguably, the most effective and unbiased method to study these microbial samples is via high-throughput sequencing. The associated computational problem is to assign sequenced (short) reads to a taxonomic unit. While this problem has been studied extensively and several methods and software tools are available, faster and more accurate algorithms are needed to keep pace with the increasing throughput of modern sequencing instruments. In [3] we introduced CLARK, a taxonomy-dependent binning method whose classification speed is currently unmatched. A recent independent evaluation of fourteen taxonomic binning/profiling methods showed that the classification precision of CLARK is comparable (sometimes better) than the state-of-the-art classifiers ([4]). While CLARK’s speed and precision are very high, its classification sensitivity (i.e., the fraction of reads that it correctly classifies) can be significantly improved with the methods described next.

We recall that CLARK is an alignment-free method based on shared k-mers. Briefly, it assigns a read r to a reference genome G if r and G share more discriminative k-mers (i.e., k-mers that appear exclusively in one reference genome) than other genomes in the database. Here we show that the classification sensitivity can be increased by allowing mismatches between shared k-mers in a limited number of (carefully predetermined) positions, while maintaining the requirement for k-mers to be discriminative. The idea of allowing mismatches to improve the sensitivity of seed-and-extend alignment methods was pioneered in [5] with the notion of spaced seed. While spaced seeds have been used in some metagenomic binning/profiling methods (e.g., MEGAN [6]), the use of discriminative spaced k-mers is novel. Here we describe a major extension of the algorithmic infrastructure of CLARK based on spaced seed, called CLARK-S.

Methods

Given an integer k and m reference genomes {g₁, g₂, …, g_m}, the set of discriminative k-mers D_i for genome gi is the set of all k-mers in g_i that do not occur (exactly) in any other genome [3]. A spaced seed s of length k and weight w < k is a string over the alphabet {1,*} that contains w ‘1’ and (k-w) ‘*’. Matches are required at a ‘1’ positions, while mismatches are allowed at the ‘*’ locations. The set of discriminative spaced k-mers E_i,s is the set of all k-mers of D_i that do not occur in any other set D_j(j ≠ i) when mismatches are allowed at ‘*’ positions in s. It is well known that the design of spaced seed is critical to achieve the highest possible precision and sensitivity ([5,7]). Since CLARK is more precise for long contiguous k-mers (e.g., k = 31), but its highest sensitivity occurs for k in the range [19,22], we considered spaced seeds of length k=31 and weight w= 22. To determine the optimal positions for the allowed mismatches, we modeled (as it is done in [5]) the succession of ‘1’ and ‘*’ via a Bernoulli distribution with parameter p, which represents the similarity level between the read and the genome. We set p=0.95 to reflect the expected high similarity between sequences at the species rank. Through an exhaustive search for optimal spaced seeds (with parameters k = 31, w= 22, p= 0.95) using the dynamic programming method by [8] on a region of 100bp, we selected three spaced seeds with the highest hit probability, namely 1111*111*111**1*111**1*11*11111 (hit probability 0.99811), 11111*1**111*1*11*11**111*11111(0.998099), and 11111*1*111**1*11*111**11*11111 (0.998093).

In the preprocessing stage, CLARK-S computes and stores on disk, for each genome g_i and each spaced seed s, the set of discriminative spaced k-mers E_i,s. Compared to the CLARK’s classification phase, CLARK-S now requires three look-ups for each k-mer in a read (one look-up per spaced seed).

Experimental Setup

Database

We compared CLARK-S and CLARK on the same set of reference genomes, namely all microbial genomes in the default NCBI/RefSeq database (total of 5,747 species: 1,335 bacteria, 123 archaea and 4,289 viruses).

Synthetic reads

Evaluations were carried out on simulated datasets and real metagenomic data, as explained next. First, we created six synthetic datasets containing reads from dominant organisms found in the mouth, city parks/medians, gut, indoor and soil environments. A seventh dataset containing reads randomly chosen from 525 bacterial/archaeal species was added (see Supplementary Figures 1-7). These datasets are composed of short synthetic reads generated using ART [9] with default settings (see Supplementary Note 1).

However, observe that a short read r generated from genome g_i may appear in another genome for a given error rate or number of mismatches. As a consequence one cannot assume that the “ground truth” of read r is gi, because r might not be unique to gi. Ignoring this observation is likely to lead to incorrect conclusions on precision and sensitivity. In order to ensure an unbiased evaluation, we created additional datasets (called “unambiguous”) in which we removed any read that occurs in more than one species, for a given number of allowed mismatches (see Supplementary Note 1 and 2). These datasets only contain unambiguously mapped reads that can allow an unbiased evaluation. In total, we have fourteen datasets containing reads from 647 species (see Supplementary Table 1).

We also added three negative control samples containing short reads that do not exist in any genomes in the NCBI/RefSeq database (see Supplementary Note 1). We used the precision and sensitivity metrics defined [3] to evaluate the classification performance.

Real metagenomic reads

For experiments on real metagenomes, we chose a large dataset from a recent study on the microbial profile of the NY City subway system, the Gowanus canal and public parks ([10]). We selected twelve samples from various microbial habitat (e.g., bench, garbage can, kiosk, stairway rail, water, etc.), subway stations and riders usage (see Supplementary Table 3). While the ground truth for these data is unknown, the abundance of bacteria, eukaryotes and viruses present in these samples were provided in [10]. Thus, we trimmed raw reads as it was done in [10] (see Supplementary Table 3) and compared the results of CLARK/CLARK-S with the findings in [10] (see Supplementary Table 4 and 5).

Results

Synthetic reads

Observe in Supplemental Table 2 that the sensitivity achieved by CLARK-S on the fourteen simulated datasets is consistently the highest, while maintaining high precision. Note that the gap in sensitivity is even higher on the unambiguous datasets. On the negative control samples, CLARK-S did not classify any reads as expected. Supplemental Table 7 shows that CLARK-S classifies about 200 thousand short reads per minute (using one CPU), while CLARK classifies about 3.5 million short reads per minute. If one can take advantage of eight cores, CLARK-S classifies about one million short read per minute, which is sufficiently fast to process large metagenomic datasets in few minutes. CLARK-S requires more time to build the database than CLARK, but its RAM usage is comparable (see Supplementary Table 8).

Real metagenomic reads

Observe in Supplemental Table 6 that CLARK-S classifies more reads than CLARK. On average, CLARK-S classifies 27% more reads than CLARK. Supplementary Table 5 indicates the reads count assigned by each tool to each species listed in [10] and present in the database. In order to compare results from CLARK/CLARK-S against [10], we estimate the “agreement rate”. For example, in the sample GC01, there are 8 species reported by the study [10] that are present in the database used (i.e., default NCBI/RefSeq genomes of bacteria, archaea and viruses). However, CLARK detected 6 species out of the 8 species, so its agreement rate is 75%. We repeat this estimation for all samples, i.e., for each sample we identified all species detected by [10] that were also present in the database (cf. Supplementary Table 4) and calculate the proportion of species CLARK and CLARK-S detected out of the identified species (cf. Supplementary Table 5).

CLARK-S achieves consistently the highest “agreement rate” with [10] on all samples. For instance, in sample P00589 and P00720, CLARK-S detected the presence of the virus Enterobacter phage HK97 but CLARK did not; in sample P01136, CLARK-S detected Brucella ovis but CLARK did not. In general, CLARK-S identified more relevant organisms than the other tested tools, as observed by a recent study focusing on water samples [11].

Source code and data

CLARK-S is written in C++ and is freely available at http://clark.cs.ucr.edu. The synthetic datasets (default and unambiguous) are freely available at http://clark.cs.ucr.edu.

Competing interests

Authors declared they thave no competing interests.

Acknowledgements

This work was supported in part by the US NSF (IIS-1302134 and IIS-1526742).

References

[1].↵
Venter, J. C., Remington, K., Heidelberg, J. F., Halpern, A. L., Rusch, D., Eisen, J. A., Wu, D., Paulsen, I., Nelson, K. E., et al.(2004). Environmental genome shotgun sequencing of the Sargasso Sea. Science, 304(5667), 66–74.
[2].↵
Huttenhower, C., Gevers, D., Knight, R., Abubucker, S., Badger, J., and Chinwalla, A. e. a. (2012). Structure, function and diversity of the healthy human microbiome. Nature, 486(7402), 207–214.
OpenUrl CrossRef PubMed Web of Science
[3].↵
Ounit, R., Wanamaker, S., Close, T. J., Lonardi, S. (2015). CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics, 16(1), 236.
OpenUrl CrossRef PubMed
[4].↵
Lindgreen, S., Adair, K. L., et al. (2016). An evaluation of the accuracy and speed of metagenome analysis tools. Scientific reports, 6, 19233.
OpenUrl
[5].↵
Ma, B., Tromp, J., et al. (2002). PatternHunter: faster and more sensitive homology search. Bioinformatics, 18(3), 440–445.
OpenUrl CrossRef PubMed Web of Science
[6].↵
Huson, D. H., Auch, A. F., et al. (2007). MEGAN analysis of metagenomic data. Genome Research, 17(3), 377–386.
OpenUrl Abstract/FREE Full Text
[7].↵
Brown, D. G., Li, M., et al. (2004). A tutorial of recent developments in the seeding of local alignment. Journal of Bioinformatics and Computational Biology, 2(04), 819–842.
OpenUrl CrossRef PubMed
[8].↵
Ilie, L., Ilie, S., et al. (2011). SpEED: fast computation of sensitive spaced seeds. Bioinformatics, 27(17), 2433–2434.
OpenUrl CrossRef PubMed Web of Science
[9].↵
Huang, W., Li, L., et al. (2012). ART: a next-generation sequencing read simulator. Bioinformatics, 28(4), 593–594.
OpenUrl CrossRef PubMed Web of Science
[10].↵
Afshinnekoo, E., Meydan, C., et al. (2015). Geospatial resolution of human and bacterial diversity with city-scale metagenomics. Cell Systems, 1(1), 72–87.
OpenUrl
[11].↵
Thompson, L. R., Williams, G. J., et al. (2016). Metagenomic covariation along densely sampled environmental gradients in the red sea. The ISME journal.

References

[1].↵
Adams, R. I., Bateman, A. C., Bik, H. M., and Meadow, J. F. Microbiota of the indoor environment: a meta-analysis. Microbiome 3, 1 (2015), 1.
OpenUrl CrossRef PubMed
[2].↵
Afshinnekoo, E., Meydan, C., Chowdhury, S., Jaroudi, D., Boyer, C., Bernstein, N., Maritz, J. M., Reeves, D., Gandara, J., Chhangawala, S., et al. Geospatial resolution of human and bacterial diversity with city-scale metagenomics. Cell systems 1, 1 (2015), 72–87.
OpenUrl
[3].↵
Fierer, N., Leff, J. W., Adams, B. J., Nielsen, U. N., Bates, S. T., Lauber, C. L., Owens, S., Gilbert, J. A., Wall, D. H., and Caporaso, J. G. Cross-biome metagenomic analyses of soil microbial communities and their functional attributes. Proceedings of the National Academy of Sciences 109, 52 (2012), 21390–21395.
OpenUrl Abstract/FREE Full Text
[4].↵
Franzosa, E. A., Huang, K., Meadow, J. F., Gevers, D., Lemon, K. P., Bohannan, B. J., and Huttenhower, C. Identifying personal microbiomes using metagenomic codes. Proceedings of the National Academy of Sciences 112, 22 (2015), E2930–E2938.
OpenUrl Abstract/FREE Full Text
[5].↵
Huttenhower, C., Gevers, D., Knight, R., Abubucker, S., Badger, J., and Chinwalla, A. e. a. Structure, function and diversity of the healthy human microbiome. Nature 486, 7402 (2012), 207–214.
OpenUrl CrossRef PubMed Web of Science
[6].↵
Huang, W., Li, L., Myers, J. R., and Marth, G. T. Art: a next-generation sequencing read simulator. Bioinformatics 28, 4 (2012), 593–594.
OpenUrl CrossRef PubMed Web of Science
[7].↵
Kuleshov, V., Jiang, C., Zhou, W., Jahanbani, F., Batzoglou, S., and Snyder, M. Synthetic long-read sequencing reveals intraspecies diversity in the human microbiome. Nature biotechnology 34, 1 (2016), 64–69.
OpenUrl CrossRef PubMed
[8].↵
Ondov, B. D., Bergman, N. H., and Phillippy, A. M. (2011). Interactive metagenomic visualization in a web browser. BMC bioinformatics, 12(1), 1.
[9].↵
Reese, A. T., Savage, A., Youngsteadt, E., McGuire, K. L., Koling, A., Watkins, O., Frank, S. D., and Dunn, R. R. Urban stress is associated with variation in microbial species composition - but not richness - in manhattan. The ISME journal(2015).
[10].↵
Ruiz-Calderon, J. F., Cavallin, H., Song, S. J., Novoselac, A., Pericchi, L. R., Hernandez, J. N., Rios, R., Branch, O. H., Pereira, H., Paulino, L. C., et al. Walls talk: Microbial biogeography of homes spanning urbanization. Science advances 2, 2 (2016), e1501061.
OpenUrl FREE Full Text