Abstract
The growing number of metagenomic studies in medicine and environmental sciences is creating increasing demands on the computational infrastructure designed to analyze these very large datasets. Often, the construction of ultra-fast and precise taxonomic classifiers can compromise on their sensitivity (i.e., the number of reads correctly classified). Here we introduce CLARK-S, a new software tool that can classify short reads with high precision, high sensitivity and high speed at the same time.
Introduction
One of the primary goals of metagenomic studies is to determine the taxonomical identity of bacteria and viruses in a heterogenous microbial sample (e.g., soil, water, urban environment, human microbiome). This analysis can reveal the presence of unexpected bacteria and viruses in a newly explored microbial habitat (e.g., the marine environment in [1]), or in the case of the human body, elucidate relationships between diseases and imbalances in the microbiome (see, e.g., [2]).
Arguably, the most effective and unbiased method to study these microbial samples is via high-throughput sequencing. The associated computational problem is to assign sequenced (short) reads to a taxonomic unit. While this problem has been studied extensively and several methods and software tools are available, faster and more accurate algorithms are needed to keep pace with the increasing throughput of modern sequencing instruments. In [3] we introduced CLARK, a taxonomy-dependent binning method whose classification speed is currently unmatched. A recent independent evaluation of fourteen taxonomic binning/profiling methods showed that the classification precision of CLARK is comparable (sometimes better) than the state-of-the-art classifiers ([4]). While CLARK’s speed and precision are very high, its classification sensitivity (i.e., the fraction of reads that it correctly classifies) can be significantly improved with the methods described next.
We recall that CLARK is an alignment-free method based on shared k-mers. Briefly, it assigns a read r to a reference genome G if r and G share more discriminative k-mers (i.e., k-mers that appear exclusively in one reference genome) than other genomes in the database. Here we show that the classification sensitivity can be increased by allowing mismatches between shared k-mers in a limited number of (carefully predetermined) positions, while maintaining the requirement for k-mers to be discriminative. The idea of allowing mismatches to improve the sensitivity of seed-and-extend alignment methods was pioneered in [5] with the notion of spaced seed. While spaced seeds have been used in some metagenomic binning/profiling methods (e.g., MEGAN [6]), the use of discriminative spaced k-mers is novel. Here we describe a major extension of the algorithmic infrastructure of CLARK based on spaced seed, called CLARK-S.
Methods
Given an integer k and m reference genomes {g1, g2, …, gm}, the set of discriminative k-mers Di for genome gi is the set of all k-mers in gi that do not occur (exactly) in any other genome [3]. A spaced seed s of length k and weight w < k is a string over the alphabet {1,*} that contains w ‘1’ and (k-w) ‘*’. Matches are required at a ‘1’ positions, while mismatches are allowed at the ‘*’ locations. The set of discriminative spaced k-mers Ei,s is the set of all k-mers of Di that do not occur in any other set Dj(j ≠ i) when mismatches are allowed at ‘*’ positions in s. It is well known that the design of spaced seed is critical to achieve the highest possible precision and sensitivity ([5,7]). Since CLARK is more precise for long contiguous k-mers (e.g., k = 31), but its highest sensitivity occurs for k in the range [19,22], we considered spaced seeds of length k=31 and weight w= 22. To determine the optimal positions for the allowed mismatches, we modeled (as it is done in [5]) the succession of ‘1’ and ‘*’ via a Bernoulli distribution with parameter p, which represents the similarity level between the read and the genome. We set p=0.95 to reflect the expected high similarity between sequences at the species rank. Through an exhaustive search for optimal spaced seeds (with parameters k = 31, w= 22, p= 0.95) using the dynamic programming method by [8] on a region of 100bp, we selected three spaced seeds with the highest hit probability, namely 1111*111*111**1*111**1*11*11111 (hit probability 0.99811), 11111*1**111*1*11*11**111*11111(0.998099), and 11111*1*111**1*11*111**11*11111 (0.998093).
In the preprocessing stage, CLARK-S computes and stores on disk, for each genome gi and each spaced seed s, the set of discriminative spaced k-mers Ei,s. Compared to the CLARK’s classification phase, CLARK-S now requires three look-ups for each k-mer in a read (one look-up per spaced seed).
Experimental Setup
Database
We compared CLARK-S and CLARK on the same set of reference genomes, namely all microbial genomes in the default NCBI/RefSeq database (total of 5,747 species: 1,335 bacteria, 123 archaea and 4,289 viruses).
Synthetic reads
Evaluations were carried out on simulated datasets and real metagenomic data, as explained next. First, we created six synthetic datasets containing reads from dominant organisms found in the mouth, city parks/medians, gut, indoor and soil environments. A seventh dataset containing reads randomly chosen from 525 bacterial/archaeal species was added (see Supplementary Figures 1-7). These datasets are composed of short synthetic reads generated using ART [9] with default settings (see Supplementary Note 1).
However, observe that a short read r generated from genome gi may appear in another genome for a given error rate or number of mismatches. As a consequence one cannot assume that the “ground truth” of read r is gi, because r might not be unique to gi. Ignoring this observation is likely to lead to incorrect conclusions on precision and sensitivity. In order to ensure an unbiased evaluation, we created additional datasets (called “unambiguous”) in which we removed any read that occurs in more than one species, for a given number of allowed mismatches (see Supplementary Note 1 and 2). These datasets only contain unambiguously mapped reads that can allow an unbiased evaluation. In total, we have fourteen datasets containing reads from 647 species (see Supplementary Table 1).
We also added three negative control samples containing short reads that do not exist in any genomes in the NCBI/RefSeq database (see Supplementary Note 1). We used the precision and sensitivity metrics defined [3] to evaluate the classification performance.
Real metagenomic reads
For experiments on real metagenomes, we chose a large dataset from a recent study on the microbial profile of the NY City subway system, the Gowanus canal and public parks ([10]). We selected twelve samples from various microbial habitat (e.g., bench, garbage can, kiosk, stairway rail, water, etc.), subway stations and riders usage (see Supplementary Table 3). While the ground truth for these data is unknown, the abundance of bacteria, eukaryotes and viruses present in these samples were provided in [10]. Thus, we trimmed raw reads as it was done in [10] (see Supplementary Table 3) and compared the results of CLARK/CLARK-S with the findings in [10] (see Supplementary Table 4 and 5).
Results
Synthetic reads
Observe in Supplemental Table 2 that the sensitivity achieved by CLARK-S on the fourteen simulated datasets is consistently the highest, while maintaining high precision. Note that the gap in sensitivity is even higher on the unambiguous datasets. On the negative control samples, CLARK-S did not classify any reads as expected. Supplemental Table 7 shows that CLARK-S classifies about 200 thousand short reads per minute (using one CPU), while CLARK classifies about 3.5 million short reads per minute. If one can take advantage of eight cores, CLARK-S classifies about one million short read per minute, which is sufficiently fast to process large metagenomic datasets in few minutes. CLARK-S requires more time to build the database than CLARK, but its RAM usage is comparable (see Supplementary Table 8).
Real metagenomic reads
Observe in Supplemental Table 6 that CLARK-S classifies more reads than CLARK. On average, CLARK-S classifies 27% more reads than CLARK. Supplementary Table 5 indicates the reads count assigned by each tool to each species listed in [10] and present in the database. In order to compare results from CLARK/CLARK-S against [10], we estimate the “agreement rate”. For example, in the sample GC01, there are 8 species reported by the study [10] that are present in the database used (i.e., default NCBI/RefSeq genomes of bacteria, archaea and viruses). However, CLARK detected 6 species out of the 8 species, so its agreement rate is 75%. We repeat this estimation for all samples, i.e., for each sample we identified all species detected by [10] that were also present in the database (cf. Supplementary Table 4) and calculate the proportion of species CLARK and CLARK-S detected out of the identified species (cf. Supplementary Table 5).
CLARK-S achieves consistently the highest “agreement rate” with [10] on all samples. For instance, in sample P00589 and P00720, CLARK-S detected the presence of the virus Enterobacter phage HK97 but CLARK did not; in sample P01136, CLARK-S detected Brucella ovis but CLARK did not. In general, CLARK-S identified more relevant organisms than the other tested tools, as observed by a recent study focusing on water samples [11].
Source code and data
CLARK-S is written in C++ and is freely available at http://clark.cs.ucr.edu. The synthetic datasets (default and unambiguous) are freely available at http://clark.cs.ucr.edu.
Competing interests
Authors declared they thave no competing interests.
Acknowledgements
This work was supported in part by the US NSF (IIS-1302134 and IIS-1526742).