Abstract
Motivation Advances in the sequencing of uncultured environmental samples, dubbed metagenomics, raise a growing need for accurate taxonomic assignment. Accurate identification of organisms present within a community is essential to understanding even the most elementary ecosystems. However, current high-throughput sequencing technologies generate short reads which partially cover full-length marker genes and this poses difficult bioinformatic challenges for taxonomy identification at high resolution
Results We designed MATAM, a software dedicated to the fast and accurate targeted assembly of short reads sequenced from a genomic marker of interest. The method implements a stepwise process based on construction and analysis of a read overlap graph. It is applied to the assembly of 16S rRNA markers and is validated on simulated, synthetic and genuine metagenomes. We show that MATAM outperforms other available methods in terms of low error rates and recovered genome fractions and is suitable to provide improved assemblies for precise taxonomic assignments.
Availability https://github.com/bonsai-team/matam
Contact pierre.pericard{at}gmail.com, helene.touzet{at}univ-lille1.fr
1 Introduction
Shotgun metagenomic sequencing provides an unprecedented opportunity to study uncultured microbial samples, with multiple applications ranging from the human microbiome to soil or marine samples, for which the vast majority of microorganisms diversity remains unknown [13].
A major goal of metagenomic studies is to characterize the microbial diversity and ecological structure. This is often achieved by focusing on one of several phylogenetic marker genes [12, 28], that are ubiquitous in the taxonomic range of interest and exhibit variable discriminative regions. For bacterial communities, the gold standard marker is the 16S ribosomal RNA (rRNA, ~1500bp avg. length), for which millions of sequences are available in curated reference databases, such as Silva [24], RDP [2] or GreenGenes [3]. Traditionnal approaches such as amplicon sequencing are limited to the analysis of small portions of the marker sequences. This leads to strong technological limitations for organisms identification at sufficiently precise taxonomic levels, typically beyond genus [23]. To assign marker sequences to species, or even strains, we need to be able to recover full length rRNA with less than a few errors per kilobase. Metagenomic assemblers are not suitable for this task, because they are optimized to deal with whole genomes, and struggle to differentiate between very similar sequences [27]. To this respect, marker-oriented methods such as EMIRGE [19] and REAGO [33] were recently developed in order to assemble metagenomic read subsets into full length 16S rRNA contigs, thus aiming to improve the taxonomic assignment accuracy of environmental samples. EMIRGE uses a Bayesian approach to iteratively reconstruct 16S rRNA full length sequences. REAGO identifies rRNA reads using Infernal [20], and then constructs an overlap graph by searching for exact overlaps between reads using a suffix/prefix array. However, such tools still show some limitations in terms of recovery error rates as well as dealing with low abundance species.
In this work, we present MATAM, a new approach based on the construction and exploitation of an overlap graph, carefully designed to minimize the error rate and the risk of chimera formation. MATAM was validated on both simulated and actual sequencing data. It is able to reconstruct nearly full length 16S rRNAs and is robust to variations in the sequencing depth as well as community complexity.
2 Methods
2.1 Overview of MATAM
The MATAM (Mapping-Assisted Targeted-Assembly for Metagenomics) pipeline takes as input a set of shotgun metagenomics short reads and a reference database containing the largest possible set of sequences from a given target marker gene. MATAM identifies reads originating from that marker, and assembles nearly full length sequences of it. It is composed of four major steps illustrated in Figure 1. Although this method should work for any conserved and widely surveyed gene, we will focus on the 16S rRNA for the remainder of the article. Additional technical details and parameters are available in the Supplementary Methods.
2.2 Reference database construction
The availability of a reference database for the marker gene is an essential feature of the method, because it allows us to model the target sequences. For applications to 16S rRNA assembly, MATAM utilizes Silva 128 SSU Ref NR database [24]. From this reference database that we denote as complete, we also build a clustered reference database, that provides a coarse-grained representation of the taxonomic space. For that task, we use Sumaclust [17, 8] using a 95% identity threshold.
2.3 rRNA reads identification and mapping
In the first step, reads are mapped against the clustered reference database using SortMeRNA [10, 9]. This step allows to quickly sort out 16S rRNA reads from the whole set of reads, providing high quality alignments. For each read, we keep up to ten best alignments against the reference database. Moreover, this mapping step yields a broad classification of the 16S rRNA reads. Indeed, reads coming from distantly related species are aligned against their respective closest known references, which nest in distant lineages of the taxonomy, while reads from closely related species are aligned against closely related references.
2.4 Construction of the overlap graph
The identified 16S rRNA reads are then organized into an overlap graph defined as follows: graph nodes are reads, and an undirected edge connects two nodes if the two reads overlap with a sufficient length and with a sufficient identity to assert that they originated from a common sampled taxon. The standard approach to build such an overlap graph requires comparison of each read with each other, which is time-consuming. Here, we use alignment information to sort through candidate read pairs in a very efficient manner. For each pairing, we consider only reads that share alignments with at least one common reference sequence and for which the alignments are overlapping on more than 50 nucleotides with 100% identity. This strict criterion allows us to reduce the risk of connecting reads from unrelated taxa, which would in turn produce chimeras. By doing so, we discard reads containing sequencing errors in their overlap, which is bearable considering the nowadays very low sequencing error rates of short reads.
2.5 Extracting contigs from the overlap graph
Although the overlap graph appears very bushy, it also reveals some general trends. While it exhibits highly connected subgraphs, it also displays disjoint paths (see Figure 1 for an example). We simplify the graph by performing a breadth first traversal starting from a random node to annotate the nodes with their depth. All nodes with equal depth that are connected in a single connected component are collapsed into a single compressed node and outgoing edges are merged into a compressed edge. Low support compressed nodes containing a single read, and compressed edges representing a single overlap are removed. The resulting graph, called the compressed graph, is several order of magnitude smaller than the initial overlap graph. We partition this graph in three categories of subgraphs: hubs, that are nodes with an degree strictly greater than two, specific paths that are sequences of nodes of degree two or one, and singletons that are non-connected nodes. Intuitively, hubs correspond to the highly connected subgraphs in the overlap graph, and are likely to contain mainly reads coming from conserved regions shared in many species, thus overlaping without error even for distantly related taxa. Specific paths tend to contain reads originating from variable regions of the 16S gene, that are specific to one or few closely related species. For each subgraph in the compressed graph (hubs, specific paths, singletons), we extract the underlying sets of reads and build an individual assembly using the genomic assembler SGA [30]. Note that any other state-of-the art genomic assembler could be used here. As a result, we obtain one or more contigs for each subgraph.
2.6 Contigs scaffolding
We use a greedy algorithm to scaffold the contigs obtained in the previous step. For that task, contigs are first mapped against the complete reference database, and all alignments within the 1% range of suboptimal scores are kept. We then select contigs by increasing number of matches and decreasing lengths. By doing so, a long contig with a unique alignment will be selected for scaffolding before a short contig exhibiting a large number of alignments. Such long contig can be assigned non-ambiguously to a single species, while the short contig with multiple matches rather corresponds to a conserved region of the marker and is used to fill in the blanks between the specific contigs. Contigs matching against the same reference sequence are then merged into a single consensus scaffold. Redundant scaffolds included in larger ones are removed. Finally, only scaffolds larger than 500bp are retained. This yields the final MATAM output which could be used for the purpose of taxonomic assignment.
3 Implementation
MATAM was implemented in Python 3, except for the overlap graph building and compression steps that were written in C++11 using the SeqAn library [4], and is available via Docker and Conda. MATAM is distributed under the GNU Affero GPL v3.0 licence and the source code is freely available at the following URL: https://github.com/bonsai-team/matam. All MATAM runs presented in this article were performed using MATAM v0.9.9.
4 Results
MATAM performance was compared with those of two general-purpose metagenomic assemblers, SPAdes [1, 21] and MEGAHIT [11], as well as with two methods specialized in 16S rRNA assembly, EMIRGE [19] and REAGO [33]. The five tools were run on three different datasets, chosen for their complementarity and the possibility to validate the reconstructed candidate 16S rRNA sequences: a simulated dataset [15], a synthetic microbial community [29], and two environmental samples from human gut and mouth providing amplicon based taxonomic assignments [31]. SortMeRNA was used to extract 16S rRNA reads from these datasets before assembling them with SPAdes and MEGAHIT. Complete command-lines and parameters are available in the Supplementary Results.
In order to compare the five methods on a common ground, the same validation procedure was applied for all experiments. Only reconstructed sequences with lengths exceeding 500bp were considered, and chimeric sequences were filtered out by the UCHIME algorithm [5] implemented in VSEARCH [25] and querying the Silva 128 SSU Ref Nr99 database. For each experiment, we indicate the proportion of chimeric contigs (% chimeras, which is the total size of all chimeric contigs divided by the assembly total size). All the following measures were then computed on the remaining assemblies. When the sequences present in the sample are actually known (see Sections 4.1 and 4.2), the assembly quality assessment was performed with MetaQuast [18] by aligning the contigs against the original sample sequences, and considering the following metrics: the number of contigs (#contigs), which is the total number of contigs of lengths greater than 500bp; the total length (TL), which is the total number of bases in the contigs; the total aligned length (TAL), which is the total number of aligned nucleotides in the contigs; the genome fraction (GF), which stands for the total number of nucleotides from the original sample sequences covered with contigs divided by the total size of the sample sequences; the error rate (ER), which consists in the percentage of observed mismatches and indels with respect to the closest matched sequence in the original sample. Finaly, taxonomic assignments were carried out with the RDP Classifier [32]. The assemblies evaluation protocol, command-lines and parameters can be found in the Supplementary Results.
4.1 Simulated metagenomic datasets with varying sequencing depth
In the first experiment, we evaluated the ability of methods to correctly reconstruct the 16S rRNA sequences in the context of low sequencing depth. For that, we used a selection of 122 genomes providing a realistic taxomical diversity [15, 22], that contains 287 distinct 16S rRNA copies. We generated five datasets with varying sequencing depths: 50x, 20x, 10x, 5x and 2x per genome. Illumina reads were simulated with the ART simulator [7], using the HiSeq2500 built-in error profile, 101bp read length, and 250bp fragment length with a 30bp standard-deviation (SD). In this simulation, all species are equally distributed, which corresponds to the high complexity community introduced in [15]. Simulation command-line and parameters can be found in the Supplementary Material (section 4.3.1).
Table 1 shows the results averaged over the five datasets (mean metrics and their respective standard deviation, SD). More than 99% of the MATAM sequences were aligned by MetaQuast to one of the 287 16S rRNA sequences from the initial sample (mean TAL/TL), while among other methods, this proportion reached at best 91%, with REAGO. Congruently, MATAM sequences obtained the lowest average error rate (ER=0.03%), which represents more than a ten-fold accuracy gain compared to the other assemblers, and a twenty-fold improvement over EMIRGE. Furthermore, EMIRGE sequences contained 0.5% of unknown nucleotides (Ns), bringing its effective ER above 1%. Additionally, MATAM recovered about thirty times less chimeras than REAGO and EMIRGE did.
For each of the five tools, we reported the recovered genome fraction (GF) with respect to increasing sequencing depth (Figure 2). MATAM recovered from 76% to 85% of the reference sequences for sequencing depths greater than 10x, while EMIRGE recovered less than 55% of the reference sequence, and the GF for other methods is lower than 22%. MATAM also achieved the best performance facing a low sequencing depth of 2x, reaching a GF of 33%, while GFs ranged between 5% and 10% with all other assemblers.
4.2 Synthetic archaeal and bacterial community
Inching toward more realistic applications, a second dataset provides Illumina reads extracted from a synthetic microbial community composed of 16 archaeal species from 12 genera, as well as 48 bacterial species from 36 genera (accession SRR606249; [29]). As emphasized by the authors, the selected organisms cover a wide range of environmental conditions and adaptation strategies. In contrast to the previous simulated dataset (Section 4.1), the proportion of each species in the sample is not uniform, which results in individual genome average sequencing depth varying from 9x to 318x. The number of 16S rRNA paralogs per genome appears also highly diverse, ranging from 1 to 10 copies per genome. Altogether, this dataset represents a total amount of 106 distinct 16S rRNA sequences with pairwise sequence identities ranging from 59.64% to 99.93%.
The organisms were sequenced on Illumina HighSeq2000, providing 109 million 101bp paired-end reads with an average fragment size of 250bp. We quality cleaned the reads using Prinseq Lite [26], removed adapter sequences using Cutadapt [14], filtered out short reads (< 50bp), and obtained a total number of 67.6 million reads, which were analyzed with MATAM and EMIRGE. The uncleaned raw dataset was provided to REAGO, considering that the method could not handle reads with varying lengths. Finally, for SPAdes and MEGAHIT, the 16S rRNA reads were extracted from the cleaned dataset using SortMeRNA, which provided 108,560 16S rRNA reads to assemble. Cleaning and pre-processing command-lines and parameters for the synthetic community can be found in Supplementary Material (section 4.3.2)
Results are shown in Table 2. Confirming the trends observed on the simulated dataset, MATAM is able to recover the highest number of sequences together with the highest GF (83%). Most importantly, with lower ER than achieved by the other tested methods, the MATAM assembly appears highly accurate. While EMIRGE is the second best approach in terms of recovered GF, it also yields the greatest ER and Ns over all the compared tools. Moreover, a RDP classification of MATAM and EMIRGE sequences indicates that while MATAM missed one expected genus only, EMIRGE missed 4 genera out of 48.
Inspection of the MetaQuast alignments of the assemblies against the original 16S rRNAs revealed that all methods accurately assembled the genes sharing less than 90% sequence identity with their closest relatives within the sample. However, performances significantly dropped when attempting to assemble the closely related genes in the dataset. This especially concerned the paralogous 16S rRNA copies sharing around 99% sequence identity. Supplementary Table 1 (Supplementary Material, section 4.3.2) provides pairwise distances between sequences from a representative subset of four related species possessing one to three such paralogous copies. Those 16S rRNAs and their corresponding assembled candidate sequences were selected for a phylogenetic tree reconstruction. The obtained tree (Figure 3) demonstrates that MATAM correctly assembled all the different paralogs with nearly no error, while EMIRGE and REAGO only managed to recover one candidate sequence per species. Thus, EMIRGE and REAGO merged into a single candidate sequence the reads issued from distinct paralogs, resulting in erroneous assemblies with high ER and underestimated GF. Indeed, each of the sequences assembled with REAGO, as well as one EMIRGE sequence over four, appear to cluster at a slight distance from their respective targeted paralogs. Those distances simply account for the methods reconstruction errors. Consistently, in two cases, the candidates assembled by EMIRGE and REAGO were identified as chimeras by VSEARCH.
4.3 Human Microbiome Project
Finally, we used two metagenomic samples from the Human Microbiome Project (gut: SRS011405, and mouth: SRS016002, [31]) in order to validate MATAM on real metagenomic datasets sequenced from genuine environments. The reads were already quality cleaned and trimmed, and no additional filtering was performed. Hence, reads having different lengths, we were not able to run REAGO on these datasets. Results obtained with SPAdes and MEGAHIT using the following protocol appeared highly inaccurate and therefore, they are not further commented in this work. Thus, we only present the results obtained with EMIRGE and MATAM. Datasets availability, and additional details on the evaluation protocol can be found in Supplementary Material (section 4.3.3).
For these two datasets, the exact ground truth is unknown. Thus we could not perform the same validation procedure as in the two previous examples and we had to resort to alternative strategies. First, we took advantage of the availability of OTU sequences inferred through a QIIME analysis of the V1-V3 hypervariable regions for the same biological samples (available from the SRS accession numbers). We compared the assignments obtained from assemblies, calculated with RDP, with these of amplicon OTUs (Table 3). For both samples, MATAM identified more classes and genera than EMIRGE did, and most of these taxa were validated by the amplicon OTUs. Interestingly, we observed that in the two samples, three genera were recovered both by MATAM and EMIRGE, but not by the amplicon approach: Odoribacter, Peptococcus, and Bergeyella. Since some species from these genera are known to be adapted to the human gut and mouth environments, it is plausible that they were missed by the amplicon approach while being accurately recovered by MATAM and EMIRGE from the metagenomic samples.
Moreover, we evaluated assembly quality by aligning MATAM and EMIRGE sequences against the complete Silva 128 SSU Ref NR database, using BLAST. The rationale for this experiment is that most of the species in these human gut and mouth samples are possibly already known, and therefore should be found in Silva. We observed that nearly all MATAM sequences matched with a known 16S rRNA in Silva with more than 99% identity, among which a majority matched with 100% identity (Figures 4 and 5), which suggests that MATAM sequences could possibly be assigned at the species or even the strain level. On the other hand, EMIRGE sequences provided a discordant picture. In the case of the human mouth sample, most of the EMIRGE sequences obtained a match above 97% identity, but only a slight proportion of them matched with 100% identity against a known 16S rRNA (Figure 5). The observation is even more pronounced with the human gut sample, where only 43% of the EMIRGE sequences obtained a match above 97% identity against a Silva 16S rRNA sequence (Figure 4). Thus, conversely to MATAM, EMIRGE sequences would suggest that only a slight proportion of the human gut and mouth diversity has a known isolate registered in Silva. However, considering our previous conclusions on controlled datasets, we assume that part of this diversity inferred with EMIRGE might in fact corresponds to reconstruction artifacts.
5 Discussion
Taxonomic assignments of environmental samples is a strikingly difficult task which suffers from inherent limitations of high-throughput sequencing technologies. In this respect, we designed MATAM as an alternative to existing software helping to better understand the taxonomic structures of shotgun metagenomic samples. Our experimental results show that MATAM outperforms other available tools providing phylogenetic marker assemblies. Reconstructing full length 16S rRNAs allows to reach a higher precision of taxonomic assignments than individual read analysis or amplicon sequencing do, because the reconstructed sequences effectively contain stronger phylogenetic signal. Moreover, metagenomic shotgun sequencing is naturally immune against the primer and amplification biases attached to the amplicon sequencing technology, and therefore is more adequate to sequence unknown species.
Our approach opens up several new perspectives. Although we have focused this work on the assembly of 16S rRNA genes, MATAM was designed to deal with any marker of taxonomic interest. Indeed, there is currently an emerging trend to consider a combination of universal (single-copy) marker families, such as provided in the recently published database proGenomes [16]. Sequences from this database, or from any other customized one, could be used with MATAM to target a variety of markers, and thus provide improving taxonomic assignments. MATAM could also be used in combination with other types of sequencing data. Long read sequencing is able to produce fragments that cover large regions of the DNA molecules, up to several thousands of bases. When long reads are available, they could serve as a guide in the scaffolding step of MATAM and concomitantly, MATAM low-error contigs could be used to correct them. Finally, targeted gene capture, that allows to sequence at high depth captured DNA regions of interest from an environmental sample [6], could also prove to be an exciting application field for MATAM.