De novo fragment assembly with short mate-paired reads: Does the read length matter?

Mark J. Chaisson; Dumitru Brinza; Pavel A. Pevzner

doi:10.1101/gr.079053.108

De novo fragment assembly with short mate-paired reads: Does the read length matter?

¹ Bioinformatics Program, University of California San Diego, La Jolla, California 92093, USA;
² Department of Computer Science and Engineering, University of California San Diego, La Jolla, California 92093, USA

Abstract

Increasing read length is currently viewed as the crucial condition for fragment assembly with next-generation sequencing technologies. However, introducing mate-paired reads (separated by a gap of length, GapLength) opens a possibility to transform short mate-pairs into long mate-reads of length ≈ GapLength, and thus raises the question as to whether the read length (as opposed to GapLength) even matters. We describe a new tool, EULER-USR, for assembling mate-paired short reads and use it to analyze the question of whether the read length matters. We further complement the ongoing experimental efforts to maximize read length by a new computational approach for increasing the effective read length. While the common practice is to trim the error-prone tails of the reads, we present an approach that substitutes trimming with error correction using repeat graphs. An important and counterintuitive implication of this result is that one may extend sequencing reactions that degrade with length “past their prime” to where the error rate grows above what is normally acceptable for fragment assembly.

Footnotes

↵3 Corresponding author.

↵E-mail mchaisso{at}bioinf.ucsd.edu; fax (858) 534-7029.
[Supplemental material is available online at www.genome.org.]
Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.079053.108.
↵4 We emphasize that the read-length barrier depends on the genome, the span of mate-pairs, coverage, error-rates in reads, variability in gap length, etc. The read-length barrier ≈35 nt for E. coli was computed under the assumption that the span is 300 ± 30 nt.
↵5 This procedure may also result in error corruption (Pevzner et al. 2001).
↵6 Error-corrected read suffixes only contribute to enlarging the assembled contigs and do not contribute to base calling.
↵7 This BAC has a repeat content representative of the rest of the human genome.
↵8 This analysis underestimates the base-calling errors by limiting them to long contigs and thus avoiding the most difficult repeated regions. Nevertheless, the base-calling accuracy appears to be comparable or even better than the accuracy of high-coverage Sanger sequencing.
↵9 In some cases the statistics for EULER-USR is slightly better than for OPTIMAL-ASSEMBLY due to subtle differences in contig reporting.
↵10 For BAC50 data set, EULER-USR has 20 mismatches and two insertions, a higher error rate as compared with ECOLI data set.
↵11 The theoretically optimal algorithms for assembling mate-paired reads remain unknown even for error-free reads and fixed distance between mate-pairs (Medvedev et al. 2007).
↵12 We found that assemblies of mate-pairs with average span d ± σ may be sensitive to the parameter even σ for the same d. For example, simulated assemblies with error-free reads may have lower quality than the real assemblies with the same d but different σ.
- Received March 26, 2008.
- Accepted November 17, 2008.