TY - JOUR T1 - Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation JF - bioRxiv DO - 10.1101/071282 SP - 071282 AU - Sergey Koren AU - Brian P. Walenz AU - Konstantin Berlin AU - Jason R. Miller AU - Adam M. Phillippy Y1 - 2016/01/01 UR - http://biorxiv.org/content/early/2016/08/24/071282.abstract N2 - Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a complete reworking of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either PacBio or Oxford Nanopore technologies, and achieves a contig NG50 of greater than 21 Mbp on both human and Drosophila melanogaster PacBio datasets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs for analysis or integration with complementary phasing and scaffolding techniques. Canu source code and pre-compiled binaries are freely available under a GPLv2 license from https://github.com/marbl/canu. ER -