Abstract
Motivation Despite the recent progress in genome sequencing and assembly, many of the currently available assembled genomes come in a draft form. Such draft genomes consist of a large number of genomic fragments (scaffolds), whose positions and orientations along the genome are unknown. While there exists a number of methods for reconstruction of the genome from its scaffolds, utilizing various computational and wet-lab techniques, they often can produce only partial error-prone scaffold assemblies. It therefore becomes important to compare and merge scaffold assemblies produced by different methods, thus combining their advantages and highlighting present conflicts for further investigation. These tasks may be labor intensive if performed manually.
Results We present CAMSA—a tool for comparative analysis and merging of two or more given scaffold assemblies. The tool (i) creates an extensive report with several comparative quality metrics; (ii) constructs the most confident merged scaffold assembly; and (iii) provides an interactive framework for a visual comparative analysis of the given assemblies. Among the CAMSA features, only scaffold merging can be evaluated in comparison to existing methods. Namely, it resembles the functionality of assembly reconciliation tools, although their primary targets are somewhat different. Our evaluations show that CAMSA produces merged assemblies of comparable or better quality than existing assembly reconciliation tools while being the fastest in terms of the total running time.
Availability CAMSA is distributed under the MIT license and is available at http://cblab.org/camsa/.
Footnotes
*The work is supported by the National Science Foundation under the grant No. IIS-1462107.
1 We remark that contigs can be viewed as scaffolds with no gaps. So, under scaffolds we understand both contigs and scaffolds.
2 The blossom algorithm computes a maximal weighted matching in a graph in O(V3) time, where V is the number of vertices.
3 We also considered GARM [30], but were unable to run it on any GAGE dataset, facing issues similar to those reported in [46].
4 We remark that conversion, for example, from NCBI AGPv2 format (rather than FASTA) would be much faster.