Abstract
Summary This paper presents an efficient tool gencore, to eliminate errors and duplicates of next-generation sequencing (NGS) data. This tool clusters the mapped sequencing reads and merges each cluster to generate one consensus read. If the data has unique molecular identifier (UMI), gencore uses it for identifying the reads derived from same original DNA fragment. Comparing to the conventional tool Picard, gencore greatly reduces the output data’s mapping mismatches, which are mostly caused by errors. This error-suppressing feature makes gencore very suitable for the application of detecting ultra-low frequency mutations from deep sequencing data. Comparing to the performance of Picard, gencore is about 3X faster and uses much less memory.
Availability and Implementation gencore is an open source tool written in C++. It’s hosted in github: https://github.com/OpenGene/gencore
Contact chen{at}haplox.com