RT Journal Article SR Electronic T1 Comparing Variant Call Files for Performance Benchmarking of Next-Generation Sequencing Variant Calling Pipelines JF bioRxiv FD Cold Spring Harbor Laboratory SP 023754 DO 10.1101/023754 A1 John G. Cleary A1 Ross Braithwaite A1 Kurt Gaastra A1 Brian S. Hilbush A1 Stuart Inglis A1 Sean A. Irvine A1 Alan Jackson A1 Richard Littin A1 Mehul Rathod A1 David Ware A1 Justin M. Zook A1 Len Trigg A1 Francisco M. De La Vega YR 2015 UL http://biorxiv.org/content/early/2015/08/03/023754.abstract AB Summary To evaluate and compare the performance of variant calling methods and their confidence scores, comparisons between a test call set and a “gold standard” need to be carried out. Unfortunately, these comparisons are not straightforward with the current Variant Call Files (VCF), which are the standard output of most variant calling algorithms for high-throughput sequencing data. Comparisons of VCFs are often confounded by the different representations of indels, MNPs, and combinations thereof with SNVs in complex regions of the genome, resulting in misleading results. A variant caller is inherently a classification method designed to score putative variants with confidence scores that could permit controlling the rate of false positives (FP) or false negatives (FN) for a given application. Receiver operator curves (ROC) and the area under the ROC (AUC) are efficient metrics to evaluate a test call set versus a gold standard. However, in the case of VCF data this also requires a special accounting to deal with discrepant representations. We developed a novel algorithm for comparing variant call sets that deals with complex call representation discrepancies and through a dynamic programing method that minimizes false positives and negatives globally across the entire call sets for accurate performance evaluation of VCFs.Availability RTG Tools is implemented as a multithreaded Java application and source code is available under BSD license at: https://github.com/RealTimeGenomics/rtg-toolsContact len{at}realtimegenomics.comSupplementary information Supplementary data are available at Bioinformatics online.