Comparing multi- and single-sample variant calls to improve variant call sets from deep coverage whole-genome sequencing data

Suyash S. Shringarpure; Rasika A. Mathias; Ryan D. Hernandez; Timothy D. O’Connor; Zachary A. Szpiech; Raul Torres; Francisco M. De La Vega; Carlos D. Bustamante; Kathleen C. Barnes; Margaret A. Taub; Behalf of the CAAPA consortium

doi:10.1101/078642

ABSTRACT

Motivation Variant calling from next-generation sequencing (NGS) data is susceptible to false positive calls due to sequencing, mapping and other errors. To better distinguish true from false positive calls, we present a method that uses genotype array data from the sequenced samples, rather than public data such as HapMap or dbSNP, to train an accurate classifier using Random Forests. We demonstrate our method on a set of variant calls obtained from 642 African-ancestry genomes from the The Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA), sequenced to high depth (30X).

Results We have applied our classifier to compare call sets generated with different calling methods, including both single-sample and multi-sample callers. At a False Positive Rate of 5%, our method determines true positive rates of 97.5%, 95% and 99% on variant calls obtained using Illumina’s single-sample caller CASAVA, Real Time Genomics’ multisample variant caller, and the GATK Unified Genotyper, respectively. Since most NGS sequencing data is accompanied by genotype data for the same samples, our method can be trained on each dataset to provide a more accurate computational validation of site calls compared to generic methods. Moreover, our method allows for adjustment based on allele frequency (e.g., a different set of criteria to determine quality for rare vs. common variants) and thereby provides insight into sequencing characteristics that indicate data quality for variants of different frequencies.

Availability Code will be made available prior to publication on Github.

Footnotes

↵12 See Supplementary Materials for full listing of consortium contributors.

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.