TY - JOUR T1 - False Negatives Are a Significant Feature of Next Generation Sequencing Callsets JF - bioRxiv DO - 10.1101/066043 SP - 066043 AU - Dean Bobo AU - Mikhail Lipatov AU - Juan L. Rodriguez-Flores AU - Adam Auton AU - Brenna M. Henn Y1 - 2016/01/01 UR - http://biorxiv.org/content/early/2016/07/26/066043.abstract N2 - Short-read, next-generation sequencing (NGS) is now broadly used to identify rare or de novo mutations in population samples and disease cohorts. However, NGS data is known to be error-prone and post-processing pipelines have primarily focused on the removal of spurious mutations or "false positives" in downstream genome datasets. Less attention has been paid to characterizing the fraction of missing mutations or "false negatives" (FN). We design a phylogeny-aware tool to determine false negatives [PhyloFaN] and describe how read coverage and reference bias affect the FN rate. Using thousand-fold coverage NGS data from both Illumina HiSeq and Complete Genomics platforms derived from the 1000 Genomes Project, we first characterize the false negative rate in human mtDNA genomes. The false negative rate for the publically available callsets is 17-20%, even for extremely high coverage haploid data. We demonstrate that high FN rates are not limited to mtDNA by comparing autosomal data from 28 publically available full genomes to intergenic Sanger sequenced regions for each individual. We examine both low-coverage Illumina and high-coverage Complete Genomics genomes. We show that the FN rate varies between ∼6%-18% and that false-positive rates are considerably lower (<3%). The FN rate is strongly dependent on calling pipeline parameters, as well as read coverage. Our results demonstrate that missing mutations are a significant feature of genomic datasets and imply additional fine-tuning of bioinformatics pipelines is needed. We provide a tool which can be used to quantify the FN rate for haploid genomic experiments, without additional generation of validation data.Data deposition Data and software are freely available on the Henn Lab website: https://ecoevo.stonybrook.edu/hennlab/data-software/Software GITHUB via https://ecoevo.stonybrook.edu/hennlab/data-software/ ER -