TY - JOUR T1 - Reducing INDEL errors in whole-genome and exome sequencing JF - bioRxiv DO - 10.1101/006148 SP - 006148 AU - Han Fang AU - Giuseppe Narzisi AU - Jason A. O’Rawe AU - Yiyang Wu AU - Julie Rosenbaum AU - Michael Ronemus AU - Ivan Iossifov AU - Michael C. Schatz AU - Gholson J. Lyon Y1 - 2014/01/01 UR - http://biorxiv.org/content/early/2014/06/10/006148.abstract N2 - Background INDELs, especially those disrupting protein-coding regions of the genome, have been associated with human diseases. However, there are still many errors with INDEL variant calling, driven by library preparation, sequencing biases, and algorithm artifacts. We have recently developed a new INDEL-calling algorithm, Scalpel, with substantially improved accuracy.Results We characterized whole genome sequencing (WGS), whole exome sequencing (WES), and PCR-free sequencing data from the same samples to investigate false-positive and false-negative INDEL errors. We developed a classification scheme utilizing validation data to define a class of low-quality INDELs with ∼2.7-fold higher error rates than high-quality INDELs. The mean concordance of INDEL detection between WGS and WES data was ∼52%, while WGS data uniquely identified ∼10.8-fold more high-quality INDELs. Concordance of INDEL detection between standard and PCR-free sequencing data was ∼71%, while PCR-free data uniquely yielded ∼6.3-fold fewer low-quality INDELs. We demonstrate that these INDEL errors are significantly reduced with a PCR-free library protocol, implying that these errors are introduced with PCR amplification. We calculated that 60X WGS data from the HiSeq 2000 platform are needed to recover ∼95% of INDELs, much higher than that for SNP detection. Accurate detection of heterozygous INDELs requires ∼1.2-fold higher coverage than that for homozygous INDELs.Conclusions Homopolymer A/T INDELs are a major source of low quality and/or uncertain INDEL calls, and these are highly enriched in the WES data. We recommend WGS for human genomes at 60X mean coverage with PCR-free protocols, which can substantially improve the quality of personal genomes. ER -