Contaminant DNA in bacterial sequencing experiments is a major source of false genetic variability

Galo A Goig; Silvia Blanco; Alberto L. Garcia-Basteiro; Iñaki Comas

doi:10.1101/403824

Abstract

Contaminant DNA is a well-known confounding factor in molecular biology and in genomic repositories. Strikingly, analysis workflows for whole-genome sequencing (WGS) data usually neglect the errors introduced by potential contaminations. We performed a comprehensive evaluation of the extent and impact of contaminant DNA in WGS by analyzing more than 4,000 bacterial samples from 20 different studies. We found that contaminations are pervasive and can introduce large biases in variant analysis. We showed that these biases can translate in hundreds of false positive and negative SNPs, even for samples with slight contaminations. Studies investigating complex biological traits from sequencing data can be completely biased if contaminations are neglected during the bioinformatic analysis. We used both real and simulated data to evaluate and implement reliable, contamination-aware analysis pipelines. Our results urge for the implementation of such pipelines as sequencing technologies consolidate as a precision tool in the research and clinical context.

Footnotes

This version is a major update to the first manuscript in which we analyzed 1,500 samples of M. tuberculosis. We have now extended our analysis to other 13 pathogenic bacterial species and reanalyzed all the samples with additional stringent filters.
https://gitlab.com/tbgenomicsunit/Publications_resources/blob/master/MTB_ancestor.fas

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license.