Abstract
Contaminant DNA is a well-known confounding factor in molecular biology and in genomic repositories. Strikingly, analysis workflows for whole-genome sequencing (WGS) data usually neglect the errors introduced by potential contaminations. We performed a comprehensive evaluation of the extent and impact of contaminant DNA in WGS by analyzing more than 4,000 bacterial samples from 20 different studies. We found that contaminations are pervasive and can introduce large biases in variant analysis. We showed that these biases can translate in hundreds of false positive and negative SNPs, even for samples with slight contaminations. Studies investigating complex biological traits from sequencing data can be completely biased if contaminations are neglected during the bioinformatic analysis. We used both real and simulated data to evaluate and implement reliable, contamination-aware analysis pipelines. Our results urge for the implementation of such pipelines as sequencing technologies consolidate as a precision tool in the research and clinical context.
Footnotes
This version is a major update to the first manuscript in which we analyzed 1,500 samples of M. tuberculosis. We have now extended our analysis to other 13 pathogenic bacterial species and reanalyzed all the samples with additional stringent filters.
https://gitlab.com/tbgenomicsunit/Publications_resources/blob/master/MTB_ancestor.fas