RT Journal Article SR Electronic T1 Sources of PCR-induced distortions in high-throughput sequencing datasets JF bioRxiv FD Cold Spring Harbor Laboratory SP 008375 DO 10.1101/008375 A1 Justus M. Kebschull A1 Anthony M. Zador YR 2014 UL http://biorxiv.org/content/early/2014/08/23/008375.abstract AB PCR allows the exponential and sequence specific amplification of DNA, even from minute starting quantities. Today, PCR is at the core of the most successful DNA sequencing technologies and is a fundamental step in preparing DNA samples for high throughput sequencing. Despite its importance, we have little comprehensive understanding of the biases and errors that PCR introduces into pools of DNA molecules. Understanding PCRs imperfections and their impact on the amplification of different sequences in a complex mixture is particularly important for a proper understanding of high-throughput sequencing data. We examined the effects of bias, stochasticity, template switches and polymerase errors introduced during PCR on sequence representation in next-generation sequencing libraries. Using Illumina sequencing results of a pool of diverse PCR amplicons with a defined structure, we searched for signatures of each process. We further developed quantitative models for each process and compared predictions of these models to our experimental data. We find that PCR stochasticity is the major force skewing sequence representation after amplification of a pool of unique DNA amplicons. PCR errors become very common in later cycles of PCR but have little impact on the overall sequence distribution as they are confined to small copy numbers. PCR template switches are rare and confined to low copy numbers. Our results will have particular relevance to single cell sequencing, in which sequences are represented by only one or a few molecules.Author summary High throughput sequencing technologies are used both qualitatively to determine the genomic sequence of an organism and quantitatively to measure the amount of specific DNA sequences present in complex mixtures. To prepare a sample for high throughout sequencing, the input DNA needs to be amplified by PCR. Amplification can introduce skews, biases and errors into the DNA pool leading to misrepresentation of the amounts of sequences in the sequencing results. Here we investigated four potential sources of such misrepresentation and find that, when molecule numbers are low early in PCR, the random amplification of some sequences and not others has a large impact on sequencing results.