Abstract
BACKGROUND Targeted resequencing with high-throughput sequencing (HTS) platforms can be used to efficiently interrogate the genomes of large numbers of individuals. A critical challenge for research and applications using HTS data, especially from long-read platforms, is errors arising from technological limits and bioinformatic algorithms.
RESULTS A single molecule real-time (SMRT) sequencing-error correction and assembly pipeline, C3S-LAA, was developed for libraries of pooled amplicons. By uniquely leveraging the structure of SMRT sequence data (comprised of multiple low quality subreads from which higher quality circular consensus sequences are formed) to cluster raw reads, C3S-LAA produced accurate consensus sequences and assemblies of overlapping amplicons from single sample and multiplexed libraries. In contrast, despite read depths in excess of 100X per amplicon, the standard long amplicon analysis module from Pacific Biosciences generated unexpected numbers of amplicon sequences with substantial inaccuracies in the consensus sequences. A bootstrap analysis showed that the C3S-LAA pipeline per se was effective at removing bioinformatic sources of error, but in rare cases a read depth of nearly 400X was not sufficient to overcome minor but systematic errors inherent to amplification or sequencing.
CONCLUSIONS C3S-LAA uses a novel processing algorithm for SMRT amplicon-sequence data that produces accurate consensus sequences and local sequence assemblies. The community standard long amplicon analysis module from Pacific Biosciences is prone to substantial errors that raise concerns about findings based on this pipeline. The method developed here removed this confounding bioinformatics source of error, allowing for the identification of limited instances of errors due to DNA amplification or sequencing.