TY - JOUR T1 - Long-read, whole-genome shotgun sequence data for five model organisms JF - bioRxiv DO - 10.1101/008037 SP - 008037 AU - Kristi E. Kim AU - Paul Peluso AU - Primo Babayan AU - P. Jane Yeadon AU - Charles Yu AU - William W. Fisher AU - Chen-Shan Chin AU - Nicole Rapicavoli AU - David R. Rank AU - Joachim Li AU - David E. A. Catcheside AU - Susan E. Celniker AU - Adam M. Phillippy AU - Casey M. Bergman AU - Jane M. Landolin Y1 - 2014/01/01 UR - http://biorxiv.org/content/early/2014/08/15/008037.2.abstract N2 - Single molecule, real-time (SMRT) sequencing from Pacific Biosciences is increasingly used in many areas of biological research including de novo genome assembly, structural-variant identification, haplotype phasing, mRNA isoform discovery, and base-modification analyses. High-quality, public datasets of SMRT sequences can spur development of analytic tools that can accommodate unique characterisitcs of SMRT data (long read lengths, lack of GC or amplification bias, and a random error profile leading to high consensus accuracy). In this paper, we describe eight high-coverage SMRT sequence datasets from five organisms (Escherichia coli, Saccharomyces cerevisiae, Neurospora crassa, Arabidopsis thaliana, and Drosophila melanogaster) that have been publicly released to the general scientific community (NCBI Sequence Read Archive ID SRP040522). Data were generated using two sequencing chemistries (P4-C2 and P5-C3) on the PacBio RS II instrument. The datasets reported here can be used without restriction by the research community to generate whole-genome assemblies, test new algorithms, investigate genome structure and evolution, and identify base modifications in some of the most widely-studied model systems in biological research.Background and Summary Single-molecule, real-time (SMRT®) DNA sequencing occurs by optically detecting a fluorescent signal when a nucleotide is being incorporated by a DNA polymerase [1–4]. This relatively new technology enables detection of DNA sequences that have unique characteristics, such as long read lengths, lack of CG bias, and random error profiles, and can yield highly accurate consensus sequences [5]. Kinetic information such as pulse width and interpulse duration are also recorded and can be used to detect base modifications [6-8].Since its introduction, investigators have published on a range of applications using SMRT sequencing. For example, the developers of GATK (Genome Analysis Toolkit) demonstrated that single nucleotide polymorphisms (SNPs) could be detected using SMRT sequences [9, 10] due to their lack of context-specific bias and systematic error [5, 10]. Likewise, the developers of PBcR (PacBio error correction) [11, 12] showed that complete bacterial genome assemblies using SMRT sequence data had greater than Q60 base quality [12]. PBcR was later incorporated as the “pre-assembly” step in the HGAP (hierarchical genome assembly process) system [13], followed by consensus polishing using the Quiver algorithm [13] to produce a complete assembly pipeline for SMRT sequence data. In addition, other third-party tools now support long reads for various applications such as mapping [14, 15], scaffolding [16], structural-variation discovery [17], and genome assembly [11, 18]. Other applications such as 16S rRNA sequencing [19], characterization of entire transcriptomes [20, 21], genome-editing studies [22], base-modification studies [7, 8, 23-25], and validation of CRISPR targets [26] have also been published.To encourage interest in further applications and tool development for SMRT sequence data, we report here the release of whole-genome shotgun-sequence datasets from five model organisms (E. coli, S. cerevisiae, N. crassa, A. thaliana, and D. melanogaster). These organisms have among the most complete and well-annotated reference genome sequences, due to continual refinement by dedicated teams of scientists. Despite continued improvement of these genome sequences with new technologies, few are completely finished with fully contiguous assemblies of all chromosomes. The gaps remaining arise from complex structures such as transposable elements, repeats, segmental duplications, or other dynamic regions of the genome that cannot be easily assembled. Structural differences in these regions can account for variability in millions of nucleotides within every genome, and mounting evidence suggest that such mutations are important for human diversity and disease susceptibility in many complex traits including autism and schizophrenia [27-29]. SMRT sequencing data can therefore play an important role in the completion of these and other reference genomes, providing a platform for new insights into genome biology. ER -