TY - JOUR T1 - <em>De Novo</em> PacBio long-read and phased avian genome assemblies correct and add to genes important in neuroscience research JF - bioRxiv DO - 10.1101/103911 SP - 103911 AU - Jonas Korlach AU - Gregory Gedman AU - Sarah B. Kingan AU - Chen-Shan Chin AU - Jason Howard AU - Lindsey Cantin AU - Erich D. Jarvis Y1 - 2017/01/01 UR - http://biorxiv.org/content/early/2017/01/28/103911.abstract N2 - Reference quality genomes are expected to provide a resource for studying gene structure and function. However, often genes of interest are not completely or accurately assembled, leading to unknown errors in analyses or additional cloning efforts for the correct sequences. A promising solution to this problem is long-read sequencing. Here we tested PacBio-based long-read sequencing and diploid assembly for potential improvements to the Sanger-based intermediate-read zebra finch reference and Illumina-based short-read Anna’s hummingbird reference, two vocal learning avian species widely studied in neuroscience and genomics. With DNA of the same individuals used to generate the reference genomes, we generated diploid assemblies with the FALCON-Unzip assembler, resulting in contigs with no gaps in the megabase range (N50s of 5.4 and 7.7 Mb, respectively), and representing 150-fold and 200-fold improvements over the current zebra finch and hummingbird references, respectively. These long-read assemblies corrected and resolved what we discovered to be misassemblies, including due to erroneous sequences flanking gaps, complex repeat structure errors in the references, base call errors in difficult to sequence regions, and inaccurate resolution of allelic differences between the two haplotypes. We analyzed protein-coding genes widely studied in neuroscience and specialized in vocal learning species, and found numerous assembly and sequence errors in the reference genes that the PacBio-based assemblies resolved completely, validated by single long genomic reads and transcriptome reads. These findings demonstrate, for the first time in non-human vocal learning species, the impact of higher quality, phased and gap-less assemblies for understanding gene structure and function.A1-L4primary auditory cortex – layer 4Amnucleus ambiguousArea Xa vocal nucleus in the striatumaStanterior striatum vocal regionaTanterior thalamus speech areaAvavalancheaDLManterior dorsolateral nucleus of the thalamusDMdorsal medial nucleus of the midbrainHVCa vocal nucleus (no abbreviation)L2auditory area similar to human cortex layer 4LSClaryngeal somatosensory cortexLMClaryngeal motor cortexMANmagnocellular nucleus of the anterior nidopalliumMOoval nucleus of the anterior mesopalliumNIfinterfacial nucleus of the nidopalliumPAGperi-aqueductal grayRArobust nucleus of the arcopalliumvventricle space ER -