ABSTRACT
Background Diploid genome assembly is typically impeded by heterozygosity, as it introduces errors when haplotypes are collapsed into a consensus sequence. Trio binning offers an innovative solution which exploits heterozygosity for assembly. Short, parental reads are used to assign parental origin to long reads from their F1 offspring before assembly, enabling complete haplotype resolution. Trio binning could therefore provide an effective strategy for assembling highly heterozygous genomes which are traditionally problematic, such as insect genomes. This includes the wood tiger moth (Arctia plantaginis), which is an evolutionary study system for warning colour polymorphism.
Findings We produced a high-quality, haplotype-resolved assembly for Arctia plantaginis through trio binning. We sequenced a same-species family (F1 heterozygosity ∼1.9%) and used parental Illumina reads to bin 99.98% of offspring Pacific Biosciences reads by parental origin, before assembling each haplotype separately and scaffolding with 10X linked-reads. Both assemblies are highly contiguous (mean scaffold N50: 8.2Mb) and complete (mean BUSCO completeness: 97.3%), with complete annotations and 31 chromosomes identified through karyotyping. We employed the assembly to analyse genome-wide population structure and relationships between 40 wild resequenced individuals from five populations across Europe, revealing the Georgian population as the most genetically differentiated with the lowest genetic diversity.
Conclusions We present the first invertebrate genome to be assembled via trio binning. This assembly is one of the highest quality genomes available for Lepidoptera, supporting trio binning as a potent strategy for assembling highly heterozygous genomes. Using this assembly, we provide genomic insights into geographic population structure of Arctia plantaginis.
List of abbreviations
- BAC
- bacterial artificial chromosome,
- bp
- base pairs,
- BUSCO
- Benchmarking Universal Single-Copy Ortholog,
- BWA
- Burrows-Wheeler Aligner,
- CLR
- continuous long reads,
- CTAB
- hexadecyltrimethylammonium bromide,
- Cy3-dUTP
- cyanine 3-deoxyuridine triphosphate,
- DABCO
- 1,4- diazabicyclo[2.2.2]octane,
- DAPI
- 1,4- diazabicyclo[2.2.2]octane,
- ENA
- European Nucleotide Archive,
- FS
- Fisher strand bias,
- GATK
- Genome Analysis Tool Kit,
- GISH
- genomic in situ hybridization,
- GTRGAMMA
- generalised time-reversible substitution model and gamma model of rate heterogeneity,
- KAT
- Kmer Analysis Toolkit,
- kbp
- kilobase pairs,
- LD
- linkage disequilibrium,
- Mbp
- megabase pairs,
- ml
- millilitre,
- ML
- maximum likelihood,
- MQ
- root mean square mapping quality,
- MQRankSum
- mapping quality rank sum test,
- PacBio
- Pacific Biosciences,
- PC
- principle component,
- PCA
- principle component analysis,
- QD
- quality by depth,
- RAxML
- Random Axelerated Maximum Likelihood,
- ReadPosRankSum
- read position rank sum test,
- SMRT
- Single Molecule, Real-Time,
- SNP
- single nucleotide polymorphism,
- SOR
- strand odds ratio,
- STAR
- Spliced Transcripts Alignment to a Reference