Abstract
Current short-read methods have come to dominate genome sequencing because they are cost-effective, rapid, and accurate. However, short reads are most applicable when data can be aligned to a known reference. Two new methods for de novo assembly are linked-reads and restriction-site labeled optical maps. We combined commercial applications of these technologies for genome assembly of an endangered mammal, the Hawaiian Monk seal.
We show that the linked-reads produced with 10X Genomics Chromium chemistry and assembled with Supernova v1.1 software produced scaffolds with an N50 of 22.23 Mbp with the longest individual scaffold of 84.06 Mbp. When combined with Bionano Genomics optical maps using Bionano RefAligner, the scaffold N50 increased to 29.65 Mbp for a total of 170 hybrid scaffolds, the longest of which was 84.78 Mbp. These results were 161X and 215X, respectively, improved over DISCOVAR de novo assemblies. The quality of the scaffolds was assessed using conserved synteny analysis of both the DNA sequence and predicted seal proteins relative to the genomes of humans and other species. We found large blocks of conserved synteny suggesting that the hybrid scaffolds were high quality. An inversion in one scaffold complementary to human chromosome 6 was found and confirmed by optical maps.
The complementarity of linked-reads and optical maps is likely to make the production of high quality genomes more routine and economical and, by doing so, significantly improve our understanding of comparative genome biology.
INTRODUCTION
High quality non-human genomes, especially mammalian genomes, are needed for a variety of reasons including 1) better constraining the limits of allowable nucleotide variation (a useful tool for identifying potentially causative mutations in human disease), 2) the identification of conserved protein and regulatory regions that may explain the morphological or physiological characteristics of different species, 3) establishing the correct relationship of SNPs to genes for association studies, 4) improving our understanding of evolutionary relatedness and mechanisms, and 5) aiding efforts for species conservation and management. The need for fast, economical and high quality genome data for de novo genome assembly of rare and endangered species will become especially important as habitat loss, climate change and human impacts accelerate in the 21st century.
Although next generation sequencing has allowed fast and accurate sequencing, to date, the largest hurdle to genome assembly has been the difficulty and cost in obtaining long contiguous sequence scaffolds from short reads. The human genome and those of many model organisms were assembled using methods where large DNA molecules were cloned (e.g., BACs, cosmids, etc.), employed long DNA sequencing technologies (e.g., PacBio), or used augmented short-reads approaches such as mate-pair libraries. Recently, new single-molecule chemistries and analytic methods have been developed to extend these approaches. Here, we have evaluated two technologies for assembling a de novo mammalian genome of the Hawaiian monk seal (Neomonarchus schauinslandi), an endangered species endemic to the Hawaiian islands, using techniques that interrogate individual long DNA molecules and have compared these to the short-read based DISCOVAR method (Weisenfeld et al., 2014) which uses overlapping paired-end (PE) Illumina reads. The first method, the 10X Genomics (10XG) Chromium chemistry, incorporates unique molecular indexes (UMI) into single long DNAs and assembles these into gapped scaffolds, based on shared UMIs, using the Supernova software tool (Weisenfeld et al., 2016). The second method used optical maps from Bionano Genomics (Bionano) both as a quality assessment for the Chromium sequences and as a tool to merge the Supernova scaffolds into longer hybrid assemblies.
METHODS
Sample
Blood (∼ 10 ml) was collected in EDTA vacutainers during surgery of an adult male seal, shipped on ice packs and processed within two days of collection. The viability of the cells was assessed by trypan blue exclusion. 1 ml of whole blood was stored in LN2 while 9 ml (1.85x106 cells/ml) were used for lymphocyte separation and subsequent DNA isolation for optical mapping.
DISCOVAR libraries
1 ug of DNA was used to prepare PCR-free Illumina libraries of ∼ 450 bp mean insert size as described on the Broad Institute website (https://software.broadinstitute.org/software/discovar/blog/?page_id=375). We generated 703 million 250bp paired-end reads on an Illuina HiSeq2500 for the PCR free library, for a read depth of approximately 75x based on estimated genome size. Analysis was done using DISCOVAR de novo v52488 using default parameters (Weisenfeld et al., 2014).
10X Genomics Chromium
DNA was isolated using MagAttract (Qiagen) and the molecular weight assayed by pulsed field gel electrophoresis. HMW gDNA concentration was quantitated using a Qubit Fluorometer, diluted to 1.25 ng/ul in TE, and denatured following manufacturers recommendations. Denatured gDNA was added to the reaction master mix and combined with the Chromium bead-attached primer library and emulsification oil on a Chromium Genome Chip. Library preparation was completed following the manufacturer’s protocol (Chromium Genome v1, PN-120229). Sequencing-ready libraries were quantified by qPCR (KAPA) and their sizes assayed by Bioanalyzer (Agilent) electrophoresis.
The library was sequenced (151 X 9 X 151) using two HiSeq 2500 Rapid flow cells to generate 975 M reads with a mean read length of 139 bp after trimming. The read 2 Q30 was 87.93% and the weighted mean molecule size was calculated as 92.33 kb. Mean read depth was approximately 61X. The sequence was analyzed using Supernova software (10X Genomics; Weisenfeld et al., 2016) which demultiplexed the Chromium molecular indexes, converted the sequences to fastq files and built a graph-based assembly. The assemblies, which diverge at “megabubbles,” consist of two “pseudohaplotypes.” The sequence data were originally analyzed using Supernova 1.0 and then repeated using v1.1 which estimates gap sizes rather than introducing an arbitrary value of 100 Ns. As noted above, the Supernova scaffolds were used by the Bionano Hybrid Scaffold tool to create sequence assemblies.
BioNano Genomics
Optical mapping of large DNA (Xiao et al., 2007) incorporates fluorescent nucleotides at sequence specific sites, visualizes the labeled molecules and aligns these to each other and to a DNA scaffold (Shelton et al., 2015). Lymphocytes were processed following the IrysPrep Kit for human blood with minor modifications. Briefly, PBMCs were spun and resuspended in Cell Suspension Buffer and embedded in 0.6% agarose (plug lysis kit, BioRad). The agarose plugs were treated with Puregene Proteinase K (Qiagen) in lysis buffer (Bionano Genomics) overnight at 50° C and shipped for subsequent processing (S. Brown, KSU). High Molecular Weight (HMW) DNA was recovered by treating the plugs with Gelase (Epicenter), followed by drop dialysis to remove simple carbohydrates. HMW DNA was treated with Nt. BspQI nicking endonuclease (New England Biolabs) and fluorescent nucleotides incorporated by nick translation (IrysPrep Labeling-NLRS protocol, Bionano). Labeled DNA was imaged on the Irys platform (Bionano) and more than 234,000 Mb of image data were collected with a minimum molecule length of 150 kb.
Bionano Genomics Analysis
Haplotype aware de novo assembly was done using the Bionano Genomics RefAligner Assembler (version 5122) software based on the overlap-layout-consensus paradigm (Xiao et al., 2015; Xiao et al., 2007; Anantharaman et al., 2001; Valouev et al., 2006). To build the overlap-layout-consensus graph, first the single molecule optical maps underwent a pairwise alignment where each molecule was aligned to every other molecule. Pairwise alignments generated using the Bionano Genomics RefAligner were used as input to the layout and consensus assembly stage where a draft assembly was built. We used a P value threshold of 1e-10 for the initial assembly and a minimum molecule length of 150kb. Next we refined the draft assembly using a P value cut off of 1e-11. Refinement of the assembly corrected errors and trimmed and split contigs if errors were found. Our first set of refined contigs were further improved using five iterations of extension and merging during which all contigs were aligned to each other to check for overlap or redundancy; the P value threshold used was 1e-15. Map extension was done by aligning all input molecules to the refined maps. When a set number of molecules extended past the end of a contig, they were combined into a consensus and added to the end of the contig. This increased the size of the contigs. A final refinement step was performed on the maps after the iterative cycles of extension and merging to produce a more accurate and haplotype separated final consensus map. Haplotype separated maps were built by aligning molecules to the genome maps and clustering them into two alleles. When the reported difference between the two alleles was large enough, two haplotype separate maps were generated.
Hybrid scaffolding
Hybrid scaffolding was performed on the haplotype separated genome maps and the Supernova pseudohaplotype scaffolds. The first step involved in silico digesting the Supernova sequence assembly using the Nt.BspQI recognition motif to generate map coordinates. Next the sequence scaffolds were aligned with the genome maps to flag alignments that were concordant and were fed into the iterative merging stage. The P value threshold used for the initial alignment was 1e-10 while merge was set at 1e-11. After the iterative merge stage the hybrid scaffolds were generated and aligned to the original sequence scaffolds. Finally, we exported back from genome map coordinates to FASTA format along with an AGP file (https://www.ncbi.nlm.nih.gov/assembly/agp/AGP_Specification/)” www.ncbi.nlm.nih.gov/assembly/agp/AGP_Specification/) that tracked had been merged.
Scaffold Quality Assessment
QUAST (Gurevich et al., 2013) was used to generate N50 plots and comparative metrics for the assembly methods. BUSCO (Benchmarking Universal Single Copy Orthologs), a tool for assessing genome completeness (Simão et al., 2015), was run on the hybrid pseudo-reference using 3,023 vertebrate-specific single copy orthologs in genome assembly assessment mode.
Conserved Synteny
The quality of the Bionano/Supernova hybrid scaffolds was determined in various ways. First, we chose one pseudohaplotype of all 170 unique scaffolds generated from the Bionano Hybrid Scaffold tool as a pseudo-reference and performed TBLASTN (e.g., Jun et al., 2009) to translate the seal sequences for alignment to both human and dog protein databases. The position of matching proteins from known coding genes in the seal pseudo-reference was manually reviewed in IGV (Thorvaldsdottir et al., 2013). We compared the regions of conserved syntenic genes primarily to the human genome because it is best annotated and, in some instances, to dog, cat and other mammals for which longer scaffolds were available.
DNA alignment
Conserved syntenic gene order was confirmed using nucmer (Kurtz et al., 2004) to align the seal hybrid pseudo-reference to human (GRCh37/hg19), with a minimum cluster length of 20 and a minimum match length of 50. Plots were generated for all matches greater than 1kb.
Analysis software
Supplement 1 lists the specific software and commands used for analysis.
RESULTS
Both 10XG and Bionano software are haplotype aware. For this study we chose Supernova “pseudohaplotype” 1 for analysis. No major differences in length were noted between haplotypes, as might be expected for a species with a population size of about 1400 animals and with reported low heterozygosity (Schultz et al., 2009)
A summary of results is shown in Table 1. The 250bp PE DISCOVAR sequencing produced over 437,230 reads with a scaffold N50 of 137,851 bp. In contrast, the 10X Chromium sequencing assembled with Supernova v1.0 had an N50 of nearly 14.6 Mbp, an improvement of over 100 fold. When Supernova v1.1 was used the N50 increased to 22.23 Mbp. The Bionano optical maps significantly improved overall scaffold length and decreased the total number of scaffolds from 203 to 170.
Figure 1 shows a QUAST (Gurevich et al., 2013) plot of scaffold size distributions. The QUAST statistics estimated the total assembly length from 2.32-2.46 Gbp, which is similar to that of other carnivores. The total number of N’s increased (Table 1) between Supernova v1.0 and v1.1, as expected, due to the improved gap estimation algorithm. The change improved alignment to the optical maps, by more accurately spacing BspQI sites with respect to the sequence scaffolds. Figure 2 shows the corresponding improvement in Bionano confidence scores with Supernova v1.1. Figure 3a shows, in IrysView, an example of a long (∼ 450 kbp) region of high concordance between the optical maps and the Supernova v1.1 scaffold while Fig. 3b shows a region where Supernova may have added larger gaps than appear in the Bionano maps. Figure 4 is an example of where a Bionano map merged two Supernova sequence scaffolds. This example was further studied by conserved synteny analysis.
In order to identify possible false Bionano joins between Supernova scaffolds, we performed TBLASTN, using a locally installed version of BLAST, to compare the 170 Bionano/10XG hybrid scaffolds against a database of human protein-coding genes. We set the match criteria at low stringency (evalue 1e-05) so that we would maximize the likelihood of identifying homologs but with the understanding that there would be many false matches. We inspected matches by finding two or more contiguous genes that mapped to the same human chromosome and with similar spacing and order. In a few instances of gene families with highly similar paralogs the exactly matching ortholog was not always identified by TBLASTN and in these cases we assigned the specific gene that best fit the conserved synteny relationship.
As an example, Table 2 shows four non-overlapping scaffolds that share conserved synteny with the majority of human chromosome 6. Although we noted inversions in gene order within scaffolds when they were arranged with respect to the chromosome we saw excellent coverage with no large gaps of expected orthologs. When the same analysis was extended to the remainder of the genome (data not shown) we found similar coverage with relatively few gaps indicating that most of the protein-coding genes have been captured in the 170 hybrid scaffolds. The only regions conspicuously absent were the short arms of the acrocentric human chromosome which are largely ribosomal RNA repeats and would not have been identified by TBLASTN analysis. The seal scaffolds orthologous to human chromosome 6 from Fig. 4 are shown in Fig. 5 as a nucmer plot which compares DNA sequences between seal and human. The plot confirms that the sequence contiguity within the seal scaffolds agrees with the conserved synteny analysis.
We also performed BUSCO analysis of the conserved vertebrate gene set both for the 170 hybrid scaffolds as well as the unscaffolded sequences that did not align to Bionano maps. Of the 170 hybrid scaffolds BUSCO identified 2,696 complete single-copy genes (89.2% of the 3023 expected conserved vertebrate genes) and 16 duplicates (0.5%). These BUSCO assessments are nearly identical to those calculated for humans (Simão et al., 2015). An additional 171 genes were scored as fragmented (5.7%) and 140 were missing (4.6%). We manually searched our TBLASTN predicted protein list and accounted for all but 12 of the “missing.” Among the sequences not aligning to optical maps, 8 “complete” and 14 “fragmented” genes were found (fewer than 1% of the total screened).
DISCUSSION
Next generation sequencing has revolutionized the field of comparative genomics, especially at the level of individual gene sequences. But, creating long assemblies has been less successful because of the complexity of genomes which are frequently interrupted by repeats longer than typical NGS reads. This limitation has led to various innovations such as long-single molecule sequencing (e.g., PacBio, Oxford Nanopore) but throughput, accuracy and costs have limited their widespread adoption. We showed that the 10XG Chromium method, assembled with Supernova v1.1 assembly software, produced long scaffolds 160-fold longer than paired end sequencing assembled with DISCOVAR de novo. When combined with Bionano optical maps the scaffold N50 increased to greater than 29 Mbp, a 200-fold improvement. The technologies described here are a large step toward improving and simplifying de novo genome assembly in that they use orthogonal methods that complement each other. The strengths of the 10XG Chromium Linked-Reads method are that it uses ∼ 106 unique molecular indices to randomly tag long molecules captured in emulsified droplets and then sequences these by standard short-read technology. Because the total amount of DNA is so small (1.2 ng for a mammalian genome), the probability that more than one molecule from the same region of the genome is encapsulated and identically labeled is extremely small, and long molecules will produce linked reads that are likely to span repeats such as L1 and other retrotransposons.
Dong et al. (2013) used optical mapping to assist with the assembly of the goat genome and found that it improved super-scaffold lengths by 5X over those obtained with fosmid end sequencing. Mostovoy et al. (2016) used 10XG chemistry and Bionano Genomics maps to create a phased genome assembly of the widely sequenced human sample NA12878 and reported 170 hybrid scaffolds with an N50 of 33.5 Mbp. We have extended their approach by using newer versions of the 10XG chemistry with an increased number of UMIs and Supernova software for de novo assembly. Although optical maps do not produce sequence per se they do provide a valuable “truth set” of chromosomal structure to confirm or refute sequence assemblies. The Bionano optical maps were useful both for merging 10X scaffolds but also as a quality assessment measure of the overall sequence contiguity in any given scaffold. Although the N50 statistic can be misleading based on the choice of the lower contig cutoff size, the assembly that we produced by combining the 10XG and Bionano optical maps resulted in larger scaffolds than most mammalian genomes reported to date. The hybrid Bionano/10XG scaffolds also had conserved syntenic blocks that were consistent with the human genome and other mammals. The QUAST statistics (Table 1) indicated that 98.3% of the total estimated genome length (i.e., the predicted haploid genome equivalent; 2360/2400 Mbp) was accounted for by the 170 Bionano hybrid scaffolds. This is in agreement with the conserved synteny data which showed that most human chromosome orthologous regions were very well covered. As noted above, BUSCO identified 99% of the conserved gene set within the 170 scaffolds, again indicating that the hybrid scaffolds were of high quality.
The materials cost for this study was less than $15,000. We anticipate that these costs will become lower in the near term as optical mapping improves, short-read sequencing migrates to higher throughput instruments, or new long read technologies (e.g., nanopore sequencing) mature. Software improvements and more standardized pipelines for genome assembly are also expected to make genome assembly less burdensome.
CONCLUSION
As a field, comparative genomics is entering a new phase where we expect that the vast majority of living (or recently extinct) species will be sequenced. We know from human studies that many of the regulatory signals that control complex traits and development are in intergenic regions (e.g., Bhatia and Kleinjan, 2014; GWAS Catalog: http://www.ebi.ac.uk/gwas/). Likewise, both the sequences and the genomic architecture of regulatory elements within chromosomes are likely to explain the majority of the phenotypic differences between species. If that is true then a fuller understanding of comparative genomics will require knowing not only selected gene sequences but the context of the genome in which those genes are found.
Combining the two orthogonal methods of Chromium Linked-Reads with Bionano optical maps is likely to make the assembly of high quality genomes routine and significantly improve our understanding of comparative genome biology while reducing costs. Because the Chromium method labels single molecules only a few ng of DNA are needed. This advantage may be particularly useful for non-lethal sampling of organisms both in captivity and in the field. As more species become endangered the need to preserve their genomes for study and conservation will increase. The methods shown here represent a cost-effective and robust strategy to meet that goal.
Acknowledgements
We thank Drs. Charles Littnan and Michelle Barbieri for obtaining samples, Dr. Susan Brown for performing the optical mapping, Dr. Melissa Olson for lymphocyte separation and whole blood cryopreservation, Laura Kasch and Jill Barton for preparing agarose blocks for optical mapping and DNA isolation from blood, Drs. Jonathan Pevsner and Michael Schatz for comments on the manuscript.