Abstract
Despite the rapid development of sequencing technologies, assembly of mammalian-scale genomes into complete chromosomes remains one of the most challenging problems in bioinformatics. To help address this difficulty, we developed Ragout, a reference-assisted assembly tool that now works for large and complex genomes. Taking one or more target assemblies (generated from an NGS assembler) and one or multiple related reference genomes, Ragout infers the evolutionary relationships between the genomes and builds the final assemblies using a genome rearrangement approach. Using Ragout, we transformed NGS assemblies of 15 different Mus musculus and one Mus spretus genomes into sets of complete chromosomes, leaving less than 5% of sequence unlocalized per set. Various benchmarks, including PCR testing and realigning of long PacBio reads, suggest only a small number of structural errors in the final assemblies, comparable with direct assembly approaches. Additionally, we applied Ragout to Mus caroli and Mus pahari genomes, which exhibit karyotype-scale variations compared to other genomes from the Muridae family. Chromosome color maps confirmed most large-scale rearrangements that Ragout detected.
Introduction
The year 2001 marked an important step in genome biology with the release of the first near complete human genome [Lander et al., 2001, Venter et al., 2001]. Since then, numerous near complete mammalian genome sequences have been made available [Pontius et al., 2007, Church et al., 2009, Scally et al., 2012]. These finished genomes, while being expensive to produce, have greatly advanced the field of comparative genomics and provided many new insights to our understanding of mammalian evolution. The initial achievement was quickly followed by the era of high throughput sequencing technologies – next generation sequencing (NGS). These cost-effective technologies are allowing many sequencing consortia to explore genomes from a large number of species [Jarvis et al. 2014].
Recently, new de novo assembly algorithms have been developed to combine high-throughput short read sequencing data with long reads (Pacific Biosciences) or jumping libraries to completely assemble bacterial genomes (one chromosome into one contig) [Koren and Phillippy, 2015]. By contrast, complete assembly of mammalian genomes using current short-read sequencing technologies remains a formidable problem, since such genomes are larger and have more complicated repeat structures. Most current mammalian assemblies produced by NGS-assemblers [Butler et al., 2008, Simpson et al., 2009] contain thousands to hundreds of thousands of contigs/scaffolds and provide limited value for comparative genomics as constructed syntenic regions are highly fragmented. Recently, some studies [Gordon et al., 2016, Chaisson et al., 2015] have applied long read technologies (Pacific Biosciences, 10x Genomics, Dovetail) to improve the assembly of larger genomes. However, the cost of generating high throughput long reads is still much higher than the cost of generating short read libraries.
Since many complete genomes are now available, an alternative approach is to use these genomes to guide the assembly of the target (assembled) genome, in a method called ‘reference-assisted assembly’ [Gnerre et al., 2009]. In such methods, the information from a closely-related reference genome is used by an NGS-assembler for resolving complicated genomic structures, such as repeats or low-coverage regions. This technique was implemented in a number of assemblers/scaffolders [Zerbino and Birney, 2008, Peng et al., 2010, Gnerre et al., 2011, Iqbal et al., 2012] and proved to be valuable when a close reference is available. Another common approach is to align pre-assembled contigs of the target genome against the reference, and order them according to their positions in the reference genome [Richter et al., 2007, Rissman et al., 2009]. However, for both approaches, simplistic modeling, in which each breakpoint is treated independently, still introduces many misassembly errors when structural variations between the reference and target genomes are present.
To improve over the single reference genome approach Kim et al. [2013] introduced the RACA tool, which made an important step toward reliable reconstruction of the target genome by analyzing the structure of multiple outgroup genomes in addition to a single reference. This approach proved to be valuable, since consistent adjacency information across multiple outgroups proved a more powerful predictor of adjacencies in the target assembly than single genomes alone. Similar to other genome rearrangement approaches, RACA relies on the decomposition of the input sequences into a set of synteny blocks – long and conservative genomic regions with respect to micro-rearrangements. However, given the lack of synteny block reconstruction tools for mammalian sequences (only tools for pairwise comparisons are available), RACA reconstructs synteny blocks by aligning all input sequences against a single reference genome. This approach is biased towards the reference genome and in some cases, cannot detect synteny blocks [Pham and Pevzner, 2010]. Additionally, the proposed rearrangement model has a limited support for genomic variations in the target genome that are not observed in references, thus such specific adjacencies are likely to be considered as chimeric and consequently ignored. As a result, RACA, while showing its improvements over other tools that use a single reference, still generates misassemblies and gaps in its final scaffolds.
This work presents Ragout, an algorithm (and a software package) that attempts to address all of the above challenges in reference-assisted assembly for mammalian genomes. Ragout combines Progressive Cactus, a multiple whole genome aligner [Paten et al., 2011], with a new graph simplification algorithm to decompose the input sequences into multi-scale synteny blocks and an iterative algorithm for finding the missing adjacencies. Ragout utilizes multi-scale synteny blocks to separate large structural variations from smaller polymorphisms, and it also minimizes the number of gaps and mis-joins in the final scaffolds. Additionally, a 2-break rearrangement model [Alekseyev and Pevzner, 2009] is used to distinguish between target-specific rearrangements and chimeric adjacencies.
Results
Synteny blocks
Nucleotide-level alignments between diverged genomes contain millions of small variations. To analyze karyotype-level rearrangements, studies typically use lower-resolution alignments. Such alignments are described as a set of coarse synteny blocks [Kent et al., 2003, Pevzner et al., 2003], each such synteny block being a set of strand-oriented chromosome intervals in the set of genomes being compared that represents the homology relationship between large segments of the genomes. In this study, we also use synteny blocks to separate large structural variations from small polymorphisms. However, we take a hierarchical approach, with multiple sets of synteny blocks, each defined at a different resolution, from the coarsest, karyotype level all the way down to the fine-grained base level. To create the hierarchy, we use the principles developed by the Sibelia tool [Minkin et al., 2013], which can create such a hierarchy for bacterial genomes, but adapted with a graph simplification algorithm for constructing synteny blocks from a multiple genome alignment file in HAL format [Hickey et al., 2013], produced by Progressive Cactus [Paten et al., 2011]. The algorithm starts from synteny blocks of the highest resolution (local sequence alignments) and then iteratively merges them into larger blocks if they are structurally concordant. Thus, each coarse synteny block has multiple predecessors, which defines the hierarchy (see Methods section for the details).
A rearrangement approach for genome assembly
At each level of resolution, the Ragout algorithm decomposes the input genomes into a set of signed strings of synteny blocks, such that when the set of strings is concatenated together it forms a signed permutation of blocks. While each contiguous reference chromosome is transformed into a single sequence of signed synteny blocks, each chromosome in the target assembly corresponds to multiple sequences of synteny blocks because of the assembly fragmentation. As a result, some information about the adjacencies between synteny blocks in the target genome is missing. Ragout infers a phylogenetic relationship between the genomes and constructs a breakpoint graph from the sets of synteny block strings, and then uses a rearrangement approach to infer these missing adjacencies. Additionally, a 2-break rearrangement model [Alekseyev and Pevzner, 2009] is used to identify chimeric adjacencies in the input contigs/scaffolds and distinguish them from real rearrangements. All these steps are performed iteratively, starting from the lowest to the highest synteny block resolution. In each iteration, scaffolds in the previous step are merged with scaffolds in the current step to refine the resulting assembly. Ragout further applies a “context-matching” algorithm to resolve repeats and put them back into the final scaffolds (see Figure 1 for an algorithm overview and the Methods section for details).
Benchmarking Ragout and RACA on simulated datasets
We simulated multiple datasets with extensive structural rearrangements to benchmark the Ragout algorithm and compared its performance against RACA. We took human chromosome 14 (GRCh37/hg19 version) as an ancestral genome and chose a set of breakpoints such that they divide the chromosome into intervals of exponentially distributed length [Pevzner et al., 2003], in approximate concordance with empirical data. Then, we modeled structural rearrangements (inversion, translocation and gene conversion) with breakpoints randomly drawn from the defined breakpoint set. These rearrangements were uniformly distributed on the branches of the phylogenetic tree. Next, we modeled the NGS assembly process by fragmenting the target genome. Since each breakpoint in mammalian genomes is close to repetitive elements [Pevzner et al., 2003, Brueckner et al., 2012], we marked half of the repeats that are located near the breakpoints as “unresolved” by the assembler and fragment the genome in the corresponding positions. We further applied additional fragmentation at random positions to make the dataset more complicated. Finally, to mimic chimeric scaffolds generated in typical NGS studies [Kim et. al., 2013], we randomly joined together 5% of the target fragments. Using this setting, we simulated three reference genomes and a target genome consisting of approximately 5000 fragments with a mean length of 18,000 bp. The corresponding phylogenetic tree is shown in Figure 2(a). As RACA requires paired-end reads as input, for each target genome we sampled 30 million reads of length 70 bp using the wgsim software [Li, 2013].
We then benchmarked Ragout and RACA on datasets simulated as described above with different number of rearrangements (ranging from 50 to 500). All tools were run with their default parameters. To benchmark the chimera detection module, we performed extra Ragout runs with the following modifications. The first extra run is called permissive with the chimera detection module turned off. The second extra run is denoted as conservative, in which all unsupported target adjacencies are broken (to mimic the common mapping approach in reference-guided assembly).
Given a resulting set of chromosomes, we call an adjacency (a pair of consecutive contigs) correct if these contigs have the same sign and their original positions in the target genome are adjacent (allowing jumps through unplaced fragments which were not included into the final scaffolds). Otherwise, the adjacency is called erroneous. For each run, we measured the error rate as the number of erroneous adjacencies divided by the total number of adjacencies. The computed error rates as well as the statistics of unplaced contigs are shown on the Figure 2. As was expected, the normal Ragout strategy, which keeps a subset of target-specific adjacencies (rather than all/none of them), produces fewer errors compared to the naive strategies. In the datasets with the fewest rearrangements, RACA produces slightly more misassemblies than Ragout, but as the number of rearrangements increase, the difference in accuracy between the default Ragout strategy and RACA magnifies. The normal and conservative strategies resulted in almost the same number of unplaced contigs. However, this number is actually higher for the permissive strategy, which is a consequence of including chimeric adjacencies in the final scaffolds. RACA produces a higher number of unplaced contigs compared to all Ragout strategies, which might be explained by the fact that RACA uses only a single synteny block resolution (block size 5000, as suggested by default). Detailed assembly statistics for the first dataset with 50 rearrangements are given in Supplementary Table S1. We additionally benchmarked Ragout on incomplete sets of references, for which the results are given in Supplementary Materials S4.
Assembly of 15 Mus musculus and one Mus spretus genomes
We applied Ragout to 15 Mus musculus and one Mus spretus genomes, which included 12 laboratory strains (129S1/SvimJ, A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6NJ, CBA/J, DBA/2J, FVB/NJ, LP/J, NOD/ShiLtJ, NZO/H1LtJ) and four wild-derived strains (WSB/EiJ, PWK/PhJ, SPRET/EiJ, CAST/EiJ) [Keane et al., in preparation]. The initial NGS assembly was performed using SGA [Simpson et al., 2009]. Scaffolding was performed with SOAPdenovo2 [Luo et al., 2012] using multiple paired-read and mate-pair libraries with insert sizes 3, 6 and 10 kbp, 40kb fosmid ends for eight strains, and BAC ends for NOD/ShiLtJ. Additionally, for the three most divergent genomes (PWK/PhJ, SPRET/EiJ and CAST/EiJ) scaffolding based on a Dovetail protocol [Putnam et al., 2016] was applied. We used the C57BL/6J M. musculus strain as a single reference, as all target genomes have the same karyotype and show a good structural similarity with the C57BL/6J reference: On average 254 adjacent synteny block pairs longer than 10 kbp from the target NGS assemblies were not adjacent in the C57BL/6J reference (which correspond to large putative rearrangements). For each target strain, Ragout produced a complete set of chromosomes with the expected large-scale structure. Some assemblies also included short unlocalized fragments (homologous to the corresponding sequences in C57BL/6J) or MT chromosomes. The statistics of assembly results are given in Table 1. The unplaced sequence for each assembly comprised less than 5% of the total length. We also estimated the number of missing exons as less than 2% for each assembly (see below). On average, 49 adjacencies between synteny blocks larger than 10 kbp from the assembled chromosomes were not present in the C57BL/6J reference (excluding the three most divergent genomes).
We used multiple sets of PacBio reads that were available for the three most divergent genomes (PWK/PhJ, SPRET/EiJ and CAST/EiJ) to estimate the structural accuracy of the assembled chromosomes. The samples included whole genome sequence data at approximately 0.5-1x coverage as well as RNA sequencing data from liver and spleen for each of the three strains. We classified each adjacency (a pair of consecutive NGS fragments on a Ragout chromosome) as covered if both fragments are joined by one or more aligned reads, or uncovered otherwise (47%, 43% and 55% adjacencies were covered for the PWK/PhJ, SPRET/EiJ and CAST/EiJ genomes, respectively). We call an adjacency validated if there is a read that has alignments longer than 500bp on both parts of the adjacency with a correct orientation. We then calculated the validated adjacency ratio as the number of validated adjacencies divided by the number of covered adjacencies (Figure 3 shows the validated adjacency ratio as a function of the maximum gap size of an adjacency). As expected, longer gaps were harder to validate as the chance of being covered with a single read decreases. Interestingly, genomes that were closer to the C57BL/6J reference had more validated adjacencies, which is explained by the increasing structural divergence between the reference and the target assembly. Additionally, since the exome data had lower coverage, fewer adjacencies could have been validated using that dataset.
To further investigate the structural accuracy of the assembly, we performed a transcript consistency analysis using the GENCODE comprehensive transcript set (GENCODE version M4 / Ensembl 78) [Flicek et al., 2012]. We aligned 356,151 protein-coding exons from the database on the combination of Ragout chromosomes and corresponding unplaced sequences using BWA [Li et al., 2009] (leaving only the best hit for each exon). We defined lost exons as exons that align on the unplaced sequence rather than on chromosomes (see Table 1). Then, for each of 78,378 transcripts we determined its primary chromosome and primary orientation with respect to the alignments of the majority of its exons. Exons that align to a non-primary chromosome of the corresponding transcript are called misplaced. Similarly, we calculated exons with incorrect orientation among the exons that align to the primary chromosome. Exons that belong to multiple transcripts were counted once. The results are shown in Figure 3. As expected, NGS assemblies had significantly higher numbers of misplaced exons than Ragout chromosomes (15,519 and 2,116 on average, respectively). The average number of exons with wrong orientation over all genomes was 69 for NGS assemblies and 469 for Ragout (excluding PWK/PhJ, SPRET/EiJ and CAST/EiJ). To estimate the specificity of the described benchmark, we also ran it on the C57BL/6J reference (false positive calls may happen because of ambiguous alignments or annotation errors); the number of misplaced exons was 1,638, while the number of exons with wrong orientation was 517. Thus, both statistics for Ragout chromosomes are only slightly higher or equal to ones calculated for the reference, which confirms the high accuracy of the reconstruction of coding sequence.
Finally, we benchmarked the ability of Ragout to preserve target-specific rearrangements that are not observed in reference genomes. We used 688 PCR primer pairs available from the previous studies [Yalcin et al., 2012] that surround structural variations in different M. musculus genomes [Keane et al., 2014]. For each genome we extracted primer pairs that align on a single NGS scaffold with a variation in distance with respect to the C57BL/6J reference (on average, 496 primer pairs were chosen). For each such pair we compared the alignment distances between the NGS assembly and the Ragout chromosomes. Surprisingly, while Ragout corrected many structural misassemblies (see Table 1), all these target-specific structural variations were preserved.
Mus caroli and Mus pahari assembly
In order to see how our method performs in assembling more distant genomes from the references, we applied Ragout to the Mus caroli and Mus pahari genomes. These genomes exhibit 4% and 8% sequence divergence from M. musculus, which is equivalent, respectively to the human-oragutant and human-marmoset divergence [Thybert et al. in preparation]. Since the targets were evolutionarily intermediate between these M. musculus and Rattus norvegicus, we used both of these references for the assembly. Additionally, for both M. pahari and M. caroli assemblies, we used optical maps from the other genome as a third “extra reference” for each genome. The phylogenetic tree of the described four genomes was reconstructed by Ragout.
The M. caroli genome was assembled using ALLPATHS-LG [Butler et al., 2008] from NGS short-range libraries, multiple mate-pair libraries with up to 3 kbp insertion length and optical maps. Ragout reconstructed 26 scaffolds (consisting of 28,486 NGS fragments), which correspond to 19 autosomes, chromosome X and 6 small, unlocalized fragments [Thybert et al., in preparation]. Detailed assembly statistics are given in Table 2. The assembled M. caroli chromosomes do not exhibit any large inter-chromosomal rearrangements with respect to the M. musculus genome, which was expected from physical maps as well as chromosome paintings [Thybert et al., in preparation]. However, we detected a large cluster of inversions in chromosome 17 of approximately 5 Mbp. In addition to this karyotype-scale variation, we detected 12 synteny block adjacencies that do not appear in any of the three reference genomes, suggesting target-specific rearrangements larger than 10 kbp. In the assembled chromosomes 28 connections between contigs longer than 10 kbp were supported by only the M. pahari and/or R. norvegicus genomes.
We also applied RACA to assemble M. caroli and compared it with the Ragout assembly (see Table 2). As RACA does not produce nucleotide sequence output, but instead outputs the order and orientation of the NGS scaffolds, we produced nucleotide sequences by manually joining the corresponding scaffold parts. The assembly was highly fragmented (327 scaffolds, N50 = 22,284 kbp), which might be a consequence of its limited support of target-specific rearrangements. Additionally, RACA uses a greedy approach to join input scaffolds into chromosomes, which might not lead to an optimal result in the presence of many structural variations. A large portion of sequence (384 Mbp) was left unplaced by RACA, which could be explained by the fact that only a single fixed synteny block size is used (5,000 by default).
The same protocol was used to assemble the M. pahari genome (see the results in Table 2). Ragout reconstructed 23 scaffolds from 16,108 assembly fragments. In contrast to M. caroli, chromosome painting as well as physical mappings of M. pahari suggest extensive inter-chromosomal rearrangements [Thybert et al., in preparation]. Ragout detected five chromosome fusions, four of which are consistent with the mappings and one that is supported by the R. norvegicus reference and might have been missed from the optical maps due to its relatively small size (about 2 Mbp). The dot-plots of the detected large-scale rearrangements as well as their corresponding chromosome paintings are shown on Figure 4. Four expected chromosome fusions and 13 expected chromosome fissions were not detected by our algorithm. This was mainly caused by the missing signatures of the rearrangements in the input sequences; in particular, it is currently very hard to predict a chromosome fission if it is not supported by any of the references, since target fragments can not provide positive evidence of such an event. Ragout resolves this issue by integrating physical mappings. Importantly, Ragout did not generate any large inter-chromosomal rearrangements which were not expected from physical mappings or references. We also detected 36 adjacencies that do not appear in any of the reference genomes, suggesting target-specific rearrangements of size more than 10 kbp. Twenty one connections between contigs longer than 10 kbp in the final chromosomes were supported by only the M. caroli and/or R. norvegicus genomes. On the same dataset RACA produced an assembly with a higher number of scaffolds and lower N50 (see Supplementary Table 2). RACA found four of the five target-specific breakpoints detected by Ragout, but they did not clearly predict chromosome-scale rearrangements due to the assembly fragmentation.
We also detected a cluster of inversions with the same structure as in the M. caroli genome in a chromosome, homologous to a M. musculus chromosome 17. The breakpoints of the detected inversions in both genomes were supported by the corresponding optical maps. Interestingly, both the M. musculus and R. norvegicus references contain different structural variations within this region and share one inversion breakpoint (see Fig. 5). This might be a signature of a rearrangement hotspot [Pevzner et al., 2003].
Discussion
Despite recent advances in sequencing technologies and bioinformatics algorithms, de novo assembly of a mammalian scale genome into a complete set of chromosomes remains a challenge. Many genome sequencing projects [Wang et al., 2014, Dobrynin et al., 2015, Vij et al., 2016] have used reference-guided assembly as a step in genome finishing, often followed by manual sequence curation. While multiple tools for reference-assisted assembly exist, their performance has proved limited when the reference genomes exhibit a significant number of structural variations relative to the target genome being assembled. In this manuscript we have presented Ragout ‐ an algorithm for chromosome assembly of large and complex genomes using multiple references. Ragout joins input NGS contigs or scaffolds into larger sequences by analyzing genome rearrangements between multiple references and the target genome. In difference to previous approaches, Ragout utilizes hierarchical synteny information, which helps to reduce gaps in the resulting chromosomes. We used simulations to show that Ragout makes few errors, even in the presence of complicated rearrangements, and outperforms previous approaches in both accuracy and assembly completeness.
Using the existing Mus musculus reference, we applied Ragout to assemble 15 Mus musculus and one Mus spretus genomes (including 12 laboratory strains and four wild-derived strains). Through the bench-marks, which included validations with long PacBio reads, transcriptome analysis and PCR testing, we show that Ragout produced highly accurate chromosome assemblies with less than 5% of sequence unplaced (2% of coding sequence). An analysis of transcript data showed a substantial improvement of the resulting assemblies from a gene structure perspective. Importantly, our algorithm is capable of preserving target-specific rearrangements, which are observed in the NGS assembly but not in the reference genomes, thus having less reference bias than simpler reference-guided approaches. Using both the reference M. musculus and R. norvegicus reference genomes, we used Ragout to assemble two more challenging genomes: M. pahari and M. caroli, which exhibit major karyotype-scale differences compared to these references. M. caroli was assembled into a complete set of chromosomes with the expected karyotype structure. M. pahari chromosomes assembled by Ragout contained five large intra-chromosomal rearrangements, four of which were confirmed by chromosome painting techniques. The fifth rearrangement was supported by the R. norvegicus reference and might have been missed from chromosome maps due to its relatively small size. While Ragout did not generate any unexpected rearrangements (not supported by chromosome maps or references), some expected rearrangements remained undiscovered. Most of these rearrangements were chromosome fissions, which are difficult to predict using only NGS sequencing data. Ragout has a module to incorporate chromosome maps to guide the final assembly and successfully incorporated all expected rearrangements into the final chromosomes.
One limitation of Ragout is that the algorithm currently does not support diploid genomes, thus only a single copy of each chromosome is reconstructed (possibly, as a mixture of different alleles). Complete de novo diploid genome assembly remains a challenging task, even with recent improvements in sequencing technologies, which include long PacBio reads or long Hi-C interactions that can link heterozygous variations. Being mostly orthogonal problems, the information from reference genomes could be further coupled with the direct sequencing approaches to improve de novo diploid genome assembly and phasing. Ragout does not currently use information from paired-end reads or long mate pair libraries; prior studies such as Kim et al. [2014] have reported that adding this information would slightly improve the resulting assemblies, and we leave this as future work.
We have shown that Ragout produces accurate and complete chromosome assemblies of mammalian-scale genome even in the presence of extensive structural rearrangement. The assembled chromosomes were used as starting points for manual sequence curation in the mouse strains assembly project [Keane et al., in preparation, Thybert et al., in preparation].
Availability
The complete Ragout package is freely available at: http://fenderglass.github.io/Ragout/. The genomes of 15 Mus musculus and Mus spretus, which have been scaffolded using Ragout with additional curations, are available at NCBI database under PRJNA310854 bio project id [Keane et al., in preparation]
Methods
Construction of synteny blocks
Ragout constructs synteny blocks at multiple scales and provides a hierarchical relationship among these different scales of synteny blocks (for a complete discussion of multi-scale synteny blocks, see [Minkin et al., 2013, Ghiurcuta and Moret, 2014, Kolmogorov et al., 2014]). At the highest level of resolution, the alignment of multiple genomes is represented as a set of alignment blocks. Each alignment block is a set of oriented, non-overlapping, homologous sub-intervals of the input genomes. To derive a set of alignment blocks we use Progressive Cactus. Each genomic sequence st can be represented in the alphabet of alignment blocks st = b1, …, bn, where bi is an alignment block of length |bi. Given a set of k sequences S = {s1, …, sk} and a parameter minBlock, we construct an A-Bruijn graph G(S, minBlock) as follows: For each alignment block bi such that |bi| ≥ minBlock we create two nodes bhi and (representing the head and tail of each block) and connect them by a black edge. We connect heads and tails of adjacent alignment blocks in each genome st with an adjacency edge with color Ct. The length of an adjacency edge is defined as the distance between the two alignment blocks in the corresponding genome. We also include a special infinity node [Alekseyev and Pevzner, 2009] representing the ends of sequences (telomeres in complete genomes or fragment ends in draft genomes). A node is called a bifurcation if it is connected with more than one node (by colored edges) or is connected with an infinity node. A path without any bifurcation node except for its start and end is called non-branching. Figure 6b shows an example of an A-Bruijn graph constructed from alignment blocks present in Figure 6a. Intuitively, this construction is equivalent to: (1) representing each sequence in the alphabet of alignment blocks and (2) gluing together two alignment blocks if they are homologous and have size larger than minBlock (See [Medvedev et al., 2011] for a formal definition of gluing). Constructing this A-Bruijn graph is also equivalent to constructing the breakpoint graph from multiple genomes (See [Lin et al., 2014] for a proof of equivalence).
Small polymorphisms and micro-rearrangements correspond to certain cycle types or local structures in the A-Bruijn graph and disrupt large homologous regions into many shorter ones (see [Pham and Pevzner, 2010] for a thorough cycle type classification of an A-Bruijn graph). We use a progressive simplification algorithm to simplify bulges and parallel paths in the A-Bruijn graph. The algorithm consists of two sub-procedures: CompressPaths and CollapseBulges, and is parameterized with a value maxGap, which is the maximum length of the cycle/path to be simplified. CompressPaths is used to merge collinear alignment blocks into a larger synteny block. It starts from a random bifurcation node and traverses the graph in an arbitrary direction until it reaches another bifurcation node or an adjacency edge that is longer than maxGap. Then, it exchanges the traversed path with two nodes connected by a black edge (which corresponds to a new synteny block, see Figure 6e for an example). A complementary procedure – Collapse-Bulges – finds all simple bulges having branch length shorter than maxGap in a similar manner and exchanges the bulge with a single synteny block as described previously (see Figure 6b). These two procedures are applied one after another multiple times, until no more simplification can be done. It is easily verified that the result is invariant to the order of such operations. After the simplification step, the sequences of synteny blocks are recovered by “threading” the genomes through the graph. If the blocks require additional simplification, a larger value of minBlock is used to construct the new A-Bruijn graph (see Figure 6d). See Supplementary Materials S1 for an extra discussion on the choice of maxGap and min-Block.
Incomplete breakpoint graphs and missing adjacency inference
After running the synteny procedure, the input genome sequences can be represented in the alphabet of synteny blocks. For the sake of simplicity, let us assume that every synteny block is represented exactly once in each genome and all reference genomes are complete (the issue of repetitive synteny blocks will be addressed further in the manuscript). Given an assembly A and k reference sequences P1, …, Pk in the alphabet of synteny blocks B, we construct the incomplete multi-color breakpoint graph BG(A, P1, …, Pk) = (V, E), where V = {bhi, bti|bi ∈ B} [Kolmogorov et al., 2014]. For each synteny block, there are two vertices in the graph which correspond to the tail and head of the block. Edges are undirected and colored by k + 1 colors. An edge connects vertices that correspond to heads/tails of adjacent synteny blocks and is colored by the corresponding color of the genome/assembly (Figure 1). We use red, P1, …, Pk to refer to the colors of edges, where red edges represent the adjacencies of synteny blocks in the target assembly A, and Pi represents the adjacencies of synteny blocks in genome Pi. If the target genome were available, the set of all red edges would define a perfect matching in the graph. However, since the genome is fragmented into contigs, the adjacency information of the target genome at the vertices that correspond to the end of contigs is missing. Ragout infers these missing red edges in the graph by solving the half-breakpoint parsimony problem using other adjacencies from the reference genomes as well as using their evolutionary relationship in the form of a phylogenetic tree (See [Kolmogorov et al., 2014] for a complete description of the half-breakpoint parsimony problem). In contrast to bacterial assemblies, mammalian assemblies usually contain a higher fraction of chimeric contigs/scaffolds due to increased size and complexity. These problematic contigs generate many erroneous red connections in the break-point graph and need to be removed before inferring the missing adjacencies. Below, we describe a new algorithm that uses rearrangement signatures in the breakpoint graph to distinguish rearrangement red edges from erroneous red edges.
Detection of chimeric sequences
Mammalian assemblies usually contain high numbers of misassembled contigs. A large portion of misassembled contigs are chimerics, which are when an NGS assembler artificially joins regions in the assembled genome that are not truly adjacent. These false adjacencies correspond to erroneous red edges in the breakpoint graph. Erroneous red edges should be identified and removed before the missing adjacency inference step, since misassemblies in different contigs will join together and cause large structural errors. We denote a red edge in the breakpoint graph as supported, if there is at least one parallel reference edge, otherwise the red edge is denoted unsupported. We call an adjacency genomic if it is a true adjacency present in the target genome and artificial if it is a result of a chimeric contig. Supported red edges are unlikely to be artificial; however, an unsupported red edge could be either genomic (coming from a rearrangement specific to the target genome) or artificial (in case of misassemblies). To classify each red edge as either genomic or artificial, we need to analyze the rearrangement structure between the target genome and references. Since any rearrangement operation can be modeled as a k-break operation [Alekseyev and Pevzner, 2009], rearrangement will generate alternating cycles in the breakpoint graph (see Figure 8 for an example). For each unsupported edge, if it does not belong to an alternating cycle, we classify it as artificial and remove it from the breakpoint graph.
Iterative assembly
We first perform multiple rounds of scaffold assembly with different synteny block resolution. The scaffolds from the largest scale represent a “skeleton” for the final assembly. We then iteratively merge the existing skeleton with the set of scaffolds constructed from finer-scale synteny blocks, so as the the new assembly is consistent with the skeleton structure (see Kolmogorov et al., [2014] for the details). By default, Ragout is run in three iterations with synteny blocks sizes equal to (10000, 500, 100), which are similar to the default synteny scales in Sibelia, but with an increased block size in the first iteration to account for longer and more complex repeats in mammalian genomes. See additional details of the merging algorithm in Supplementary Materials S2.
Repeat resolution
The breakpoint graph analysis requires all synteny blocks to appear exactly once in each genome. A common approach is to filter out all repetitive blocks before analyzing the breakpoint graph. While this approach works for bacterial genomes, it is not applicable for mammalian assemblies, since the number of unresolved repeats in mammalian assemblies is much larger. Filtering out all such sequences would lead to a large number of gaps in the final assembly. Here we present an algorithm that addresses this issue by resolving repeat sequences based on the information from the references. Intuitively, for each unresolved repeat, we want to create k different instances of that repeat, where k is the copy number of the repeat in the target genome. As some repeats are already resolved by the NGS scaffolder, the corresponding synteny blocks in contigs will be surrounded by other unique synteny blocks. We use this “context” information to map repeat instances in contigs to the corresponding repeats in reference genomes. See additional details about repeat resolution algorithm in Supplementary materials S3.
Phylogenetic tree reconstruction
When the phylogenetic tree of the input sequence is not available, it can be inferred from the adjacency information [Lin et al., 2011]. Given a set of genomes represented as permutations of synteny blocks, for each pair of genomes we first build a breakpoint graph. Let b1 be a set of all breakpoints (graph edges) from the first genome and b2 from the second genome. We denote b1 & b2 as a set of breakpoints that are shared in both genomes. The distance between two genomes is defined as min{size(b1), size(b2)} − size(b1 & b2). We then build a distance matrix and run standard Neighbor-Joining algorithm [Saitou and Nei, 1987], which provides a good approximation of a real phylogenetic tree [Moret et al., 2001].
Footnotes
Address correspondence to: Dr. Son Pham, sonpham{at}bioturing.com