Abstract
The vast majority of bacterial genome sequencing has been performed using Illumina short reads. Because of the inherent difficulty of resolving repeated regions with short reads alone, only -10% of sequencing projects have resulted in a closed genome. The most common repeated regions are those coding for ribosomal operons (rDNAs), which occur in a bacterial genome between 1 and 15 times and are typically used as sequence markers to classify and identify bacteria. Here, we show that the genomic context in which rDNAs occur is conserved across taxa and that, by utilizing the conserved nature of rDNAs across taxa and the uniqueness of their flanking regions, it is possible to improve assembly of these regions relative to de novo sequencing. We describe a method which constructs targeted pseudocontigs generated by iteratively assembling reads that map to a reference genomes rDNAs. These pseudocontigs are then used to more accurately assemble the newly-sequenced chromosome. We show that this method, implemented as riboSeed, correctly bridges across adjacent contigs in bacterial genome assembly and, when used in conjunction with other genome polishing tools, can result in closure of a genome.
Background
Sequencing bacterial genomes has become much more cost effective and convenient, but the number of complete, closed bacterial genomes remains a small fraction of the total number sequenced (Table 1). Even with the advent of new technologies for long-read sequencing and improvements to short read platforms, assemblies typically remain in draft status due to the computational bottleneck of genome closure [6,32]. Although draft genomes are often of very high quality and suited for many types of analysis, researchers must choose between working with these draft genomes (and the inherent potential loss of data), or spending time and resources polishing the genome with some combination of
in silico tools, PCR, optical mapping, re-sequencing, or hybrid sequencing [32,45]. Many in silico genome finishing tools are available, and we summarise several of these in Table 2.
The Illumina entries in NCBI’s Sequence Read Archive (SRA) [21] outnumber all other technologies combined by about an order of magnitude (Table S2). Draft assemblies from these datasets have systematic problems common to short read datasets, including gaps in the scaffolds due to the difficulty of resolving assemblies of repeated regions [43,50]. By resolving repeated regions in assemblies, it may be possible to improve on existing assemblies, and therefore obtain additional sequence information from existing short read datasets in the SRA.
The most common repeated regions are those coding for ribosomal RNA operons (rDNAs), as ribosomes are essential for cell function. Sequencing of the 16S ribosomal region is widely used to identify bacteria and explore microbial community dynamics [7,8,49,51], as the region is conserved within taxa, yet retains enough variability to act as a bacterial “fingerprint” to separate clades informatively. However, the 16S, 23S, and 5S ribosomal subunit coding regions are often present multiple times in a single prokaryotic genome, and commonly exhibit polymorphism [9,24,30,47].
These long, inexactly repeated regions [2] are problematic for short-read genome assembly. As rDNAs are frequently used as a sequence marker for taxonomic classification, resolving their copy number and sequence diversity from short read collections where the assembled genome has collapsed several repeats into a single region could help improve reference databases, increasing the accuracy of community analysis. We present here an in silico method, riboSeed, that capitalizes on the genomic conservation of rDNA regions within a taxon to improve resolution of these difficult regions and provide a means to benefit from unexploited information in the SRA/ENA short read archives.
riboSeed is most similar in concept to GRabB, the method of Brankovics et al. [5] for assembling mitochondrial and rDNA regions in eukaryotes, as both use targeted assembly. However, GRabB does not make inferences about the number of rDNA clusters present in the genome, or take advantage of their genomic context. In riboSeed, genomic context is resolved by exploiting both the rDNA sequences and their flanking regions, harnessing unique characteristics of the broader rDNA region within a single genome to improve assembly.
The riboSeed algorithm proceeds from two observations: (1) although repeated rRNA coding sequences within a single genome are nearly identical, their flanking regions (that is, the neighboring locations within the genome) are distinct in that genome, and (2) the genomic contexts of equivalent rDNA sequences are also conserved within a taxonomic grouping. riboSeed uses only reads that map to rDNA regions from a reference genome, and is not affected by chromosomal rearrangements that occur outside the flanking regions immediately adjacent to each rRNA.
Briefly, riboSeed uses rDNA regions from a closely-related organism’s genome to generate rDNA cluster-specific “pseudo contigs” that are seeded into the raw short reads to generate a final assembly. We refer to this process in this work as de fere novo (meaning “starting from almost nothing”) assembly.
Implementation
We present riboSeed: a software suite that allows users to easily perform de fere novo assembly, given a reference genome sequence from a closely-related organism and single or paired-end short reads. The code is primarily written in Python3, with accessory shell and R scripts.
riboSeed relies on a closed reference genome assembly that is sufficiently closely-related to the isolate being assembled (which can be estimated using an alignment-free approach such as the KGCAK database [48]), in which rDNA regions are assembled and known to be in the correct context, as discussed below.
riboSeed proceeds in three stages: preprocessing, de fere novo assembly, and assessment/visualization.
1. Preprocessing
riboScan.py
riboScan.py uses Barrnap (https://github.com/tseemann/barrnap) to annotate rRNAs in the reference genome, and EMBOSS’s seqret [37] to create GenBank, FASTA, and GFF formatted versions of the reference genome. This preprocessing step unifies the annotation vocabulary for downstream processes.
riboSelect.py
riboSelect.py infers ribosomal operon structure from the genomic location of constituent 16S, 23S and 5S sequences. Jenks natural breaks algorithm is then employed to group rRNA annotations into likely operons on the basis of their genomic coordinates, using the number of 16S annotations to set the number of breaks. The output defines individual rDNA clusters and describes their component elements in a plain text file. This output can be manually adjusted before assembly if the clustering does not accurately reflect the known arrangement of operons based on visualization of the annotations in a genome browser.
2. De Fere Novo Assembly
riboSeed.py
riboSeed.py implements the algorithm described in Figure 2 in the current release. Short reads for the sequenced isolate are mapped to the reference genome using BWA [22]. Reads that map to each annotated rDNA and its flanking regions (default size 1kbp) are extracted into subsets (one subset per cluster). Each subset is independently assembled into a representative pseudocontig with SPAdes [3], using the reference rDNA regions as a trusted contig. The resulting pseudocontigs are evaluated for inclusion in future mapping/subassembly iterations based on their length (as discussed below), and concatenated into a pseudogenome, in which pseudocontigs are separated by 5kb of Ns as a spacer. This process is repeated in each subsequent iteration, using the previous round’s pseudogenome as the reference.
After a specified number of iterations (3 by default), SPAdes is used to assemble all short reads in a hybrid assembly that includes the pseudocontigs from the final iteration as “trusted contigs” (or as “untrusted contigs” if the mapping quality of reads to that pseudocontig falls below a threshold, defined below). As a control, the short reads are also de novo assembled without the pseudocontigs.
Although this implementation of riboSeed uses SPAdes to perform both the subassemblies and the final de fere novoassembly, the pseudocontigs can be submitted to any hybrid assembler that accepts short read libraries and contigs. After assembly, the de fere novo and de novo assemblies are assessed with QUAST [16].
3. Assessment and Visualization
riboScore.py
riboScore.py extracts the regions flanking the rDNAs in the reference and in the assemblies generated by riboSeed. The flanking regions from the assembly are matched with the reference flanking regions using BLAST, and depending on the ordering of the matches, calls a junction a correct, incorrect, or ambiguous join based on the criteria outlinedbelow.
riboSnag.py
riboSnag.py is provided as a helper tool to produce useful diagnostics and visualisation concerning rDNA sequence in the reference genome. Using the clustering generated by riboSelect.py, sequences for the clusters can be extracted from the genome, aligned, and Shannon entropy [40] plotted with consensus depth for each position in the alignment.
riboSwap.py
In all cases, we recommend assessing the performance of the riboSeed pipeline visually using Mauve [10,11], Gingr [42], or a similar genome assembly visualizer to compare reference, de novo, and de fere novo assemblies in addition to riboScore.py. If contigs appear to be incorrectly joined, the offending de fere novo contig can be replaced with syntenic contigs from the de novo assembly using the riboSwap.py script.
riboStack.py
riboStack.py uses bedtools [36] and samtools [22] to compare the depths of coverage of reads aligning to the reference genome in the rDNA regions to randomly sampled regions elsewhere in the reference genome. riboStack.py takes output from riboScan.py, and a BAM file of reads that map to the reference. If the number of riboScan.py-annotated rDNAs matches the number of rDNAs in the sequenced isolate, the coverage depths within the rDNAs will be similar to the coverage in other locations in the genome. If the coverage of rDNA regions sufficiently exceeds the average coverage elsewhere in the genome, this may indicate that the reference strain has fewer rDNAs than the sequenced isolate. In this case, using an alternative reference genome may produce improved results.Results
Results
Characteristics of rDNA flanking regions
The use of rDNA flanking sequences to uniquely identify and place rDNAs in their genomic context requires their flanking sequences to be distinct within the genome for each region. This is expected to be the case for most, if not all, prokaryotic genomes. We determined that using 1kb flanking widths was sufficient to include differentiating sequence (Figure S1). To demonstrate this, rDNA and 1kb flanking regions were extracted from E. coli Sakai [17] (BA000007.2), in which the rDNAs have been well characterized [33]. These regions were aligned with MAFFT [20], and their consensus depth and Shannon entropy [40] calculated for each position in the alignment (Figure 3a).
Figure 3a (and Figure S3) shows that within a single genome the regions flanking rDNAs are variable between operons. This enables unique placement of reads at the edges of rDNA coding sequences in their genomic context (i.e. there is not likely to be confusion between the placements of rDNA edges within a single genome).
In E. coli MG1655, the first rDNA is located 363 bases downstream of gmhB (locus tag b0200). Homologous rDNA regions were extracted from 25 randomly selected complete E. coli chromosomes (Table S1). We identified the 20kb region surrounding gmhB in each of these genomes, then annotated and extracted the corresponding rDNA and flanking sequences. These sequences were aligned with MAFFT, and the Shannon entropies and consensus depth plotted (Figure 3b).
Figure 3b shows that equivilent E. coli rDNAs, plus their flanking regions, are well-conserved across several related genomes. Assuming that individual rDNAs are monophyletic within a taxonomic group, short reads that can be uniquely placed on a related genome’s rDNA as a reference template are also likely able to be uniquely-placed in the appropriate homologous rDNA of the genome to be assembled.
Taken together, when these two properties hold, this allow for unique placement of reads from homologous rDNA regions in the appropriate genomic context. These “anchor points” effectively reduce the number of branching possibilities in de Bruijn graph assembly for each individual rDNA, and thereby permit a complete balanced path through the full rDNA region.
Validating Assembly across rDNA regions
Settings used for analyses in this manuscript are the defaults as of riboSeed version 0.4.09 (except where otherwise noted).
To evaluate the performance of de fere novo assembly compared to de novo assembly methods, we used Mauve to visualize syntenic regions and contig breaks of each assembly in relation to the reference genome that was used to generate pseudocontigs. We categorized each rDNA in an assembly as either a success, failure, or misassembly, as follows.
An rDNA assembly was classed as correct if two criteria were met: (i) the assembly merged two contigs across a rDNA region such that, based on the reference, the flanking regions of the de fere novo assembly were syntenous with those of the reference; and (ii) the assembled contig extends at least 90% of the flanking length. An assembled cluster wasdefined as skipped if the ends of one or more contigs aligned within the rDNA or flanking regions (signalling that extension across the rDNA region was not achieved). Finally, if two contigs assembled across a rDNA region in a manner that conflicted with the orientation indicated in the reference genome, the rDNA region was deemed to be incorrect.
In all cases, SPAdes was used with the same parameters for both de fere novo assembly and de novo assembly, apart from the addition of pseudocontigs in the de fere novo assembly.
Simulated Reads with Artificial Genome
To create a small dataset for testing, we extracted all 7 distinct rDNA regions from the E. coli Sakai genome (BA000007.2), including 5kb upstream and downstream flanking sequence, using the tools riboScan.py, riboSelect.py and riboSnag.py. Those regions were concatenated to produce a ̴100kb artificial test chromosome (see supplementary methods). pIRS [19] was used to generate simulated reads (100bp, 300bp inserts, stdev 10, 30-fold coverage, built-in error profile) from this test chromosome. These reads were assembled using riboSeed, using the E. coli MG1655 genome (NC-000913.3) as a reference. The simulation was run 8 times.
The de fere novo assembly bridged 4 of the 7 rDNA regions in the artificial genome, while the de novo assembly method failed to bridge any (Table S2). To explain how the choice of reference sequence determines the ability to assemble correctly through rDNA regions, we ran riboSeed with the same E. coli reads using pseudocontigs derived from the Klebsiella pneumoniae HS11286 (CP003200.1) reference genome [23]. The de fere novo assembly with pseudocontigs from K. pneumoniae bridged between 1 and 2 rDNAs, but also misassembled several rDNA gaps (Figure 4).
Effect of reference sequence identity on riboSeed performance
To investigate how riboSeed assembly is affected by choice of reference strain, we implemented a simple mutation model to generate reference sequence variants of the artificial chromosome described above, with a specified rate of substitution. A simple substitution rate was applied across all bases uniformly does not address the disparity of conservation between rDNAs and their flanking region, but a second model was also applied wherein substitutions were allowed only to the rDNA flanking regions. We then assembled the artificial genome’s reads using the mutated artificial genome as a referece, using these models (Figure 5).
To obtain an estimate of substitution rate for the E. coli data used above, Parsnp [42] and Gingr [42] were used to identify SNPs in the 25 genomes used above (Figure 3), with respect to the same region in E. coli Sakai. An average substitution rate of 0.0062 was observed. Compared to the results from the simulated genomes, we could expect successful performance under the model of mutated flanking regions, and partial success under the model of substitutions throughout the region.
Figure 5 indicates that the more similar the reference sequence is to the genome being assembled, the greater the likelihood of correctly assembling through rDNA regions. When mutating only the flanking regions (Figure 5), which more closely resembles the relative sustitution frequencies of the rDNA regions, the procedure correctly assembles rDNAs with tolerance to substitution frequencies up to approximately 30 substitutions per kbp. Using a average nucleotide identity species boundary of 95% [14], it could be concluded that riboSeed requires a reference within the same species for optimal performance, and that moderate success can be achieved even when using a more distant reference.
Simulated reads with E. coli Sakai and K. pneumoniae Genomes
To investigate the effect of short read length on riboSeed assembly, pIRS [19] was used to generate paired-end reads from the complete E. coli MG1655 and K. pneumoniae NTUH-K2044 genomes, simulating datasets at a range of read lengths most appropriate to the sequencing technology. In all cases, 300bp inserts with 10bp standard deviation and the built-in error profile were used. Coverage was simulated at 20x to emulate low coverage runs and at 50x to emulate coverage close to the optimized values determined by Miyamoto [29] and Desai [12]. De fere novo assembly was performed with riboSeed using E. coli Sakai and K. pneumoniae HS11286 as references, respectively, and the results were scored with riboScore.py (Figure 6).
At either 20x or 50x coverage, de novo assembly was unable to resolve any rDNAs with any of the simulated read sets. de fere novo assembly with riboSeed showed modest improvement to both the E. coli and K. pneumoniae assemblies. Increasing depth of coverage and read length improves rDNA assemblies.
Benchmarking against Hybrid Sequencing and Assembly
To establish whether riboSeed performs as well with short reads obtained by sequencing a complete prokaryotic chromosome as with simulated reads, we attempted to assemble short reads from a published hybrid Illumina/PacBio sequencing project. The hybrid assembly using long reads was able to resolve rDNAs directly, and provides a benchmark against which to assess riboSeed performance in terms of: (i) bridging sequence correctly across rDNAs, and (ii) assembling rDNA sequence accurately within each cluster.
Sanjar, et al. published the genome sequence of Pseudomonas aeruginosa BAMCPA07-48 (CP015377.1) [38], assembled from two libraries: ca. 270bp fragmented genomic DNA with 100bp paired-end reads sequenced on an Illumina HiSeq 4000 (SRR3500543), and long reads from PacBio RS II. The authors obtained a closed genome sequence by hybrid assembly. We ran the riboSeed pipeline on only the HiSeq dataset in order to compare de fere novo assembly to the hybrid assembly and de novo assembly of the same reads, using the related genome P. aeruginosa ATCC 15692(NZ_CP017149.1) as a reference.
de fere novo assembly correctly assembled across all 4 rDNA regions, whereas de novo assembly failed to assemble any rDNA regions (Table 3).
Comparing the BAMCPA07-48 reference to the de fere novo assembly, we found a total of 9 SNPs in the rDNA flanking regions (Table 4). The same regions from the ATCC 15692 reference used in the de fere novo assembly showed 103 SNPs compared to the BAMCPA07-48. This demonstrates that this subassembly scheme successfuly recovers the correct sequence dispite a large number of differences between the reference and the sequenced isolate.
Thus, we find that the de fere novo assembly using short reads performs better than de novo assembly using short reads alone. Comparison of the de fere novo assembly to the hybrid assembly allows assessment of de fere novoaccuracy, and reveals that de fere novo can recover the rDNA sequence correctly placed in their genomic context, and with a low error rate.
Case Study: Closing the assembly of S. aureus UAMS-1
Staphylococcus aureus UAMS-1 is a well-characterized, USA200, methicillin-sensitive strain isolated from an osteomyelitis patient. The corresponding published genome was sequenced using Illumina MiSeq generating 300bp reads, and the assembly refined with GapFiller as part of the BugBuilder pipeline [1]. Currently, the genome assembly is represented by two scaffolds (JTJK00000000), with several repeated regions acknowledged in the annotations [39]. As the rDNA regions were not fully characterized in the annotations, we proposed that de fere novo assembly might resolve some of the problematic regions.
Using the same reference S. aureus MRSA252 [18] (BX571856.1) with riboSeed as was used in the original assembly, de fere novo assembly correctly bridged gaps corresponding to two of the five rDNAs in the reference genome (Table 5). Furthermore, de fere novo assembly bridged two contigs that were syntenic with the ends of the scaffolds in the published assembly, indicating that the regions resolved by riboSeed could allow closure of the genome.We modified the BugBuilder pipeline (https://github.com/nickp60/BugBuilder) used in the published assembly to incorporate pseudocontigs from riboSeed, resulting in a single scaffold of 7 contigs. In this case, riboSeed was able assist in bringing an existing high-quality scaffold to completion.
Benchmarking against GAGE-B Datasets
We used the Genome Assembly Gold-standard Evaluation for Bacteria (GAGE-B) datasets [26] to assess the performance of riboSeed against a set of well-characterized assemblies. These datasets represent a broad range of challenges; low GC content and tandem rDNA repeats prove challenging to the riboSeed procedure. Mycobacterium abscessus has only a single rDNA operon and does not suffer from the issue of rDNA repeats, so it was excluded from this analysis.
When the reference used in the GAGE-B study came from the sequenced strain (as was the case for R. sphaeroidesand the B. cereus), we chose an alternate reference, as using the true reference sequence would provide an unfair advantage to riboSeed. The GAGE-B datasets include both raw and trimmed reads; in all cases, the trimmed reads were used. Results are shown in Table 6.
Compared to the de novo assembly, de fere novo assembly improved the majority of assemblies. In the case of the S. aureus and R. sphaeroides datasets, particular difficulty was encountered for all of the references tested. In the case of B. fragilis, the entropy plot (Figure S3g) shows that the variability on the 5’ end of the operon is much lower than the other strains, likely leading to the misassemblies.
Discussion
We show that the regions flanking equivilent rDNAs from related strains show a high degree of conservation in related organisms. This allows us to infer the location of rDNAs within a newly sequenced isolate, even in absence of the resolution that would be provided by long read sequencing. Comparing the regions flanking rDNAs within a single genome, we observed that when considering sufficiently large flanking regions, flanking sequences show enough variability to differentiate each instance of the rDNAs. Taken together, the cross-taxon homology allows inference of the location (i.e. the flanking regions) of rDNAs, and the variability of these flanking regions within a genome enables unique identification of reads likely belonging to each cluster.
The extent of sequence similarity between the sequenced isolate and the reference influences the resulting de fere novo assembly. To prevent spurious joining of contigs, if less than 80% of the reads map to the reference, the resulting pseudocontigs will be treated as “untrusted” contigs by SPAdes. Figure 5 shows that although one should use the closest complete reference available for optimal results, the subassembly method is robust against moderate discrepancies between the reference and sequenced isolate’s flanking regions.
The method of constructing pseudocontigs implemented by riboSeed relies on having a relevant reference sequence, where the rDNA regions to act as “bait”, fishing for reads that likely map specifically to that region. Although we show this to be an effective way to partition the appropriate reads, perhaps a more robust and supervision-free method would be use a probabilistic representation of equivalent rDNA regions for a particular taxon. By developing a database of sequence profiles (e.g. hidden Markov Models) from each of the rDNAs in a taxon, perhaps the step of choosing a single most appropriate reference could be circumvented. For datasets where the choice of reference determines riboSeed’s effectiveness, a probabilistic approach may improve performance.
Several checks are implemented after the subassembly to ensure that the resulting pseudocontig is fit for inclusion in the next mapping/subassembly iteration or the final de fere novo assembly. If a subassembly’s longest contig is greater than 3x the particular pseudocontig length or shorter than 6kb (a conservative minimum length of a 16S, 23S, and 5S operon), this is taken to be a sign of poor parameter choice so the user is warned, and by default no further seedings will occur to avoid spurious assembly. Such an outcome can be indicative of any of several factors: improper clustering of operons; insufficient or extraneous flanking sequence; sub-optimal mapping; inappropriate choice of k-mer length for subassembly; inappropriate reference; or other issues. If this occurs, we recommend testing the assembly with different k-mers, changing the flanking length, or trying alternative reference genomes. Mapping depth of the rDNA regions is also reported for each iteration; a marked decrease in mapping depth may also be indicative of problems.
Many published genome finishing tools and approaches offer improvements when applied to suitable datasets, but none (including the approach presented in this paper) is able in isolation to resolve all bacterial genome assembly issues. One constraint on the performance of riboSeed is the quality of rRNA annotations in reference strains. Although it is impossible to concretely confirm it is the case in silico, we (and others [28]) have found several reference genomes of the course of this study that we suspect have collapsed rDNA repeats. We recommend using a tool such as 16Stimator [34]
or rrnDB [41] to estimate number of 16s (and therefore rDNAs) prior to assembly, or riboStack.py to assess mapping depths after running riboSeed.
As riboSeed relies on de Bruijn graph assembly, the results can be affected by assembler parameters. Care should be taken to find the most appropriate settings, particularly in regard to read trimming approach, range of k-mers, and error correction schemes.
One difficulty in determining the accuracy of rDNA counts in reference genome occures becuase genome sequences are often released without publishing the reads used to produced the genome. This practice is a major hindrance when attempting to perform coverage-based quality assessment, such as to infer the likelihood of collapsed rDNAs. While data transparency is expected for gene expression studies, that stance has not been universally adopted when publishing whole-genome sequencing results. To ensure the highest quality assemblies, it is imperative the researchers allow the scientific community to scrutinize the raw whole genome sequencing data with the same rigor that would be applied to any other type of high-throughput sequencing project.
Conclusions
Demonstration that rDNA flanking regions are conserved across taxa and that flanking regions of sufficient length are distinct within a genome allowed for the development of riboSeed, a de fere novo assembly method. riboSeed utilizes rDNA flanking regions to act as barcodes for repeated rDNAs, allowing the assembler to correctly place and orient the rDNA. de fere novo assembly can improve the assembly by bridging across ribosomal regions, and, in cases where rDNA repeats would otherwise result in incomplete scaffolding, can result in closure of a draft genome when used in conjunction with existing polishing tools. Although riboSeed is far from a silver bullet to provide perfect assemblies from short read technology, it shows the utility of using genomic reference data and mixed assembly approaches to overcome algorithmic obstacles. This approach to resolving rDNA repeats may allow further insight to be gained from large public repositories of short read sequencing data, such as SRA, and when used in conjunction with other genome finishing techniques, provides an avenue towards genome closure.
List of abbreviations
rDNA: DNA region coding for ribosomal RNA operon; rRNA: ribosomal RNA; SRA: Sequence Read Archive; ENA: European Nucleotide Archive; IG: intergenic, GAGE-B: Genome Assembly Gold-standard Evaluation for Bacteria
Availability of data and materials
The riboSeed pipeline and the datasets generated during the current study are available in the riboSeed GitHub repository, https://github.com/nickp60/riboSeed. The software is released under the MIT licence. Supplementary data can be found in the riboSeed repository under Waters_et_al_2017. The modified BugBuilder pipeline can be found at https://github.com/nickp60/BugBuilder. Reference strains used for this study can be found in Table S3.
Competing interests
The authors declare that they have no competing interests.
Funding
The work was funded through a joint studentship between The James Hutton Institute, Dundee, Scotland, and the National University of Ireland, Galway, Ireland.
Authors’ contributions
NRW wrote all the bugs.
Supplementary Data
Making the artificial test genome
The artificial genome used for testing was constructed using the makeToyGenome.sh script included in the GitHub repository under the scripts directory. Briefly, the 7 rDNA regions from the E. coli Sakai genome were extracted with 5kb flanking sequence upstream and downstream; these sequences were then concatenated end to end to form a single, ̴100kb sequence containing the 7 rDNAs as well as their flanking context.
Effect of reference sequence identity on riboSeed performance: Methods
A range of substitutions were introduced into a artificial genome using the runDegenerate.sh script (included in the GitHub repository under the scripts directory), which facilitates the following procedure. An artificial test genome is constructed (see above), and reads simulated using pIRS (100bp, 300bp inserts, stdev 10, 30-fold coverage, built-in error profile). Then, for each of a range of substitution frequencies, substitutions are introduced into the simulated genome, either just in the flanking regions or throughout. riboSeed is run on the reads using the mutated genome as the reference, and the results are evaluated with riboScore. This script was run 100 times, using a different random seed each time. As the algorithm used by python’s pseudo random number generator may differ between operating systems, comparable but not identical results can be expected.
Archaeal Datasets
We assessed the effectiveness of riboSeed with assembling archaeal genomes. Most ('55%) archaeal genomes have only a single rDNA, and none has been observed to have more than four. As riboSeed requires a sequencing dataset and a reference genome, applicability was limited; of the 104 entries in rrnDB with multiple rDNAs, only 7 had multiple entries at the species level. Among those, only 2 had publicly available short read data. We used riboSeed to reassemble Methanosarcina barkeri Fusaro DSMZ804 (Ion Torrent PGM, 89bp single-end reads) and Methanobacterium formicicum st. BRM9 (Illumina HiSeq 2000, 100bp paired-end reads). Methanobacterium formicicum st. JCM10132(DRR017790) and Methanosarcina barkeri Fusaro DSMZ804 (SRR2064286) were the only ones that were suitable for riboSeed, meaning that there was publicly available short read data and that there is a related genome at the species level which is complete.
M. formicicum st. JCM10132 was sequenced on an Ion Torrent PGM, generating 106.5Mbp of single-end data. M formicicum BRM9 (CP006933.1) was used as a reference. The resulting de fere novo assembly resulted in assembly of 1 of 2 rDNA gaps. This represents the first application of riboSeed to Ion Torrent data.
Methanosarcina barkeri Fusaro DSMZ804 was sequenced using an Illumina HiSeq2000 with 101bp paired-end reads, with an average fragment length of 400bp. We downsampled to use 5% of the 19.4Gbp dataset. Methanosarcina barkeri str. Wiesmoor was used as a reference. The resulting riboSeed assembly showed correct assembly of 3 of 3 rDNAs, while de novo assemble failed to resolve any.
Taken together, we show that given appropriate datasets, archaeal datasets can be processed in the same manner used for bacteria.
Acknowledgements
We thank Anton Korobeynikov for his helpful tips on optimizing SPAdes. Yoann Augagneur, Shaun Brinsmade, and Mohamed Sassi graciously provided access to the S. aureus UAMS-1 genome sequencing data.