The landscape of chloroplast genome assembly tools

Jan A Freudenthal; Simon Pfaff; Niklas Terhoeven; Arthur Korte; Markus J Ankenbrand; Frank Förster

doi:10.1101/665869

Abstract

Chloroplasts are photosynthetic organelles in plant cells and contain their own genomic information. That genome can be utilized in different scientific fields like phylogenetics or biotechnology. Thus, different assemblers have been developed specialized in chloroplast assemblies. Those assemblers often use the output of whole genome sequencing experiments as input. Such sequencing data usually contain the complete chloroplast genome information, even if the sequencing aims for the core genome. Different assembly tools have never been systematically compared. Here we present a benchmark of seven chloroplast assembly tools, capable to succeed in more than 60% of real data sets. Our results show significant differences between the tested assemblers in terms of generating whole chloroplast genome sequences and computational requirements. Moreover, we suggest further development to improve user experience and success rate. In terms of reproducibility, we created docker images for each tested tool, which are available for the scientific community. Following the presented guidelines, users are able to analyze and screen data sets for chloroplast genomes using only standard computer infrastructure. Thus large scale screening for chloroplasts as hidden treasures within genomic sequencing data is feasible.

Introduction

General introduction and motivation

Chloroplasts are essential organelles present in plant cells and the cells of some protists. Chloroplasts enable the conversion of light energy into chemical energy via photosynthesis. They harbor their own ribosomes and a circular DNA genome usually with a size between 120 kbp to 160 kbp [1]. Because of this small size, the chloroplast genome has been an early target for sequencing. The first chloroplast genome sequences were obtained as early as 1986 [2, 3]. These early efforts elucidated the general genome organization and structure of the chloroplast DNA. Chloro-plast genome content and structure are reviewed for example in [4, 5]. Chloroplast genomes are widely used for evolutionary analyses [6, 7], barcoding [8, 9, 10], and meta-barcoding [11, 12]. Interesting aspects of chloroplast genomes are their small size (120 kbp to 160 kbp,[1]), caused through endosymbiotic gene transfer [13, 14] and the low number of 100 to 120 genes that are still encoded on the chloroplast genome [4]. Despite the overall high conservation of the genome sequence, there are striking differences in the gene content between different groups (e.g. the loss of the whole ndh gene family in Droseraceae [15]). Even more extreme evolutionary cases, where chloroplasts show a very low GC content and a modified genetic code are described [16].

These differences call for comparative genomic approaches. Given the small size, it is much easier to decipher the complete chloroplast genome than the complete core genome. For example the Arabidopsis thaliana core genome is approximately 125 Mbp in length [17, 18] while the size of the A. thaliana chloroplast genome with 154 kbp is more than 800 x smaller [19].

Even if only a single chloroplast is located inside a plant cell, several hundreds copies of the chloroplast genome exists in each cell [20, 21]. Therefore, many genome sequencing projects contain chloroplast reads as by-product. In some cases the chloroplast data is even considered contamination and experimental protocols for reducing their content have been developed [22]. An alternative approach to improve the assembly of the core genome would be to first resolve the chloroplast genome and afterwards use this information to remove those reads that map to the chloroplast genome.

Structurally, two inverted repeats (IR_A and IR_B) of 10 kbp to 76 kbp divide the chloroplast genome into a large (LSC) and a small single copy (SSC) region [1]. Those large inverted repeats complicate automated resolution with short read technologies[23]. Moreover, the existence of different chloroplasts within a single individual, and thus multiple different chloroplast genomes, have been described for different plants [24, 25, 26]. Although the origin and evolutionary importance of this phenomena —called heteroplasmy— are only poorly understood, the assembly of whole chloroplast genomes might be hindered.

Databases exist containing short read data for species where no reference chloroplast sequence is publicly available, eg. the Sequence Read Archive at NCBI [27]. The availability of whole chloroplast genomes would enable large scale comparative studies [28]. Additionally, reconstructed full chloroplast genomes have been used as super-barcodes [29], for biotechnology applications and genetic engineering [30].

Approaches to extracting chloroplasts from whole genome data

Different strategies have been developed to assemble chloroplast genomes [31]. In general, obtaining a chloroplast genome from WGS data requires two steps. First, the chloroplast reads have to be extracted from the mixed sequencing data. The second step is the assembly and resolution of the special circular structure including the inverted repeats. The extraction of the reads can be achieved by mapping the reads to a reference chloroplast. [32]. A different approach that does not perform alignments, relies on the higher coverage of chloroplast data in the whole genome sequencing data set[33]. Here, a k-mer analysis can be used to extract the most frequent reads. An example for this is implemented in chloroExtractor [34]. A third method combines both approaches by using a reference chloroplast as seed and simultaneously assembling the reads based on k-mers [35].

Purpose and scope of this study

The goal of this study is to compare the effectiveness and efficiency of existing open source command-line tools to de-novo assemble whole chloroplast genomes from raw genomic data sets with minimal configuration. This includes no need for extensive data preparation, no need for a specific reference (apart from A. thaliana), no need to change default parameters, no manual finishing. We further restricted our benchmark to paired end Illumina data sets as these are routinely generated by modern sequencing platforms [36].

In our opinion this reflects the most common use cases: (1) a user trying a tool quickly without digging into options for fine tuning and (2) large scale automatic applications. Still, we acknowledge that the performance of the tools might be significantly improved by optimizing parameters (and references if applicable) for each data set specifically. However, an exhaustive comparison - including tuning of all different possible parameters for each tool-was out of scope for this study.

Our results will enable the discovery of novel chloroplast genomes as well as an assembly of inter/intra-individual differences in the respective chloroplast genomes.

Results

Performance metrics

Time requirements

In terms of run time, massive differences between the different tools have been observed. Apart from tool-specific differences, input data and number of threads had huge impact. The observed run times varied from a few minutes to several hours (figure 1).

Figure 1 Computation time depending on number of threads and size of input data

The boxplots show the differences in demand of CPU time for different number of threads and input data size for the seven different assemblers

Some assemblies failed to finish within the time-limit we set (48 h). On average the longest time to generate the assemblies was taken by IOGA and Fast-Plast followed by ORG.Asm and GetOrganelle the most time efficient tool was chloroExtractor, which on average is a little faster than NOVOPlasty and Chloroplast assembly protocol.

Not all the tools were able to benefit from having access to multiple threads. Both NOVOPlasty and ORG.Asm take about the same time independent of being able to utilize 1, 2, 4, or 8 threads. Chloroplast assembly protocol, chloroExtractor, GetOrganelleand Fast-Plast all profit from multi-threading (figures 1 and 2 and tables S3 to S5).

Figure 2 Performance metrics

Boxplots depicting the demand of CPU and RAM and disk space needed depending on the assembler, input data size and number of threads

Memory and CPU Usage

The peak and mean CPU usage, as well as peak memory and disk usage have been recorded for all assemblers based on the same input data set and number of threads to use (figure 2 and tables S3 to S5). Mainly, the size of the input data influenced the peak memory usage with the exception of chloroExtractor and IOGA. Those two assemblers seems to have a memory usage pattern, which is less influenced by the size of the data. The number of allowed threads had only a limited impact on the peak memory usage. Nevertheless, all programs profit by a higher number of threads, if the size of the input data was increased. In contrast, the disk usage is independent from input size and number of threads for all assemblers.

Qualitative

The user experience of most tools was evaluated as mainly Good (table 1). However, a few critique points remained. Two minor dependencies were missing in the GetOrganelle installation instructions and there was no test data available. Additionally, an issue occurred when running it on a A. thaliana data set. We are currently in the process of resolving this with the authors.

View this table:

Table 1 Overview of the results of the qualitative usability evaluation

Each tool could score Good, Okay or Bad in each of the categories.

The Fast-Plast installation instructions were missing some dependencies. Like GetOrganelle, Fast-Plast does not offer a test data set or a tutorial, except for some example commands.

The ORG.Asm installation instructions did not work. We found some issues, which are probably related to the requirement of Python 3.7. There is a tutorial where sample data is available. However, following the instructions resulted in a segmentation fault. We found a workaround for this bug and contacted the authors.

The main critique point of NOVOPlasty was the lack of a test data set with instructions. This was fixed by the authors after we contacted them. Additionally, NOVOPlasty uses a custom license, where an OSI approved license would be preferred.

The chloroExtractor does come with a test data set and a short tutorial. However, it is currently not possible to evaluate the results of the test run.

The IOGA installation instructions were missing many dependencies. Also, there was no test data or tutorial available and there is no license assigned to it. Since there was no update to the GitHub repository for the last three years, the project can be seen as inactive. After contacting the authors, they promised to resolve the mentioned issues.

As many of the other tools, the installation instructions for the Chloroplast assembly protocol were missing some dependencies. The list was updated after we contacted the authors. This tool does come with a test data set, however a note about the expected outcome is missing. A more extensive tutorial is provided. The description about the parameter is short, but sufficient.

Quantitative

Simulated data

The only assembler obtaining perfect results according to our score for the simulated data sets is GetOrganelle (figure 3 and table 2). IOGA and Chloroplast assembly protocol showed the worst performance, being unable to fully assemble a single chloroplast out of 14 runs. NOVOPlasty performed second best with scores above 80 for all data sets, only failing to resolve the contigs into one single circular chloroplast assembly. The overall performance is best, when the input data consists purely of chloroplast reads. Only IOGA and Chloroplast assembly protocol failed to deliver any results under this scenario once. In general, no clear correlation between either length of the input reads or the ratio of core vs chloroplast reads and the performance of the different assemblers can be observed.

Figure 3 Score of assemblies on simulated data

Results of assemblies from simulated data sets. Color scale of the tiles represents the score

View this table:

Table 2

Scores of assemblies of simulated data

Real data sets

Concerning the performance of the assemblers on the real data sets, we were able to observe considerable differences in the median score (figure 4). The highest scores were achieved by GetOrganelle with a median of 99.7 and 199 circular assemblies out of a total of 356 assemblies that resulted in an output (table 3). The performance of GetOrganelle is followed by Fast-Plast, NOVOPlasty, ORG.Asm, and chloroExtractor. Fast-Plast is outperforming the latter two slightly in terms of score, with twice as much 114 perfectly assembled chloroplast genomes (NOVOPlasty produced 66 and ORG.Asm 55 circular genomes). IOGA and Chloroplast assembly protocol were both not able to assemble a circular, single-contig genome (table 3), consequently resulting in the lowest mean and median scores (figure 5).

Figure 4 Results of scoring of the seven assemblers

The box- and swarplots depict the results of the scoring algorithm we used. For the different assemblers. The whiskers of boxplots indicate the 1.5 x interquartile range.

Figure 5 Upset plot [67] comparing success of assemblers on the real data sets

The plot shows the intersection of success (score > 99) between assemblers. For 69 data sets only GetOrganelle was able to obtain a complete chloroplast. 43 were successful with both GetOrganelle and Fast-Plast and so on

View this table:

Table 3

Mean scores of chloroplast genome assemblers

Consistency

Consistency was tested by re-running assemblies and comparison of the scores of two assemblies (figure 6). Replicates that did not produce an output were manually scored as 0. GetOrganelle was the only tool that succeeded in obtaining similar scores for all assemblies, without producing and completely unsuccessful assemblies for this subset of data. Except for Fast-Plast all the other tools had at least one assembly that was unsuccessful in one run, but produced an output in the other. Notably IOGA appears to have a tendency to perform differently in independent runs. Here, more than 10% of the assemblies failed in one run only.

Figure 6 Scores between two repeated runs for consistency testing

The scatter plots depicts the scores of the 1. runs x-axis versus the scores of the 2. run y-axis of the data sets that were selected for re-evaluation.

Both Fast-Plast and NOVOPlasty tend to have minor changes in the assembly when the overall performance is comparably well, leading to the arrow-shaped scatter plots. chloroExtractor and Chloroplast assembly protocol appear to be the most robust assemblers, having only few deviations between the two runs.

Discussion

We aimed to generate an overall performance score for the different chloroplast assemblers, but depending on distinct downstream applications, the different criteria assessed in this work need to be weighted differently. For example, ease of installation and use might not be a big concern if the tool is installed once and integrated in an automated pipeline. On the other hand this factor alone might prevent other users from being able to use the tool in the first place. Similarly, computational requirements or run time might be less relevant, if the goal is to assemble a single chloroplast for further analysis, but it is essential if hundreds or thousands of samples should be processed in parallel for a large scale study. Eventually, both ease of use and run time are irrelevant if the tool is not able to successfully accomplish its task. Also the scope of this study needs to be considered when interpreting the guidelines below. In particular, we evaluated all tools under the assumption that they are used in the most basic form (default parameters, no hand selected reference, no pre-processing of the data or post-processing of the result, restricted run time). It is important to note that any tool might perform significantly different, if the above mentioned parameters are fine-tuned for a specific data set.

The overall best success rate, both on simulated and real data, was achieved by GetOrganelle followed by Fast-Plast. Both tools complement each other, as each is able to successfully reconstruct a full chloroplasts in cases where the other tool fails. In rare cases NOVOPlasty or ORG.Asm are the only tool to succeed. The tools Fast-Plast, NOVOPlasty, and ORG.Asm produce the most variable results, thus rerunning the tool after a failed attempt might be successful. chloroExtractor yields only few complete chloroplast assemblies, but requires also only few resources. It is easy to install and use and thus could be considered as a good option for a quick first try. Both IOGA and Chloroplast assembly protocol have the worst performance of all tools tested and fail to return reliable chloroplast assemblies.

Additionally, we observed no phylogenetic pattern in the success rate of the assemblers (figure 7). This indicates that the tools are generally able to reconstruct chloroplasts across the plant kingdom even without or with fixed A. thaliana as reference.

Figure 7 Success for chloroplast assembly shows no taxonomic bias

Success of assemblers on real data sets on tree derived from NCBI taxonomy [68]. Plot was prepared using [69]

Guidelines for the end-user

Given these results, our recommendation is to use GetOrganelle as default option, and in case of failure Fast-Plast as backup solution. If both programs fail, it is sensible to re-run Fast-Plast and additionally try NOVOPlasty and ORG.Asm. This procedure maximizes the chance to effectively and efficiently recover the circular chloroplast genome from mixed genomic data. If none of these four assemblers produce sensible results, a reference guided approach and tweaking of the default parameters, might be the solution. Here, it is not possible to provide general guidelines, as the procedure will differ for different data sets. For an automated approach, running GetOrganelle and Fast-Plast in parallel appears to be a good trade-off between success rate and use of resources.

Ideas for future development

For further experiments, combining different components from different tools might be a promising approach. For example, read scaling from chloroExtractor followed by an assembly by GetOrganelle and finally the structural resolution with Fast-Plast could be a promising approach, combing the respective strength of the different tools.

Moreover, the installation issues need to be mitigated by modern software. Therefore, either containerization (docker, singularity, etc.) or install workflows (eg. bio-conda [37]) should be established by all software packages. Otherwise, the burden of the software installation might result in scientists ignoring good tools.

Another important feature of software is a comprehensive documentation, which needs to be up-to-date and maintained. Additionally, software authors could improve the usability based on suggestions from their users.

Finally, all tools should improve their integrated guessing of default parameters, as many users avoid fine tuning of those, especially, for larger screening approaches. Last, as sequencing technology is developing fast (eg. PacBio or nanopore), tools need to be updated to not become obsolete. But the hope would be that with ongoing software development and improved sequencing technologies, the generation of whole chloroplast assemblies from any species will become a routine technique.

Conclusion

The main assumption for our study to benchmark different chloroplast assembly tools, is that whole genome sequencing data are also a promising source for chlorplast assemblies. Our benchmark shows that 60 % of the data sets without available chloroplast genome, have been assembled by at least one of the tools we analysed. Still, even with simulated (aka“perfect”) data, not all tools succeeded in generating complete chloroplast assemblies. Therefore, we determined the strengths and weaknesses of the specific tools and provided guidelines for the users. However, it might be necessary, to combine different methods or manually explore the parameter space, to obtain reliable results if a single run seems not sufficient. Ultimately, large scale studies reconstructing hundreds or thousands of chloroplast genomes are now feasible using the currently available tools.

Methods

Data availability

Source code for all methods used is available at [38] and archived in zenodo under [39]. All docker images are published on [40] and are named with a leading benchmark_ (table 4).

View this table:

Table 4

Docker images used in our benchmark setup SHA256 checksums are stated in table S2

To enable a fair comparison of all tools, we generated simulated sequencing data. Those simulated data sets are stored at [41]. This study adheres to the guidelines for computational method benchmarking [42].

Tool Selection

We included tools designed for assembling chloroplasts from whole genome paired end Illumina sequencing data. As a requirement, all tools must be available as open source software and allow execution via a command line interface. As a graphical user interface is not suitable for automated comparisons, tools only providing a graphical interface have not been included. The following tools were determined to be within the scope of this study: ORG.Asm [29], chloroExtractor [34], Fast-Plast [43], IOGA [44], NOVOPlasty [35], GetOrganelle [45], and Chloroplast assembly protocol [46].

Other related tools for assembling chloroplasts that did not meet our criteria and are therefore outside the scope of this study are for example: Organelle PBA [47], sestaton/Chloro [48], Norgal [49], and MitoBim [50].

Organelle PBA is designed for PacBio data and does not work with paired Illumina data alone. sestaton/Chloro fits our criteria, but it is flagged as work in progress and development and support seem to have ended two years ago. Norgal is a tool to extract organellar DNA from whole genome data based on a k-mer frequency approach. However the final output is a set of contigs of mixed mitochondrial and plastid origin. The suggested approach to get a finished chloroplast genome is to run NOVOPlasty on the ten longest contigs. Therefore we only included NOVOPlasty with the default settings and excluded Norgal. MitoBim is specifically designed for mitochondrial genomes. Even though there is a claim by the author that it can be used for chloroplasts as well, there is no further description on how to do that [51].

Additionally, there is a protocol for the Geneious [52] software available [53]. However, Geneious is closed source and GUI based, which is not in the scope of this study. There is also another publication describing a method for assembling chloroplasts [54]. However, the link to the software is not active anymore.

Our Setup

We want to use a minimum of different parameter settings for all assembly programs to enable a fair comparison. Therefore, we decided to specify that all programs have to work based on two input files, representing a data set’s forward (forward.fq) and reverse (reverse.fq) sequence file in FASTQ format. Depending on the assembler, output files with different names and locations are generated. Those different files are copied and renamed to ensure that each assembly approach produces the same output file (output.fa). Additionally, we set an environment variable for all programs to control the number of allowed threads. All three requirements (defined input file names, defined output file name, thread number control via environment variable) are ensured by a simple wrapper script (wrapper.sh). Finally, for a maximum of reproducibility all programs have been bundled into individual docker images based on a central base image which provides all the required software. Those docker images were used for the recording of the consumption of computational resources. Those docker images have been used for the performance benchmarking on a four Intel CPU-E7 8867 v3 system offering 1 TB of RAM. Furthermore, all our docker images have been converted into singularity containers for the quantitative measurement on simulated and real data sets. Singularity container were built from docker images for usage on a HPC-environment using Singularity v.2.5.2 [55] All singularity containers were run on Intel® Xeon® Gold 6140 Processors using a Slurm workload manager version 17.11.8 [56]. Assemblies were run on 4 threads using 10 GiB RAM with a time limit of 48 h.

Data

Simulated

To avoid suffering from sequencing errors and biological variances, we simulated perfect reads based on the A. thaliana (TAIR10) chloroplast assembly [57]. We used a sliding window approach with seqkit [58]. The exact commands are documented in 03_representative_datasets.md in [41]. For the final simulated data sets reads based on mixtures of the A. thaliana (TAIR10) core and chloroplast genome were generated with different ratios (0:1, 1:10, 1:100, and 1:1000). Additionally, we generated data with different read lengths (150bp and 250bp). All data simulated contain exactly 2 million read pairs.

Real

We selected real data deposited at SRA [27]. We searched all data that matched (((((((“green plants”[orgn]) AND “wgs”[Strategy]) AND “illumina”[Platform]) AND “biomol dna”[Properties]) AND “paired”[Layout]) AND “random”[Selection])) AND “public”[Access] [59]. For each species with a reference chloroplast in Cp-Base [60], we selected one data set of those. In total, this accumulated to 369 data sets (table S1) representing a broad spectrum of the green plants (figure 7).

Evaluation Criteria

Computational Resources

We recorded the mean and the peak CPU usage, the peak memory consumption, and the size of the assembly folder for each program. As input data, we used different data sets comprising 25 000, 250 000 and 2 500 000 read pairs sampled from our simulated reads. We used our docker image setup (table 4) to run all assembly programs three times for each parameter setting. The different settings combined different input data and different number of threads to use (1, 2, 4 and 8).

Some programs will use more CPU threads than specified, therefore, the number of CPUs available have been fixed using the CPU option while running the docker run command. For each assembly setting, we recorded the peak memory consumption, the CPU usage (mean and peak CPU usage) and the size of the folder where the assembly was calculated. The values of CPU and memory usage have been obtained by docker. The disk usage was estimated using the GNU tool du. We used GNU parallel for queuing of the different settings [61].