Abstract
Chloroplasts are photosynthetic organelles in plant cells and contain their own genomic information. That genome can be utilized in different scientific fields like phylogenetics or biotechnology. Thus, different assemblers have been developed specialized in chloroplast assemblies. Those assemblers often use the output of whole genome sequencing experiments as input. Such sequencing data usually contain the complete chloroplast genome information, even if the sequencing aims for the core genome. Different assembly tools have never been systematically compared. Here we present a benchmark of seven chloroplast assembly tools, capable to succeed in more than 60% of real data sets. Our results show significant differences between the tested assemblers in terms of generating whole chloroplast genome sequences and computational requirements. Moreover, we suggest further development to improve user experience and success rate. In terms of reproducibility, we created docker images for each tested tool, which are available for the scientific community. Following the presented guidelines, users are able to analyze and screen data sets for chloroplast genomes using only standard computer infrastructure. Thus large scale screening for chloroplasts as hidden treasures within genomic sequencing data is feasible.
Introduction
General introduction and motivation
Chloroplasts are essential organelles present in plant cells and the cells of some protists. Chloroplasts enable the conversion of light energy into chemical energy via photosynthesis. They harbor their own ribosomes and a circular DNA genome usually with a size between 120 kbp to 160 kbp [1]. Because of this small size, the chloroplast genome has been an early target for sequencing. The first chloroplast genome sequences were obtained as early as 1986 [2, 3]. These early efforts elucidated the general genome organization and structure of the chloroplast DNA. Chloro-plast genome content and structure are reviewed for example in [4, 5]. Chloroplast genomes are widely used for evolutionary analyses [6, 7], barcoding [8, 9, 10], and meta-barcoding [11, 12]. Interesting aspects of chloroplast genomes are their small size (120 kbp to 160 kbp,[1]), caused through endosymbiotic gene transfer [13, 14] and the low number of 100 to 120 genes that are still encoded on the chloroplast genome [4]. Despite the overall high conservation of the genome sequence, there are striking differences in the gene content between different groups (e.g. the loss of the whole ndh gene family in Droseraceae [15]). Even more extreme evolutionary cases, where chloroplasts show a very low GC content and a modified genetic code are described [16].
These differences call for comparative genomic approaches. Given the small size, it is much easier to decipher the complete chloroplast genome than the complete core genome. For example the Arabidopsis thaliana core genome is approximately 125 Mbp in length [17, 18] while the size of the A. thaliana chloroplast genome with 154 kbp is more than 800 x smaller [19].
Even if only a single chloroplast is located inside a plant cell, several hundreds copies of the chloroplast genome exists in each cell [20, 21]. Therefore, many genome sequencing projects contain chloroplast reads as by-product. In some cases the chloroplast data is even considered contamination and experimental protocols for reducing their content have been developed [22]. An alternative approach to improve the assembly of the core genome would be to first resolve the chloroplast genome and afterwards use this information to remove those reads that map to the chloroplast genome.
Structurally, two inverted repeats (IRA and IRB) of 10 kbp to 76 kbp divide the chloroplast genome into a large (LSC) and a small single copy (SSC) region [1]. Those large inverted repeats complicate automated resolution with short read technologies[23]. Moreover, the existence of different chloroplasts within a single individual, and thus multiple different chloroplast genomes, have been described for different plants [24, 25, 26]. Although the origin and evolutionary importance of this phenomena —called heteroplasmy— are only poorly understood, the assembly of whole chloroplast genomes might be hindered.
Databases exist containing short read data for species where no reference chloroplast sequence is publicly available, eg. the Sequence Read Archive at NCBI [27]. The availability of whole chloroplast genomes would enable large scale comparative studies [28]. Additionally, reconstructed full chloroplast genomes have been used as super-barcodes [29], for biotechnology applications and genetic engineering [30].
Approaches to extracting chloroplasts from whole genome data
Different strategies have been developed to assemble chloroplast genomes [31]. In general, obtaining a chloroplast genome from WGS data requires two steps. First, the chloroplast reads have to be extracted from the mixed sequencing data. The second step is the assembly and resolution of the special circular structure including the inverted repeats. The extraction of the reads can be achieved by mapping the reads to a reference chloroplast. [32]. A different approach that does not perform alignments, relies on the higher coverage of chloroplast data in the whole genome sequencing data set[33]. Here, a k-mer analysis can be used to extract the most frequent reads. An example for this is implemented in chloroExtractor [34]. A third method combines both approaches by using a reference chloroplast as seed and simultaneously assembling the reads based on k-mers [35].
Purpose and scope of this study
The goal of this study is to compare the effectiveness and efficiency of existing open source command-line tools to de-novo assemble whole chloroplast genomes from raw genomic data sets with minimal configuration. This includes no need for extensive data preparation, no need for a specific reference (apart from A. thaliana), no need to change default parameters, no manual finishing. We further restricted our benchmark to paired end Illumina data sets as these are routinely generated by modern sequencing platforms [36].
In our opinion this reflects the most common use cases: (1) a user trying a tool quickly without digging into options for fine tuning and (2) large scale automatic applications. Still, we acknowledge that the performance of the tools might be significantly improved by optimizing parameters (and references if applicable) for each data set specifically. However, an exhaustive comparison - including tuning of all different possible parameters for each tool-was out of scope for this study.
Our results will enable the discovery of novel chloroplast genomes as well as an assembly of inter/intra-individual differences in the respective chloroplast genomes.
Results
Performance metrics
Time requirements
In terms of run time, massive differences between the different tools have been observed. Apart from tool-specific differences, input data and number of threads had huge impact. The observed run times varied from a few minutes to several hours (figure 1).
Some assemblies failed to finish within the time-limit we set (48 h). On average the longest time to generate the assemblies was taken by IOGA and Fast-Plast followed by ORG.Asm and GetOrganelle the most time efficient tool was chloroExtractor, which on average is a little faster than NOVOPlasty and Chloroplast assembly protocol.
Not all the tools were able to benefit from having access to multiple threads. Both NOVOPlasty and ORG.Asm take about the same time independent of being able to utilize 1, 2, 4, or 8 threads. Chloroplast assembly protocol, chloroExtractor, GetOrganelleand Fast-Plast all profit from multi-threading (figures 1 and 2 and tables S3 to S5).
Memory and CPU Usage
The peak and mean CPU usage, as well as peak memory and disk usage have been recorded for all assemblers based on the same input data set and number of threads to use (figure 2 and tables S3 to S5). Mainly, the size of the input data influenced the peak memory usage with the exception of chloroExtractor and IOGA. Those two assemblers seems to have a memory usage pattern, which is less influenced by the size of the data. The number of allowed threads had only a limited impact on the peak memory usage. Nevertheless, all programs profit by a higher number of threads, if the size of the input data was increased. In contrast, the disk usage is independent from input size and number of threads for all assemblers.
Qualitative
The user experience of most tools was evaluated as mainly Good (table 1). However, a few critique points remained. Two minor dependencies were missing in the GetOrganelle installation instructions and there was no test data available. Additionally, an issue occurred when running it on a A. thaliana data set. We are currently in the process of resolving this with the authors.
The Fast-Plast installation instructions were missing some dependencies. Like GetOrganelle, Fast-Plast does not offer a test data set or a tutorial, except for some example commands.
The ORG.Asm installation instructions did not work. We found some issues, which are probably related to the requirement of Python 3.7. There is a tutorial where sample data is available. However, following the instructions resulted in a segmentation fault. We found a workaround for this bug and contacted the authors.
The main critique point of NOVOPlasty was the lack of a test data set with instructions. This was fixed by the authors after we contacted them. Additionally, NOVOPlasty uses a custom license, where an OSI approved license would be preferred.
The chloroExtractor does come with a test data set and a short tutorial. However, it is currently not possible to evaluate the results of the test run.
The IOGA installation instructions were missing many dependencies. Also, there was no test data or tutorial available and there is no license assigned to it. Since there was no update to the GitHub repository for the last three years, the project can be seen as inactive. After contacting the authors, they promised to resolve the mentioned issues.
As many of the other tools, the installation instructions for the Chloroplast assembly protocol were missing some dependencies. The list was updated after we contacted the authors. This tool does come with a test data set, however a note about the expected outcome is missing. A more extensive tutorial is provided. The description about the parameter is short, but sufficient.
Quantitative
Simulated data
The only assembler obtaining perfect results according to our score for the simulated data sets is GetOrganelle (figure 3 and table 2). IOGA and Chloroplast assembly protocol showed the worst performance, being unable to fully assemble a single chloroplast out of 14 runs. NOVOPlasty performed second best with scores above 80 for all data sets, only failing to resolve the contigs into one single circular chloroplast assembly. The overall performance is best, when the input data consists purely of chloroplast reads. Only IOGA and Chloroplast assembly protocol failed to deliver any results under this scenario once. In general, no clear correlation between either length of the input reads or the ratio of core vs chloroplast reads and the performance of the different assemblers can be observed.
Real data sets
Concerning the performance of the assemblers on the real data sets, we were able to observe considerable differences in the median score (figure 4). The highest scores were achieved by GetOrganelle with a median of 99.7 and 199 circular assemblies out of a total of 356 assemblies that resulted in an output (table 3). The performance of GetOrganelle is followed by Fast-Plast, NOVOPlasty, ORG.Asm, and chloroExtractor. Fast-Plast is outperforming the latter two slightly in terms of score, with twice as much 114 perfectly assembled chloroplast genomes (NOVOPlasty produced 66 and ORG.Asm 55 circular genomes). IOGA and Chloroplast assembly protocol were both not able to assemble a circular, single-contig genome (table 3), consequently resulting in the lowest mean and median scores (figure 5).
Consistency
Consistency was tested by re-running assemblies and comparison of the scores of two assemblies (figure 6). Replicates that did not produce an output were manually scored as 0. GetOrganelle was the only tool that succeeded in obtaining similar scores for all assemblies, without producing and completely unsuccessful assemblies for this subset of data. Except for Fast-Plast all the other tools had at least one assembly that was unsuccessful in one run, but produced an output in the other. Notably IOGA appears to have a tendency to perform differently in independent runs. Here, more than 10% of the assemblies failed in one run only.
Both Fast-Plast and NOVOPlasty tend to have minor changes in the assembly when the overall performance is comparably well, leading to the arrow-shaped scatter plots. chloroExtractor and Chloroplast assembly protocol appear to be the most robust assemblers, having only few deviations between the two runs.
Discussion
We aimed to generate an overall performance score for the different chloroplast assemblers, but depending on distinct downstream applications, the different criteria assessed in this work need to be weighted differently. For example, ease of installation and use might not be a big concern if the tool is installed once and integrated in an automated pipeline. On the other hand this factor alone might prevent other users from being able to use the tool in the first place. Similarly, computational requirements or run time might be less relevant, if the goal is to assemble a single chloroplast for further analysis, but it is essential if hundreds or thousands of samples should be processed in parallel for a large scale study. Eventually, both ease of use and run time are irrelevant if the tool is not able to successfully accomplish its task. Also the scope of this study needs to be considered when interpreting the guidelines below. In particular, we evaluated all tools under the assumption that they are used in the most basic form (default parameters, no hand selected reference, no pre-processing of the data or post-processing of the result, restricted run time). It is important to note that any tool might perform significantly different, if the above mentioned parameters are fine-tuned for a specific data set.
The overall best success rate, both on simulated and real data, was achieved by GetOrganelle followed by Fast-Plast. Both tools complement each other, as each is able to successfully reconstruct a full chloroplasts in cases where the other tool fails. In rare cases NOVOPlasty or ORG.Asm are the only tool to succeed. The tools Fast-Plast, NOVOPlasty, and ORG.Asm produce the most variable results, thus rerunning the tool after a failed attempt might be successful. chloroExtractor yields only few complete chloroplast assemblies, but requires also only few resources. It is easy to install and use and thus could be considered as a good option for a quick first try. Both IOGA and Chloroplast assembly protocol have the worst performance of all tools tested and fail to return reliable chloroplast assemblies.
Additionally, we observed no phylogenetic pattern in the success rate of the assemblers (figure 7). This indicates that the tools are generally able to reconstruct chloroplasts across the plant kingdom even without or with fixed A. thaliana as reference.
Guidelines for the end-user
Given these results, our recommendation is to use GetOrganelle as default option, and in case of failure Fast-Plast as backup solution. If both programs fail, it is sensible to re-run Fast-Plast and additionally try NOVOPlasty and ORG.Asm. This procedure maximizes the chance to effectively and efficiently recover the circular chloroplast genome from mixed genomic data. If none of these four assemblers produce sensible results, a reference guided approach and tweaking of the default parameters, might be the solution. Here, it is not possible to provide general guidelines, as the procedure will differ for different data sets. For an automated approach, running GetOrganelle and Fast-Plast in parallel appears to be a good trade-off between success rate and use of resources.
Ideas for future development
For further experiments, combining different components from different tools might be a promising approach. For example, read scaling from chloroExtractor followed by an assembly by GetOrganelle and finally the structural resolution with Fast-Plast could be a promising approach, combing the respective strength of the different tools.
Moreover, the installation issues need to be mitigated by modern software. Therefore, either containerization (docker, singularity, etc.) or install workflows (eg. bio-conda [37]) should be established by all software packages. Otherwise, the burden of the software installation might result in scientists ignoring good tools.
Another important feature of software is a comprehensive documentation, which needs to be up-to-date and maintained. Additionally, software authors could improve the usability based on suggestions from their users.
Finally, all tools should improve their integrated guessing of default parameters, as many users avoid fine tuning of those, especially, for larger screening approaches. Last, as sequencing technology is developing fast (eg. PacBio or nanopore), tools need to be updated to not become obsolete. But the hope would be that with ongoing software development and improved sequencing technologies, the generation of whole chloroplast assemblies from any species will become a routine technique.
Conclusion
The main assumption for our study to benchmark different chloroplast assembly tools, is that whole genome sequencing data are also a promising source for chlorplast assemblies. Our benchmark shows that 60 % of the data sets without available chloroplast genome, have been assembled by at least one of the tools we analysed. Still, even with simulated (aka“perfect”) data, not all tools succeeded in generating complete chloroplast assemblies. Therefore, we determined the strengths and weaknesses of the specific tools and provided guidelines for the users. However, it might be necessary, to combine different methods or manually explore the parameter space, to obtain reliable results if a single run seems not sufficient. Ultimately, large scale studies reconstructing hundreds or thousands of chloroplast genomes are now feasible using the currently available tools.
Methods
Data availability
Source code for all methods used is available at [38] and archived in zenodo under [39]. All docker images are published on [40] and are named with a leading benchmark_ (table 4).
To enable a fair comparison of all tools, we generated simulated sequencing data. Those simulated data sets are stored at [41]. This study adheres to the guidelines for computational method benchmarking [42].
Tool Selection
We included tools designed for assembling chloroplasts from whole genome paired end Illumina sequencing data. As a requirement, all tools must be available as open source software and allow execution via a command line interface. As a graphical user interface is not suitable for automated comparisons, tools only providing a graphical interface have not been included. The following tools were determined to be within the scope of this study: ORG.Asm [29], chloroExtractor [34], Fast-Plast [43], IOGA [44], NOVOPlasty [35], GetOrganelle [45], and Chloroplast assembly protocol [46].
Other related tools for assembling chloroplasts that did not meet our criteria and are therefore outside the scope of this study are for example: Organelle PBA [47], sestaton/Chloro [48], Norgal [49], and MitoBim [50].
Organelle PBA is designed for PacBio data and does not work with paired Illumina data alone. sestaton/Chloro fits our criteria, but it is flagged as work in progress and development and support seem to have ended two years ago. Norgal is a tool to extract organellar DNA from whole genome data based on a k-mer frequency approach. However the final output is a set of contigs of mixed mitochondrial and plastid origin. The suggested approach to get a finished chloroplast genome is to run NOVOPlasty on the ten longest contigs. Therefore we only included NOVOPlasty with the default settings and excluded Norgal. MitoBim is specifically designed for mitochondrial genomes. Even though there is a claim by the author that it can be used for chloroplasts as well, there is no further description on how to do that [51].
Additionally, there is a protocol for the Geneious [52] software available [53]. However, Geneious is closed source and GUI based, which is not in the scope of this study. There is also another publication describing a method for assembling chloroplasts [54]. However, the link to the software is not active anymore.
Our Setup
We want to use a minimum of different parameter settings for all assembly programs to enable a fair comparison. Therefore, we decided to specify that all programs have to work based on two input files, representing a data set’s forward (forward.fq) and reverse (reverse.fq) sequence file in FASTQ format. Depending on the assembler, output files with different names and locations are generated. Those different files are copied and renamed to ensure that each assembly approach produces the same output file (output.fa). Additionally, we set an environment variable for all programs to control the number of allowed threads. All three requirements (defined input file names, defined output file name, thread number control via environment variable) are ensured by a simple wrapper script (wrapper.sh). Finally, for a maximum of reproducibility all programs have been bundled into individual docker images based on a central base image which provides all the required software. Those docker images were used for the recording of the consumption of computational resources. Those docker images have been used for the performance benchmarking on a four Intel CPU-E7 8867 v3 system offering 1 TB of RAM. Furthermore, all our docker images have been converted into singularity containers for the quantitative measurement on simulated and real data sets. Singularity container were built from docker images for usage on a HPC-environment using Singularity v.2.5.2 [55] All singularity containers were run on Intel® Xeon® Gold 6140 Processors using a Slurm workload manager version 17.11.8 [56]. Assemblies were run on 4 threads using 10 GiB RAM with a time limit of 48 h.
Data
Simulated
To avoid suffering from sequencing errors and biological variances, we simulated perfect reads based on the A. thaliana (TAIR10) chloroplast assembly [57]. We used a sliding window approach with seqkit [58]. The exact commands are documented in 03_representative_datasets.md in [41]. For the final simulated data sets reads based on mixtures of the A. thaliana (TAIR10) core and chloroplast genome were generated with different ratios (0:1, 1:10, 1:100, and 1:1000). Additionally, we generated data with different read lengths (150bp and 250bp). All data simulated contain exactly 2 million read pairs.
Real
We selected real data deposited at SRA [27]. We searched all data that matched (((((((“green plants”[orgn]) AND “wgs”[Strategy]) AND “illumina”[Platform]) AND “biomol dna”[Properties]) AND “paired”[Layout]) AND “random”[Selection])) AND “public”[Access] [59]. For each species with a reference chloroplast in Cp-Base [60], we selected one data set of those. In total, this accumulated to 369 data sets (table S1) representing a broad spectrum of the green plants (figure 7).
Evaluation Criteria
Computational Resources
We recorded the mean and the peak CPU usage, the peak memory consumption, and the size of the assembly folder for each program. As input data, we used different data sets comprising 25 000, 250 000 and 2 500 000 read pairs sampled from our simulated reads. We used our docker image setup (table 4) to run all assembly programs three times for each parameter setting. The different settings combined different input data and different number of threads to use (1, 2, 4 and 8).
Some programs will use more CPU threads than specified, therefore, the number of CPUs available have been fixed using the CPU option while running the docker run command. For each assembly setting, we recorded the peak memory consumption, the CPU usage (mean and peak CPU usage) and the size of the folder where the assembly was calculated. The values of CPU and memory usage have been obtained by docker. The disk usage was estimated using the GNU tool du. We used GNU parallel for queuing of the different settings [61].
Qualitative
The qualitative evaluation is mainly based on the reviewer guidelines for the Journal of Open Source Software (JOSS) [62]. To create a standard environment, all tools were tested in a fresh default installation of Ubuntu 18.04.2 running in a virtual machine (VirtualBox Version 5.2.18_Ubuntu r123745). We chose this setup instead of the docker container, because it resembles a typical user environment better than the minimal docker installation. The tools were installed according to their installation instructions and the provided tutorial or example usage was executed. During the evaluation, the following questions were asked:
Is the tool easy to install?
Is there a way to test the installation or a tutorial on how to use the tool?
Is there a good documentation on the parameter settings?
Is the tool maintained (issues answered, implementation of new features)?
Is the tool Open Source?
These questions were answered with Good, Okay or Bad, depending on the quality of the result. For example, a Good installation utilizes an automated package or dependency management like apt, CRAN, docker, etc. An Okay installation procedure provides a custom script to install everything or at least lists all dependencies. A Bad installation procedure fails to list important dependencies or produces errors, that prevent a successful installation without exhaustive debugging.
After an initial evaluation, we contacted all authors via their GitHub or GitLab issue tracking to communicate potential flaws we found.
Quantitative
For each data set and assembler the generated chloroplast genome was compared to the respective reference genome using a pairwise alignment obtained with minimap2 v2.16 [63]. Based on theses alignments a score is calculated as shown in equation (1) The assemblies were scored on a scale from 0 to 100, with 100 being the best and 0 the worst possible score. Four different metrics were Incorporated, each contributing to the total score: Completeness, correctness, repeat resolution and continuity. These metrics are similar in concept by those used in the Assemblathon 2 project: coverage, validity, multiplicity, and parsimony [64].
The completeness is estimated as the coverage of the assembled chloroplast genome versus the reference genome (covref) It resembles how many bases of the query genome can be mapped to its respective reference genome. Secondly, we mapped the reference genome against the query. The coverage of the reference genome (covqry) is used as measurement for the correctness of the assembly. The repeat resolution is estimated from the size difference of the assembly and the reference genome , leading to values between 0 and 1. The fourth metric used is the continuity, represented by the number of contigs. A perfect score is achieved if one circular chromosome was assembled, while the score gets worse with the amount of contigs.
Consistency
To ensure consistency of the obtained results, we randomly chose 100 data sets, that in the previous runs resulted in outputs for most of the assemblers and run them again with the same parameters as before. The resulting assemblies were scored again as described and the scores of the first and the second run were compared to each other This information is important to assess the robustness of the different programs.
Competing interests
The authors disclose that SP, NT, FF, and MJA are developers of chloroExtractor, one of the tools benchmarked in this article. The authors exercised caution not to let this fact unfairly influence their judgment and recommendations. JF, NT, and MJA are affiliated with the for-profit organization AnaLife Data Science.
Author’s contributions
MJA and FF conceived of the presented idea and supervised the findings of this manuscript. SP and FF created the docker images. NT performed the qualitative analysis for all assemblers. MJA prepared the simulated and real data sets. JAF assembled the real data sets. FF ran the performance assemblies on the simulated data sets. All authors developed the score model. JAF and MJA implemented the score model and prepared the figures. All authors discussed the results and contributed to the final manuscript.
Availability of data and materials
The supplemental material is available from Zenodo [65]. The simulated data set is available from Zenodo [41]. All program code is available via Zenodo [39] or from Github [38]. The input data sets can be generated using the raw reads from NCBI SRA (links for each data set in table S1). The resulting assemblies are avaiable from Zenodo [66].