Abstract
Background: Analysis of mixed microbial communities using metagenomic sequencing experiments requires multiple preprocessing and analytical steps to interpret the microbial and genetic composition of samples. Analytical steps include quality control, adapter trimming, host decontamination, metagenomic classification, read assembly, and alignment to reference genomes.
Results: We present a modular and user-extensible pipeline called Sunbeam that performs these steps in a consistent and reproducible fashion. It can be installed in a single step, does not require administrative access to the host computer system, and can work with most cluster computing frameworks. We also introduce Komplexity, a software tool to eliminate potentially problematic, low-complexity nucleotide sequences from metagenomic data. A unique component of the Sunbeam pipeline is an easy-to-use extension framework that enables users to add custom processing or analysis steps directly to the workflow. The pipeline and its extension framework are well documented, in routine use, and regularly updated.
Conclusions: Sunbeam provides a foundation to build more in-depth analyses and to enable comparisons in metagenomic sequencing experiments by removing problematic low complexity reads and standardizing post-processing and analytical steps. Sunbeam is written in Python using the Snakemake workflow management software and is freely available at github.com/sunbeam-labs/sunbeam under the GPLv3.
Background
Metagenomic shotgun sequencing involves isolating DNA from a mixed microbial community of interest, then sequencing deeply into DNAs drawn randomly from the mixture. This is in contrast to marker gene sequencing (e.g., the 16S rRNA gene of bacteria), where a specific target gene region is amplified and sequenced. Metagenomic sequencing has enabled critical insights in microbiology [1-9], especially in the study of virus and bacteriophage communities [10-15], and is beginning to be used in clinical diagnosis [16-19]. However, an ongoing challenge is analyzing and interpreting the resulting large datasets in a standard and reliable fashion [20-27].
A common practice to investigate microbial metagenomes is to use Illumina sequencing technology to obtain a large number of short (100-250 base pair) reads from fragmented DNA isolated from a sample of interest. After sequence acquisition, several post-processing steps must be carried out before the sequences can be used to gain insight into the underlying biology [25, 28]. Some steps are common to many sequencing experiments, like quality control and sequencing adapter trimming, while others are unique to shotgun metagenomic sequencing, such as attributing reads to gene ontologies.
Researchers have many tools at their disposal for accomplishing each post-processing step and will frequently encounter multiple parameters in each tool that can change the resulting output and downstream analysis, sometimes radically. Varying parameters or tools between analyses makes it challenging to compare the results of different metagenomic sequencing experiments. Conversely, employing a consistent workflow across studies ensures that experiments are comparable and that the downstream analysis is reproducible, as emphasized in ref [25]. Documentation of software, databases and parameters used is an essential element of this practice; otherwise, the benefits of consistent and reproducible workflows are lost to history.
A metagenomic post-processing workflow should have the following qualities to maximize its utility and flexibility: it should be deployable on a wide range of computers; it should allow simple configuration of software parameters and reference databases; it should provide robust error handling and the ability to resume after interruptions; it should be modular so that unnecessary steps can be skipped or ignored, and it should allow new procedures to be added by the user. The ability to deploy the workflow on a wide range of computer systems ensures that all processing steps can be repeated in different labs with different computing setups and provides flexibility for researchers to choose between computing resources at the institution or in the cloud. Similarly, the ability to record running parameters through the use of configuration files allows for the use of experiment-specific software parameters and serves as documentation for future reference.
Several features contribute to efficient data analysis. It is beneficial if errors or interruptions in the workflow do not require restarting from the beginning—sequencing experiments produce large amounts of data making repeating steps in data processing time-consuming and potentially expensive. In addition, not all steps in a workflow will be necessary for all experiments, and some experiments may require custom processing. To handle experiments appropriately, the workflow should provide an easy way to skip unnecessary steps but run them later if necessary. To make the framework widely useful, users must be able to straightforwardly add new steps into the workflow as needed and share them with others. Several pipelines have been developed that achieve many of these goals [17, 29-31], but did not meet our needs for greater flexibility in processing metagenomic datasets and long-term reproducibility of analyses.
Here, we introduce Sunbeam, an easily-deployable and configurable pipeline that produces a consistent set of post-processed files from metagenomic sequencing experiments. Sunbeam is self-contained and installable on any modern Linux computer without any pre-existing dependencies or administrator privileges. It features robust error-handling, task resumption, and parallel computing capabilities resulting from its implementation in the Snakemake workflow language [32]. Nearly all steps are configurable, with reasonable pre-specified defaults, allowing rapid deployment without extensive parameter tuning. Sunbeam is extensible using a simple mechanism that allows new steps to be added at any point in the workflow.
In addition, Sunbeam features custom software that allows it to robustly process data from challenging sample types, including samples with abundant low-quality or host-derived sequences. These include custom-tuned host-derived read removal steps for any number of host or contaminant genomes, and Komplexity, a novel sequence complexity analysis program that rapidly and accurately removes problematic low-complexity reads before downstream analysis. Reads with low sequence complexity are common in vertebrate-derived samples with low microbial biomass. Microsatellite DNA sequences make up a significant proportion of the human genome and are highly variable between individuals [33-35], compounding the difficulty of removing them by alignment against a single reference genome. We developed Komplexity because existing tools for analyzing nucleotide sequence complexity [36-38] did not meet our needs in terms of speed, removal of spurious hits, and natively processing fastq files.
Sunbeam is mostly implemented in Python and Rust and is licensed under the GPLv3. It is freely available at https://github.com/sunbeam-labs/sunbeam. Documentation is available at http://sunbeam.readthedocs.io.
Implementation
Installation
Sunbeam manages and installs all of its own software dependencies and only requires Linux to run. Installation is performed by downloading the software from its repository and running “install.sh”. Installation does not require administrator privileges. Software dependencies are automatically installed in an isolated environment to avoid conflicts with existing software outside the pipeline.
Sunbeam architecture
Sunbeam is comprised of a set of discrete steps that take specific files as inputs and produce other files as outputs. Because Sunbeam is implemented using the Snakemake workflow framework, the dependencies between steps are determined at runtime. This allows steps that do not rely on each other to operate independently on separate processors or compute nodes. It also enables robust error handling, because steps that fail or are interrupted do not cause independent steps to fail. In addition, interrupted steps can be resumed without starting from scratch as long as the required input files exist.
Sunbeam is structured in such a way that the output files are grouped conceptually in different folders, providing a logical output folder structure. Users can request specific outputs separately or as a group, and the pipeline will run only the steps required to produce the desired files. This allows the user to skip or re-run any part of the pipeline in a modular fashion.
By default, Sunbeam performs the following preliminary operations on raw, demultiplexed Illumina sequencing reads in the following order:
1. Quality control: Adapter sequences are removed and bases are quality filtered using the Trimmomatic [39] and Cutadapt [40] software. Read pairs surviving quality filtering are kept. Read quality is assessed using FastQC [41] and summarized in separate reports.
2. Low-complexity masking: Sequence complexity in each read is assessed using Komplexity, a kmer-based complexity algorithm newly described below. Reads that fall below a user-customizable threshold are removed. Logs of the number of reads removed are written for later inspection.
3. Host read decontamination: Reads are mapped against a user-specified set of host or contaminant sequences using bwa [42]. Reads that map to any of these sequences within certain identity and length thresholds are removed. The numbers of reads removed are logged for later inspection.
After this initial quality-control process, multiple optional downstream steps can be performed independently. In the classify step, the decontaminated and quality-controlled reads are classified taxonomically using Kraken [43]. In the assembly step, reads from each sample are assembled into contigs using MEGAHit [44]. Contigs above a pre-specified length are annotated for circularity. Open reading frames (ORFs) are extracted using Prodigal [45]. The contigs [and associated ORFs] are then searched against any number of user-specified nucleotide or protein BLAST [46] databases, using both the entire contig and the putative ORFs. The results are summarized into reports for each sample. Finally, in the independent mapping step, quality-controlled reads are mapped using bwa to any number of user-specified reference genomes or gene sets, and the resulting BAM files are sorted and indexed using samtools [47].
Standard outputs from Sunbeam include reads from each step of the quality-control process, taxonomic assignments for each read, contigs built from each sample, gene predictions, and alignment files of all reads to any number of reference genomes. Most rules produce logs of their operation for later inspection and summary.
Installation and versioning
We designed Sunbeam to be as simple as possible to install. To this end, installation requires only copying the software repository and running the installation script. The installation script handles all dependencies and creates the Conda environment, including installing Conda if necessary. At no point are administrative rights required on the host computer. The only requirements are internet connectivity, Linux, and the Bash shell.
We have also incorporated a robust upgrade and semantic versioning system into Sunbeam. Specifically, the set of output files and configuration file options are treated as fixed between major versions of the pipeline to maintain compatibility. Any changes that would change the format or structure of the output folder, or would break compatibility with previous configuration files, only occur during a major version increase (i.e. from version 1.0.0 to version 2.0.0). Minor changes, optimizations, or bugfixes that do not alter the output structure or configuration file may increase the minor or patch version number (i.e. from v1.0.0 to v1.1.0). To prevent unexpected errors, the software checks the version of the configuration file before running to ensure compatibility and will stop if it is from a previous major version. To facilitate upgrading between versions of Sunbeam, the same installation script can also install new versions of the pipeline in-place. We provide a utility to upgrade configuration files between major version changes.
To ensure the stability of the output files and expected behavior of the pipeline, we built a robust integration testing procedure into Sunbeam’s development workflow. This integration test checks that Sunbeam is installable on a system, produces the expected set of output files, and correctly handles various configurations and inputs. The test is run through a continuous integration system that is triggered upon any commit to the Sunbeam software repository, and only changes that pass the integration tests are merged into the ‘stable’ branch used by end-users.
Extensions
The Sunbeam pipeline can be extended by users. Extensions take the form of supplementary rules written in the Snakemake format and define the additional steps to be executed. Optionally, two other files may be provided: one listing additional software requirements, and another giving additional configuration options. Extensions can optionally run in a separate software environment, which enables the use of tools with requirements that conflict with Sunbeam’s. To integrate these extensions, the user copies the files into Sunbeam’s extensions directory, where they are automatically integrated into the workflow during runtime. The extension platform is tested as part of our integration test suite.
User extensions can be as simple or complex as desired and have minimal boilerplate: an extension with no additional dependencies can be as short as six lines of code. Because they are integrated directly into the main Sunbeam environment, they have access to the same environmental variables and resources as the primary pipeline, and gain the same error-handling benefits. To make it easy for users to create their own extensions, we provide an extension template on our GitHub page (https://github.com/sunbeam-labs/sbx_template) as well as a number of useful prebuilt extensions available at https://github.com/sunbeam-labs. We created extensions that allow users to run alternate metagenomic read classifiers like Kaiju [48] or MetaPhlAn2 [49], visualize read mappings to reference genomes with IGV [50], and even format Sunbeam outputs for use with downstream analysis pipelines like Anvi’o [51].
Komplexity
We regularly encounter low-complexity sequences comprised of short nucleotide repeats that pose problems for downstream taxonomic assignment and assembly [12, 52], for example by generating spurious alignments to unrelated repeated sequences in database genomes. To avoid these potential artifacts, we created a novel, fast read filter called Komplexity. Komplexity is a stand-alone program implemented in the Rust programming language that is designed to mask or remove problematic low-complexity nucleotide sequences. It scores sequence complexity by calculating the number of unique k-mers divided by the sequence length. Komplexity can either return this complexity score for the entire sequence or mask regions that fall below a score threshold. The k-mer length, window length, and complexity score cutoff are modifiable by the user, though default values are provided. Komplexity accepts FASTA and FASTQ files as input and outputs either complexity scores or masked sequences in the input format. As integrated in the Sunbeam workflow, Komplexity assesses the total read complexity and removes reads that fall below the default threshold. Komplexity is also available as a separate open-source program at https://github.com/eclarke/komplexity.
Results and Discussion
Sunbeam implements a core set of commonly-required tasks supplemented by user-built extensions. Even so, the capabilities of Sunbeam compare favorably with existing pipelines such as SURPI (Sequence-based Ultra-Rapid Pathogen Identification) [17], EDGE (Empowering the Development of Genomics Expertise) [29], ATLAS (Automatic Tool for Local Assembly Structures) [30], and KneadData [31], (Table S1). Where Sunbeam’s primary advancements lie are in its ease of deployment, extension framework, and novel algorithmic solutions to the issues of low-complexity or host-derived sequence filtering.
To demonstrate the use of Sunbeam on real-world data, we used it on a subset of a previously published dataset of healthy humans or individuals with Crohn’s disease [53]. We replaced potentially identifiable human reads with pIRS-simulated human genomic reads [54]. This dataset is available for download at https://zenodo.org/record/1287807. The output of Sunbeam primarily consists of sequence and text files for downstream use, so to demonstrate the potential of the extension system, we created a “report” extension that collects the outputs and visualizes them in a series of plots. These plots, shown in Figure 2, describe metrics such as the average sequence quality at each point in the read (Figure 2A), the total number of contaminant and low-complexity sequences removed for each sample (Figure 2B), and a high-level overview of the taxa found in each sample (Figure 2C). The report extension (https://github.com/sunbeam-labs/sbx_report) demonstrates the ease of building downstream analysis steps into Sunbeam: the report is generated from an R Markdown document. Excluding the R document, the extension is only 12 lines of code long.
Sunbeam’s extension framework promotes reproducible analyses and greatly simplifies performing the same type of analysis on multiple datasets. Extension templates, as well as a number of pre-built extensions for metagenomic analysis and visualization software like Anvi’o [51], MetaPhlAn [49], and Pavian [55], are available on our GitHub page (https://github.com/sunbeam-labs).
Comparing low-complexity filtering program filtering and performance
Low complexity reads often cross-align between genomes, and commonly elude standard filters, so Sunbeam implements a new filter. A number of tools currently exist for filtering low-complexity nucleotide sequences. The gold standard, RepeatMasker [36], uses multiple approaches to identify and mask repeat or low complexity DNA sequences, including querying a database of repetitive DNA elements (either Repbase [56] or Dfam [57]). DUST [37] employs an algorithm which scores and masks nucleotide sequence windows that exceed a particular complexity score threshold (with lower-complexity sequences assigned higher scores) such that no subsequence within the masked region has a higher complexity score than the entire masked region. BBMask, developed by the Joint Genome Institute, masks sequences that fall below a threshold of k-mer Shannon diversity [38].
Many of these tools were not optimal for our use with shotgun metagenomic datasets. RepeatMasker uses databases of known repeat sequences to mask repetitive nucleotide sequences, but runs too slowly to be feasible for processing large datasets. Neither DUST nor RepeatMasker accept files in FASTQ format as input, requiring conversion to FASTA before processing. An option to filter reads falling below a certain complexity threshold is not available in DUST, RepeatMasker or BBMask (although filtering is available in the BBMask companion tool BBDuk). Finally, the memory footprint of BBMask scales with dataset size, requiring considerable resources to process large shotgun sequencing studies. Therefore, we designed Komplexity to mask or filter metagenomic reads as a rapid, scalable addition to the Sunbeam workflow that can also be installed and run separately. It accepts FASTA/Q files as input, can mask or remove reads below a specified threshold, and operates with a constant memory footprint.
To compare the performance of all the low-complexity-filtering tools discussed above, we used pIRS [54] to simulate Illumina reads from the human conserved coding sequence dataset [58] as well as human microsatellite records from the NCBI nucleotide database [59] with the following parameters: average insert length of 170 nucleotides with a 5% standard deviation, read length of 100 nucleotides, and 5x coverage. To ensure compatibility with all programs, we converted the resulting files to FASTA format, then selected equal numbers of reads from both datasets for a total of approximately 1.1 million bases in the simulated dataset (both available at https://zenodo.org/record/1287807). We processed the reads using Komplexity, RepeatMasker, DUST and BBMask and used GNU Time [60] to measure peak memory usage and execution time for six replicates (Table 1). Komplexity and RepeatMasker mask a similar proportion of microsatellite nucleotides, while none of the four tools masks a large proportion of coding nucleotides. Komplexity runs faster and has a smaller memory footprint than other low-complexity filtering programs. The memory footprint of Komplexity and DUST are also relatively constant across datasets of different sizes (data not shown).
To understand the extent to which different tools might synergize to mask a larger proportion of overall nucleotides, we visualized nucleotides from the microsatellite dataset masked by each tool or combinations of multiple tools using UpSetR [61] (Figure 3). Komplexity masks 78% of the nucleotides masked by any tool, and 96% excluding nucleotides masked by only RepeatMasker. This suggests that there would only be a marginal benefit to running other tools in series with Komplexity. Komplexity in combination with Sunbeam’s standard host removal system resulted in the removal of over 99% of the total simulated microsatellite reads.
Conclusions
Here we introduce Sunbeam, a Snakemake-based pipeline for analyzing shotgun metagenomic data with a focus on reproducible analysis, ease of deployment and use. We compare Sunbeam with other pipelines for metagenomic analysis and note several favorable features. We also present Komplexity, a tool for rapidly filtering and masking low-complexity sequences from metagenomic sequence data, and show its superior performance in comparison with other tools for masking human microsatellite repeat sequences. Sunbeam’s scalability, customizability, and ease of deployment and use simplify the processing of shotgun metagenomic sequence data, while its extension framework and thorough quality control enable robust and reproducible analyses. We have already used Sunbeam in multiple published [23, 52, 62, 86] and ongoing studies. As a case study for Sunbeam’s ease-of-use and robust deployability, we featured Sunbeam at a metagenomics workshop at the University of Pennsylvania in the summer of 2017. All participants successfully installed and ran the full pipeline on sample datasets, which emphasizes the ease of deploying and using Sunbeam.
Availability and requirements
Project name: Sunbeam
Project home page: https://github.com/sunbeam-labs/sunbeam
Operating system: Linux
Programming languages: Python, Rust and Snakemake
License: GPLv3
Restrictions to use by non-academics: No
List of abbreviations
ATLAS: Automatic Tool for Local Assembly Structures; BAM: binary alignment map; BLAST: Basic Local Alignment Search Tool; EDGE: Empowering the Development of Genomics Expertise; GPL: (GNU) General Public License; ORF(s): open reading frame(s); SDUST: Symmetric DUST; SURPI: Sequence-based Ultra-Rapid Pathogen Identification
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Availability of data and material
Sunbeam is available at https://github.com/sunbeam-labs/sunbeam. Komplexity is available at https://github.com/eclarke/komplexity. Pre-built extensions referenced can be found at https://github.com/sunbeam-labs. The dataset of simulated microsatellite and conserved coding sequence reads, as well as the example dataset analyzed in Figure 2, are archived in Zenodo at https://zenodo.org/record/1287807.
Competing interests
The authors declare that they have no competing interests.
Funding
This work was supported by the NIH grants U01HL112712 (Site-Specific Genomic Research in Alpha-1 Antitrypsin Deficiency and Sarcoidosis (GRADS) Study), R01HL113252, and R61HL137063, and received assistance from the Penn Center for AIDS Research (P30AI045008), T32 Training Grant (T32AI007324, LJT), and the PennCHOP Microbiome Program (Tobacco Formula grant under the Commonwealth Universal Research Enhancement (C.U.R.E) program with the grant number SAP # 4100068710).
Authors’ contributions
ELC, CZ, FDB, and KB conceived and designed Sunbeam. ELC, CZ, AC, and KB developed Sunbeam. ELC and LJT conceived and developed Komplexity. LJT performed the low-complexity sequence masking analysis. ELC, LJT, and KB wrote the manuscript. All authors read, improved, and approved the final manuscript.
Supplementary table legends
Table S1 (Additional File 1):
Feature comparison for metagenomic pipelines. Tools used by each pipeline: trimmomatic [39]; cutadapt [40]; tadpole [63]; fastqc [41]; FaQCs [64]; BBDuk2 [65]; DUST [37]; TRF [66]; bwa [42]; bowtie2 [67]; BBMap [68]; KRAKEN [43]; SNAP [69]; MUMmer [70]; JBrowse [71]; GOTTCHA [72]; MetaPhlAn [49]; DIAMOND [73]; FastTree [74]; MEGAHit; SPAdes [75]; Minimo [76]; Prodigal [45]; BLASTp [46]; Prokka [77]; BLASTn [46]; eggNOG [78]; ENZYME [79]; dbCAN [80]; Primer3 [81]; RAPSearch [82]; RAxML [83]; PhaME [84]; conda [85]; Snakemake [32]; samtools [47].
Acknowledgements
Thanks to members of the Bushman lab, Penn-CHOP Microbiome Center, and Penn Bioinformatics Code Review communities for helpful suggestions, discussions, and beta testing.
References
- 1.↵
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.↵
- 10.↵
- 11.
- 12.↵
- 13.
- 14.
- 15.↵
- 16.↵
- 17.↵
- 18.
- 19.↵
- 20.↵
- 21.
- 22.
- 23.↵
- 24.
- 25.↵
- 26.
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵