Abstract
The Tara Oceans Expedition has provided large, publicly-accessible microbial metagenomic datasets from a circumnavigation of the globe. Utilizing several size fractions from the samples originating in the Mediterranean Sea, we have used current assembly and binning techniques to reconstruct 290 putative high-quality metagenome-assembled bacterial and archaeal genomes, with an estimated completion of ≥50%, and an additional 2,786 bins, with estimated completion of 0-50%. We have submitted our results, including initial taxonomic and phylogenetic assignments for the putative high-quality genomes, to open-access repositories (iMicrobe and FigShare) for the scientific community to use in ongoing research.
Introduction
Microorganisms are a major constituent of the biology within the world’s oceans and act as the important linchpins in all major global biogeochemical cycles1. Marine microbiology is among the disciplines at the forefront of pushing advancements in understanding how microorganisms respond to and impact the local and large-scale environments. An estimated 1029 Bacteria and Archaea2 reside in the oceans and an immense amount of poorly constrained, and ever evolving genetic diversity.
The Tara Oceans Expedition (2003-2010) encompassed a major endeavor to add to the body of knowledge collected during previous global ocean surveys to sample the genetic potential of microorganisms 3. To accomplish this goal, members of Tara Oceans sampled planktonic organisms (viruses to fish larvae) at two major depths, the surface ocean and the mesopelagic. The amount of data collected was expansive and included 35,000 samples from 210 ecosystems3. The Tara Oceans Expedition generated and publically released 7.2 Tbp of metagenomic data from 243 ocean samples from throughout the global ocean, specifically targeting the smallest members of the ocean biosphere, the viruses, Bacteria and Archaea, and picoeukaryotes4. Initial work on these fractions produced a large protein database, totaling > 40 million nonredundant protein sequences and identified >35,000 microbial operational taxonomic units (OTUs)4.
Leveraging the publically available metagenomic sequences from the “girus” (giant virus;0.22-1.6 μm), “bacteria” (0.22-1.6 μm), and “protist” (0.8-5 μm) size fractions, we have performed a new joint assembly of these samples using current sequence assemblers (Megahit5) and methods (combining assemblies from multiple sites using Minimus26). These metagenomic assemblies were binned using a strictly coverage based binning algorithm7 in to 290 high-quality (low contamination) microbial genomes, ranging from 50-100% estimated completion.
Environmentally derived genomes representing the most abundant microorganisms are imperative for a number of downstream applications, including comparative genomes, metatranscriptomics, and metaproteomics. This series of genomic data can allow for the recruitment of environmental “-omic” data and provide linkages between functions and phylogenies. This method was initially performed on the seven sites from the Mediterranean Sea containing microbial metagenomic samples (TARA007, -009, -018, -023, -025 and -030), but will continue through the various Longhurst provinces8,9 sampled during the Tara Oceans project (Figure 1). All of the assembly data is publically available, including the initial Megahit assemblies for each site from the various size fractions and depths and putative (minimal quality control) genomes within iMicrobe (http://imicrobe.us).
Materials and Methods
A generalized version of the following workflow is presented in Figure 2.
Sequence Retrieval and Assembly
All sequences for the reverse and forward reads from each sampled site and depth within the Mediterranean Sea were accessed from European Molecular Biology Laboratory (EMBL) utilizing their FTP service (Table 1). Paired-end reads from different filter sizes from each site and depth (e.g., TARA0007, girus filter fraction, sampled at the deep chlorophyll maximum) were assembled using Megahit5 (v1.0.3; parameters: --preset, meta-sensitive). To keep consistent with TARA sample nomenclature, “bacteria” or “BACT” will be used to encompass the size fraction 0.22-1.6 μm. All of the Megahit assemblies were pooled in to two tranches based on assembly size, ≤1,999bp, and ≥2,000bp. Longer assemblies (≥2kb) with ≥99% semi-global identity were combined using CD-HIT-EST (v4.6; -T 90 -M 500000 -c 0.99 -n 10). The reduced set of contiguous DNA fragments (contigs) was then cross-assembled using Minimus26 (AMOS v3.1.0; parameters: -D OVERLAP=100 MINID=95).
Metagenome-assembled Genomes
Sequence reads were recruited against a subset of contigs (≥7.5kb) constructed during the secondary assembly (Megahit + Minimus2) for each of the Tara samples using Bowtie210 (v4.1.2; default parameters). Utilizing the SAM file output, read counts for each contig were determined using featureCounts11 (v1.5.0; default parameters). Coverage was determined for all contigs by dividing the number of recruited reads by the length of the contig (reads/bp). Due to the low coverage nature of the samples, in order to effectively delineate between contig coverage patterns, the coverage values were transformed by multiplying by five (determined through manual tuning). Transformed coverage values were then utilized to cluster contigs in to bins utilizing BinSanity (parameters: -p -3, -m 4000, -v 400, -d 0.9) 7. Bins were assessed for the presence of putative microbial genomes using CheckM12 (v1.0.3; parameters: lineage_wf). Bins were split in to three categories: (1) putative high quality genomes (≥50% complete and ≤10% cumulative redundancy [% contamination – (% redundancy × % strain heterogeneity ÷ 100)]); (2) bins with “high” contamination (≥50% complete and ≥10% cumulative redundancy); and (3) low completion bins (<50% complete). The high contamination group were additionally binned using the BinSanity refinement method (refine-contaminated-log.py; parameters: -p ‘variable’, -m 2000, -v 200, -d 0.9), which utilizes affinity propagation13 to cluster contigs within a bin based on tetranucleotide frequencies and %G+C.
To determine the preference values needed to successfully bin the high contamination bins, 15 bins were assessed manually using the number of marker occurrences determined by CheckM. Bins containing approximately two genomes, three genomes, and bins with more genomes used a preference of -1000 (-p -1000), -500 (-p -500), and -100 (-p -100), respectively. The 15 manually assessed bins were used to train a decision tree within scikit-learn14 (default parameters, DecisionTreeClassifier) to assign parameters to the other bins. The resulting bins were added to one of the three categories: putative high quality genomes, high contamination bins, and low completion bins. The high contamination bins were processed for a third time with the BinSanity refinement step utilizing a preference of -100 (-p -100). These bins were given final assignments to either the putative high quality genomes (some putative genomes had >10% cumulative contamination, but have been designated) or low completion bins. Bins determined to be low completion bins were reserved for an additional round of binning (see below).
After this initial round of binning, all contigs not assigned to putative high-quality genomes were assessed using BinSanity using raw coverage values. Two additional rounds of refinement were performed (as above) with the first round of refinement using the decision tree to determine preference and the second round using a set preference of -10 (-p -10). Following this binning phase, contigs were assigned to high quality bins (e.g.,, Tara Mediterranean genome 1, referred to as TMED1, etc.), low completion bins with at least five contigs (0-50% complete; TMEDlc1, etc. lc, low completion), or were not placed in a bin (Supplemental Table 1 & 2).
Taxonomic and Phylogenetic Assignment of High Quality Genomes
The bins representing the high quality genomes were assessed for taxonomy and phylogeny using multiple methods to provide a quick reference for selecting genomes of interest. Taxonomy as assigned using the putative placement provided via CheckM during the pplacer15 step of the analysis to the lowest taxonomic placement (parameters: tree_qa -o 2). This step was also performed for all low completion bins. A second taxonomic assignment was determined using a method modified from Albersten, et al. (2013)16, wherein putative coding DNA sequences (CDSs) were determined using Prodigal17 (v2.6.3; parameters: -m -o -p meta -q). The putative CDS were searched against the NCBI non-redundant (NR) database (accessed March 2016) using DIAMOND18 (v0.8.11.73; parameters: -f xml -k 5 --sensitive -e 1e-10) and the output was processed using MEGAN19 (v4; parameters: recompute toppercent = 5, recompute minsupport = 1, collapse rank = species, select nodes = all) to determine the last common ancestor for the top five matches. Using a script from the Multi-Metagenome package (hmm.majority.vote.pl; https://github.com/MadsAlbertsen/multimetagenome; parameters: -n -l 4 [-l 5, -l 6, or -l 7]), each contig was assigned a consensus taxonomic identification at approximately the Phylum, Class, Order, and Family levels. A consensus for all contigs at each taxonomic level was determined. If at any level a tie was achieved between possible assignments, it has been denoted with a “T” in the genome table.
Two separate attempts were made to assign the high quality genomes a phylogenetic assignment. High quality genomes were searched for the presence of the full-length 16S rRNA gene sequence using RNAmmer20 (v1.2; parameters: -S bac -m ssu). All full-length sequences were aligned to the SILVA SSU reference database (Ref123) using the SINA web portal aligner21 (https://www.arb-silva.de/aligner/). These alignments were loaded in to ARB22 (v6.0.3), manually assessed, and added to the non-redundant 16S rRNA gene database (SSURef123 NR99) using ARB Parsimony (Quick) tool (parameters: default). A selection of the nearest neighbors to the Tara genome sequences were selected and used to construct a 16S rRNA phylogenetic tree. Genome-identified 16S rRNA sequences and SILVA reference sequences were aligned using MUSCLE23 (v3.8.31; parameters: -maxiters 8) and processed by the automated trimming program trimAL24 (v1.2rev59; parameters: -automated1). Automated trimming results were assessed manually in Geneious25 (v6.1.8) and trimmed where necessary (positions with >50% gaps) and re-aligned with MUSCLE (parameters: -maxiters 8). An approximate maximum likelihood (ML) tree with pseudo-bootstrapping was constructed using FastTree26 (v2.1.3; parameters: -nt -gtr -gamma; Figure 3).
High-quality genomes were assessed for the presence of the 16 ribosomal markers genes used in Hug, et al. (2016)27. Putative CDSs were determined using Prodigal (v2.6.3; parameters: -m -p meta) and were searched using HMMs for each marker using HMMER28 (v3.1b2; parameters: hmmsearch --cut_tc --notextw). If a genome had multiple copies of any single marker gene, neither was considered, and only genomes with ≥8 markers were used to construct a phylogenetic tree. Markers identified from the high quality genomes were combined with markers from 1,729 reference genomes that represent the major bacterial phylogenetic groups (as presented by IMG29). Archaeal reference sequences were not included; however, none of the putative archaeal environmental genomes had a sufficient number of markers for inclusion on the tree. Each marker gene was aligned using MUSCLE (parameters: -maxiters 8) and automatically trimmed using trimAL (parameters: -automated1). Automated trimming results were assessed (as above) and re-aligned with MUSCLE, as necessary. Final alignments were concatenated and used to construct an approximate ML tree with pseudo-bootstrapping with FastTree (parameters: -gtr -gamma; Figure 4).
Relative Abundance of High Quality Genomes
To set-up a baseline that could approximate the “microbial” community (Bacteria, Archaea and viruses) present in the various Tara metagenomes, which included filter sizes specifically targeting both protists and giruses, reads were recruited against all contigs generated from the Minimus2 and Megahit assemblies ≥2kb using Bowtie2 (default parameters). Some assumptions were made that contigs <2kb would include, low abundance bacteria and archaea, bacteria and archaea with high degrees of repeats/assembly poor regions, fragmented picoeukaryotic genomes, and problematic read sequences (low quality, sequencing artefacts, etc.). All relative abundance measures are relative to the number of reads recruited to the assemblies ≥2kb. Read counts were determined using featureCounts (as above). Length-normalized relative abundance values were determined for each high quality genome for each sample:
Available Through iMicrobe
In keeping with the open-access nature of the Tara Oceans project, all of the data generated for this analysis is publically available through iMicrobe (http://data.imicrobe.us/project/view/261), including: all contigs generated using Megahit from each sample; all contigs from Minimus2 + Megahit output used for binning and community assessment, ≥2kb and ≥7.5kb; a table that details statistics, taxonomy, and phylogeny for the high quality genomes; the putative genome contigs and Prodigal-predicted nucleotide and protein putative CDS FASTA files. Additional files, such as, the ribosomal marker HMM profiles, reference genome markers, high quality genome markers, final concatenated MUSCLE alignment, FastTree Newick file, contig read count data, relative abundance matrix for genomes from all samples, low completion bins, and contigs without a bin, as well as, additional data files, have been provided and are available through FigShare (https://dx.doi.org/10.6084/m9.figshare.3545330). Digital locations of data files and contents can be found on Supplemental Table 3.
Results
Assembly
The initial Megahit assembly was performed on the publicly available reads for Tara stations 007, 009, 018, 023, 025, 030. Starting with 147-744 million reads per sample, the Megahit assembly process generated 1.2-4.6 million assemblies with a mean N50 and longest contig of 785bp and 537kb, respectively (Table 1). In general, the assmeblies generated from the Tara samples targeting the protist size fraction (0.8-5 μm) had a shorter N50 value than the bacteria size fractions (mean: 554bp vs 892bp, respectively). Assemblies from the Megahit assembly process were pooled and separated by length. Of the 42.6 million assemblies generated during the first assembly, 1.5 million were ≥2kb in length (Table 2). Several attempts were made to assemble the shorter contigs, but publicly available overlap-consensus assemblers (Newbler [454 Life Sciences], cap330, and MIRA31) failed on multiple attempts. Processing the ≥2kb assemblies from all of the samples through CD-HIT-EST reduced the total to 1.1 million contigs ≥2kb. This group of contigs was subjected to the secondary assembly through Minimus2, generating 158,414 new contigs (all ≥2kb). The secondary contigs were combined with the Megahit contigs that were not assembled by Minimus2. This provided a contig dataset consisting of 660,937 contigs, all ≥2kb in length (Table 2; further referred to as data-rich-contigs).
Binning
The set of data-rich-contigs was used to recruit the metagenomic reads from each sample using Bowtie2. The data-rich-contigs recruited 15-81% of the reads depending on the sample. In general, the protist size fraction recruited substantially fewer reads than the girus and bacteria size fractions (mean: 19.8% vs 75.0%, respectively) (Table 1). For the protist size fraction, the “missing” data for these recruitments likely results from the poor assembly of more complex and larger eukaryotic genomes. The fraction of the reads that do not recruit in the girus and bacterial size fraction samples could be accounted for by the large number of low quality assemblies (200-500bp) and reads that could not be assembled due to low abundance or high complexity (Table 2). Coverage was determined as total reads per base pair, based on the number of reads recruited to each contig.
Unsupervised binning was performed using both transformed and raw coverage values for a subset of 95,506 contigs from the data-rich-contigs that were ≥7.5kb (referred to further as binned-contigs) utilizing the tool BinSanity. An iterative process was performed that first used coverage to generate putative bins, and then after removing putative high-quality genomes (≥50% complete and <10% redundancy), based on estimates of redundancy through CheckM, used two passes through the BinSanity refinement process, utilizing sequence composition. Binning using the transformed coverage data generated 237 putative high-quality genomes (12 putative genomes are of slightly lower quality with >10% redundancy and have been noted) containing 15,032 contigs. Contigs not in putative genomes were re-binned through the iterative use of BinSanity based on raw coverage values, generating 53 additional putative high-quality genomes encompassing 3,348 contigs. In total, 290 putative high-quality genomes were generated with 50-100% completion (mean: 69%) with a mean length and number of putative CDS of 1.7Mbp and 1,699, respectively (iMicrobe; Supplemental Table 1). All other contigs were grouped in to bins with at least five contigs, but with estimated completion of 0-50% (2,786 low completion bins; 74,358 contigs; Supplemental Table 2) or did not bin (2,732 contigs). Nearly a quarter of the low completion bins (24.7%) have an estimated completion of 0%.
Taxonomy, Phylogeny, & Potential Organisms of Interest
The 290 putative high-quality genomes had a taxonomy assigned to it via CheckM during the pplacer step. All of the genomes, except for 20, had an assignment to at least the Phylum level, and 83% of the genomes had an assignment to at least the Class level. Additionally, all of the genomes were assigned putative taxonomies using a consensus method of the taxonomies assigned to the putative CDS on a contig. The genomes were assigned four levels of taxonomic information, roughly equivalent to Phylum, Order, Class, and Family. Due to the nature of this method, especially at lower taxonomic levels, it is possible for a small number of assignments to greatly influence the results. Because all of these methods have inherent biases, consistency across several results should be viewed as reinforcing support for the accuracy of the genome, while inconsistent results should not be used as evidence of an incorrectly binned genome.
Attempts were made to provide phylogenetic information for as many genomes as possible. Genomes were assessed for the presence of full-length 16S rRNA genes. In total, 37 16S rRNA genes were detected in 35 genomes (mean 16S rRNA gene copy number, 1.05). 16S rRNA genes can prove to be problematic during the assembly steps due the high level of conservation that can break contigs32 (Figure 3). Additionally, the conserved regions of the 16S rRNA, depending on the situation, can over- or under-recruit reads, resulting in coverage variations that can misplace contigs in to the incorrect genome. As such, several of the 16S rRNA phylogenetic placements support the taxonomic assignments, while some are contradictory. Further analysis should allow for the determination of the most parsimonious result.
Beyond the 16S rRNA gene, genomes were searched for 16 conserved, syntenic ribosomal markers. Sufficient markers (≥8) were identified in 193 of the genomes (67%) and placed on a tree with 1,729 reference sequences (Figure 4). Phylogenies were then assigned to the lowest taxonomic level that could be confidently determined.
The taxonomic and phylogenetic assignments are provided to give downstream users a guide for determining which genomes prove to be most interesting for further analysis. Highest confidence should be given to genomes with multiple lines of evidence supporting an assignment and additional confirmation should be gathered for those with multiple conflicting results. These putative results reveal a number of genomes were generated that represent multiple clades for which environmental genomic information remains limited, including: Planctomycetes, Verrucomicrobia, Marinimicrobia, Cyanobacteria, and uncultured groups within the Alpha- and Gammaproteobacteria.
Relative Abundance
Based on the assembly and recruitment results, the assumption was made that the data-rich-contigs and their corresponding reads represent the dominant portion of the microbial (bacterial, archaeal, and viral) community and that reads that did not recruit represent eukaryotes, low quality assemblies, and/or less dominant portions of the microbial community. A length-normalized relative abundance value was determined for each genome in each sample based on the number of reads recruited to the data-rich-contigs. The relative abundance for the individual genomes was determined based on this portion of the read dataset.
In general, the genomes and their underlying contigs had low coverage (<1X coverage) and low relative abundance (maximum relative abundance = 1.9% for TMED155 a putative Cyanobacteria in TARA023-PROT-SRF; Supplemental Table 1). The high-quality genomes accounted for 1.57-25.16% of the approximate microbial community as determined by the data-rich-contigs (mean = 13.69%), with the ten most abundant genomes representing 0.61-10.31% (Table 1).
Almost all of the contigs in the binned-contigs were low coverage, only a small subset of 6,350 contigs (6.6%) had >1X coverage in at least one sample. Of these contigs, 1,962 were assigned to putative high-quality genomes, while the other contigs were placed in the low completion bins. Further, an additional 22,470 contigs (Total bp = 79,422,500bp, mean = 3,535bp, and longest contig = 7,498bp) within data-rich-contigs (1.3%) had greater than >1X coverage, but were not included in the binning protocol.
Concluding Statement
The goal of this project was to provided preliminary putative genomes from the Tara Oceans microbial metagenomic datasets. The 290 putative high-quality genomes and 2,786 low completion bins were created using the 20 samples and six stations from the Mediterranean Sea. We will continue to generate putative high-quality genomes from additional Tara Oceans dataset, starting with the Red Sea and Arabian Sea in the near future.
We would like to take some time to highlight to interesting results created within this dataset. For new genomes from environmental organisms, this project created approximately 14 new Cyanobacteria genomes within the genera Prochlorococcus and Synechococcus and 33 new SAR11 genomes. Three unconfirmed members related to the Candidate Phyla Radiation (CPR) as determined by placement of an internal node between the Parcubacteria and Microgenomates (with long-branch characteristics; TMED88) and a node basal to the CPR genomes, potentially related to the Wirthbacteria (TMED70 and TMED22), on the concatenated ribosomal marker tree. Additionally, there are putative genomes from the marine Euryarchaeota (n = 11), Verrucomicrobia (n = 17), Planctomycetes (n = 14), and Marinimicrobia (n ≈ 5).
Some additional perplexing results include, TMED58 a putative Deltaproteobacteria with taxonomic assignments from the NCBI NR database to the Myoviridae. This result occurs due the presence of a few large contigs assigned to the bacteria and many small contigs assigned to the virus. However, if these two entities should be binned together remains unresolved. Lastly, the low completion bins may house distinct viral genomes. Of particular interest may be the 40 bins with 0% completion (based on single-copy marker genes), but that contain >500kb of genetic material (including 3 bins with >1Mb). These large bins lacking markers may be good candidates for research in to the marine “giant viruses” and episomal DNA sources (plasmids, etc.).
It should be noted, researchers using this dataset should be aware that all of the genomes generated from these samples (and additional stations, which are on-going) should be used as a resource with some skepticism towards the results being an absolute. Like all results for metagenome-assembled genomes, these genomes represent a best-guess approximation of a taxon from the environment33. Researchers are encouraged to confirm all claims through various genomic analyses and accuracy may require the removal of conflicting sequences.
Author Contributions
BJT conceived of the project, performed all of the methods and analyses, and wrote the manuscript. RS provided the origins of the workflow and invaluable feedback during the execution of the methods and analyses. EDG provided feedback and troubleshooting using the pre-release version of BinSanity. JFH provided funding. RS and JFH contributed to manuscript editing and polishing. All authors have read the submitted draft of the manuscript.
Legends
Supplemental Table 1. Statistics and taxonomic and phylogenetic assignments for the putative high-quality genomes
Supplemental Table 2. Statistics and CheckM taxonomy for low completion bins
Supplemental Table 3. Data file names, descriptions, and digital locations
Acknowledgements
We would like to thank iMicrobe and FigShare for hosting data for this research. We are indebted to the Tara Oceans project and team for their commitment to open-access data that allows data aficionados to indulge in the data and attempt to add to the body of science contained within. And we thank the Center for Dark Energy Biosphere Investigations (C-DEBI) for providing funding to BJT and JFH (OCE-0939654).