1 Abstract
We applied simulation-based approaches to characterize how sequencing depth influences the properties of genomes identified in metagenomes assembled from short read sequences. An initial analysis evaluated the quantity, completion, and contamination of metagenome-assembled genomes (MAGs) as a function of sequencing depth on four preexisting sequence read datasets taken from four environments: a maize soil, an estuarine sediment, the surface ocean, and the human gut. These were subsampled to varying degrees in order to simulate the effect of sequencing depth on MAG binning. The property, MAG quantity fit the Gompertz curve, which has been used to describe microbial growth curves. A second analysis explored the relationship between sequencing depth and the proportion of available metagenomic DNA sequenced during a sequencing experiment as a function of community richness, evenness, and genome size. Typical sequencing depths in published experiments (1 to 10 Gb) reached the point of diminishing returns for MAG creation. Simulations from the second analysis demonstrated that both community richness and evenness influenced the amount of sequencing required to sequence a metagenome to a target fraction of exhaustion. The most abundant genomes required comparable quantities of bases sequenced regardless of community evenness, while more uneven communities required considerably more sequences to fully sequence rarer members. Future whole-genome shotgun sequencing studies can use an approach comparable to the one described here to estimate the quantity of sequences required to achieve scientific objectives.
Importance Short read sequencing with Illumina sequencing technology provides an accurate, high-throughput method for characterizing the metabolic potential of microbial communities. Short read sequences are assembled into metagenome-assembled genomes which allow metabolic processes influencing health, agriculture, and biogeochemical cycles to be assigned to microbial clades. At present, no reliable guidelines exist to select sequencing depth as a function of experimental goals in metagenome-assembled genomes creation projects. The work presented here provides a framework for obtaining a constrained estimate on the number of short read sequences needed for sequencing microbial communities. Results suggested that both the microbe community richness and evenness influence the amount of sequencing in a predictable matter.
Introduction
The assembly of high-accuracy short read sequences into metagenome-assembled genomes (MAGs) is a recent approach to characterize microbial metabolisms within complex communities (1). The recent creation of ~8,000 MAGs from largely uncultured organisms across the tree of life (2), the spatial characterization of microbial metabolisms and ecology across Earth’s oceans (3), and the characterization of the potential impact that fermentation-based microbial metabolisms have on biogeochemical cycling in subsurface sediment environments (4) provide a few examples of how MAGs helped constrain the relationships between microbial ecology, microbial metabolisms, and biogeochemistry. At present, there is little information to guide how much sequencing is appropriate for metagenomic shotgun sequencing experiments (5). For the year 2017, estimates compiled by Quince et al. (5) suggest that up till now, metagenomic shotgun sequencing experiments usually sequence between 1 Gb and 10 Gb DNA nucleotides. Nonetheless, more guidance is necessary for selecting an appropriate metagenomic shotgun sequencing depth for one’s experimental question which balances the maximization of information and minimization of cost.
Illumina sequencing technology is currently the most popular platform to generate metagenomic shotgun sequences (5). Here we present two distinct analyses which constrain the relationship between the quantity of Illumina metagenomic shotgun sequences and the quantity and quality of retrieved MAGs. First, we performed in silico experiments simulating the effect of how sequencing depth on Illumina sequence read datasets impacted the retrieved MAG properties for these datasets. Second, we applied a theoretical model and numerical simulations to estimate the minimum sequencing depth needed to sequence a metagenome to a target fraction of exhaustion. The work presented here illustrates how community evenness and richness control the sequencing depth necessary to sequence a metagenome to a target fraction of exhaustion. These patterns can be used to guide sequencing depth decisions for future sequencing efforts in which MAG creation is a primary goal.
Results
MAG assembly as a function of sequencing depth in existing metagenomic datasets
The number of “effective MAGs” (equivalents to100%-complete MAGs, as defined in the Methods section) as a function of high quality bases empirically fit the Gompertz equation (equation 1; Fig 1B; parameters in Table 1). For each environment, the data fit the Gompertz equation better than a linear least-squares fit based on Akaike Information Criterion (AIC) (6). This equation is formulated for applications with microbial growth curves, such that the parameters A, μ, and λ correspond to maximum cell density, growth rate, and lag time (Fig 1A). Here, A, μ, and λ correspond to the maximum number of effective MAGs assembled with the pipeline, the maximum rate which effective MAGs form as with more sequencing, and the “lag bases,” or the bases which must be sequenced prior to rapid retrieval in effective MAGs. For the estuary, maize, and human gut datasets, MAG yield began to asymptote at higher sequencing depths, which indicates that further sequencing would yield diminishing returns with our pipeline. The Tara Ocean dataset followed a similar pattern at <25 Gb. However, when the number of sequenced bases was >25Gb, the number of effective MAGs decreased and became insensitive to sequencing depth. Since we have expressed MAG creation in terms of effective MAGs, the actual number of MAGs created in each example was considerably higher.
Mean MAG completeness also increased towards an asymptote with increasing sequencing depth (Fig 1C). Completeness was highest for the human gut dataset, with a maximum of 23.9%, and increased continuously as sequencing depth increased. The mean MAG completeness reached an asymptote of ~10-15% for the other three datasets with sequenced bases >10 Gb. Note that when >10 Gb were sequenced, the number of effective MAGs created still increased as new sequences were added. For all datasets, mean MAG contamination was <2% (Fig 1D) and did not depend strongly on sequencing depth.
Simulation experiments
Using equation 7, we calculated the number of k-length sequence reads required to sequence all unique DNA sequences of length, k (k-mers), in four hypothetical metagenomes. Three of the community structures are ecologically unrealistic but represented a community in which taxa are distributed perfectly evenly, highly unevenly, and at an intermediate level of evenness (Fig 2A–C). The fourth community structure, which is lognormally distributed, is ecologically realistic (Fig 2D; (7, 8)). The expectation value of the log number of sequences required to fully sequence metagenomes of those hypothetical communities was linear with respect to log-transformed size of the metagenome (i.e., number of unique k-mers in the population, approximate number of unique base pairs in a metagenome); this suggests a power-law relationship between metagenome size and expectation value of sequence reads required to sequence the metagenome to exhaustion (Fig 2E). For all community structures, the slope of the relationship between log-transformed sequenced reads and log-transformed unique number of sequenced reads was within 1% of 1.06. The structure of the population strongly influenced the number of reads required such that more even community structures required far fewer reads than less even structures.
As equation 7 only estimates the number of reads to sequence a metagenome to exhaustion, we used a numerical simulation to estimate the number of k-sized reads to sequence a metagenome to a target fraction of exhaustion. Numerical simulation results predicted the same number of sequences reads to sequence 100% of a given metagenome as the numerically integrated expected sequences from equation 7 (Fig 3); this supported the use of this simulation. The log-transformation of both total unique k-sized reads (|KMG| and sequenced reads showed a linear response for all target fractions and all community structures. The amount of sequences required to achieve a given target of |KMG| was variable for the different communities shown in Fig 2A. For instance, the lognormally-distributed community required the most amount of sequencing to sequence a metagenome to a target fraction of exhaustion but required similar amount of sequencing to sequence the metagenome to a target fraction of 50% as the other communities.
We applied the simulation to semi-quantitatively demonstrate the effect that community evenness has on the number of reads required to sequence a community to a target fraction of completion. These communities ranged from perfectly even (a=0, eq. 9) to more uneven (a = 0.02, Fig 4A). Evenness was quantified using the Pielou evenness index, which expresses Shannon diversity relative to the diversity of a perfectly even community (9). Computational limits precluded simulating communities with Pielou evenness less than 0.977 given the richness and size of genomes within the communities. The number of sequence reads required to sequence genomes to a target fraction of completion depended strongly on both the evenness and the target fraction of completion (Fig 4B). Again, more even communities required more sequence reads than less even communities. The strength of this relationship also depended on the target fraction of completion. A community with Pielou evenness of 0.97 required 3 orders of magnitude more sequence reads to sequence a metagenome to a target fraction of exhaustion than a perfectly even community while the same community only required about 42% more reads to sequence 50% of the metagenome.
The minimum number of sequence reads required to sequence a microbe genome given a combination of target fraction, genome size, and fraction of the metagenome community was modeled with a generalized additive model. The smooth dimensions for target fraction, genome size, and fraction of the metagenome community was 7, 3, and 9, respectively, to achieve a normal distribution of residuals. To normalize for different sequence read length, sequence reads were converted to bases and ranged from 1×107 to 1×1013. More bases were required to sequence microorganisms when 1) the genome was relatively rarer in the community, 2) to achieve better coverage of the genome, and 3) when the genome increased in size.
Discussion
We sought to establish evidence-based guidelines for selecting a sequencing depth during shotgun metagenomic sequencing experiments with the goal of creating MAGs of a given quantity and quality. Random subsamples of existing short read datasets, which were each individually assembled and binned, simulated the effect of creating MAGs from datasets of different sizes and environments. The datasets analyzed here are argued to be representative of both the order of magnitude of sequencing depth (1 to 10 Gb) (5) and the types of target environments microbial ecologists often investigate (10). A variety of software is available for all steps of MAG creation pipelines, and the quantity/quality of MAGs will depend on software selection, software configuration, and sequenced environment (5). Furthermore, it is best-practice to manually curate algorithmically-created MAG bins (11). We do not argue that the pipeline used here is objectively optimal for generating “true” MAGs (i.e., represent true genomes). Thus, MAG quantity was not directly reported but expressed as effective MAGs. The metric, effective MAGs, represents the integrated completeness (12) divided by 100 for MAGs retrieved with a taxonomic rank of at least phylum. In effect, effective MAGs represents phylogenetic signal, as defined by the presence of marker genes in assembled contigs (necessary for constructing MAGs). Thus, increases in effective MAGs should scale proportionally with increases in the quantity of true MAGs.
As sequencing depth increased, there was at first a “lag time” (more precisely a lag depth, or number of bases before effective MAGs began to increase) followed by a rapid increase in effective MAG quantity, and then diminishing returns at higher sequencing depths. Previous investigators modeled the response of 16S RNA gene (13–15), Hill’s number diversity (16), taxon-resolved abundance (17), and gene abundance (17) as a function of sequencing depth using rarefaction curves, or collectors curves. The effective number of MAGs created did not match a traditional collector’s curve, which does not contain any initial lag. The Gompertz function, conversely, fit the data well, suggesting that MAG construction as a function of sequencing depth behaves similarly to microbial growth in a constrained medium, in concept if not in precise mechanism. The Gompertz function is defined in terms of three parameters, A, μ, and λ. These parameters correspond to the maximum effective MAGs at infinite sequencing depth (A), maximum rate that effective MAGs increased with increases in sequencing depth (μ), and a minimum threshold of sequencing necessary prior to rapid effective MAGs retrieval (λ) (Fig 1A). The Gompertz equation achieves the same asymptotic behavior of conventional rarefaction models while also modeling the apparent lag (λ) in effective MAGs observed during this work (Fig 1B).
The four environments analyzed demonstrated different responses to increases in sequencing depth. Specifically, the predicted maximum effective MAGs varied from ~17 to ~97, the predicted maximum rate that effective MAGs increased varied from ~1.4 to ~5.8, and the minimum threshold of sequencing necessary prior to seeing effective MAGs varied from ~0.6 to ~6.7. The Tara Ocean dataset, where effective MAGs decreased at sequencing depth >20 Gbp, was an exception. We speculate that our choice of pipeline, and specifically the fact that we discarded contigs <3kb, caused poor performance at higher sequencing depth for the Tara Ocean dataset.
As mean MAG completeness converged to an asymptote considerably less than 100% (Fig 1B), MAG yields (Table 1) were close to 100%. This suggests the maximum effective MAGs (A) likely represents sequence reads associated with abundant MAGs. Thus, we asked how much sequencing was necessary to sequence a community to exhaustion. The expected number of sequence reads required to sequence an entire metagenome was estimated using equation 7 for four hypothetical communities (Fig 2A–D). The total unique k-sized reads (i.e., richness) and community structure influenced how much sequencing is necessary to sequence an entire metagenome (Fig 2E). For a given community structure, increases in community richness lead to linear increases the sequencing depth necessary to exhaust the metagenome. All regressions had similar slopes, indicating that community structure did not exert a major influence on that relationship. Interestingly, the sequencing depth necessary to sequence an entire metagenome depended strongly on the structure of the target microbial community (Fig 2E). As sequencing depth was log-transformed in Fig 2E, the differences in model intercepts indicate orders of magnitude differences in the necessary sequencing depth. The primary implication of Fig 2 is that the sequencing depth increased in a predictable trend in response to richness, regardless of the community structure.
One limitation to equation 7 is that it only provides an estimate of the sequencing depth required to sequence a metagenome to exhaustion. For practical applications, a continuous increase in sequencing depth eventually leads to diminishing returns in identifying unique sequence reads while also leading to a disproportional increase in monetary resources needed to find these unique sequence reads (18). Thus, it is desirable to constrain the fraction of unique sequence reads (e.g., 50%, 70%, 90%, etc.) sequenced from a metagenome in relation to monetary investment necessary to achieve that fraction of a metagenome. Simulations show that as target metagenome completeness increases, the sequencing depth required increases dramatically (Fig 3). Simulation results were validated by comparing the sequencing depth necessary to sequence 100% of a metagenome with predictions from equation 7. While the numerical approach successfully reproduced and extended equation 7, communities with large values of richness (|KMG| > 1 × 108 became computationally burdensome. Nonetheless, when the target fraction and community structures were held constant, the linear increase in sequencing depth as a function of increased richness suggests linear regression may be sufficient to estimate sequencing depth for communities with large values of richness.
One observation from the numerical simulations was the impact that community structure had on the required depth of sequencing (Fig 2E and 3). Even communities required less sequencing to achieve a fraction of |KMG|. Conceptually this makes sense, as abundant taxa (i.e., large n values in equation 3) should be sequenced more deeply compared to rarer taxa. To further explore the influence that community evenness had on required sequencing depth, communities with similar and more realistic lognormal structures (7, 16) at different levels of evenness were compared to one another (Fig 4A). Decreasing evenness (increasing a; equation 9) led to both increases in the sequencing depth required to sequence a given target fraction of |KMG| (Fig 4B). For communities with more uneven species distributions, rarer community members required more sequencing. While only semi-quantitative, this analysis demonstrates that community evenness can have a significant impact on the sequencing depth necessary to characterize an entire community.
In practice, information about a target community structure may not be available for estimating sequencing depth. The spline model built here illustrates the minimum number of sequences necessary to sequence a given fraction of a target genome, assuming genome size and proportion that the genomic content represents in the community metagenome (GMG) (Fig 5). This proves useful for constraining the observed MAG properties from one’s bioinformatic pipeline (e.g., Fig 1B–D) in the context of what proportion of a given microbe’s metagenome (gMG; equation 4) has been sequenced to exhaustion. For example, taking the 5 Gb human gut dataset analyzed here (Table 2), if a microbe with a genome size of ~5 Mbp existed from this environment, then Fig 5C suggests that a 5 Mbp genome representing >10% of the whole metagenome (GMG; equation 5) will be sequenced to a minimum of 50% to exhaustion. More so, one has constrained perspective of how a given genome may be represented in the retrieved MAGs. Although the simple nature of sequencing a genome may not necessarily translate into the production of more MAGs, one can safely say that additional sequencing of that 5 Mbp genome which represents >10% of the community will not lead to the addition of more MAGs. More so, the bioinformatic pipeline would act as the limiting step (opposed to sequencing) in the production of MAGs.
Materials and Methods
Sequence data sources
All sequence data were downloaded from NCBI’s Sequence Read Archive (SRA) using the SRA Toolkit (fastq-dump -split-files) (19). Exact duplicate reads for both forward and reverse reads were removed using PRINSEQ (-derep 1; v0.20.4) (20). All sequencing datasets were limited to Illumina shotgun metagenomic paired-end reads. Four datasets were analyzed for this analysis. The first dataset was from oceanic surface water collected at 5m depth in the Caribbean Sea as a part of the Tara Oceans expedition (21). The second dataset was from sediment from a depth of 8-10 cm below the surface (sulfate-rich zone) and collected at the White Oak River Estuary, Station H, North Carolina, USA (4). The third dataset was collected from maize soil (22). The last dataset was collected from human fecal samples and represented a human gut microbiome (23). All datasets analyzed in this study are summarized in Table 1.
MAG Assembly Pipeline
The pipeline developed here followed similar pipelines described by other authors (3, 24). All sequence datasets were analyzed as follows. Trimmomatic (v0.36) (25) removed adapters as well as trimmed low-quality bases from the ends of individual reads. Read leading and trailing quality scores were required to be >3. The sliding window was set to 4 base pairs and filtered base pair windows with a mean score <15. Quality controlled reads were assembled into contigs using MEGAHIT (v1.1.2; --presets meta-large) (26). Due to RAM limitations, assembled contigs <3000 bp in length were excluded from the analysis. Redundant contigs were removed using CD-HIT (v4.6.8; cd-hi-est -c 0.99 -n 10) (27). Similarity among the remaining contigs was further evaluated via intra-contig sequence alignments using Minimus2 (-D OVERLAP=100 MINID=95). The quality-controlled reads (i.e., after using Trimmomatic) were then mapped to the remaining contigs using Bowtie 2 (v2.3.3) (28) to generate a coverage score for individual contigs.
Resultant contigs were iteratively clustered into MAGs using the unsupervised clustering algorithm Binsanity (v0.2.6) (24). Similar to Tully et al. (3), six initial clustering iterations were performed with the parameter, preference (-p), set to −10 (iteration 1), −5 (iteration 2), −3 (iteration 3-6). Between iterations, a refinement step (Binsanity-refine) was performed on the putative MAGs with constant preference (−p) of −25. The refined putative MAGs were evaluated for contamination and completeness using the software CheckM (v1.0.6) (12), which uses HMMER (v3.1) and Prodigal (v2.6.3) (29). Contigs associated with putative MAGs meeting one of the following criteria: 1) had a completeness > 90% and contamination < 10%, 2) had a completeness > 80% and contamination < 5%, or 3) had a completeness > 50% and contamination < 5% were treated as high-quality. All other MAGs were considered low-quality MAGs. MAGs defined as high-quality were not modified any further. Contigs associated with the high-quality MAGs were not used in the subsequent reclustering and refinement steps. The contigs associated with low-quality MAGs were pooled together and reclustered during the next iteration of Binsanity clustering. After the sixth iteration, the remaining MAGs which did not fall into one of the three categories underwent additional refinement using Binsanity-refine. During this step, MAGs were iteratively refined with preference set to −10 (iteration 1), −3 (iteration 2), and −1 (iteration 3). Between each refinement step, metrics of contamination and completeness were evaluated using CheckM. Again, MAGs which met the criteria of one of the high-quality categories described above were not further modified. The respective contigs to the putative MAGs were not used in proceeding refinement steps. After the last iteration of refinement, all MAGs were reevaluated for completeness and contamination as well as assigned a final taxonomic rank using CheckM. Completeness and contamination values for MAGs with the resolved taxonomic rank of phylum were integrated together. The integrated completeness was then divided by 100 to produce effective number of MAGs.
Subsampling Sequence Read Datasets
The effect of decreased sequencing depth was simulated by subsampling the initial sequence read datasets described above. Downloaded sequence read datasets were randomly sampled at set fractions of 1%, 10%, 20%, 40%, 60%, 80%, 90%, 95%, and 100%. To account for variability in the reads sampled at a given fraction, each fraction was resampled, assembled, and binned in triplicate. All triplicates were analyzed using the MAG assembly pipeline described above.
Modeling MAG Response to Sequencing Depth
Effective MAGs as a function of sequencing depth was modeled for environmental sequence datasets using the Gompertz equation, as reformulated by Zweitering et al. (30) for use with microbial growth curves: where A, μ, and λ are fit coefficients and b is high-quality bases. To assess the validity of this function, AIC (6) was calculated for all Gompertz equation fits and compared to AIC values for linear regressions models for same dataset.
Defining the Microbial Metagenome and Sequencing Probability
Here we draw on set theory to provide a theoretical sroundins for our in silico simulations described below. The application of probability theory for predicting the expected number sequences to sequence a metagenome became founded by defining a metagenome as the set of available metagenomic DNA that can be sequenced in a sequencing experiment. Fig 6A–E provides a cartoon example illustrating the application of this set theory on a hypothetical microbial population, G. G is a community of genomes (g) with finite abundances (n). As the definition of microbial species is somewhat contentious (31), g is taken as the average genome for all individual genomes defined as a meeting some criteria defining a taxonomic rank. Thus, the richness (s) of G, or the total number of g, depends on the definition of g. In the example G (Fig 6A–E), s=6 and the total n=13. Thus, G can be represented as (Fig 6A): where s is the total number of unique species within the community (richness). When characterizing G via shotgun metagenomics, the ith genome, gi, can be sequenced at K unique sections given a characteristic read length, k, and average genome size, l, in number of base pairs (Fig 6B). Thus, the number of unique k-sized reads, K, associated with the ith genome, gi, within G is equal to:
From equation 3, the metagenome, gMG, for gi is defined as the set of all unique possible k-sized reads (Fig 6C) or: where the subscripts for gi represent a given k-sized read spanning from an arbitrary starting base pair to the arbitrary starting base pair plus k. By substituting gMG,i into all g for equation 2 (Fig 6D), the metagenome for a microbial community, GMG, is derived to be: while the population of unique k-sized reads in the metagenome, GMG (Fig 6E), is represented as:
From equation 4, one can determine the cardinality, or the total number, of unique k-sized reads in associated with GMG (expressed as |KMG|). When attempting to fully sequence GMG using shotgun metagenomics, we assume that sampling events (sequence reads) are independent and are sampled with replacement. In fact, Illumina sequencing technology sequences reads in parallel via the individual DNA fragments binding to individual clusters. Furthermore, the fragmented DNA cannot be sequenced twice as the sequencing process is destructive (32). Nonetheless, the mass of DNA extracted from a target environment will represent a negligible fraction of the total DNA which exists in that environment. As the relative abundance of the k-sized reads in KMG does not change when DNA is extracted from an environment, sampling events can be treated as independent and thus, DNA sampling reduces to sampling with replacement. If the proportion DNA mass extracted had a significant impact on the remaining mass of DNA in the environment, then one would be more suited to sequence all the DNA versus a smaller proportion of the DNA. The sequencer should have no impact on sampling assuming no sequencing errors due to misreading or spatial sampling issues (i.e., clonal density issues). Obviously, these issues do exist, but for the sake of a first order, general approximation, these biases can be ignored.
By making the above assumptions, the probability of sequencing all elements in GMG reduces to a coupon collectors problem (33). Using the general functional form for calculating expected samples for sampling all unique elements in a set (equation 13b in 8), one can predict the number of sequences necessary to sequence all elements in KMG, such that the expected number of sequences, E(GMG), is: where j is a given element within KMG, t is the number of sampling events, and pj is equal to the proportion of the jth k-sized read within a given population of k-sized reads. pj can be expressed as follows: where ni is the respective abundance for the species whose MAG contains the jth k-sized read within KMG, and |KMG| is the cardinality of GMG, or the total number of k-sized reads in the metagenome, GMG.
Modeling Expected Sequences
Equation 7 provides an estimate for the total number of sequences to sequence all KMG. The influence of increasing species richness (i.e., s in equation 2) on the expected number of sequences was tested for four hypothetical communities. The first community had an even structure such that all the metagenomic DNA segments were equally distributed across all KMG. In the second community, 90% of the metagenomic DNA segments were equally distributed in 50% of KMG, and the remaining 10% of the metagenomic DNA segments were distributed equally across the remaining 50% of KMG. This community represented a community with relatively moderate species evenness. in the third community, 90% of the metagenomic DNA segments were equally distributed across 10% of KMG, and the remaining 10% of the metagenomic DNA segments were distributed equally across the remaining 90% of KMG. This community represented a community with relatively low species evenness. The last community had 10 equally-sized groups, or octaves (i.e., s was the same in all groups). The abundance of the metagenomic DNA segments in each group followed a lognormal distribution which has been observed in true microbial populations (e.g., (7, 16)). The functional form for modeling abundances was based on the functional form of a lognormal community (34): where S0 was treated as the maximum relative of abundance (S0 = 1), a was the inverse width of the distribution, R was treated as the positive octave range spanning 0 to 9, and S(R) represented the abundance for a given octave. For the lognormal abundance distribution in Fig 2D, a was set to a value of 0.2. Each hypothetical community started with a unique number of k-sized reads |KMG| = 1 × 102. |KMG| was incrementally increased at 10 equally-spaced, linear steps to a maximum of |KMG| = 1 × 106. As |KMG| increased, all community structures remained constant. Graphical representation of rank abundance in Fig 2a was normalized by a given |KMG| to reflect that populations retained the same structure even as population size varied. We defined a normalized rank abundance rn such that where r and s are untransformed rank abundance and richness, respectively. Thus, the most abundant k-mer in a in a metagenome population has a normalized rank abundance of 1/s and the least abundant has a normalized rank abundance of 1. For each community, at each step, the expected number of sequences was calculated using equation 7. The expected number of sequences as a function of |KMG| were modeled with linear regressions.
Equation 7 gives the expected number of sequences required to sequence any sized community to exhaustion. Numerical sequencing simulations were performed to determine the number of sequences necessary to sequence a subset of all unique DNA (KMG). These numerical sequencing simulations were applied to four hypothetical community structures described above. Numerical simulations were performed such that |KMG| = 3 × 107,4 × 107,5 × 107,7 × 107,9 × 107, and 1 × 108. During each of these simulations, the parameters read length (k) and average genome size (l) were set to 100 and 1 × 106, respectively, for all g. Random elements from KMG were selected with replacement to simulate a sequencing event. Numerical simulations were performed until the fraction of |KMG| sequenced was 50%, 70%, 90%, 95%, 99%, or 100%. A weight distribution was applied to elements in a given KMG. The weight distribution biased sequencing to reflect the relative abundances of the four hypothetical communities described above. The fraction of |KMG| sequenced was evaluated every 1 × 107 sequences. Numerical simulations were performed in triplicate for all |KMG| and all target fractions of |KMG|.
We explored the influence of community evenness on required sequencing depth by performing numerical sequencing simulations on 6 different lognormally-distributed communities. The numerical sequencing simulations were similar to the simulations described above. The 6 lognormal communities were modeled such that each community had S0 =1, 10 equally-sized octaves, and |KMG| = 1 × 107. The difference between the 6 lognormal distributions was due to variations in a where a=0, a=0.005, a=0.008, a=0.01, a=0.015, and a=0.02. Evenness was represented using Pielou evenness index (9), which is the ratio of the Shannon diversity index (35) for a given community to that of an even community of the same richness. Shannon diversity was calculated in the context of a metagenomes such that: where pj is the proportion that the jth k-sized read represents among all unique DNA sequences in the metagenome. Thus, the Pielou evenness index (9) was calculated such that: where J was the Pielou evenness index, HMG′ was the metagenome Shannon diversity index, and HMG,max represented the metagenome Shannon diversity index when all pj were equal (i.e., a=0).
Lastly, numerical simulations were performed to determine the sequencing depth necessary to achieve a target fraction for an individual metagenome (gMG). Target fractions were increased from 0.5 to 1 at 100 linearly-spaced intervals. The fraction of the metagenome community (GMG) that gMG represented varied from 1% to 100% in 30 lognormally-spaced intervals. The target genome sizes (l) varied such that l=0.5×106, l=1×106, l=2×106, l=3×106, l=5×106, l=10×106, l=15×106, and l=20×106. The sequencing depth for a given combination of target fraction, genome size, and fraction of the metagenome community was modeled using the gam function (mgcv R package; (36)). For modeling purposes, target fraction was raised to the 12th power and both genome size and sequences were log-transformed. The number of smooth dimensions for fraction of community, genome size, and target fraction were heuristically varied till the resulting fit demonstrated residuals with a normal distribution. Note that the objective here was not build a predictive model but simply a first order approximation for simulations performed here.
Data Availability
All simulations and codes used for modeling sequencing depth are freely available on Github at: https://github.com/taylorroyalty/sequence_simulation_code.
Acknowledgements
This research was supported by the National Science Foundation and a C-DEBI subaward (contribution number to be determined).