Abstract
Shotgun short-read sequencing methods facilitate study of the genomic content and strain-level architecture of complex microbial communities. However, existing methodologies do not capture structural differences between closely related co-occurring strains such as those arising from horizontal gene transfer and insertion sequence mobilization. Recent techniques that partition large DNA molecules, then barcode short fragments derived from them, produce short-read sequences containing long-range information. Here, we present a novel application of these short-read barcoding techniques to metagenomic samples, as well as Athena, an assembler that uses these barcodes to produce improved metagenomic assemblies. We apply our approach to longitudinal samples from the gut microbiome of a patient with a hematological malignancy. This patient underwent an intensive regimen of multiple antibiotics, chemotherapeutics and immunosuppressants, resulting in profound disruption of the microbial gut community and eventual domination by Bacteroides caccae. We significantly improve draft completeness over conventional techniques, uncover strains of B. caccae differing in the positions of transposon integration, and find the abundance of individual strains to fluctuate widely over the course of treatment. In addition, we perform RNA sequencing to investigate relative transcription of genes in B. caccae, and find overexpression of antibiotic resistance genes in our de novo assembled draft genome of B. caccae coinciding with both antibiotic administration and the appearance of proximal transposons harboring a putative bacterial promoter region. Our approach produces overall improvements in contiguity of metagenomic assembly and enables assembly of whole classes of genomic elements inaccessible to existing short-read approaches.
Introduction
Short-read metagenomic sequencing and assembly have played an instrumental role in advancing the study of bacterial genomes beyond the minority of culturable organisms1. This has greatly expanded our understanding of the genomic structure and dynamics of the human microbiome, which has emerged as an important element in many aspects of health and disease2–4. However, the precise genetic makeup of organisms within these complex systems remains poorly understood.
Duplicated and conserved sequences within a metagenome complicate the recovery of strain-level architecture with existing short-read methods. These sequences arise from several mechanisms, including horizontal gene transfer and transposon mobilization, each with a well-described capacity to induce significant changes in phenotype. Horizontal gene transfer (HGT) results in the acquisition and dissemination of functional elements that can include antibiotic resistance genes, virulence factors, or metabolic capabilities5,6. Mobile elements can affect gene function and regulation by disrupting coding sequences7, or by introducing new promoter sequences8,9, and have been observed to mobilize in response to antibiotic stress10. In addition, gene upregulation mediated by mobile sequences has been shown to increase antibiotic resistance8,11. HGT and mobile sequence element duplication represent two mechanisms by which bacterial genomes acquire repeated sequences, which pose challenges for metagenomic sequence analysis.
Specialized computational techniques have been developed to recover draft genomes for individual organisms within metagenomic samples. These techniques include dedicated metagenomic assemblers12–14 and contig binning approaches based on sequence similarity15–18 and coverage depth covariance19. Binning techniques can group assembled contigs into significantly more comprehensive drafts, but do not improve the contiguity of the input assembly. Sequence fragment sizes from existing high throughput platforms are too short to span duplicated sequences, and as a result, these regions currently remain unassembled. This represents an inherent limitation of short reads, necessitating a complementary molecular approach to assemble these classes of genomic sequences.
In principle, long-read approaches can be used to address these issues. Single molecule platforms such as Pacific Biosciences’ Single Molecule Real Time sequence approach have been successfully applied to close genomes of cultured isolates20–22 and dominant organisms within more complex mixtures23, as well as assemble mobile and duplicated sequences within cultured isolates24,25. Synthetic long reads have also been used to improve metagenomic assemblies26. However, limited throughput and relatively high input DNA mass requirements bar these approaches from application to biological samples where high molecular weight DNA is limited. In addition, single molecule approaches have comparatively low nucleotide accuracy, which may impede resolution of closely related strains.
Several existing platforms address these shortcomings by partitioning long input DNA fragments, and barcoding shorter fragments derived from them, to tag short reads with long-range information. The recent 10X Genomics platform streamlines this barcoding process with more than 100,000 droplet partitions to yield uniquely barcoded short-read fragments from one or a few long molecules trapped in each droplet partition27. Sequencing of 10X libraries yields shallow coverage depth groups of barcode-sharing reads, which we will refer to as read clouds28 (also referred to as linked reads27). This platform offers an attractive combination of high nucleotide accuracy, low input mass requirements and long-range information. Applications of this platform and similar ones predating it have focused on reference-based human haplotype phasing26,27,29–31, and their potential for de novo metagenomic sequence assembly has yet to be explored.
We present, to our knowledge, the first approach that leverages read clouds provided from the 10X Genomics Gemcode platform to directly assemble complex metagenomic mixtures from a single sample. We developed an assembler, Athena, that uses the barcode information in order to assemble sequences that cannot be placed in correct genomic context using short reads alone. We first tested our approach on a mock mixture of DNA from 10 known bacterial species and used Athena to accurately assemble and place multiple copies of the ribosomal RNA (rRNA) operon within the draft assembly of each species.
We then applied our technique to a clinical gut microbiome time series from a patient undergoing stem cell transplantation for a hematological malignancy. This patient underwent an extensive series of antibiotic, antiviral, antifungal, chemotherapeutic and immunosuppressive treatments, imparting profound selective pressure on the gut microbiome, resulting in domination by Bacteroides caccae. Our barcoded assembly approach reveals the presence of a number of nearly identical B. caccae strains differing in the position of integration of transposons and a large transferred region (genomic island), which we validated using long-range PCR and Sanger sequencing. Our improved drafts allowed us to quantify gene expression in regions that were fragmented or unresolved by short-read assembly, revealing potential transcriptional effects of insertion sequences and larger mobile elements not otherwise accessible by conventional short-read assembly approaches.
Results
Read cloud sequencing and Athena Assembly
We developed the Athena assembler to use long-range information encoded within barcoded short-read sequences. In our approach, we first extract high molecular weight DNA and use the 10X Genomics Gemcode platform to obtain barcoded short reads for our samples (Figure 1a). Barcoded short reads are first jointly assembled with a conventional short-read assembler (see Methods) to obtain an initial covering of the metagenome in the form of assembled sequence contigs. These seed contigs are then provided to the Athena assembler for further metagenome sequence assembly (Figure 1b). The same barcoded short reads are then mapped back to the seed contigs and read pairs that span contigs are used to form edges in a scaffold graph. Branches in this scaffold graph correspond to ambiguities encountered by the short-read assembler. At each edge, Athena examines the short-read mappings together with the attached barcodes to propose a simpler subassembly problem of a pooled subset of barcoded reads that can potentially assemble through branches in the scaffold graph (see Supplementary Methods). The selection of this read subset removes the majority of reads considered during the initial assembly while retaining reads that cover the local target sequence, isolating the local subassembly problem from the broader metagenome. The much smaller and independent subassembly problems are performed for every edge in the scaffold graph to yield longer, overlapping subassembled contigs that resolve branches in the scaffold graph. The initial seed contigs and intermediary subassembled contigs are then passed to an overlap layout consensus assembler, Canu32 (formerly the Celera Assembler), which determines how to assemble the target genome from these much longer contigs. The resulting metagenome assembly consists of more complete sequence contigs with resolved repeats that are too difficult to assemble with short-read techniques alone.
Assembly of highly conserved elements in a synthetic metagenomic community
As a first validation of our approach, we performed read cloud sequencing on a mock mixture of DNA from 10 known bacterial species (see Methods) and tested Athena’s ability to accurately assemble the conserved 16S and 23S ribosomal RNA operon subunits. This repeat unit, which varies in size from 5kb to 7kb depending on the size of the spacer between 16S and 23S subunits, is known to occur in multiple nearly identical copies throughout individual bacterial genomes33,34. These subunits are also highly conserved across all bacterial species, yet contain sufficient divergence between species, allowing for their use as an informative marker for phylogenetic characterization of microbial communities35. These interspecies repeat units, which also occur in multiplicity within each individual genome, serve as a useful model of the assembly issues created by duplicated and conserved sequences.
Our validation process entailed Athena assembly, 16S/23S operon identification, and molecular validation of assembled ribosomal RNA loci. We produced both standard short-read and read cloud libraries for our ten species mixture, and applied conventional assembly and Athena on each of these libraries, respectively. To obtain drafts for each of the ten organisms for each approach, we classified resultant contigs, and grouped contigs sharing species-level classifications (see Methods). We used RNAmmer36 to find instances of rRNA subunits within the short-read and Athena drafts. Conventional short-read assembly produced a single disconnected instance of the rRNA operon most closely resembling that of Bacteroides ovatus. In contrast, Athena read cloud assembly assembled 28 copies of the complete rRNA operon (Supplementary Table 1). In addition, Athena assembled 2 copies of 16S and 2 copies of 23S outside the operon context. Of the 32 total instances of assembled rRNA, we chose 23 with at least 3kb of assembled sequence on left and right flanks for long-range PCR validation, allowing flanking PCR primer design for amplification across the entire repeat (Supplementary Figure 1). All 23 PCR experiments produced specific amplicons of the anticipated length. To further validate rRNA assemblies, we obtained Sanger reads confirming the left and right junctions between the operon and genomic flanking sequence (Supplementary File 1) for those instances of the rRNA operon occurring within the Klebsiella genome. Sanger primer sequences targeted regions of high nucleotide identity internal to the operon, and initiated Sanger read extension outward into genomic flanks (Supplementary Figures 2,3). In summary, we demonstrated validated assembly of multiple copies of highly conserved rRNA subunits in a mock mixture of bacterial DNA, and were able to show markedly improved capability over conventional short-read assembly to resolve the genomic contexts of these sequences.
Assembly of a clinical gut microbiome time series
To test the generalizability of this approach to natural biological samples, we next applied Athena to longitudinal clinical gut microbiome samples obtained from a patient receiving treatment for hematological malignancy at the Stanford Hospital Blood and Marrow Transplant Unit. The patient underwent hematopoietic stem cell transplantation (HCT) for myelodysplastic syndrome and myelofibrosis, which was refractory to treatment with azacitidine. The patient received multiple medications during the period of observation, including antibiotics, antivirals, antifungals, chemotherapeutics and immunosuppressives. The patient was diagnosed with gastrointestinal (GI) graft-versus-host disease (GVHD) (clinical grade 3, histological grade 1 in duodenal biopsy). Fiber, fermented foods, probiotics, and prebiotics were restricted from the patient’s diet over the entire time series. During the course of treatment, the patient’s gut microbiome underwent profound simplification, rapidly becoming dominated by Bacteroides caccae, a rare opportunistic pathogen37 with mucin-degradation capability38 (Figure 2; for genus-level classifications see Supplementary Figure 4; for classification data see Supplementary File 4).
To study the trajectory of this patient’s gut microbiome throughout this treatment, we selected the following four time points for sequencing: A (day 0) pre-chemotherapy and pre-HCT; B (day 13) post-chemotherapy and post-HCT; C (day 43) post broad-spectrum antibiotic exposure and onset of GI GVHD; D (day 56) long-term follow up (Figure 2). We prepared both Illumina Truseq and 10X Gemcode libraries for stool samples from these four time points, equalized read counts between the two library types, and assembled each library with a short-read assembler and Athena, respectively (see Methods). We classified the resulting contigs from each metagenomic assembly, and grouped those contigs sharing species-level classifications to obtain drafts of each constituent organism for each approach.
We observed significant improvements in both contiguity and completeness for several of the drafts produced by the read cloud approach as compared to standard short-read techniques (Supplementary Table 2). Our approach yielded drafts with improved contiguity in 15 of 16 organisms with >30x coverage depth in both libraries (average 4.8-fold increase in N50). These belonged to the genera Bacteroides, Parabacteroides, and Haemophilus. These Athena drafts were as complete as those obtained by short-read assembly (average completeness percentages within 2%), ascertained by core gene detection (see Methods, Supplementary File 2). However, improved contiguity in Athena drafts allowed an additional 7.6Mbp of assembled sequence to be assigned to these 16 organisms.
We found that per-species read coverage differed substantially between standard and read cloud libraries, with five organisms receiving >30x coverage in only one of the two libraries (Supplementary Table 2). These discrepancies in coverage are likely due in part to biases introduced by either fragment size selection or sampling effects during the microfluidic molecular partitioning process. We observed an approximate 10-fold decrease in DNA mass during size selection prior to read cloud library preparation, potentially causing depletion of more fragmented organism genomes. In order to separate effects on assembly performance arising from upstream molecular steps from those intrinsic to the Athena assembly approach itself, we examined the 16 organisms that were sufficiently covered in both libraries (>30x short read coverage).
For the predominant Bacteroides caccae, Athena was able to consistently yield more contiguous (time points A, B, C, D) and more complete (time points B, C, D) drafts than conventional techniques (Figure 3). Our best draft in time point D had an N50 of 390kb and total size of 5.5Mb, as compared to an N50 of 61kb and total size of 4.5Mb yielded by short-read techniques. Given these much-improved drafts, we next sought to locate duplicated sequences resolved by Athena. We posited that comparative analysis of the B. caccae drafts across time points may yield insight into either selection or potential genomic remodeling that this organism may have undergone as it grew to eventually dominate this host’s gut microbiome.
Read clouds recover nearly identical strains in clinical samples
In order to locate duplicated sequences that were assembled by Athena, we identified short k-mers that were overrepresented in the Athena assemblies relative to the short-read assemblies, and annotated the most highly overrepresented sets with BLAST (nr/nt database)39. Of the elements thus identified, we focus on mobile element IS612, a conserved Bacteroides insertion sequence (IS), due to its high prevalence in our time series and pronounced fluctuation in abundance across the time series. This sequence is present in the short-read assemblies, but only appears in a single copy with extreme sequence coverage depth (up to 17,345x coverage in time point C) that is detached from genomic context, highlighting a major limitation of standard short-read assembly.
From our read cloud assemblies we selected 44 distinct instances of the IS from all timepoints. Selected assemblies had at least 3kb assembled on both flanks, which served as primer design sites for validation by long-range PCR and Sanger sequencing. We were able to obtain Sanger sequencing data validating the specific genomic placement of all but one of the 44 instances. Of the 43 validated IS instances, 20 occurred in contigs classified as B. caccae (Figure 3). The other validated instances were also within contigs classified as belonging to Bacteroides: six in B. vulgatus, two each in B. thetaiotaomicron, B. ovatus, and B. dorei, and a single instance each in B. uniformis and B. xylanisolvens. The remaining 10 did not receive species-level classifications.
At many insertion sites in B. caccae, short-read alignments to the Athena assembly confirmed the co-occurrence of both strains harboring the IS and strains with the pre-insertion ancestral sequence. From these alignments, we obtained an estimate of the relative abundance of ancestral and insertion-containing strains for each site (see Methods). These estimated abundances agreed with observed PCR band intensities at the ancestral and IS-containing band sizes (Figure 4). We observed large shifts in the ancestral abundance at several insertion loci (Supplementary Table 3), with 18 large (>50%) shifts occurring between consecutive time points.
In addition to these small-scale structural divergences, we observed larger-scale structural strain variations with similarly pronounced shifts in abundance in B. caccae. Inspection of the global alignments revealed one instance of the IS to be immediately adjacent to a large 60kb region present in only some strains in the last time point. Raw short-read alignments from C and D to the Athena draft from D confirmed this rearrangement and showed the 60kb sequence to be present at much lower abundance relative to flanking genomic sequence during time point C (Figure 5). Annotation of this 60kb island revealed the presence of xerC and xerD tyrosine recombinases, which can mediate genomic integration of mobile elements40,41. In addition, the region contains flagellar motor protein motB, a phage integrase family protein, and an operon encoding four genes mediating streptomycin biosynthesis. We searched for xerCD recognition motifs previously described in Escherichia coli40 within our draft from time point D and found a single 11 base pair site directly adjacent to the island and overlapping with an IS. PCR validation confirmed the integration of the 60kb region, as well as the pre-integration strain containing only the adjacent IS (Supplementary Figure 5). We have demonstrated the validated assembly of numerous instances of a small mobile element as well as a large-scale sequence integration with extensive functional potential, and observed these assembled sequences to fluctuate widely in relative abundance over time.
Identification of insertion-mediated transcriptional upregulation
To explore Athena’s effects on metatranscriptomic data analysis in genomic regions where short-read assemblies would be otherwise too fragmented, we used our drafts as references in RNA sequence data alignment. We performed RNA sequencing on time points B, C, and D of the same samples (RNA yield from sample A was very poor, despite multiple attempts). We compared use of both short-read and Athena drafts as references, and found our more complete B. caccae drafts allowed a significantly larger fraction of all RNA sequencing reads to be assigned to this organism. Specifically, an additional 11%, 22%, and 10% of the RNA sequencing reads from time points B, C, and D respectively were aligned to the Athena drafts of B. caccae over the corresponding short-read drafts. Our more complete drafts allowed for a much larger fraction of the coding potential of this organism to be evaluated.
We next used our Athena drafts together with the RNA sequencing reads to investigate the potential transcriptional effects of the structural changes we detected, focusing on IS612. This IS contains a putative outward-facing promoter near its 5’ end oriented antisense to its transposase coding sequence8. Determining the transcriptional effect of this IS is difficult in a complex metagenomic setting, as RNA-seq reads may originate from co-occurring strains with or without a given insertion. In light of this difficulty, we restricted our attention to integration sites that were dominated first by ancestral strains and then by IS-harboring strains in consecutive time points, with at least 30% change in estimated ancestral abundance. In these sites, a corresponding increase in transcription downstream of the promoter versus upstream transcription is more likely attributable to the additional promoter provided by the IS. We located three such genomic loci of “transcriptional asymmetry”, all demonstrating more than 10-fold higher transcription of the downstream neighboring gene relative to upstream (Figure 6, Supplementary Figures 6 and 7, Supplementary Table 4).
The highest degree of transcriptional asymmetry coincided with placement of the putative promoter in IS612 to upregulate norM, a multidrug resistance transporter (Figure 6a). NorM is a multidrug efflux protein found to confer resistance to ciprofloxacin42, which was administered for the first 30 days of treatment through time points A and B (Figure 2). Short-read alignments to this insertion site showed this integration to be undetectable in time point A, present in roughly a third of strains in B, and then in the majority in time points C and D, consistent with visible band patterns in our targeted PCR results (Figure 6b, Supplementary Table 3). Other transcriptional asymmetries were observed in susC, an outer membrane protein involved in starch utilization43, and resA, an oxidoreductase involved in cytochrome c synthesis44 (Supplementary Figures 6, 7). Short-read alignments showed these insertions to be absent in this site in time point A, and present in roughly a third of the strains in B, the majority of strains in time point C, and half of strains in time point D. The abrupt changes observed in the abundance of these insertions suggest strong selective pressures.
We found the most highly expressed gene in time points C and D to be the extended-spectrum beta-lactamase gene per1, known to confer resistance to meropenem, a major component of the patient’s antibiotic regimen. This gene was expressed nearly 60% more than the second most expressed gene in both time points C and D (Supplementary Table 4). Its rise in expression coincided with a two-week course of meropenem beginning 10 days after time point B and ending 6 days before time point C. Per1 continues to exhibit high expression in time point D, 19 days after withdrawal of meropenem, despite the absence of any further beta-lactam antibiotic administration. Our drafts located an IS adjacent to this gene oriented correctly for IS-mediated transcription to occur. While estimating relative abundance of this insertion, we were unable to detect reads from the ancestral strain, and determined this insertion to be fixed within the population over the course of treatment.
Though several insertion sequences became undetectable in DNA sequence data between timepoints C and D, strains with insertions adjacent to norM, susC and per1 continue to dominate through the end of our time course. By time point D, ciprofloxacin and meropenem have been withdrawn for 26 and 19 days, respectively, yet expression of resistance factors norM and per1 remained increased compared to levels prior to antibiotic exposure.
Discussion
We present a novel approach for applying barcoded short reads to metagenomics. This method improves contiguity in assembled drafts and enables assembly of whole classes of duplicated elements inaccessible to conventional short-read approaches. The improved draft assemblies generated by Athena serve as a useful basis for RNA sequence data alignment that allow us to investigate transcriptional effects of newly assembled sequences. In clinical samples, we show that the powerful combination of genome assembly with gene expression analysis yields evidence for antibiotic resistance mechanisms that may evolve quickly via insertion sequence mobilization. We observe apparent transcriptional upregulation of antibiotic resistance and starch metabolism genes by adjacent mobile sequence elements, highlighting the clinical importance of regulatory changes induced by these duplicated sequences.
After confirming the accuracy and evaluating the performance characteristics of this approach using a defined bacterial mock community, we proceeded to apply this approach to human microbiome samples. The clinical subject we examine is an individual who underwent extensive treatment with several classes of medication in a short time period while undergoing hematopoietic stem cell transplantation. Immune suppression and extensive antibiotic treatment rendered the patient studied in this investigation highly susceptible to destabilization and taxonomic simplification of the gut microbiome, which has been associated with increased overall mortality, GI GVHD and other complications4,45–47. Past work shows that antibiotic use and a restricted diet can lead to intestinal domination by one or few microorganisms38,48, but the mechanisms by which any specific microorganism achieves dominance over the larger community remain poorly understood.
Barcode assembly of our clinical gut microbiome time series reveals that populations of microbes with apparently stable taxonomic composition can be composed of many closely related strains undergoing wide fluctuations in abundance. Though our later two time points C and D have nearly identical species compositions, they each harbor numerous strains differing in mobile sequence integration sites. In addition, the most dominant strains in time point D carry a 60kb island containing 72 predicted coding sequences including a four-gene operon encoding streptomycin biosynthesis. The capability of strains to acquire novel genetic material or alter regulation of existing genes generates strain diversity, potentially increasing the likelihood that some highly fit strain evolves to dominate the destabilized gut microbiome of immunocompromised hosts.
IS mobilization in particular provides a means by which organisms can upregulate individual or sets of genes past levels of endogenous expression. In vitro experiments have confirmed that insertion sequences similar to IS612 are mobilized in response to antibiotic stress10 and during strain competition49. These elements can upregulate adjacent genes, potentially resulting in increased antibiotic resistance8. In our clinical time series, insertions expected to affect genes related to antibiotic resistance and starch utilization reached the highest abundances within a timespan of days, consistent with antibiotic use and potentially related to dietary interventions. Our results confirm the occurrence of this conserved mechanism in the gut microbiome, and suggest that it functions as an adaptive response mechanism underlying clinically significant alterations in bacterial behavior.
Our approach to de novo assemble sample-specific strain diversity complements existing approaches to analyze strain-level variation. Previous methods use compiled reference sequence collections, and characterize nucleotide divergence within defined gene sets or gene presence within an organism-specific pan-genome50–52. Although these methods differ substantially, their shared reliance on the reference sequence collection restricts sensitivity in the context of poorly sequenced or unknown species, comprising a large fraction of microbial diversity53, or previously unobserved structural variants of known species. Our results provide a means to obtain significantly improved individual drafts from metagenomes that could enable existing tools to study strain-level variation that is currently underrepresented in existing reference sequence collections.
We anticipate that the approach presented here, which enables short-read assembly to capture new classes of duplicated sequences, will benefit from future improvements in molecular barcoding platforms. In our clinical metagenomes, Athena was unable to produce improved drafts for some organisms due to disproportionately reduced read representation in read cloud library preparation compared to standard methods. We anticipate that improvements in high molecular weight DNA extraction, which preserve relative abundance between species, will allow Athena to provide improved drafts for these organisms as well. We also anticipate that improvements in both coverage uniformity and long fragment partitioning during read cloud library preparation will enable our approach to eventually produce near reference grade microbial genomes from individual gut microbiome samples alone.
Methods
Mock community DNA mixture preparation
For the mock community, purified bacterial isolate DNA was obtained from BEI Resources (Manassas, VA) for each of ten species belonging to distinct genera (Supplementary Table 1). DNA samples were diluted to 5ng/uL and combined in equal volumes. The mixture was concentrated by ethanol precipitation and resuspended in 1/4x volume, and quantified with the Thermo Qubit 3.0 fluorometer (Thermo Fisher Scientific, Waltham, MA) using the Qubit dsDNA HS assay. The mixture was size selected with the Sage Science BluePippin instrument (Sage Science, Beverly, MA) with the BLF7510 agarose gel cassette kit and Marker S1 targeting a 5kb minimum fragment length, then quantified again with Qubit prior to library preparation.
Clinical participant recruitment
The patient from whom the longitudinal stool samples were obtained was recruited at the Stanford Hospital Blood and Marrow Transplant Unit under an IRB-approved protocol (PIs: Dr. David Miklos, Dr. Ami Bhatt). Informed consent was obtained. A comprehensive chart review was carried out to identify clinical features of the patient, demographic information, duration and exposure to medications, and diet.
Clinical DNA and RNA preparation
Stool samples were obtained from the study subject on an approximately weekly basis, when available. Clinical stool samples were placed at 4C immediately upon collection, and processed for storage at −80C the same day. Stool samples were aliquoted into 2mL cryovial tubes with either no preservative or 700uL of RNAlater and homogenized by brief vortexing. Samples were stored at −80C until extraction. DNA was extracted from stool samples with the Qiagen QiAMP Stool Mini Kit modified with the addition of 7 cycles of 30s bead beating alternating with 30s cooling on ice (for full details, see Supplementary Methods). DNA concentration estimations were performed using Qubit fluorometric quantitation. Extracted DNA was size selected with the BluePippin instrument with a 5kb minimum size cutoff as described above. DNA for the synthetic mixture and clinical samples was prepared for sequencing with the Gemcode instrument (10X Genomics, Pleasanton, CA) with revision C of the standard protocol. Library fragment size was quantified with the 2100 Bioanalyzer instrument (Agilent Technologies, Santa Clara, CA) using the High Sensitivity DNA chip and reagent kit.
RNA was extracted with the RNeasy Mini kit (Qiagen, Germantown, MD) from samples stored in RNAlater at −80C. Total RNA concentration was assayed with the Qubit RNA HS kit and Qubit fluorometer. RNA was ethanol precipitated and resuspended in nuclease-free water to concentrate, then quantified again using both Qubit RNA HS and Qubit DNA HS kits to determine the degree of DNA contamination. Contaminating DNA was removed using the Baseline-ZERO DNase protocol (Epicentre, Madison, WI) with 30 minute incubation followed by a second ethanol precipitation. Ribosomal RNA was depleted with the Epicentre Ribo-Zero rRNA removal kit (Bacteria). The rRNA-depleted RNA was quantified with 2100 Bioanalyzer using the Agilent RNA6000 Pico kit. cDNA sequencing library was prepared with the Illumina (San Diego, CA) Truseq Stranded mRNA kit following the Truseq Stranded mRNA LS protocol.
In addition, conventional short-read sequencing libraries were prepared for DNA extracted from clinical stool samples, the mock mixture and unmixed bacterial isolates. Clinical samples were prepared for sequencing with the Illumina Truseq Nano DNA library preparation kit with a target insert size of 500bp. Mixed and unmixed bacterial isolate DNA samples were prepared for sequencing with the Illumina Nextera XT library preparation kit.
Sequencing and assembly
Bacterial isolate libraries and 10× Gemcode libraries were subjected to 2×148bp sequencing with the Illumina Nextseq 500. The mock mixture, clinical DNA and clinical RNA libraries were subjected to 2×100bp sequencing with the Illumina HiSeq 4000. Raw reads from conventional sequencing were trimmed using Trim Galore v0.4.154 using a minimum length of 60bp, minimum terminal base score of 20, and the Illumina adapter sequences. In addition, forward reads were trimmed by 2bp at the 5’ end and reverse reads were trimmed by 6bp at the 3’ end to remove low quality bases, and deduplicated with SuperDeduper v2.055. All short-read libraries were downsampled to equal the total read count of the corresponding read cloud library. Data were assembled using SPAdes v3.6.156 with default parameters for paired-end input. SPAdes seed assemblies of Gemcode libraries were then reassembled with Athena. Assemblies were visualized with IGV57, R58 and python using the ggplot259, circlize60 and matplotlib61 libraries. K-mers enriched in Athena assemblies compared to conventional assemblies were determined using Jellyfish62 for k-mer counting.
Assembly annotation
Contigs were assigned taxonomic classifications using Kraken63 with a custom database constructed from the Refseq and Genbank64,65 bacterial genome collections. All species represented in <1% of read classifications were discarded. Sequences were functionally annotated using Prokka v1.1166.
Insertion abundance estimation
Illumina Truseq short read data were aligned with BWA67 to the validated insertion sequence assemblies obtained from Athena assemblies of clinical microbiome data. Reads recruited to each insertion locus were realigned with STAR68 in order to obtain gapped alignments spanning the insertion sequence. Gapped alignments, representing the ancestral strain, were counted for each insertion. To obtain relative abundance, ancestral counts were divided by coverage sampled at a location two kilobases adjacent to the left flank of the insertion.
PCR amplification
PCR was performed to establish molecular contiguity between sequences assembled by Athena, and to generate template materials for Sanger sequencing. PCR reactions contained Phusion High-Fidelity DNA Polymerase (New England BioLabs, Ipswich, MA) with Phusion HF Buffer and NEB Deoxynucleotide Solution Mix. Primers were obtained from Elim Biopharm (Hayward, CA) with target melting temperature of 60°C. Protocols for rRNA and IS amplifications were as follows:
Per 30uL reaction:
6uL 5x HF buffer
0.6uL 10mM dNTP
0.3uL Phusion
2uM 10mM forward primer
2uM 10mM reverse primer
1uL template DNA
PCR clean water to 30uL
Thermocycling:
Denature: 98°C for 30s
35 cycles
Denature: 98°C for 5s
Anneal: 65°C for 10s
Extend: 72°C for 30s/kb
Final extension: 72°C for 5min
Hold: 4°C indefinitely
Sanger sequence assembly validation
In order to validate the Athena assembly at locations likely to be misassembled, we targeted duplicated sequences for orthogonal molecular validation. In the mock mixture, we identified occurrences of the 16S/23S operon within the Klebsiella genome, and in the clinical samples, we isolated all assembled regions containing insertion sequence IS612. For each of these regions, we designed PCR primers adjacent to the left and right junctions between the duplicated sequence and genomic flank. We amplified each genomic segment, performed gel extraction to isolate the amplicon corresponding to the inserted variant of the genomic segment when necessary, and performed Sanger sequencing traversing the left and right junctions. For very low abundance insertions, nested PCR using primers surrounding the left and right junctions of the insertion assembly was performed in order to amplify sufficient material for sequencing. Primer sequences and Sanger read sequences can be found in Supplementary File 1. Primer design and Sanger sequencing data visualization, quality control and alignment were performed with Geneious v7.1.469.
Code availability
The Athena assembler together with a demonstration dataset can be found at https://github.com/abishara/athena_meta. This example closes several gaps within an initial draft assembly from SPAdes, including assembling two instances of IS612 inside a single contig of B. caccae.
Data availability
The datasets generated during the current study are available in the NCBI Sequence Read Archive under Bioproject accession PRJNA380276.
Author contributions
E.L.M., A.B., A.S.B. and S.B. conceived of the study. E.L.M., C.H., C.W. and H.J. prepared read cloud libraries. E.L.M., E.T., J.K., and T.A. collected samples, extracted DNA and RNA and prepared short-read sequencing libraries. E.L.M. designed and selected samples, and performed read cloud sequencing, PCR validation, and Sanger sequencing. A.B. and S.B. conceived of the assembly approach. A.B. implemented the Athena assembler. E.L.M. and A.B. carried out all analyses, wrote the manuscript, and generated figures. All authors commented on the manuscript.
Competing financial interests
The authors declare no competing financial interests.
10X Gemcode Contamination
Athena assembly yielded a draft for Ralstonia pickettii in addition to the 10 intended species for the read cloud library. This organism was present in all read cloud libraries and absent from all conventional short-read libraries prepared from clinical samples, mixed isolate and unmixed isolate DNA. Thus, we attributed these contigs to DNA contamination introduced during 10X Genomics Gemcode library preparation and discarded them.
Athena Assembly
We developed Athena to use barcoded short-read sequences derived from partitioned long input DNA fragments, which we refer to as read clouds. We apply Athena to a read cloud dataset generated with the 10X Genomics Gemcode instrument. In principle, the long fragments that are used as input to these platforms allow resolution of repeats contained within these fragments. However, the barcode-specific coverage of each long fragment is too sparse to allow de novo assembly of each in isolation. Furthermore, the long range information encoded within the raw output of each barcode in the form of unordered and unoriented short-read sequences does not fit well into existing sequence assembly algorithms. Athena uses the barcode information to propose a series of simplified assembly tasks that can be performed using existing assemblers as black box subroutines.
Athena first uses an existing short-read assembler (SPAdes) to obtain an initial sequence covering of the underlying metagenome in the form of (possibly short) sequence contigs. A scaffold graph is then constructed using the paired-end information from short-read alignments to these contigs. This scaffold graph contains branches that can be attributed to nearly identical repeats, small divergent sequences between otherwise identical strains, or conserved sequences. Mappings of the barcoded short reads to this scaffold graph allow the selection of input read subsets for a smaller assembly problem (subassembly), such that the resulting contigs yield unambiguous paths through the scaffold graph. The resulting contigs are then passed as reads to an overlap layout consensus assembler, Canu (formerly Celera), for further assembly of these much larger sequences.
The steps for Athena assembly are as follows:
1) A conventional short-read assembler (SPAdes 56) is used to assemble the raw reads to obtain an initial covering of the target metagenome in the form of short sequence contigs. We refer to the contigs as seeds.
2) Raw reads are mapped back to these seed contigs, and paired end mappings that span two seed contigs are considered for edge creation in a scaffold graph. However, we observed a significant fraction of read pairs mapping with an intervening distance that exceeds the expected library fragment size. We believe these pairs to be mostly due to chimeric fragments arising during Gemcode library preparation. In order to prevent these from introducing spurious connections in the scaffold graph, we perform the following steps:
2a) For any two seed contigs that are still connected by at least three spanning read pairs, the mapped positions of these spanning read pairs on each seed contig are clustered together into 500bp neighborhoods, corresponding to the average library fragment size.
2b) All clusters are examined and if any single cluster on each seed contig contains more than 50% of these spanning read pairs, then an edge is added in the scaffold graph. Otherwise, the candidate edge is assumed to be spurious and discarded. This filtering process greatly reduces the number of proposed subassemblies to perform.
3) For each remaining edge between two seed contigs within the scaffold graph, subassembly of the linked seed contigs is performed with the following steps:
3a) Barcodes containing at least one read mapping to both seeds are selected. We refer to these as subassembly barcodes. Pooled reads from the subassembly barcodes potentially contain contiguous sequences that bridge together the two seed contigs.
3b) Pooled reads that map to these two seeds are used to estimate short-read coverage of the target sequence within the subassembly. If the short-read coverage is estimated to be low (<10x), then this subassembly is skipped as the local target is unlikely to assemble at low depths. If the short-read coverage is estimated to be high (>200x), then the subassembly barcodes are first downsampled to accelerate subassembly.
3c) The remaining pooled reads are then assembled with IDBA UD to yield subassembled contigs. These subassembled contigs are likely to disambiguate other branches in the scaffold graph because the pooling of all reads within the chosen barcodes also draws in reads from flanking regions, due to the long input DNA fragments. This pooling will also draw reads from other input DNA fragments that do not cover the local target sequence, which we refer to as off target fragments. These off target fragments will have a low probability of collision with the local target. Nonetheless, to prevent incorrect subassemblies due to off-target reads arising from repeats, we determine a local threshold (based on estimated coverage of the subassembly target) on the minimum coverage depth required to assemble through a sequence contig. We used the existing short-read assembler IDBA UD to assembled the pooled reads because it was designed for use with highly uneven short-read coverages and also allowed us to specify the minimum support each k-mer should have to assemble through a sequence contig. The updated 10X Genomics Chromium platform uses more than an order of magnitude more of droplet partitions than the Gemcode, and should, theoretically, eliminate the need for this additional threshold.
4) The subassembled contigs, which contain large overlaps, together with the initial seed contigs, are passed as reads to the Overlap Layout Consensus Assembler Canu to perform further assembly. Following the methodology used with previous synthetic long read metagenomic assembly approaches71, we specify a small read error rate to facilitate overlap assembly even in the presence of strain microdiversity. The resulting draft metagenome assembly contains more complete sequence contigs that do not have gaps and resolve shared sequences too difficult to assemble from short-read techniques alone. Repeats that cannot be unambiguously spanned by subassembled contigs remain unresolved by the overlap assembler.
Acknowledgements
The authors would like to thank Alexandra Sockell for assistance operating the NextSeq 500, and Arend Sidow for valuable feedback on the manuscript. This work was supported by NCI K08 CA184420, the Amy Strelzer Manasevit Award from the National Marrow Donor Program, and a Damon Runyon Clinical Investigator Award to A.S.B. E.L.M. was supported by National Science Foundation Graduate Research Fellowship DGE-114747. A.B. was supported by the Stanford Genome Training Program (SGTP; NIH/NHGRI) and the Training Grant of the Joint Initiative for Metrology in Biology (JIMB; NIST). Access to shared compute resources was supported in part by NIH P30 CA124435 using the Stanford Cancer Institute Shared Resource Genetics Bioinformatics Service Center.