Abstract
High-throughput amplicon sequencing of large genomic regions represents a challenge for existing short-read technologies. Long-read technologies can in theory sequence large genomic regions, but they currently suffer from high error rates. Here, we report a high-throughput amplicon sequencing approach that combines unique molecular identifiers (UMIs) with Oxford Nanopore sequencing to generate single-molecule consensus sequences of large genomic regions. We demonstrate the approach by generating nearly 10,000 full-length ribosomal RNA (rRNA) operons of roughly 4,400 bp in length from a mock microbial community consisting of eight bacterial species using a single Oxford Nanopore MinION flowcell. The mean error rate of the consensus sequences was 0.03%, with no detectable chimeras due to a rigorous UMI-barcode filtering strategy. The simplicity and accessibility of this method paves way for widespread use of high-accuracy amplicon sequencing in a variety of genomic applications.
Introduction
High throughput amplicon sequencing is a powerful method for analysing variation in defined genetic regions when sample amounts are limited, insights into low abundant subpopulations are important, or samples need to be analysed in an economical manner. The method is therefore ideal for studying genetic populations with low abundant variants or high heterogeneity such as cancer driver genes1–3, virus populations4–6 and microbial communities7.
For years, short-read Illumina sequencing has dominated amplicon related research due to its unprecedented throughput and low native error-rate of 0.1%, but with a limitation in maximum amplicon size of ∼500 bp (merging of 2×300 bp PE reads)8. To enable a lower error-rate and sequencing of longer amplicons, unique molecular identifiers (UMI’s) have been applied extensively. Each template nucleotide sequence molecule in a sample is tagged with a UMI sequence consisting of 10-20 random bases. All derived products throughout processing and sequencing will contain the UMI tag, which can subsequently be used to sort and analyse reads based on their original template molecule. This concept has many applications in high-throughput sequencing, such as absolute quantification9, generating molecule-level consensus sequences with a low error rate10, and assembly of synthetic long reads11. These applications have enabled key advances across diverse fields of research, such as absolute counting of transcripts in single cells12, detecting low-frequency cancer mutations in plasma cell-free DNA13, and generating full-length microbial SSU ribosomal RNA (rRNA) sequences in a high throughput manner14, to mention a few. The lowest possible error rate of Illumina based consensus sequencing is impressive (< 10-7 %), but the upper limit of target length for UMI synthetic long-reads remains approximately 2000 bp due to inefficient cluster generation of longer DNA fragments on the flowcells15. UMI-based protocols exist that can generate longer consensus sequences from short reads16, but they are not widely adopted due to complicated laboratory protocols. Partitioning based methods such as 10x Genomics and TruSeq Synthetic Long-Reads struggle resolve complex amplicon populations, as there is a high risk of >1 amplicon ending up in the same partition which will result in a chimeric assembly8. Lastly, as synthetic long reads depend on de novo assembly of the short-reads, this approach will never be able to resolve internal molecule repeats larger than the read length.
In order to analyse amplicons larger than 2000 bp in high throughput, the only feasible approach would be to use long-read sequencing technologies such as Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio). However, these methods are limited by a relatively high raw error-rate of 9 and 13% respectively17. For PacBio, the circular consensus sequencing (CCS) approach, where a template molecule is circularized and read multiple times by the polymerase, can produce mean error rates to as low as 0.04%18. Strategies also exist for reducing the error-rate of amplicon sequencing on the ONT platform with template circularization and rolling circle amplification before sequencing to generate single-molecule consensus sequences, but these methods suffer from insufficient molecule coverage to effectively reduce mean error rates below 2%19. In principle, lower error rates can be achieved with different clustering strategies19, but at the cost of missing variants, which are critical in many applications20.
In principle, UMIs can be used in long-read amplicon sequencing to reduce sequencing error-rate, and eliminate PCR artefacts (e.g. chimeras), which are present irrespective of polymerase21 and can make up > 20% of the amplicons22. This is also true for PacBio CSS, where errors are introduced before sequencing during amplicon PCR amplification. Despite the benefits, the combination of UMIs with long-read sequencing is relatively unexplored, and only recently has it been applied with PacBio sequencing, but without profiling the error of the generated consensus sequences23, 24. For ONT sequencing the raw error rate of 5-25%25 has, until now, made it difficult to efficiently extract UMI sequences and confidently determine the true UMI sequences necessary for read binning.
Here, we created a UMI design containing recognizable internal patterns, which together with UMI length filtering now makes it possible to robustly determine true UMI sequences in raw nanopore data. We incorporated this patterned-UMI design into a simple, generally applicable laboratory and bioinformatics protocol that combines UMIs and ONT sequencing of long amplicons (>4500 bp) from low template amounts with high accuracy. As a proof of concept, we apply the method to sequence full-length ribosomal RNA (rRNA) operons in a mock microbial community of eight bacterial species (ZymoBIOMICS Microbial Community DNA Standard) and generate consensus sequences with a mean error rate of 0.03% and no detectable chimeras.
Results and discussion
The method is simple and comprised of two PCR amplifications, Nanopore library preparation, Nanopore sequencing and custom data processing (Figure 1). First, the DNA template is diluted according to the desired number of output sequences. The final yield is impacted by the initial dilution, as well as the amplicon length and PCR efficiency; thus, the dilution should be calibrated empirically for an amplicon target of interest. For rRNA operon sequencing, we found that 5 ng of template produced ∼10,000 consensus sequences, and is a good general starting point for further optimization. The genetic region of interest is targeted using 2 cycles of PCR with a custom set of tailed primers, which include a target-specific primer, a UMI sequence and a synthetic priming site used for downstream amplification (Figure 1A, step 1). Here we used the 27F (16S)26 and the 2490R (23S)27 primers to target the bacterial rRNA operon. The result from the initial PCR is a dsDNA amplicon copy of the genetic target with UMIs and synthetic primers in both ends. This template is subsequently amplified by PCR (Figure 1A, step 2) and prepared for long read sequencing, in this case using the using the ONT 1D ligation kit and ONT MinION (Figure 1A, step 3) followed by base-calling. After sequencing, the data is trimmed, filtered and reads are binned according to both terminal UMIs (Figure 1B, steps 1 and 2). To overcome the obstacle of binning UMIs in raw nanapore data with a mean error rate ∼9.5%, we designed ‘patterned’ UMIs, with the structure “NNNYRNNNYRNNNYRNNN”. The YR [C/T][A/G] patterns limit the length of homopolymer in the UMIs to 4 bases, which mitigates the higher homopolymer error rate present in ONT sequencing8. UMI sequences that have a high probability of being correct are detected based on the presence of the above pattern, as well as an expected UMI length of 18 bp. The two terminal UMIs in the amplicons make up a combined UMI pair of 36 bases with a theoretical complexity of 1.2×1018 combinations, which means it is extremely unlikely that two molecules contain the same UMI pairs if aiming for 10,000 – 1,000,000 molecules. Chimeric amplicons will form during the later cycles of PCR amplification step, especially if proof-reading polymerases are used21. UMI pairs from these chimeric sequences are de novo filtered by removing reads with UMI pairs in which either UMI has been observed before in a more abundant UMI pair (Figure 1B, step 2)28. The filtered, high-quality UMI pair sequences are used as a reference for binning of the raw dataset according to UMIs (Figure 1B, step 3).
Sequencing of the mock community rRNA operon library resulted in 7.4 Gbp of base-called raw data, of which 3.3 Gbp was binned based on UMIs. The mean read coverage per UMI bin was 67x. The consensus sequence for each UMI bin was generated by initially finding the centroid sequence in the bin, and polishing this centroid with all the data in the UMI bin using 5 rounds of racon29 followed by 2 rounds of Medaka (Figure 1B, step 4).
Initially, we observed error-rates that were highly correlated with the individual rRNA operons in the Zymo mock (Supplementary Figure 4), which indicated errors in the available reference genomes, as was also reported by others18. The reference genomes were generated using the Unicycler assembler with both Illumina and Nanopore reads and polished with pilon (personal communication with Zymo Research). As Unicycler uses a short-read assembly as starting point30 and short-read polishing has been used for final curation, repeat regions are bound to contain errors resulting from ambiguous assembly and mapping31. To generate improved rRNA operon references, we first used a long-read assembly approach, in which publicly available ONT sequence data of the Zymo mock community32 was assembled into individual reference genomes with miniasm33 followed by racon and Medaka polishing. rRNA operons were then extracted from the high-quality long-read assemblies, and SNPs with no Illumina short-read support were manually curated, which were mainly indel errors in homopolymers. In total, we found 49 bacterial rRNA operons with 4-10 copies/species, where 44 operons were unique and had 1-379 intra-species difference (Supplementary Figure 2). The mean difference between the original references and our curated sequences was 0.063% (∼2.8 SNP/operon), with a range of 0 – 0.47% (0 – 21 SNP/operon) (Supplementary Figure 3).
A total of 9759 amplicon UMI consensus sequences with an average length of 4372 bp were generated with a read coverage of ≥ 30x, a mean error rate of 0.03% and no detected chimeras (Figure 2C). Of these sequences, 2570 were perfect with no errors. The error rate is markedly different in non-homopolymer regions compared to homopolymer regions (Figure 2B). The non-homopolymer error rate stabilizes above a coverage of 10x for all error types (deletions, inserts and mismatches), with mismatches contributing to a majority of the remaining error (Figure 2D). Within homopolymer regions, the error rate is higher and continues to drop beyond 100x coverage, which is primarily due to the indel errors (Figure 2B). The mismatch error rate is similar between non-homopolymer and homopolymer regions over all coverage values. This demonstrates that the major obstacles for achieving a lower error rate are generally mismatch errors, as well as indel errors specifically in the homopolymer regions. The mismatch error rate of 0.012-0.016% is most likely derived from the 2 cycles of initial PCR performed to target the rRNA operon. For this PCR, Platinum Taq DNA high-fidelity polymerase (Thermo Fisher) was used, which should have an error rate in the range of 0.003 - 0.005% (6x lower than Taq) 34, 35 per duplication which theoretically would result in a cumulative error rate over 2 PCR cycles of up to 0.01%. Other high-fidelity polymerases with lower error rates were tested, but we were unable to consistently produce amplicons, which we might be due to unwanted intra- or inter-molecule annealing. The homopolymer indel error rate is a consequence of the nanopore read-head structure in the CsgG pore used in the current R9.4 chemistry8. Generally, the homopolymer indel rate depends on homopolymer length and specific nucleotide (Supplementary Table 2), i.e. A-homopolymers have markedly lower errors than G-homopolymers. Yet, a closer inspection of the homopolymer error rates reveals a more complicated picture. For example, some positions of 3xC homopolymers contained more frequent insertions than longer C-homopolymers (Supplementary Figure 1). This problem is likely rooted within the calibration of the neural networks of the base-caller and consensus algorithms36, and is bound to change significantly in the future, and will probably be reduced with the introduction of the R10 pores. Despite residual systematic errors, the error-rate presented here is the lowest documented for long read amplicons yet (Supplementary Table 4). We did not identify any chimera’s in the generated long-read amplicon data.
An important application of high-accuracy amplicon sequencing is the ability to confidently call variants, even if they are present in low relative abundance. To test our method, we performed naive variant calling based on the consensus sequences. Consensus sequences initially were grouped via clustering, and SNPs within each cluster were phased and called as a variant if present ≥2x coverage. Subsequently, the consensus reads were binned according to variants, and variant consensus sequences were generated. To reduce impact of systematic homopolymer errors, the homopolymers were masked before phasing and variant calling, and reintroduced before final consensus calling. Of 44 unique rRNA operons, 40 variant consensus sequences were found with no errors, and 4 with 1 error in homopolymer regions (Figure 3B, Supplementary Table 3). An additional 26 spurious variants were detected with a mean error count of 1.4 (0.03% error rate) and a maximum of 3 (0.07% error rate). These spurious variants are supported by 1.6% of the total data, and seem to occur due to systematic errors at specific positions outside homopolymer regions.
The relative rRNA operon abundance within each species were very similar, as was expected (Figure 3C). For some species the internal coverage variance was small (E. coli percent sd=4.9) and for others it is higher (L. fermentum sd=12.8) (Supplementary Figure 6 and Supplementary Table 5). By investigating the read coverage of the mock community genomes within the publicly-available metagenomic nanopore data32, we found evidence of heavy coverage skew across the genome in some species, likely due to different growth rates of the cultures at the time of sampling (Figure 3D, Supplementary Table 7). This skew can impact the relative template abundances of the operons up to +/− 50% (Supplemental Table 5), depending on their distance to the origin of replication, and could to some degree explain the variance we see among inter-species operon abundances. The observed relative abundance between species did not match the theoretical abundance for all species reported by the vendor (Supplementary Table 6). Possible explanations are erroneous mixing of the mock community, species-dependent DNA fragment size, PCR primer mismatch, operon/genome GC content, and different amplification efficiencies. To our surprise, none of these potential causes could alone explain the observed discrepancy in relative abundance (Supplementary Tables 6-7 and Figures 7-9). However, it is evident that multiple factors have to be considered when interpreting this kind of data, especially template DNA size distribution impact on template availability (Supplementary Figure 7), growth dependent coverage bias (Supplementary Figure 6) and template amplification efficiency (Supplementary Figure 9).
The data presented here was generated in 48 hrs (6 hrs lab work, 24 hrs sequencing, 6 hrs data processing) at a reagent cost of 1100 USD, which is ∼0.1 USD/consensus sequence. Using this method on the PacBio Sequel system with the SMRT Cell 1m chips, we anticipate the output would be around 100,000 UMI consensus sequences at a cost of ∼0.02 USD/consensus sequence with a marginally better error rate, as the PacBio errors seem more random and therefore better suited for consensus calling37. The throughput will likely change by a factor of 10x with the introduction of Sequel II and the SMRT Cell 8m chips. The turnaround for PacBio sequencing is theoretically < 24 hrs, but as most users would need sequencing out-of-house, this is more likely > 7 days. We predict that the ease of use, fast turn-around time and accessibility will favour sequencing of high-accuracy amplicons on the ONT platform.
Over the past several decades, the amplification and sequencing of ribosomal RNA (rRNA) genes, primarily 16S and 18S, has become an integral method used to study the diversity and taxonomic composition of microbial communities in a variety of environments38. With our method, it is now possible to effortlessly improve upon high-throughput sequencing of environmental samples with databases based on full rRNA operon (SSU-ITS-LSU), which has not been previously feasible due to the length of the operon (≈ 5 kbp) and the method limitations aforementioned. A database of full operon rRNA sequences will help improve upon rRNA phylogeny, allow higher phylogenetic resolution39–42, especially critical if the method is applicable to eukaryotes43, 44, and will present a wider range of target regions for designing short-read amplicon sequencing assays and fluorescent in situ hybridization probes45, 46.
High-accuracy amplicon sequencing of long targets has many applications, and the ease and accessibility of this method now makes it possible for the wider scientific community to develop new solutions – all one needs is a modified version of their favourite primers, a few generic molecular laboratory instruments, and a MinION starter kit from Oxford Nanopore Technologies. While the residual error rate in the Nanopore consensus data is negligible, the remaining systematic indel errors could still be an issue in some contexts, such as sensitive assays where low abundant variants are important, or if shifts in reading frames cannot be tolerated. These systematic indel errors will hopefully be solved soon, and until then, this method can be applied with the PacBio platform for the specific purposes above. By exchanging the initial PCR for a ligation step, high-accuracy amplicon sequencing could also be applied to fragmented DNA with tight size distributions (5-15 kbp) to produce long reads with low error rate, which holds great promise for human genome sequencing47 and resolving strain-diversity in metagenomes48.
Methods
Sources of DNA
The ZymoBIOMICS Microbial Community DNA Standard (D6305, lot no. ZRC190633) was obtained from Zymo Research (Irvine, California). The mock community DNA contained genomic material from 10 species (8 bacteria and 2 yeasts): Bacillus subtilis, Cryptococcus neoformans, Enterococcus faecalis, Escherichia coli, Lactobacillus fermentum, Listeria monocytogenes, Pseudomonas aeruginosa, Saccharomyces cerevisiae, Salmonella enterica, Staphylococcus aureus. Note, 2 of the yeast species were not targeted by PCR amplification of rRNA operons. The concentration of mock DNA was measured on a Qubit 3.0 fluorometer and Qubit dsDNA HS assay kit (Thermo Fisher Scientific) and the quality of the mock DNA was measured by gel electrophoresis on an Agilent 2200 Tapestation using Genomic screentapes (Agilent Technologies).
DNA Sequence Library Preparation
Target gene and add UMIs
PCR was used to target the bacterial 16S-23S rRNA operon and simultaneously tag each template molecule with terminal unique molecular identifiers (UMIs).
The following primers were used for the PCR. Forward primer (ncec_16S_8F_v7): 5’- CAAGCAGAAGACGGCATACGAGAT NNNYRNNNYRNNNYRNNN AGRGTTYGATYMTGGCTCAG. Reverse primers (ncec_23S_2490R_v7): 5’- AATGATACGGCGACCACCGAGATC NNNYRNNNYRNNNYRNNN CGACATCGAGGTGCCAAAC. The first section of both primers is a synthetic priming site used for downstream amplification. The second section is the ‘patterned’ UMI consisting of a total of 12 random nucleotides (N) and 6 degenerate nucleotides (Y or R) which results in a total of 1.2×1018 possible UMI combinations if the UMIs in both ends of a molecule are concatenated (412*2 x 26*2 = 1.2×1018). The last section of the primers consists of the rRNA operon specific primer site for 27f1 and 2490r2, respectively.
The PCR reaction contained 10 ng of ZymoBIOMICS Microbial Community DNA Standard, 0.5 U Platinum Taq DNA Polymerase High Fidelity (Thermo Fisher Scientific, USA), and a final concentration of 1x High Fidelity PCR buffer, 100 mM of each dNTP, 1.5 mM MgSO4, 500 nM of each ncec_16S_8F_v7/ ncec_23S_2490R_v7 primers in 50 µL. The PCR program consisted of initial denaturation (3 minutes at 95◦C) and 2 cycles of denaturation (30 seconds at 95◦C), annealing (30 seconds at 55◦C) and extension (6 minutes at 72◦C). The PCR product was purified using CleanPCR (CleanNA, Netherlands) following the manufacturer’s instructions (CleanPCR, manual revision v1.02) with the exception of an EtOH concentration of 80%, post wash dry time of < 3 minutes and 0.6x bead solution/sample ratio.
Amplification of UMI tagged amplicons
A second PCR was used to amplify the UMI-tagged template molecules. All of the UMI-tagged template molecules were added to the reaction along with a final concentration of 1x High Fidelity PCR buffer, 100 mM of each dNTP, 1.5 mM MgSO4, 500 nM of each ncec_pcr_fw_v7 (5- CAAGCAGAAGACGGCATACGAGAT)1 and ncec_pcr_rv_v7 (5- AATGATACGGCGACCACCGAGATC)1 primers and 0.5 U Platinum Taq DNA Polymerase High Fidelity (Thermo Fisher Scientific, USA) in 100 µL. The PCR program consisted of initial denaturation (3 minutes at 95◦C) and then 25 cycles of denaturation (15 seconds at 95◦C), annealing (30 seconds at 60◦C) and extension (6 minutes at 72◦C) followed by final extension (5 minutes at 72◦C). The PCR product was purified using a custom bead purification protocol “SPRI size selection protocol for >1.5-2 kb DNA fragments” (Oxford Nanopore, England) based on: dx.doi.org/10.17504/protocols.io.idmca46. CleanPCR (CleanNA, Netherlands) bead solution was used for preparing the custom buffer. The purification was performed according to the custom protocol with the exception of an EtOH concentration of 80% and 0.9x bead solution/sample ratio. The concentration and quality of the PCR amplicons was measured as described before.
To obtain sufficient PCR product for Oxford Nanopore sequencing, a third PCR was performed using amplicons from the second PCR and the same procedure as before, but with 4 x 100 µl reactions and 10 cycles of amplification. The final amount of amplicon generated was 10 µg in 55 µL.
DNA Sequencing
2000 ng of the purified amplicon from the third PCR was used as template for library preparation using the protocol “1D amplicon/cDNA by ligation (SQK-LSK109)” (Oxford Nanopore, England) with omission of the AMPure purification after the end-prep step. A R9.4.1 FLO-MIN106 flowcell was used for sequencing on a MinION and MinKNOW v18.12.9 (Oxford Nanopore, England). Basecalling was performed with Guppy v3.0.3 in GPU mode and the dna_r9.4.1_450bps_hac.cfg model (Oxford Nanopore, England).
Data generation workflow
Trimming and filtering of raw data
Raw fastq sequence data was adaptor trimmed using porechop with the commands: -- min_split_read_size 3500 --adaptor_threshold 80 --extra_end_trim 0 -- extra_middle_trim_good_side 0 --extra_middle_trim_bad_side 0 --middle_threshold 80 -- check_reads 1000 (v0.2.4 https://github.com/rrwick/Porechop). Additionally, the adaptors.py file in porechop was modified to include possible end-to-end ligation combinations of the custom primers (ncec_pcr_fw_v7/ ncec_pcr_rv_v7 5-GTCTTCTGCTTGAATGATACGGCG; ncec_pcr_fw_v7/ ncec_pcr_fw_v7 5-GTCTTCTGCTTGCAAGCAGAAGAC; ncec_pcr_rv_v7/ ncec_pcr_rv_v7: 5-CGCCGTATCATTAATGATACGGCG). The custom settings and modifications to adaptors.py were necessary to correctly split amplicons concatenated in the ligation step of the library preparation, which made up a substantial amount of the data. The adaptor trimmed data was filtered using filtlong --min_length 3500 --min_mean_q 70 (v0.2.0 https://github.com/rrwick/Filtlong) and cutadapt3 (v2.1) -m 3500 –M 6000. The final result from these pre-processing steps was trimmed and filtered raw read data.
Extraction of UMI reference sequences
To efficiently bin reads according to UMIs, it was critical to extract and validate true UMI sequences that could be used as references. UMI sequences of the correct length (18 bp) were extracted from the reads by locating the flanking sequences within the custom primers. The first 200 bp from each terminal end of all reads were extracted using awk, and saved into individual files. UMI sequences were extracted from each file with cutadapt3 (v2.1) in paired-end input mode, using the commands: -e 0.2 -O 11 -m 18 -M 18 --discard-untrimmed -g CAAGCAGAAGACGGCATACGAGAT…AGRGTTYGATYMTGGCTCAG -g AATGATACGGCGACCACCGAGATC…CGACATCGAGGTGCCAAAC –G GTTTGGCACCTCGATGTCG…GATCTCGGTGGTCGCCGTATCATT -G CTGAGCCAKRATCRAACYCT…ATCTCGTATGCCGTCTTCTGCTTG. This step insured that only reads with UMIs of the correct length in both ends were extracted. UMI pairs were then concatenated and filtered to remove UMI pairs that did not follow the expected pattern (NNNYRNNNYRNNNYRNNNNNNYRNNNYRNNNYRNNN). Filtered UMI pairs were clustered using usearch4 (v11.0.667) with the commands: -fastx_uniques -minuniquesize 2 - strand both and usearch -cluster_fast -id 0.85 -centroids -sizein -sizeout -strand both. Potential chimeras were removed by filtering all UMI pairs containing a single UMI that was observed in another UMI pair with a higher abundance. The final result from these steps was a list of trusted UMI pairs that could be used as references for binning reads.
Binning reads according to UMI
The first 55-65 bp of each terminal of the trimmed and filtered reads were extracted with awk and saved into individual files. The UMI pair reference sequences were split into their corresponding single UMIs and mapped to the read terminals using bwa5 (v0.7.17-r1198-dirty) with the commands: index, aln -n 3 –N, and samse –n 10000000. The mapping results were then filtered using samtools6 (v1.9) with the command view -F 20. Mapping results from each end of the reads were merged, and a read was assigned to a specific UMI pair reference if two conditions were met: A) the UMI was the best hit; B) the mapping difference between the query read and each sub UMI was ≤ 3 bp. Based on these designations, the trimmed and filtered reads were divided into UMI bins.
Generation of UMI consensus sequences
For each individual UMI bin, a consensus sequences was initially generated using usearch (v11.0.667) with the commands -cluster_fast -id 0.75 -strand both -centroids, and picking the most abundant centroid. The centroid sequence was used as template for five rounds of polishing using all the UMI bin reads with minimap27 (v2.16-r922) with the command -x ava-ont and racon8 (v1.3.1) with the command -m 8 -x -6 -g -8 -w 800. The racon-polished consensus sequence was further polished using all of the reads in that UMI bin using two rounds of Medaka (v0.7.0) with the commands -m r941_min_high_model.hdf5 (https://github.com/nanoporetech/medaka). The polished consensus sequences from all UMI bins were then pooled and trimmed and filtered using cutadapt with the commands -m 3000 -M 6000 –g AGRGTTYGATYMTGGCTCAG…GTTTGGCACCTCGATGTCG. Consensus sequences not containing both primers were discarded.
Phasing of consensus sequences
Consensus sequences were phased and used to call variants using a custom workflow. The homopolymers were masked in the consensus sequences by converting homopolymers of length ≥3 into length 2 to prevent them from impacting the phasing. The masked consensus sequences were clustered using two rounds of usearch with the commands -cluster_fast -id 0.995 -strand both -consout -clusters -sort length -sizeout, and removing clusters of size < 3. The reads belonging to each cluster were mapped back to the consensus sequence of the cluster using minimap2 with the command –ax asm5. Genotype likelihoods were estimated from the mappings with bcftools9 (v1.9) with the command mpileup –Ov –d 1000000 –L 1000000 –a “FORMAT/AD,FORMATDP”, and the results were filtered to show positions of SNPs present in ≥2x coverage using bcftools view -i ‘AD[0:1-]>2’ for each cluster. The list of SNP positions were used to phase the reads within a cluster, and a variant was called if ≥3 reads supported a combination of SNPs. Consensus reads were then grouped according to called variants, and consensus sequences were generated for each variant group. First, the homopolymers were unmasked in the consensus reads and a crude variant consensus was generated using usearch with commands -cluster_fast -id 0.99 -strand both -consout –sizeout. The crude variant consensus was polished with workflow using minimap2 with commands –ax map-ont, bcftools mpileup –Ov –d 1000000 –L 1000000 –a “FORMAT/AD,FORMAT/DP”, bcftools norm –Ov, bcftools view -i ‘AD[0:1]/FORMAT/DP>0.5’ –Oz and bcftools consensus.
Pipeline parallelization
Many steps in the pipeline has been parallelized using GNU parallel10.
Generation of Reference Sequences for Mock Community
We obtained raw fast5 files from a previously-reported sequencing effort of the ZymoBIOMICS Microbial Community DNA Standard using Oxford Nanopore Technologies GridION flowcells (available from: https://github.com/LomanLab/mockcommunity). The fast5 data was basecalled using the GPU-basecaller guppy v. 2.2.3 with “flipflop” mode. The basecalled reads mapped to the existing reference sequences using minimap2 (v.2.12) using default settings. The mapped reads were assembled separately for each reference using minimap2 (v.2.12) to create overlaps and miniasm (v.0.3) to perform the assembly using default settings. The reads were mapped to the assembled genomes using minimap2 (v.2.12) using default settings and racon (v.1.3.1) was used to retrieve corrected consensus sequences using default settings. The corrected sequences were subsequently polished with medaka (v.0.6.0, https://github.com/nanoporetech/medaka) with the “r941_flip_model” model. Ribosomal RNA operons were extracted from the draft reference genome assemblies using in silico PCR with our forward and reverse primers using the ipcress command from the package exonerate (v.2.2), and were verified with genome coordinates for rRNA operons predicted by barrnap (v.0.9) (available from: https://github.com/tseemann/barrnap).
To further remove any residual errors from the rRNA operon reference sequences after assembly and polishing, high-quality short reads generated from Illumina sequencing were downloaded from NCBI for each bacterial strain in the mock community (accessions: ERR2935851, ERR2935850, ERR2935852, ERR2935857, ERR2935854, ERR2935853, ERR2935848, ERR2935849) and used for final polishing. The Illumina reads were randomly subsampled to an expected average coverage of 100 for each bacterial strain using the sample command in seqtk (v.1.0) (available from: https://github.com/lh3/seqtk). The subsampled Illumina reads were mapped to the draft rRNA operon sequences using minimap2 with the settings: -ax sr. The BAM files were sorted and indexed by samtools. We performed variant calling using bcftools (v1.9) with the commands mpileup and call using the settings: ploidy =1. Variant calls were filtered using bcftools filter with the settings: quality > 200. Variant calls were manually inspected and corrected, if needed, by visualizing mapping profiles in CLC Workbench. Polished consensus sequences were generated with bcftools consensus to generate high-quality references for use in benchmarking error rates in this study.
Data analysis
Chimera detection
Chimeras in the consensus sequences were detected by usearch12 with the commands -uchime2_ref -strand plus -mode sensitive, using our curated rRNA operon reference sequences from the ZymoBIOMICS Microbial Community DNA Standard (see above).
Error profiling
Detection of error was based on a mapping of the sequence data (raw reads, consensus sequences, variant consensus sequences) to our curated rRNA operon reference sequences from the ZymoBIOMICS Microbial Community DNA Standard (see above). Mapping was performed with minimap2 -ax map-ont --cs and filtered using samtools view -F 2308. The references and mappings were imported into R software environment13 (v3.5.1), where errors in the sequences were profiled using mainly the tidyverse (v1.2.1 https://www.tidyverse.org/) and Biostrings14 (v2.48.0) R-packages and custom scripts (see Code availability). In brief, errors and their type (mismatch, deletion, insert) were detected from the SAM --cs tags. The relative positions of the errors was determine in respect to the reference and this was used to categorize the errors as being homopolymers errors (hp+) or no (hp-). The error information was combined with metadata (UMI bin sizes, most similar reference etc.) and used to explore and visualize error as function of different parameters.
Exploration of relative abundance inconsistencies
We observed a difference between the relative abundance estimated with our UMI consensus data and the theoretical abundance for the rRNA operons of the mock community. We investigated several different potential causes of this discrepancy by importing relevant data and metadata into the R software environment13 (v3.5.1), using mainly the tidyverse (v1.2.1 https://www.tidyverse.org/) and Biostrings14 (v2.48.0) R-packages and custom scripts (see Code availability).
Validate content of ZymoBIOMICS mock
Oxford Nanopore data from the ZymoBIOMICS Microbial Community DNA Standard described above was used for the analysis. The data was divided per species and imported into R. Based on read lengths, the total bp count was estimated for each species, and used together with the theoretical genome sizes and rRNA operon copy numbers to estimate the theoretical relative abundance of 16S (equal to rRNA operons). The read length data was used to estimate the amount of DNA theoretically available for rRNA operon PCR. A DNA fragment has to be equal to or larger than the rRNA operon to be a valid PCR template. Furthermore, DNA fragments are generated randomly and break points introduced within the operon will also render the DNA fragment useless as a template for PCR. Hence, all fragments below 4500 bp were discarded and 4500 bp were subtracted from all longer fragment lengths > 4500 bp to take broken operons into account. Based on the adjusted read lengths we estimated an adjusted theoretical relative abundance of 16S rRNA.
Investigate impact of GC and operon length
Possible impact of GC content (genome/rRNA operon) and operon read lengths was investigated by plotting relative difference between observed abundance and theoretical abundances.
Investigate PCR primer match
A bias in relative abundance can be introduced in the first PCR where the rRNA operon is targeted with region specific primers. If there are mismatches between primers and template, we would expect a lower annealing/amplification efficiency. Primer/template mismatches were estimated using ipcress as described above.
Investigate PCR amplification bias
A bias in relative abundance can also be introduced in the second PCR where the UMI tagged amplicons are amplified with > 25 cycles of PCR. If a specific template has a relatively poor amplification efficiency we would expect this to impact the general bin size of this template. To investigate this, we imported UMI bin size statistics and UMI classifications into R and plotted bin sizes as function of species, operon and operon size.
Analysis of genomic coverage skew due to growth
A bias in relative abundance could also occur due to the mock species being in different growth phases at the time of sampling. To investigate the potential contribution of growth to coverage bias, we used the previously generated genomes of the mock community species. Nanopore data was mapped to each species genome using minimap2 -ax map-ont and calculated genome position depth using samtools. Ribosomal RNA operon genome coordinates were predicted by barrnap as described before. The data was imported into R, and used to create read coverage plots.
Code Availability
Source code and analysis scriptes are freely available at https://github.com/SorenKarst/longread-UMI-pipeline
Data Availability
Raw and assembled sequencing data is available at the European Nucleotide Archive (https://www.ebi.ac.uk/ena) under the project number PRJEB32674 and a complete data overview can be found in supplementary table 8.
Footnotes
- included new and updated references. - added missing link to raw data. - fixed typos. - improved figure text.
1 Oligonucleotide sequences © 2007-2018 Illumina, Inc. All rights reserved. Illumina adaptor sequences were used as synthetic priming sites, as they are proven to work robustly for PCR amplification and to allow the option to validate libraries with Illumina sequencing, which is useful during troubleshooting.