Abstract
Despite the massive developments within culture-independent methods for detection and quantification of microorganisms during the last decade, culture-based methods remain a cornerstone in microbiology. We have developed a new method for bacterial DNA enrichment and tagmentation allowing fast (< 24h) and cost-effective species level identification and strain level differentiation using the MinION portable sequencing platform (ON-rep-seq). DNA library preparation takes less than 5h and ensures highly reproducible distribution of reads that can be used to generate strain level specific read length counts profiles (LCp). We have developed a pipeline that by correcting the random error of reads within peaks of LCp generates a set (∼10 contigs per sample; 300bp - 3Kb) of high quality (>99%) consensus reads. Whereas, the information from high quality reads is used to retrieve species level taxonomy, comparison of LCp allows for strain level differentiation. With benchmarked 288 isolates identified on a single flow cell and a theoretical throughput to evaluate over 1000 isolates, our method allows for detailed bacterial identification for less than 2$ per sample at very high speed.
Introduction
Culture dependent methods remain indispensable in detailed identification of bacteria. Yet, successful typing of bacteria down to species/strain level remains not fully resolved 1. Several promising technologies and methodologies for solving the problem have been proposed but with a variable success. Generally, fast and cost-effective methods are not accurate enough, while those that are more accurate are also more laborious and/or expensive. Methods based on 16S rRNA gene sequencing are amongst the most universal, yet species level resolution cannot always be reached 2. More complex molecular tools that are able to reach strain level resolution such as PFGE, Rep-PCR, MLST or MALDI-TOF MS are hampered by one or several drawbacks that include low speed/throughput, limited databases, no taxonomic information, laborious procedure or high equipment cost 3,4,5.
The present gold standard for strain level bacterial identification is full genome sequencing. Optimally this approach combines information from high-throughput, short, good quality reads with lower throughput, poor quality but long reads 6. However, this approach is far from being cost effective, and the data analysis and interpretation is far from trivial 7,8,9.
The portable DNA sequencing platform MinION by Oxford Nanopore Technologies (ONT) offers an attractive tool with a potential to tackle the task of species/strain level identification 10. Unfortunately, ONT still deals with two critical problems: relatively high error rate at the base level and lower throughput compared to technologies offered by e.g. Illumina 10. We propose a DNA enrichment method that to a large extent have solved both these pitfalls by combining an optimized version of repetitive extragenic palindromic PCR (Rep-PCR) with a consecutive dual-stage Rep-PCR-2 step during which sample specific barcodes are incorporated.
Repetitive extragenic palindromic sequences in bacterial genomes were first described in the genomes of E. coli and Salmonella in 1984 by M.J. Stern 11. A decade later J. Versalovic used interspersed repetitive sequences as a binding site for primers developing Rep-PCR 12. Amplicons varying in length (from few dozens base pairs (bp) to few kilo base pairs (Kbp)) separated with electrophoresis create a genomic fingerprint that has been proven many times to have species and in some set-ups also strain level discriminative resolution of bacteria 13. Only five years later Rep-PCR was described as one of the most reproducible and commonly used method for species and strain level discernment 14, and numerous applications of the method have been reported in many fields including food processing, food safety, environmental microbiology, and medicine 15,16,17,18,19. Despite the immense progress in DNA sequencing technologies Rep-PCR is still a commonly used technique in many research groups mainly due to the low cost of the analysis and basic laboratory equipment needed 20. However, the low running costs comes with a price of highly laborious and time-consuming procedures involving 3-5h PCR, 3-5h electrophoresis, and complicated, tedious and potentially error prone fingerprint data analysis. Additionally, Rep-PCR only allows for bacterial discrimination but not direct identification 21.
We are presenting a new bacterial DNA enrichment method for Oxford Nanopore sequencing called ON-rep-seq. The method exploits an optimized version of Rep-PCR for reproducible amplification followed by a dual stage Rep-PCR-2 step allowing tagmentation of up to 96 samples in one reaction. Furthermore, we have developed a pipeline utilizing the information from the generated sequences at three levels: i) generation and comparison of isolate specific read length counts profiles (LCp) ii) detection of peaks in each LCp followed by within-peak correction of the random single base error iii) species-level taxonomy assignment using corrected consensus reads (Figure 1). The method has been tested on 38 different bacterial species and three strain level groups successfully identifying all bacteria down to the species level and discriminating strains with a sensitivity that is at least similar to a Whole Genome Sequencing (WGS) based approach.
Results
Sequencing of the Rep-PCR enriched library with MinION generates highly reproducible LCp
Similar to Rep-PCR gel based fingerprints, sequenced Rep-PCR products can be transformed into read length counts profiles (LCp) being a function of reads length and abundance. The shape and position of peaks is highly reproducible in all technical replicates across first two sequencing runs (Figure 2, Supplementary Figures 1 and 2) indicating that the barcode sequences do not affect the shape or the position of the peaks during Rep-PCR-2. Yet, as explained below, we observed a minor run effect in the third consecutive run resulting in shifted distribution of short/long reads.
Reads correction within individual peaks provides a set of high quality consensus sequences per isolate allowing detailed identification
A single band on a gel (or peak in LCp) of a Rep-PCR profile will contain mainly representatives of the same amplicon what would allow for base accuracy correction using tools such as e.g. Canu 22. With that assumption, we have developed a pipeline operating in three steps: strain specific LCp generation and comparison, within peak reads correction, and peak’s consensus sequence annotation. The pipeline generated on average 10 high quality consensus reads for each isolate (max = 26, min = 3, SD = 4) with mean length of 1 Kbp (max = 3.6 Kbp, min = 0.3 Kbp, SD = 0.6 Kbp). The number of reads used for correction within a peak (cluster size) varied from 50 to 2400 (mean = 254, SD = 246).
Subjecting the set of corrected reads for each sample to centrifuge classifier allowed for unambiguous annotation of all bacteria down to the species and subspecies level (Table 1). The average sequence similarity of corrected reads from strain validated with Illumina sequencing (S. enterica serovar Typhimurium C5) reached 99,4% (BLAST; min = 98.3%, max = 100%, SD = 0.5%). Among the isolates tested are for example Lactobacillus casei and Lactobacillus paracasei subsp. paracasei known to be indistinguishable based on 16S rRNA gene sequence comparison or Lactococcus lactis subsp. cremoris that cannot be distinguished from Lactococcus lactis subsp. lactis. All these strains were unambiguously discriminated using ON-rep-seq. Two bacterial species: Bacteroides thetaiotaomicron and Lactococcus lactis subsp. cremoris were tested in pairs from different culture collections resulting in all cases in highly reproducible LCp (Supplementary Figure 1).
Paired comparison of LCp can be used for the strain level differentiation
Five Listeria monocytogenes, four Salmonella enterica (three serovar Typhimurium and one serovar Oranienburg) and two Bacillus cereus strains have been used to evaluate the method for strain level discrimination. We have developed an algorithm (LCpCluster.R) estimating the level of similarity between the pairs of LCp generated by the ON-rep-seq. Among five L. monocytogenes strains four unique profiles were identified (Figures 3 and 4). Strains EGDe and LO28 generated identical profiles (Figures 3 A and 4).
No SNPs variants could be detected when comparing consensus sequences of corresponding peaks of all technical replicates (data not shown). Whole genome sequencing (WGS) data have been used to estimate the genetic similarity between EGDe and LO28 strains. The average nucleotide identity (OrthoANI) index between these two genomes reached 99.9%, while L. monocytogenes MLST schemes mapped against EGDe and LO28 found only one differing locus (dapE) out of seven tested (Supplementary Table 1).
This implies a high level of genetical similarity between the two strains that requires a specific approach to ensure differentiation.
Among four S. enterica strains LCpCluster.R recognized three unique profiles (Figure 3 B and 4). Serovar Typhimurium strains u292 and C5 showed the same ON-rep-seq LCp with no SNPs variants in corresponding peaks (data not shown). WGS comparison of these two strains revealed high OrthoANI reaching 99.9%. Salmonella enterica MLST schemes mapped against genomes of u292 and C5 also showed the same alleles profiles (Supplementary Table 1). This implies that S. enterica strains u292 and C5 could not be straightforwardly distinguished based on their genome using both methods.
Interestingly, serovar Typhimurium strain 4/74 that presented similar LCp to u292 and C5, yet with a clear additional peak in the position ∼1370bp (Figure 3 B and 4) reached OrthoANI above 99.9% and had the same MLST profile compared to u292 and C5. In this particular example ON-rep-seq presented higher discrimination power over OrthoANI and MLST analysis based on WGS data. Further investigation of the peak at position ∼1370bp disclosed that the consensus sequence presented high similarity (blast identity 1372/1384bp; 99.1%) to SopEΦ prophage. Moreover, this sequence could only be found in the draft genome of the 4/74 strain (blast identity 1371/1384bp; 99.1%) but not in any of the remaining S. enterica strains.
Finally, the two B. cereus strains generated clearly distinctive LCp and were classified as different strains. (Figure 3 C and 4). The LCp. Cluster results showing grouping according to “Ward.D2” hierarchical clustering on D_KLsym distance of all 48 isolates in four technical replicates from first two runs are given in Supplementary Figure 2.
Benchmarking of the R9.4.1 flow cell indicates the theoretical throughput of ON-rep-seq over 1000 bacterial isolates
To validate the method, two R9.4.1 flow cells were benchmarked for the maximum possible output generated. The first benchmarked flow cell generated in total over 2.6 M reads (after quality control and demultiplexing). See Supplementary Table 2 for details. In the first four consecutive runs, each lasting 4h, intertwined with the flow cell washing steps and storage for minimum 24h enough data was generated to successfully demultiplex and identify 4×96 bacterial profiles on a single flow cell. The last run generated 0.22 M reads which was enough to detect and correct sequences of 94 out of 96 samples.
The second flow cell generated in total 2.49 M reads respectively 1 M 0.56 M and 0.87 M for first (4h) second (4h) and third (12h) run. See Table Supplementary Table 2. All three runs of the second flow cell generated enough data to successfully analyze 96 bacterial profiles. To verify the minimum number of reads necessary to analyze all samples the data have been iteratively subsampled and subjected to the analysis with a receiver operating characteristic (ROC) curves to quantify tradeoff between pairwise “same/not-the-same” strain discrimination dependent on clustering cutoff. Throughout the analysis it was noticed that within-strain variance was larger than between-strain variance in cases of small differing features in the latter, and the disproportion of short reads vs long reads in the former case (the observation verified by sample mean read length regression vs sample read count; Supplementary Figure 3). This this disproportion was attributed to the third sequencing run on the reused flow cell, hence the latest repC run was omitted from the cluster analysis and most of ROC curves analysis that follows.
Clustering on different data sets were compared: all, “wo.rep*C” (without the third consecutive run: repC), 2%, 10%, 20%, 50% subsamples. “wo.rep*C” performed best most of the time, though random fluctuations in 50%, 20% and 10% subsamples overperformed occasionally at single data points. Subsampling to 50% and 20% (avg. #reads/sample 4326 and 1730) performed very similarly to full samples (avg. #reads/s 8652), while 10% subsamples performed worse, though still reasonably good, while 2% subsamples (avg. #reads/s 173) performed much worse, though relevant information is still present and retrievable even with such a small reads lengths sample (Supplementary Figure 4 C-H).
The flow cell benchmarking results showed that 20% of generated reads (avg. #reads/sample 1730) were already sufficient to analyze all samples. Notably, the number of isolates that could be analyzed simultaneously on a single flow cell will ultimately depend on the number and position of peaks in LCp (for strain level comparison). Nonetheless, our data demonstrate that the theoretical throughput of the R9.4.1 flow cell ranges between 960 and 1440 isolates depending on the sequencing run performance (∼1.5 M to ∼2.5 M reads respectively).
Discussion
The process of fast and accurate bacterial identification, subtyping and strain level differentiation is of high importance in epidemiology, to recognize infection outbreaks, determine its source or follow highly virulent nosocomial pathogens. It is also desired in the food industry to validate quality and safety and to investigate microbiologically complex communities like many fermented foods. For the past three decades the most commonly used and standardized methods became molecular techniques based on DNA analysis 14. Since first described in 1994 Rep-PCR targeting REP and/or repetitive intergenic consensus (ERIC) regions became a widely used methods of DNA typing 12,17. Its discriminatory power has been shown multiple times to be superior to many other typing methods including ribotyping 12,23, multilocus enzyme electrophoresis 24,25, but also biochemical characterization 26. Rep-PCR was often shown to have similar or slightly lower discriminatory power than pulse-field gel electrophoresis (PFGE) but was always considered a less laborious and cheaper solution 27,28,29. Among several Rep-PCR options (GTG)5-PCR have reported to be the most robust 30. Despite well-documented strain level discrimination power, the main pitfall of Rep-PCR is without a doubt its inability for taxonomic identification, without additional analysis such as 16S rRNA gene sequencing, which requires extra laboratory work and significantly increases time and cost of the analysis. Moreover, such strategy relays entirely in the discriminatory power of 16S rRNA gene that does not always allow for species level identification.
The massive leap in DNA sequencing methods made within the last decade heralded the inevitable decline of many “old-fashioned” DNA finger-print based typing methods. A single HiSeq X instrument (IIlumina) has a capacity to sequence about 35,000 average size bacterial genomes with 100 times coverage in a single run (Illumina.com). Yet, notwithstanding the immense potential, this technology is still not meant for fast, routine and cost-effective typing of bacteria. It is mainly due to the high equipment cost, low flexibility requiring collection of multiple samples (from dozens to thousands depending on the platform), relatively long runtime and complex data analysis. The portable, USB powered MinION offered by Oxford Nanopore Technologies is so far the cheapest (∼1000$) sequencing platform on the market. Its main advantage besides the price is the possibility to generate ultra-long reads with the longest ones crossing 1Mb. Nonetheless, there are two main reasons why ONT have not yet become the first choice of a sequencing strategy in many laboratories. First is the relatively high base calling error rate of a single DNA molecule and second, a relatively low throughput compared to many other platforms 10. Our method have largely solved these two hindrances, allowing ONT to be exploited for accurate, large scale and detailed identification of bacteria. Highly reproducible amplification of regions flanked by REP elements not only solved the main problem with sequencing redundancy one needs to deal during WGS, but also enabled single base error correction owing it to its random nature. ON-rep-seq offers the well documented discrimination power of the DNA fingerprint analysis, but also for the first time full access to the “hidden information” within each band DNA sequence in quality crossing 99% accuracy. Since each isolate composes on average of 10 corrected consensus reads with an average length of 1Kb this information can be used for highly accurate taxonomic identification. Even if one of the reads would not find a hit in a database, there are still several others to ensure classification. As shown here, all 48 isolates have been accurately assigned to the species and sub-species level. Also, for the first time users will be able to easily determine contaminations in case one or several peaks would turn out to belong to another organism. Lastly, by reducing the number of samples and increasing the coverage one could achieve even higher accuracy of the consensus sequence what could be used to assess presence of SNPs in profiles of closely related strains. This might be an additional source of previously unknown discrimination repository of Rep-PCR in some unique cases.
We have demonstrated here that ON-rep-seq successfully identified and differentiated between Listeria monocytogenes and Salmonella enterica serovar Typhimurium strains. Both species are of special epidemiological importance and model organisms for host-pathogen infection 31 32. Rep-PCR was previously recommended method for subtyping of L. monocytogenes and S. enterica with a similar discrimination power of PFGE or RAPD 29,33. The only two undistinguished pairs of strains were L. monocytogenes EGDe from LO28 and Salmonella enterica C5 from u292. Paired comparison of their genomes revealed high level of similarity (OrthoANI > 99.9%) further confirmed with the MLST indicating that genetic diversity between these two strains could be allocated in SNPs. Regrettably, none of these SNPs were found by comparing sequence within the peaks. Although ON-rep-seq cannot discriminate between strains that differs solely with SNPs it can be used for fast and cost-effective screening of multiple isolates to select those of identical profiles that should be subjected for deep sequencing saving resources, money and time.
Interestingly ON-rep-seq was shown to be superior to traditional WGS analysis in distinguishing between S. enterica u292 and 4/74. The OrthoANI between these two strains reached > 99.9% with identical MLST profiles. This makes it very challenging to differentiate between u292 and 4/74 at the strain level using WGS 34,35. However, comparison of ON-rep-seq based LCp allowed for clear and unambiguous differentiation between the two strains. The peak allowing this distinction was shown to be a mobile element with high similarity to a prophage. It was previously demonstrated that large fractions of genetic variation in in Salmonella strains is allocated in variable genomic regions and islands that encompass phage insertions 36.
Presented in this work barcodes enable accurate tagmentation of 96 isolates, but our data demonstrate that even about 1000 barcodes could be used on a single R9 flow cell. We have benchmarked two R9.4.1 flow cells to estimate the maximum possible output and cost per isolate. Since the flow cell price ranges between 475$-900$ (depending on the bundle offer) the sole cost of sequencing assuming highest output would range between 0.40 – 0.75 $. It is important to mention that the maximum data output will vary depending on the flow cell viability that may be affected by multiple washing steps. Furthermore, it was demonstrated herein that consecutive usage of the flow cell may be a source of an increased run effect. Therefore, the best performance of ON-rep-seq could be achieved if for example 96×10 barcodes were used in a single run lasting for maximum time (48h). It seems however that the new gadget offered by ONT called Flongle, promised to be released this year, could be the most optimal and user-friendly solution for ON-rep-seq (nanoporetech.com). Flongle is an adapter for MinION with one-quarter throughput of a R9 flow cell but price not crossing 100$. This means that up to 3×96 isolates could be analyzed in a single run for about 0.35$ per sample. Naturally, the user could then choose to sequence less isolates but ensure even better coverage.
In summary, we present here the DNA enrichment and barcoding method called ON-rep-seq (from: Oxford Nanopore based Rep-PCR based sequencing) that in combination with ONT sequencing platforms allows for highly cost effective, bulk screening of bacterial isolates with species and strain level resolution. We believe that ON-rep-seq has a potential to become a modern standard molecular based method with multiple applications in research, industry and medicine. By sharing it to other users we are looking forward for thorough validation of many more bacterial species, optimization of sequencing protocols and pipelines. We hope that conjoined effort of multiple users will also allow for the development of ON-rep-seq consensus reads database facilitating in the even faster and simplified identification.
Methods
Wet laboratory Rep-PCR-1
Bacterial genomic DNA was extracted using GenElute™ Bacterial Genomic Kit (Sigma Life Science, Darmstadt, Germany) according to the manufacturer’s instructions. In total 48 isolates represented by 38 different bacterial species were subjected for the analysis in duplicates in each of the three runs giving six technical replicates per isolate. The barcodes order was shifted during preparation of each library to ensure that every technical replicate is tagged with different barcode sequence. Three strains of Salmonella enterica, five Listeria monocytogenes and two Bacillus cereus strains have been used to evaluate the ability of the method for strain level differentiation. The detailed list of bacteria used for the analysis is given in Table1 while ON-rep-seq LCp in Supplementary Figure 1. The Rep-PCR reaction mix contained 5 μl PCRBIO HiFi buffer (5x), 0.25 μl of PCRBIO HiFi Polymerase (PCR Biosystems Ltd, London, United Kingdom), 4 μl of (GTG)5 primers (5 μM), 1 μl of DNA (∼20ng/μl) and nuclease-free water to a total volume of 25 μl. The Rep-PCR thermal conditions were optimized as follows: Denaturation at 95°C for 5 min; 30 cycles of 95°C for 30 s, 45°C for 1 min and 62°C for 4 min; followed by final elongation at 72°C for 5 min. It is important to note that several polymerases have been tested in order to shorten the elongation time without compromising the longest amplicons. With current settings, the PCR takes less than 3h on SureCycler 8800 (Agilent, CA, USA).
Barcoding by dual-stage Rep-PCR-2
We have designed 96 ONT compatible barcodes (Supplementary Table 3) with 10bp spacer separating ONT motor protein adapter from the barcode sequence and (GTG)5 pairing region. The spacer was added to ensure higher tolerance for the low-quality at the beginning of the sequence entering the pore and thus higher recovery of barcode sequence. At the same time the spacer sequence was designed to prevent creations of stem-loops in relatively long primers during low temperature annealing step. The Rep-PCR reaction mix contained 12 μl of PCRBIO UltraMix (PCR Biosystems Ltd, London, United Kingdom) 2 μl of corresponding repBC primer (10 μM), 1 μl of PCR product from Rep-PCR-1 and nuclease-free water to a total volume of 25 μl. Incorporation of ONT compatible adapters (Supplementary Table 3) was performed using dual-stage PCR where first 3 cycles provide optimal annealing of (GTG)5 regions, while next 10 cycles allow for best hybridization of full adapters in consecutive cycles: Denaturation 5 min; 3 cycles of 95°C for 30 s, 45°C for 1 min and 62°C for 4 min; followed by 10 cycles of 95°C for 30 s, 65°C for 1 min and 72°C for 4 min and final elongation at 72°C for 5 min.
Library preparation and ONT-sequencing
After Rep-PCR-2 samples were pooled using 10 μl of each sample. Note that samples were not pooled in equimolar concentration due to expected differences in length of amplified regions between the samples. However, it is advisable to verify the DNA concentration with Qubit® dsDNA BR Assay Kit (Life Technologies, CA, USA) for the quality control of the Rep-PCR-2 step. The measurement was performed with Varioskan Flash Multimode Reader (Thermo Fischer Scientific, MA, USA). Fluorescence was measured at 485/530 nm.
The pooled library was cleaned with AMPure XP beads (Beckman Coulter Genomic, CA, USA) in volumes 100:50 μl respectively. The bead pellet was washed with 80% ethanol and re-suspended in 100 μl of nuclease-free water. The bead washing step was added to shift the proportion of short to long reads that is multi-template PCR specific feature and to remove primer-dimers. The pooled and bead-purified library was measured with Qubit® dsDNA HS Assay Kit (Life Technologies, CA, USA) and 66 ng of library was used as an input to the End-prep step in 1D amplicon by ligation protocol (ADE_9003_v108_ revT_18Oct2016) with one adjustment: 80% ethanol instead of 70% was used for all washing steps.
To validate our method, we have benchmarked two R9.4.1 flow cells for the maximum possible output generated. First flow cell was used 10 days after the delivery in five consecutive runs, each lasting 4h, intertwined with the flow cell washing steps and storage for minimum 24h (QC 1347 active pores). Second flow cell was used 44 days after the delivery in three consecutive runs lasting respectively 4 h, 4h and 12h (to collect maximum amount of data from declining flow cell; QC 1105 active pores). After each run the flow cell was washed according to the manufacturer’s instructions and a new library was prepared and loaded. In order to evaluate the possibility of barcode specific amplification during Rep-PCR-2 step, all samples received different barcode in consecutive sequencing runs. The data from the first benchmarked flow cell were used solely to test the optimal concentration of DNA needed and viability of the flow cell while data from the second flow cell are presented herein and can be downloaded from SRA NCBI repository (#SUB4333515).
Data analysis
Data collection, base calling, demultiplexing and trimming
Data were collected using MinKnow 1.10.23. The amount of data collected in both R.9.4 flow cells is listed in Supplementary Table 2. Albacore v2.1.3 pipeline was used to base call raw fast5 to fastq. Porechop v0.2.2 was used for adapters trimming and samples demultiplexing. Porechop settings together with the list of custom adapters (adapters.py) compatible with oligos given in Supplementary Table 3 are available at (bitbucket.org/modelscat/on-rep-seq) The script allows for demultiplexing up to 96 barcodes and trimming of: ONT adapters, custom spacers and tandem repeats of (GTG)n.
Correction and base location of peaks
Peaks are identified in LCp expressed as sequencing length (x-axis) by number of reads (y-axis) by fitting local third order polynomials in a sliding window of size 1/50 of the x-span across the x-axis, followed by calculation of the first- and second order derivatives. The position of a peak is identified at the x-axis where the first derivative is zero and the second derivative is negative. Only peaks with intensity higher than baseline, defined as a moving boxcar (zero order polynomial) in a broad window (4 times the size of the window used for calculation of the derivative) are used for further analysis. The identified peaks are ordered based on the height, and a representative fragment are used for data base matching.
Reads correction within a peak
Sequences containing quality scores (fastq files) resolved within each peak were retrieved using Cutadapt v1.15 37, and corrected with Canu v1.6 22 using the following parameters: genomeSize=5k, minimumReadLength=200, correctedErrorRate=0.05, corOutCoverage=5000, corMinCoverage=2 and minOverlapLength=50. The corrected reads were sorted-by-length and clustered with UCLUST (cluster_fast) from USEARCH v10.0 38, using the following options: -id of 0.9, -minsl of 0.8, -sizeout, and min_cons_pct of 20. Subsequently, consensus sequences were sorted-by-size (coverage) and those with a minimum coverage-size of 50× were kept for downstream analyses.
Classification
Centrifuge 1.0.3 microbial classification engine was used for labelling of corrected reads.
Comparison of LCp
The identification of a good distance measure on read length counts profiles (LCp) was approached by considering them as approximating samples of their underlying sampling distributions. Ideally one would like to understand processes involved in signal peaks and noise formation, thus a priori distributions could be postulated, and later optimized for profiles posteriors. Primarily, empirical discrete length distributions were smoothed with window moving average (ma). Selection of ma window size was done by computing the average jitter of all profiles: an average number of times when profile’s discrete derivative changes sign (change to 0 was counted as 0.5). From mean jitter plot “ma” window size was selected to 20, the point of the lowest second derivative, after which second derivative stabilized closely around 0, meaning the information (jitter) loss due to increasing of window size became relatively low and more constant (Supplementary Figure 4 A, B). Next with each LCp assigned was a “ma” smoothed and probability-normalized “distribution profile” Dp.
Stability problems around #reads(i)=Dp(i)=0 are avoided by considering a mixture of “ma” smoothed Dp and the uniform read lengths distribution (in a considered range 150 to 3000bp) with proportions (0.99, 0.01). The distance between two samples, reads lengths-based, was defined as a function of LCp_1, Dp_1, LCp_2, Dp_2. One natural approach was to consider the probability of sampling LCp_1 from Dp_2, however for the distance to be comparable between samples of different read counts it needed to be normalized by total read count. Resulting is the following logarithmic formula:
The above formula is however not centralized because the distance of a sample to itself is not 0 but it is rather equal to sample’s smoothed entropy. Centralization of this distance yields distance very similar to Kullback-Leiber divergence of probabilities, which is proposed for the distance between LCp, as follows:
In the following clustering analysis, we use the symmetrized version:
Analysis of D_KLsym distance between peaks profiles performance on bacterial LCp generated with ON-rep-seq
Validation of KL based distance on LCp by hierarchical clustering was performed on sequencing results where clusters were compared with down-to-strain sample labels. To promote clusters with low variance around centroids “Ward.D2” clustering method was selected and performed with modified “heatmap3” R library. Figure 4 shows clusters recovered with cutoff=0.09 where all L. monocytogenes, B. cereus and S. enterica strains with clearly visible feature peaks were properly clustered.
Whole genome sequencing data analysis
Complete or draft genomes L. monocytogenes EGDe (#NC_003210.1) and LO28 (#AARY02000001.1-2001127.1); S. enterica serovar Typhimurium ST4/74 (#CP002487.1) and u292 (#ERR277220) were downloaded from public databases and compared using OrthoANI 8.
Salmonella enterica MLST schemes (internal fragments and their alleles) hosted at PubMLST.org were mapped against genomes of U292 (#ERR277220) and 4/74 ((#CP002487.1), as well as the assemble contigs of C5 strain included in this study.
For strain C5, DNA was subjected to library preparation (Nextera XT kit, following manufacturer procedures) and sequencing on Illumina NextSeq platform. High-quality reads (>95% quality and minimum size of 50nt using Trimmomatic v0.35 39 were de-convoluted from phiX174 controls reads (-id: 0.97, -query_cov: 0.97) and dereplicated using USEARCH v10 40. Subsequently, reads were assembled into contigs using Spades v3.5.0 41. Contigs with a minimum size of 10,000 bp generated for C5 strain, and in addition to the publicly available U292 and 4/74 putative genomes, were subjected to MLST analysis on the CLC Genomics Workbench v11.1 using a minimum alignment length of 400bp and high level of alignment stringency.
Authors contributions
L.K conceived the presented idea, participated in the data analysis, developed the wet-lab protocol, participated in pipeline development, designed figures, wrote the manuscript. J.L.C.M participated in pipeline development, data analysis and interpretation, tables generation, contributed in manuscript writing. D.N.M participated in optimization of the wet-lab protocol, data analysis and interpretation. M.B.M participated in optimization of the wet-lab protocol. M.A.R participated in the data analysis and interpretation, pipeline development, contributed in manuscript writing. M.S participated in the data analysis, interpretation, pipeline development, designed several figures, contributed in manuscript writing. D.S.N supervised the work, participated in formulation of the initial idea, contributed in manuscript writing. All authors edited and approved the final version of the manuscript.
Corresponding authors
Lukasz Krych or Josué L. Castro-Mejia or Dennis S. Nielsen
Software and code
ON-rep-seq pipeline is available at: bitbucket.org/modelscat/on-rep-seq
Data
Fastq files can be downloaded from SRA NCBI repository (#SUB4333515)
Acknowledgements
To Pernille Johansen for inspirational conversation. To Witold Kot and Aarhus University for generating WGS data. To Basheer Yousef Aideh for providing access to the broad bacterial collection. To Henrik Siegumfeldt for help in editing.