ABSTRACT
Identifying combinations of taxa distinctive for microbiome-associated diseases is considered key to the establishment of diagnosis and therapy options in precision medicine and imposes high demands on accuracy of microbiome analysis techniques. We propose subsequence based 16S rRNA data analysis, as a new paradigm for microbiome phenotype classification and biomarker detection. This method and software called DiTaxa substitutes standard OTU-clustering or sequence-level analysis by segmenting 16S rRNA reads into the most frequent variable-length subsequences. These subsequences are then used as data representation for downstream phenotype prediction, biomarker detection and taxonomic analysis. Our proposed sequence segmentation called nucleotide-pair encoding (NPE) is an unsupervised data-driven segmentation inspired by Byte-pair encoding, a data compression algorithm. The identified subsequences represent commonly occurring sequence portions, which we found to be distinctive for taxa at varying evolutionary distances and highly informative for predicting host phenotypes. We compared the performance of DiTaxa to the state-of-the-art methods in disease phenotype prediction and biomarker detection, using human-associated 16S rRNA samples for periodontal disease, rheumatoid arthritis and inflammatory bowel diseases, as well as a synthetic benchmark dataset. DiTaxa identified 17 out of 29 taxa with confirmed links to periodontitis (recall= 0.59), relative to 3 out of 29 taxa (recall= 0.10) by the state-of-the-art method. On synthetic benchmark data, DiTaxa obtained full precision and recall in biomarker detection, compared to 0.91 and 0.90, respectively. In addition, machine-learning classifiers trained to predict host disease phenotypes based on the NPE representation performed competitively to the state-of-the art using OTUs or k-mers. For the rheumatoid arthritis dataset, DiTaxa substantially outperformed OTU features with a macro-F1 score of 0.76 compared to 0.65. Due to the alignment- and reference free nature, DiTaxa can efficiently run on large datasets. The full analysis of a large 16S rRNA dataset of 1359 samples required ≈1.5 hours on 20 cores, while the standard pipeline needed ≈6.5 hours in the same setting.
Availability An implementation of our method called DiTaxa is available under the Apache 2 licence at http://llp.berkeley.edu/ditaxa.
1 Introduction
Microbial communities vary widely in their taxonomic structures and compositions [1, 2, 3]. The human microbiota fulfills important functions in supporting, regulating, and causing adverse conditions in their environment, motivating methods for inferring relationships between microbial taxa or functions associated with certain host phenotypes. Due to its low cost, a popular data type generated in microbiome studies, is 16S rRNA amplicon data. The 16S rRNA gene includes both variable and conserved regions and is universally present in archaeal and bacterial microorganisms [4, 5, 6]. Particular regions of the 16S rRNA gene are amplified from degenerate primers and sequenced. After sequencing, reads are typically clustered based on their sequence similarity to each other and the resulting clusters are referred to as operational taxonomic units (OTUs). Three main strategies for creating OTUs have been developed: in the de novo OTU clustering scheme, input sequences are aligned against one another and OTU clusters created based on a user-specified percent identity cutoff (in practice mostly 97%) without comparisons to reference databases. The implementation of the de novo strategy is difficult to parallelize and therefore limited to small-scale datasets. Variations of this method, such as sub-sample open-reference OTU picking [7] or centroid-based greedy clustering approaches [8] accelerate this process and enable their application to larger datasets. Alternatively, in closed-reference OTU clustering, input reads are aligned to a set of cluster centroids defined in a reference database (containing clusters of previously identified OTUs) and will be reported as an OTU, if they align at a given threshold. This strategy will not report OTUs for novel taxa that are not part of the reference database, though. An advantage is the usual high quality of taxonomic assignments of the reference database, which can be used for taxonomic assignment of the OTUs from the community of interest. Finally, the open-reference OTU clustering scheme combines de novo and closed-reference picking, where input sequences are aligned against a reference database (such as Greengenes [9]) and sequences that fail to match the reference are subsequently clustered de novo in a serial process [7]). Individual algorithms for OTU clustering, post- and pre-processing have been combined to pipelines such as mothur [10], QIIME [11, 12], USEARCH [13] and LotuS [14].
Although OTU clustering has simplified 16S rRNA processing by substituting the analysis of millions of reads by analysis of only thousands of OTUs, it still has several disadvantages: OTUs do not necessarily represent meaningful taxonomic units, such as e.g. species, and sequencing errors may inflate diversity estimates by orders of magnitude [15]. To prevent diversity overestimates, OTU based approaches require a highly stringent quality control and relaxed clustering at < 97% similarity. While this approach limits the inflation of OTUs by potential sequencing errors, it comes at the expense of taxonomic resolution and may combine organisms with distinct biological properties and capabilities into a single OTU. A further disadvantage is that OTU calling requires extensive sequence alignment efforts. All of the above mentioned OTU-picking strategies involve sequence alignments either to the reference genomes or to the sample sequences, which is computationally expensive and cannot be easily extended to further samples. It was shown that OTUs were generally ecologically consistent across habitats, but observed OTU content can differ substantially between clustering methods [16]. Since the number of obtained OTUs and their content is dependent on the pipeline and the parameter settings, reproducing the same analysis is difficult [17]. An alternative solution is the analysis of individual 16S rRNA gene sequence [18, 19, 20], which is computationally challenging, as each 16S rRNA sample may contain 10,000s of sequences.
Popular machine learning tasks over 16S rRNA gene sequencing data are taxonomic classification, host phenotype prediction, as well as biomarker detection. Although k-mer features and some other non-OTU features have been also used [18, 21], the most common representation of 16S rRNA gene sequences is based on OTUs. Random Forest was reported as the most effective classification approach for several diseases [2, 3, 21, 22, 23]. Recently, we have shown that using k-mer representations of shallow-subsamples is computationally inexpensive (being reference- and alignment-free) and marginally outperforms OTUs in host phenotype and environment classification tasks [21]. However, a disadvantage of k-mer features is that short k-mers cannot easily be mapped to a taxonomy to obtain taxonomic biomarkers. Microbiome studies often aim to identify OTUs, taxa, or clades that differ in their abundance across two or more subsets of the input samples (e.g. between diseased and healthy states), here referred as biomarker discovery [24, 25]. Identifying these biological informative taxa that are enriched in only a subset of phenotypes (e.g. diseased subjects or patients that better respond to a certain treatment) is a challenging task, in particular for metagenomic samples, because of their high-dimensionality, sequencing errors, as well as other systematic biases, such as the presence of chimeric sequences [15]. One prominent biomarker example is the over-representation of the Firmicutes phylum in obese individuals compared to lean controls [26, 27, 28]. In case the over-representations are causal for the aetiology of the diseases, detection of such biomarkers might have potential therapeutic implications if disease progression can be reversed by targeting over-expressed causative species using emerging technologies such as CRISPR/Cas9 or phage-based targeting [29, 30, 31]. This is also true for biomarkers that are inversely related to disease progression, such as the over-representation of Akkermansia muciniphila in individuals with a healthier metabolic status and better clinical outcome after caloric restriction [32]. For many other diseases where a microbiological component is expected, such (combinations) of biomarkers yet have to be found. Even when the biomarkers fail to be causal, they may enable prediction of the disease state or disease sub-types, or suggest suitable therapies in personalized medical interventions.
Different methods have been developed to identify OTU-based biomarkers [33]. The most widely used method is linear discriminant analysis effect size (LEfSe), which has a particular focus on high-dimensional class comparison for metagenomic analysis and determines features (such as taxa, OTUs, genes or clades) most likely to explain differences between two or more classes from relative OTU abundances [34]. This method uses the non-parametric factorial Kruskal-Wallis (KW) sum rank test [35]. Several other with similar functionality exist that use different statistical tests over sample profiles based on OTU features, such as STAMP [36], MetaStats [34] and MetagenomeSeq [33]).
In this paper, we propose DiTaxa, an alignment- and reference-free, subsequence based paradigm for processing of 16S rRNA microbiome data for phenotype and biomarker detection. DiTaxa substitutes standard OTU-clustering by segmenting 16S rRNA sequences into variable length subsequences. The obtained subsequences are then used as data representation for downstream phenotype and biomarker detection. We show that DiTaxa outperforms the state-of-the-art approach in biomarker detection for synthetic and a number of disease-related datasets. In addition, DiTaxa performs competitively with the k-mer based state-of-the-art approach, outperforming OTU-features, in phenotype prediction.
2 Material and Methods
2.1 Datasets
Inflammatory Bowel Diseases
We use the largest pediatric Crohn’s disease dataset available to date, described in [37]1, which covers different types of Inflammatory Bowel Diseases (IBD). This is a dataset of 1359 labeled 16S rRNA samples from 731 pediatric (≤ 17 years old) patients diagnosed with Crohn’s disease (CD), 219 with ulcerative colitis (UC), 73 with indeterminate colitis (IC), and 336 samples verified as healthy. Sequencing was targeted towards the V4 hypervariable region of the 16S rRNA gene. We downloaded OTU representations of the samples from Qiita repository2 obtained using QIIME pipeline [7].
Rheumatoid arthritis
We downloaded read data (454 platform) of the 16S rRNA gene sequences of V1 and V2 rRNA for 114 fecal DNA samples of a rheumatoid arthritis (RA) study [38] from SRA (ID: SRP023463). OTU clustering was performed based on filtered reads (365.7k, 23.0%) of which 140,382 were unique and 119,217 singletons and resulted in 949 OTUs based on 97% identity. The OTU clustering pipeline is detailed in §2.3.
Periodontal disease
We use the data provided by Jorth, et al. [39] to differentiate between healthy and diseased periodontal microbiota3. This dataset consists of microbial samples collected from subgingival plaques from 10 healthy and 10 patients diagnosed with periodontitis. Sequencing was targeted towards the (V4 − V5) hypervariable region of the 16S rRNA gene. Similar to the RA dataset, we obtain the OTU features using the clustering pipeline detailed in §2.3.
Synthetic dataset
To evaluate DiTaxa in a known setting, we generated a dataset with synthetic samples using Grinder v. 0. 5.3 [40] based on 1000 V4 regions of different genera of Green-genes (GG) sequences [9]. V4 regions were extracted from the Green-genes 13.8 databases using the forward and reverse primer sequences GTGCCAGC[AC]GCCGCGGTAA and ATTAGA[AT]ACCC[CGT][AGT]GTAGTCC. To generate 16S rRNA datasets, the lengthbias parameter was set to zero and the unidirectional parameter was set to one. To cover the full V4 region, the amplicon read length distribution was set to 300 and the fold coverage of the input reference sequence was set to 30. We specified the percent of reads in the amplicon libraries that should be chimeric sequences to 10%. We used default parameters for the specification of the chimera distribution resulting in 89% bimeras, 11% trimeras and 0.3% quadmeras. Sequencing errors were introduced in the reads, at positions that follow a uniform model using the default ratio of substitutions to the number of indels (4 substitutions for each indel). Two sets of samples were created, denoted as case and control samples with an average number of sequences in both groups of 29,204 reads. While in the control set all 1000 genera were set to the same abundance (mean abundance set to 0.1%, 500 randomly selected GG V4 sequences (corresponding to unique genera) were enriched at equal levels in the case dataset (μ of non-selected genera set to 0.05%; μ of selected genera set to 0.15;). For both, the control and the case settings, 100 samples were generated, each with variations under the normal distribution (σ = 0.02). We processed the synthetic dataset using a standard pipeline consisting of USEARCH and UPARSE and generated 1,041 OTUs at 97% identity similarity, detailed in §2.3.
2.2 Nucleotide-pair Encoding
The idea of Nucleotide-pair Encoding (NPE) is inspired by the Byte Pair Encoding (BPE) algorithm, a simple universal text compression scheme [41, 42], which has been also used for compressed pattern matching in genomics [43]. Although BPE had lost its popularity for a long time in compression, only recently it again became popular, but for a different reason, i.e. word segmentation in machine translation in natural language processing (NLP). BPE became a common approach for a data-driven unsupervised segmentation of words into their frequent subwords, which facilitate open vocabulary neural network machine translation and improve the quality of translation by reducing the vocabulary size [44, 45]. In this work, we adapt the BPE algorithm for splitting biological sequences into frequent variable length subsequences called Nucleotide-pair Encoding (NPE). We propose NPE as general purpose segmentation for the biological sequence (DNA, RNA, and proteins). In contrast to the use of BPE in NLP for vocabulary size reduction, we use this method to increase the size of symbols from 4 nucleotides to a large set of variable length biomarkers.
The input to NPE is a set of sequences. We treat each sequence as a list of characters (nucleotides in the case of 16S rRNA gene sequences). The algorithm finds the most frequently occurring pair of adjacent symbols in the sequences. On the next, we replace all instances of the selected pair with a new subsequence (merged pair as a new symbol). The algorithm repeats this process until reaching a certain vocabulary size or when no more frequently occurring pairs of symbols available. The obtained merging operations can be inferred once from a large set of sequences in an offline manner and then applied to an unseen set of sequences. A simple pseudo-code of NPE is provided in Algorithm 1.
Adapted Byte-pair algorithm (BPE) for segmentation of biological sequences (NPE)
2.3 Standard 16S rRNA gene processing workflow
To evaluate the performance of DiTaxa against the state-of-the-art, we used a standard 16S rRNA gene processing workflow employed in previous studies on 16S rRNA data [46, 47, 48]. Note that throughout this paper “the standard pipeline (STDP)” refers to this workflow:
Obtained 16S rRNA gene sequencing reads are quality controlled and clustered using the Usearch 8.1 software package4, where quality filtering is done with fastq_filter(–fastq_maxee1).
The OTU clusters and representative sequences are determined using the UPARSE algorithm (derep_fulllength : minuniquesize2;cluster_otus : otu_radius_pct3) [49].
The next step is taxonomy assignment using the EZtaxon database [50] as the reference database, and the decision is made by RDP Classifier [51].
The OTU absolute abundance table and mapping file are used for statistical analyses in LDA Effect Size (LEfSe) [34].
2.4 DiTaxa computational workflow
The DiTaxa computational workflow has three main components; (i) NPE representation creation, (ii) phenotype prediction, and (iii) biomarker detection and taxonomic analysis (shown in Figure 1). In this section, these components are described in details.
NPE representation
The first component of DiTaxa is the NPE representation creation. The 16S rRNA gene sequences aggregated from all samples from all phenotypes go through the NPE algorithm 1 for training segmentation operations. Then the segmentation will be applied on sequences to segment sequences into variable length subsequences. We pick the vocabulary size large enough to obtain discriminative 16S rRNA subsequences considered as biomarkers. Each sample will be presented as a count distribution of its subsequences. We propose a bootstrapping scheme to investigate the sufficiency of shallow sub-samples to produce proper representation.
In a previous study, using a bootstrapping framework we showed that shallow sub-samples of 16S rRNA gene sequences are sufficient to produce a proper k-mer presentation of data for phenotype prediction [21]. Similarly, here we use bootstrapping to investigate sufficiency and consistency of NPE representation, when only a small portion of the sequences are used. This has two important implications, first, sub-sampling reduces the preprocessing run-time, second, it shows that even a shallow 16S rRNA sequencing is enough for the phenotype prediction. We use a resampling framework to find a proper sampling size. Let θ#npe(Xi) be the normalized NPE (with vocabulary size of #npe) distribution of Xi, a set of sequences in the ith 16S rRNA sample. We investigate whether only a portion of Xi, which we represent as , i.e. jth resample of Xi with sample size N, would be sufficient for producing a proper representation of Xi. To find a sufficient sample size for Xi quantitatively, we propose the following formulation in a resampling scheme. (i) Self-consistency: resamples for a given size N from Xi produce consistent , i.e. resamples should have similar representations.(ii) Representativeness: resamples for a given size N from Xi produce similar to θ#npe(Xi), i.e. similar to the case where all sequences are used. As presented in [21], we measure the self-inconsistency of the resamples’ representations by calculating the average Kullback Leibler divergence among normalized NPE distributions for NR resamples (here NR=10) with sequences of size N from the ith 16S rRNA sample: where . We calculate the average of the values of over the M different 16S rRNA samples:
We measure the unrepresentativeness of the resamples by calculating the average Kullback Leibler divergence between normalized NPE distributions for NR resamples (NR=10) with size N and using all the sequences in Xi for the ith 16S rRNA sample: where . We calculate the average over for the M 16S rRNA samples:
For the experiments on the datasets presented in §2.1, we measure self-inconsistency and unrepresen-tativeness for NR = 10 and M = 10 for #npe ∈ {10000,20000,50000} with sampling sizes ranging from 20 to 20000.
As shown in Figure 1, the obtained NPE representation in the first component will be then used for two main use cases, i.e. phenotype prediction and biomarker detection.
Phenotype prediction
We used Random Forest (RF) classifiers [52], which have shown a superior performance over deep neural network (deep multi-layer perceptron) and support vector machine (SVM) classifiers in phenotype classification for the size of datasets we use here [21, 22]. However, the provided implementation provides deep learning and SVM classifiers as well. For the disease phenotype prediction, Random Forest classifiers were tuned for (i) the number of decision trees in the ensemble, (ii) the number of features for computing the best node split, and (iii) the function to measure the quality of a split. We evaluate and tune the model parameter using stratified 10 fold cross-validation and optimize the classifiers for the harmonic mean of precision and recall, i.e. the F1-score, as a trade-off between precision and recall. We provide both micro- and macro-F1 metrics, which are averaged over instances and over categories, respectively.
We performed phenotype classification for a synthetic dataset (binary classification of 100 case samples and 100 control samples), a Crohn’s disease dataset (binary classification of 731 Crohn’s disease samples from 628 control or other diseases), and a Rheumatoid Arthritis (RA) dataset (44 RA disease subjects versus 70 control/treated/Psoriatic arthritis subjects). In order to evaluate the performance of NPE representation, we compare the classification performance of RFs over NPE features versus using OTUs, as well as k-mer features, which are considered as state-of-the-art approaches for disease phenotype prediction [21, 22].
Biomarker detection and taxonomic analysis
The designed steps in DiTaxa for detection of differently expressed markers in the phenotype of interest are shown in the light purple background in Figure 1:
The first step is finding discriminative markers between two phenotype states using false discovery rate corrected two-sided χ2 test over the median-adjusted presence of markers in the samples. Thus if a marker is presented within a sample at least as frequent as the median frequency across samples, we consider it as present, otherwise as absent. We discard insignificant markers using a threshold for the p-value of < 0.05. For the multi-phenotype case, a one-versus-all policy is used. In addition, markers shorter than a certain threshold (< 50bps) will be discarded to ensure the markers are specific enough for a downstream taxonomic assignment.
The filtered markers go through a local BLAST [53] with EzBioCloud database as a local reference dataset [50], covering 62,362 quality controlled reference sequences. We assign the taxon corresponding to the Lowest Common Ancestor (LCA) of the taxa annotated for the best hits of a marker in a reference taxonomy. The markers that cannot be aligned to the references will be marked as ‘Novel’ markers.
In the third step, we remove redundant markers based on their co-occurrence information using symmetric Kullback-Leibler divergence [54]: where Pm and Pn are respectively normalized frequency distributions of mth and nth markers across all samples. Using to find identical markers, split the set of markers into equivalence classes. Subsequently, from each class we pick only one representative marker with the most specific taxonomy level, which its taxonomy information is confirmed by the majority of markers within the class. The selected markers at this step are our final set of biomarkers.
Our approach has three main outputs, first, a taxonomic tree for significant discriminative biomarkers, where identified taxa to the positive and negative class are colored according to their phenotype (red for positive class and blue for negative class). The DiTaxa implementation for taxonomic tree generation uses a Phylophlan-based backend [55]. Second, a heatmap of top biomarkers occurrences in samples, where the rows denote markers and the columns are samples is generated. Such a heatmap allows biologists to obtain a detailed overview of markers’ occurrences across samples (e.g., Figure 10 and Figure 11). The heatmap shows number of distinctive sequences hit by each biomarker in different samples and stars in the heatmap denote hitting unique sequences, which cannot be analyzed by OTU clustering approaches. The third output is a list of novel markers for further analysis of the potential novel organisms.
To compare the performance of our approach with a standard workflow (defined in §2.3) for real datasets, we used the scientific literature as the ground-truth. We extracted a list of organisms which are experimentally identified by previous studies to be associated with or cause periodontal disease. Then we evaluate the recall for DiTaxa and the standard workflow in the detection of the confirmed organisms.
To quantify the performance of DiTaxa in a known synthetic setting versus the standard pipeline, we generated two high-dimensional synthetic datasets denoted as ‘case’ and ‘control’, as described in §2.1. The description of the standard pipeline is provided in §2.3 generating a list of significant differently expressed OTUs for both phenotypes. We then compare the significantly enriched OTU sequences (FDR corrected P values < 0.05, LEfSe) and significant (FDR < 0.05) subsequences determined by DiTaxa as biomarkers with the ground-truth 16S V4 region using global nucleotide alignment with blastn v. 2.7.1+ with parameters “–perc_identity 100 – ungapped”. We quantified the number of false positives (FP), true positives (TP), false negatives (FN) and true negatives (TN) based on the presence of significant alignments of potential biomarkers sequences (OTU or DiTaxa markers) to each of the 500 differentially expressed GG 16S regions. TPs were calculated as the number of GG sequences (n = 500) with at least one marker hit from the positive marker list. FNs were calculated using case GG sequences (n = 500) without at least one marker hit from the marker collection found to be significantly enriched in the case set. TNs are the number of control GG sequences (n = 500) with at least one marker hit from the marker list that are significant in the control set. FPs were quantified as the number of low abundant GG sequences (n = 500) without at least one marker hit from the set of markers found in the control sample. Recall was calculated as TP/(TP + FN) while precision was calculated as TP/(TP + FP).
3 Results
3.1 Phenotype prediction
Bootstrapping for sample size selection
We picked a stable sample size for each NPE vocabulary size based on the output of bootstrapping in phenotype prediction. Each point in Figure 2 represents the average of 100 (M × NR) resamples belonging to M randomly selected 16S rRNA samples, each of which is resampled NR = 10 times. As shown in Figure 2, a larger vocabulary size require higher sampling rates to produce self-consistent and representative representations. As the structure of the curve does not vary a lot from dataset to dataset, to avoid redundancy, we only show the bootstrapping results for the rheumatoid arthritis dataset.
The classification results for different NPE vocabulary sizes on the synthetic, Crohn’s disease, and rheumatoid arthritis datasets using RF classifiers are presented in Table 1. All methods could reliably predict the affected cases in the synthetic dataset without any error. For the Crohn’s disease dataset k-mers with the MicroPheno approach [21] achieved a slightly better prediction performance while using NPE and OTU features achieved the same macro-F1 of 0.74. In rheumatoid arthritis prediction, NPE and k-mers achieved a macro-F1 of 0.76, outperforming the use of OTU features by 11 percent. Changes in sample size did not substantially affect the prediction performance, suggesting sufficiency of shallow sub-samples in phenotype prediction using the NPE representation (Table 1).
3.2 Biomarker detection and taxonomic analysis
Marker detection results for synthetic data
In the biomarker detection for the synthetic dataset, DiTaxa did not report erroneous (neither FN nor FP) biomarkers, which resulted in a recall and precision value of 1. In comparison, the biomarkers detected with a standard pipeline, using OTU clustering and LEfSe, included 51 false negative and 47 false positive instances, resulting in a recall and precision value of 0.898 and 0.905, respectively (Table 2). This evaluation demonstrated the superiority of DiTaxa for biomarker discovery compared to OTU-based approaches in both recall and precision.
Biomarker detection results for periodontal disease
We next assessed the performance of DiTaxa and a standard pipeline (STDP; section 2.3) in detecting otherwise confirmed taxa for periodontal disease (Table 4). DiTaxa performed better in the detection of relevant taxa (Table 3). Of 29 taxa identified as relevant in other studies, 17 were detected by DiTaxa, while the standard approach detected only 3 from the same dataset. Notably, experimentally verified taxa shown to alter the disease phenotype in mouse models, Fusobacterium nucleatum [56] and Porphyromonas gingivalis [56, 57, 58], were only detected by Ditaxa. Since periodontitis is a polymicrobial disease and the oral biofilms are extremely diverse [59], detecting all relevant taxa confirmed by the literature from a single dataset is not feasible. For instance, A. actinomycetemcomitans is specifically associated with juvenile aggressive periodontists in Moroccan population, which can be hardly found in the population from Turkey [39]. However, relative comparison of recall for different methods on the same dataset is still meaningful. A higher recall of DiTaxa in confirming the literature links, in comparison with a standard pipeline shows that DiTqxa can be more accurate in detecting disease-specific biomarkers. A detailed comparison of the predicted taxa with different methods and taxa with confirmed links by the literature is also shown in Figure 3. The red color shows the disease associated taxa found by DiTaxa and the blue color indicates the up-regulated taxonomy in the healthy samples. The up-regulated organisms found by standard pipeline are colored in yellow and down-regulated organisms are colored to green. The intersection of DiTaxa and the standard approach is colored in orange for up-regulation and cyan denotes for the consensus of methods in down-regulation.
Taxonomy of discriminative biomarkers for rheumatoid arthritis
Comparative taxonomic visualization of detected differentially expressed markers for DiTaxa and a common workflow are shown in Figure 4 for samples from patients with untreated rheumatoid arthritis (new onset RA) versus healthy individuals. Taxa predicted by DiTaxa for samples from patients with untreated rheumatoid arthritis (new onset RA) versus healthy individuals had Prevotella copri as the most significantly ranked, which was confirmed based on shotgun metagenome analysis and in mouse experiments [71], while the standard workflow only predicted the genus of this taxon as relevant [71]. DiTaxa also predicted Prevotella stercorea as implicated in new onset RA.
The DiTaxa results for patient samples from several other diseases versus healthy individuals are provided in Figure 5 (for CD versus healthy), Figure 7 (for indeterminate colitis versus healthy), Figure 6 (for ulcerative colitis versus healthy), Figure 9 (for treated rheumatoid arthritis versus healthy), Figure 8 (for psoriatic versus healthy).
Biomarker heatmaps
Visualization of the occurrence pattern of the identified biomarkers is another output of DiTaxa. Examples of such a visualization for rheumatoid arthritis (Figure 10) and periodontal disease (Figure 11) are provided. The rows represent inferred biomarker sequences and are sorted based on the taxonomic marker assignments. The columns represent patient samples and are sorted firstly based on their phenotype and secondly based on their pattern similarity. ‘Novel’ organisms are shown in the top rows, denoting the markers that could not be aligned to any reference sequence and are therefore potentially novel taxa. The cell colors on the heatmap show the percentage of distinct marker sequences matching a biomarker per sample on a log scale. Markers targeting a single 16S sequence only are marked by a star.
The plots clearly show the varying “generality” of the inferred marker sequences, with some matching only to unique 1S sequences, and others found across larger numbers, indicating representation of different levels of evolutionary relatedness of the underlying targeted organisms. For instance, of the inferred markers assigned Prevotella copri, while some markers match multiple distinct 16S genes across patient samples, indicating the presence of strain-level diversity evident from 16S within this species, while other markers targeting predominantly single 16S copies across patients samples within this species, indicating the existence of disease-associated subspecies diversity, that can be discovered with this technique.
3.3 Runtime analysis
To assess computational efficiency, we compared the runtimes of DiTaxa versus the standard workflow (Table 5). For both DiTaxa and the standard workflow 20 cores were used in computations. Workflow parts that could be parallelized for both pipelines are denoted with “||” in Table 5. The bottleneck for DiTaxa computation is the segmentation training, which cannot be parallelized. However, the segmentation needs to be trained only once for a dataset and then any combinations of phenotype analysis can use the trained segmentation and the subsequent representation. Although the standard pipeline for datasets of less than 200 samples has been few minutes faster than DiTaxa, DiTaxa can run faster for the dataset of 1359 samples (total of 93,93 min), while the standard pipeline tool 385,66 min using the same computational setting.
4 Discussion and conclusions
We describe DiTaxa, a method implementing a new paradigm for host disease status prediction and biomarker detection from 16S rRNA amplicon data. The main distinction of this approach from existing methods is substituting standard OTU-clustering [49] or sequence-level analysis [18] by segmenting 16S rRNA reads into the most frequent variable-length subsequences of a dataset. The proposed sequence segmentation, called Nucleotide-pair Encoding, is an unsupervised approach inspired by Byte-pair Encoding, a data compression algorithm that recently became popular in deep natural language processing. The identified subsequences represent commonly occurring sequence portions, which we found to be distinctive for taxa at varying evolutionary distances and highly informative for predicting host disease phenotypes. We compared the performance of DiTaxa to the state-of-the-art in disease phenotype prediction and biomarker detection, using human 16S datasets from metagenomic samples of periodontal, rheumatoid arthritis, and inflammatory bowel diseases, as well as a synthetic benchmark dataset. DiTaxa identified 17 of 29 taxa with confirmed links to periodontitis (recall= 0.59), while the OTU-based approach could only detect 3 of 29 organisms (recall= 0.10). In addition, we show that for the rheumatoid arthritis dataset, machine-learning classifiers trained to predict host disease phenotypes based on the NPE representation substantially outperformed OTU features (macro-FI =0.76 compared to 0.65) and performed competitively for Crohn’s disease and synthetic datasets. Taxa predicted by DiTaxa for samples from patients with untreated rheumatoid arthritis (new onset RA) versus healthy individuals had Prevotella copri as the most significantly ranked, which was confirmed based on shotgun metagenomic analysis and in mouse experiments [71], while the standard workflow only predicted the genus of this taxon as relevant [71]. Due to the alignment- and reference free nature, DiTaxa can efficiently run on large datasets. The full analysis of a large 16S rRNA dataset of 1359 samples required ≈1.5 hours, where the standard pipeline took ≈6.5 hours with the same number of cores (20 cores). Although on smaller datasets the conventional workflow was faster than DiTaxa, the run-time difference of less than 30 minutes for those settings is worth the performance gain in phenotype prediction and biomarker detection. The applications of NPE representation are not limited to 16S rRNA data and it can be also applied to shotgun metagenomics or any other biological sequences to infer intrinsic features from data, instead of using parameter-dependent representations. Taken together, DiTaxa seems to provide a better solution for biomarker and phenotype detection than OTU-based methods. It thus could contribute to a better understanding of the microbial organisms associated with microbiome-related diseases and the development of personalized diagnostics and therapy procedures.
Acknowledgements
Fruitful discussions with Curtis Huttenhower, Hinrich Schütze, Szymon Szafranski, and Benjamin Roth are gratefully acknowledged. P.C.M. received funding from German Research Foundation (315980449).
Footnotes
↵1 Available at: https://www.ncbi.nlm.nih.gov/bioproject/PRJEB13679
↵2 Available at https://qiita.ucsd.edu/study/description/1939
↵3 Downloaded from http://datadryad.org/resource/doi:10.5061/dryad.d41v4