ENIGMA: An Enterotype-Like Unigram Mixture Model for Microbial Association Analysis

Ko Abe; Masaaki Hirayama; Kinji Ohno; Teppei Shimamura

doi:10.1101/397091

Abstract

Background One of the major challenges in microbial studies is to discover associations between microbial communities and a specific disease. A specialized feature of microbiome count data is that intestinal bacterial communities have clusters reffered as enterotype characterized by differences in specific bacterial taxa, which makes it difficult to analyze these data under health and disease conditions. Traditional probabilistic modeling cannot distinguish dysbiosis of interest with the individual differences.

Results We propose a new probabilistic model, called ENIGMA (Enterotype-like uNIGram mixture model for Microbial Association analysis), to address these problems. ENIGMA enables us to simultaneously estimate enterotype-like clusters characterized by the abundances of signature bacterial genera and environmental effects associated with the disease.

Conclusion We illustrate the performance of the proposed method both through the simulation and clinical data analysis. ENIGMA is implemented with R and is available from GitHub (https://github.com/abikoushi/enigma).

Introduction

More than 100 trillion microbes live on and within human beings and consists of complex microbial communities (microbiota). The majority of microbes cannot be cultured in laboratories, which makes it difficult to understand which individual microorganisms mediate vital microbiome-host interactions under health and disease conditions. However, recent important advances in high-throughput sequencing technology have allowed us to observe the composition of these intestinal microbes. That is, for each sample drawn from an ecosystem, the number of occurrences of each operational taxonomic units (OTUs) is measured and the resulting OTU abun-dance are summarized at any level of the bacterial phylogeny. Discovering recurrent microbial compositional patterns that are related with a specific disease is a significant challenge since individuals with the same disease typically harbor different microbial community structures.

The recent large-scale sequencing surveys of the human intestinal microbiome, such as the US NIH Human Microbiome Project (HMP) and the European Metage-nomics of the Human Intestinal Tract project (MetaHIT), have shown considerable variations in microbiota composition among individuals [1, 2]. In particular, the presence of community clusters characterized by differences in the abundance of signature taxa, referred to as enterotypes, have been first reported in humans [3]. Later, other studies found enterotype-like clusters which might reflect features of host-microbial physiology and homeostasis in different species [4, 5] or across human body sites [6–9]. These observed microbial stratification has motivated the development of methods to examine unknown clusters of microbial communities.

Probabilistic modeling of microbial metagenomics data often provides a powerful framework to characterize the microbial community structures [10–12]. For example, Knights et al. [10] applied a Dirichlet prior to a single-level hierarchy and proposed a Bayesian approach to estimate the proportion of microbial communities. Holmes et al. [11] extended the Dirichlet prior to Dirichlet multinomial mixtures to facilitate clustering of microbiome samples. Shafiei et al. [12] proposed a hierarchical model for Bayesian inference of microbial communities (BioMiCo) to identify clusters of OTUs related with environmental factors of interest.

However, such models are not suitable for identification of enterotype-like clusters of microbial communities doe to the following two reasons. First, the frameworks of Knights et al. [10] and Holmes et al. [11] do not explicitly address the association between the microbial compositional patterns and environmental factors of interest. Second, the framework of Shafiei et al. [12] models the structure of each sample by a hierarchical mixture of multinomial distributions that are dependent to factors of interest. It is known that individual host properties such as body mass index, age, or gender cannot explain the observed enterotypes [3]. Thus, such enterotype-like clusters that describes interindividual variability among humans do not always to directly affect host probabilities such as diseases ranging from localized gastroen-terologic disorders to neurologic, respiratory, metabolic hepatic, and cardiovascular illnesses.

Here, we introduce a novel probabilistic model of a microbial community structures, called ENIGMA (Enterotype-like uNIGram mixture model for Microbial Association analysis), to address these problems. ENIGMA includes the following contributions:

ENIGMA takes OTU abundances as input and models each sample by underlying unigram mixture whose parameters are represented by unknown group effects and known effects of interest. The group effects are represented by the baseline parameters which change with a latent group of microbial communities. One of the most important features for our model is that the group effects are independent of the effects of interest. This enables to separate interindividual variability and fixed effects of the host properties related with disease risk.
ENIGMA is regarded as a Bayesian learning for the association between community structure and factors of interest. Our model can be used to simultaneously learn how enterotype-like clusters of OTUs contributes to microbial structure and how microbial compositional patterns might be related to the known features of the sample.
We provide an efficient learning procedure for ENIGMA by using a Laplace approximation to integrate out the latent variables and estimate the evidence of the complete model and the credible intervals of the parameters. The software package that implements ENIGMA in the R environment is available from https://github.com/abikoushi/enigma.

We describe our proposed framework and algorithm in section named “Methods”. We evaluate the performance of ENIGMA on simulated data in terms of its accuracy to estimate parameters and identify clusters in Section named “Simulation Study”. We apply ENIGMA to clinical metagenomics data and demonstrate how ENIGMA simultaneously identifies enterotype-like clusters and gut microbiota related with Parkinson’s disease (PD) in Section named “Results on Clinical Data”.

Methods

Suppose that we observe microbiome count data of K taxa for N samples with M individual host properties, (y_nk, x_nm) (n = 1,…, n; k = 1,…, K; m = 1,…, M) where represents the abundance of the k-th taxa in the n-th sample and x_nm represents a binary variable such that x_nm = 1 if the n-th sample has the m-th host property and x_nm = 0 otherwise. Here the word “taxa” could be at any level of the bactgerial phylogeny, e.g., species, genes, family, order, etc.

Model

Figure 1 illustrates the plate diagram of the proposed model for metagenome sequencing, where y_n is the read count vector of the n-th sample, x_n is the vector of the host properties of the n-th sample and z_n ∈ {1,…, L} is a latent class of the n-th sample. Our model is a simple extension of unigram mixture model. We assume that each sample is generated from a multinomial distribution with the parameter vector p_n = (p_n₁,…, p_nK)^T. The elements of p_n, p_nk(k = 1,…, K) are probabilities of the occurrence of the K taxa for the n-th sample. We also assume that p_nk can be influenced independently by the environmental factor on the taxa that is common to all latent classes and the interindividual factor on the latent enterotype-like classes. More specifically, the generative process of ENIGMA is defined by:where γ_l is baseline parameter (K-dimensional vector) which change with the latent class, M × K matrix B = (β_mk) is effect of a environmental factor common the all enterotypes, β_m is a m-th row-vector of B, π = (π₁,…, π_L) is a mixing ratio of components, O_K is K-dimensional zero matrix and I_K is K-dimensional identity matrix. Here softmax function is defined by softmax for a vector x = (x₁,…, x_K)^T using element-wise exponential function and the probability function of categorical distribution is parameterized as Pr(z = l|π) = π_l, l ∈ {1,…, L}. In a Bayesian approach we need to define prior distributions for π, β, and γ_l. We set a prior based on the Dirichlet distribution for π, and flat priors to the hyperparameters σ and τ for β and γ, respectively. For the convenience of later section, let p′_l = softmax(γ_l) be probabilities of the occurrence of bacteria in the latent classes l.

Figure 1 Plate diagram of the model for ENIGMA.

yn is affected from environmental factors xn and latent variables zn.

Parameter estimation

Let us denote observed matrix by Y = (y_nk), X = (x_nm), the unknown parameters by θ = (α, B, γ₁,…, γ_L, σ, τ) and their prior by φ(θ). The posterior distribution is represented as follows:First, latent variable z_n must be marginalized. The likelihood belongs toThe posterior distribution is proportional to product of the likelihood and prior density:Let be the MAP estimator of θ, found by maximizing log p(θ, Y, X).

We use a Laplace approximation [15] for parameter estimation. A Taylor expansion around giveswhere and H() is Hessian of log p(θ|Y, X) evaluated at . Eq.4 giveswhere C is normalizing constant. This relation shows that p(θ|Y, X) can be approximated by normal distribution N (, H⁻¹()). Credible intervals can be calculated from this multivariate normal distribution.

We used stochastic programming language Stan (http://mc-stan.org/) for its implementation. The MAP estimators were obtained by L-BFGS method. Credible intervals were computed from the using a Stan function to compute the Hessian at the MAP estimates.

After fitting the model, we are left with the task of classify the enterotype of each samples. The conditional probability of z_n = l isThis is the probability which n-th sample belong enterotype l. Then, n-th sample is then classified into the l-th enterotype that maximizizes the conditional probability gven by Eq.5.

Model Selection

We are also interested in whether or not the whole set rather than individual bacteria is related to the environmental factors of interest. We consider the comparison between the two models when B /= 0 and B = 0. We can use the log marginal likelihood as the goodness of fit for model comparison. The marginal likelihood is given byFrom Eq.4, we haveSo, log marginal likelihood is approximated by following formula:where D is the number of free parameters. In model comparison, we choose the model with the larger log marginal likelihood.

Simulation Study

To show the performance of ENIGMA, we conducted several experiments by simulation. The synthetic data can be naturally produced via our generative process given by Eq.1. We set M = 1, L = 3, π_l = 1/3, and α = (1, 1, 1)^T. We first generated B and γ_l from the standard normal distribution. The variables x_n, z_n, and y_n are then sampled from the Bernoulli distribution with probability of 0.5, the categorical distribution, and the multinomial distribution, respectively. For the above parameter setting, we randomly generate a count dataset of 100 taxa for 100 samples for evaluation.

Coverage probability (CP): The coverage probability is the proportion of the time that the interval contains the true value. A discrepancy between the coverage probability and the nominal coverage probability frequently occurs. When the actual coverage is greater than the nominal coverage, the interval is called conservative. If the interval is conservative, there is no inconsistency in interpretation.
Bias: The bias of B is defined by difference between true value and estimated value E[] − B.
Standard error (SE): The standard error is the standard deviation of the estimate. The smaller standard error indicates the higher accuracy of estimation.
Root mean squared error (RMSE): The RMSE is defined by . The smaller RMSE indicates the higher accuracy of estimation.
Accuracy: The accuracy is the percentage of samples correctly classified into original group.

For calculating these metrics, we note that we calculated the sample means and standard deviations of and ( − B)² from the 10,000 synthetic datasets.

Figure 2 shows the comparison of true B and the mean and standard deviation of estimates through the 10,000 simulations. We observed that the points are arranged diagonally, which implies the estimator of ENIGMA is unbiased. We also calculated the proportion of the time that the 95% credible interval contains the true value of B. We found that this proportion is greater than nominal value 0.95 for all B in Figure 3. Table 2 shows the coverage probability (CP), bias, standard error (SE), and RMSE of , respectively. We observed that the bias and standard error decrease when β_mk is large (i.e. the corresponding abundance is large). We also found that the accuracy of classification given by Eq.5 is exactly 100%. Thus, these results indicate that ENIGMA can produce reasonable estimates.

Figure 2 Simulation result of B.

The comparison true B and the mean of Bˆ. The error bars indicates SE.

Figure 3 Coverage probability of B.

The histogram of coverage probability of B.

Results on Clinical Data

To validate the performance of ENIGMA on discovering clusters of micribial communities and associations between microbes and a specific disease, we applied ENIGMA to the real metagenomic sequencing data from Scheperjans et al. [16], Hill-Burns et al. [17], Heintz-Buschart et al. [18] and Hopfner et al. [19]. The data is analized by sequencing the bacterial 16S ribosomal RNA genes sampled from patients of Parkinson’s disease (PD) and control in Finland, USA, and Germany. Table 1 shows the summary statistics of the data. The OTUs are mapped to the SILVA taxonomic reference, version 132 (https://www.arb-silva.de/) and the abundances of family-level taxa are calculated. Following the evidence of Arumugam et al. [3], the number of latent classes in ENIGMA is chosen to be L = 3. We set the hyperparameters of Dirichlet prior α = (1, 1, 1)^T, which is equivalent to a non-informative prior.

View this table:

Table 1

The data summary

View this table:

Table 2

Coverage probability (CP), bias, standard error (SW) and RMSE of Bˆ

We evaluated whether the model where bacteria have the associations to the PD patients is better than the model without the associations in terms of marginal likelihood. We note that the marginal likelihood represents the model evidence which expresses the preference of the data for different models. Let M₁ be the model which is described Eq. 1 and M₀ be the model setting all β_mk = 0 in Eq. 1. Table 3 shows that the marginal likelihood of M₁ is greater than M₀. It is better to explain the data by considering the association between the microbiota and PD.

View this table:

Table 3

The comparison marginal likelihood.

Figure 4 shows the estimated probabilities of the occurrences of bacteria for the three latent classes, p′_l, (l = 1, 2, 3). Bacteria detected in less than three countries were removed. Arumugam et al. [3] showed that enterotype is characterized by the differences in the abundance of Bacteroides, Prevotella, and Ruminococcus. The result of ENIGMA shows the same tendency as previous survey. According to the results of ENIGMA, the abundance of Enterobacteriaceae and Lachnospiraceae also differ greatly among clusters. Bacterial abundance differs between countries. In USA there is a large abundance of Verrucomicrobiaceae, but in Finland there are few. Conversely, in Finland there is more Prevotellaceae, but in USA it is less.

Figure 4 Heatmap showing 1 (

This quantities corresponds to the probabilities of the occurrences of bacteria for the three latent classes.

Table 5 shows the coefficients whose 95% credible intervals do not contain zero in more than two countries. The microbes with these coefficients indicates that the corresponding microbial composition patterns are significantly related to PD. We found that, in family levels, Clostridiaceae, Comamonadaceae, Pasteurellacea, Prevotellaceae, Actinomycetaceae, Bifidobacteriaceae, Enterococcaceae, Lactobacil-laceae, Synergistaceae, Verrucomicrobiaceae and Victivallaceae, the signs of these coefficients matched in all countries. These results are consistent with previous studies. Hill-Burns et al. [17] reported PD patients contained high levels of Bifidobacte-riaceae and Verrucomicrobiaceae and low levels of Pasteurellaceae. Scheperjans et al. [16] reported PD patients contained high levels of Lactobacillaceae, Verrucomi-crobiaceae and low levels of Prevotellaceae. Hopfner et al reported PD patients have high levels of Lactobacillaceae and Enterococcaceae.

We compared ENIGMA to the Wilcoxon rank sum test, one of the classical methods for identifying bacteria related with a environmental factor of interest [18]. Table 4 shows bacteria significantly related with the PD patients with p-value < 0.05 in more than two countries. We observed that the bacteria detected by the Wilcoxon test were almost included in those of ENIGMA (Table 5). We note that all of the corrected p-values in Table 4 are larger than 0.05. This result shows that ENIGMA is superior to the Wilcoxon rank sum test in terms of identifying more associations between microbiota and the PD patients.

View this table:

Table 4

p-value of Wilcoxon test Abe et al.

View this table:

Table 5

The bacteria which significant associated with PD in more than two countries. The “-” notation indicates the bacteria undetected in that country.

The analyses with real data thus show that ENIGMA can identify enterotype-like clusters and the associations between the gut microbiota and the PD patients, and some of the results are strongly supported by the previous researches.

Conclusion

We proposed a novel hierarchical Bayesian model, ENIGMA, to discover the underlying microbial community structures and associations between microbiota and their environmental factors from microbial metagenome data. ENIGMA is based on a probabilistic model of a microbial community structures and supplied with labels for one or more environmental factors of interest for each sample. The structures of each sample is modeled by a multinomial distribution whose parameters are represented independently by group and environmental effects of each sample, which prevent mixing of individual differences and effects of interest. This framework enables the model to learn (i) how microbes contribute to an underlying community structures (cluster) and (ii) how microbial compositional patterns are explained environmental factors of interest, simultaneously. The effectiveness of ENIGMA was evaluated on the bases of experiments involving both synthetic and read datasets. We believe that these newly discovered clusters and associations estimated from ENIGMA would provide more insight in the the mechanisms of a microbial community.

There is one major limitation of ENIGMA is its scalability and efficiency, since the number of the parameters in the model grow proportional to the number of taxa when the number of environmental factors of interest is large. Further works should focus on developing a scalable probabilistic model of microbial compositions to analyze underlying microbial structures with a large number of these effects by using sparse parameter estimation [20]. We are also interested in developing a dynamic probabilistic model similar to reproted by Blei and Lafferty [21] to analyze time-varying bacteria compositions during the progression of a disease.

Competing interests

The authors declare that they have no competing interests.

Author’s contributions

KA and TS designed the proposed algorithm; KO and MH designed the experiments.

Acknowledgements

This work was supported by Grants-in-Aid from the Ministry of Education, Culture, Sports, Science and Technology of Japan (MEXT); Ministry of Health, Labour and Welfare of Japan (MHLW); Japan Agency for Medical Research and Development (AMED), and the Hori Sciences and Arts Foundation.

References

1.↵
Huttenhower C, et al. Structure, function and diversity of the healthy human microbiome. Nature. 2012; 486:207–214.
OpenUrl CrossRef PubMed Web of Science
2.↵
Le Chatelier E, et al. Richness of human gut microbiome correlates with metabolic markers. Nature. 2013; 500:541–546.
OpenUrl CrossRef PubMed Web of Science
3.↵
Arumugam M, Raes J, Pelletier E, Le Paslier D, Yamada T, Mende DR, … Bertalan. Enterotypes of the human gut microbiome. nature, 2011, 473.7346: 174.
OpenUrl CrossRef PubMed Web of Science
4.↵
Moeller AH, Degnan PH, Pusey AE, Wilson ML, Hahn BH and Ochman H, Chimpanzees and humans harbour compositionally similar gut enterotypes. Nature Communications. 3:1179.
5.↵
Hildebrand F, Nguyen TL, Brinkman B, Yunta RG, Cauwe B, Vandenabeele P, Liston A, Raes J. Inflammation-associated enterotypes, host genotype, cage and inter-individual effects drive gut microbiota variation in common laboratory mice. Genome biology, 2013, 14.1: R4.
OpenUrl CrossRef PubMed
6.↵
Ravel J. Vaginal microbiome of reproductive-age women. Proceedings of the National Academy of Sciences, 2011, 108.Supplement 1: 4680–4687.
OpenUrl Abstract/FREE Full Text
7.
Koren, Omry, et al. A guide to enterotypes across the human body: meta-analysis of microbial community structures in human microbiome datasets. PLoS computational biology, 2013, 9.1: e1002863.
OpenUrl
8.
Ding T and Schloss PD. Dynamics and associations of microbial community types across the human body. Nature, 2014, 509:357–360.
OpenUrl CrossRef PubMed Web of Science
9.↵
Zhou Y, et al. Exploration of bacterial community classes in major human habitats. Genome Biology. 2014 15:R66.
OpenUrl CrossRef PubMed
10.↵
Knights D1, Costello EK, Knight R. Supervised classification of human microbiota. FEMS microbiology reviews, 2011, 35.2: 343–359.
OpenUrl CrossRef PubMed Web of Science
11.↵
Holmws I, Harris K, Quince C. Dirichlet multinomial mixtures: generative models for microbial metagenomics. PloS one, 2012, 7.2: e30126.
OpenUrl CrossRef PubMed
12.↵
Shafiei M, Dunn KA, Boon E, MacDonald SM, Walsh DA, Gu H, and Bielawski JP. BioMiCo: a supervised Bayesian model for inference of microbial community structure. Microbiome, 2015, 3.1: 8.
OpenUrl CrossRef
13.
Yang Y, Chen N, Chen T. mLDM: a new hierarchical Bayesian statistical model for sparse microbioal association discovery. bioRxiv, 2016, 042630.
14.
Nigam K, McCallum, AK, Thrun S, Mitchell T. Text classification from labeled and unlabeled documents using EM. Machine learning, 2000, 39.2–3: 103–134.
OpenUrl
15.↵
Bishop C, Pattern recognition and machine learning. Springer-Verlag New York. 2006.
16.↵
Scheperjans F, Aho V, Pereira PA, Koskinen K, Paulin L, Pekkonen E, … Kinnunen E. Gut microbiota are related to Parkinson’s disease and clinical phenotype. Movement Disorders, 2015; 30 (3): 350–358.
OpenUrl CrossRef PubMed
17.↵
Hill-Burns EM, Debelius JW, Morton JT, Wissemann WT, Lewis MR, Wallen ZD, … Knight R. Parkinson’s disease and Parkinson’s disease medications have distinct signatures of the gut microbiome. Movement Disorders, 2017; 32 (5): 739–749.
OpenUrl CrossRef PubMed
18.↵
Heintz-Buschart A, Pandey U, Wicke T, Sixel-Döring F, Janzen A, Sittig-Wiegand E, …, Wilmes P. The nasal and gut microbiome in Parkinson’s disease and idiopathic rapid eye movement sleep behavior disorder, Movement Disorders, 2018; 33 (1): 88–98.
OpenUrl
19.↵
Hopfner F, Künstner A, Müller SH, Künzel S, Zeuner KE, Margraf NG, …, Kuhlenbäumer G. Gut microbiota in Parkinson disease in a northern German cohort. Brain research, 2017; 1667, 41–45.
OpenUrl
20.↵
Yang Y, Chen N, and Chen T. mLDM: a new hierarchical Bayesian statistical model for sparse microbioal association discovery. 2016. bioRxiv, 042630.
21.↵
Blei DM and Lafferty, JD. Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning. ACM, 2006. 113–120.