Abstract
Molecular ecology regularly requires the analysis of count data that reflect the relative abundance of features of a composition (e.g., taxa in a community, gene transcripts in a tissue). The sampling process that generates these data can be modeled using the multinomial distribution. Replicate multinomial samples inform the relative abundances of features in an underlying Dirichlet distribution. These distributions together form a hierarchical model for relative abundances among replicates and sampling groups. This type of Dirichlet-multinomial modelling (DMM) has been described previously, but its benefits and limitations are largely untested. With simulated data, we quantified the ability of DMM to detect differences in proportions between treatment and control groups, and compared the efficacy of three computational methods to implement DMM—Hamiltonian Monte Carlo (HMC), variational inference (VI), and Gibbs Markov chain Monte Carlo. We report that DMM was better able to detect shifts in relative abundances than analogous analytical tools, while identifying an acceptably low number of false positives. Among methods for implementing DMM, HMC provided the most accurate estimates of relative abundances, and VI was the most computationally efficient. The sensitivity of DMM was exemplified through analysis of previously published data describing lung microbiomes. We report that DMM identified several potentially pathogenic, bacterial taxa as more abundant in the lungs of children who aspirated foreign material during swallowing; these differences went undetected with different statistical approaches. Our results suggest that DMM has strong potential as a statistical method to guide inference in molecular ecology.
Introduction
In many scientific disciplines, data from both manipulative experiments and surveys of natural variation are often counts of observations that are assigned to categories. Given some total level of observational effort, the counts of the different features in the sample (e.g., taxa or transcripts) reflect the underlying proportions of those features in the sampled composition (e.g., an assemblage of organisms or collection of molecules). In molecular ecology, such sampling can take the form of detecting and counting taxa based on observed DNA sequences (e.g., in molecular barcoding or microbial ecology) or counting the reads assigned to specific transcripts in studies of gene expression (Fernandes et al. 2014, Gloor et al. 2017, Tsilimigras and Fodor 2016). For these applications, sampling effort corresponds to the total number of sequence reads, and the count of reads assigned to a taxon or gene supports inference of their true proportion in the composition. Moreover, the total number of reads that can be obtained is constrained by the sequencing instrument, with reads ascribed to samples and features within each sample. Due to this constant sum constraint, compositional data have the important quality that as the relative abundance of one feature in the composition increases, other features must decrease.
Molecular ecologists often rely on compositional count data to define differences between sampling groups. As an example, we may wish to know how the foliar and root microbiomes of a particular plant taxon differ. To answer this question, an understanding of how each feature shifts in relative abundance among sampling groups is required. In our view, if even a single feature shifts in relative abundance among groups, then this demonstrates an effect of sampling group that could be biologically interesting, albeit subtle. Such effects will go unnoticed if analyses rely on techniques such as ordination and PERMANOVA, which can provide insight into overall differences between sampling groups (McKnight et al. 2019), but provide no statistical model to identify those features that may differ in relative abundance among groups. Accordingly, a variety of methods have been developed to perform the seemingly simple task of determining treatment-induced shifts in relative abundance, which is often referred to as “differential relative abundance testing” or “differential expression” testing (the latter phrase arises because the roots of many of these methods lie within the field of functional genomics; Bullard et al. 2010, Dillies et al. 2013, Paulson et al. 2013, Thorsen et al. 2016, Weiss et al. 2017).
Methods for detecting shifts in relative abundance vary tremendously—and the benefits and drawbacks of various methods are the subjects of an ongoing dialogue (e.g., Bullard et al. 2010, McMurdie and Holmes 2014, Weiss et al. 2017). Early approaches typically relied on repeated frequentist tests after transforming count data to account for differences in sampling effort among replicates or sampling groups, typically via rarefaction, conversion to proportions, or, for transcriptomic data, reads per kilobase per million mapped reads (Bullard et al. 2010). More recently, rarefaction has been criticized because it can amplify the variation present within replicates and thus reduce statistical power (McMurdie and Holmes 2014; but see McKnight et al. 2019 and Weiss et al. 2017 for counterarguments). Numerous statistical modelling approaches have arisen to account for the challenges imposed by compositional data, while avoiding rarefaction. These methods often model feature relative abundance and typically involve some form of normalization followed by repeated frequentist testing. Methods most often differ in the choice of distribution(s) utilized for modelling and normalization method employed. For example, the software DESeq2 (Love et al. 2014) and edgeR (Robinson et al. 2010) are widely-used for analysis of gene expression data and, more recently, for microbiome analysis (Weiss et al. 2017). These tools model feature relative abundances using a negative binomial distribution (a reparameterization of the Poisson distribution to allow for overdispersion), which is scaled to account for variation in sequencing depth among samples (each tool uses different normalization methods). Next a generalized linear model is used to determine if features differ in relative abundance between sampling groups. By comparison, the popular ANCOM software applies a centered log ratio transformation (Aitchison 1982) to the data followed by repeated parametric or non-parametric testing (depending on the data) with multiple comparison correction. These few examples serve to illustrate the variety of approaches available for performing differential expression testing. However, we are unaware of any popular method that allows estimates of feature relative abundance to be easily extracted while preserving the uncertainty in those estimates for propagation to downstream analyses. This perceived need led us to consider modelling feature relative abundances using the Dirichlet and multinomial distributions (Box 1) in a Bayesian framework.
The multinomial and Dirichlet probability distributions are the relevant models of the aforementioned sampling process that commonly leads to compositional data. Statistical modelling using these distributions has proven successful in a number of biological studies. For instance, Fordyce et al. (2011) rely on Dirichlet-multinomial modelling (DMM) to analyze ecological count data, such as counts of behavioural and dietary choices of animals (also see Coblentz et al. 2017). Similar models have been applied to large counts of DNA sequences—for instance, Fernandes et al. (ALDEx2, 2014), Nowicka and Robinson (DRIM-Seq, 2016), and Rosa et al. (HMP, 2012) use DMM to estimate and compare feature-specific relative abundances in transcriptomes and microbiomes. Additionally, DMM has been used to model mixtures of compositions, a situation that could arise in a laboratory-derived microbial assemblage occurring as a contaminant within samples, or in mixtures of different communities in nature (MicrobeDMM, Holmes et al. 2012; SourceTracker, Knights et al. 2011; BioMiCo, Shafiei et al. 2015; FEAST, Shenhav et al. 2019; ecostructure, White et al. 2019). Likewise, DMM has been used to estimate association networks among microbial taxa (SparCC, Friedman and Alm 2012; mLDM, Yang et al. 2017).
These models represent important advances and demonstrate the utility of DMM, but it remains unclear how data attributes, such as rank-abundance profiles and dimensionality, affect the accuracy and precision of parameter estimates. Moreover, compared to models that rely on other distributions or are based on different statistical methods (likelihood and frequentist methods), Bayesian DMM can be computationally demanding. Recent advances in computational statistics such as Hamiltonian Monte Carlo (HMC) sampling and variational inference (VI, see Methods; Blei et al. 2017, Monnahan et al. 2017) may improve model runtime, but the accuracy and performance of these new methods remains to be evaluated in different modelling contexts.
Consequently, we conducted a simulation experiment to learn the limits and benefits of DMM through the analysis of data that encompass much of the variety in attributes encountered across scientific domains (e.g. replication, number of observations, and so on; Fig. 1). Notably, included in simulated data, were those emulating the results of high-throughput sequencing of microbial assemblages, as these are analytically challenging due to their dimensionality, high among-replicate variation, and extreme rank-abundance skew—often several microbial taxa are orders of magnitude more abundant than the numerous marginal taxa that typically compose the bulk of biodiversity within a sample (e.g., see Lynch and Neufeld 2015, Sachdeva et al. 2019). Our primary analytical goal was to measure the sensitivity and accuracy of DMM for comparing feature relative abundance between compositions and to compare the performance of DMM with competing approaches. Also, we provide a primer on the requisite algorithmic methods (e.g., VI and HMC) for Bayesian implementation of DMM and explore how different algorithms affect model accuracy and computational expense. Finally, we analyzed a data set published by Duvallet et al. (2019) that describes the lung microbiomes of children experiencing aspiration of foreign material and evaluated to what extent DMM recapitulated the published analyses or detected additional differences among microbiomes.
Box 1. A brief explanation of the multinomial and Dirichlet distributions
The multinomial distribution is the multivariate generalization of the binomial distribution. The binomial distribution can be used to describe counts of binary outcomes, with respective probabilities p and 1 − p. For instance, with a finite sample of observations, the binomial distribution would be useful for estimating the frequency of females (p) in a dioecious population. The multinomial distribution extends this concept to encompass more than two unique outcomes. For instance, a composition comprising three equally abundant features would have the the following multinomial parameter vector: . As an example, consider data from a sequencing machine. The counts of sequences that fall into each category (e.g., transcripts or taxa) are multinomially distributed, with a probability that corresponds to its relative abundance. For three equally abundant features (i.e. microbial taxa), there would be an equal chance of sampling a sequence from each of the features and on average we would expect to obtain the same number of sequences from each (for this example, we assume no laboratory-technique imposed bias).
To share information among samples in the same sampling group (e.g. treatment group, host population, or sampling location) and recover group-level estimates of the proportion of each feature in a composition, the Dirichlet distribution can be appropriately parameterized. The Dirichlet distribution is the multivariate generalization of the beta distribution. Deviates from a standard beta distribution fall in the range of [0, 1], and the distribution can be parameterized with expectation π (the expected frequency of the reference category, with 1 − π for the alternative category) and a parameter, θ, that affects the variation among deviates. Likewise, the Dirichlet distribution can be parameterized by a vector of expected frequencies of each feature , and an intensity parameter, θ. When drawing deviates from the Dirichlet distribution, the intensity parameter influences the amount of among-deviate variation in the frequencies observed—for a given , larger intensity parameters induce less among-deviate variation. This parameterization of the Dirichlet thus allows modelling of the variation among experimental replicates (the “noise” within the data).
Information about the frequencies of features within replicates is shared to estimate frequencies for each feature within that sampling group , forming a hierarchical model (Fig. 1) that is analogous to how replicates can be used in an analysis of variance to learn about marginal, grand means associated with treatments. Estimates of frequencies of compositional features at the sampling group level are the basis of inferences about which features differ among sampling groups (e.g., treatment versus control) and by how much (on an absolute or normalized scale).
Methods
Dirichlet multinomial modelling approach
Our specification of the Dirichlet-multinomial model generally follows that of Fordyce et al. (2011, implemented in the bayespref software) and takes as input a matrix of counts (X). The rows of this matrix correspond to different replicates (; the superscripted arrow denotes a vector) and the columns correspond to features of the composition (the format of an OTU or transcript table). Each count xij in this matrix corresponds to the jth feature (of n features in total) in the composition observed in the ith replicate sample. Replicates are grouped into k groups, corresponding to treatment conditions, sampling locations, or some other stratification that specifies which replicates share information (parameters shared among replicates for the group). Counts in each row of the matrix are multinomially distributed:
Each value pij in is the probability of observing a particular feature j in sample i and is a vector of the total counts in each sample. The product across i replicates of the i multinomial distributions forms the likelihood in the model and can be written:
The prior probability for the vector of feature proportions is a Dirichlet distribution, with parameters that are specific to the kth group of replicates and that are learned from the data:
In this parameterization of the Dirichlet distribution for , the parameters correspond to the expected proportions of each of the n features (e.g., a particular transcript or taxon) in group k, and θ is an intensity parameter that is shared among all features (see Box 1). For a given , larger θ means less variation among deviates from the Dirichlet expectation . The probability density function of this distribution, across i replicates within the kth group, is given by, where is a normalizing function that ensures the Dirichlet distribution integrates to one. The hyperprior for the parameters at the “topmost”, or most inclusive, level of the model hierarchy is another Dirichlet distribution with equal prior probability for each feature within the composition. For this Dirichlet distribution we use α1…n = 10−7 as a prior that will contribute little information, gives an expected value of , and has a high variance on the expectation:
The overall model for the posterior distribution for parameters of a sampling group is:
To quantify differences in proportions of features between two sampling groups (often referred to as “differential relative abundance testing”; Thorsen et al. 2016, Weiss et al. 2017), posterior probability distributions (PPDs) for πj,k=1 − πj,k=2 (Fig. 2d) can be obtained. Consistent with convention, if 95% of the samples of this PPD of differences are either greater or less than zero, then there is a high certainty of a non-zero effect of sampling group on feature relative abundance. One can also observe where zero occurs in the PPD of differences to quantify the probability of no effect of sampling group on feature relative abundance.
If a sampling scheme was used that induces dependence among replicates via a more nested hierarchical structure then the model described above, then the model hierarchy could be extended to include inference of the Dirichlet distributions describing the relative abundances of features within each additional stratum of the sampling scheme. For example, consider a study design where subjects are provided one of several diets and gut microbiome samples are taken from both sexes. In this case, one would want to account for non-independence among the data due to both sex and diet treatment. This can be accomplished through incorporation of additional Dirichlet distributions into the model, , where describes the relative abundances of features within each diet treatment (m), τ is the intensity parameter for that Dirichlet distribution, and describes relative abundances of features within each sex that is nested within each diet treatment. In this way, the model can be extended to encompass as many hierarchical layers as desired, given suitable sampling and replication (Coblentz et al. 2017).
A primer of the algorithms to perform DMM
One goal of statistical modelling is to estimate values for parameters that could correspond with directly observable variables (i.e. the data) or with latent, unobservable, variables (i.e. those that are inferred from observable variables). Bayesian modelling attempts to estimate parameters of interest, while explicitly quantifying the uncertainty in those estimates and allowing for the influence of prior knowledge on estimates. Much of Bayesian statistical modelling relies on Markov chain Monte Carlo (MCMC) sampling (Gelman et al. 2013). A Markov chain is a series of states where each state depends upon the immediately preceding state. Monte Carlo refers to repeated, random sampling. MCMC is a process by which values are suggested randomly from a probability distribution and substituted into the functions that define the model. Over MCMC iterations, sampling converges on the most supported parameter space (the PPDs for model parameters) and samples in the chain occur with probability defined by the PPD.
There are several MCMC algorithms and they primarily differ in how they choose or propose new values and their criteria for inclusion of those values in the chain (Gelman et al. 2013). A standard MCMC tool is the Metropolis algorithm (Gelman et al. 2013, pg. 289). To perform Metropolis sampling, a value (xt) is proposed from some distribution Q(xt|xt−1), where t is iteration (a suitable initial value, x0, is required). Once xt is chosen a ratio of is calculated, where f (x) is a function that is proportional to the probability density to be estimated. The new value xt is accepted into the chain with probability α, otherwise xt = xt−1. The Metropolis algorithm relies on a symmetric proposal distribution, such that Q(xt|xt−1) = Q(xt−1|xt). The Metropolis-Hastings (MH) algorithm extends this concept through relaxing the assumption of symmetry regarding the proposal probability distribution.
Gibbs sampling (Geman and Geman 1987, Kruschke 2015) is a special case of the MH algorithm (because the proposal acceptance criterion is always met; see pg. 289 in Gelman et al. 2013) and is suited for cases when the distributions used within the model are conditionally conjugate, such as when the prior and likelihood distributions are conjugate and, consequently, their product has a well defined form. At each iteration of Gibbs sampling (t), each parameter is sampled from the conditional distribution defined by the other parameters in the model, which are held constant at values chosen at iteration t − 1. Parameters are typically updated one at a time, in a predefined order.
The probabilistic programming language JAGS (Plummer 2003) implements Gibbs and Metropolis-Hastings MCMC as required to obtain samples from the distributions in our DMM. Henceforth, we refer to parameter estimation via Gibbs, Metropolis, and Metropolis-Hasting sampling as MCMC. These algorithms can be slow to converge for complex models; indeed in our experience, in a JAGS implementation, convergence may not be observed for the majority of parameters over a week of runtime for DMM with high dimensional data (such as transcriptomic data), even with sensible chain initialization values (a bespoke software implementation of MCMC tuned to the data and model would likely be faster, but would require greater care in programming and use).
Hamiltonian Monte Carlo (HMC) seeks to improve upon the efficiency of MCMC through the use of a physics inspired algorithm (for an excellent description of HMC see Monnahan et al. 2017). The sampling method can be envisioned by considering a ball dropped into a bowl and allowing the ball to roll about the curvature of the bowl. The bowl is the PPD and is frictionless, so the ball will roll back and forth in the bowl forever. After repeated drops of the ball into the bowl, from different angles and with different potential energies, the shape of the PPD is determined from the combined paths the ball took across all iterations. The benefit of this approach is that samples from nearly anywhere in the PPD can be generated at each iteration (HMC does not use a Markov chain process, but does rely on a Metropolis ratio to determine acceptability of updates), whereas MCMC typically chooses values based on the previous state space and thus cannot quickly move throughout the PPD, which can slow chain mixing and time to convergence. The probabilistic programming language and software Stan allows the use of an improved version of HMC called the “no U-turn” sampler that avoids redundant sampling of parameter space (Hoffman and Gelman 2014). To continue the previous analogy, when the ball starts to make a U-turn due to the curvature of the bowl, the sampler is stopped, and the ball dropped again—thus avoiding spending sampler time in previous explored parameter space.
HMC often improves model runtime (Monnahan et al. 2017) over MCMC, but can still be quite time consuming. Variational inference is a class of optimization methods from the machine learning literature that can rapidly approximate PPDs (Blei et al. 2017), and thus holds great promise for statistical modelling of complex data where the speed of MCMC or HMC is insufficient. Variational inference (VI) has yet to be widely applied by biologists, but it has been used to estimate population genetic structure (e.g. Raj et al. 2014, Scordato et al. 2017), genotype-phenotype associations (Carbonetto and Stephens 2012, Logsdon et al. 2010), phylogenetic relationships (Jojic et al. 2004), and in a generalized latent linear modelling context (Niku et al. 2019).
The idea behind VI is that the exact PPD need not be estimated, but can be approximated through optimization of parameters of more tractable distributions. Briefly, a density is chosen from a family of distributions and optimized so that the Kullback-Leibler (KL) divergence between that density and the PPD is minimized. KL divergence relies on the definition of entropy. Entropy is a measure of the information present within a distribution and can be expressed (for a discrete probability distribution): where p(x) is a function that outputs a probability contingent upon an input value x, which is indexed by i. It is perhaps easiest to intuit entropy using log2, in which case H is the minimum number of bits needed to encode the data. KL divergence extends this idea to quantify the amount of information necessary to explain the divergence (||) between two probability distributions p and q, which, in this example, are discrete:
Because this measure of divergence is based on the quantification of entropy, when p and q differ greatly, then more information is required to explain how they differ and KL divergence increases. For VI we wish to minimize the KL divergence between the probability distribution and some density chosen from a family of distributions Q. To avoid computation of (see Blei et al. 2017, for more), minimizing the KL divergence can be solved by maximizing the “evidence lower bound” (ELBO; the 𝔼 used below refers to expectation):
The ELBO is the negative of KL divergence after adding the constant log p(x). Thus maximizing the ELBO is equivalent to minimizing the KL divergence, up to the added constant. This also means that:
The ELBO describes the lower bound of the evidence, because when the ELBO is subtracted from the evidence (log p(x)) the result must be ≥ 0, because KL must be ≥ 0. Because maximizing the ELBO does not require computing log it is easier than minimizing KL divergence. Maximization techniques can then be used to find the density that best approximates .
Choosing Q such that the family of densities includes a that provides a good approximation, while being easily optimized, is the challenge of VI. Stan solves this problem through a method called “automatic differentiation variational inference” (Kucukelbir et al. 2015) by first transforming the data that are the support of the latent variables to lie within the real numbers (ℝ) and then suggesting a Gaussian distribution, which can be optimized to fit the data, and which induces a non-Gaussian approximation to the untransformed data. Stan’s default approach uses the “mean-field” algorithm, which treats latent variables (zj) as independent and assigns a unique density, qj(zj), to each of these j variables. Since Stan transforms the data such that latent variables have support on ℝ and then fits Gaussian distributions to those data, this statement becomes the product of many Gaussian distributions, each of which are optimized to minimize the ELBO. Following the notation of Blei et al. (2017), this can be written:
VI is an attractive technique because it can be many orders of magnitude faster than MCMC (e.g. Raj et al. 2014). However, it is unclear how well VI works across analytical tasks and model specifications (Blei et al. 2017).
Model implementation
We performed DMM in the R statistical computing environment (R Core Team 2019) using models specified for the JAGS and Stan (Carpenter et al. 2017) software programs, and used the models through the rjags (Plummer 2015) and rstan (Stan Development Team 2018) R packages, respectively. JAGS uses MCMC (Gibbs and MH), whereas Stan implements HMC (no U-turn sampling) and VI. Model specification for use in Stan was slightly modified from that described above in that we used an exponential distribution as the form of the prior for θk:
This change in model specification followed the recommendation to avoid uniform priors provided in the Stan documentation.
For HMC and MCMC implementations of DMM, we used two chains to explore parameter space. Initial values for in each chain were the vector of proportions observed from the data in replicate i, and values for were initialized using the vector of observed proportions for each feature across replicates within k (i.e., the maximum likelihood estimates for and ). θ was left to be initialized internally by rjags and rstan. In rjags, the model was subjected to an adaptation period long enough for the sampler to approach optimal efficiency as determined via internal heuristics, or for 20,000 iterations, whichever came first. Models were updated (“burned in”) for 300,000 steps for rjags and 1000 steps for rstan (with a maximum tree depth of 10). This discrepancy in burn in time was needed because in preliminary work we observed much quicker convergence with HMC than MCMC sampling. We obtained 1000 samples from PPDs by saving every second sample for HMC, and 2000 samples from PPDs for MCMC by saving every fourth sample.
Preliminary inspections of samples showed higher auto-correlation of parameter estimates for MCMC sampling, hence we discarded more samples (higher thinning rate) from the MCMC-derived chains. MCMC convergence was evaluated via the Gelman-Rubin and Geweke statistics (Geweke 1991, Gelman and Rubin 1992). We note that the runtime of MCMC could likely be improved by optimizing adaptation, burn in, and sampling steps within JAGS, or by implementing a custom MCMC procedure in the C (or an equivalent) programming language. Data with different dimensions and variance among samples would likely require different optimizations, so we have not further pursued optimization of the MCMC herein. To perform variational inference we used the functionality included within Stan (the “vb” function; Kucukelbir et al. 2015) and collected 1000 samples from the estimated posterior distributions.
The ability of models to recover true simulation parameters was estimated via root mean square error (RMSE) and the percentage of times the true simulation parameters were within the 95% high density intervals (HDIs) of PPDs (as per Kruschke 2015, pg. 727). For unimodal, symmetric PPDs, the HDI and equal-tailed probability interval should be identical (Gelman et al. 2013, pg. 38). We measured model bias as the average difference between estimated parameters and the truth and we measured model precision as , where TP refers to true positives and FP to false positives. False positive rate was calculated as , where TN is true negatives. Additionally, we calculated Matthew’s Correlation Coefficient (MCC; Matthews 1975), which provides a measure of classifier performance in terms of both true and false positives and negatives. MCC is the correlation between actual and predicted classifications and varies from one (perfect classification) to negative one (completely incorrect classification). An MCC value of zero denotes a classifier that performs no better than expected from random guessing.
Data simulation
To evaluate the performance of DMM implementations and alternative statistical methods (see below), we simulated and analyzed data with two sampling categories (k), corresponding to treatment and control groups, or some other blocking factor of interest (Fig. 2). We simulated data that possessed three different rank abundance profiles that were meant to correspond to the variety of data encountered by practitioners (Fig. 2). We considered simulations in which all features were equally abundant , and two sets of simulations in which features were sampled from Pareto distributions with differing shape parameters. The Pareto distribution describes data with few abundant features and many rarer features (Krishnamoorthy 2006). The skew towards low abundance in this distribution is controlled by the shape parameter, with smaller parameters increasing skew (Fig. 2); the location parameter defines the minimum value of the distribution. For each simulation, we sampled one of these distributions to populate a vector of length corresponding to the approximate desired number of features (n) within the simulated data:
was duplicated to make a second vector, . Selected features within these vectors were multiplied by an effect size (either 1.1, 1.5, or 2, to simulate 10%, 50%, or 100% shifts in feature relative abundance), such that those elements differed between and . Features that varied between vectors were chosen randomly from within each of three broad abundance classes (abundant, rare, and intermediate; see Electronic Supplementary Material) present within and . Only features of intermediate abundance were available when constraining all relative abundances to be equal. Effect sizes were applied so that . These two vectors were multiplied by a specified intensity parameter S and used as the parameters for two Dirichlet distributions that were sampled to create parameter vectors for multinomial distributions corresponding with each replicate. In this way, we simulated a replicate by feature matrix where replicates were split into two treatment groups and known features differed between treatment groups. Simulated data sets often had fewer features than the originally specified value for n, because when drawing deviates from multinomial distributions with many rare features, all features would not be observed in each deviate (for a visual depiction of simulation approach see Fig. 2).
Using this approach, we simulated data from each sampling distribution that varied in dimensionality (number of features, ∈ {500, 2000}), number of replicates (∈ {10, 50}), the total number of observations per replicate (e.g., the number of reads per sample for sequencing data; ∈ {10000, 50000}), the variation (noise) among replicates (∈ {0.5, 3}; the intensity parameter in notation provided above), and the effect size applied to features that differed between sampling groups (∈ {1.1, 1.5, 2}; to apply the effect size transformation, these values were multiplied by the original proportion. In total, we created and analyzed 144 data sets. Because the same number of observations were used for each replicate, transformation of the data to account for unequal sampling effort was not required. After simulating data matrices, we added a one to every datum, and thereby avoided numerical errors in JAGS that arise with Dirichlet parameters approaching zero.
For our main simulation, we did not vary read counts among replicates for the sake of simplicity, however to ensure that this did not bias our results we simulated data where replicates differed by up to two orders of magnitude in total observations (read count). To accomplish this, multinomial deviates were obtained as described above, however the total number of draws from the multinomial distribution was randomly selected from ∈ {1000, 10000, 100000}. Data used for this additional analysis were simulated using a representative subset of the aforementioned attributes. Additionally, to better understand the false positive rate of DMM, we simulated and analyzed data where no features were expected to differ between treatment groups, again using a representative subset of the attributes presented above to simulate data.
We competed our implementations of DMM against ALDEx2 v1.14.1 (Fernandes et al. 2014), ANCOM v2.0 (Mandal et al. 2015), DESeq2 v1.18.1 (Love et al. 2014), edgeR v3.20.9 (Robinson et al. 2010), mvabund v4.0.1 (Wang et al. 2019) and a frequentist approach using repeated Wilcoxon rank sum tests with a Benjamini-Hochberg false discovery rate (FDR) correction (Weiss et al. 2017). We used multiple comparison correction and typical settings for all software (see the Supplemental Material). Of the aforementioned methods, only ALDEx2 relies upon DMM. ALDEx2 estimates posterior probability distributions of Dirichlet parameters, which are subsequently transformed via the centered log ratio (Aitchison 1982). Transformed MCMC samples are subjected to a frequentist test of differential relative abundance between sampling groups, p values calculated, and the distribution of p values across MCMC samples obtained (with multiple comparison correction applied as desired by the user). The mean of this distribution is used as a point estimate of the significance of treatment. mvabund relies on a generalized linear model, in our case using a negative binomial distribution, to determine differential relative abundance. Each feature in the simulated data was a response variable and treatment group was the categorical predictor variable in the model. If the effect of the predictor was significant then the feature differed between treatment groups in relative abundance. mvabund is thus quite similar to edgeR and DESeq2, however those methods use different normalization strategies.
Our implementation of DMM differs from these methods in several important ways: 1) most competing methods do not rely on the Dirichlet and multinomial distributions, which explicitly model compositions (except ALDEx2); 2) we use a more complex hierarchical structure than the other methods tested to share information among replicates and sampling groups; 3) we do not perform repeated frequentist tests to determine differences in feature relative abundance, but instead directly subtract posterior probability distributions for parameters of interest and observe the location of zero in the resulting distribution of differences.
For all methods, we evaluated how data attributes (e.g. number of replicates, features, etc.) influenced model performance via multiple regression, with either the proportion of true positives recovered or false positive rate as the response variable.
Analyses on empirical data
To understand how DMM could affect inferences made using previously published, empirical data, we analyzed data from Duvallet et al. (2019) describing the lung microbiomes of children with and without oropharyngeal dysphagia (swallowing difficulties) induced aspiration (when a foreign substance enters the lungs). These authors characterized the bacterial assemblages in the lungs (obtained via bronchoalveolar lavage; BAL), gastric fluid, and oropharyngeal region (OR) of each subject via sequencing of the 16S locus. Aspiration is linked to pneumonia in both adults and children (Holas et al. 1994, Marik 2001, Thomson et al. 2016), but the provenance of aspirated microbes is poorly understood. Duvallet et al. (2019) showed that the lung microbiome of patients with difficulty swallowing is more similar to the microbiome of the oropharyngeal region than that of gastric fluid. These authors performed differential relative abundance testing using Kruskal-Wallis tests with a multiple comparison correction to determine whether certain bacterial taxa shifted in relative abundance between aspirating and non-aspirating patients. The authors did not find any taxa that differed in relative abundance, regardless of substrate examined (BAL, gastric fluid, or OR), though they did detect shifts in prevalence (presence across subjects within a sampling group) with phenotype, and suggested that microbial exchange between the lungs and oropharyngeal region is greater than between the lungs and stomach. Using DMM (both VI and HMC; implemented as described above) and all aforementioned competing analyses, we reanalyzed the publicly available BAL data from aspirators and non-aspirators. The data we analyzed were obtained from 66 patients (33 aspirators, and 33 non-aspirators) and included 4006 OTUs (for details of sequence processing see Duvallet et al. 2019).
Results
Dirichlet-multinomial modelling (DMM) provided a good compromise between true positive recovery and false positive generation (Fig. 3 & S2), as shown through analysis of data simulated in the context of a treatment-control experimental design. DMM consistently detected many more true positives than competing methods (Fig. 4) and this sensitivity facilitated detection of subtle shifts in relative abundance between sampling groups. For instance, when analyzing data with a skewed rank abundance profile, DMM detected approximately 15–20% of features that were shifted by treatment by just 10% of their relative abundance. None of the other methods that we employed were able to reliably detect these subtle effects (Fig. 3). When effect sizes were larger, DMM recovered more than 80% of true positives on average, which was 20–40% more true positives than were recovered by DESeq2, the next best model in terms of sensitivity.
The sensitivity of DMM came at the cost of a slightly higher false positive rate and a loss of precision compared to other methods (Figs. 3 & S1). Precision was generally high for uniformly distributed data and when the effect size that described the shift in relative abundance of a feature was large, however for data with skewed rank abundance profiles the precision of DMM was lower than competing methods. When considering the Matthew’s correlation coefficient (MCC), DMM typically performed as well or better than competing approaches examined (Fig. S2). MCC is a more holistic index of classifier performance than precision because it encompasses true and false positives and negatives. mvabund, ANCOM, and, for some data sets, Wilcoxon tests also performed quite well by this metric.
We observed that the FPR was adversely affected by the rank abundance skew within the data. Analysis of data that was simulated such that no features were expected to differ among treatment groups revealed that for data simulated from a uniform distribution FPR was negligible (0%, Fig. S3). However, FPR for HMC increased to 5.4% on average for data simulated such that they had a highly skewed rank abundance profile (Pareto shape parameter of 0.7). When data were of intermediate skew (Pareto shape of 4) then FPR increased to 8.2%. We also found that high among-replicate variation in sampling depth tended to increase FPR by a few percentage points (Fig. S4). On average, FPR of VI was only slightly higher than HMC. By comparison, FPR was often much higher when DMM was implemented via MCMC. Indeed, in many cases, MCMC generated an unacceptably high FPR of over 20%. This high FPR is at least partially due to the lack of convergence we observed for many parameters when using MCMC, even when we employed lengthy run times. We observed broadly comparable results from our primary simulation experiment, which spanned data with a broader variety of attributes and for which features differed in relative abundance among sampling groups (Fig. 3).
Of the analytical tools examined, DESeq2 and edgeR were the next most sensitive behind DMM. DESeq2 maintained a lower false positive rate than DMM. ANCOM, ALDEx2, and Wilcoxon tests all exhibited negligible false positive rates, but were only able to identify a small fraction of the features that shifted in relative abundance between sampling groups. All methods, including DMM, performed poorly when confronted with data where all features were equally abundant (denoted as “uniform” in figures). This was unsurprising, because, for these data, the expectation of π was approximately one divided by the number of features present and large, marginal shifts in relative abundance between sampling groups (such as doubling) still resulted in very small differences in proportions (e.g. versus ), which were difficult to estimate.
We used multiple regression to test how data attributes influenced true positive detection and false positive rate (Tables S2, S3). For all methods competed, the degree of rank abundance skew within the data had, by far, the largest effect on model performance. Surprisingly, all methods were quite insensitive to variation in other data attributes. Data dimensionality (number of features), number of replicates, number of observations, and among-replicate variation had very minor influences on true positive detection and false positive rate for most methods tested (Tables S2, S3).
While our primary goal was ascertaining the relative merits of DMM for detecting differences in feature abundance, we also asked how well DMM could recover the relative abundances ( and ) that were used to simulate data. We report very low average root mean square error (RMSE) for estimates of simulated relative abundances ( and ) obtained through DMM (Fig. 5). As a complementary test of model performance, we determined how often the parameters used to simulate data fell within the high density interval (HDI) of PPDs. When feature relative abundances were equal, or modestly skewed (“Equal” or “Pareto, shape = 4”), the HDI of PPDs encompassed the value used to simulate data for nearly all parameters of interest, regardless of estimation method employed (MCMC, VI, or HMC; Fig. S5). Parameter estimation was much more difficult for highly skewed data—when using MCMC or VI, the true values for the parameters did not lie within the estimated HDIs in some cases. By comparison, HMC did better when confronting these challenging data—on average ∼90% of simulation parameters fell within the HDI, though there was wide variation in model performance depending upon data set (Fig. 3). We observed that the width of credible intervals for π parameters was not associated with relative abundance regardless of implementation method or dataset (Fig. S16– S18). Bias of DMM differed among implementations, with HMC having negligible bias (Figs. S7, S8, S9) and VI and MCMC exhibiting comparatively more bias. We observed that, for all implementations, bias, when present, was typically limited to the most abundant and rarest features within the dataset. Specifically, π parameters were occasionally slightly underestimated for abundant features and overestimated for rare features. This pattern was more noticeable for highly skewed data and can be explained given the prior we used for π parameters, which corresponded to , where n was the number of features. For skewed data with high among-replicate variation, the strength of the prior was not overcome by the likelihood, thus leading to slight overestimation of marginal features and underestimation of abundant features. If among-replicate variation was reduced, then DMM was able to accurately recover true parameters even for highly skewed data. The prior we chose was agnostic to rank-abundance curves and thus suitable for a wide-range of applications, but could be substituted for a prior with a specific rank-abundance profile if desired by the user.
Inferences on empirical data
Reanalysis of data provided by Duvallet et al. (2019) demonstrated the sensitivity of DMM. Using HMC, we found that 53 taxa within the lung microbiome (samples were obtained via bronchoalveolar lavage) shifted in relative abundance between aspirating and non-aspirating children (Fig. S19). This contrasts dramatically with the results we obtained from repeated Wilcoxon tests with a Benjamini-Hochberg false discovery rate correction, mvabund, and ALDEx2, which suggested no taxa significantly shifted in relative abundance between sampling groups. By comparison, DESeq2 suggested 17 taxa differed, edgeR suggested ten taxa, and ANCOM four taxa.
Analysis of lung microbiome data using VI and HMC based implementations of DMM provided largely similar results; however, VI did report five fewer taxa shifted in relative abundance than did HMC. The majority of taxa identified by HMC were also identified by VI; the two methods did not agree regarding true positive status for only nine taxa. Of the 53 taxa that we found shifted between sampling groups, the most dramatic change was in a Streptococcus taxon, which was much more abundant in aspirating children (Fig. S19). An increase in this taxon has previously been reported in adult humans with pneumonia by Akata et al. (2016). We also found an increase in Haemophilus (Norman M. Jacobs and Harris 1979), Moraxella (Claesson and Leinonen 1994), Neisseria (Johnson et al. 1981), and Prevotella (El-Solh et al. 2003), all of which have previously been associated with pneumonia (see citations for examples), but may be present in healthy lung tissue as well (Beck et al. 2012). We also observed an increase in Enterobacter, Lactococcus, Leuoconostoc, and Acinetobacter taxa in the lungs of non-aspirating subjects.
Discussion
Over the past decade, there has been considerable discussion regarding how molecular ecologists should process and analyze compositional data, particularly those generated by high-throughput sequencing instruments (e.g., see Knight et al. 2018, Thorsen et al. 2016, Weiss et al. 2017). This dialogue has been motivated by the constraints of modern laboratory equipment (e.g., the constant sum constraint of sequencers) coupled with a pressing need for consensus involving appropriate, sensitive tools to analyze data generated by such instruments. Through analysis of simulated data spanning the variation in attributes expected across many scientific domains, we report that new computational statistical techniques have made Dirichlet-multinomial modelling (DMM) an approach that can be applied efficiently in many settings. Specifically, we report that DMM is much more sensitive than the competing approaches we examined, making DMM particularly well suited to identification of subtle shifts in relative abundance among features, such as what might be required in the study of rare, but consequential, microbes or metabolites (Lynch and Neufeld 2015, Sachdeva et al. 2019). Indeed, for some data, DMM identified many times more true positives then certain competing methods (up to approximately eight times more in extreme cases; Fig. 3). The sensitivity of DMM does, however, come at the cost of an increase in false positive rate (FPR) and a loss of precision compared to competing methods, particularly for data with skewed rank abundance profiles and large variation in sampling depth among replicates. For such challenging data, FPR increased to between 5.5–10% (Fig. S3), which we suggest may be acceptable for those practitioners tasked with analyzing challenging data and that wish to avoid missing features that truly differ among compositions. The tradeoff between sensitivity (also referred to as “recall”) and precision is well known (Buckland and Gey 1994) and we suggest that the suitability of DMM will depend on the particular needs of the practitioner. If practitioners are interested primarily in sensitivity, then our results suggest DMM is an appropriate method to choose. If, on the other hand, practitioners wish to avoid false positives, even at the expense of considerable loss of sensitivity, then other methods may be more suitable.
Aside from sensitivity, DMM provides several important ancillary benefits including the estimation of parameters that describe the data under consideration and the ability to propagate uncertainty in those estimates to downstream analyses. Propagation of uncertainty allows for a precise statement regarding the credibility of an inference and is a particular benefit of Bayesian techniques over frequentist methods. For example, to determine the extent that specific features shifted from one simulated sampling group to another, we obtained the difference between PPDs of Dirichlet parameters from each group (Fig. 2d). A PPD is a distribution that explicitly describes the probability of certain values for a particular model parameter; thus, in the model described here, the mean of the PPD for a specific π parameter is a sensible point estimate for that feature’s relative abundance and the variation around that mean describes the certainty in that estimate. By subtracting PPDs for π parameters obtained from different sampling groups for a focal taxon, we obtain a PPD of differences, thus propagating uncertainty in relative abundance estimates through to differential relative abundance testing (Fig. 1). This provides a great deal of flexibility to practitioners, because the location of zero in this distribution of differences quantifies the probability that the two original PPDs differed—in other words, that the feature differed in relative abundance between sampling groups. We assumed that, for some feature i present in two sampling groups k, if 95% of the PPDs for πik did not overlap, then that feature differed in relative abundance between groups (see methods). If a more conservative analysis is desired, then a more strict criterion could be employed to determine if PPDs of focal features are sufficiently divergent, for instance 98% or 99%. Similarly, a less strict criterion could be used (e.g., 90%) for exploratory analyses. Moreover, because we precisely quantify uncertainty in parameter estimates derived from a single model, multiple comparison testing is unneeded for our implementation of DMM. A final benefit of quantifying uncertainty for each feature of interest is that, with some creativity, this uncertainty can be propagated to other downstream analyses, including those using derived parameters of interest such as diversity entropies (see Supplemental Material and Marion et al. 2018). The benefits provided by uncertainty propagation are primary differences between DMM as we describe it here and the competing approaches we tested that rely on some form of frequentist testing.
Another important benefit of the approach to DMM we describe is the hierarchical sharing of information among replicates from sampling groups (also see Fordyce et al. 2011). Hierarchical models make thorough use of the information present within the data, which can improve parameter estimates and propagate uncertainty, particularly when sampling effort is inconsistent among replicates and sampling groups (Coblentz et al. 2017). As described in the methods, hierarchical modelling can be used in a way analogous to frequentist, mixed effects modelling to account for non-independence among replicates through the use of a random effect (Bates et al. 2015, Björk et al. 2018). Hierarchical modelling also allows for novel inferential opportunities, given sufficient data, because parameter estimates can be extracted from any level in the model hierarchy.
Additional considerations pertaining to Dirichlet-multinomial modelling
A downside to Bayesian modelling is its computational expense. While JAGS (Plummer 2003), BUGS (Lunn et al. 2012), Stan (Carpenter et al. 2017), and PyMC3 (Salvatier et al. 2016) have greatly simplified Bayesian model specification and implementation, Bayesian analysis can require much more computational time then frequentist methods. Users should be aware that as the number of parameters to estimate increases, so too does modelling time. For data sets of low to moderate dimensionality (i.e. less than a thousand features), the model described herein can be run on a desktop computer within several hours using any of the three PPD estimation methods (VI may take only a few seconds to run for such small data). However, for larger data sets of many thousand features, convergence when using MCMC or HMC may require a multiple days. For larger data, MCMC sampling should probably be avoided because HMC, as implemented in Stan is much faster and results in convergence for more parameters and, thus, a lower false positive rate (Fig. S6). For extremely large data, VI may be the only viable option for efficient parameter estimation. Unfortunately, we observed heightened variation in the performance of VI compared to MCMC or HMC when confronting data with a dramatic rank abundance skew—in some cases VI did as well as HMC, but in other cases it was unable to recover a high proportion of the true positives present (Fig. 3). Computational implementations of VI are a topic of current research and will undoubtedly improve over coming years (Blei et al. 2017). For most users, we suggest performing an initial analysis using both HMC and VI. If parameter estimates are largely congruent between techniques (as we generally observed), then VI could be used for subsequent analyses using similar data, thus taking advantage of VI’s efficiency.
For HMC or MCMC sampling, time to convergence can be improved through initializing the chains at sensible values for all parameters. We initialize chains for multinomial and Dirichlet parameters at their maximum likelihood values (, the proportion of each feature within a sampling group). Additional performance gains can be achieved by combining features that are consistently infrequent across replicates to form a composite feature. This composite feature should be included in modelling, otherwise proportion estimates will be distorted and incorrect. This approach could be particularly appropriate for analysis of high-throughput sequencing of microbiomes and transcriptomes, which often rely on data sets characterized by many features of extremely low relative abundance. Estimates of the relative abundance of very infrequent features will be imprecise, thus precluding effective comparison of relative abundances among sampling groups. Therefore, for some questions, combining these features will not lessen inferential opportunity and can greatly reduce computation time.
Some authors have suggested that the expected negative covariance of feature proportions in a Dirichlet distribution is a drawback that makes this distribution undesirable (Grantham et al. 2017, Mandal et al. 2015, Weiss et al. 2016). Specifically, the elements of in a deviate from a Dirichlet distribution are expected to negatively covary (Mosimann 1962) according to: , where is the vector of expected proportions for features in the composition and represents the Dirichlet parameter vector. Indexing of and across features is achieved via i and j, and , where n is the number of features. For even modest values of α0, the expected negative covariance between elements in is small and diminishes rapidly with increasing α0, approaching zero in the limit of large α0. The negative covariance structure is a fundamental limitation of compositional data, as one or more features increase, other features must decline to maintain a constant sum. Thus, the Dirichlet distribution assumes a reality that mirrors the data.
There are many problems associated with the analysis of compositional data that cannot be handled by DMM alone (see Aitchison and Egozcue 2005, Gloor and Reid 2016, Quinn et al. 2017, Tsilimigras and Fodor 2016, van den Boogaart and Tolosana-Delgado 2013). The most intuitive challenge posed by compositional data is that spurious correlations among features can arise because of the data’s inherent covariance structure (Pearson 1897). For instance, shifts in the relative abundance of a dominant microbial taxon along an abiotic gradient causes shifts in the relative abundance of co-occurring taxa, even if the actual abundances of those taxa are invariant across the gradient (Fig. 1). In such a scenario, compositionality could induce associations between the relative abundances of certain taxa and the gradient that are not biologically supported. Other issues that can arise when analyzing compositional data include “sub-compositional incoherence”, which means that omission of features from the composition necessarily changes the relative abundances of the remaining features after they are renormalized to their constant sum (e.g., one for proportions; Pawlowsky-Glahn and Egozcue 2006).
The technique most relied upon to address these problems is log ratio transformation: , where pi is the ith feature within , which is composed of either counts or proportions, and is a function. When is the geometric mean of all feature abundances, this transformation is called the “centered log ratio” (CLR; Aitchison 1982). Division by the geometric mean places all replicates on the same scale and, therefore, is useful when variation in sampling effort exists among replicates. Alternatively, can be an indexing function and output the value of a feature, pj, that has a constant absolute abundance among replicates. This approach is called the “additive log ratio” (ALR) transformation (Aitchison 1982) and can be useful when an internal standard can be added to samples prior to data generation (e.g. during library preparation for next-generation sequencing; Jiang et al. 2011, Munro et al. 2014, Tourlousse et al. 2017, Tkacz et al. 2018) or when certain features are expected to be invariant among replicates (e.g. “housekeeping genes”; Eisenberg and Levanon 2013). By converting information from each feature into a ratio, both ALR and CLR avoid the sub-composition incoherence problem (Morton et al. 2019). To understand this, consider conducting the ALR transformation on replicates that each include a feature with identical absolute abundance that is used as the denominator in the transformation (it does not matter whether we consider counts or proportions for this example). The ratio between any specific feature within a replicate and the denominator will not be affected by removing other features from the composition (i.e., if the ratio is 2:1 it will remain so after omitting features from the composition and re-normalizing to maintain a constant sum). Either the CLR or ALR transformation can be applied to each MCMC sample of parameters of interest to obtain transformed PPDs for analysis (see Fernandes et al. 2014, for an example).
Conclusions
The challenges posed by many modern molecular ecology data sets—extreme dimensionality, compositionality, and, often, stark differences in the abundance of features—have motivated the rapid development of new analytical tools and techniques. Indeed, new methods and software are published on a near monthly basis and practitioners are left to wonder which tool is best suited for the job at hand. While we do not claim DMM addresses all the challenges associated with compositional data, we do report that it is a sensitive, flexible technique that facilitates feature-specific analyses and should be added to ecologist’s toolkits (Fordyce et al. 2011). It is likely to be broadly useful and sensitive for analyses of microbiomes, other DNA barcoding, gene expression, metabolomics, and other applications in molecular ecology (Table S1). To facilitate use of DMM, we have provided an expository vignette in the Electronic Supplemental Material that provides an example of how to perform DMM using both Stan and JAGS in the R environment.
The success of DMM for relative abundance estimation, as demonstrated herein, coupled with the aforementioned benefits of hierarchical Bayesian modelling, justifies extension of the DMM to determine the effects of covariates on relative abundances and to characterize mixtures of compositions (sensu Chen and Li 2013, Holmes et al. 2012, Knights et al. 2011, Shafiei et al. 2015, Tang and Chen 2018). We look forward to continued method development along these lines.
Data Accessibility
All scripts and processed data used for this manuscript are available at https://github.com/JHarrisonEcoEvo/DMM Harrison et al. 2019 and a snapshot corresponding to the status at publication at Zenodo (10.5281/zenodo.3558682). Data from Duvallet et al. (2019) can be downloaded from (DOI: 10.5281/zenodo.2678108).
Author contributions
All authors contributed to model development and manuscript preparation.
Supplementary Material
Supplemental Methods
Determination of feature abundance class
Deviates used to simulate data were divided into abundance classes to ensure that features of each abundance class were made to differ between sampling groups. All features were assigned to the intermediate abundance class when Dirichlet parameters were assigned a constant value. When the Pareto distribution with shape parameter of four was used, deviates greater than or equal to five were assigned to the abundant class, deviates in the intermediate class were between two and five, and deviates within the rare class were less than two. When the Pareto distribution with shape parameter of 0.7 was used, deviates greater than 1000 were assigned to the abundant class, deviates between 1000 and 100 to the intermediate class, and deviates less than 100 to the rare class. These thresholds were chosen through visual examination of frequency distributions of deviates from distributions. Recall that the location parameter (minimum value) of the Pareto distributions was set to one.
Implementation of competing software
For analyses conducted using ALDEx2 v1.14.1 (Fernandes et al. 2014) we drew 1000 MCMC samples, which were transformed using the CLR (denom = all). Welch’s t tests and general linear models were used to determine differential relative abundance. We used a p value threshold of 0.05 to determine significance after applying a Benjamini-Hochberg FDR correction.
Options used for ANCOM v2.0 (Mandal et al. 2015) included a significance value of 0.05, a “less stringent” multiple comparison correction (multcorr = 2), “prev.cut” was set to 0.99 (meaning features that were not observed in 99% or more of samples were omitted), and “repeated” was set to “False”. During analysis we uncovered an apparent error in the ANCOM v1.1-3 software. On occasion, ANCOM would suggest that all features within a data set differed significantly between groups. This error was not stable, though errors did seem to only occur when data were generated using the Pareto distribution. Upon further research, we found others have reported this error on the QIIME forums (Caporaso et al. 2010). To work around this problem, during the very rare cases when ANCOM reported ≥ 90% of features were significant, we identified significantly differing features as those with non-zero w parameters (the test statistic used by ANCOM). This resulted in very similar results among replicate analyses of data simulated using the same parameters, but that did not trigger the aforementioned error. Subsequently, we shifted analyses to rely on ANCOM v2.0, but we left this solution in place in the event that v2.0 suffered from the same error we observed in v1.1-3.
We used default options for DESeq2 v1.18.1 (Love et al. 2014). The “nbinomWaldTest” function was used to determine differential relative abundance. Significant differences were defined at p ≤ 0.05 after a multiple comparison correction that was calculated by DESeq2.
Default options were used for edgeR v3.20.9 (Robinson et al. 2010). After dispersion estimates were calculated using the “estimateDisp” function, the “glmQLFit” and “glmQLFTest” functions were used to determine differential relative abundance. Features differing in relative abundance were determined using the “topTags” function with a Benjamini-Hochberg FDR correction and p ≤ 0.05 threshold.
Default options were used for mvabund v4.0.1 (Wang et al. 2019). Simulated data were converted into an mvabund object using the “mvabund” function. The “manyglm” function was used to implement a non-hierarchical linear model where each taxon was the response and treatment group was a categorical predictor variable. A negative binomial distribution was used for the GLM and the parameter “cor.type” set to “shrink” to account for correlation among response variables. Results from the GLM were determined using the “anova” function with a Wald test and a multiple comparison correction using a step-down resampling algorithm described in Wang et al. (2012) and Westfall and Young (1993).
R was used to implement all software.
Examples of possible derived parameters
Derived parameters can be calculated from the output of Dirichlet-multinomial modelling while preserving the uncertainty quantified by the model. For example, many microbial and community ecologists wish to compare diversity indices among sampling groups (Jost 2007, Marion et al. 2015). Diversity indices can be calculated for each sample of the Dirichlet’s parameter vector, thus generating a PPD of diversity statistics for each sampling group. PPDs of diversity could then be compared between sampling groups through subtraction (see Harrison et al. 2019 for an example). This conceptual approach was first described by Marion et al. (2018), though the model in that study relied upon a multivariate normal prior with softmax transformation, instead of the Dirichlet prior we use here.
When a finite number of observations can be ascribed to categories (e.g., observations of taxa or transcripts), the counts of observations of each category can be appropriately modeled using the multinomial distribution. Multinomial parameters define the probability that a given observation belongs to a particular category and these probabilities correspond to the relative abundance of that category in the population that was sampled. Because it accounts for the probability of all categories, the sum of the multinomial parameter vector is one. For instance, if of the birds one observed on a long hike were American robins then the maximum likelihood estimate of the multinomial parameter for robins would be and other parameters would correspond to the relative abundance of the other bird taxa observed. Here we describe how to model multinomial data using a hierarchical Bayesian approach that shares information among replicates via the Dirichlet distribution. The parameters of the Dirichlet distribution allow inference regarding the relative abundance of each category, or feature, within the sampling group.
The goal of the analysis demonstrated here is to identify features (i.e. taxa, transcripts, behavioral preferences) that differ in relative abundance across treatment groups. However, once estimates for feature relative abundance are obtained, these estimates can be passed to additional analyses. We implement modeling using three frameworks (variational inference and Hamiltonian Monte Carlo in Stan, and MCMC [Gibbs and Metropolis-Hastings] sampling in JAGS) to demonstrate the differences and similarities of each.
Be advised that modeling large data sets is computationally expensive, therefore we use a simple, simulated data set for this example. For smaller datasets, say of a few hundred to a thousand features, the model shown here can be run on a desktop system. For larger datasets, computation will take several days, so one may wish to run the model remotely. One trick that can be used to reduce computational expense is to sum uncommon features into a single, composite feature. The counts of this composite feature should be included during modeling, otherwise proportional estimates will be incorrect. Also, run time can be reduced by initializing sampling at values that are likely to be closer to the true values of the parameters to be estimated (e.g., π parameters in the Dirichlet could be set to the maximum likelihood estimate of the frequency of that feature across replicates). See documentation for Stan or rjags for information on how to initialize chains.
Simulation
We start this example by simulating some data. Note that the intensity parameter of the Dirichlet distribution controls the degree of among-replicate variation within the data. Higher values for this parameter lead to less variation among replicates. Also, we add a one to every datum so that there are no zero values within the data. This is necessary because zeros can cause infinite density errors in JAGS, due to their contribution to the Dirichlet probability density function. If zeros exist in one’s data, then add a one to every count.
# library(gtools) # library(rstan) # library(rjags) # library(shinystan) # library(VGAM) notus <- 50 nsamples <- 5000 nreps <- 100 intensity <- 1 comprop <- matrix(0, ncol = notus, nrow = 2) indprop <- matrix(0, ncol = notus, nrow = nreps) #Assemblage 1 comprop[1, ] <- rdirichlet(1, c(rep(15, 5), rep(1, notus - 5))) #Assemblage 2 comprop[2, ] <- rdirichlet(1, c(rep(1, notus - 5), rep(15, 5))) #Construct data matrix com <- matrix(0, ncol = notus, nrow = nreps) for (i in 1:(nreps / 2)) { indprop[i, ] <- rdirichlet(1, comprop[1, ] * intensity) com[i, ] <- rmultinom(1, nsamples, prob = indprop[i, ]) } for (i in (1 + nreps / 2): nreps) { indprop[i, ] <- rdirichlet(1, comprop[2, ] * intensity) com[i, ] <- rmultinom(1, nsamples, prob = indprop[i, ]) } com <- com + 1 nsamples <- nsamples + 50Stan model specification
Now we run the model. See the main text for model exposition. First, we load the Stan specification of the model, which, in this case, is in a text file located within the working directory. This can take a few seconds.
DM <- stan_model(“DM.stan”, model_name = “DM”) #This file has the following model within it: # // Model specification for Dirichlet-Multinomial # data { # int<lower=1> N; # int<lower=1> nreps; # int<lower=1> notus; # # int<lower=1> start[N]; # int<lower=1> end[N]; # # int datamatrix[nreps, notus]; # } # # parameters { # real<lower=0> theta[N]; # simplex[notus] pi[N]; # simplex[notus] p[nreps]; # } # # # model { # for(i in 1:N){ # target += exponential_lpdf(theta[i] | 0.001); # target += dirichlet_lpdf(pi[i] | rep_vector(0.0000001, notus)); # for(j in start[i]:end[i]){ # target += dirichlet_lpdf(p[j] | theta[i]*pi[i]); # target += multinomial_lpmf(datamatrix[j,] | p[j]); # } # } # }Variational inference in Stan
Now we implement variational inference (VI) to learn parameters of interest. Note how the data are passed in as a named list, the algorithm specified, and the number of samples to be extracted from the estimated posterior specified (“output samples”). For more, see the Stan documentation.
ptm <- proc.time() fitstan_VI <- vb(DM, data = list(“datamatrix” = com, “nreps” = nrow(com), “notus” = ncol(com), “N” = 2, “start” = c(1, nreps/2), “end” = c((nreps/2) - 1, nreps) ), algorithm = “meanfield”, output_samples = 500, check_data = T, seed = 123, pars <- “pi”) viTime <- c(proc.time() - ptm)[3]Variational inference took 5.959 seconds.
Hamiltonian Monte Carlo sampling in Stan
Now we implement Hamiltonian Monte Carlo (HMC) using the no U-turn sampling algorithm. Note that the number of chains and cores can be specified (use one core per chain). “warmup” controls model burn in (and should probably be increased for larger data sets). “iter” controls total iterations, so the difference between iter and warmup specifies how many samples of the posterior probability distribution will be extracted. “thin” specifies how many samples to skip before saving another sample (if thin=2 then every other sample will be saved). For more, see the Stan documentation.
ptm <- proc.time() fitstan_HMC <- sampling(DM, data = list(“datamatrix” = com, “nreps” = nrow(com), “notus” = ncol(com), “N” = 2, “start” = c(1, nreps/2), “end” = c((nreps/2) - 1, nreps) ), chains=2, warmup = 500, iter = 1000, thin = 2, algorithm = “NUTS”, cores = 1, pars <- “pi”, verbose = T) hmcTime <- c(proc.time() - ptm)[3]Hamiltonian Monte Carlo took 278.297 seconds. Note that this time could be reduced by optimizing “warmup” and “iter”, running each chain on a different core, and providing sensible initialization values. When optimizing run time be sure to check model convergence statistics to ensure that convergence upon a stable posterior probability distribution has been achieved.
Stan estimation diagnostics
Checking model convergence can be done easily for HMC, but at the time of writing there was no simple way to test effectiveness of VI.
For HMC, the number of effective samples and can be checked using the following code.
summary(fitstan_HMC, pars = “pi”, probs = c(0.025, 0.975))$summaryThe shinystan application is an excellent interface to dig deeper into model performance. See https://mc-stan.org/users/interfaces/shinystan
Model specification and MCMC samples in JAGS
Now we use a very similar specification of the model for the JAGS software to estimate π parameters of the Dirichlet.
Model specification is as follows:
community.model.level <- “model{ for(i in 1:N){ for(j in start[i]:end[i]){ datamatrix[j,] ∼ dmulti(p[j,], nreads[j]) p[j,1:notus] ∼ ddirch(pi[i,]*theta[i]) } pi[i,1:notus] ∼ ddirch(alpha) theta[i] ∼ dunif(0, 4000) } for(k in 1:notus){ alpha[k] <- 0.0000001 } }”Compile and run the model.
ptm <- proc.time() sim.mod.jags <- jags.model( textConnection(community.model.level), data = list( datamatrix = com, notus = dim(com)[2], nreads = rowSums(com), N = 2, start = c(1,nreps/2), end = c((nreps/2)-1,nreps) ), n.chains = 2, n.adapt = 0 ) #Adapt model iter_needed <- 0 y = FALSE while(y == FALSE){ y <- adapt(sim.mod.jags, n.iter = 1000, end.adaptation = FALSE) iter_needed <- 1000 + iter_needed if(iter_needed > 4000){break} } #Burn in update(sim.mod.jags, n.iter = 3000) #Extract samples sim.mod.sam <- jags.samples(model = sim.mod.jags, variable.names = “pi”, n.iter = 4000, thin = 4) jagsTime <- c(proc.time() - ptm)[3]JAGS took 77.208 seconds. This time could possibly be reduced by optimizing burn in and adaptation and providing sensible initialization values.
To test for MCMC convergence, one can use the functions within the Coda R package. Be advised, that statistics should be calculated parameter-wise when there are many parameters, else memory requirements become burdensome. The following function can be used to accomplish this task.
#Compute the Gelman-Rubin and Geweke statistics mcmcdiag <- function(x, nparams) { #x is an mcmc object #nparams is number of params in the object Gr <- vector(length = nparams) GK <- vector(length = nparams) k <- 1 a <- character(0) while (k <= nparams) { m <- x[1:length(x)][, k] gr <- gelman.diag(m) print(paste(“Feature”, k, sep = “ “)) print(“Gelman-Rubin”) print(gr) if (gr[[1]][1] <= 2) { Gr[k] <- “passed” } else{ Gr[k] <- “failed” } gk <- geweke.diag(m, frac1 = 0.1, frac2 = 0.5) suspectGK <- names(which(2 * pnorm(-abs(gk[[1]]$z)) < 0.08)) if (identical(a, suspectGK)) { GK[k] <- “passed” } else if (suspectGK == “var1”) { GK[k] <- “failed” } k <- k + 1 } return(list(Gr, GK)) } diagout <- mcmcdiag(as.mcmc.list(sim.mod.sam$pi), dim(com)[2])We have noticed that for large datasets (many thousands of parameters), JAGS can require many days to achieve convergence. By comparison, HMC is much faster. To avoid impractically long run times, VI may be the only viable option for extremely large data sets.
Use of parameter estimates
Now we extract π parameters from each sampling group and subtract them. The location of zero within this distribution quantifies the probability of no effect of sampling group. If desired, the mean of this distribution of differences can be extracted and used as a point estimate for the effect of sampling group, though we advocate for using samples characterizing the entire distribution for analyses whenever possible, thus utilizing our measures of uncertainty. We present a simple function to determine if 95% or more of the distribution of differences lies on either side of zero. If so, then we suggest this is high certainty of an effect of sampling group on the relative abundance of that feature.
calc_certain_diffs <- function(mcmc_of_diffs, dimension){ positives <- vector() negatives <- vector() for(i in 1:dim(mcmc_of_diffs)[dimension]){ if(dimension == 2){ perc <- length(which(mcmc_of_diffs[,i] > 0))/ length(mcmc_of_diffs[,i]) }else{ perc <- length(which(mcmc_of_diffs[i,] > 0)) / length(mcmc_of_diffs[i,]) } if(perc >= 0.95 | perc <= 0.05){ positives <- c(positives, i) }else{ negatives <- c(negatives, i) } } return(list(positives = positives, negatives = negatives)) } est.pi <- extract(fitstan_HMC,”pi”) diffs_HMC <- est.pi$pi[,1,] - est.pi$pi[,2,] outHMC <- calc_certain_diffs(diffs_HMC,2) est.pi <- extract(fitstan_VI,”pi”) diffs_VI <- est.pi$pi[,1,] - est.pi$pi[,2,] outVI <- calc_certain_diffs(diffs_VI,2) diffs_jags <- sim.mod.sam$pi[1,,,1: 2] - sim.mod.sam$pi[2,,,1: 2] outJAGS <- calc_certain_diffs(cbind(diffs_jags[,,1], diffs_jags[,,2]), 1)Next we make a plot to determine which features shifted in relative abundances. Points correspond to estimated differences in feature relative abundance between sampling groups. The blue dots correspond with those features that we expected to shift. Lines extending from each point denote 95% high density intervals, and are colored purple for those features suggested to differ.
#Code from Kruschke’s Doing Bayesian Data Analysis book (cited in main text). HDIofMCMC = function(sampleVec, credMass=0.95) { # Computes highest density interval from a sample of representative values, # estimated as shortest credible interval. # Arguments: # sampleVec # is a vector of representative values from a probability distribution. # credMass # is a scalar between 0 and 1, indicating the mass within the credible # interval that is to be estimated. # Value: # HDIlim is a vector containing the limits of the HDI sortedPts = sort(sampleVec) ciIdxInc = ceiling(credMass * length(sortedPts)) nCIs = length(sortedPts) - ciIdxInc ciWidth = rep(0, nCIs) for(i in 1: nCIs) { ciWidth[i]= sortedPts[i + ciIdxInc] - sortedPts[i] } HDImin = sortedPts[which.min(ciWidth)] HDImax = sortedPts[which.min(ciWidth) + ciIdxInc] HDIlim = c(HDImin, HDImax) return(HDIlim) } notus <- dim(com)[2] colorPoints <- rep(“black”, notus) colorPoints[c(1: 5,(notus-4): notus)] <- “blue” #Plot differences in pis plotr <- function(x, y, z, whatitis){ plot(1: notus, apply(x, y, mean), cex = 1.5, ylim = c(-0.06,0.06), ylab = “Difference in rel. abund.”, xlab = “Feature”, main = whatitis, pch = 16, col = colorPoints, las = 2) abline(h = 0, col = “red”) segs <- apply(x, y, HDIofMCMC) colorLines <- rep(“black”, notus) colorLines[z$positives] <- “purple” segments(1: notus, segs[1,], 1: notus, segs[2,], col = colorLines) } par(mfrow=c(1,3)) plotr(x = diffs_VI, y = 2, z = outVI, whatitis = “VI”) plotr(x = diffs_HMC, y = 2, z = outHMC, whatitis = “HMC”) plotr(x = diffs_jags, y = 1, z = outJAGS, whatitis = “JAGS”)DMM can be extended easily to encompass more than two sampling groups. Simply order data (in a matrix or dataframe format) so that replicates from the same sampling groups are neighboring rows. For instance, say one was analyzing measurements from eight sampling groups denoted numerically. One should order the associated data for these sampling groups like so that the data looked like this:
exampleData <- round(runif(16,1,1000)) groups <- c(rep(“group1”,2), rep(“group2”,2), rep(“group3”,2), rep(“group4”,2), rep(“group5”,2), rep(“group6”,2), rep(“group7”,2), rep(“group8”,2)) cbind(exampleData, groups) ## exampleData groups ## [1,] “766” “group1” ## [2,] “347” “group1” ## [3,] “134” “group2” ## [4,] “686” “group2” ## [5,] “90” “group3” ## [6,] “539” “group3” ## [7,] “244” “group4” ## [8,] “799” “group4” ## [9,] “232” “group5” ## [10,] “349” “group5” ## [11,] “751” “group6” ## [12,] “936” “group6” ## [13,] “586” “group7” ## [14,] “201” “group7” ## [15,] “940” “group8” ## [16,] “865” “group8”Then one can simply pass in the indices that describe which rows bound which group to the “start” and “end” portions of the function. For our toy example, the start indices would be:
c(1,3,5,7,9,11,13,15) ## [1] 1 3 5 7 9 11 13 15and the end indices would be:
c(2,4,6,8,10,12,14,16) ## [1] 2 4 6 8 10 12 14 16These values would then be substituted into the model and the “N” parameter changed to reflect the number of sampling groups used (in this case N = 8). Note that you cannot pass in the grouping column if it is included in your data. See above for another example of how data should be formatted.
fitstan_HMC <- sampling(DM, data = list(“datamatrix” = as.matrix(exampleData), “nreps” = 16, “notus” = 1, “N” = 8, “start” = c(1,3,5,7,9,11,13,15), “end” = c(2,4,6,8,10,12,14,16) ), chains=2, warmup = 500, iter = 1000, thin = 2, algorithm = “NUTS”, cores = 1, pars <- “pi”, verbose = T)Acknowledgments
We wish to thank Claire Duvallet and the co-authors of Duvallet et al. (2019) for making their well-curated data available to the public. Specific thanks to Dr. Duvallet for helpful interpretation regarding our reanalysis of her and her co-author’s data. Additional thanks to helpful comments from James Fordyce and two anonymous reviewers. This research was supported by the Microbial Ecology Collaborative at the University of Wyoming with funding from NSF award #EPS-1655726. Computing was performed in the Teton Computing Environment at the Advanced Research Computing Center, University of Wyoming, Laramie (https://doi.org/10.15786/M2FY47).