Abstract
Microbiota contribute to many dimensions of host phenotype, including disease. To link specific microbes to specific phenotypes, microbiome-wide association studies compare microbial abundances between two groups of samples. Abundance differences, however, reflect not only direct associations with the phenotype, but also indirect effects due to microbial interactions. We found that microbial interactions could easily generate a large number of spurious associations that provide no mechanistic insight. Using techniques from statistical physics, we developed a method to remove indirect associations and applied it to the largest dataset on pediatric inflammatory bowel disease. Our method corrected the inflation of p-values in standard association tests and showed that only a small subset of associations is directly linked to the disease. Direct associations had a much higher accuracy in separating cases from controls and pointed to immunomodulation, butyrate production, and the brain-gut axis as important factors in the inflammatory bowel disease.
Introduction
Microbes are essential to any ecosystem be it the ocean or the human gut. The sheer impact of microbial processes has however been underappreciated until the advent of culture-independent methods to assess entire communities in situ. Metagenomics and 16S rRNA sequencing identified significant differences in microbiota among hosts, and experimental manipulations established that microbes could dramatically alter host phenotype [1–8]. Indeed, anxiety, obesity, colitis, and other phenotypes can be transmitted between hosts simply by transplanting their intestinal flora [9–13].
New tools and greater awareness of microbiota triggered a wave of association studies between microbiomes and host phenotypes. Microbiome wide association studies (MWAS) have been carried out for diabetes, arthritis, cancer, autism and many other disorders [14–23]. MWAS clearly established that each disease is associated with a distinct state of intestinal dysbiosis, but they of-ten produced conflicting results and identified a very large number of associations both within and across studies [14, 19, 21, 23–26]. For example, a recent study on inflammatory bowel disease (IBD) reported close to 100 taxa associated with IBD [25], a number that is fairly typical [14]. Such long lists of associations defy simple interpretation and complicate mechanistic follow-up studies because one needs to examine the role of almost every species in the microbiota. In fact, one can argue that MWAS are most useful when they can identify a small network of taxa driving the disease.
Although extensive dysbiosis might reflect the multifactorial nature of the disease, it is also possible that MWAS detect spurious associations because their statistical methods fail to account for some important aspects of microbiome dynamics. One such aspect is the pervasive nature of microbial interactions: species compete for similar resources, rely on cross-feeding for survival, and even produce their own antibiotics [27–37]. Hence, microbial abundances must be correlated with each other, and even a simple change in host phenotype could manifest as collective responses by the microbiota. Traditional MWAS, however, completely neglect this possibility because they treat each change in abundance as an independent manifestation of altered host phenotype. As a result, MWAS cannot distinguish taxa directly linked to disease from taxa that are affected only through their interactions with other species.
The main conclusion of this paper is that realistic microbial interactions produce a large number of spurious associations. Many of these indirect associations can be removed by a simple procedure based on maximum entropy models from statistical physics, which can separate host effects from the microbial interactions. We dubbed this approach Direct Association Analysis, or DAA for short.
When applied to the largest MWAS on IBD, DAA shows that many of the previously reported associations could be explained by interspecific interactions rather than the disease. At the genus and species level, the direct associations include only Roseburia, Faecalibacterium prausnitzii, Bifidobacterium adolescentis, Blautia producta, Turicibacter, Oscillospira, Eubacterium dolichum, Aggregatibacter segnis, and Sutterella. Some of these associations are well-known, while others have received little attention in IBD research. The phenotypes of the taxa directly to disease suggest that immunomodulation, butyrate production, and the brain-gut interactions play important role in the etiology of IBD.
Compared to traditional MWAS, DAA corrected the inflation of p-values responsible for the large number of spurious associations and identified taxa most informative of the diagnosis. We found that directly associated taxa are much better at discriminating between cases from controls than an equally-sized subset of indirect associations. In fact, direct associations have the same potential to discriminate between health and disease as the entire set of almost a hundred associations detected by a conventional method.
Results
Traditional MWAS detect species with significantly different abundances between case and control groups. Some changes in the abundances are directly associated with the disease while others are due to microbial interactions. The emergence of indirect changes in abundance is illustrated in Fig. 1A for a hypothetical network of five species. Only two species A and D are directly linked to the disease. However, strong interactions make the abundances of all five species differ between control and disease groups. For example, the mutualistic interaction between A and B helps B grow to a higher density following the increase in the abundance of A. The expansion of B in turn inhibits the growth of C and reduces its abundance in disease. Strong mutualistic, competitive, commensal, and parasitic interactions have been demonstrated in microbiota [27–37], and Fig. 1B shows that almost every species present in the human gut participates in a strong interaction. Thus, the propagation of abundance changes from directly-linked to other species could pose a significant challenge for MWAS. To test this hypothesis, we turned to a minimal mathematical model of microbiota composition.
Maximum entropy model of microbiota composition
A quantitative description of interspecific interactions and their effect on MWAS requires a statistical model of host-associated microbial communities. Ideally, such a model would describe the probability to observe any microbial composition, but the amount of data even in large studies is only sufficient to determine the means and covariances of microbial abundances. This situation is common in the analysis of biological data and has been successfully managed with the use of maximum entropy distributions [38]. These distributions are chosen to be as random as possible under the constraints imposed by the first and second moments. Maximum entropy models introduce the least amount of bias and reflect the tendency of natural systems to maximize their entropy [39]. In other contexts, these models have successfully described the dynamics of neurons, forests, flocks, and even predicted protein structure and function [40–44]. In the context of microbiomes, a recent work derived a maximum entropy distribution for microbial abundances using the principle of maximum diversity [45].
We show in the Supplemental information (SI) that the maximum entropy distribution of microbial abundances P ({li}) takes the following form where li is the log-transformed abundance of species i, hi represents the direct effect of the host phenotype on species i, and Jij describes the interaction between species i and j; the factor of 1/Z is the normalization constant. The log-transformation of relative abundances alleviates two common difficulties with the analysis of the microbiome data. The first difficulty is the large subject-to-subject variation, which is much better captured by a log-normal rather than a Gaussian distribution; see Fig. S1, SI, and Ref. [25]. The second difficulty arises from the fact that the relative abundances must add up to one. This constraint is commonly known as the compositional bias because it leads to artifacts in the statistical analysis. The log-transformation is an essential step in most methods that account for the compositional bias [46–48], and, in the SI, we show that all of our conclusions are robust to the variation in the strength of the compositional bias.
Testing for spurious associations in synthetic data
We obtained realistic model parameters from one of the largest case-control studies previously reported in Ref. [21]. The samples were obtained from mucosal biopsies of 275 newly diagnosed, treatment-naive children with Crohn’s disease (a subtype of IBD) and 189 matched controls. Microbiota composition was determined by 16S rRNA sequencing with about 30,000 reads per sample. From this data, we inferred the interaction matrix J and the typical changes in microbial abundances associated with the disease for 47 most prevalent genera (Methods and SI). Even though the number of data points significantly exceeds the number of free parameters in the model, overfitting could still be a potential concern. Overfitting, however, is unlikely to affect our main conclusions because they depend only on the overall statistical properties of J rather than on the precise knowledge of every interaction. In fact, none of our results changed when we analyzed only about half of the data set (Fig. 2). To improve the quality and robustness of the inference procedure, we also used the spectral decomposition of J to remove any interaction patterns that were not strongly supported by the data; see Methods and SI for further details.
To determine the effect of microbial interactions on conventional MWAS analysis, we generated synthetic data with a known number of direct associations. The data for the control group was used without modification from Ref. [21]. The disease group was generated using Eq. (5) with the same values of h and J as in the control group, except we modified the values of h for 6 representative genera (Tab. S2). We also generated two other synthetic data sets with smaller and larger effect sizes (Tab. S2). The results for all three data sets were very similar (SI).
The synthetic data was further subsampled to several sample sizes in order to simulate variation in statistical power between different studies. For an ideal method, the number of detected associations should increase with the cohort size, but eventually saturate once all 6 directly associated genera are discovered. In contrast to this expectation, the number of associations detected by the conventional approach increased rapidly with the sample size until almost all genera were found to be statistically associated with the disease in our synthetic data. At this point, traditional MWAS completely lost the power to identify the link between the phenotype and microbiota. Unbounded growth in the number of detections was also observed for the real data (Fig. 2C) suggesting that many previously reported associations between microbiota and IBD could be indirect.
Are spurious associations simply an artifact of our ability to detect even minute differences between cases and controls? Fig. 2B and 2D show that this was not the case. The median effect size declined only moderately with the number of associations, and most associations corresponded to about a factor of two difference in the taxon abundance. Thus, spurious associations are not weak and could not be discarded based on their effect size.
Direct association analysis (DAA)
Fortunately, the maximum entropy model provides a straightforward way to separate direct from indirect associations. Since direct effects are encoded in h, MWAS should be performed on h rather than on l. This simple change in the statistical analysis correctly recovered 4 out of 6 directly associated taxa in the synthetic data and yielded no indirect associations even for large cohorts (Fig. 2A and S5). Similarly good performance was found for the two other synthetic data sets (Fig. S7). For the IBD data, DAA also identified a much smaller number of associations compared to traditional MWAS analysis and showed clear saturation at large sample sizes (Fig. 2B). Direct associations with IBD are summarized in Fig. 3 at the genus and species levels, and the entire phylogenetic tree of direct associations is shown in Fig. S2 and Tabs. S3 and S4.
To demonstrate that DAA isolates direct effects from collective changes in the microbiota, we examined the p-value distribution in this method. The distribution of p-values is commonly used as a diagnostic tool to test whether a statistical method is appropriate for the data. In the absence of any associations, p-values must follow a uniform distribution because the null hypothesis is true [54]. A few strong deviations from the uniform distribution signal true associations [55]. In contrast, large departures from the uniform distribution typically indicate that the statistical method does not account for some properties of the data, for example, population stratification in the context of genome wide association studies [56, 57]. Figure 4A compares the distribution of p-values for DAA and a conventional method in MWAS. Consistent with our hypothesis that interspecific interactions cannot be neglected, conventional analysis generates an excess of low p-values and, as a result, a large number of potentially indirect associations. In contrast, the distribution of p-values from DAA matches the expected uniform distribution and, thus, provides strong support for our method.
Finally, we show that indirect association excluded by DAA do not reduce the predictive power of microbiome data. Supervised machine learning such as random forest [58, 59], support vector machine [60], and sparse logistic regression [61–63] were used to classify samples as cases or controls based on their microbiota profile. We found good and identical performance of the classifiers trained either on all taxa detected by conventional MWAS or on a much smaller subset of direct associations detected by DAA (Fig. 4B). Moreover, the DAA-based classifier showed significantly better performance compared to a classifier trained on an equal number of randomly-selected indirect associations (Fig. 4B). Thus, DAA reduces the number of associations without losing any information on the disease status and selects taxa with the greatest potential to distinguish health from disease.
Discussion
The primary goal of MWAS is to guide the study of disease etiology by detecting microbes that have a direct effect on the host. These direct effects could be very diverse and include secretion of toxins, production of nutrients, stimulation of the immune system, and changes in mucus and bile [64, 65]. In addition to the host-microbe interactions, the composition of microbiota is also influenced by the interspecific interactions among the microbes such as competition for resources, cross-feeding, and production of antibiotics. In the context of MWAS, microbial interactions contribute to indirect changes in microbial abundances, which are less informative of the disease mechanism and are less likely to be valuable for follow-up studies or in interventions. Here, we estimated the relative contribution of indirect associations to MWAS and showed how to isolate direct from indirect assoctiations.
Our main result is that interspecific interactions are sufficiently strong to generate detectable changes in the abundance of many microbes that are not directly linked to host phenotype. As a result, conventional approaches to MWAS detect a large number of spurious associations and produce inflated p-values that do not match their expected distribution (Fig. 4A). These challenges are resolved by Direct Association Analysis (DAA), which uses maximum entropy models to explicitly account for interspecific interactions. We applied DAA to a large data set of pediatric Crohn’s disease and found that it restores the distribution of p-values and substantially simplifies the pattern of dysbiosis while retaining full classification power of a conventional MWAS.
The relatively simple dysbiosis identified by DAA in IBD has strong support in the literature and offers interesting insights into disease etiology. Four of the taxa identified by our method have a well-established role in IBD: B. adolescentis, F. prausnitzii, B. producta, and Roseburia. They have been repeatedly found to have lower abundance in both Crohn’s disease and ulcerative colitis [66–73], and several studies have demonstrated their ability to suppress inflammation and alleviate colitis [69, 74–78]. Bifidobacterium species occupy a low trophic level in the gut and ferment complex polysaccharides such as fiber [79, 80]. Fermentation products include lactic acid, which promotes barrier function, and maintains a healthy, slightly acidic environment in the colon [81]. Due to these properties Bifidobacterium species are commonly used as probiotics [79]. F. prausnitzii, Blautia producta and Roseburia occupy a higher trophic level and ferment the byproducts of polysaccharide digestion into short-chain fatty acids (SCFA), which are an important energy source for the host [68,69, 82, 83].
The ability of DAA to detect taxa strongly associated with IBD is reassuring, but not surprising. What is surprising is that many strong associations are classified as indirect by our method. For example, Roseburia and Blautia are the only genera of Lachnospiraceae that DAA finds to be directly linked to the disease. In sharp contrast, traditional MWAS report seven genera in this family that are strongly associated with IBD [25]. All seven genera are involved in SCFA metabolism, but their specializations differ. Species in Blautia genus are major producers of acetate, a SCFA that is commonly involved in microbial crossfeeding [84, 85]. In particular, many species extract energy from acetate by converting it into butyrate, another SCFA that plays a major role in gut health by nourishing colonocytes and regulating the immune function [82, 85]. Roseburia genus specializes almost exclusively in the production of butyrate and acts as a major source of butyrate for the host [82, 86]. Thus, our findings suggest that butyrate production plays an important role in IBD etiology and that the dysregulation of this process is directly linked to the depletion of Roseburia and possibly Blautia.
The important role of butyrate is further supported by our detection of E. dolichum and Oscillospira, which are known to produce butyrate [87–89]. The latter taxon has not been detected in three independent analyses of this IBD data set [21, 25, 90] presumably because its involvement was masked by indirect associations and interactions with other microbes. Indeed, several other studies found that Oscillospira is suppressed in IBD [91],[92]. Oscillospira was also found to be positively associated with leanness and negatively associated with the inflammatory liver disease [93–95]. The interactions between Oscillospira and the host appears to be quite complex and involve the consumption of host-derived glycoproteins including mucin, production of SCFA, and modulation of bile-acid metabolism [89, 96, 97]. The latter interaction was suggested to be a major factor in the protective role of Oscillospira against infections with Clostridium difficile [96, 98, 99].
The final taxon that was suppressed in IBD is Turicibacter. This genus is not very well characterized, and few MWAS studies point to its involvement in IBD [21, 25, 100]. Two studies in animal models, however, directly looked into the connection between IBD and Turicibacter [101, 102]. The first study found that iron limitation eliminates colitis in mice while at the same time restoring the abundance of Turicibacter, Bifidobacterium, and four other genera [101]. The second study study identified Turicibacter as the only genus that is fully correlated with immunological differences between mice resistant and susceptible to colitis: high abundance of Turicibacter in the colon predicted high levels of MZ B and iNK T cells, which are potent regulators of the immune response [102]. Moreover, Turicibacter was the only genus positively affected by the reduction in CD8+ T cells. Thus, our method identified a taxon that is potentially directly linked to IBD via the modulation of the immune system.
Perhaps the most unexpected finding was our detection of Aggregatibacter and Sutterella as the only genera increased in disease compared to 26 positive associations detected by the previous analysis [25]. All other associations were classified as indirect even though they often corresponded to much more significant changes in abundance between IBD and control groups. Thus, our results indicate that expansion of many taxa including opportunistic pathogens is driven by their interactions with the core IBD network shown in Fig. 3. One possibility is that the dysbiosis of the symbiotic microbiota makes it less competitive against other bacteria and opens up niches that can be colonized by opportunistic pathogens. The other, less explored possibility, is that commensal microbiota can not only protect from pathogens, but also facilitate their invasion, a phenomenon that has been recently demonstrated in bees [103].
Little is known about the specific roles that Aggregatibacter and Sutterella play in IBD, and more generally in gut health. Aggregatibacter is a common member of the oral microbiota that thrives in local infections such as periodontal disease and bacterial vaginosis [104–106]. The high abundance of Aggregatibacter is also associated with an increased risk of IBD recurrence [107]. Sutterella, on the other hand lacks overt pathogenicity, and MWAS produced inconsistent findings [108–114] on its involvement in IBD. Some studies reported that Sutterella is increased in patients with good outcomes [21, 111] while other studies found positive or no association between Sutterella and IBD [25, 109, 112–114]. Experimental investigations showed that Sutterella lacks many pathogenic properties; in particular, it does not induce a strong immune-response and has only moderate ability to adhere to mucus [113, 114]. Further, Sutterella strains from IBD and control patients showed no phenotypic differences in metabolomic, proteomic, and immune response assays [114]. Nevertheless, Sutterella is strongly associated with worse behavioral scores in children with autism spectrum disorder and Down syndrome [19, 20, 115]. Therefore, the direct link between Sutterella and IBD could involve the gut-brain axis.
In summary, we found a small number of taxa can explain extensive dysbiosis in IBD and accurately predict disease status. Directly associated taxa include strains with dramatically different abilities to trigger colitis and are specifically targeted by the immune system of patients and animals with IBD [12]. Previous studies of these taxa point to facilitated colonization by pathogens, butyrate production, immunomodulation, bile metabolism, and the gut-brain axis as the primary factors in the etiology of IBD.
Many disorders are accompanied by substantial changes in host microbiota, but our work shows that only a small subset of these changes could be directly related to the disease. Similarly, only a handful of taxa could drive the dynamics of ecosystem-level changes in the environment. To untangle the complexity of such dysbioses, it is important to account for microbial interactions using mechanistic or statistical methods. Direct association analysis is a simple statistical approach based on the principle of maximum entropy. It can be applied to any microbiome data set that is sufficiently large to infer interspecific interactions.
Methods
The data used in this study was obtained from Ref. [21], which reported changes in the microbiome of newly-diagnosed, treatment-naive children with IBD compared to controls. This data was recently analyzed in Ref. [25], and we followed all the statistical procedures adopted in that study to enable direct comparison of the results. Specifically, we used a permutation test on mean log-transformed abundances to determine the statistical significance of an association.
All computation was carried out in Python environment. We used scikit-learn 0.15.2 [116] for hierarchical clustering and to build the supervised classifiers used in Fig. 4B of the main text and Fig. S3. The variance in the accuracy of classification was evaluated through 5-fold stratified cross-validation with 100 random partitions of the data into the training and validation sets. For all findings, statistical significance was evaluated with Fisher’s exact test (permutation test) with 106 permutations. False discovery rate was controlled to be below 5% following Benjamini-Hochberg method [54].
To fit the maximum entropy model to the data, we first computed the mean log-abundance for each genus mi and the covariance in the log-transformed abundances Cij. The interaction matrix was computed as J = C-1 by performing singular value decomposition [117] and removing all singular values that were comparable to the amount of noise present in the data. The host effects were computed as h = J m. See Supplementary Methods for further details.
Acknowledgements
This work was supported by a grant from the Simons Foundation (#409704, Kirill S. Korolev) and by the startup fund from Boston University to Kirill S. Korolev. Vivek Ramanan was also supported by NSF grant DBI-1559829 through BRITE Bioinformatics REU Program at Boston University. Rajita Menon was partly supported by the Graduate Fellowship from the Rafik B. Hariri Institute for Computing and Computational Sciences & Engineering. Simulations were carried out on Shared Computing Cluster at Boston University.
References
- [1].↵
- [2].
- [3].
- [4].
- [5].
- [6].
- [7].
- [8].↵
- [9].↵
- [10].
- [11].
- [12].↵
- [13].↵
- [14].↵
- [15].
- [16].
- [17].
- [18].
- [19].↵
- [20].↵
- [21].↵
- [22].
- [23].↵
- [24].
- [25].↵
- [26].↵
- [27].↵
- [28].
- [29].
- [30].↵
- [31].
- [32].
- [33].
- [34].
- [35].
- [36].
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].
- [48].↵
- [49].↵
- [50].
- [51].
- [52].
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].
- [68].↵
- [69].↵
- [70].
- [71].
- [72].
- [73].↵
- [74].↵
- [75].
- [76].
- [77].
- [78].↵
- [79].↵
- [80].↵
- [81].↵
- [82].↵
- [83].↵
- [84].↵
- [85].↵
- [86].↵
- [87].↵
- [88].
- [89].↵
- [90].↵
- [91].↵
- [92].↵
- [93].↵
- [94].
- [95].↵
- [96].↵
- [97].↵
- [98].↵
- [99].↵
- [100].↵
- [101].↵
- [102].↵
- [103].↵
- [104].↵
- [105].
- [106].↵
- [107].↵
- [108].↵
- [109].↵
- [110].
- [111].↵
- [112].↵
- [113].↵
- [114].↵
- [115].↵
- [116].↵
- [117].↵
- [118].↵
- [119].↵