Abstract
The role of microbial interactions on the properties of microbiota is a topic of key interest in microbial ecology. Microbiota contain hundreds to thousands of operational taxonomic units (OTUs), most of which are rare. This feature of community structure can lead to methodological difficulties: simulations have shown that methods for detecting pairwise associations between OTUs (which presumably reflect interactions) yield problematic results. The performance of association detection tools is impaired for a high proportion of zeros in OTU table. Here, we explored the statistical testability of such associations given occurrence and read abundance data. The goal was to understand the impact of OTU rarity on the testability of correlation coefficients. We found that a large proportion of pairwise associations, especially negative associations, cannot be reliably tested. This constraint could hamper the identification of candidate biological agents that could be used to control rare pathogens. Consequently, identifying testable associations could serve as an objective method for trimming datasets (in lieu of current empirical approaches). This trimming strategy could significantly reduce the computation time and improve inference of association networks. When OTU prevalence is low, association measures for occurrence and read abundance data are correlated, raising questions about the information actually being captured.
Introduction
Microbiota play key roles in ecosystem processes, from eukaryote physiology [1] to global biogeochemical cycles [2]. Research often focuses on comparing microbiota found in similar environments to identify the major forces shaping their structure [3] and function [4]. Microbial interactions are probably one such force [5, 6].
The most common technique for describing microbiota is 16S rRNA sequencing [7]. Association network analysis is then often employed to characterize potential microbial interactions [8]. Association networks require identifying pairwise associations between the occurrence or abundance of bacterial operational taxonomic units (OTUs) [9].
However, microbiota frequently contain hundreds to thousands of OTUs, most of which are rare [10, 11]. A typical matrix describing the abundance of OTUs among similar microbiota thus includes a high proportion of zeros. Simulations illustrated that an excess of zero impairs the efficiency of association network analysis [12, 13]. To avoid this, rare OTUs are filtered out before such analysis. Current trimming procedures are empirical in nature and restrictive. They may rely on OTU prevalence [12, 14], mean abundance [15], or diversity [16]. Moreover, simulations suggested that association network analyses more efficiently detect negative relationships (amensal, competitive) than positive ones (i.e., mutual, commensal) [12]. It is not clear yet wether this is due to the distribution of OTU prevalence.
Precise definition of the conditions under which positive and negative associations are reliably tested can help improve current studies on microbial interactions. This can help define a study plan that will provide sufficient statistical power. This can evidence potential avenues for improving data analysis. This can help interpret association network analysis, depending on the limitations of these approaches.
Below, we analyzed the effect of low OTU prevalences observed in microbiota on associations measures based on occurrence data and read abundance data. More specifically, we computed the extrema of correlation coefficients depending on prevalences for both data types. These extrema were used to define which associations between OTUs could be reliably tested. We finally compared the results obtained from both data types. This allowed (i) to define to what extent prevalence and sample size affect the association studies in microbiota, (ii) to demonstrate that negative interactions can not be captured in most cases, (iii) to show that the added value obtained from analyzing abundance data compared to occurrence data is limited. Results are discussed in the light of current analysis procedures and tools to identify potential solutions to the issues we evidenced.
Materials and methods
The distribution of association statistics are affected by prevalence. This lead to problems to test correlation coefficients. For instance, the statistic’s minimum and/or maximum can fall within the expected confidence interval obtained from the classical distributions used to approximate expected values. This issue can arise with both occurrence and abundance data.
Model for occurrence data
First, we explored how to define testability when occurrence data are used. Regarding tests based on the expected distribution of an association statistic, we employed the Phi coefficient ϕ [17], a measure of association between two binary variables XA and XB: where PA, PB are the prevalence values for two OTUs, XA and XB, and P11 is the prevalence of their co-occurrence. The extrema of Phi [18] depend exclusively on PA and PB (S1 Fig).
Under the null hypothesis (H0) that the occurrences of XA and XB are independent, Phi can be approached thanks to the Pearson’s chi-squared test: where N is the total number of samples and χ2 is a chi-squared distribution with 1 degree of freedom [19]. This last distribution is thus used to build a confidence interval to test departure from the independence hypothesis. Furthermore, we can describe cases where associations can not be tested reliably based on this confidence interval, because the genuine minimum and/or maximum of ϕ fall within this confidence interval.
For occurrence data, limitations on testability can also be studied with exact tests based on the possible combinations of associations (like the Fisher’s exact test, see Part 2.7 in S1 Appendix). For fixed prevalences, the probability of observing the minimum or maximum number of co-occurrences may be higher than the alpha level (canonically at 5%) [20, 21]. In such a case, a negative or positive association, respectively, can not be significantly detected.
Model for read abundance data
Second, we explored how to define testability when read abundance data are used. We employed the Pearson correlation coefficient [22], a measure of association between two continuous variables, XA and XB.
We demonstrated that the minimum of Pearson correlation coefficient depends only on OTU prevalence (see proof in Part 3 in S1 Appendix and illustration S2 Fig).
We can then set up a confidence interval from the following assumption. When XA and XB follow two uncorrelated normal distributions, where t has a Student’s t-distribution with degrees of freedom N − 2.
We provide a detailed description of how prevalence affects the testability of associations for both data types in a detailed supplementary material (S1 Appendix).
To estimate the proportion of unreliable tests, we considered two distributions for the OTU prevalence: (i) an uniform law to study the influence of sample size N and prevalence PA, PB; (ii) a truncated power law to take into account the real structure of microbiota data. We also compared the results on the testability limits for the two types of data and highlighted a correlation between the two associated measures.
Results
Testability given a uniform distribution of prevalence
When occurrence data were used, four inequations (Eqs (7ߝ10) in S1 Appendix) defined reliable tests based on the chi-squared distribution depending on OTU prevalences (Fig 1A). The proportion of non-testable associations (i.e., neither positive nor negative correlations could ever be significant) rapidly fell as N increased (Fig 1B). The proportion of associations with partial testability (i.e., either only positive or negative correlations could ever be significant) never exceeded 0.25 (Fig 1B). When N = 300, the proportion of fully testable associations (both positive and negative correlations could be significant) exceeded 0.80 (Fig 1B). We showed by simulations that the proportion of Fisher’s exact test affected by prevalence are similar compared to the analytical results presented above (S3 Fig).
When read abundance data were used, some negative correlations were not testable based on the Student’s distribution (Eq (33) in S1 Appendix and Fig 1C). This problem became less pronounced as N increased, and the proportion of testable associations reached 0.95 at N = 300 (Fig 1D).
Testability given observed community structure
Prevalence distributions are highly unbalanced in microbiota because of the large number of rare OTUs (Fig 2A). Accordingly, we modelled OTU prevalence using a truncated power law distribution; the latter reflects observed community structure (Part 5 in S1 Appendix and S4 Fig). OTU prevalence was fitted according to a truncated power law, with k ranging from −2 to −1: the smaller k, the higher the proportion of rare species. Such distribution imply that for most pairs of OTUs, the two OTUs have a low prevalence (Fig 2B).
For the occurrence data, there was thus a large proportion of associations for which negative correlations could never be significant (> 0.50 for k = −1, > 0.90 for k = −2); this proportion increased as N increased (Fig 2C). This counter-intuitive result is due to the accumulation of rare OTUs as N increases under the power law assumption. Fewer than 10% of associations were non-testable when N was greater than 50 (Fig 2D).
For the read abundance data, when N = 100, a large and extremely large proportion of negative correlations were untestable when k = −1 (proportion: 0.60) and k = −2 (proportion: 0.95), respectively (Figs 2D).
Comparison between the two data types
We finally compared the association statistics for both data types under conditions of low OTU prevalence such as those observed in the actual microbiota data (Part 4 in S1 Appendix). A formal decomposition of variance and covariance illustrates the structural relationship of the correlation coefficients calculated from the occurrence and read abundance data (Eq (2), Part 1 in S1 Appendix). The observed values of the Phi coefficient ϕ and the Pearson correlation r among OTUs pairs of microbial datasets, (Fig 3A) illustrated that the minimum of the statistics is particularly affected as explained above. Furthermore a correlation is observed between the two measures on real microbial datasets (cor = 0.78 and R2 = 0.62 on honeybees microbiota data, Fig 3A). Simulations allowed us to investigate more precisely the expected correlation between the two measures. The association tests that can be performed using occurrence versus read abundance data tend to be similar, and prevalence influences association testability in the same way. More specifically, association measures for the two data types become correlated as prevalence decreases (Fig 3B).
Discussion
We showed that it is impossible to reliably test a large proportion of the pairwise associations between OTUs in microbiota. Indeed, assuming realistic community structure (i.e., most OTUs are rare), negative correlations could not be tested for most associations. This finding clarifies previous modelling results [12] and underscores a major analytical challenge in this domain. From a practical perspective, for example, this constraint could hamper the identification of candidate biological agents that could be used to control rare pathogens.
This result is amplified by the dependencies within the correlation matrix. There are naturally more positive coefficients than negative coefficients in the correlation matrix. If A and B as well as B and C are negatively correlated, A and C will be more likely to be positively correlated. It is not clear that partial correlations could cope with this issue, especially since a partial negative correlation would be difficult to estimate.
Applying more stringent standards (i.e., analysing only fully testable associations) could drastically reduce the number of tests required to infer an association network. Associations with partial testability (i.e., positive or negative association cannot be tested) could be included, but significance thresholds would need to be adjusted to avoid spurious correlations. We thus propose that identifying testable associations could serve as an alternative to current, empirical strategies for trimming microbiota datasets. By limiting test number, this approach could help control the overall risk of statistical errors and speed up computation time. It could be implemented by forcibly introducing zeros in the correlation matrix during the inference of an association network.
We found that association testability tends to be similar for occurrence and read abundance data. More specifically, association measures calculated using the two data types become correlated as prevalence decreases. This raises questions about the information actually being captured by current methods for quantifying OTU associations. These questions have both computational implications, questioning the ability of current models to make the most of abundance data, and biological implications, because different biological processes of interactions could be revealed by the two data types. Fitting abundances to zero-inflated distributions [23, 24], that aim at decoupling occurrence and abundance, appears as a promising solution to improve the inference of microbial associations.
The low prevalence of OTUs in metagenomic datasets greatly limits their overall analysis. In light of the results obtained, we believe that advances in the discovery of microbial associations should be made through the systematic integration of the available information in the models. The first attempts to develop statistical models incorporating prior information in the context of metagenomic analysis showed promising results [25]. From a statistical point of view, this approach can be improved by using for instance a Bayesian framework. From a a biological point of view, this approach would benefit from the development of a database dedicated to microbial interactions. Open and shared microbiota datasets, like those present in the Qiita collaborative platform (qiita.ucsd.edu), could be used to benchmark statistical models and to feed such database to improve knowledge on microbiota.
Rarity of microbial species: In search of reliable associations
Arnaud Cougoul, Xavier Bailly, Gwenaël Vourc’h, Patrick Gasqui
Acknowledgments
The work was funded by two INRA metaprogrammes: Meta-omics of microbial ecosystems (MEM) and Integrated management of animal health (GISA). We thank Ioana Molnar for the mathematical advice and Jessica Pearce-Duvet for proofreading the manuscript.
S1 Appendix.
1. Notation and decomposition of variance and covariance
We consider two OTUs whose abundances are modelled by two random variables, XA and XB (Table 1). Our threshold is based on presence or absence of OTU, so we created a contingency table whose categories are defined by variable presence or absence.
N is the number of microbiota samples; N00 is the number of co-absences of XA and XB; N11 is the number of co-occurrences of XA and XB; and P11 = N11/N is the proportion of co-occurrences of the two OTUs. PA = NA/N and PB = NB/N are the marginal probabilities of XA and XB, respectively (i.e., individual OTU prevalence). Since the OTUs are observed at least once, PA, PB ∈ [1/N, 1].
We can calculate the mean and estimated variance of XA and XB using the non-zero values of XA or XB. Consequently, , and .
The estimated variances of XA and XB can be calculated as follows:
The estimated covariance of XA and XB can be decomposed based on whether or not XA and XB co-occur (i.e., XA and XB are non-null or not). If , then
“Exclusively quantitative” covariance
When the data are reduced into binary variables, because {XA| XA, XB ≠ 0} and {XB| XA, XB ≠ 0} are constants. Then is part of the covariance of XA and XB only because of the quantitative aspect of data.
“Qualitative” covariance
The second part of the covariance is the difference between the mean product for the whole population and the mean product for the co-occurring elements only. Consequently, it can be explained by OTU co-occurrences (qualitative in nature).
When the data are reduced into binary variables (based on equations (1) and (2)):
Therefore, the correlation of XA and XB, , will depend only on P11, PA, and PB.
2. Threshold method for binary data
Our method is based on the properties of discrete statistics. As binary data are discrete data, statistical tests have discrete distributions, as do p-values. Moreover, the minimum observable p-value for fixed marginal values can be higher than the alpha level (usually set to 5%), which means the test yields useless results [1,2]. In other words, for two OTUs with fixed prevalence, if all the possible values of an association index fall within the expected confidence interval, the association is simply not testable. Below, we will illustrate how OTU prevalence can thus shape potential correlations.
In this section, we detail how we developed our threshold method for binary data (i.e., OTU occurrence). First, we describe the association index used and show that it is bounded. Second, we present how we defined its testability. Third, we examine the consequences of our threshold method for network inference. Fourth, we present the testability limits on Fisher’s exact test as a function of prevalence.
2.1. Measure of associations for binary data
The combinatorics that ensue from the hypergeometric law provide only simulated solutions for determining the testability of associations. In contrast, the Phi coefficient [6] can be used to establish equations for exploring association testability and give an analytical solution. The Phi coefficient is mathematically related to the common chi-square test. Since Fisher’s exact test and Pearson’s chi-square test are asymptotically equivalent, we used the Phi coefficient as the basis for our threshold method. Moreover, we showed that the testability results were equivalent for both tests (see section 2.7 and S1 Fig 3). Phi is also equivalent to the Pearson correlation coefficient in situations with binary data (coded by 0 and 1), a property that was helpful when extending our threshold method to quantitative situations (see sections 3 and 4).
Consider two random binary variables, and , which represent the presence or absence of two OTUs. Working from Table 1, the Phi coefficient for the association between and is calculated as follows:
2.2. Bounds of the Phi coefficient as a function of prevalence
Based on the Boole–Fréchet inequality for logical conjunction, for the marginal probabilities PA, PB ∈]0, 1[, it follows that
Given equations (3) and (4) and because ϕ is a continuous and monotonic function of P11: where [7]
Therefore, ϕ is bounded and ϕmin and ϕmax depend exclusively on PA and PB.
2.3. Distribution of the Phi coefficient under the null hypothesis of independence
Under the null hypothesis (H0) that the occurrences of two OTUs, and , are independent, ϕ can be determined thanks to the Pearson’s chi-squared test: ϕ2 = χ2/N, where N is the total number of observations and χ2 is the chi-squared statistic for a 2×2 contingency table whose data follow a chi-squared distribution and for which there is 1 degree of freedom [8].
Since we know the distribution of ϕ, we can obtain the confidence interval at an alpha level of α. The confidence interval of a distribution is , where b is defined by (e.g., for α = 5%, b ≈ 1.962 ≈ 3.84).
The confidence interval of ϕ at an alpha level of α can be calculated as follows:
2.4. Determining the testability of occurrence-based associations
We now examine the testability of the Phi coefficients calculated from pairs of OTU prevalence values. We do so by determining if the extrema of Phi occur within the confidence interval. There are two ways in which we may have trouble detecting significant associations:
If , then we will not be able to detect a significant negative association.
If , then we will not be able to detect a significant positive association.
As ϕmin and ϕmax depend exclusively on PA and PB, we now consider the conditions under which PA and PB adopt problematic values.
We can split the first case (A) in two subcases because ϕmin can have two different values depending on the specific values of PA and PB:
A1) If PA + PB < 1, then max(0, PA + PB − 1) = 0. Based on equations (3), (4), and (5a),
A2) If PA + PB ≥ 1, max(0, PA + PB − 1) = PA + PB − 1. Based on equations (3), (4), (5a),
We can then resolve the inequation .
A1) For
(all variables are positive)
A2) For
If inequations (7) or (8) are true, a negative association cannot be detected.
The second case (B) can be similarly split up because ϕmax can also have two values:
B1) If PA ≤ PB, then min(PA, PB) = PA. Based on equations (3), (4), and (5b),
B2) If PA ≥ PB, then min(PA, PB) = PB. Based on equations (3), (4), and (5b),
We can now solve the inequation .
B1) If
B2) If
If inequations (9) or (10) are true, a positive association cannot be detected.
Using the four inequations (7), (8), (9), and (10), we can delimit zones within which there is full, partial, or no testability. The characteristics of the tests in these zones will be detailed in the introduction to the next section.
For the two OTUs, PA and PB form a [1/N, 1]2 square (Figure 1 below); 1/N is the smallest observable value. The testability zones in this square can be defined using four border functions that result from the inequations:
Emerging from these border functions are four graph intersections that are defined by:
2.5. Proportion of associations in each testability zone
The zones defined by the border functions (11) contain different proportions of associations that can be categorised as fully testable, partially testable, or non-testable using our threshold method. The first zone, Abilateral, contains associations for which both positive and negative correlations can be reliably tested. The second zone, Aunilateral, contains associations for which only positive correlations can be reliably tested (subzone Apositive) and for which only negative correlations can be reliably tested (subzone Anegative). Finally, the third zone, Airrelevant, contains associations that cannot be reliably tested at all.
The distribution of prevalence values is treated as identical for all OTUs. Therefore, PA and PB have the same distribution and play symmetrical roles. However, these distribution patterns are not necessarily uniform. We examined two types of distributions—the uniform distribution and the truncated power law distribution; the latter fit the prevalence patterns of OTUs in real microbiota (see section 5).
For the uniform distribution of prevalence, the probability density function is
For the truncated power law distribution of prevalence, the probability density function is and, following normalization, we arrive at , so .
When k = 0, we have a uniform distribution with the interval .
To computationally define the different zones, analytical formulas can be used in the case of the uniform distribution but not in the case of the power law distribution. Consequently, in the latter situation, we chose to proceed by numerical integration. Since the current form of the R function integrate (in the stats package) does not deal well with the power law, we used a Monte Carlo approach. This consisted of generating random prevalence values in accordance with the observed prevalence distribution (see section 2.6) and counting how many fell within each of the zones.
To simplify the zone-defining equations below, we have used the following notation: and the same notation applies in the cases of F2, F3 and F4.
∧ denotes the logical conjunctions.
From the four inequations (7, 8, 9, 10) and the border function (11), the proportions of associations that fall within each zone are determined as follows:
2.6. Defining testability zones using a Monte Carlo method
To compute Monte Carlo integrations, it is necessary to generate random prevalence values using the observed distribution of prevalence. For the uniform distribution, many pseudorandom number generators exist. However, for the truncated power law distribution, we had to employ an inverse transformation method that is rooted in the following property:
We therefore needed to define the inverse cumulative distribution function. Let F be the cumulative distribution function of the truncated power law distribution as defined in (14).
We can then generate a power law distribution from a uniform distribution using the following equation:
2.7. Testability limits on Fisher’s exact test
Co-occurrence networks are commonly reconstructed using the hypergeometric law that underlies Fisher’s exact test [3–5].
From an observed 2×2 contingency table (Table 1), Fisher showed that the probability P of obtaining such a set was given by the hypergeometric distribution: where is the binomial coefficient and ! indicates the factorial.
This equation can be written according to NA, NB, N and N11:
Based on the Boole–Fréchet inequality for logical conjunction, for the marginal counts NA, NB ∈]0, N[, it follows that
We have two extreme situations:
Observe the minimum number of co-occurrences, N11 = min(N11) = max(0, NA + NB − N)
Observe the maximum number of co-occurrences, N11 = max(N11) = min(NA, NB)
We can calculate the probability P associated with these two situations a) and b). A bilateral test can also be performed. As in the fisher.test function of R, the p-value is computed by summing the probability for all table with probabilities less than or equal to that of the observed table.
For two given OTUs with prevalence and , we have 4 possibilities in the testability limits on Fisher’s exact test:
If the p-values associated with the two configurations a) and b) are lower than the alpha level (5%), the two extremes situations a) and b) correspond to significant associations. We have no limit on the test.
If the p-value associated with the configuration a) is greater than the alpha level, then we will not be able to detect a significant negative association.
If the p-value associated with the configuration b) is greater than the alpha level, then we will not be able to detect a significant positive association.
If the p-values associated with the configurations a) and b) are greater than the alpha level, then we will not be able to detect a significant positive or negative association.
3. Threshold method for quantitative data
In this section, we detail how we developed our threshold method for quantitative data (i.e., OTU read abundance). First, we introduce our system of notation and the primary elements of our proof. Second, we present the situation, in which correlations are bounded by an excess of zeroes, and describe the minimum correlation value. Third, we show how we defined association testability. Finally, we examine the consequences of our threshold method for network inference.
3.1. Introduction
In this section, XA and XB are two random variables that represent quantitative data. The Pearson correlation coefficient [9] is used to characterise the pairwise associations in OTU read abundance. We were specifically interested in understanding how the number of zeroes in the data could influence the correlation coefficient.
We use same notations as in section 1. and represent the number of zeros associated with XA and XB, respectively. N00 is the number of co-absences of XA and XB, and N11 is the number of co-occurrences.
Based on Table 1 and the Boole–Fréchet inequalities, we can deduce the following:
For pairs of and we distinguish two cases:
The number of zeros is sufficiently low such that there are no raw restrictions on possible correlations. Indeed, it is simple to build two non-restricted correlations that approach infimum −1 and supremum +1:
In this case, the correlation coefficient is r = −1.
The correlation tends toward the supremum, or .
Based on equations (19) and (21), and N11 ≥ 0. Consequently, N11 can equal zero, meaning that there are enough zeros associated with XA and XB that XA and XB may not co-occur. In this situation, information on quantitative correlations is degraded. We can prove that r, the Pearson correlation coefficient, has a minimum, rmin, that is different from −1:
3.2. Determining the lower bound of the Pearson correlation coefficient
Given and we wished to determine the minimum possible correlation between XA and XB. We highlight that a lower bound of the Pearson correlation exists between two positive variables and prove that it can be reached under certain conditions.
For the association between XA and XB, the Pearson correlation coefficient is calculated as follows:
If XA, XB ≥ 0, then μ(XAXB) ≥ 0 and
Consequently, the mean of XAXB is null if and only if there are no co-occurrences. In other words,
If μ(X1X2) = 0, then ∑ X1X2 = 0. Each element of the sum are positive then ∑ X1X2 = 0 imply that all elements are null and there are no co-occurrences (i.e., N11 = 0).
If there are no co-occurrences, then X1X2 = 0 and μ(X1X2) = 0.
From equations (22) and (23), we can conclude that
Moreover, if , then, from equation (21), we know that N11 ≠ 0. Therefore,
We now want to control and find its minimum. We therefore maximise and separately. μ/σ corresponds to the inverse coefficient of variation.
3.3. Maximising the inverse coefficient of variation
Below, we illustrate how to maximise the inverse coefficient of variation for XA. We will show that .
We can express variance using the König–Huygens formula:
If μ(XA) ≠ 0, then
We are now interested in , and we will show that .
Let V, W be two vectors of ℝN. As per the Cauchy–Schwarz inequality,
Let V = Y be the vector of non-null elements of XA (for Y, vector size is equal to NA); W = 1NA, a constant vector of size NA. In this case, the Cauchy–Schwarz inequality becomes the following: where equality holds if and only if Y = λ × 1NA, where λ > 0 (i.e., Y is a constant vector).
As and , , then
Based on equations (26) and (27), we now observe that
Finally,
The maximum occurs where is .
The approach is equivalent for XB, so we can conclude that
3.4. Determining the minimum Pearson correlation coefficient when there are many zeros
Based on equations (24), (28), and (29),
It therefore stands to reason that
Therefore, and .
Finally, based on equations (30) and (31), when ,
3.5. Constraints on the testability of the Pearson correlation coefficient
When XA and XB follow two uncorrelated normal distributions, , where t is a Student’s t statistic with degrees of freedom N − 2. We can then determine a confidence interval: , where K depends on α and N.
Returning to our measures of OTU prevalence, if and , then . The constraint is the same as in the case of binary data.
If rmin falls within the confidence interval, we can conclude that negative associations cannot be detected.
Accordingly, if inequation (33) is true, then negative associations are not testable.
The border function that defines the testability zones in the square formed by PA × PB is as follows:
3.6. Proportion of associations in each testability zone
Using the border function (34), we observed that two zones existed. The first zone, Abilateral, contains associations for which both positive and negative correlations can be reliably tested. The second zone, Apositive, contains associations for which only positive correlations can be reliably tested. As for the binary data (sections 2.5 and 2.6), we explored the testability of abundance-based associations using the uniform distribution and the truncated power law distribution. In the latter case, we again employed a Monte Carlo approach.
Based on the border function (34), the proportions of associations that fall within each zone can be determined as follows: (Same notation as in section 2.5)
4. Similarity of the Phi and Pearson correlation coefficients
In this section, we show that testability constraints tend to be similar with both occurrence and abundance data. We also examine the degree of correlation between the correlation coefficients calculated using the two data types.
4.1. Testability constraints on occurrence and abundance data
The distribution of the correlation coefficient for two normally distributed independent variables is .
As (i.e., there is distribution convergence) and , then . Since the distribution of the square of the Phi coefficient is under the null hypothesis of independence, the Pearson correlation coefficient will asymptotically attain the same confidence interval as the Phi coefficient: their lower bounds converge upon (sections 2.3 and 3.5).
We now underscore that the Phi and Pearson correlation coefficients have the same lower bound when the two OTUs have low levels of prevalence:
When N is large enough, the testability of positive associations will be the same for binary data and quantitative data. This pattern will be all the more pronounced given that, in real microbiota, OTU prevalence is greatly skewed to the right: positive associations represent the majority of associations to be tested.
4.2. Correlation between Phi and Pearson coefficients
In section 1, we showed that variance can be decomposed in a quantitative part and a qualitative part (equation (2)). Here, we use the results of a simulation to explore how the strength of the correlation between the values of the Phi coefficient and the Pearson coefficient is related to OTU prevalence. We are most interested in what happens when prevalence is low.
OTU abundances XA and XB are modelled by a zero-inflated Poisson (ZIP) distribution using the following probability mass function: where the probability of structural zeros, p0, is the result of a Bernoulli process and λ is the mean of the Poisson portion of the distribution (i.e., the Poisson parameter). In the simulation, XA and XB had the same values for p0 and λ.
The probability of structural zeros p0 represents the complementary probability of prevalence P, i.e. p0 = 1 − P. As p0 increases (i.e., prevalence decreases), the correlation between the Phi coefficient and the Pearson coefficient increases (Figure 3A in the article). The correlation also strengthens as λ increases. When prevalence is below 0.25, the correlation is greater than 0.75 for all values of λ.
If OTU prevalence follows a ZIP distribution, we can conclude that the values of the Phi coefficient and the Pearson coefficient will be correlated, especially when OTU prevalence is low.
5. Distribution of OTU prevalence in real microbiota
To characterise actual OTU distribution patterns, we employed data from the QIITA database (qiita.ucsd.edu) and the TARA Ocean Project (ocean-microbiome.embl.de) [10]. The biom files were processed using the R package biomformat. We deliberately chose different kinds of microbiota so as to represent as wide a diversity of microbial communities as possible (Table 2). We used OTU rather than species tables.
The prevalence values were fitted to a truncated power law distribution as described by equation (14), and the power law coefficient k was estimated by maximizing the log-likelihood [11].
Footnotes
↵* arnaud.cougoul{at}inra.fr