Abstract
Microbes are found in high abundances in the environment and in human-associated microbiomes, often exceeding one million per milliliter. Viruses of microbes are estimated to turn over 10 to 40 percent of microbes daily and, consequently, are important in shaping microbial communities. Given the relative specificity of viral infection and lysis, it is essential to identify the functional linkages between viruses and their microbial hosts. Multiple timeseries analysis methods, including correlation-based approaches, have been proposed to infer infection networks in situ. In this work, we evaluate the effectiveness of correlation-based inference using an in silico approach. In doing so, we compare actual networks to predicted networks as a means to assess the self-consistency of correlation-based inference. Contrary to common use, we find that correlation is a poor indicator of interactions that arise from antagonistic virus-host infections that culminate in lysis. In closing, we discuss alternative inference methods, particularly model-based methods, as a means to predict interactions in complex virus-microbe communities.
Competing interests: The authors have declared that no competing interests exist.
Funding: This work was supported by the Simons Foundation (SCOPE award ID 329108, J.S.W.).
1 Introduction
Microbes and viruses are ubiquitous and highly diverse in marine, soil, and human-associated environments. Microbes play important roles in biogeochemical cycling, and viruses of microbes can transfer genes between microbial hosts [1, 2], alter host physiology (e.g. via auxillary metabolic genes [3, 4]), and redirect the flow of organic matter in food webs through cell lysis [5, 6]. Viruses therefore are a significant part of microbial communities, and characterizing virus-microbe interactions is necessary for understanding how cellular-level interactions influence community structure and ecosystem function.
Viruses are known to be relatively specific but not exclusive in their microbial host range. Individual viruses may infect multiple strains of an isolated bacteria, or they may infect across genera, e.g. cyanophage can infect both Prochlorococcus and Synechococcus [7]. Analyses of dozens of culture-based studies have revealed structure in virus-microbe interaction networks [8, 9, 10]. Interaction networks are nested at the strain and genus level, which is indicative of rapid coevolution [11], and are hypothesized to be modular at larger phylogenetic scales. Importantly though, these results come from culture-based methods raising the question of how broadly applicable they are in situ [1]. Partially culture-independent methods, such as viral tagging [12, 13] and digital PCR [14], overcome some of the hurdles associated with culturability and isolation but do not yet represent a community-based approach to inferring interactions amongst non-targeted viruses and microbes. Community-based inference methods are needed to fully characterize virus-microbe interactions.
Viral metagenomics (viromics) has made it possible to characterize phylogenetic and functional diversity of viruses in situ, bypassing culturing altogether [15, 16, 17]. Recent methods for inferring interactions leverage information from time-series obtained via metagenomic sampling. In these methods, population abundances (or marker-based proxies) are estimated directly from viral and cellular metagenomes. Many such inference methods for time-series exist (see reviews [18, 19, 20, 21, 22]) and broadly fall into two categories: modelbased and model-free. Examples of model-free methods include the direct use of correlations, time-lagged correlations (e.g. local similarity analysis or LSA [23, 24, 25]), and correlations between log-transformed abundance ratios (e.g. sparse correlations for compositional data or SparCC [26]). Other model-free methods which are not correlation-based include crossconvergent mapping (CCM) [27, 28, 29], pairwise asymmetric inference (PAI) [30], and the sparse S-map method (SSM) [31], and many model-based methods exist as well (see Discussion). Out of these, correlation and correlation-based inference methods are widely used with experimental time-series data. Local similarity analysis (LSA) in particular has been widely applied to infer interaction networks in communities of marine bacteria [32, 33]; bacteria and phytoplankton [34, 35]; bacteria and viruses [36]; and bacteria, viruses, and protists [37, 38].
Correlations in time-series are difficult to interpret, despite their widespread use. Positive correlations could indicate “common preferred conditions or perhaps cooperative activities such as crossfeeding” [22], while negative correlations could indicate “opposite seasonality, competition for limited resources or perhaps active negative interactions such as targeted allelopathy or predator-prey relationships” [22]. Correlations are often interpreted as direct interactions, e.g. predation, but may instead be indicative of a wide variety of indirect interactions (see [39, 40]). Predicted interactions can vary drastically depending on the particular correlation metric used [19], which further compounds problem of ecological interpretation. Compositional data, which is common in metagenomic time-series, introduces additional complications [26]. Time-series may be strongly and significantly correlated without having any underlying physical or ecological relationship at all, a well-known adage (“correlation does not imply causation”) that is often disregarded. Multiple studes have shown that correlations in time-series do not predict interactions in in silico microbial communities using a discrete-time Lotka-Volterra model [41] and a parametric statistical model [42].
Despite these challenges, correlation-based inference is rarely verified in advance of its application. Inferred networks are difficult to verify in general, in part because there is no existing “gold standard” interaction network, and as previously mentioned culture-based methods are not widely applicable. Hence, in this paper, we take an in silico approach to assessing the efficacy of correlation-based inference. We use a mechanistic model for a virus-microbe community to generate synthetic time-series in which the interaction network is known a priori. Then we apply correlation-based inference to the resulting time-series and compare predicted interactions to the original interaction network. As we show, correlation-based inference fails to recapitulate virus-microbe interaction networks raising substantive concerns over its use in natural systems.
2 Methods
2.1 Modeling the virus-microbe community
We model the dynamics of a virus-microbe community with a system of nonlinear differential equations: where Hi and Vj refer to the population density of microbial host i and virus j respectively. There are NH different host types and NV different virus types. For our purposes, a “type” is a group of microbes or viruses with identical life history traits, i.e. microbes or viruses that occupy the same functional niche.
In the absence of viruses, the hosts undergo logistic growth with growth rates ri. The hosts have a community-wide carrying capacity K, and they compete with each other for resources both inter and intra-specifically with competition strength aii′. Each host can be infected and lysed by a subset of viruses determined by the interaction terms Mij. If host i can be infected by virus j, Mij is one; otherwise it is zero. The collection of all the interaction terms is the interaction network represented by matrix M of size NH by NV. The adsorption rates ϕij denote how frequently host i is infected by virus j.
Each virus j’s population grows from infecting and lysing hosts. The rate of virus j’s growth is determined by its host-specific adsorption rate ϕij and host-specific burst size βij, which is the number of new virions per infected host cell. The quantity is the interaction strength between virus j and host i, and the collection of all the interaction strengths is the weighted interaction network . Finally, the viruses decay at rates mj.
2.2 Interaction network topology
Virus-microbe interaction networks are represented as bipartite networks or matrices of size NH by NV. Here, we generate in silico interaction networks given variation in nestedness and modularity. For a given network size NH by NV, we first generate the perfectly nested (Fig 1A) and perfectly modular (Fig 1B) networks using the BiMat MATLAB package [43]. For the modular network, we choose a small number of modules relative to the network dimensions NH and NV, e.g. 2 modules for a 10 by 10 network. In general, the perfectly nested network and the perfectly modular network will have a different number of interactions. In the following rewiring procedure, we treat the two networks separately so that the number of interactions is conserved.
We rewire the perfect network by randomly selecting an interacting host-virus pair (Mij = 1) and a non-interacting host-virus pair (Mij = 0) and exchanging their interaction values. We do not allow exchanges that would result in an all-zero row or column but do not restrict the exchanges in any other way. We continue the random selection without replacement until all host-virus pairs have been selected no more than once, recording the new interaction network after each exchange. We repeat this procedure several times to generate an ensemble of interaction networks with varying nestedness and modularity.
We measure the nestedness and the modularity of each network in the ensemble using the default algorithms in the BiMat MATLAB package. The nestedness metric used is NODF, and the modularity is normalized [44]. We rearrange the networks in their most nested or most modular form as determined by the BiMat MATLAB package (see Fig 1 for examples) [43].
2.3 Generating life history traits
The life history traits for a given interaction network are chosen to ensure that all host and virus types can coexist in the long term [45], as summarized here.
First, we randomly sample target steady-state densities and for each host and virus. We also sample some of the life history traits, in particular the host carrying capacity K, adsorption rates ϕij, and burst sizes βij. All of these values are chosen by independent random sampling from ranges as specified in Table 1.
Next, we sample the host competition terms aii′. To begin, we set all intraspecific competition to one (aii = 1) and all interspecific competition terms to zero (aii′ = 0 for i′ ≠ i). To ensure coexistence among the hosts in the absence of viruses, we randomly sample target virus-free steady-state densities from the range specified in Table 1. Coexistence is satisfied when for each host i. Thus, for each host i, some interspecific competition terms must be made non-zero. We randomly choose an index k ≠ i and randomly sample aik between zero and one. If the new does not exceed the carrying capacity K, we repeat for a new index k. Once the carrying capacity is exceeded, we adjust the most recent aik so that Eqn 3 is satisfied exactly.
The remaining life history traits, the viral decay rates mj and the host growth rates ri, are solved for using the steady-state versions of the virus-host differential equations (Eqns 1 and 2):
2.4 Simulating community dynamics
We use MATLAB’s ODE45 to numerically simulate the virus-host dynamical system (Eqns 1 and 2) with in silico interaction network and life history traits generated as described in §2.2 and §2.3. We use a relative error tolerance of 10-8 and specify regularly spaced time-points on the order of . Initial conditions are chosen by perturbing the target steady-state densities and by a multiplicative factor δ= ±0.3 where the sign of c5 is chosen randomly for each host and each virus. After generating the time-series, we sample from the transient dynamics using a fixed sample frequency.
2.5 Calculating correlation networks
Let t = {t1,…, tN} be the collection of sample times. Let Hi(tk) and Vj(tk) be the sampled time-series of host i and virus j at the single time-point tk ∈ t. We denote the log-transformations of the sample points as hi(tk) = log10 Hi(tk) and vj(tk) = log10 Vj(tk). The Pearson correlation coefficient between host i and virus j is defined as where N is the number of sampled time points and and are the sample means. The correlation network R is the collection of Pearson correlation coefficients for all possible host-virus pairs, that is, a matrix of size NH by NV.
We also consider correlations given a time-delay τ. The time-delayed Pearson correlation coefficient between host i and virus j is defined as where N is the number of sampled time-points and is the mean of the time-delayed host sample. The time-delay is applied to the host time-series so that it is sampled later in time, whereas the virus time-series sample is unchanged. The same number of sampled time-points N is used for both hosts and viruses.
We consider two different approaches for implementing a time-delay τ. If identical time-delays are used for each host-virus pair, τ is a single community-wide time-delay. On the other hand, if time-delays are unique for each host-virus pair, τ = [τij] is a matrix of size NH by NV of unique pairwise time-delays. For both cases, the correlation network Rτ is the collection of time-delayed Pearson correlation coefficients.
2.6 Evaluating correlation network efficacy
To evaluate how well the correlation network R predicts interactions in the weighted interaction network , we binarize the two networks so that they may be compared directly. For the weighted interaction network , non-zero values are categorized as interactions while zeros are categorized as non-interactions. The binarized weighted interaction network is . For the correlation network R, we categorize values according to a threshold c. Correlations greater than or equal to the threshold are categorized as interactions, while those that are less are non-interactions. The binarized correlation network for a threshold c is Rc.
To compare the two binarized networks Rc and , we count the number of interactions in which Rc predicts correctly, the true positives, as well as the number of non-interactions which Rc predicts incorrectly, the false positives. We normalize the true and false positive counts by the number of actual interactions and non-interactions in . These are the true positive rates (TPR) and false positive rates (FPR).
We repeat the binarization for many thresholds between -1 and +1, the minimum and maximum possible values of Pearson’s correlation coefficient. The entire process results in a tradeoff, or receiving operator characteristic (ROC), curve for the correlation network R. We quantify the overall performance of the correlation network R as the maximum difference between TPR and FPR, which is known as Youden’s J-statistic or simply J [46]. The maximum difference J occurs at a particular threshold, the optimal threshold c*.
We note that J = 1 is a “perfect” recovery, that is, there exists a threshold c for which the thresholded correlation network Rc perfectly matches the binarized weighted interaction network . On the other hand, J = 0 means that the thresholded correlation networks Rc have a true positive rate (TPR) less than or equal to their false positive rate (FPR) across all thresholds.
3 Results
3.1 Simple correlation networks
We computed simple Pearson correlation networks (Eqn 6) for two different in silico virus-host communities. The two interaction networks each have 10 hosts and 10 viruses. One interaction network is highly nested (Fig 2-A3) and the other is highly modular (Fig 2B3). The life history traits for the two networks were generated as described in §2.3, with parameter ranges as specified in Table 1. The time-series were generated according to §2.4 with a timestep of minutes and a relative error tolerance of 10-8 (Fig 2-A1 and -B1). The time-series were sampled at a fixed frequency of 8 hours for 100 time-points per host and per virus type, resulting in a sample period of 800 hours ≈ 1 month (Fig 2-A2 and -B2).
We evaluated the efficacy of the two correlation networks with the procedure described in §2.6 (Fig 3). For each correlation network, we report the maximum difference between TPR and FPR, that is, Youden’s J-statistic or J, as well as the optimal threshold c*. To compute the p-value for J, we randomly shuffle the identities of the hosts and viruses in the original time-series, and calculate correlation networks for these shuffled time-series. We used N = 1000 random permutations and calculated J for each correlation network.
For the two in silico communities, simple correlation networks do not predict the interaction network. For the nested community we found J = 0.09 (p = 0.6; Fig 3-A3), and for the modular community we found J = 0.16 (p = 0.1; Fig 3-B3). The low J values mean that the correlation networks have high FPR compared to TPR across all thresholds, whereas a correlation network which successfully recovered the interaction network would have a high TPR and low FPR for at least one threshold, resulting in J ≈ 1. The large p-values mean that the reported J values are likely to occur by chance (see Supp Fig 1-A1 and -B1 for distributions).
3.2 Community-wide time-delayed correlation networks
We computed time-delayed Pearson correlation networks (Eqn 7) with a community-wide time-delay for the same two in silico communities, using the same time-series (Fig 2-A1 and -B1), sample frequency of 8 hours, and sampled time-points of 100. We considered community-wide time-delays that were multiples of the sample frequency up to the sample period, so that there was always some overlap between the sample times. The time-delay was applied to the host time-series, that is, all host time-series were identically sampled later in time, while the virus time-series were sampled as before. We computed correlation networks for all considered community-wide time-delays, that is, we computed 100 correlation networks where τ =8 hours, 16 hours,…, 800 hours.
The correlation networks vary with the community-wide time-delay, as can be seen from a few representative networks (Fig 4-A2 and -B2). For the nested community, none of the 100 time-delayed correlation networks successfully predict the interaction network (Fig 4-A3). The best score is J = 0.22 (p = 0.08) which occurs at a time-delay of τ = 768 hours. The low J value means that the correlation network has a high FPR compared to TPR across all thresholds, and the high p-value means that the reported J value is likely to occur by chance.
For the modular community, the best score is J = 0.46 (p = 0.001) which occurs at τ = 464 hours. While the measured J value is significant (p < 0.05), it is still low, with FPR = 0.2 and TPR = 0.66 (see Supp Fig 2). Thus, it is not evident that this is a “successful” correlation network for predicting interactions. Futhermore, without knowing the interaction network a priori – as is the case with in situ communities – we have no way of identifying the particular community-wide time-delay τ = 464 hours as the best choice.
3.3 Pairwise time-delayed correlation networks
We computed time-delayed Pearson correlation networks (Eqn 7) for the same two in silico communities, now with unique time-delays for each host-virus pair. We used the same time series (Fig 2-A1 and -B1), sample frequency of 8 hours, and sampled time-points of 100. We considered time-delays that were multiples of the sample frequency up to the sample period as before. The time-delays were applied to the host time-series, with each host potentially having a different time-delay, while the virus time-series were sampled as before. For host virus pair (i, j), we computed correlations for all considered time delays and recorded the maximum correlation coefficient and its associated time-delay .
The resulting correlation networks for the two in silico communities consist of positive correlations which are almost uniformly high (Fig 5-A3 and -B3). Neither interaction network is successfully recovered. For the nested community, J = 0.17 (p = 0.2), and for the modular community, J = 0.04 (p = 0.9).
3.4 Effects of sampling frequency
We repeated the three procedures for calculating correlation networks with and without time-delays (§3.1, 3.2, and 3.3) for varying sample frequencies. We used the same two in silico communities and time-series as before (Fig 2-A1 and -B1). We examined sample frequencies between 15 minutes and 48 hours. For each sample frequency considered, we sampled for 100 time-points. Since the number of sampled time-points was fixed, the sample periods varied between 25 hours and 4800 hours (≈ 6 months).
For the simple correlation networks without time-delays, we calculated the networks in the same way as described in §3.1. For the time-delayed correlation networks, we used time-delays as described in §3.2, that is, we considered time-delays that were multiples of the sample frequency up to the sample period. For the community-wide implementation, we report the maximum J across all considered time-delays for each sample frequency. For the pairwise implementation, we determined time-delays for each host-virus pair such that correlation was maximized as described in §3.3 for each sample frequency.
For both the nested and modular communities, moderate sample frequencies (1-12 hours) performed slightly better than very high sample frequencies (15 minutes) or very low sample frequencies (24 − 48 hours) across the three implementations (Fig 6). For the nested community, the highest score was J = 0.23 (p = 0.1) for a sample frequency of 4 hours, using a community-wide time-delay. The simple and pairwise time-delay implementations had only slightly lower scores. For the modular community, the highest score was J = 0.6 (p = 0.001) for a sample frequency of 12 hours, using a community-wide time-delay, which outperformed both the simple and pairwise time-delay implementations.
4 Discussion
Using in silico virus-host communities, we calculated correlation networks amongst viral and microbial population time-series using both simple correlations and time-delayed correlations. In the case of time-delayed correlations, we considered a two implementations: a single community-wide time-delay and unique pairwise time-delays. The correlation networks for all three implementations failed to effectively recover the original interaction networks, as quantified by the efficacy score J, the maximum difference between true positive and false positive rates. There was a single test scenario involving a modular network where inference was found to be statistically significant (Supp Fig 2). Although significant, the efficacy score was still low. Furthermore, without knowing the interaction network a priori – such as when considering in situ communities – it is not clear which particular community-wide time-delay and threshold should be used, nor that they would be robust properties of the community (e.g. to different initial conditions or measurement noise). Because we observed low efficacy scores which were non-significant, we conclude that the correlation networks do not meaningfully predict interactions given this mechanistic model of virus-microbe interactions.
In this work, we examined virus-microbe interaction networks which were highly nested as well as networks which were highly modular. These structures are characteristic of complex networks where each host and virus population interacts with many others, as observed in virus-microbe communities using culture-based assays [8, 9, 10]. Complex interaction networks can drive weak correlations amongst interacting taxa and strong correlations amongst non-interacting taxa, due in part to mutual interactions which act as confounding variables (e.g. taxa A and B both interact with taxa C but not with each other) [41, 42]. It is also worth noting that counfounding effects from nonlinearities and feedbacks in the underlying dynamical system might cause correlation networks to fail even for simple interaction networks.
Our results came from generating time-series via a particular dynamical model and applying a correlation-based inference method. There is some evidence that correlation performs poorly regardless of the underlying model (see [41, 42]). This has important implications for using correlation-based methods with experimental time-series: if correlation performs poorly for a wide range of in silico models, similar or even worse performance should be expected for in situ communities. At the very least, the particular correlation-based method of interest should be benchmarked in silico before its use with experimental data, including examining the effects of measurement noise and internal system stochasticity.
Despite the difficulties and ambiguities associated with correlation-based inference methods, they are still widely used within in situ studies of virus-microbe interactions and microbe-microbe interactions more generally. In light of the poor performance of correlation-based approaches, we advocate for increased studies of model-based inference. In essence, such model-based approaches ask the question: which interaction network is compatible with the observed changes in populations arising from an underlying dynamical model? Thus model-based approaches avoid the assumption that correlations provide direct information on interactions. Given favorable results of in silico benchmarking of model-based methods [41, 42, 47, 48, 49, 50], it will be important to take the next step: to investigate the efficacy of model-based inference of virus-microbe interaction networks in situ.