Abstract
High- (HNA) and low-nucleic acid (LNA) bacteria are two separated flow cytometry (FCM) groups that are ubiquitous across aquatic systems. HNA cell density often correlates strongly with heterotrophic production. However, the taxonomic composition of bacterial taxa within HNA and LNA groups remains mostly unresolved. Here, we associated freshwater bacterial taxa with HNA and LNA groups by integrating FCM and 16S rRNA gene sequencing using a machine learning-based variable selection approach. There was a strong association between bacterial heterotrophic production and HNA cell abundances (R2 = 0.65), but not with more abundant LNA cells, suggesting that the smaller pool of HNA bacteria may play a disproportionately large role in the freshwater carbon flux. Variables selected by the models were able to predict HNA and LNA cell abundances at all taxonomic levels, with highest accuracy at the OTU level. There was high system specificity as the selected OTUs were mostly unique to each lake ecosystem and some OTUs were selected for both groups or were rare. Our approach allows for the association of OTUs with FCM functional groups and thus the identification of putative indicators of heterotrophic activity in aquatic systems, an approach that can be generalized to other ecosystems and functioning of interest.
Introduction
A key goal in the field of microbial ecology is to understand the relationship between microbial diversity and ecosystem functioning. However, it is challenging to associate bacterial taxa to specific ecosystem processes. Marker gene surveys have shown that natural bacterial communities are extremely diverse, however, the presence of a taxon does not imply their activity. Taxa present in these surveys may have low metabolic potential, be dormant, or have recently died [1, 2]. Therefore, new methodologies which integrate different data types are needed to associate bacterial taxa with ecosystem functions in order to ultimately model and predict them [3].
One such advance is the use of flow cytometry (FCM), which has been used extensively to study aquatic microbial communities [4-6]. This single-cell technology partitions individual microbial cells into phenotypic groups based on their observable optical characteristics. Most commonly, cells are stained with a nucleic acid stain (e.g. SYBR Green I) and upon analysis assigned to either a low nucleic acid (LNA) or a high nucleic acid (HNA) group [7-10]. HNA cells differ from LNA cells in both a considerable increase in fluorescence due to cellular nucleic acid content and scatter intensity due to cell morphology. The HNA group is thought to correspond to the ‘active’ fraction, whereas the LNA population has been considered as the ‘dormant’ or ‘inactive’ group of a microbial community [4, 11–13]. This is based on positive linear relationships between HNA abundance and (a) bacterial heterotrophic production (BP) [8, 12, 15], (b) bacterial activity measured using the dye 5-cyano-2,3-ditolyl tetrazolium chloride [16, 17], and (c) phytoplankton abundance [18]. Additionally, growth rates are higher for HNA than LNA cells [11-14, 19] and HNA cells accrue cell damage significantly faster than the LNA cells under temperature [20] and chemical oxidant stress [21].
One main research question that still remains is whether HNA and LNA groups are composed of unique taxa or if they are different physiological states of the same taxa. Bouvier et al. [9] proposed four possible scenarios: (1) bacteria start their life cycle in the HNA group and move to the LNA group upon death or inactivity; (2) cells in the HNA group originate from LNA cells undergoing cell division; (3) HNA and LNA consist of different non-overlapping taxa; (4) bacteria switch between groups from time to time in addition to having part of the community that is unique to each fraction. The view that HNA cells are more active is in line with scenario 1 and 2. On the other hand, several studies have found distinct groups with little taxonomic overlap and proposed scenario 3 [22, 23] or 3 and 4 [24]. In this case, HNA and LNA groups have been associated with different life strategies in bacterioplankton communities, such as large cell size (HNA) versus small cell size (LNA) [13, 23], genome size [15] and ploidy [22]. By combining FCM with taxonomic identification of bacterial communities, one can associate individual taxa with population dynamics and functioning.
In this study, we developed a novel approach to associate the dynamics of individual taxa with those of the LNA and HNA groups in freshwater lakes by using a machine learning variable selection strategy. We applied two variable selection methods, the Randomized Lasso [25] and the Boruta algorithm [26] to associate individual taxa with HNA and LNA cell abundances. This approach allowed us to associate specific taxa to FCM functional groups, and via the observed HNA-productivity relationship, to functioning. In addition, this approach enabled us to test the influence of rare taxa on these two groups as recent research has found that rare taxa may have a strong impact on community structure and functioning [27, 28]. To validate the RL-based association with the HNA and/or LNA group, we correlated taxon abundances with specific regions in the FCM fingerprint without prior knowledge of the HNA/LNA group. Furthermore, we tested for phylogenetic conservation of HNA and LNA functional groups and for the association between the selected taxa and productivity. The combination of FCM and 16S rRNA gene sequencing allows for the inference and assessment of the taxonomic structure of HNA and LNA groups, therefore advancing our ability to link bacterial taxa to their functionality in nature. This knowledge will help identify the taxa that drive carbon fluxes in freshwater ecosystems, which are disproportionately large relative to the global freshwater surface area [29].
Results
In this study, we developed a machine learning variable selection strategy to integrate FCM and 16S rRNA gene sequencing with the aim of inferring the bacterial drivers of functional groups in freshwater lake systems. We studied a set of oligo-to eutrophic small inland lakes, a short residence time mesotrophic freshwater estuary lake (Muskegon Lake), and a large oligotrophic Great Lake (Lake Michigan), all located in Michigan, USA. We showed that abundance variation of these FCM functional groups is predicted by a small subset of all taxa that are present in the environment. Selected taxa were mostly FCM groups and lake system specific, and across systems, association with HNA or LNA was not phylogenetically conserved. The relationship between selected taxa and productivity measurements was assessed for one of the lake systems (Muskegon Lake), thereby showing that HNA cells (and their putative bacterial taxa) likely turn over faster and disproportionately contribute to the freshwater carbon flux.
Study lakes are dominated by LNA cells
The inland lakes (6.3 × 106 cells/mL) and Muskegon Lake (6.0 × 106 cell/mL) had significantly higher total cell abundances than Lake Michigan (1.7 × 106 cell/mL; p = 2.7 × 10−14). Across all lakes, the mean proportion of HNA cell counts (HNAcc) to total cell counts was much lower (29-33%) compared to the mean proportion of LNA cell counts (LNAcc; 67-71%). Through ordinary least squares regression, there was a strong correlation between HNAcc and LNAcc across all data (R2 = 0.45, P = 2 × 10−24; Figure 1A), however, only Lake Michigan (R2 = 0.59, P = 5 × 10−11) and Muskegon Lake (R2 = 0.44, P= 2 × 10−9) had significant correlations when the three ecosystems were considered separately.
HNA cell counts and heterotrophic bacterial production are strongly correlated
For mesotrophic Muskegon Lake, there was a strong correlation between total bacterial heterotrophic production and HNAcc (R2 = 0.65, p = 1e-05; Figure 1B), no correlation between BP and LNAcc (R2 = 0.005, p = 0.31; Figure 1C), and a weak correlation between heterotrophic production and total cell counts (R2 = 0.18, p = 0.03; Figure 1D). There was a positive (HNA) and negative (LNA) correlation between the fraction of HNA or LNA to total cells and productivity, however, the relationship was weak and not significant (R2 = 0.14, p = 0.057).
Association of OTUs to functional groups by Randomized Lasso regression
The relevance of specific OTUs for predicting freshwater FCM functional group abundance was assessed using the Randomized Lasso (RL) approach, which assigns a score between 0 (unimportant) to 1 (highly important) to each taxon in function of the target variable: HNAcc or LNAcc. This score can be interpreted as the probability that an OTU will be included in the Lasso model to predict HNA or LNA cell abundances. Variations of HNAcc and LNAcc were modelled in function of relative changes of OTUs. To address the negative correlation bias intrinsic to compositional data, compositions were first transformed using a centered log-ratio (CLR) transformation.
The RL score was used to implement a recursive variable elimination scheme. Specifically, we iteratively removed the lowest-ranked OTUs based on the RL score (i.e. OTUs were ranked according to the score from high to low) and the Lasso was fitted to the data to predict HNAcc and LNAcc based on the corresponding subset of OTUs. The performance was expressed in terms of the , the R2 between predicted and true values of HNAcc and LNAcc of samples that were held-out using a leave-one-group-out cross-validation scheme, in which samples were grouped according to year and location of measurement. If equals 1, predictions were equal to the true values, a value of 0 is equivalent to random guessing.
There was taxonomic dependency for both HNAcc and LNAcc across lake systems (Figure 2). increased when lower-ranked OTUs were removed (moving from right to left on Figure 2), which was gradual for the inland lakes (Figure 2A) and Muskegon Lake (Figure 2C) but was abrupt for Lake Michigan (Figure 2B). The number of taxa that resulted in the highest contained less than a quarter of the total amount of taxa that were present (see solid (HNA) and dotted (LNA) lines in Figure 2), being 10.2% HNA and 15.3% LNA for the inland lakes, 4.0% HNA and 3.0% LNA for Lake Michigan, and 25.0% for both HNA and LNA in Muskegon Lake. This behavior was consistent for each lake system and FCM population. The Lake Michigan results differed the most from other lake systems, having the lowest , a sharp increase in instead of gradual, and a considerably lower minimal amount of OTUs (13 for HNAcc, 10 for LNAcc). No relationship could be established between rankings of variable selection methods and the relative abundance of individual OTUs (Figure S1). Multiple taxa with low average abundance were included in the minimal set of predictive variables, whereas few highly abundant OTUs were included. HNAcc and LNAcc could be predicted with equivalent performance to relative HNA and LNA proportions, yet the increase between initial and optimal performance was bigger (Figure S2). The final predictive performance was lower when compositional data was not transformed using the CLR-transformation (Figure S3).
Identification on different taxonomic levels: OTUs outperform all other taxonomic levels
To assess whether HNA and LNA groups were taxonomically conserved, compositional data was analyzed on all possible taxonomic levels for Muskegon Lake (Figure 3), using the same strategy as outlined in previous paragraph. The resulting values were considerably higher than zero on all taxonomic levels, meaning that at all levels individual taxonomic changes can be related to changes in HNAcc and LNAcc. Even though the OTU level resulted in the best prediction of HNAcc and LNAcc (Figure 3), each individual OTU has a lower RL score compared to other taxonomic levels, which on average became lower as the taxonomic level decreased (Figure S4). The fraction of variables (taxa) that could be removed to reach the maximum decreased as the taxonomic level became less resolved.
Validation of OTU selection results with the Boruta algorithm
The OTU results were validated with an additional variable selection strategy, called the Boruta algorithm. This approach allowed the further generalization of the findings presented above. In addition, it connects with Random Forest results from other studies, which have been described recently in microbiome studies of other systems (see [30] and [31]). The Boruta algorithm selects relevant variables based on statistical hypothesis testing between the variable importance of an original variable and the importance of the most important permuted variable (see materials and methods), as retrieved from multiple Random Forest models. Selected variables are ranked as ‘1’, tentative variables as ‘2’, and all other variables get lower ranks, depending on the stage in which they were eliminated. The Boruta algorithm was applied for all three lake systems at the OTU-level, selected OTUs are visualized in Figure S5. The fraction of selected OTUs was always smaller than 1% across lake systems and functional groups (Figure S6). The top scored OTU according to the RL was also selected according to the Boruta algorithm for HNAcc for all lake systems; for LNAcc both methods only agreed for Lake Michigan (Table 1). OTU060 (Proteobacteria;Sphingomonadales;alfIV_unclassified) was the only OTU selected in function of LNAcc across all lake systems, whereas no OTUs were selected across lake systems for HNAcc. As Random Forest regressions are the base method of the Boruta algorithm, we compared the predictive power of Boruta selected OTUs to those of all OTUs using Random Forest regression. For all lake systems and functional group performance increased when only selected OTUs were included in the model (Table S1). Lasso predictions, in which OTUs were selected according to the RL, were better as opposed to Random Forest predictions in which OTUs were selected according to the Boruta algorithm (Figure S7). The fraction of selected OTUs according to the Boruta algorithm was lower than the optimal amount of OTUs according to the RL.
In this way, a number of findings could be generalized independent of a specific method: 1) Selected OTUs were mostly lake systems specific, 2) a small fraction of OTUs was needed to predict changes in community composition, 3) selected OTUs are often rare and do not show a relationship with abundance and 4) top RL-ranked HNA OTUs were also selected according to the Boruta algorithm, suggesting to inspect more closely the phylogeny of these taxa.
HNA- and LNA-associated OTUs differed across lake systems
Selected OTUs were mostly assigned to either the HNA or LNA groups and there was limited correspondence across lake systems between the selected OTUs (Figure 4). In Muskegon Lake, OTU173 (Bacteroidetes;Flavobacteriales;bacII-A) was selected as the major HNA-associated taxon while OTU29 (Bacteroidetes;Cytophagales;bacIII-B) had the highest RL score for LNA OTUs. In Lake Michigan, OTU25 (Bacteroidetes;Cytophagales;bacΠI-A), was selected as the major HNA-associated taxon while OTU168 (Alphaproteobacteria:Rhizobiales:alfVΠ) was selected as a major LNA-associated taxon. For the inland lakes, OTU369 (Alphaproteobacterial;Rhodospirillales;alfVIII) was the major HNA-associated OTU while the OTU555 (Deltaproteobacteria;Bdellovibrionaceae;OM27) was the major LNA-associated taxon. Many more OTUs were selected in Muskegon Lake (197 OTUs; compared to 134 OTUs from the Inland Lakes and 21 OTUs from Lake Michigan) and these OTUs were often associated with both HNA and LNA groups.
RL scores were correlated for HNAcc and LNA within each lake system (Inland r = 0.25, P < 0.001; Michigan r = 0.59, P < 0.001, Muskegon r = 0.59, P < 0.001). Only OTUs that were present in all three freshwater environments were considered to calculate correlations between lake systems (190 in total, Figure S8). RL scores were lake ecosystem specific, with only a significant similarity between the Inland lakes and Muskegon lake using the RL for HNAcc (r = 0.21, P = 0.0042). Note that the correlation within a lake system therefore differs from previously reported values (as not all OTUs were considered), yet differences were small and results were comparable. The Boruta algorithm selected mostly OTUs which were unique both for the lake system and functional population (Figure S5).
Selected HNA and LNA OTUs do not have a phylogenetic signal
While many of the 258 OTUs selected by the RL were one of a few members of their phylum (e.g. Firmicutes; Epsilonproteobacteria; OTU717 in Lentisphaerae; OTU267 in Omnitrophica; etc), the Bacteroidetes (60 OTUs), Betaproteobacteria (36 OTUs), Alphaproteobacteria (22 OTUs), and Verrucomicrobia (21 OTUs) were a total of 54% of the selected OTUs (Figure 5). Of these top four phyla, the majority of their membership were within the LNA group (41-52% of selected OTUs), with the minority of OTUs within the HNA group (14-30% of selected OTUs), and a quarter to a third of the OTUs were selected as members of both the LNA and HNA groups (23-36% of selected OTUs).
To evaluate how much phylogenetic history explains whether a selected taxon was associated with the HNA and/or LNA group(s), we calculated the phylogenetic signal, which is a measure of the dependence among species’ trait values on their phylogenetic history [32]. If the phylogenetic signal is very strong, taxa belonging to similar phylogenetic groups (e.g. a Phylum) will share the same trait (i.e. association with HNAcc or LNAcc). Alternatively, if the phylogenetic signal is weak, taxa within a similar phylogenetic group will have different traits. For the most part, Pagel’s lambda was used [33] to test for phylogenetic signal where lambda varies between 0 and 1. A lambda value of 1 indicates complete phylogenetic patterning whereas a lambda of 0 indicates no phylogenetic patterning and leads to a tree collapsing into a single polytomy. There was no phylogenetic signal with FCM functional group used as a discrete character (i.e. HNA, LNA, or Both). As a continuous character using the RL scores for HNA (Figure S9), there was also no phylogenetic signal (lambda = 0.16; P = 1). There was a significant LNA signal (p = 0.003), however, the lambda value was 0.66, suggesting weak phylogenetic structuring in the LNA group. However, this significant result in the LNA was not replicated with other measures of phylogenetic signal (Blomberg’s K (HNA: p = 0.63; LNA: p = 0.54), and Moran’s I (HNA: p = 0.88; LNA: p = 0.12)) indicating that there is likely no phylogenetic signal in the taxa that drive the dynamics in either the HNA or the LNA group.
Flow cytometry fingerprints confirm associated taxa and reveal complex relationships between taxonomy and flow cytometric fingerprints
To confirm the association of the final selected OTUs with the HNA and LNA groups, we calculated the correlation between the density of individual regions (i.e. “bins”) in the flow cytometry data with the relative abundances of the OTUs. The Kendall rank correlation coefficient between OTU abundances and counts in the flow cytometry fingerprint was calculated for each of the top HNA OTUs selected by the RL within each of the three systems. The correlation coefficient was visualized for each bin in the flow cytometry fingerprint (Figure 6). As these values denote correlations, they do not indicate actual presence. OTU25 correlated with almost the entire HNA region, whereas OTU173 was limited to the lower part of the HNA region. In contrast, OTU369 was positively correlated to both the LNA and HNA regions of the cytometric fingerprint, highlighting results from Figure 4 where OTU369 was selected in function of both HNA and an LNA. The threshold that was used to define HNAcc and LNAcc lies very close to the actual corresponding regions.
Proteobacteria and rare taxa correlate with productivity measurements
The Kendall rank correlation coefficient was calculated between CLR-transformed abundances of individual OTUs and productivity measurements. OTU481 was significantly correlated after correction for multiple hypothesis testing using the Benjamini-Hochberg procedure (P < 0.001, P_adj = 0.016). This OTU had however a low RL score (0.022) and was not selected according to the Boruta algorithm. Of the top 10 OTUs according to the RL, three still had significant P-values (OTU614: P = 0.0064; OTU412, P = 0.044; OTU487, P = 0.014). Some OTUs that had a high RL score also had a positive response to productivity measurements (Figure S10). At the phylum level, only Proteobacteria were significantly correlated to productivity measurements after Benjamini-Hochberg correction (P < 0.001, P_adj = 0.010).
Discussion
Our study introduces a novel computational workflow to investigate relationships between microbial diversity and ecosystem functioning. Specifically, we aimed to study the ecology of flow cytometric functional groups (i.e. HNA and LNA) by associating their dynamics with those of bacterial taxa (i.e. OTUs). We simultaneously collected flow cytometry and 16S rRNA gene sequencing data from three types of freshwater lake systems in the Great Lakes region, and bacterial heterotrophic productivity from one lake ecosystem, and used a machine learning based variable selection strategy, known as the Randomized Lasso, to associate one with another. Our results showed that (1) there was a strong correlation between bacterial heterotrophic productivity and HNA cell abundances, (2) HNA and LNA cell abundances were best predicted by a small subset of OTUs that were unique to each lake type, (3) some OTUs were included in the best model for both HNA and LNA abundance, (4) there was no phylogenetic conservation of HNA and LNA group association and (5) freshwater FCM fingerprints display more complex patterns related to OTUs and productivity than compared to the traditional dichotomy of HNA and LNA. While HNA and LNA groups are universal across aquatic ecosystems, our data suggest that some bacterial taxa contribute to both HNA and LNA groups and that the taxa driving HNA and LNA abundance are unique to each lake system.
Although high-nucleic acid cell counts (HNAcc) and low-nucleic acid cell counts (LNAcc) were correlated with each other, only the association between bacterial heterotrophic production (BP) and HNAcc was strong and significant. This correlation between BP and HNA is higher than previously reported values, though previous reports have focused on the proportion of HNA rather than absolute cell abundances with the majority of data collected from marine systems. For example, Bouvier et al. [9] found a correlation between the fraction of HNA cells and BP within a large dataset of 640 samples across various freshwater to marine samples (r = 0.49), whereas a study off the coast of the Antarctic Peninsula found a moderate correlation (R2 = 0.36; [15]). Another study in the Bay of Biscay also found this association (R2 = 0.16; [13]), however, the authors attributed this difference to be related to cell size and not due to the activity of HNA. Notably, these studies were predominantly testing the association of marine HNA and the reason for the stronger correlation in our study may be due to the nature of the freshwater samples. As such, future studies in freshwater environments should test this hypothesis, which is especially important for understanding the broader influence that HNA bacteria may have in the context of the disproportionately large role that freshwater systems play as hotspots in the global carbon cycle [29]. Finally, as our correlations with proportional HNA abundance also indicated less strong correlations than with absolute HNAcc, we suggest absolute HNAcc should be used to best predict heterotrophic bacterial production with FCM data.
The use of machine learning methods, such as the Lasso and Random Forest, are becoming more common in microbiome literature as these approaches are able to deal with multi-dimensional data and test the predictive power of a combined set of variables ([34-36]. Although the Lasso already uses an intrinsic variable selection strategy, it has been noted that the Lasso method is not suited for compositional data because the regression coefficients have an unclear interpretation, and single variables may be selected when correlated to other variables [37]. When performing variable selection with Random Forests, traditional variable importance measures such as the mean decrease in accuracy can be biased towards correlated variables [38]. Our approach included algorithms which extended on these traditional machine learning algorithms, i.e. the Randomized Lasso or Boruta algorithm [25, 26]. These methods make use of resampling and randomization which allow to either assign a probability of selection (RL) or statistically decide which OTU to select (Boruta). Both the RL and Boruta algorithm have been applied to microbiome studies before. Examples for RL include the selection of genera in the gut microbiome relation to BMI [34] or the selection of OTUs from the oral microbiome in function of salivary pH and lysozyme activity [39]. The Boruta algorithm has been applied to select relevant genera, for example in the gut microbiome in relation to multiple sclerosis [31] or in function of different diets during pregnancy of primates [30]. Moreover, the Boruta algorithm has been recently proposed as one of the top-performing variable selection methods that make use of Random Forests [40]. The ability of our approach to identify unique sets of OTUs predictive of HNAcc and LNAcc despite the correlation between HNAcc and LNAcc (Figure 1A) illustrates the power of the machine learning based-variable selection methods. However, there is still room for improvement when attempting to integrate these different types of data. For example, 16S rRNA gene sequencing still faces the hurdles of DNA extraction [41] and 16S copy number bias [42]. Moreover, detection limits are different for FCM (expressed in the number of cells) and 16S rRNA gene sequencing (expressed in the number of gene counts or relative abundance), which create data that may be different in resolution. Future work may focus on developing ways around these shortcomings to further improve the integration of FCM with 16S rRNA gene sequencing.
In our study, only a minority of OTUs was needed to predict specific flow cytometric group abundances. While each OTU individually had low predictive power, the selected group of OTUs was generally a strong predictor of HNAcc and LNAcc. In addition, the selected OTUs were often rare and thus no relationship could be established between the RL score and the abundance of an OTU (Figure S3). These results are in line with recent findings of Herren & McMahon [28], who reported that a minority of low abundance taxa explained temporal compositional changes of microbial communities. The selection of different sets of HNA and LNA OTUs across the three freshwater systems indicates that different taxa underlie the universally observed HNA and LNA functional groups across aquatic systems. This is in line with strong species sorting in lake systems [43, 44], shaping community composition through diverging environmental conditions between the lake systems presented here [45]. This high system specificity also explains the low RL scores for individual OTUs, as the spatial dynamics of an OTU diverged strongly across systems. (For example, an OTU that has an RL score of 0.5 implies that on average it will only be chosen one out of two times in a Lasso model).
Based on the high correlation of BP with HNAcc and low correlation with BP and LNAcc, the high proportion of LNA cells across all lake systems might indicate that the majority of cells in the bacterial community are dormant or have very low activity. This agreest with previous research showing that up to 40% [46] or even 64-95% [47] of cells in freshwater systems to be inactive or dormant. In fact, up to 60-80% of the OTUs in freshwater lakes have been reported to be dormant [48]. Based on variable environmental conditions sampled across our dataset, some of the taxa that are predominantly dormant in one sample may contribute to activity in another sample. If this differing contribution to activity also covaries with a taxon’s abundance, these taxa may be considered to be ‘conditionally rare taxa’ [49] and previously 1-2% of freshwater lake OTUs have been reported to be conditionally rare [27]. It has also been shown that marine heterotrophic bacteria can survive for at least 8 months (maximum tested length) in a starved state [50]. These factors may explain why some OTUs were included in both the HNAcc and LNAcc models and is in line with scenario 1 from Bouvier et al [9] (i.e. the transitioning of cells from active growth to death or inactivity). Alternatively, the same OTU may occur in both HNA and LNA groups due to phenotypic plasticity. Phenotypic plasticity has been shown for bacterial morphology and size, for example during predation and carbon starvation [51]. The fact that HNA and LNA groups have been suggested to correspond to cells of differing size, with HNA harboring larger cell sizes [10, 23], is in line with this hypothesis. Finally, the OTU level grouping of bacterial taxa can disguise genomic and phenotypic heterogeneity [52-55], which may be an explanation for inconsistent associations between OTUs and FCM functional groups. While all taxonomic levels resulted in a model with predictive power, the best model was at the most resolved taxonomy (i.e. OTU) indicating that it is unlikely that OTUs within the HNA and LNA groups are phylogenetically conserved. Indeed, when analyzing the data at an OTU level, very little phylogenetic conservation was found between selected OTUs for HNA and LNA groups. This is in contrast to a recent study that found a clear signal at the phylum level [23]. Proctor et al. [23] showed separate bacterial clusters between HNA and LNA groups across different aquatic systems. However, this was not the case for lake water samples. It is notable that Proctor et al. [23] separated HNA and LNA cells based on cell size (where HNA cells were >0.4 um and LNA cells were 0.2-0.4 um, based on 50-90% removal of HNA cells after filtering), while our study separated these FCM functional groups on the basis of fluorescence intensity alone. Moreover, our study assessed associations between OTUs and population dynamics, while Proctor et al. [23] assessed actual presence.
The Boruta algorithm and RL scores agreed on the top-ranked HNA OTU for all lake systems, which motivates further investigation of the ecology of these OTUs. While little information on the identities of HNA and LNA freshwater lake bacterial taxa exists, several studies identified Bacteroidetes among the most prominent HNA taxa and is in line with our findings. Vila-Costa et al. [24] found that the HNA group was dominated by Bacteroidetes in summer samples from the Mediterranean Sea, Read et al. [17] showed that HNA abundances correlated with Bacteroidetes, and Schattenhofer et al. [22] reported that the Bacteroidetes accounted for the majority of HNA cells in the North Atlantic Ocean. In Muskegon Lake, OTU173 was the dominant HNA taxon and is a member of the Order Flavobacteriales (bacII-A). The bacII group is a very abundant freshwater bacterial group and has been associated with senescence and decline of an intense algal bloom [56]. BacII-A has also made up ~10% of the total microbial community during cyanobacterial blooms, reaching its maximum density immediately following the bloom [57]. In Lake Michigan, OTU25, a member of the Bacteroidetes Order Cytophagales known as bacIII-A, was the top HNA OTU. However, much less is known about this specific group of Bacteroidetes. Though, the bacII-A/bacIII-A group has been strongly associated with more heterotrophically productive headwater sites (compared to higher order streams) from the River Thames, showing a negative correlation in rivers with dendritic distance from the headwaters, indicating that these taxa may contribute more to productivity [17]. In the inland lakes, OTU369 was the major HNA taxon and is associated with the Alphaproteobacteria Order Rhodospirillales (alfVIII), which to our knowledge is a group with very little information available in the literature. In contrast to our findings of Bacteroidetes and Alphaproteobacterial HNA selected OTUs, Tada & Suzuki [58] found that the major HNA taxon from an oceanic algal culture was from the Betaproteobacteria whereas LNA OTUs were within the Actinobacteria phylum.
Conclusions
Our results indicate that there are taxonomic differences between HNA and LNA groups in freshwater lake systems, though these are lake system specific. This result may be due to taxa switching between these groups, potentially due to genomic or phenotypic plasticity. The difference between selected taxa is larger between lake systems as opposed to differences between HNA and LNA groups, which were not conserved phylogenetically. Thus, our results correspond most with research presented by Vila-Costa et al. [24], in which a taxonomic division was found between HNA and LNA groups, yet this was not rigid and followed seasonal trends. Overall, our results motivate scenario 4 proposed by Bouvier et al. [9], where HNA and LNA exhibit a different taxonomy, but this taxonomy changes over time and space and may overlap. With this study, we show that different types of microbial ecological data can be integrated with machine learning to learn about the composition and functioning of bacterial populations in aquatic systems. Future studies on HNA and LNA bacterial groups should use genome-resolved metagenomics, metatranscriptomics, or single-cell genomics to decipher whether the traits that underpin the association of a taxon with a FCM group are related to genomic or phenotypic plasticity.
Materials and Methods
Data collection and DNA extraction, sequencing and processing
In this study, we used a total of 173 samples collected from three types of lake systems described previously [45], including: (1) 49 samples from Lake Michigan (2013 & 2015), (2) 62 samples from Muskegon Lake (2013-2015; one of Lake Michigan’s estuaries), and (3) 62 samples from twelve inland lakes in Southeastern Michigan (2014-2015). For more details on sampling, please see Figure 1 and the Field Sampling, DNA extraction, and DNA sequencing and processing sections within Chiang et al. [45]. In all cases, water for microbial biomass samples were collected and poured through a 210 μm and 20 μm bleach sterilized nitex mesh and sequential inline filtration was performed using 47 mm polycarbonate in-line filter holders (Pall Corporation, Ann Arbor, MI, USA) and an E/S portable peristaltic pump with an easy-load L/S pump head (Masterflex®, Cole Parmer Instrument Company, Vernon Hills, IL, USA) to filter first through a 3 μm isopore polycarbonate (TSTP, 47 mm diameter, Millipore, Billerica, MA, USA) and second through a 0.22 μm Express Plus polyethersulfone membrane filters (47 mm diameter, Millipore, MA, USA). The current study only utilized the 3 - 0.22 μm fraction for analyses.
DNA extractions and sequencing were performed as described in Chiang et al. [45]. Fastq files were submitted to NCBI sequence read archive under BioProject accession number PRJNA412984 and PRJNA414423. We analyzed the sequence data using MOTHUR V.1.38.0 (seed = 777; [59] based on the MiSeq standard operating procedure and put together at the following link: https://github.com/rprops/Mothur oligo batch. A combination of the Silva Database (release 123; [60]) and the freshwater TaxAss 16S rRNA database and pipeline [61] was used for classification of operational taxonomic units (OTUs).
For the taxonomic analysis, each of the three lake datasets were analyzed separately and treated with an OTU abundance threshold cutoff of at least 5 sequences in 10% of the samples in the dataset (similar strategy to [62]). For comparison of taxonomic abundances across samples, each of the three datasets were then rarefied to an even sequencing depth, which was 4,491 sequences for Muskegon Lake samples, 5,724 sequences for the Lake Michigan samples, and 9,037 sequences for the inland lake samples. Next, the relative abundance at the OTU level was calculated using the transform_sample_counts() function in the phyloseq R package [63] by taking the count value and dividing it by the sequencing depth of the sample. For all other taxonomic levels, the taxonomy was merged at certain taxonomic ranks using the tax_glom() function in phyloseq [63] and the relative abundance was re-calculated.
Heterotrophic bacterial production measurements
Muskegon Lake samples from 2014 and 2015 were processed for heterotrophic bacterial production using the [3H] leucine incorporation into bacterial protein in the dark method [64, 65]. At the end of the incubation with [3H]-leucine, cold trichloroacetic acid-extracted samples were filtered onto 0.2 μm filters that represented the leucine incorporation by the bacterial community. Measured leucine incorporation during the incubation was converted to bacterial carbon production rate using a standard theoretical conversion factor of 2.3 kg C per mole of leucine [65].
Flow cytometry, measuring HNA and LNA
In the field, a total of 1 mL of 20 μm filtered lake water were fixed with 5 μL of glutaraldehyde (20% vol/vol stock), incubated for 10 minutes on the bench (covered with aluminum foil to protect from light degradation), and then flash frozen in liquid nitrogen to later be stored in - 80°C freezer until later processing with a flow cytometer. Flow cytometry procedures followed the protocol laid out in Props et al. [66], which also uses the samples presented in the current study. Samples were stained with SYBR Green I and measured in triplicate. The lowest number of cells collected after denoising was 2342. HNA and LNA groups were selected using the fixed gates introduced in Prest et al. [67] and plotted in Figure S11. Cell counts were determined per HNA and LNA group and averaged over the three replicates (giving rise to HNAcc and LNAcc).
Data analysis
Processed data and analysis code for the following analyses can be found on the GitHub page for this project at https://deneflab.github.io/HNALNAproductivity/.
HNA-LNA and HNA-Productivity Statistics and Regressions
We tested the difference in absolute number of cells within HNA and LNA functional groups across running analysis of variance with a post-hoc Tukey HSD test (aov() and TukeyHSD(); stats R package; [68]). In addition, we tested the association of HNA and LNA to each other and with productivity by running ordinary least squares regression with the lm() (stats R package; [68]).
Ranking correlation
Ranking correlation between variables was calculated using the Kendall rank correlation coefficient, using the kendalltau() function in Scipy (v1.0.0) or cor() in R (v3.2). The ‘tau-b’ implementation was used, which is able to deal with ties. Values range from −1 (strong disagreement) to 1 (strong agreement). The same statistic was used to assess the similarity between rankings of variable selection methods.
Centered-log ratio transform
First, following guidelines from Paliy & Shanker, Gloor et al. and Quinn et al.[69-71], relative abundances of OTUs were transformed using a centered log-ratio (CLR) transformation before variable selection was applied. This means that the relative abundance xiof a taxa was transformed according to the geometric mean of that sample, in which there are p taxa present:
Zero values were replaced by δ = 1/p2. This was done using the scikit-bio package (www.scikit-bio.org, v0.4.1).
Lasso & stability selection
Scores were assigned to taxa based on an extension of the Lasso estimator, which is called stability selection [25]. In the case of nsamples, the Lasso estimator fits the following regression model: in which X denotes the abundance table, V the target to predict, which either is HNA cell abundances (HNAcc) or LNA cell abundances (LNAcc), and Λ is a regularization parameter which controls the complexity of the model and prevents overfitting. The Lasso performs an intrinsic form of variable selection, as the weights of certain variables will be put to zero.
Stability selection, when applied to the Lasso, is in essence an extension of the Lasso regression. It implements two types of randomizations to assign a score to the variables, and is therefore also called the Randomized Lasso (RL). The resulting RL score can be seen as the probability that a certain variable will be included in a Lasso regression model (i.e., its weight will be non-zero when fitted). When performing stability selection, the Lasso is fitted to B different subsamples of the data of fraction n/2 denoted as X′ and corresponding y′. A second randomization is added by introducing a weakness parameter α. In each model, the penalty Λ changes to a randomly chosen value in the set [λ »/α], which means that a higher penalty will be assigned to a random subset of the total amount of variables. The Randomized Lasso therefore becomes: where wj is a random variable which is either α or 1. Next, the Randomized Lasso score (RL score) is determined by counting the number of times the weight of a variable was non-zero for each of the B models and divided by B. Meinshausen and Bühlmann show that, under stringent conditions, the number of falsely selected variables is controlled for the Randomized Lasso when the RL score is higher than 0.5. If A is varied, one can determine the stability path, which is the relationship between π and λ for every variable. For our implementation, B = 500, α = 0.5 and the highest score was selected in the stability path for which A ranged from 10−3 until 103, logarithmically divided in 100 intervals. The RandomizedLasso() function from the scikit-learn machine learning library was used [72], v0.19.1).
Random Forests & Boruta
The Boruta algorithm is a wrapper algorithm that makes use of Random Forests as a base classification or regression method in order to select all relevant variables in function of a response variable [26]. Similar to stability selection, the method uses an additional form of randomness in order to perform variable selection. Random Forests are fitted to the data multiple times. To remove the correlation to the response variable, each variable gets per iteration a so-called shadow variable, which is a permuted copy of the original variable. Next, the Random Forest algorithm is run with the extended set of variables, after which variable importances are calculated for both original and shadow variables. The shadow variable that has the highest importance score is used as reference, and every variable with significantly lower importance, as determined by a Bonferroni corrected t-test, is removed. Likewise, variables containing an importance score that is significantly higher are included in the final list of selected variables. This procedure can be repeated until all original variables are either discarded or included in the final set; variables that remain get the label ‘tentative’ (i.e., after all repetitions it is still not possible to either select or discard a certain variable). We used the boruta_py package to implement the Boruta algorithm (https://github.com/scikit-learn-contrib/boruta_py). Random Forests were implemented using RandomForestRegressor() function from scikit-learn [72], v0.19.1). Random Forests were run with 200 trees, the number of variables considered at every split of a decision tree was p/3 and the minimal number of samples per leaf was set to five. The latter were based on default values for Random Forests in a regression setting [73]. The Boruta algorithm was run for 300 iterations, variables were selected or discarded at P < .05 after performing Bonferroni correction.
Recursive variable elimination
Scores of the Randomized Lasso were evaluated using a recursive variable elimination strategy [74]. Variables were ranked according to the RL score. Next, the lowest-ranked variables were eliminated from the dataset, after which the Lasso was applied to predict HNAcc and LNAcc respectively. This process was repeated until only the highest-scored taxa remained. In this way, performance of the Randomized Lasso was assessed from a minimal-optimal evaluation perspective [75]. In other words, the lowest amount of variables that resulted in the highest predictive performance was determined.
Performance evaluation
In order to account for the spatiotemporal structure of the data, a blocked cross-validation scheme was implemented [76]. Samples were grouped according the site and year that they were collected. This results in 5, 10 and 16 distinctive groups for the Michigan, Muskegon and Inland lake systems respectively. Predictive models were optimized in function of the R2 between predicted and true values of held-out groups using a leave-one-group-out cross-validation scheme with the LeaveOneGroiφOut() function. This results in a cross-validated value. For the Lasso, Λ was determined using the lassoCV() function, with setting eps=10 1 and n_alphas=400. The Random Forest object was optimized using a grid search where max_features was chosen in the interval (all variables) or [1, …, p] (Boruta selected variables) and min_samples_leaf in the interval [Is ¾ using the GridSearchCV() function. The number of decision trees (n_trees) was set to 200. All functions are part of scikit-learn ([72]; v0.19.1)
Stability of the Randomized Lasso
Similarity of RL scores between lake systems and functional groups was quantified using the Pearson correlation. This was done using the pearsonr() function in Scipy (v1.0.0).
Patterns of HNA and LNA OTUs across ecosystems and phylogeny
To visualize patterns of selected HNA and LNA OTUs across the three ecosystems, a heatmap was created with the RL scores of each OTU from the Randomized Lasso regression that were higher than specified threshold values. The heatmap was created with the heatmap.2() function (gplots R package) using the euclidean distances of the RL scores and a complete linkage hierarchical clustering algorithm (Figure 4).
Correlations between taxa and productivity measurements
Kendall tau ranking correlations between productivity measurements and individual abundances were calculated on the phylum and OTU level using the kendalltau() function from Scipy (v1.0.0). P-values were corrected using Benjamini-Hochberg correction, reported as P_adj. This was done using the multitest() function from the Python module Statsmodels ([77]; v0.5.0).
Phylogenetic tree construction and signal calculation
We calculated the best performing maximum likelihood tree using the GTR-CAT model (-gtr - fastest) model of nucleotide substitution with fasttree (version 2.1.9 No SSE3; [78]). Phylogenetic signal with both discrete (i.e. HNA, LNA, or both) and continuous traits (i.e. the RL score) using the newick tree from FastTree was then used to model phylogenetic signal using Pagel’s lambda (discrete trait: fitDiscrete() from the geiger R package [79]; continuous trait: phylosig() from the phytools R [80]), Blomberg’s K (phylosig() function from the phytools R package [80]), and Moran’s I (abouheif.moran() function from the adephylo R package [81]).
Acknowledgements
PR was supported by Ghent University (BOFSTA2015000501) and MLS was supported by the National Science Foundation Graduate Research Fellowship Program (Grant No. DGE 1256260). Part of the computational resources (Stevin Supercomputer Infrastructure) and services used in this work were provided by the VSC (Flemish Supercomputer Center), funded by Ghent University, the Hercules Foundation and the Flemish Government department EWI. Flow cytometry analysis was supported through a Geconcerteerde Onderzoeksactie (GOA) from Ghent University (BOF15/GOA/006).
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.
- 54.
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵