Abstract
Populations of Streptococcus pneumoniae are typically structured into groups of closely related organisms or lineages. Here, we employ a machine learning technique to try and tease out whether these lineages are maintained by selection or by neutral processes. Our results indicate that lineages of S. pneumoniae evolved through selection on the groESL operon, an essential component of its survival machinery. This operon contains genes which encode chaperone proteins that enable a very large range of proteins to fold correctly within the physical environment of the nasopharynx and therefore will be in strong epistasis with several other genes. These features of groESL would explain why lineage structure is so stable within S. pneumoniae despite high levels of horizontal genetic transfer. S. pneumoniae is also antigenically diverse, exhibiting a variety of distinct capsular serotypes. We show that associations may arise between lineage and capsular serotype due to immune selection and direct resource competition but these can be more easily perturbed in the presence of external pressures such as vaccination. Overall, our analyses indicate that the evolution of S. pneumoniae can be conceptualized as the rearrangement of modular functional units occurring on several different timescales under different selection pressures: some patterns have locked in early (such as the epistatic interactions between groESL and a constellation of other genes) and preserve the differentiation of lineages, while others (such as the associations between capsular serotype and lineage) remain in continuous flux.
Many bacterial pathogen populations contain a number of co-circulating lineages bearing unique signatures of alleles at selected housekeeping loci [1] and also at a whole genome level [2–4]. The maintenance of these discrete lineages is hard to ascribe to purely neutral processes, given the high rate of genetic exchange in these pathogen populations [5]. We have previously proposed that extensive co-adaptation between loci may give rise to these patterns, as even small fitness differences among different combinations of alleles can lead to the loss of less ʼfitʼ lineages under intense competition for resources [6]. Bacterial populations may also segregate into a set of successful ʼmetabolic typesʼ which are able to co-circulate by virtue of exploiting separate metabolic niches and thereby avoiding direct resource competition and immune pressures [7], As an example, specific differences in the ability to absorb particular carbohydrate resources have been observed in functional genomics studies of Streptococcus pneumoniae [8], and these may reflect specialization upon different resources within the same environment as a means of avoiding competition.
Several bacterial pathogens are also antigenically diverse: S. pneumoniae, for example, can exist in over 90 different serologically distinguishable states or ʼserotypesʼ [9]. Many bacterial populations - including S. pneumoniae - exhibit strong associations between antigenic type and lineage, at least at the level of MLST [7, 10], Such associations may have arisen through neutral processes; alternatively, as we have previously demonstrated, they may represent the outcome of a combination of immune selection acting upon antigen genes and direct resource competition acting upon metabolic genes and virulence factors [6, 10]. Distinguishing between these two hypotheses is complicated by the high levels of linkage disequilibrium observed across the whole genome [10, 11], However, the alternative hypotheses make very different predictions about how the system would respond to perturbation by vaccination, particularly when only a subset of antigenic types are included in the vaccine, as is the case for S. pneumoniae. Under these circumstances, antigenic types that are not included in the vaccine may be expected to increase in frequency but, under the neutral model, it would be highly unlikely that they would do so in association with the genotypes previously associated with vaccine serotypes. By contrast, if the associations were primarily generated by selection, one would expect non-vaccine serotypes to become associated with genotypes that were previously commonly associated with vaccine serotypes [10], Evidence for this phenomenon of Vaccine Induced Metabolic Shift (VIMS) has been found among pre- and post-vaccine isolates collected in the USA [2, 4, 12–14] and South Korea [15],
Significance Statement
Populations of Streptococcus pneumoniae (the pneumococcus) appear to form stable clusters of closely related organisms despite the fact that they frequently exchange genetic material. In this paper we show that, rather than emerging by chance, these clusters have evolved to maintain their differences so that they may avoid competing with each other. Our work suggests that these clusters are fundamentally determined by variation within a set of genes which encode “chaperones” that help other pneumococcal proteins fold correctly under changes in the physical environment they would encounter while trying to infect their vertebrate host. These chaperone proteins are also targets of immunity and therefore may have originally diverged to minimise immunological interference between pneumococci, thereby necessitating changes across the whole genome.
Establishing the contribution of co-adaptation and metabolic competition in the maintenance of lineage structure is thus important as the outcome of certain interventions, such as vaccination, depends crucially on these underlying determinants of population structure. Here, we assess the potential to achieve this by machine learning techniques, which work by attempting to identify relevant features based on information supplied on a set of potential predictor variables for each individual genetic sample. Random forests (RF) are one of such methods currently witnessing a surge of attention, owing to its unique advantages in dealing with large datasets of both numerical and categorical data, as well as having low computational overhead, a nonparametric nature and a well defined probabilistic output [16], A random forest algorithm (RFA) is an ensemble method that combines the information of multiple, regression or classification trees built around predictor variables towards a response variable. The output of an RFA is composed both of the classification success rates of the response variable and a ranking of the predictor variables (scores) quantifying their relative role in the classification process. RFA-based methods are widely applied in genome-wide association studies of cancer and chronic disease risk [17], drug resistance [18], species classification [19], and in the analysis of microarray data [20], In the context of host-pathogen systems, machine learning techniques have been shown to be able to successfully ascertain host tropism, for instance by identifying the key sites that determine host specificity of zoonotic viruses [21], by analyzing the probability of Escherichia coli cattle strains more likely to be virulent to humans [22], and by selecting the clear genetic distinctions in both avian and human proteins of Influenza viruses [23, 24],
In this paper, we undertake a feature selection analysis of a dataset containing 616 whole genomes of S. pneumoniae collected in Massachusetts (USA), including 133 samples from 2001, 203 from 2004 and 280 from 2007 [3, 25], thus representing the bacterial population at the point of PCV7 introduction in year 2000, and any changes that may have followed. These data have been used in numerous studies, including analysis of post-vaccine epidemiological and genetic changes [3, 10, 26], maintenance of population structure [2], frefa-lactani resistance [27], determinants of colonization [28] and constraints on serotype switching [29]. Each isolate in this dataset contains information on its capsular serotype (determined by serological means), and had also been assigned to one of a number of monophyletic Sequence Clusters (SC) using a phylogenetic and clustering analysis on a core genome built from all putative protein-coding sequences that were present in a single copy in all genomes [3]. Using a machine learning technique and a previous allelic annotation of 2135 genes among these isolates (using ATCC 700669 serotype 23F as reference [10], table SI), we attempt to identify the relative contribution of each gene in maintaining the observed population structure in terms of (i) capsular serotype and (ii) Sequence Cluster (SC). We find a clear distinction between the sets and functions of genes highly informative for serotype versus SC, suggesting that different selective processes have led to the emergence and maintenance of S. pneumoniae's population structure.
Results
Genes which predict serotype do not perform well in predicting Sequence Cluster
We first assessed the success of the combined variation in 2135 genes of known and unknown function in identifying the Sequence Cluster (SC) to which isolates belonged, this being a measure of shared ancestry (as per [3]). Classification of SC by RFA was accurate (Fig. SIB) with all SC types being predicted with success close to 100%. By contrast, the success rate in identifying the capsular serotypes of the 616 whole genomes, although also very high, was not perfect. None of housekeeping genes included in MLST classification performed better than average in predicting serotype or SC (Fig. 1).
As might be expected, genes within the capsular locus (defined as being within but not including the genes dexB and a.UA) achieved high scores in predicting serotype but these did not score above average in predicting SC (Fig. 1). We noted that many of these genes contained what appeared to be a high proportion of deletions but, in fact, had simply eluded allelic notation on account of their high diversity at the level of the population. For certain genes, such as those encoding the polysaccharide polymerase Wzy and the flippase Wzx, the allelic notation process failed at least 50% of the time for over 90%) of the isolates, essentially working only for 23F (the reference genome) and the closely related 23A and 23B serotypes. In general, the degree of success in allelic notation of each gene was closely linked to the potential for alignment with its counterpart in the 23F reference genome (Fig. S4). Nonetheless, the same shift towards lower RFA scores of capsule associated genes in predicting SC rather than serotype was observed upon performing these classification exercises after excluding all genes which contain > 50% (Fig. S2) or > 10% (Fig. S3) of gene mismatches/deletions. When imposing an exclusion criterion of > 10% we retained only the genes wze, wzg and wzh (in addition to two pseudogenes), and these could also clearly be seen to shift from above the upper 97.5% limit into the neutral expectation of RFA scores when predicting SC (Fig. S3).
Finally, we performed the same analysis excluding all genes which showed mismatches or deletions above a threshold of 1%. This eliminated all of the genes considered above as belonging within the capsular locus, although many flanking genes were retained and a number of these achieved the top 2.5% of RFA scores in predicting serotype (Fig. 2A, Table 1): 38%) of the top genes occurred within 10 genes downstream and upstream of the capsular locus, and 66% were situated within 60 genes (a distance amounting to 2.8% of the genome). None of the genes achieving the top 2.5%) of RFA scores in predicting serotype (shown in red in Fig. 2) remained in the top 2.5%) category when asked to predict SC. Similarly, all genes which achieved top scores in predicting SC (Table 2) were only of average importance in elucidating serotype (shown in green in Fig. 2). Interestingly, the MLST gene spi gained a place among the top-scoring genes for SC (Table 2) under this stringent cutoff.
Top-scoring genes for serotype classification mediate competitive interactions
We found that a large proportion of non-capsular genes which were associated with serotype to be involved in metabolic functions linked to resource competition, at least in related streptococcal species. The top-scoring gene t.rpF, for example, is known to be essential for the biosynthesis of tryptophan for S. pneumoniae [30], and more generally of the biosynthesis of aromatic amino acids in at least 9 species of bacteria [31]. Another top-scoring gene, fa.bG, encodes the /3-ketoacyl-ACP reductase, the only known keto-acid reductase in bacterial fatty acid biosynthesis [32]. The gene lysC codes for an aspartokinase involved in lysine production and aminoethyl cysteine resistance in Coryneba.cteri.um. glutamicum [33]. We also found two genes, m.vaD and rnvaKʼ2, of the mevalonate pathway scoring highly for serotype prediction. This pathway, also known as the HMG-CoA reductase pathway, can be found in bacteria, eukaryotes and archaea [34]. One of its main products, the isopentenyl pyrophosphate (IPP), is used to make isoprenoids, a diverse class of over 30,000 biomolecules. In bacteria, the principal products of IPP include the lipid carrier undecaprenol (involved in wall biosynthesis), plus a range of menaquinones and ubiquinones both involved in electron transport, and the latter also in aerobic cellular respiration [34–36]. In S. pneumoniae, these two genes are essential for growth and are proposed to be part of a single operon [35].
ATP-binding cassette (ABC) transporter genes, which are critical for intake and metabolism, were found 5 times more frequently in the top genes classifying serotype compared to those determining SC (Table 1). As part of one such transporter, the SpuA protein is involved in a-glucan metabolism, whose main substrate is glycogen (polysaccharide of glucose), an abundant resource in human lung epithelial cells [37, 38]. The gene patB also encodes part of an ABC efflux pump in S. pneumoniae, responsible for resistance to fluoroquinolenes [39–41]. Among other ABC transporters, two other genes were located within the pit. operon which is involved in iron uptake. In line with our findings, the pit. operon has previously been shown to exhibit strain-specific variation [42]. In contrast, our approach did not select 2 other operons involved in iron uptake (piu and pi.a), which are conserved between S. pneumoniae strains [42] and therefore unlikely to be predictors of serotype. Another important regulator of iron transport among the top- scoring genes is gnd [43]. The latter is transcriptionally linked to another top gene, rit.R, which is orthologous to the streptococcal global regulator covR, for which there is conclusive evidence from S. pyogenes, S. suis and S. agalact.iae of regulatory functions on capsular biosynthesis [44–46]. Another ABC transporter known as Ecs is represented in the top list by one of its two genes, ecs A. The substrate of Ecs is so far unknown, but obligatory anaerobes or microaerophilic bacteria do not carry the Ecs transporter, and its function is therefore argued to be related to respiration [47]. Finally, transport can be achieved by a multitude of systems alternative to ABC transporters, such as ʼpassiveʼ channels like the top-scoring sodium symporter GlyP [48]. Sodium is one of the main electrolytes in human saliva, existing there at a higher concentration than in blood plasma, and differentiation in sodium transport, similarly to iron or glucose transport, could potentially be under selection for niche specialization.
High RFA scores for serotype were also found among a number of genes flanking the capsular locus which are involved in the cell wall peptidoglycan biosynthesis pathway [9]. These include the penicillin-binding protein genes pbpX and pbplA, the 16S rRNA cytosine-methyltransfera.se gene m.raW and the phospho-N-acetylmuramoyl-pentapeptide-transferase gene m.raY. Mutations in these genes can lead to penicillin resistance, and single-nucleotide positions in all three genes have been shown to associate strongly with S. pneumoniae /3- lactam resistance in genome-wide association studies (GWAS) performed on the dataset used in this study [3], in a Thai study containing 3,085 isolates [49], and in a Canadian study on 11,083 isolates [50]. It is of relevance to note that in S. Pneumoniae, pbplA is also involved in the formation of the septum during cell division [51] and is associated in a two-gene operon with the top-scoring gene recU, coding for the Holliday junction resolvase, required for homologous DNA recombination, repair and chromosome segregation [52, 53]. Finally, resistance to various classes of cell wall-inhibitory antibiotics (ex. methicillin, vancomycin, daptomycin) in S. Aureus is regulated via the vra operon, by up or downregulation of a set of genes commonly designated as the cell wall stimulon [54]. We find this operon represented by two entries, the vraT and vraT genes.
In addition to genes clearly related to critical resource functions, transport and antibiotic resistance, we also found some of the top-scoring entries to be involved in functions associated with direct inter- and intra-species competition, either through factors related to immune escape or warfare. For instance, blpH is part of the BlpABCSRH pathway [55], which regulates production of class II bacteriocins and related immunity proteins [56, 57]. In related species, the aminotransferase GlmS is also known to upregulate the production of ammonia thereby increasing acid tolerance and survival [58]. The capsular flanking gene luxS is also a good example, as it is part of a Staphylococcus epiderrnidis quorum-sensing system in biofilm formation, and linked to pneumolysin expression, a key player in interference with the host immune response [59, 60]. Finally, the top-scoring lytC gene encodes a lysozyme (or glycoside hydrolase) which can be found in a number of secretions, such as tears, saliva and mucus, with the potential to damage (interspecies) bacterial cell walls by catalyzing hydrolysis of linkages and residues in peptidoglycans and chitodextrins [61, 62].
Several top-scoring genes for SC classification are also key determinants of phenotype
A number of top scoring genes (ex. sodA, groEL, groES, Imb) in predicting SC have previously been demonstrated to be powerful discriminators of genealogy in a range of bacterial species. For instance, sodA, encoding for the manganese superoxide dismutase, critical against oxidative stress and linked to both survival and virulence, has been highlighted in numerous studies for its relevance in identification of rare clones of pneumococci [63, 64] and Streptococci at the species level [65, 66]. Also, the Imb gene encodes for an extracellular protein with a key role in physiology and pathogenicity [67, 68], and homologs of this protein have been documented to be present and discriminatory of at least 25 groups of the Streptococcus genus with possible similar functions [69, 70].
Certain top-scoring genes were strongly associated with phenotype such as cell-shape, virulence or invasiveness. For instance, glycolytic enzymes (GE) such as the one encoded by the top-scoring gene pdhB have long been regarded as virulence factors [71] and are involved in cytosol-located metabolic processes. When transported to the surface, the PdhB protein- complex is known to interact with host factors such as the extracellular matrix and fibrinolysis system [72]. Critically, Mycoplasma pneumoniae's pdhB is involved in the degradation of human fibrinogen and is also able to bind human fibronectin [72, 73], Fibronectin is commonly found in human saliva, presenting a vast set of functions, from prevention to colonization of the oral cavity and pharynx, to involvement in adhesion and wound healing [74], Another top gene, pclA, encodes for the pneumococcal collagen-like protein A, a top candidate for human collagen mimicry [75], involved in host-cell adherence and invasion [76]. Binding to fibronectin and collagen are common strategies employed by various invading bacterial pathogens to colonize or disseminate within the host [77, 78], In ovococcus bacteria such as S. pneumoniae the function of the top-scoring protein MreD (the Rod shape-determining protein) is unknown. It is therefore down to speculation on why this protein is a good predictor of SC, but since the depletion of MreD protein can cause cells to stop growing, become spherical, form chains and lyse, its selection hints on the possibility that variation in this gene may dictate specific lineage differences in cell-shape phenotype [79], We also find the genes designated as SPN23F11320 and SPN23F09460 to be relevant for SC classification, which in our dataset represent about 13% of all non-putative GCN5-related, N-aeetyltransferases of the (GNAT) family. These are key proteins involved in acetylation, and there is growing evidence in the literature of their role in regulation of central carbon metabolism and phenotype through epigenetics [80, 81],
Overall, the characteristics of these top-scoring genes differed significantly from those which were successful in predicting serotype and, contrary to expectations from a population structured mainly by neutral evolution, we found the top-scoring genes for SC (ancestry) not to be uniformly distributed across the genome. Most strikingly, 25% of the top scoring genes for SC were contiguous and contained the groESL operon, which includes the GroEL and GroES chaperon proteins (Table 2). Other studies have reported the power of the groESL operon and its proteins to ascertain phylogeny and classification within the Streptococcus genus [82] and between species of the Viridans and Mutans Streptococci groups [83, 84]. We also noted the top-scoring gene recX is in close proximity to the groESL operon, which encodes a regulatory protein that inhibits the RecA recombinase in multiple species of bacteria [85–88].
Discussion
We have presented a novel technique for attempting to distinguish the effects of selection from neutral processes giving rise to population structure by applying a machine learning algorithm to genomic data. Our strategy involves applying a Random Forest Algorithm (RFA) to predict particular features (serotype or Sequence Cluster) of each isolate from information on the allelic composition of all isolates. By comparing the contribution of different genes as reflected in their RFA scores in predicting serotype or Sequence Cluster, inferences can be made concerning the evolutionary processes underlying their formation, relationship and maintenance at the population level. We performed this analysis on a dataset containing 616 whole genomes of S. pneumoniae collected in Massachusetts (USA) [3], for each of which we had obtained allelic profiles of 2135 genes[10].
Classification success of Sequence Cluster (SC) to which each isolate belonged was achieved almost perfectly by the RFA. This is a reflection of the strong correspondence between taxonomy and classification trees based on genetic information, as explored in recent studies [19], and demonstrated by Austerlitz and colleagues when comparing the success of RFA, neighbour-joining and maximum-likelihood (PhyML) methodologies on simulated and empirical genetic data [89]. Classification of serotype by the RFA was more variable and, most importantly, there was no overlap between the genes which appeared to be most important in determining serotype and those which scored highly in identifying SC. As might be expected, genes of the capsular locus (cps) and many of those flanking it achieved high RFA scores in predicting serotype but did not perform better than average in predicting SC. Interestingly, none of the genes among the MLST loci showed a consistently strong association with SC across sensitivity experiments and all performed no better at predicting SC than serotype.
We encountered difficulties in using the entire dataset due to the large number of putative deletions recorded. Some of these proved to be a result of the extreme diversity of genes (such as wzx and wzy within the capsular locus) which interfered with their alignment to the reference serotype 23F genome (ATCC 700669). In the entire dataset, around 7% of the genes had over > 80% deletions/mismatches recorded, 10% had > 50% deletions/mismatches recorded, and just over 25% of the data had to be discarded if we rejected all genes with an excess of > 1%> of deletions/mismatches. Given these limitations, we repeated the RFA analysis under various cutoffs for percentage of gene deletions/mismatches in a series of sensitivity exercises. While this did not affect the trend of genes within and flanking the cps locus to shift to lower RFA scores when comparing prediction of serotype against prediction of SC, it thwarted efforts to ascertain whether any specific associations with serotype existed among other highly variable surface proteins of interest: PspA, choline binding protein CpbA/PspC, the IgA proteases or the histidine triad proteins. Future work of this methodology will rely on the development of more robust methods of allele classification for this category of genes, an area, still lacking adequate approaches.
By eliminating all genes with > 1% of deletions/mismatches, we were left with 1581 genes which likely corresponds to the 1500 ʼcoreʼ cluster of orthologous genes (COGs) identified by Croucher et al [2] in their recent analysis of the same dataset. Within this more restricted set, we also observed a clear disjunction between genes that score highly in predicting serotype and the top-scoring genes for predicting SC. Not surprisingly, a significant proportion of genes that were good markers for serotype were found to flank the capsular locus (shown in bold in Table 1), although there were a number which were distal to it. A high proportion of genes scoring highly for serotype prediction were associated with key functions in metabolism and very likely defined unique ʼmetabolic typesʼ, but since most were in proximity to the cps locus, it was not possible to determine whether these had become segregated through resource competition [10] or by physical and/or functional associations with this locus. The presence of several co-functional, co-transcribed or co-localizing sets of genes (eg. the gnd and rit.R genes, the pit,3 m.va and vra operons, and the penicillin- binding genes) on this list (Table 1) argues, however, that the evolution of these serotype-associated traits may best be understood within a modular framework in which different serotypes are characterized by particular combinations of these units.
Genes that were highly informative for SC classification were also not uniformly distributed across the genome, with around a quarter of them co-localizing within and around the groESL operon (shown in bold in Table 2), encoding a chap- eronin system which remains a paradigm of inacromolecular machinery for protein folding [90]. Apart from assisting protein folding by preventing inappropriate interactions between non-native polypeptides [90], this system may also buffer deleterious effects of mutations on protein foldability and stability [91], with important consequences for protein evolution. The protein GroEL is also highly immunogenic for different bacterial species and has been shown to provide strain-specific protection in vaccine studies [92–94]. This raises the radically alternative possibility that sequence clustering may have arisen from immune selection operating on these genes in conjunction with epistatic interactions between the relevant heat shock proteins and the loci encoding the proteins they are chaperoning. The associations between serotype and SC may thus be primarily driven by immune selection operating on multiple immunogenic loci (in this case, cps and groESL) causing them to be organized into non-overlapping combinations, as predicted by strain theory of host-pathogen systems [95, 96]. It has previously been proposed that immune selection acting jointly on capsular and sub-capsular antigens could account for the maintenance of these associations [29]. Immunological selection of unique combinations of cps and groESL, however, has the additional advantage of consolidating the link with a range of other genes across the genome through essential epistatic and highly specific (chaperoning) interactions with GroEL and GroES [90],
Our results are in broad agreement with the framework proposed by Croucher and colleagues [2], based on their analysis of the same dataset, in which lineage structure is maintained by infrequent transfer of modular elements (“macroevolution”) and provides a stable backdrop for more frequent, and often transient, “microevolutionary” changes (see Figure 3). The differentiation of the groESL operon is potentially a striking example of “macroevolution”, being specific not only to S. pneumoniae sequence clusters but also serving to genealogically distinguish closely related bacterial species [82–84], We propose that this is the evolutionary outcome of a combination of immune selection and epistasis operating on specific modules, such as groESL, rather than neutral processes. Selection would also operate at a “microevolutionary” level in creating (more transient) associations between SC and serotype as means of avoiding immunological and direct resource competition [6, 10, 29]. We note that genes belonging to the Rec family are positioned in close proximity to both the contiguous clusters of top-scoring genes for SC and serotype (Tables 1 and 2) and would argue that these endorse the role of restriction- modification systems (RMS) in protecting the modularity of the genome [2], and that population structure arises through selection favouring particular combinations of variants of these modules. Our analyses support the hypothesis that lineage structure in maintained by co-adaptation and competition [6, 10] and show, unexpectedly, that these selection pressures converge upon the same locus, namely the groESL operon, and strongly endorse the development of vaccines targeting the associated chaperone protein GroEL to avoid vaccine induced changes in the population structure such as Vaccine Induced Serotype Replacement (VISR, [26, 29]) or Vaccine Induced Metabolic Shift (VIMS, [10]) which have the potential of greatly reducing the benefits of capsular serotype targeted interventions.
Materials and Methods
Sequence Data and Allelic Annotation
We used a dataset sequenced by Croucher et al, comprising 616 carriage S. pneumona.ie genomes isolated in 2001, 2004 and 2007 from Massachusetts (USA). The data includes 133, 203, 280 samples from 2001, 2004, 2007, respectively; and is stratified into 16 samples of serotype 10A, 50 of 11A, 7 of 14, 24 of 15A, 60 of 15BG, 8 of 16F, 5 ofʼl7F, 6 of 18G, 73 of 19A, 33 of 19F, 1 of 21, 21 of 22F, 33 of 23A, 23 of 23B, 17 of 23F, 11 of 3, 4 of 31, 5 of 33F, 6 of 34, 49 of 35B, 18 of 35F, 2 of 37, 9 of 38, 47 of 6A, 17 of 6B, 33 of 6G, 3 of 7G, 11 of 7F, 4 of 9N, 6 of 9V and 14 of NT (see [3] for collection details). In summary, allelic notation was carried out using the BIGSdb software with an automated BLAST process [97], and the genomes were analysed using the Genome Comparator tool (with ATGG 700669, serotype 23F, accession number FM211187, as the reference). Alleles identical to the reference were classified as ʼ1ʼ, with subsequent sequences, differing at least by one base, labelled in increasing order. Genes were further classified as allele ʼXʼ when genetic data present had no match to the genome of interest, or were found to be truncated or non-coding (see SI Dataset of [10] for a visual representation of allele annotation and diversity). The allelic matrix as obtained by this approach and used in the RFA analysis is herein made available in supplementary Table SI, which also includes the Accession Numbers, gene name, gene product, gene position in reference genome, and year of collection, Sequence Cluster and serotype of each sample.
Random Forest Approach
We implement a machine learning approach based on a Random Forest Algorithm (RFA) to predict particular features (serotype or Sequence Cluster) of each pneumococci isolate from information on the allelic composition of 2135 genes [16], In summary, the RFA process takes the following pseudosteps: (I) the response variable and predictor variables are chosen by the user; (II) a predefined number of independent bootstrap samples are drawn from the dataset with replacement, and a classification tree is fit to each sample containing roughly 2/3 of the data, for which predictor variable selection on each node split in the tree is conducted using only a small random subset of predictor variables; (III) the complete set of trees, one for each bootstrap sample, composes the random forest, from which the status (classification) of the response variable is predicted as an average (majority vote) of the predictions of all trees. Compared to single classification tress, RFs increase prediction accuracy, since the ensemble of slight different classification results adjusts for the instability of the individual trees and avoids data overfitting [98].
Here we use randomForest: Breiman and Cutler's Random Forests for Classification and Regression, a software package for the R-statistical environment [99], Predictor variables are set to be each gene in our genome samples and the response variable is set to the serotype or Sequence Cluster classification of each genome (as per [3]). We use the Mean Decrease Accuracy (MDA), or Breiman-Cutler importance, as a measure of predictor variable importance, for which classification accuracy after data permutation of a predictor variable is subtracted from the accuracy without permutation, and averaged over all trees in the RF to give an importance value [98], For the results presented in the main text, we assume the predictor variables to be numerical (as opposed to categorical). This assumption is known to introduce RF biases, as classification is effectively made by regression and artificial correlations between allele numbering and the features being selected (serotype and Sequence Cluster) may be present. The assumption is herein necessary since the RFA R-based implementation (version 3.6.12) has an upper limit of 53 categories per predictor variable and we find some genes to present up to 6 times this limit in allele diversity. The categorical constraint is a common feature of RFA implementations, as predictor variables with N categories imply 2N possible (binary) combinations for an internal node split, making the RFA method computationally impractical. Given this inherent RFA limitation, we implemented an input shuffling strategy to minimize potential bias. For this, M random permutations of each gene's allelic numbering in the original dataset is performed, effectively creating M independent input matrices. The RFA is run over the input matrices and in the main results we present each gene's average MDA score. A sensitivity analysis was performed by comparing RFA results between two independent sets of M = 50 input matrices (effectively comparing 100 independent runs) (Fig. S5). Results suggest that the existing biases in independent runs of the RFA due to the assumption of numerical predictors are virtually mitigated with our shuffling approach, specially for experiments classifying serotype.
ACKNOWLEDGMENTS
The authors acknowledge the sequence data and constructive comments by Angela Brueggemann and An- dries van Tonder.
Footnotes
S.G., J.L. and E.W. designed the study; S.G. and J.L. conducted the study. All authors were involved in analysis and interpretation of results and available data and in writing the paper.
We declare we have no competing interests.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.
- 87.
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵