Abstract
Background Bacillus cereus sensu lato (s. l.) is an ecologically diverse bacterial group of medical and agricultural significance. In this study, I used publicly available genomes to characterize the B. cereus s. l. pan-genome and performed the largest phylogenetic analyses of this group to date in terms of the number of genes and taxa included. With these fundamental data in hand, it became possible to identify genes associated with particular phenotypic traits (i.e., “pan-GWAS” analysis), and to quantify the degree to which taxa sharing common attributes were phylogenetically clustered.
Methods A rapid k-mer based approach (Mash) was used to create reduced representations of selected Bacillus genomes, and a fast distance-based phylogenetic analysis of this data (FastME) was performed to determine which species should be included in B. cereus s. l. The complete genomes of eight B. cereus s. l. species were annotated de novo with Prokka, and these annotations were used by Roary to produce the B. cereus s. l. pan-genome. Scoary was used to associate gene presence and absence patterns with various phenotypes. The orthologous protein sequence clusters produced by Roary were filtered and used to build HaMStR databases of gene models that were used in turn to construct phylogenetic data matrices. Phylogenetic analyses used RAxML, DendroPy, ClonalFrameML, Gubbins, PAUP*, and SplitsTree. The genealogical sorting index was used to assess the tree-based clustering of taxa sharing common attributes.
Results The B. cereus s. l. pan-genome currently consists of ≈60,000 genes, ≈600 of which are “core” (common to at least 99% of taxa sampled). Pan-GWAS analysis revealed genes that were associated with phenotypes such as isolation source, oxygen requirement, and ability to cause diseases such as anthrax or food poisoning. Extensive phylogenetic analyses using an unprecedented amount of data produced phylogenies that were largely concordant with each other and with previous studies. Phylogenetic support as measured by bootstrap probabilities increased markedly when all suitable pan-genome data was included in phylogenetic analyses, as opposed to when only core genes were used. B. cereus s. l. taxa sharing common traits and species designations exhibited varying degrees of phylogenetic clustering.
List of abbreviations
- ACLAME
- A CLAssification of Mobile genetic Elements
- AFLP
- amplified fragment length polymorphism
- BCSL
- Bacillus cereus sensu lato
- BLAST
- Basic Local Alignment Search Tool
- BP
- bootstrap probability
- GB
- gigabytes
- GWAS
- genome-wide association study
- HMM
- hidden Markov model
- HaMStR
- Hidden Markov Model based Search for Orthologs using Reciprocity
- LCB
- locally collinear block
- MAFFT
- Multiple Alignment using Fast Fourier Transform
- MGE
- mobile genetic element
- ML
- maximum likelihood
- MLST
- multilocus sequence typing
- MP
- maximum parsimony
- NCBI
- National Center for Biotechnology Information
- PANTHER
- Protein ANalysis THrough Evolutionary Relationships
- PATRIC
- Pathosystems Resource Integration Center
- PAUP*
- Phylogenetic Analysis Using Parsimony *and other methods
- PHYLIP
- Phylogeny Inference Package
- PRANK
- Probabilistic Alignment Kit
- RAM
- random access memory
- RAxML
- Randomized Axelerated Maximum Likelihood
- RF
- Robinson-Foulds
- RefSeq
- Reference Sequence database
- SNP
- single nucleotide polymorphism
- dDDH
- digital DNA-DNA hybridization
- gsi
- genealogical sorting index
- nt
- nucleotides