Abstract
The European honey bee (Apis mellifera) plays a major role in pollination and food production, but is under threat from emerging pathogens and agro-environmental insults. As with other organisms, honey bee health is a complex product of environment, host genetics and associated microbes (commensal, opportunistic and pathogenic). Improved understanding of bee genetics and their molecular ecology can help manage modern challenges to bee health and production. Sampling bee and cobiont genomes, we characterised the metagenome of 19 honey bee colonies across Britain. Low heterozygosity was observed in bees from many Scottish colonies, sharing high similarity to the native dark bee, A. mellifera mellifera. Apiaries exhibited high diversity in the composition and relative abundance of individual microbiome taxa. Most non-bee sequences derived from known honey bee commensal bacteria or known pathogens, e.g. Lotmaria passim (Trypanosomatidae), and Nosema spp. (Microsporidia). However, DNA was also detected from numerous additional bacterial, plant (food source), protozoan and metazoan organisms. To classify sequences from cobionts lacking genomic information, we developed a novel network analysis approach clustering orphan contigs, allowing the identification of a pathogenic gregarine. Our analyses demonstrate the power of high-throughput, directed metagenomics in agroecosystems identifying potential threats to honey bees present in their microbiota.
The European honey bee, Apis mellifera Linnaeus, has a global distribution and a major role in pollination and food production (1). Like other pollinators, honey bee populations face multiple threats. The UN’s food and agriculture organisation estimate that 75% of all pollinators are in decline globally and their numbers have dropped by about one third in the past decade (2). Whilst flowering crops benefit greatly from a diversity of insect pollinators (3), managed honey bees are a major global contributor, providing nearly half of the service to all insect-pollinated crops on Earth (4, 5). Despite the recent increase in non-commercial beekeeping, the number of managed honey bee colonies is growing more slowly than agricultural demand for pollination (6). The decline in pollinators is not thought to be caused by a single factor but may be driven by a combination of habitat fragmentation, agricultural intensification, pesticide residue accumulation, new honey bee pests and diseases, and suboptimal beekeeping practices (7–9). Trade in honey bees from different regions of the globe have unquestionably contributed to a rise in infectious disease and there may be transmission between honey bees and wild pollinators (10, 11).
In the UK the genetic structure of honey bee populations has undergone large changes over the last 100 years. The native M-lineage subspecies, A. m. mellifera, had predominated in the UK, but the population was decimated in the early 20th century by a combination of poor weather and chronic bee paralysis virus, thought to have been caused by Isle of Wight disease (12). Following this, the practice of bee importation increased dramatically. In the UK today there is a growing industry that imports bees from mainland Europe, particularly the Italian honey bee (A. m. ligustica) and Carniolan honey bee (A. m. carnica), both C-lineage subspecies. Importation of queens has for a long time been used as a means to compensate for the loss of colonies and the Southern European strains are often viewed as a means to improve honey production. It had been assumed that the native UK bee was extinct, but new molecular studies have shown that colonies robustly assigned to A. m. mellifera still exist in Northern Europe (13). In the UK the genetic diversity of honey bee populations is poorly understood. The genetic makeup of bee populations not only influences production traits and the ability to survive under less favourable conditions, but also plays a vital role in disease resistance (14). However, the movement of honey bees across the globe has unquestionably contributed to the spread of infectious diseases, which may also be transmitted to wild pollinators (10, 11).
In the UK, the health of honey bees is under threat from a range of native and non-native bacterial, fungal and viral pathogens. While known ‘notifiable diseases’ can be risk assessed and regulated by law, emergent diseases such as Nosema ceranae (15) may be spread globally before they have been properly identified and risk assessed. Nosemosis is one of the most prevalent honey bee diseases and is caused by two species of microsporidia, Nosema apis and Nosema ceranae, that parasitize the ventriculum (midgut). Although infected bees often show no clear symptoms, heavy infections can result in a broad range of detrimental effects (16–21). N. ceranae, a native parasite of the Asiatic honey bee (Apis cerana), has been detected in Apis mellifera samples from Uruguay predating 1990 but is now present in Apis mellifera worldwide (15). Notifiable diseases, American foulbrood (AFB) and European foulbrood (EFB), are caused by the non-native bacteria Paenibacillus larvae and native Melissococcus plutonius, respectively (22, 23). Acarine disease is caused by a mite found throughout Britain which infests the trachea of honey bees (24). Protozoans such as gregarines and the emergent trypanosomatid Lotmaria passim, also infect honey bees. The most devastating of all pathogenic species in recent years is the hemophagous mite Varroa destructor, which shifted hosts from A. cerana to A. mellifera sometime in the first half of the 20th century (25). Varroa mites feed on the haemolymph of both larval and adult stages of the honey bee. More importantly, V. destructor transmits several bee viruses, generating epidemics that kill colonies within two to three years unless the Varroa population is kept under control. Among the most important and lethal viruses in this regard are deformed wing virus (DWV) (26), acute bee paralysis virus complex (ABPV), Kashmir bee virus (KBV), and Israeli acute paralysis virus (IAPV) (27). Sacbrood virus (SBV) can also be transmitted but without major epidemic consequences and is primarily indirectly affected by Varroa (25, 28, 29).
The core commensal microbiome can mediate disease susceptibility and the internal ecology of the host can greatly affect disease outcome (30). In addition to immunological health and essential nutrient provision, microbial metabolism affects the growth, behaviour and hormonal signalling of honey bees (31). Unlike most host species, the core microbiota of the honey bee has relatively little diversity (32–38). Snodgrasella alvi (Betaproteobacteria), Gilliamella apicola (Gammaproteobacteria), two Lactobacillus taxa (Firm-4 and Firm-5) (34, 35), and Bifidobacterium asteroides are common and abundant (39, 40). There are at least four less common species: Frischella perrara (41), Bartonella apis (42), Parasaccharibacter apium (37) and Gluconobacter-related species group Alpha2.1 (35). Metagenomic analyses have revealed high between-isolate genetic diversity in honey bee microbiotal taxa, suggesting they comprise clusters of related taxa (43). These bacteria maintain gut physiochemical conditions and aid their host in the digestion and metabolism of nutrients, neutralisation of toxins, and resistance to parasites (44–46). Gilliamella species digest pectin from pollen, and the Lactobacillus species inhibit the growth of foulbrood bacteria (47). However, F. perrara may cause a widespread scab phenotype in the gut (48). A negative correlation was found between the presence of Snodgrasella alvi and pathogenic Crithidia in bees (49), but pre-treatment of honey bees with S. alvi prior to challenge with Lotmaria passim (an A. mellifera pathogen closely related to Crithidia) resulted in greater levels of L. passim compared to bees which were not pre-treated (50). Thus, commensal microbiome species can have beneficial, mutual or parasitic relationships with their hosts, and in particular, different combinations of species – different microbiota communities – may be associated with variations in honey bee health.
With recent significant reductions in the cost of high throughput sequencing, metagenomics could be a useful tool for analysing genetic lineage, gut health and pathogen load as part of routine testing and/or monitoring imports for novel pathogens. Here, to establish baseline figures and test the suitability of this approach, we applied a novel network analysis framework together with deep sequencing of the honey bee metagenome, examining the genomes of honey bees and their symbiotic and pathogenic cobionts in UK apiaries.
Results
Metagenome sequencing of honey bees and their cobionts
We performed full metagenomic sequencing of 19 samples of UK honey bees (Supplementary Table 1). Samples were obtained from hives located across Scotland and England (Fig. 1a), each sample comprising of 16 workers collected from a single colony. Duplicates of samples 1–4 were analysed at a lower sequencing coverage to assess cobiont and genomic variant discovery. While the sample size was limited, the colonies sequenced were selected as representative of the phenotypic diversity of honey bees currently managed by UK apiarists. Notably, representatives of the Buckfast bee and the Colonsay “native” black bee lines were included in the sampling. The entire thorax and abdomen was processed for genome sequencing, thus including gut microorganisms, organisms attached to the outside of the bees, and haemolymph/tissue parasites. Between 4.5 and 12.5 million 125 base paired-end reads were generated per sample on the Illumina HiSeq 2500, equivalent to between 17- and 50-fold coverage of the honey bee genome (Amel 4.5).
Genomic diversity of sampled honey bees
DNA sequence data were mapped onto the honey bee reference genome (version Amel 4.5 (51)) and variants identified. Overall 3,940,467 sites were called as polymorphic, ranging from 962,775 to 2,586,224 single nucleotide variants (SNVs) per sample (Fig. 1b). A correlation graph derived from a matrix of identity-by-state (IBS) at each variant position for all samples was used to define related groups of samples (Fig. 1c). Group 1, which includes the native black bee sample from Colonsay (samples 2 and 9), was less heterozygous than Group 2 (Fig. 1d). ADMIXTURE (62) analyses were used to explore population subdivision in the data following removal of SNVs in linkage disequilibrium. ADMIXTURE cross-validation (CV) error values increased as the number of populations (K) assumed to be contributing to the variation were increased (K=1, CV= 0.562; K=2, CV= 0.601; K=3, CV= 0.712; K=4, CV= 0.853; K=5, CV= 1.007). At K=2 the Buckfast (samples 3 and 10) and Carniolan (samples 4 and 11) C lineage samples were distinguished from the M lineage A. m. mellifera samples, while K=3 further discerns the “native” A. m. mellifera sampled from Colonsay (samples 2 and 9), the Buckfast sample at K = 4 and the A. m. mellifera breeding project (samples 1 and 8) at K = 5 (Fig. 1e).
ADMIXTURE was originally designed to estimate ancestry in unrelated individuals rather than pooled DNA from several individuals, as analysed here. To address this, genotypes were simulated for 10 individuals per pooled DNA sample, using allele sequence depth to estimate allele frequency under an assumption of Hardy-Weinberg equilibrium and analysed using ADMIXTURE. The CV error values decreased as K was increased (K=1, CV= 0.980; K=2, CV= 0.835; K=3, CV= 0.795; K=4, CV= 0.763; K=5, CV= 0.736). At K≤3 the simulated data results were consistent with those from the actual pooled genotypes, while K=4 distinguished samples from the A. m. mellifera breeding project (samples 1 and 8), and K=5 assigned a distinct genetic background to bees sampled from Wigtownshire (sample 15) (Fig. 1e). k-nearest neighbour (kNN) network analysis of the pooled genotype data (63) also identified 2 clusters, separating C and M lineage samples in the same manner as the ADMIXTURE analyses. Together, these results support a model of two genetic backgrounds in the UK bee populations sampled, most likely representing the C and M lineages, with evidence of a distinct A. m. mellifera background in bees originating from Colonsay and other areas of Scotland, and differentiation of Buckfast and Carniolan bees (Fig. 1f).
The microbiome of honey bees
The majority of the data (~90% of reads) from each sample mapped to the honey bee reference genome. Reads that did not map to the honey bee reference were collated and used for a metagenomic assembly. This resulted in over 35,000 contigs greater than 1 kb in length. Contigs were assigned to a taxonomic group by comparison to a series of curated databases in a defined order (Fig. 2a) using BlobTools (52). First, contigs were compared to the bee cobiont sequence data in the HoloBee Database (v2016.1) (53), followed by genomes and proteomes of species identified as being bee-associated (54, 55), and finally by comparison of contigs against the NCBI Nucleotide and UniProt Reference Proteome databases. Patterns of coverage, GC% and taxonomic annotation of contigs were explored to identify likely genomic compartments present (Fig. 2b,c). We discarded contigs with read coverage lower than 1, as these were likely an artefact of pooling reads, yielding a final set 31,386 metagenome contigs, spanning 140 Mb. Taxon assignments are summarised in Supplementary Table 2. Correlation graphs were constructed based on the coverage or contribution to each contig from each sample. Clustering samples in this manner did not recapitulate their clustering by honey bee genome SNVs (Fig. 2d). A correlation graph was also constructed where nodes represented individual contigs (Fig. 3). A high correlation threshold (r = 0.99) was used, which meant that 35% of the contigs were unconnected. The highly structured multi-component graph was subdivided using the MCL algorithm (56) into clusters of contigs whose abundance across the samples was very similar. Many of these clusters were made up of contigs derived from the same species or in a number of cases from strongly co-occurring species.
Rarefaction analysis of ribosomal RNA sequences present in the assembled data was used to estimate the species richness discovered as a function of sequencing depth (Supplementary Fig. 1). While there was variation between samples in terms of species richness at all sequencing depths, even the lowest coverage achieved (17x reference genome coverage) was likely to be sufficient to capture most A. mellifera cobionts present.
We examined clusters on the graph further. One (Fig. 4a) contained 1.33 Mb of sequence, most of which had no match in public databases, but contained some contigs that had significant similarity to sequences from other Apis species (Fig. 4b). The number of reads mapping to these contigs was proportional to the depth of honey bee genome sequencing (Fig. 4c) and we infer that they likely represent reads from true A. mellifera genome fragments not present in the honey bee reference genome (Fig. 4d). Others in this cluster, spanning 0.01 Mb, matched sequences from Ascophaera apis (chalkbrood), an endemic fungal associate of honey bees (57).
Most of the other groups of contigs could be assigned to cobiont organisms. The contribution of non-A mellifera reads varied between samples, a pattern that may be partly explained by the presence in some samples of eukaryotic pathogens such as Nosema microsporidians and the trypanosomatid L. passim, which have larger genomes. The most abundant non-pathogenic bacterial cobionts identified were Gilliamella apicola, Bartonella apis, Frischella perrara, Snodgrassella alvi, “Firm-4” firmicutes (58) (Lactobacillusmellis and Lactobacillusmellifer), “Firm-5” firmicutes (58) (Lactobacillus melliventris, Lactobacillus kimbladii, Lactobacillus kullabergensis, Lactobacillus sp. wkB8, Lactobacillus helsingborgensis and Lactobacillus sp. wkB10), Lactobacillus kunkeei and Bifidobacterium asteroides (Supplementary Table 2). Each species varied in its abundance across the samples. In some nominal species, contig read coverage clustering suggested the presence of multiple distinct genotypes of cobionts. Contigs ascribed to Bartonella apis together had a total span of 11.7 Mb, almost five times longer than the reference B. apis genome, and formed a connected module (Fig. 5a). The three largest B. apis clusters had distinct distribution across the samples, which we suggest reflects the presence of distinct genotypes of B. apis with varying abundance across the samples. Similarly, contigs ascribed to Gilliamella apicola, the most abundant species identified in the bee microbiome, were distributed across three closely related clusters (Fig. 5b). Groups containing contigs from several closely related but distinct Lactobacillus species were identified: Firm-4 lactobacilli (clusters 25 and 40) or Firm-5 lactobacilli (clusters 16, 20, 21 and 24) (Fig. 5c). These Lactobacillus groups may represent distinct cobiont communities. The exception was cluster 21, which contained contigs assigned to a mix of Firm-5 species: this may represent a core genome component conserved between species. Cluster 29 comprised contigs assigned to Lactobacillus kunkeei that formed an unconnected graph component. L. kunkeei is thought to be an environmental rather than a gut microbiome organism. Some connected components were more complex. Cluster 32 contained contigs assigned to several prevalent honey bee cobionts, including G. apicola, F. perrara, B. asteroides, S. alvi, B. apis, S. floricola and P. apium. The co-clustering of genomic segments from multiple species is likely to reflect a strongly interacting community of organisms where the relative abundance of each is regulated homeostatically (43, 55, 59).
Some clusters had very restricted presence in the sample set. For example, cluster 3 was largely restricted to sample 4 (Supplementary Fig. 2e). These are likely to derive either from rare members of the honey bee cobiont community or opportunistic infections. Several clusters had little to no annotation (Supplementary Fig. 2f). The coverage of these contigs was also usually derived from individual samples. They may represent novel species, or divergent or novel genomic regions of known species.
Honey bee pathogens
Known honey bee pathogens were detected in many samples. One of the largest components of clustered contigs was assigned to the trypanosomatid parasite Lotmaria passim, with a combined span of 16.3 Mb (Fig. 6a). While sequences were detected from notifiable pathogens Melisococcus plutonious and Paenibacillus larvae (European and American foulbrood), no distinct cluster was identified and the <1Mb total combined span of matched sequences was relatively minor (Supplementary Table 2).
Both Nosema species N. apis (Fig. 6b) and N. ceranae (Fig. 6c) were identified. N. ceranae was more prevalent (4/19 colonies vs. 2/19 colonies). Contigs matching the pathogen causing “chalk brood” (Ascophaera apis) were found in cluster 2 and were derived almost exclusively from sample 23 (Fig. 6d). In close proximity in the network graph was cluster 47, containing contigs assigned to the parasitic mite V. destructor and contigs assigned to Apis mellifera filamentous virus (AmFV), found in 6/19 colonies (Fig. 6c). The largest source of reads mapping to these contigs was sample 23, which also had a high prevalence of chalkbrood. Blobplots describing the taxonomy and cumulative span for each panel in Fig. 6 are available in Supplementary Fig. 2d-j.
To validate the metagenomic hits, we employed PCR to screen our samples for B. apis, Nosema ceranae and L. passim. All samples in which we identified sequences deriving from organisms were positive by their respective PCR. However, we also identified the presence of species in additional samples not scored as positive by sequencing, suggesting that the PCR assays are more sensitive than bulk sequencing (Supplementary Fig. 3a-c). We also identified a small cluster containing only one contig matching to a recorded genome sequence, Apicystis bombi, a gregarine known to parasatise honey bees (60). To identify the exact species present, we sequenced the PCR results of custom primers against the largest contig in this cluster, in conjunction with primers encompassing the 18 S and ITS2 rDNA regions, as used by Dias et al. for the characterisation of novel gregarine species (61) (Supplementary Fig. 3d). The contig sequence matched with various gregarine species, while the ribosomal DNA sequence confirmed the species present to be Apicystis bombi (Supplementary Fig. 3e).
Discussion
A healthy population of honey bees is crucial for the security of the ecosystem service of pollination. With the continued and sometimes unregulated global transport of A. mellifera, the introduction of invasive pests and parasites is a continuing threat, as is the genetic dilution or extinction of locally adapted subspecies. Here we used metagenomic analyses of nineteen honey bee colonies from around the UK to compare host genetics, examine the complexity and connectedness of the bee microbiome, and quantify disease burden.
Using the reference honey bee genome and sequence data from sixteen worker bees from each colony, we defined over five million SNVs with a relatively even distribution across all sixteen chromosomes (Fig. 1). We also identified likely honey bee-derived sequences not represented in the reference genome, likely because the reference is incomplete or because of variation in genomic content between honey bee populations. The island of Colonsay in Scotland is a reserve for the northern European bee, A. m. melifera. Given the level of bee imports into Scotland, it was therefore reassuring – and perhaps surprising – to observe that the genotypes of other colonies from around Scotland were close to that of the Colonsay sample, although distinct from samples from A. m. mellifera breeding programmes in England. The low heterozygosity of Scottish A. m. mellifera and continued survival in face of imports may reflect natural selection for A. m. mellifera genotypes in the colder climates and shorter foraging season of northern Europe.
The whole organism-derived sequence data was also used to explore the composition of the communities of organisms living in or on honey bees. Non-A. Mellifera-mapping reads were assembled to generate 160 Mb of genomic sequence from honey bee cobionts. These cobionts were biologically identified using read depth coverage, patterns of coverage across samples and best taxonomic assignment based on comparisons to known organisms. A correlation network based on per-sample read coverages of these contigs (Fig. 2d) did not fully match the relatedness of the source bees (Fig. 1c), suggesting that both environmental and host genetic components drive microbiome composition. Our limited sampling (only nineteen colonies) is not sufficient to unpick these interdependent drivers, but we note that samples from the Scottish coast, the central belt of Scotland and from England were grouped separately. These data are congruent with previous analyses of the roles of climate and forage in determining microbiome structure of honey bees (62, 63).
In many animals, the gut microbiota form quasi-stable communities, with individual hosts harbouring somewhat predictable communities of different bacterial taxa. These different microbiome types have been associated with different gross physiological performance. In addition, changes in microbiota composition (dysbiosis) have been associated with the promotion of disease states in humans and other mammals (64, 65). Dysbiosis in honey bees may be an important correlate of bee and colony health (48, 66–68).
In the honey bee gut, bacterial numbers are highest in the rectum, followed by the ileum, mid-gut and crop (66). Lactobacilli are mainly found close to the rectum and, together with bifidobacteria, greatly outnumber other species (66). We identified several contig clusters that likely represented single Lactobacillus species as well as a mixed-origin cluster (Fig. 4). Most of these were interlinked, revealing patterns of co-occurence of individual taxa. In contrast, L. kunkeei, an environmental cobiont reportedly indicative of poor health (66), formed a distinct, unlinked cluster. Samples 2 and 9 were technical replicates, and both had reduced diversity, containing only G. apicola and Lactobacillus species. The reason for this is unclear, but there was no evidence of pathogenic disruption of the sampled bees.
Nosema infection has been linked to immune suppression and oxidative stress of bee hosts (69). Similarly L. kunkeei and P. apium, which are adapted to fluctuating oxygen levels predicted for the gut (70), have been associated with disease states in social bees, and negatively correlated with the amount of core commensal bacteria present (66). The microbiome from sample 23 was had a preponderance of reads mapping to the L. kunkeei cluster (Supplementary Fig. 2c), evidence of P. apium presence, much reduced representation of other Lactobacillus species, and the highest read coverage of contigs associated with the pathogens V. destructor, AmFV and A. apis. Sample 23 may be an example of pathogen-induced dysbiosis, or of invasion by pathogens of a resident microbiome disturbed by other drivers. There was a high level of co-occurrence of different pathogens across samples, implying that colonies infected with one pathogen may be more susceptible to others. A meta-stable community may exist in the case of Varria destructor and AmFV (Fig. 6c). However, we note a recent study reported identifying 0.5 Mb of sequence from Varroa reference genome to be of AmFV origin (71). It is therefore possible that several of the contigs in our study matched with Varroa destructor are in fact of AmFV origin.
Several distinct contig clusters were assigned to G. apicola and B. apis suggesting the existence of genetically distinct subtypes of these highly prevalent bacteria. (Fig. 5a,b). G. apicola has a high diversity of accessory genes, associated with adaptation to different A. mellifera ecological niches (72, 73). Increased relative abundance of G. apicola has been associated with dysbiosis and host deficiencies (66). Similarly, extreme displacement of S. alvi by F. perrara and G. apicola (and to a lesser extent by the opportunists P. apium and L. kunkeei) has been strongly associated with reduced bacterial biofilm function and host tissue disruption by scab-inducing F. perrara (48, 68), leading to poor host development and early mortality. Blooms of B. apis have also been associated with poor health. This species exploits stressed, young, and old bees, showing sporadic abundance in whole guts of newly emerged workers (58) and occurring uniformly across putatively dysbiotic foragers (56). In support of this theory, samples from our study with the highest coverage of G. apicola and B. apis contigs also contained reads from pathogens such as L. passim or Nosema species. Significant positive correlation has been reported between infection levels of these parasites (74).
Our novel use of correlation networks (Fig. 3) to organise contigs based on their relative abundance across samples partitioned 65% of them into clusters of sequences corresponding to individual species and distinct micro-communities. Some sample-specific clusters, such as clusters 3 and 32, contained several core microbiome taxa. This may be a reflection of substrate specialisation based on host foraging (75). However, several sample specific clusters contained contigs that had no informative taxonomic annotation, potentially revealing uncharacterised species. We identified a cluster of unclassified contigs derived from a gregarine, with closest match to Apicystis bombi. The accuracy of our metagenomic analyses was confirmed by PCR and ribosomal DNA primers verified the species as Apicystis bombi. This is further evidence that managed honey bees can act as a reservoir for wild pollinator pathogens (60); through increased understanding of honey bee molecular ecology and preventing disease transmission, we can indirectly improve wild pollinator health (76). To our knowledge Lotmaria passim had not been previously identified in the UK. Its presence was confirmed for the first time in our study using the primers designed by Stevanovic et al. (77), further validating our sequencing inference.
A whole-organism metagenomics approach has allowed us to describe the complexity of host-microbiome biology of UK honey bees. Using pooled samples we have demonstrated the power of this approach in dual characterisation of the genotypic diversity of the honey bee and the genomic diversity of its cobionts. Correlation networks are powerful analytic tools that allowed us to cluster the sequence data to reveal interacting networks of bacterial and eukaryotic microbiota, in addition to classifying novel genomic sequences. As with the human and other animal microbiome projects, the precision of these analyses improves with additional data, permitting definition (and ultimately whole genome assembly) of novel genotypes of cobionts. To this end, the raw data from this project can be accessed through the Bee Microbiome Database, established and managed by the Bee Microbiome Consortium, a non-profit organization of bee scientists for collecting, curating and analysing bee microbiome data (55). Complementation of cheap short read data with low-coverage long-read data from isolated gut contents enhances the contiguity of assemblies and the functional inferences that can be derived them. This study highlights the potential to use this approach in routine screening, breeding programmes and horizon scanning for emerging pathogens.
Methods
Samples
Nineteen samples of honey bee (each comprising sixteen workers collected from a single colony) were obtained from beekeepers in Scotland and England, with the help of Science and Advice for Scottish Agriculture (SASA) and Fera Science Ltd. Wings, legs and heads were removed before homogenizing the remainder of the bees (thorax and abdomen) in 2% CTAB buffer (100 mM Tris-HCl pH 8.0, 1.4 M NaCl, 20 mM EDTA pH 8.0, 2% hexadecyltrimethylammonium bromide, 0.2% 2-mercaptoethanol). Samples were incubated at 60ΰC with proteinase K (54 ng/μl) for 16 h before incubating with RNaseA (2.7 ng/μl) at 37ΰC for 1 h. After two chloroform:isoamyl alcohol (24:1) extractions, samples were ethanol precipitated, washed three times in 70% ethanol and resuspended in 0.1 TE. All genomic DNA samples were analysed for quantity (Qubit dsDNA HS Assay Kit, Thermo Fisher Scientific, Waltham, MA, USA), purity (Nanodrop, Thermo Fisher Scientific, Waltham, MA, USA) and quality (TapeStation, Agilent Technologies, Santa Clara, CA, USA).
Sequencing
All sequencing was performed by Edinburgh Genomics. DNAs were prepared for whole genome sequencing using the TruSeq DNA PCR-free gel free library kit (Illumina, Cambridge, UK) and, for eight samples, using the TruSeq DNA Nano gel free library kits (Illumina). For comparison, both types of libraries were prepared for four samples. 125 base paired-end sequencing was performed on an Illumina HiSeq 2500. Four samples were sequenced at 50X coverage, eight at 25X (including repeat sequencing of the four 50X samples) and twelve at 17X coverage. Data were screened for quality using FastQC v0.11.2 (Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc), and trimmed of low quality regions and adapters using Trimmomatic v0.35 (78) with parameters ‘TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:100.’ These parameters remove bases from the end of a read if they are below a Phred score of 20, clip the read if the average Phred score within a 4 base sliding window advanced from the 5’ end falls below 20, and specify a minimum read length of 100 bases (the parameters used for all informatics analyses are also detailed in Supplementary Table 3).
Variant calling on honey bee
Reads were aligned to the reference A. mellifera genome, Amel_4.5 (INSDC assembly GCA_000002195.1) using BWA-MEM v0.7.8 (79) with parameters -R and -M. Output files were merged and duplicates marked using Picard Tools v2.1.1 to create one BAM file per sample. This was filtered using SAMtools view v1.3 (80) to retain only the highest confidence alignments using the parameters -q 20 (to remove alignments with a Phred score <20) and -F 12 (to remove all reads that are not mapped and whose mate is not mapped).
Variants were called using GATK v3.5 in accordance with GATK best practice recommendations (81, 82). Local realignments were performed and base quality scores recalibrated using bee SNVs from dbSNP (83) build ID 140 (ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/bee_7460/VCF/, downloaded 1st January 2016). GATK HaplotypeCaller was used with parameters emitRefConfidence, - GVCF variant index type – LINEAR, variant index parameter -128000, stand emit conf – 30, stand call conf - 30. The resulting VCFs, one per sample, were merged to create a single gVCF file using GATK GenotypeGVCFs to allow variants to be called on all samples simultaneously. Variant quality score recalibration was performed on this file using GATK VariantRecalibrator with parameters badLodCutoff – 3, -an QD, -an MQ, -an MQRankSum, -an ReadPosRankSum, -an FS, -an DP (specifying the above dbSNP data as both the truth set [prior=15.0] and training set [prior=12.0]). To identify any effect these variants may have upon protein-coding genes in the reference annotation, we used SNPeff v4.2 (84). A total of 5,302,201 variants were identified across the 19 samples.
An Identity By State (IBS) analysis was performed using the R/Bioconductor package, SNPRelate (85). Briefly, the minor allele frequency and missing rate for each SNV was calculated over all the samples. The values of the resultant IBS matrix ranged from zero to one. Using this matrix, we constructed a network correlation graph for all of the samples, using the network analysis tool Graphia Professional (Kajeka Ltd., Edinburgh, UK), where each node represented a sample, and edges between nodes represented a correlation above the defined threshold between those samples.
De novo assembly and analysis of non-honey bee data
De novo assembly was performed on all of the reads which did not map to the Apis mellifera reference genome using SPAdes v3.8.1 (86). The resulting contigs were filtered by length (> 1kb) and coverage (> 2). BWA-MEM (79) was used to identify and remove reads mapping to these contigs and de novo assembly was performed on the remaining reads. This process was repeated for a total of five iterations. Input reads from each sample were mapped to each contig using BWA-MEM and base coverage/contig was calculated. Contigs with a cumulative base coverage from all samples less than half the SPAdes overall coverage were discarded. Using BLAST (87), contigs were compared to a set of custom databases: 1. HB_Bar_v2016.1 (53); 1. HB_Mop_v2016.1 (53); 3. nucleotide sequences of core microbiome species identified from literature (43, 44, 55, 73); 4. protein sequences of these species (43, 44, 55, 73); 5. NCBI nt (88); 6. UniProt Reference Proteomes (89) using BLAST (87) and Diamond (90). Files of all six sequence similarity searches were provided as input to BlobTools in the listed order under the tax-rule ‘bestsumorder’, i.e. a contig is assigned the NCBI taxid of the taxon providing the best scoring hits within a given file, as long as it has not been allocated a NCBI taxid in a previous file.. BlobTools was used to visualise the coverage, GC% and best BLAST similarity match of the assembly, and to build a table of base coverage of contigs in each sample together with their taxonomic annotation. A network graph was constructed using r value of 0.99 comparing samples to each other based on correlations between their overall microbiome content as well as contig coverage across the dataset. This follows the approach used to compare gene expression values in transcriptomics data (91).
Population genetics analyses
SNVs were filtered using Plink v1.9; (92) to remove those not mapped to the autosomes, those having low genotyping call rate (<0.9), those with low minor allele frequency (<0.1), and those with pairwise linkage disequilibrium r2 > 0.1 (for SNVs in 50 kb windows with a 10 kb step). The resulting 58,354 SNVs were submitted to unsupervised analyses in ADMIXTURE (93) for 1 ≤ K ≤ 5 genetic backgrounds. To explore consequence of analysing genotypes from pooled DNA, individual genotypes simulated for 10 individuals per sampling location for each SNV were subjected to ADMIXTURE analysis. Briefly, for each SNV the allele frequency observed in a pooled sample was calculated from the read counts for each allele, and used to simulate ten genotypes assuming Hardy-Weinberg equilibrium. The efficacy of this process was tested using data from Harpur et al. (94), details of which are provided in the supplementary information (Supplementary Table 3). A distance matrix from the pooled DNA genotypes used in ADMIXTURE analyses was generated with Plink and analysed using the R package netview (95) (https://github.com/esteinig/netview), which analyses genetic structure using mutual k-nearest neighbour (kNN) graphs. Graphs were created assuming 2 ≤ k ≤ 20 nearest neighbours.
Primer design for identification of cobionts using PCR
Custom primers were designed against the longest contigs we generated matching Bartonella apis (Bartonella_Fw 5’-CAGCAGCGCTTATTCCGTTC-3’, Bartonella_Rv 5’-AGTCACGAGCAACAATCGGT-3’) and the Gregarine species (Gregarine_F 5’-GACCACCGTCCTGCTGTTTA-3’, Gregarine_R 5’-GAGGTATCGGGTGCCATGA-3’). Primers were run through NCBI BLAST to confirm specificity (87). Apicystis bombi specific primers were used as described in Dias et al. (61). Specific primers against Nosema ceranae were used as described by Chen et al. (96) and Lotmaria passim specific primers were used as described by Stevanovic et al. (77).
Rarefaction analysis of microbiome sampling
“Mean species richness” was calculated using the R package ‘vegan’ (97) for each sample at each of the sequencing depths used. Assembled contigs were analysed against the SILVA rDNA (16S and 18S) databases (98) instead of the NCBI nt database to assess species composition. Each contig identified as being from a unique species was counted as one “count” or incidence of discovering that species in the sample.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.
- 19.
- 20.
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.
- 34.↵
- 35.↵
- 36.
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵