Abstract
The great gerbil (Rhombomys opimus) is a social rodent living in permanent, complex burrow systems distributed throughout Central Asia, where it serves as the main host of several important vector-borne infectious diseases and is defined as a key reservoir species for plague (Yersinia pestis). Studies from the wild have shown that the great gerbil is largely resistant to plague but the genetic basis for resistance is yet to be determined. Here, we present a highly contiguous annotated genome assembly of great gerbil, covering over 96 % of the estimated 2.47 Gb genome. Comparative genomic analyses focusing on the immune gene repertoire, reveal shared gene losses within TLR gene families (i.e. TLR8, TLR10 and all members of TLR11-subfamily) for the Gerbillinae lineage, accompanied with signs of diversifying selection of TLR7 and TLR9. Most notably, we find a great gerbil-specific duplication of the MHCII DRB locus. In silico analyses suggest that the duplicated gene provides high peptide binding affinity for Yersiniae epitopes. The great gerbil genome provides new insights into the genomic landscape that confers immunological resistance towards plague. The high affinity for Yersinia epitopes could be key in our understanding of the high resistance in great gerbils, putatively conferring a faster initiation of the adaptive immune response leading to survival of the infection. Our study demonstrates the power of studying zoonosis in natural hosts through the generation of a genome resource for further comparative and experimental work on plague survival and evolution of host-pathogen interactions.
Main
The great gerbil (Rhombomys opimus) is a key plague reservoir species of Central Asia[1] whose habitat stretches from Iran to Kazakhstan to North Eastern China. This diurnal, fossorial rodent lives in arid and semi-arid deserts, and forms small family groups that reside in extensive and complex burrow systems with a large surface diameter and multiple entrances, food storage and nesting chambers [2]. Where great gerbil communities coincide with human settlements and agriculture they are often viewed as pests through the destruction of crops and as carriers of vector-borne diseases[3-5]. Great gerbil is a dominant plague host species in nearly a third of the plague reservoirs located in the vast territories of Russia, Kazakhstan and China [6].
Plague, caused by the gram-negative bacterium Yersinia pestis, is a common disease in wildlife rodents living in semi-arid deserts and montane steppes, as well as in tropical regions[7,8]. It is predominantly transmitted between rodents by fleas living on rodents or in rodent nests [9] and regularly spills over into human populations [10], leading to individual cases and sometimes localized plague outbreaks [11]. Historically, spillover has resulted in three major human pandemics and continues to cause annual outbreaks of human plague cases in Madagascar [12-14]. Humans have played an important role in spreading the disease globally [15]. However, they are generally dead-end hosts and the long-term persistence of plague depends on plague reservoirs, which are areas where the biotic and abiotic conditions are favoring the bacterium’s survival [5].
Most commonly plague enters the body through a subcutaneous flea-bite of an infected flea, being deposited in the dermal tissue of the skin [9,16]. Once the primary physical barriers of the mammalian immune defense have been breached, the pathogen encounters a diverse community of innate immune cells and proteins evolved to recognize and destroy invasive pathogens. Here, Toll-like receptors (TLRs) and other pattern recognition receptors (PRRs) are at the forefront and have a vital role in the recognition and initiation of immune responses. Stimulation of adaptive immunity is in turn governed by the major histocompatibility complex (MHCs). MHC class I (MHCI) and class II (MHCII) proteins present antigens to CD8+ and CD4+ T lymphocytes, respectively. In particular, the CD4+ T lymphocyte is a master activator and regulator of adaptive immune responses [17,18].
In host-pathogen interactions, both sides evolve mechanisms to overpower the other engaging in an evolutionary arms race that shapes the genetic diversity on both sides [19,20]. Y. pestis evoke a specialized and complex attack to evade detection and destruction by the mammalian immune system to establish infection [21]. Upon entering a mammalian host, the change in temperature to 37°C initiates a change in bacterial gene expression switching on a wealth of virulence genes whose combined action enables Y. pestis to evade both extracellular and intracellular immune defenses [22] at the site of infection, in the lymph node and finally in the colonized blood-rich organs [16,23-26]. The host, in addition to standard immune responses, will have to establish counter measures to overcome the Y. pestis strategy of suppressing and delaying the innate immune responses [27,28]. This includes recognition of pathogen, resisting the bacterial signals that induce apoptosis of antigen presenting cells (APCs) and successfully producing an inflammatory response that can overpower the infection while avoiding hyperactivation.
Like all main plague reservoir hosts great gerbils can cope remarkably well with plague infections with only a minor increase in mortality levels compared to the natural mortality (see [10,29] for details). In a laboratory setting, a very large dose of Y. pestis is required before a lethal dose is reached where half the injected animals die (LD50) [30]. Variation in plague resistance do exists between individual great gerbils [30] however, the genetic basis of plague resistance and the differences in survival is still unclear. The adaptive immune system requires several days to respond to an infection and Y. pestis progresses so rapidly that it can kill susceptible hosts within days. Consequently, the genetic background of the innate immune system could potentially play a pivotal part in plague survival and also contribute to the observed heterogeneity in plague resistance [31]. For a successful response the innate immune system would have to keep the infection in check whilst properly activating the adaptive immune system [18], which can then mount an appropriate immune response leading to a more efficient and complete clearance of the pathogen. Previous studies investigating plague resistance have indeed implicated components of both innate [32-38] and adaptive immunity [39,40]. Although, none of these studies have involved wild reservoir hosts in combination with whole-genome sequencing, an approach with increased resolution that can be used in a comparative genomic setting to investigate adaptation, evolution and disease.
The importance of studying (the genetics/genomics of) zoonosis in their natural hosts is increasingly recognized [41] and the advances in sequencing technology has made it possible and affordable to do whole-genome sequencencing of non-model species for individual and comparative analysis of hosts facing a broad range of zoonosis [41].
In this paper, we present a de novo whole-genome sequence assembly of the major plague host, the great gerbil. We use this new resource to investigate the genomic landscape of innate and adaptive immunity with focus on candidate genes relevant for plague resistance such as TLRs and MHC, through genomic comparative analyses with the closely related plague hosts Mongolian gerbil (Meriones unguiculatus) and sand rat (Psammomys obesus) and other mammals.
Results
Genome assembly and annotation
We sequenced the genome of a wild-caught male great gerbil, sampled from the Xinjiang Province in China, using the Illumina HiSeq 2000/2500 platform (Additional file 2: Table S1 and S2). The genome was assembled de novo using ALLPATHS-LG resulting in an assembly consisting of 6,390 scaffolds with an N50 of 3.6 Mb and a total size of 2.376 Gb (Table 1), thus covering 96.4 % of the estimated genome size of 2.47 Gb. Assembly assessment with CEGMA and BUSCO, which investigates the presence and completeness of conserved eukaryotic and vertebrate genes, reported 85.88 % and 87.5 % gene completeness, respectively (Table 1). We were also able to locate all 39 HOX genes conserved in four clusters on four separate scaffolds through gene mining (Additional file 1: Fig S1). Further genome assessment with Blobology, characterizing possible contaminations, demonstrated a low degree of contamination, reporting that more than 98.5 % of the reads/bases had top hits of Rodentia. Thus, no scaffolds were filtered from our assembly. Annotation was performed using the MAKER2 pipeline and resulted in 70 974 predicted gene models of which 22 393 protein coding genes were retained based on default filtering on Annotation Edit Distance score (AED<1).
Reduced TLR repertoire in great gerbil and Gerbillinae
We characterized the entire TLR genetic repertoire in the great gerbil genome and found 13 TLRs: TLR1-13 (Fig. 1). Of these, TLR1-7 and TLR9 were complete with signal peptide, ecto-domain, transmembrane domain, linker and Toll/interleukin 1 receptor (TIR) domain that phylogenetically clustered well within each respective subfamily (Table 2 and Fig.2). For the remaining five TLRs, we were only able to retrieve fragments of TLR8 and TLR10 genes and although sequences of TLR11-13 were near full length, all three members of the TLR11 subfamily are putative non-functional pseudogenes as they contain numerous point mutations that creates premature stop codons and frameshift-causing indels. In addition, TLR12 contains a large deletion of 78 residues (Additional file 1: Figure S2). For TLR8, the recovered sequence almost exclusively covers the conserved TIR domain. Relative synteny of TLR7 and TLR8 on chromosome X is largely conserved in both human and published rodent genomes, as well as in the great gerbil with the fragments of TLR8 being located upstream of the full-length sequence of TLR7 on scaffold00186 (Additional file 1: Figure S3). The great gerbil TLR10 fragments are located on the same scaffold as full-length TLR1 and TLR6 (scaffold00357), in a syntenic structure comparable to other mammals (Additional file 1: Figure S3). In addition to being far from full-length sequences, the pieces of TLR8 and TLR10 in the great gerbil genome have point mutations that creates premature stop codons and frameshift-causing indels (Additional file 1: Figure S2). The same TLR repertoire is seen in great gerbils’ closest relatives, Mongolian gerbil and sand rat, with near full-length sequences of TLR12 and TLR13 and shorter fragments of TLR8 and TLR10. Interestingly, for TLR11 only shorter fragments were located for these two species, which is in contrast to the near full-length sequence identified in great gerbil. Moreover, also in these two species premature stop codons and indel-causing frameshifts were present in both the near full-length and fragmented genes (Fig. 1 and Additional file 1: Figure S2).
Diversifying selection of TLRs
To explore possible variations in selective pressure across the species in our analysis, we ran the adaptive branch-site random effects model (aBSREL) on all full-length TLRs. Evidence of episodic positive selection was demonstrated for the Gerbillinae lineage for TLR7 and TLR9 and for the Mongolian gerbil TLR7 specifically (Additional file 1: Figure S5 and S6). Additionally, all full-length great gerbil TLRs were analyzed for sites under selection using phylogeny guided mixed effects model of evolution (MEME), from the classic datamonkey and datamonkey version 2.0 websites. Reported sites common between both analyses for all full-length TLRs at p-value 0.05 and their distribution among each domain of the proteins are listed in Additional file 2: Table S3. Overall, the sites under selection were almost exclusively located in the ecto-domains with a few sites located in the signal peptide (TLR3, TLR6 and TLR9) and in the Linker and TIR domains (TLR1, TLR2, TLR4 and TLR5). The 3D protein structure of TLR4, TLR7 and TLR9 modelled onto the human TLR5 structure further demonstrated that the sites are predominantly located in loops interspersed between the leucine-rich repeats (Additional file 1: Figures S7-9).
Scrutiny of the TLR4 amino acid sequence alignment revealed drastic differences in the properties of the residues at two positions reported to be important for maintaining signaling of hypoacetylated lipopolysaccharide (LPS). In rat (Rattus norvegicus) and all mouse species used in this study, the residues at position 367 and 434 are basic and positively charged while for the remaining species in the alignment including all Gerbillinae, the residues are acidic and negatively charged.
Characterization of the great gerbil class I MHC region
The overall synteny of the MHCI region is well conserved in great gerbil, displaying the same translocation of some MHCI genes upstream of the MHCII region as demonstrated in mouse and rat i.e. with a distinct separation of the MHCI region into two clusters (Fig. 3). Some of the great gerbil copies were not included in the phylogeny due to missing data, which hindered their annotation. Additionally, the annotation was obstructed either by the copies being located on scaffolds not containing framework genes or due to variation in the micro-synteny of those particular loci of MHCIa and MHCIb between mouse, rat and great gerbil (Fig. 3). From the synteny it appears that MHCI genes are missing in the region between framework genes Trim39 and Trim26 and possibly between Bat1 and Pou5ƒ1 in the great gerbil. For full gene names for these and other framework genes mentioned below, see Additional file 2: Table S4.
We were able to identify six scaffolds containing MHCI genes (Fig. 3 and Additional file 2: Table S5. Four of the scaffolds contained framework genes that enabled us to orient them. In total, we located 16 MHCI copies, of which we were able to obtain all three α domains for 10 of the copies. Three copies contain 2 out of 3 domains while for the last three copies we were only able to locate the α3 domain. In one instance, the missing α domain was due to an assembly gap. Reciprocal BLAST confirmed hits as MHCI genes. Due to high similarity between different MHCI lineages annotation of identified sequences was done through phylogenetic analyses and synteny. Our phylogeny reveals both inter-and intraspecific clustering of the great gerbil MHCI genes with other rodent genes with decent statistical support (i.e. bootstrap and/or posterior probabilities) of the internal branches (Fig. 4). Five great gerbil MHCI genes (RhopA1-5) cluster together in a main monophyletic clade while the remaining copies cluster with mouse and rat MHCIb genes. Two of the copies (Rhop-A3Ψ and Rhop-M6Ψ) appear to be pseudogenes as indicated by the presence of point mutations and frameshift-causing indels. Additionally, our phylogeny displays a monophyletic clustering of human MHCI genes (Fig. 4). The clade containing five of the great gerbil MHCI genes (Rhop-A1-5) possibly include a combination of both classical (MHCIa) and non-classical (MHCIb) genes as is the case for mouse and rat, where certain MHCIb genes cluster closely with MHCIa genes (Fig. 3 and Fig. 4). Also, due to the high degree of sequence similarity of rodent MHCI genes the phylogenetic relationship between clades containing non-classical M and T MHCI genes could not be resolved by sufficient statistical support.
The overall synteny of the MHCI and II regions are very well conserved in great gerbil displaying the same translocation of MHCI genes upstream of MHCII as seen in mouse and rat and resulting in the separation of the MHCI region into two. Most notably, for MHCII there is a duplication of a β gene of the DR locus in great gerbil (highlighted by a red asterix) whose orientation has changed and is located downstream of the FW gene Btnl2 that normally represents the end of the MHCII region.
Characterization of the great gerbil class II MHC region
A single scaffold (scaffold00896) of 471 076 bp was identified to contain all genes of the MHCII region, flanked by the reference framework genes Col11a2 and Btnl2. We were able to obtain orthologues of α and β genes of the classical MHCII molecules DP, DQ and DR as well as for the ‘non-classical’ DM and DO molecules (Table 3). The antigen-processing genes for the class I presentation pathway, Psmb9, TAP1, Psmb8 and TAP2 also maps to scaffold00896 (Fig. 3). Synteny of the MHCII region was largely conserved in great gerbil when compared to mouse, rat and human regions except for a single duplicated copy of Rhop-DRB (Rhop-DRB3) that was located distal to the Btnl2 framework gene representing the border between class II and III of the MHC region (Fig. 3). The duplicated copy of the Rhop-DRB gene has an antisense orientation in contrast to the other copies of the Rhop-DRB genes in great gerbil. In rodents, the DR locus contains a duplication of the β gene and the two copies are termed β1 and β2, with the β2 gene being less polymorphic than the highly polymorphic β1 gene. The relative orientation of the β and α genes of the DR locus is conserved in most eutherian mammals studied to date with the genes facing each other, as is the case for Rhop-DRB1, Rhop-DRB2 and Rhop-DRA (Fig. 3). Sequence alignment and a maximum likelihood (ML) phylogeny establishes Rhop-DRB3 to be a duplication of Rhop-DRB1 (Fig. 5). Rhop-DRB1 and Rhop-DRB3 are separated by around 80 kb containing Rhop-DRB2, Rhop-DRA and five assembly gaps (Table 3).
Any similar duplication of the Rhop-DRB1 gene is not seen in either of the two close family members of the Gerbillinae subfamily used in our comparative analyses. BLAST searches of the sand rat genome returned a single full-length copy of the β1 gene and a near full-length copy of the β2 gene (Fig. 5 and Additional file 2: Table S6). According to the annotations of the Mongolian gerbil genome provided by NCBI, this species contains two copies of the DR locus β genes. A manual tBLASTn search using the protein sequences of Mongolian gerbil DRB genes to search the genome assembly did not yield additional hits of β genes in this locus that could have been missed in the automatic annotation process. The phylogeny confirms the copies found in Mongolian gerbil to be β1 and β2 genes (Fig. 5).
MHCII DRB promoters
MHCII genes each contain a proximal promoter with conserved elements (S-X-Y motifs) that are crucial for the efficient expression of the gene. We aligned the proximal promoter of the β genes of the DR locus in great gerbil and the other investigated species to establish if the integrity of the promoter was conserved as well as examining similarities and potential dissimilarities causing the previously reported differences in transcription and expression of β1 and β2 genes in rodents [42,44]. The alignment of the promoter region reveals the conserved structure and similarities within β1 and β2 genes as well as characteristic differences (Fig. 6 and Table 4). Clear similarities are seen for the proximal promoter regions of Rhop-DRB1 and Rhop-DRB3 to the other rodent and human β1 promoters, as illustrated by high sequence similarity and the presence of a CCAAT box just downstream of the Y motif in all investigated rodent β1 promoters. Notably, the CCAAT box is missing in β2 promoters. The crucial distance between the S and X motifs is conserved in all β genes and the integrity of the S-X-Y motifs is observable for Rhop-DRB1 and DRB3 promoters. However, both the S and X box of DRB2 are compromised by deletions in great gerbil. The deletion in the X box severely disrupt the motif and reduce its size by half. An identical deletion in the X box is seen in Mongolian gerbil while the sand rat X box sequence covering the deleted parts is highly divergent from the conserved sequence found in the rest of the promoters (Fig. 6). Furthermore, for the β2 genes, two deletions downstream of the motifs are shared among all rodents in the alignment as well as an additional insertion observed in Gerbillinae members.
Peptide binding affinity predictions and expression of Rhop-DR MHCII molecules
Mouse and rat β2 molecules have been shown to have an extraordinary capacity to present the Y. pseudotuberculosis superantigen mitogen (YPm) [42]. We therefore investigated the peptide binding affinities of the Rhop-DR molecules by running translations of Rhop-DRA in combinations with each of the three Rhop-DRB genes through the NetMHCIIpan 3.2 server [47] along with peptide/protein sequences of YPm, Y. pestis F1 ‘capsular’ antigen and LcrV antigen. Universally, the Rhop-DRB3 shows an affinity profile identical to that of Rhop-DRB2 displaying high affinity towards both Y. pseudotuberculosis and Y. pestis epitopes while Rhop-DRB1 does not (Fig.7 and Additional file 3). The translated great gerbil MHCII from DP and DQ loci were also tested for peptide binding affinity but only Rhop-DP displayed affinity to one of the epitopes tested. Furthermore, analyses of the translated amino acid sequences of sand rat DR (Psob-DR) molecules as well as published protein sequences of Mongolian gerbil DR (Meun-DR) molecules and the mouse ortholog H2-E confirmed the high affinity of β2 molecules to Y. pseudotuberculosis and Y. pestis (Additional file 1: Figure S10 and Additional file 3). The equal capacity of Rhop-DRB2 and Rhop-DRB3 to putatively present Yersiniae combined with the proximal promoter investigations lead us to question the expression of DRB genes in great gerbil. Searching a set of raw counts of great gerbil expressed genes, reveal that Rhop-DRB1 and Rhop-DRB3 are both similarly expressed at similar levels while Rhop-DRB2 is not expressed or at undetectable levels (3936 and 2279 vs 14).
Discussion
Here we present a highly contiguous de novo genome assembly of the great gerbil covering over 96 % of the estimated genome size and almost 88 % of the gene space, which is equivalent to the genic completeness reported in the recently published and close relative sand rat genome [48] (Additional file 2: Table S7). By comparative genomic analyses where we include genome data from its close relatives within the Gerbillinae, we provide novel insight into the innate and adaptive immunological genomic landscape of this key plague host species.
The TLR repertoire in the great gerbil and Gerbillinae
TLRs are essential components of PRRs and the innate immune system as they alert the adaptive immune system of the presence of invading pathogens [49]. The detailed characterization of TLRs did not uncover any species-specific features for the great gerbil. However, a shared TLR gene repertoire for the Gerbillinae lineage (i.e. the great gerbil, sand rat and Mongolian gerbil), with gene losses of TLR8, TLR10 and all members of the TLR11-subfamily was revealed. This finding could indicate quite similar selective pressures on these species, at least in regard to their function of TLRs, all being desert dwelling, burrowing rodents living in arid or semi-arid ecosystems and being capable of carrying plague. Thus, it is possible that the members of this clade have reduced the TLR repertoire in a cost-benefit response to environmental constraints or due to altered repertoire of pathogen exposure [50]. These results are in line with the fairly conserved TLR gene repertoire reported within the vertebrate lineage [51], although the repertoire of TLR genes present within vertebrate groups can show major differences [51-53], presumably in response to presence or lack of certain pathogen or environmental pressures [54,55]. Outside of Gerbillinae, the presence of TLR11-subfamily appears to be universal in Rodentia, however functionally lost from the human repertoire [51,53]. The TLR11-subfamily recognizes parasites and bacteria through profilin, flagellin and 23S ribosomal RNA [56,57] and it is possible that cross-recognition of these patterns by other TLR members or other PRRs might have made the TLR11-subfamily redundant in Gerbillinae and humans [50]. The varying degree of point mutations, frameshift-causing indels and in some cases almost complete elimination of sequence in TLR8, TLR10 and TLR11-13 in Gerbillinae suggest successive losses of these receptors, where a shared pseudogenization of TLR12-13 across all species investigated were recorded. For TLR11 however, the pseudogenization seems to have occurred in multiple steps, i.e. with a more recent event in the great gerbil where a near full-length sequence was identified compared to the shorter fragments identified for Mongolian gerbil and sand rat (Additional file 1: Figure S2C). Furthermore, the high degree of shared disruptive mutations among all three species of Gerbillinae indicates that the initiation of pseudogenization predates the speciation estimated to have occurred about 5.5 Mya [58].
In the context of plague susceptibility, the branch specific diversifying selection reported here for TLR7 and TLR9 in Gerbillinae is intriguing, as both receptors have been implicated to affect the outcome of plague infection in mice and humans [59-61]. For instance, the study by Dhariwala et al. (2017) showed, in a murine model, that TLR7 recognizes intracellular Y. pestis and is important for defense against disease in the lungs but was detrimental to septicemic plague [59]. Moreover, recognition of Y. pestis by TLR9 was also demonstrated by Saikh et al. (2009) in human monocytes [61]. All but one of the residues under site specific selection seen in TLR7 and TLR9 were located in the ectodomain, which may suggest possible alterations in ligand recognition driven by selection pressure from Y. pestis or other shared pathogens. Stimulation of TLR7 and TLR9 have also been reported to regulate antigen presentation by MHCII in murine macrophages [62]. These data could therefore indicate a possible connection of the selection in TLR7 and TLR9 with the great gerbil duplication in MHCII. For TLR4, the selection tests and sequence alignment analysis did not reveal any branch-specific selection for great gerbil nor Gerbillinae, whereas we did detect signs of site-specific selection in the ectodomain that occasionally was driven by great gerbil or Gerbillinae substitutions. TLR4 is the prototypical PRR for detection of lipopolysaccharides (LPS) found in the outer membrane of gram-negative bacteria like Y. pestis. As part of the arms race, however, it is well known that gram-negative bacteria, including Y. pestis, alter the conformation of their LPS in order to avoid recognition and strong stimulation of the TLR4-MD2-CD14 receptor complex [63-65]. Despite this, in mice at least, some inflammatory signaling still occurs through this receptor complex but require particular residues in TLR4 not found to be conserved in the Gerbillinae lineage. Whether other mutations in TLR4 in Gerbillinae have a similar functionality as the residues that allow mice to respond to Y. pestis LPS is not known. However, if such functionality is missing in Gerbillinae, the loss of responsiveness to the hypoacetylated LPS [19] could perhaps defer some protection from pathologies caused by excessive initiation of inflammatory responses [66], and thus TLR4 is not likely directly involved in the resistance of plague in great gerbils.
Cumulatively, our investigations of the great gerbil innate immune system, focusing on the TLR gene repertoire, reveal shared gene losses within TLR gene families for the Gerbillinae lineage, all being desert dwelling species capable of carrying plague. The evolutionary analyses conducted did not uncover any great gerbil-specific features that could explain their resistance to Y. pestis, indicating that other PRRs (not investigated here) could be more directly involved during the innate immune response to plague infection in the great gerbil [36].
Great gerbil MHC repertoires
MHCI and II proteins are crucial links between the innate and adaptive immune system continuously presenting peptides on the cell surface for recognition by CD8+ and CD4+ T cells respectively, and MHC genes readily undergo duplications, deletions and pseudogenization [67]. For MHCI, the discovery of 16 copies in great gerbil is in somewhat agreement with what has earlier been reported in rodents, where the MHCI region is found to have undergone extensive duplication followed by sub-and neofunctionalization with several genes involved in non-immune functions [68,69]. However, it should be noted that our copy number estimation is most likely an underestimation, due to the assembly collapse in almost all MHCI containing regions identified. Furthermore, not all copies could be confidently placed in the gene maps as some scaffolds lacked colocalizing framework genes. These two factors are the probable reason why the great gerbil appears to be lacking some MHCI genes compared to mouse and rat.
For MHCII we discovered a gerbil-specific duplication that is not present in other closely related plague hosts or in other rodents investigated. The phylogeny established the duplication’s (Rhop-DRB3) relationship to Rhop-DRB1 and other mammalian β1 genes and reflects the orthology of mammalian MHCII genes [70]. The localization of Rhop-DRB3 outside of the generally conserved framework of the MHCII region and not in tandem with the other β genes of the DR locus is unusual and is not generally seen for eutherian mammals. For instance, major duplication events with altered organization and orientation of DR and DQ genes has been reported for the MHCII region in horse (Equus caballus), however all genes are found within the framework genes [71]. Duplications tend to disperse in the genome as they age [72], thus the reversed orientation and translocation of the great gerbil copy might indicate that the duplication event is ancient occurring sometime after the species split approximately 5 Mya. However, it must also be noted that there are several assembly gaps located between Rhop-DRB1 and Rhop-DRB3 resulting in the possibility of the translocation being a result of an assembly error.
Predictions of the affinity of the β1, β2 and β3 MHCII molecules to Y. pestis and Y. pseudotuberculosis antigens matched the reported high affinity of rodent β2 molecules for Yersiniae epitopes [42]. Rhop-DRB3 had an equally high affinity and largely identical affinity-profile as Rhop-DRB2. A high affinity for Y. pestis epitopes is important in the immune response against plague, as the initiation of a T cell response is more efficient and requires fewer APCs and T cells when high-affinity peptides are presented by MHCII molecules [73]. In the early stages of an infection where presence of antigen is low, there will be fewer MHCII molecules presenting peptides and affinity for those peptides is paramount to fast initiation of the immune response against the pathogen. Individuals presenting MHCII molecules with high affinity for pathogen epitopes are able to raise an immune defense more quickly and have a better chance of fighting off the rapidly progressing infection than individuals that are fractionally slower. This fractional advantage could mean the difference between death or survival.
We find comparable expression levels for Rhop-DRB1 and Rhop-DRB3 but no detectable expression of Rhop-DRB2. These similarities and differences are likely explained by the variations discovered in the proximal promoter of the genes. Integrity of the conserved motifs and the spacing between them is necessary for assembly of the enhanceosome complex of transcription factors and subsequent binding of Class II Major Histocompatibilty Complex transactivator (CIITA), and is essential for efficient expression of MHCII genes. The conservation of the proximal promoter of Rhop-DRB3 along with the overall sequence similarity with other β1 genes are indicative of a similar expression pattern. In contrast, the deletion in the X box of Rhop-DRB2 reducing the motif to half the size will likely affect the ability of the transcription factors to bind and could explain the lack of expression. Similar disruptions in the β2 genes of the other Gerbillinae were found along with a major deletion further downstream in all β2 genes that perhaps explains the previously reported low and unusual pattern of transcription for rodent β2 genes [42,44]. The equal affinity profile but different expression levels of Rhop-DRB2 and Rhop-DRB3 could mean that Rhop-DRB3 has taken over the immune function lost by the lack of expression of Rhop-DRB2. The selective pressure might have come from Yersinia or pathogens similar to Yersiniae. A nonclassical function of MHCII molecules have also been reported where intracellular MHCII interacted with components of the TLR signaling pathway in a way that suggested MHCII molecules are required for full activation of the TLR-triggered innate immune response [74]. Moreover, in vertebrates the MHCII DRB genes are identified as highly polymorphic and specific allele variants have frequently been linked to increased susceptibility to diseases in humans [75]. Intriguingly, in a recent study by Cobble et al. (2016) it was suggested that allelic variation of the DRB1 locus could be linked to plague survival in Gunnison’s prairie dog colonies [40]. Thus, investigating how the genetic variation of the DRB1 and DRB3 loci in great gerbil manifests at the population level and the affinity of these allelic variants to Yersiniae epitopes, would be the next step to further our understanding of the plague resistant key host species in Central Asia.
From the analyses conducted on the genomic landscape of the adaptive immune system of the great gerbil, i.e. MHCI and MHCII more specifically, the most interesting reporting is the duplication of an MHCII gene. In silico analyses of Rhop-DRB3 indicate a high predicted affinity for Y. pestis epitopes, which may result in faster initiation of the adaptive immune system in great gerbils when exposed to the pathogen, and thus could explain the high degree of plague resistance in this species.
Conclusion
Plague has historically had a vast impact on human society through major pandemics, however it mainly circulates in rodent communities. A key issue is to understand host-pathogen interactions in these rodent hosts. From the pathogen-perspective, research has studied how Y. pestis has evolved to evade both detection and destruction by the mammalian immune system to establish infection. In this study, we have demonstrated the power of using whole genome sequencing of a wild plague reservoir species to gain new insight into the genomic landscape of its resistance by immuno-comparative analyses with closely related plague hosts and other mammals. We reveal the duplication of an MHCII gene in great gerbils with a computed peptide binding profile that putatively would cause a faster initiation of the adaptive immune system when exposed to Yersiniae epitopes. We also find signs of positive selection in TLR7 and TLR9, which have been shown to regulate antigen presentation and impact the outcome of a plague infection. Investigations into how the genetic variation of the MHCII locus manifests at the population level are necessary to further understand the role of the gene duplication in the resistance of plague in great gerbils. Comprehending the genetic basis for plague resistance is crucial to understand the persistence of plague in large regions of the world and the great gerbil de novo genome assembly is a valuable anchor for such work, as well as a resource for future comparative work in host-pathogen interactions, evolution (of resistance) and adaptation.
Methods
Sampling and sequencing
A male great gerbil weighing 180g was captured in the Midong District outside Urumqi in Xinjiang Province, China in October 2013. The animal was humanely euthanized and tissue samples of liver were conserved in ethanol prior to DNA extraction. Blood samples from the individual were screened for F1 ‘capsular’ antigen (Caf1) and anti-F1 as described in [30,76] to confirm plague negative status. The DNA used in the library construction was extracted from liver tissue using Gentra Puregene Tissue Kit (Qiagen Inc. USA). Use of great gerbil tissue was approved by the Committee for Animal Welfares of Xinjiang CDC, China.
The sequence strategy was tailored towards the ALLPATHS-LG assembly software (Broad Institute, Cambridge, MA) following their recommendations for platform choice and fragment size resulting in the combination of one short paired-end (PE) library with an average insert size of 220 bp (150 bp read length) and two mate-pair (MP) libraries of 3 kbp and 10 kbp insert size (100 bp read length). See Additional file 2: Table S1 for a list of libraries and sequence yields. Sequencing for the de novo assembly of the great gerbil reference genome was performed on the Illumina platform using HiSeq2500 instruments at the Norwegian Sequencing Centre at the University of Oslo for the PE library (https://www.sequencing.uio.no) and using HiSeq2000 instruments at Génome Québec at McGill University for the MP libraries (http://gqinnovationcenter.com/index.aspx?l=e).
Genome assembly and Maker annotation
The Illumina sequences were quality checked using FastQC v0.11.2 and SGA-preqc (downloaded 25th June 2014) with default parameters. Both MP libraries were trimmed for adapter sequences using cutadapt v1.5 with option-b and a list of adapters used in MP library prep [77] and the trimmed reads were used alongside the PE short read as input for ALLPATHS-LG v48639 generating a de novo assembly. This combination of short-read sequencing technology combined with the ALLPATHS-LG assembly algorithm is documented to perform well in birds and mammals [78-80]. File preparations were conducted according to manufacturer’s recommendation and the option TARGETS=submission was added to the run to obtain a submission prepared assembly version.
Assembly completeness was assessed by analysing the extent of conserved eukaryotic genes present using CEGMA v2.4.010312 and BUSCO v1.1.b [81-83]. Gene mining for the highly conserved Homeobox (HOX) genes was also conducted as an additional assessment of assembly completeness (see Additional file 1: Note S1 and Figure S10). All reads were mapped back to the assembly using BWA-MEM v 0.7.5a and the resulting bam files were used alongside the assembly in REAPR v 1.0.17 to evaluate potential scaffolding errors as well as in Blobology to inspect the assembly for possible contaminants, creating Taxon-Annotated-GC-Coverage (TAGC) plots of the results from BLAST searches of the NCBI database [84].
The genome assembly was annotated using the MAKER2 pipeline v2.31 run iteratively in a two-pass fashion (as described in https://github.com/sujaikumar/assemblage/blob/master/README-annotation.md) [85]. Multiple steps are required prior to the first pass though MAKER2 and include creating a repeat library for repeat masking and training three different ab initio gene predictors. Firstly, construction of the repeat library was conducted as described in [86]. In brief, a de novo repeat library was created for the assembly by running RepeatModeler v1.0.8 with default parameters, and sequences matching known proteins of repetitive nature were removed from the repeat library through BLASTx against the UniProt database. Next, GeneMark-ES v2.3e was trained on the genome assembly using default parameters with the exception of reducing the–min-contig parameter to 10.000 [87]. SNAP v20131129 and AUGUSTUS v3.0.2 was trained on the genes found by CEGMA and BUSCO, respectively. The generated gene predictors and the repeat library were used in the first pass alongside proteins from UniProt/SwissProt (downloaded 16th February 2016) as protein homology evidence and Mus musculus cDNA as alternative EST evidence (GRCm38 downloaded from Ensembl). For the second pass, SNAP and AUGUSTUS were retrained with the generated MAKER2 predictions and otherwise performed with the same setup. The resulting gene predictions had domain annotations and putative functions added using InterProScan v5.4.47 and BLASTp against the UniProt database with evalue 1e-5 (same methodology as [86,88]). Finally, the output was filtered using the MAKER2 default filtering approach only retaining predictions with AED <1.
Genome mining and gene alignments
We searched for TLR genes, associated receptors and adaptor molecules as well as genes of the MHC region (complete list of genes can be found in Additional file 2: Table S4) collected from UniProt and Ensembl. Throughout, we performed tBLASTn searches, manual assembly exon by exon in MEGA7 and verified annotations through reciprocal BLASTx against the NCBI database and phylogenetic analysis including orthologues from human (Homo sapiens), mouse (Mus musculus), rat (Rattus norvegicus) and all three members of the Gerbillinae subfamily. For details on the phylogenetic analyses we refer to descriptions in sections below. In the TLR analyses Algerian mouse (M. spretus), Ryukyu mouse (M. caroli), Chinese hamster (Cricetulus griseus) and Chinese tree shrew (Tupaia belangeri chinensis) were also included.
Sand rat and Mongolian gerbil genome assemblies were downloaded from NCBI (September 12th 2017). The genome assemblies of the great gerbil, sand rat and Mongolian gerbil were made into searchable databases for gene mining using the makeblastdb command of the blast+ v2.6.0 program. Local tBLASTn searches, using protein sequences of mouse and occasionally rat, human and Mongolian gerbil as queries, were executed with default parameters including an e-value cut-off of 1e+1. The low e-value was utilized to capture more divergent sequence homologs. Hits were extracted from assemblies using bedtools v2.26.0 and aligned with orthologs in MEGA v7.0.26 using MUSCLE with default parameters. In cases where annotations for some of the TLRs for a species were missing in Ensembl and could not be located in either the NCBI nucleotide database or in UniProt, the Ensembl BLAST Tool (tBLASTn) was used with default parameters to find the genomic region of interest using queries from mouse.
Synteny analyses of MHC regions
A combination of the Ensembl genome browser v92 and comparisons presented in [89] and tBLASTn searches, as described above, were used in synteny analyses of the MHCI and II regions of human, rat and mouse with great gerbil. Synteny of MHCII genes of sand rat and Mongolian gerbil were also investigated, however for simplicity and visualization purposes not included in the figure (Fig. 3).
Alignment and phylogenetic reconstruction of TLR and MHC
Sequences were aligned with MAFFT [90] using default parameters: for both nucleotides and amino acid alignments the E-INS-i model was utilized. The resulting alignments were edited manually using Mesquite v3.4 [91]. See Additional file 2: Tables S8-10 for accession numbers.
Ambiguously aligned characters were removed from each alignment using Gblocks [92] with the least stringent parameters for codons and proteins.
Maximum likelihood (ML) phylogenetic analyses were performed using the “AUTO” parameter in RAxML v8.0.26 [93] to establish the evolutionary model with the best fit. The general time reversible (GTR) model was the preferred model for the nucleotide alignments, and JTT for the amino acid alignments. The topology with the highest likelihood score of 100 heuristic searches was chosen. Bootstrap values were calculated from 500 pseudoreplicates. Taxa with unstable phylogenetic affinities were pre-filtered using RogueNaRok [94] based on evaluation of a 50 % majority rule (MR) consensus tree, in addition to exclusion of taxa with >50 % gaps in the alignment.
Bayesian inference (BI) was performed using a modified version of MrBayes v3.2 [95] (https://github.com/astanabe/mrbayes5d). The dataset was executed under a separate gamma distribution. Two independent runs, each with three heated and one cold Markov Chain Monte Carlo (MCMC) chain, were started from a random starting tree. The MCMC chains were run for 20,000,000 generations with trees sampled every 1,000th generation. The posterior probabilities and mean marginal likelihood values of the trees were calculated after the burn-in phase (25 %), which was determined from the marginal likelihood scores of the initially sampled trees. The average split frequencies of the two runs were < 0.01, indicating the convergence of the MCMC chains.
Selection analyses
All full-length TLRs located in the genomes of great gerbil, sand rat and Mongolian gerbil along with other mammalian TLRs (Additional file 2: Table S8) were analysed in both classic Datamonkey and Datamonkey 2.0 (datamonkey.org) testing for signs of selection with a phylogeny guided approach [96,97]. For each TLR gene alignment a model test was first run prior to the selection test and the proposed best model was used in the analyses. The mixed effects model of evolution (MEME) and adaptive branch-site random effects model (aBSREL) were used to test for site based and branch level episodic selection, respectively [98-100]. aBSREL was iterated three times per gene alignment, initially running an exploratory analysis were all branches were tested for positive selection and subsequently in a hypothesis mode by which first the Gerbillinae clade and secondly the great gerbil was selected as “foreground” branches to test for positive selection. All TLR alignments are available in the Github repository (https://github.com/uio-cels/Nilsson_innate_and_adaptive).
TLR protein structure prediction
Translated full-length great gerbil TLR sequences were submitted to the Phyre2 structure prediction server for modelling [101]. All sequences were modelled against human TLR5 (c3j0aA) and the resulting structures were colored for visualization purposes using Jmol (Jmol: an open-source Java viewer for chemical structures in 3D. http://www.jmol.org/). Colors were used to differentiate between helices, sheets and loops as well as the transmembrane domain, linker and TIR domain. Sites found in the MEME selection analysis were indicated in pink and further highlighted with arrows (Additional file 1: Figures S7-9). All great gerbil PDB files are available in the GitHub repository (https://github.com/uio-cels/Nilsson_innate_and_adaptive).
As TLR4 is the prototypical PRR for lipopolysaccharide (LPS) which are found in all gram-negative bacteria including Y. pestis, we subjected the sequence alignment to additional investigation of certain residues indicated in the literature to have an impact on signaling [19]. These were the residues at position 367 and 434, which in mouse are both basic and positively charged, enabling the mouse TLR4 to maintain some signaling even for hypoacetylated LPS [19]. Hypoacetylated LPS is a common strategy for gram-negative bacteria to avoid recognition and strong stimulation of the TLR4-MD2-CD14 receptor complex [63-65].
MHCII promoter investigation
The region 400 bp upstream of human HLA-DRB, mouse H2-Eb and rat RT-Db genes were retrieved from Ensembl (GRch38.p12, GRCm38.p6 and Rnor_6.0). Similarly, the region 400 bp upstream of the start codon of DRB genes in the three Gerbillinae were retrieved using bedtools v2.26.0. Putative promoter S-X-Y motifs, as presented for mouse in [102], were manually identified for each gene in MEGA7 and all sequences were subsequently aligned using MUSCLE with default parameters [102].
Peptide binding affinity
The functionality of MHCII genes is defined by the degree of expression of the MHC genes themselves, and the proteins ability to bind disease-specific peptides to present to the immune system. The ability of an MHCII protein to bind particular peptides can with some degree of confidence be estimated by MHC prediction algorithms, even for unknown MHCII molecules, as long as the alpha and beta-chain protein sequences are available [47]. We here use the NetMHCIIpan predictor v3.2 [47] to estimate the peptide binding affinities of the novel Rhop-DRB3 MHCII molecule and compare it to various other MHCII molecules from great gerbil, mouse, sand rat and Mongolian gerbil. The program was run with default settings and provided with the relevant protein sequences of alpha and beta chains. We compared the predicted binding affinity of these MHCII molecules for 17 known Y. pestis epitopes derived from positive ligand assays of Y. pestis (https://www.iedb.org/). Specifically, we tested against 16 ligands derived from the F1 capsule antigen of Y. pestis, and 1 ligand from the virulence-associated Low calcium V antigen (LcrV) of Y. pestis. In addition, we compared the binding affinity of these MHCII molecules against the superantigen Y. pseudotuberculosis derived mitogen precursor (YPm) [42]. The threshold for binders was set to <500nM [47].
RNA sampling and sequencing
Two additional great gerbils were captured in the Midong District outside Urumqi in Xinjiang Province, China, in September 2014. The animals were held in captivity for 35 days before being humanely euthanized and liver tissue samples were conserved in RNAlaterTM at −20 °C prior to RNA extraction. RNA was extracted using standard chloroform procedure [103]. Library prep and sequencing were conducted at the Beijing Genomics Institute (BGI, https://www.bgi.com/us/sequencing-services/dna-sequencing/) using Illumina TruSeq RNA Sample Prep Kit and PE sequencing on the HiSeq4000 instrument (150 bp read length).
The reads were trimmed using trimmomatic v0.36 and mapped to the genome assembly using hisat2 v2.0.5 with default parameters. A raw count matrix was created by using htseq v0.7.2 with default parameters to extract the raw counts from the mapped files.
Funding
This project was funded by University of Oslo Molecular Life Science (MLS, allocation #152950), the Research Council of Norway (RCN grant #179569), the European Research Council (ERC-2012-AdG No. 324249-MedPlag), the National Natural Science Foundation of China (No. 31430006) and National Key Research & Development Program of China (2016YFC1200100).
Availability of data and materials
The genome assembly has been deposited at DDBJ/ENA/GenBank under the accession REGO00000000. The version described in this paper is version REGO01000000.
The genome assembly and annotation are also available from FigShare: In the following GitHub repository are files of immune gene alignments, PDB files and more: https://github.com/uio-cels/Nilsson_innate_and_adaptive
Authors’ contributions
PN created the genome assembly and annotated it, performed all BLAST-based, TLR based and promoter analysis and wrote the first draft of the manuscript. MHS conducted the protein model analyses of TLRs and assisted in the BLAST-based and TLR analyses. BVS performed the MHCII affinity analyses. RJSO performed phylogenetic analysis of TLR, MHCI and MHCII genes. YZ, sampled, acclimatised and tested individual great gerbil for plague. RL, YC and YS extracted DNA and RNA for sequencing. PN, WRE, BVS, SJ and KSJ designed the sequencing strategy. WRE, BVS, SJ, KSJ, NCS and RY oversaw the project. All authors read and approved the final manuscript.
Ethics approval
Use of great gerbil tissue was approved by the Committee for Animal Welfares of Xinjiang Centre for Disease Control and Prevention, China. Sampling was performed prior to Chinas signature of the Nagoya Protocol (date of accession September 6th 2016). The sampled species have a “least concern” status in the IUCN Red List of Threatened Species.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional files
Additional file 1: Additional figures and one Note detailing the HOX gene mining (DOCX 19.2Mb)
Additional file 2: Additional tables (DOCX 66Kb)
Additional file 3: Peptide binding affinity predictions for all MHCII molecules run in NetMHCIIpan predictor v3.2 (XLSX)
Acknowledgements
All computational work was performed on the Abel Supercomputing Cluster (Norwegian metacenter for High Performance Computing (NOTUR) and the University of Oslo) operated by the Research Computing Services group at USIT, the University of Oslo IT-department and the Cod nodes of CEES. Sequencing library creation and high throughput sequencing was carried out at the Norwegian Sequencing Centre (NSC), University of Oslo, Norway, and McGill University and Genome Quebec Innovation Centre, Canada.
We would like to thank Morten Skage for assistance in sequence library construction and Ole K. Tørresen, Srinidhi Varadharajan, Tore O. Elgvin and Cassandra N. Trier for helpful advice and support during assembly and annotations steps of the genome, Helle T. Baalsrud for advice during genome mining and Tone F. Gregers for helpful discussions regarding MHCII. For early access to the sand rat genome assembly we thank John F. Mulley.
References
- 1.↵
- 2.↵
- 3.↵
- 4.
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.
- 25.
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.
- 34.
- 35.
- 36.↵
- 37.
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.
- 61.↵
- 62.↵
- 63.↵
- 64.
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.
- 80.↵
- 81.↵
- 82.
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.
- 100.↵
- 101.↵
- 102.↵
- 103.↵