Summary
Despite the huge international endeavor to understand the genomic basis of malaria biology, there remains a lack of information about two human-infective species: Plasmodium malariae and P. ovale. The former is prevalent across all malaria endemic regions and able to recrudesce decades after the initial infection. The latter is a dormant stage hypnozoite-forming species, similar to P. vivax. Here we present the newly assembled reference genomes of both species, thereby completing the set of all human-infective Plasmodium species. We show that the P. malariae genome is markedly different to other Plasmodium genomes and relate this to its unique biology. Using additional draft genome assemblies, we confirm that P. ovale consists of two cryptic species that may have diverged millions of years ago. These genome sequences provide a useful resource to study the genetic basis of human-infectivity in Plasmodium species.
Introduction
All of the human malaria species were described in the early 20th Century, with Plasmodium malariae and P. ovale being recognized as distinct species from P. falciparum, P. vivax, and P. knowlesi1. Reference genomes have now been published for the latter three2−4, with the extent of human infections caused by P. knowlesi having only been recognized decades after initial discovery5. Analysis of these reference genomes has revealed the basis of key biological processes, including virulence6, invasion7, and antigenic variation8. Despite the huge international endeavor to understand the genomic basis of malaria biology, almost nothing is known about the genetics of P. malariae and P. ovale.
Infections with these two organisms are frequently asymptomatic9 and have parasitaemia levels often below the level of detection of light microscopy10, thus making them difficult to study in human populations and potentially thwarting efforts to eliminate them and declare any region as ‘malaria free’11. This lack of knowledge is especially worrying because the two species are distributed widely across all malaria-endemic areas of the world12,13 (Figure 1a). Both species are frequent co-infections with the two common human pathogens, P. falciparum and P. vivax, and can be present in up to 5% of all clinical malaria cases9. This equates to roughly 30 million annual clinical cases. P. malariae infections can lead to lethal renal complications14 and can recrudesce decades later15, further increasing their socioeconomic costs.
Unraveling the mechanisms that enable P. malariae to persist in the host for decades is critical for a more general understanding of chronicity in malaria. The genome sequence of P. ovale, the other hypnozoite-forming species, will facilitate the search for conserved hypnozoite genes and will conclusively show whether P. ovale consists of two cryptic subspecies, as recently suggested16. Finally, the genetic basis of human-infectivity in malaria parasites can only be fully understood by having access to the genome sequences of all human-infective species.
Here we present the genome sequences of both these species, including the two recently described16 subspecies of P. ovale (P. o. curtisi and P. o. wallikeri). We update the phylogeny of the Plasmodium genus using whole genome information, and describe novel genetic adaptations underlying their unique biology. Using whole genome sequencing of additional P. malariae (including two obtained from chimpanzees, referred to as P. malariae-like) and P. ovale samples, we describe the genetic variation present, as well as identify genes that are under selection. The data presented here provide the community with an essential foundation for further research efforts into these neglected species and into understanding the evolution of the Plasmodium genus as a whole.
Results
Plasmodium Co-Infections
Obtaining P. malariae and P. ovale DNA has historically been difficult due to the low level of parasitaemia in natural human infections. Using a novel method based on mitochondrial SNPs (Methods), we found P. malariae and P. ovale in approximately 2% of all P. falciparum clinical infections from the globally sampled Pf3K project (www.malariagen.net) (Figure 1a) (Supplementary Table 1), compared to 4% being co-infections with P. vivax. We also found a number of infections containing three species. These P. malariae and P. ovale co-infections are in addition to the larger number of mono-infections that they cause, which are frequently confounded by difficulties in confirming a species diagnosis. We used the two P. ovale co-infections with the highest number of sequencing reads to perform de novo genome assemblies (Supplementary Table 2).
Genome Assemblies
A 33.6 megabase (Mb) reference genome of P. malariae was produced from clinically isolated parasites and sequenced using Pacific BioSciences long-read sequencing technology. The assembled sequence comprises 14 super-contigs representing the 14 chromosomes, with 6 chromosome ends extending into telomeres, and a further 47 unassigned subtelomeric contigs containing an additional 11 telomeric sequences (Table 1). Using existing Illumina sequence data from two patients primarily infected with P. falciparum, reads were extracted (Methods) and assembled into 33.5 Mb genomes for both P. o. curtisi and P. o. wallikeri, each assembly comprising fewer than 800 scaffolds. The genomes are significantly larger than previously sequenced Plasmodium species, and have isochore structures similar to those in P. vivax, with a higher AT content in the subtelomeres. In addition, a P. malariae-like genome was produced using Illumina sequencing from parasites isolated from a chimpanzee co-infected with P. reichenowi. The P. malariae-like genome was more fragmented than the other assemblies and its 23.7 Mb sequence misses most subtelomeric regions due to whole genome amplification prior to sequencing.
Most of the P. malariae genome is collinear with P. vivax, however we see two instances of large recombination breakpoints. The chromosomes syntenic to the P. vivax chromosomes 6 and 10 have recombined (Supplementary Figure 1a) and a large internal inversion has occurred on chromosome 5 (Supplementary Figure 1b), confirmed by mapping additional P. malariae samples back to the reference assembly.
Across the four genomes, between 4,430 and 7,165 genes were identified using a combination of ab initio gene prediction and projection of genes from existing Plasmodium genome sequences. Manual curation was used to correct 2,516 and 2,424 genes for both the P. malariae and P. o. curtisi reference genomes respectively. A maximum likelihood tree was constructed using 1,000 conserved core genes that are present as single copies in 12 selected Plasmodium species (Figure 1b). The four newly assembled genomes do not cluster with any other Plasmodium species, but form two distinct and novel clades. Similar to a recent study using apicoplast data17, the two P. ovale species form a sister clade with the rodent malaria species, the latter being an ingroup to the ‘superfamily’ of primate-infective species in this tree. We also see that P. malariae-like has a longer branch length than P. malariae, which may be a reflection of the higher levels of diversity in P. malariae-like (Supplementary Figure 2a). This lack of diversity in P. malariae compared to the Chimpanzee species mirrors the situation of P. falciparum with P. reichenowi18.
We estimated the time of divergence for the four species using a Bayesian inference tool, G-PhoCS19. Absolute divergence time estimates are inherently uncertain due to mutation rate and generation time assumptions, and we therefore scaled these parameters to date the P. falciparum and P. reichenowi split using G-PhoCS to 4 million years ago (MYA), as previously published (3.0 − 5.5MYA)20. Assuming that the mutation rates and generation times are similar for P. ovale and P. falciparum, we find that the relative split of the two P. ovale species is about 5-times earlier than the split of P. falciparum and P. reichenowi. Using the same parameters as for the Laverania split, we thereby date the divergence of the two P. ovale subspecies to approximately 22.8MYA. This strongly supports the classification of P. o. curtisi and P. o. wallikeri as separate species rather than subspecies of P. ovale.
Using the same mutation rate and a longer generation time to account for the longer intra-erythrocytic cycle, we date the split of P. malariae from P. malariae-like to ~3.9MYA. This is similar to the estimated divergence of P. falciparum and P. reichenowi, suggesting a significant evolutionary event that promoted speciation in Plasmodium at that time. It has been suggested that a new world primate infective species termed P. brasilianum is identical to P. malariae21. To investigate this further using the new genome assemblies, we aligned the P. brasilianum merozoite surface protein 1 (MSP1)22 and ribosomal rRNA21 genes to both the P. malariae and P. malariae-like orthologous genes, showing that P. brasilianum is identical to P. malariae, but that P. malariae-like is indeed very different (Supplementary Figure 2b).
Gene Changes
The greater number of genes in both P. malariae and P. ovale compared to existing Plasmodium genomes is mostly due to gene family expansions in the subtelomeres, such as Plasmodium interspersed repeat (pir) and STP1 genes (Table 1). In addition, a large expansion of gamete antigen 25/27 (Pfg27) was identified in P. malariae with 22 tandemly duplicated copies including two pseudogenes on chromosome 14 (Supplementary Figure 3a). P. vivax and P. falciparum only have one and two copies respectively. Pfg27 is expressed highly during early gametocytogenesis23, and is essential for correct gametocyte development24. This gametocyte gene duplication may be an adaption by this species to ensure sexual reproduction in a setting of low level parasitaemia during infection.
In the P. ovale species, certain genes are also tandemly duplicated. Nine homologs (including two pseudogenes) of PVP01_1270800 are present in P. o. curtisi and 7 homologs are present in P. o. wallikeri (Supplementary Figure 3b). The P. vivax homolog is most highly expressed in sporozoites but has no known function25. The 3D structure of this gene, as predicted by I-TASSER26, appears to be similar to a nuclear pore complex (TM-Score > 0.4), suggesting a role in transport. This sporozoite change may be indicative of differences in liver-stage invasion or possibly hypnozoite formation.
Multiple genes have become pseudogenized in the two reference genomes compared to other human-infective Plasmodium species (Supplementary Table 3), including homologs of a multidrug efflux pump (PF3D7_0212800) which may suggest a higher susceptibility of these species to drug targeting. A phospho-fructo kinase, central to glycolysis27, is pseudogenized in both P. ovale, suggesting novel energy metabolism in these species. We also see genes that are pseudogenized in P. o. wallikeri but not P. o. curtisi, such as a serine-threonine protein kinase and a reticulocyte binding protein 1b (RBP1b), which is also pseudogenized in P. malariae as discussed below. One gene of specific interest that is pseudogenized in P. o. wallikeri but not in P. o. curtisi is a homolog of a cyclin in P. falciparum (PF3D7_1227500), an observation that may explain the different relapse times of the two P. ovale species28. The highest number of pseudogenes is seen in the P. malariae subtelomeres, where ~40% of the genes are pseudogenized in this species, indicating reduced selection pressure to cleanse the genome of these remnant genes.
P. malariae has a significantly longer intra-erythrocytic lifecycle compared to other human-infective Plasmodium species. All three Plasmodium cyclins29 are highly conserved in P. malariae, suggesting that the genetic cause may be elsewhere. A WD repeat-containing protein (WRAP73) is deleted in P. malariae but conserved across all other Plasmodium species. It is part of a large gene family known to be involved in a number of cellular processes, including cell cycle progression30. Knocking this gene out in other species may elucidate its importance in Plasmodium cell cycle progression.
Both P. ovale species are able to form hypnozoites, similar to P. vivax3 and the simian-infective P. cynomolgi31. In searching for genes shared exclusively by these species, we identified 64 genes, of which two are of interest (Supplementary Table 4), as they do not belong to subtelomeric gene families. These include two conserved Plasmodium proteins, one of which has a low-level similarity to the P. falciparum ring-exported protein 4 gene. Both genes contain transmembrane domains and are expressed in P. vivax sporozoites32, making them interesting candidates to study experimentally.
Subtelomeric Gene Families
The Plasmodium genus is characterized by species-specific subtelomeric gene family expansions, such as var genes in P. falciparum33 and pir genes in P. yoelii34. In P. malariae and P. ovale, where approximately 40% of the total genome size is subtelomeric, we also see large expansions of gene families that are species-specific (Figure 2a) (Table 1). The three largest gene clusters that we identified were in P. malariae. Of these, one cluster is composed of STP1 and surface associated interspersed genes (surfins). P. malariae and P. ovale are the only human-infective species other than P. falciparum35 to contain surfins (Table 1), raising the possibility of studying this gene family using comparative genomics.
The other two large P. malariae clusters consist of two novel gene families, here termed fam-l and fam-m, consisting of 373 and 416 two-exon ~250 amino acid long genes respectively. We find two fam-m genes in P. malariae-like, which, despite the assembly lacking the majority of the subtelomeres, suggests that P. malariae-like also contains at least one of these novel families. The first exon of each fam-l and fam-m gene contains a signal peptide and a PEXEL motif-the signature in P. falciparum for export from the parasite into host erythrocytes36. In addition, the second exon contains two transmembrane domains flanking a hypervariable region. The remainder of the gene sequence is conserved between members of the same family and differentiates the two families from each other. These characteristics support the notion that the proteins coded for by these genes are exported from the parasite and may be targeted to the infected red blood cell surface and play a role in host-parasite interactions.
Ninety-three percent of fam-l and fam-m genes are on the same strand facing the telomeres (Figure 2b). This pattern, similar to pir genes in P. yoelii34, may be an adaptation to facilitate recombination between these genes. Uniquely, ~60% of these new genes are found in doublets of a fam-l and a fam-m (Figure 2b). Mirror tree analysis suggests that the pairs may be co-evolving over short periods of time (Supplementary Figure 4a), likely through being duplicated together, but that pairing may be disrupted by recombination over longer periods. We do not see any evidence of co-evolution between pir genes in close proximity of fam-l or fam-m genes (Supplementary Figure 4b), supporting the fact that this is not an artifact from their subtelomeric location. This suggests that fam-l and fam-m genes may encode proteins that dimerize when they are exported, a feature not previously seen among subtelomeric gene families.
Finally, we performed 3D structure prediction of both a fam-l and a fam-m gene using I-TASSER26. For both genes we got similar high-confidence (TM score > 0.5) 3D structures. These structures overlap the crystal structure of the P. falciparum RH5 protein very well (TM score > 0.8), with 100% of the RH5 structure covered even though they only have 10% sequence similarity (Figure 2c). RH5 is a prime vaccine target in P. falciparum due to its essential binding to basigin during invasion37. The RH5 kite-shaped fold is known to be present in RBP2a in P. vivax38, and may be a conserved structure necessary for the binding capabilities of all RH and RBP genes. This suggests that fam-l and fam-m genes may be involved in binding host receptors.
While neither P. ovale species has fam-l or fam-m genes, they both have large expansions of the pir gene family with 1,930 and 1,335 pir genes in P. o. curtisi and P. o. wallikeri respectively, while P. malariae only has 247 pir genes. This is the largest number of pir genes in any sequenced Plasmodium genome to date, explaining the large subtelomeres of this species. These pir genes form large species-specific clusters suggesting recent expansions (Figure 2a), but most closely resemble those in P. vivax. Many subfamilies of pir genes in P. malariae and P. ovale are shared with P. vivax, while almost none are shared with the rodent-infecting species (Supplementary Figure 5a). This suggests that pir genes are relatively well conserved between non-falciparum species infecting humans. Interestingly, all hypnozoite-forming species (Both P. ovale, P. vivax, and P. cynomolgi) contain over 1000 pirs each, significantly more than non-hypnozoite-forming Plasmodium species. Using additional draft genome assemblies for both P. o. curtisi and P. o. wallikeri, we show that the two species of P. ovale share significantly fewer pir genes inter-specifically than they do intra-specifically or intra-genomically (99% identity over 150 amino acids), further suggesting that the two species are not recombining with each other (Supplementary Figure 5b).
Reticulocyte and Duffy Binding Proteins
RBP genes encode a merozoite surface protein family present across all Plasmodium species and known to be involved in red blood cell invasion and host specificity39. Compared to P. vivax, P. malariae has lost multiple RBPs including nearly all RBP2 genes and RBP1b, though it does have a functional RBP3. On the other hand, the two P. ovale species each have multiple full-length RBP2 genes (seven in P. o. curtisi and four in P. o. wallikeri) compared to three copies in P. vivax (Figure 3a). The two P. ovale species have very similar RBP2s, such as PocGH01_00019400 and PowCR01_00048600, a number of RBP2 pseudogenes in the two genomes match with a functional copy in the other genome (Supplementary Figure 6a). The RBP1b in P. o. wallikeri is less pseudogenized than the RBP1b in P. malariae and in P. malariae-like where we have a short fragment of the gene (Figure 3b). The specific mutation introducing a stop codon is conserved across the two P. o. wallikeri samples (Supplementary Figure 6b), indicating that RBP1b has become pseudogenized recently in P. o. wallikeri, or that the shortened form may be functional and has therefore been maintained under selection. It is interesting to note that the positioning of RBP1b and RBP1a is conserved across all these species, but not with the rodent malaria species.
RBP genes are thought to be involved specifically in reticulocyte invasion, which explains the gene loss in P. malariae, a species that preferentially invades normocytes13 (Figure 3c). Both P. ovale species exclusively invade reticulocytes12 and may have developed novel invasion pathways through the RBP2 expansion, similar to P. vivax. This supports a role for RBP2 gene expansions specifically in reticulocyte invasion. RBP3 genes seem to be pseudogenized in all reticulocyte-infective species, while they are fully functional in normocyte-infective species, suggesting a role in normocyte-invasion for RBP3.
Duffy binding proteins (DBPs) are also important for erythrocyte invasion39. P. malariae has one functional and one recently pseudogenized DBP, while both P. ovale have two functional copies. It is believed that P. vivax is incapable of infecting duffy-negative humans due to relying on its DBP binding the Duffy antigen, with recent studies showing duffy-negative infectivity in P. vivax strains containing a DBP duplication40. The fact that P. malariae and P. ovale are found throughout Africa (Figure 1a) suggests that they are capable of infecting duffy-negative individuals. It is therefore surprising that P. malariae only has one functional copy, implying that one copy is sufficient for duffy-negative infectivity in this species.
Differential Selection Pressures
Using four additional P. malariae samples, two additional P. o. curtisi samples and two P. malariae-like and P. o. wallikeri samples each (Supplementary Table 2), we investigated differences in selection pressures between two species that diverged based on host differences (P. malariae and P. malariae-like), and two species that supposedly diverged within the same host (P. o. curtisi and P. o. wallikeri). Using GATK UnifiedGenotyper41, we called a total of 981,486 raw SNPs in P. malariae and 2,458,473 raw SNPs in P. ovale. Excluding subtelomeric regions, the pairwise nucleotide diversity between the different P. malariae samples is 4.7 × 10−4 and for the P. o. curtisi samples it is 3.8 × 10−4, which is significantly lower than similar estimates for P. falciparum42 and P. vivax43. Following SNP filtering (Methods), we retained on 230,881 SNPs in P. malariae with an average of 8,295 SNPs between the reference genome and the different P. malariae samples and with 150,832 SNPs on average with P. malariae-like (Supplementary Table 5). In P. ovale we retained 1,462,486 SNPs, of which 37,897 SNPs were different on average between P. o. curtisi samples and 1,412,799 were different on average between P. o. curtisi and P. o. wallikeri (Supplementary Table 6).
We calculated a number of selection measures for every core gene with 5 or more nucleotide substitutions (2,192 genes in P. malariae, 4,579 genes in P. o. curtisi), including the Hudson-Kreitman-Aguade ratio (HKAr)44, which is the ratio of interspecific nucleotide divergence to intraspecific polymorphisms (ie. diversifying selection), Ka/Ks45, to look for an enriched number of nonsynonymous differences compared to synonymous differences (ie. positive selection), and the McDonald Kreitman (MK) Skew46, a measure of maintained polymorphisms (ie. balancing selection). We find high levels of HKAr (HKAr > 0.15) in a large proportion of genes in P. malariae, (127/2,192, 5.8%), but not in P. o. curtisi (36/4,579, 0.8%) (2-sample test for equality of proportions, p < 0.001) (Figure 4) (Supplementary Table 7). We see more genes under significant balancing selection in P. malariae (9/2,192, 0.4%) than in P. o. curtisi (5/4,579, 0.1%) (p < 0.05). More genes are under positive selection in P. malariae (104/2,192, 4.7%) than in P. ovale (24/4,579, 0.5%) (p < 0.001). This suggests that P. malariae may be under more widespread or stronger selective pressure than P. o. curtisi.
Looking at specific genes under selection, we see similar genes in the P. malariae /P. malariae-like test as in an earlier P. falciparum/P. reichenowi study47. This includes a large number of invasion genes with high HKAr values, such as MSP8 and MSP7, as well as significant MK skews for MSP1 and apical membrane antigen 1. For P. malariae, genes with high HKAr values besides invasion genes are associated with stages throughout the parasites lifecycle (Figure 4). However in P. o. curtisi, they are predominantly invasion and gametocyte genes, including among others a gametocyte associated protein and a mago nashi homolog protein (Figure 4), the latter potentially being involved in sex determination48. For P. o. curtisi, we also find a large number of genes with high Ka/Ks values that are gametocyte-associated, such as a sexual stage antigen s16. We therefore find that invasion genes tend to always be under strong selective pressure in Plasmodium, but that P. malariae and P. o. curtisi differ in terms of the other life cycle stages that are under selective pressure.
One of the genes with the highest Ka/Ks in the P. malariae/P. malariae-like comparison is RBP1a, which has 37 nonsynonymous fixed differences between the two species and only 6 synonymous fixed differences. The other two intact RBPs are much more highly conserved. Knowing that P. malariae also infects new world monkeys (where it is known as P. brasilianum)21, we might suppose that the receptor for RBP1a may be conserved between humans and new world monkeys, but not with chimpanzees. We identified 19 human genes coding for transmembrane-containing proteins that may act as potential RBP1a receptors (Methods), which includes a mucin-22 precursor and an aquaporin 12b precursor (Supplementary Table 8).
Discussion
The high-quality genome sequences of P. malariae and P. ovale and their annotation presented here provide a rich new resource for comparative Plasmodium genomics. They provide a foundation for further studies into the biology of these two neglected malaria species, as well as new tools to explore genus level similarities and differences in infection. The genome sequences have revealed a number of genomic adaptations and possible consequences related to the success of these species sustaining low parasitaemia infections, including gametocyte gene expansions and an increase in genome size. The genome sequences suggest that the rodent-infective malaria species may be the result of an ancestral host switch from a primate-infective species and also conclusively show that P. ovale is a species complex, consisting of two highly diverged species, P. o. curtisi and P. o. wallikeri. The genome sequences reveal a novel type of subtelomeric gene family in P. malariae occurring in doublets and potentially having an RH5-like fold. Having access to a larger number of genome sequences also allows us to identify features such as the RBP2 gene expansion in reticulocyte invading Plasmodium species. Multi-sample analysis of the two species highlights differences in selection pressures between host-switching and within-host speciation, as well as the omnipresent selective pressure during red blood cell invasion. These genome sequences will now enable more comprehensive studies of human-infectivity in Plasmodium species.
Methods
Co-infection Mining
We aligned the P. malariae (AB354570) and P. ovale (AB354571) mitochondrial genome sequences against those of P. falciparum2, P. vivax3, and P. knowlesi4 using MUSCLE49. For each species, we identified three 15bp stretches within the Cox1 gene that contained two or more species-specific SNPs. We searched for these 15bp species-specific barcodes within the sequencing reads of all 2,512 samples from the Pf3K global collection ( www.malariagen.net). Samples that contained at least two sequencing reads matching one or more of the 15bp barcodes for a specific species were considered to be positive for that species (Supplementary Table 1). We found good correspondence between the three different barcodes for each species, with over 80% of positive samples being positive for all three barcodes. We generated pseudo-barcodes by changing two randomly selected nucleotide bases at a time for 10 randomly selected 15bp region in the P. vivax3 mitochondrial genome. We did not detect any positive hits using these pseudo-barcodes. As an additional negative control, we searched for P. knowlesi co-infections, but did not find any samples positive for this species. Two samples (PocGH01, PocGH02) had high numbers for all three P. ovale barcodes and were used for reference genome assembly and SNP calling respectively.
Parasite Material
All P. ovale samples were obtained from symptomatic patients diagnosed with a P. falciparum infection. The two P. o. curtisi samples (PocGH01, PocGH02) identified through co-infection mining (see above), were from two patients testing positive on a CareStart® (HRP2 based) rapid malaria diagnostic test (RDT) kit at the Navrongo War Memorial hospital, Ghana. Following consent obtainment, about 2-5mls of venous blood was obtained and then diluted with one volume of PBS. This was passed through CF11 cellulose powder columns to remove leucocytes prior to parasite DNA extraction.
The two P. malariae-like samples, PmlGA01 and PmlGA02, were extracted from Chimpanzee blood obtained during routine sanitary controls of animals living in a Gabonese sanctuary (Park of La Lékédi, Gabon). Blood collection was performed following international rules for animal health. Within six hours after collection, host white blood cell depletion was performed on fresh blood samples using the CF11 method50. After DNA extraction using the Qiagen blood and Tissue Kit and detection of P. malariae infections by Cytb PCR and sequencing51, the samples went through a whole genome amplification step52.
One P. malariae sample, PmGN01, collected from a patient with uncomplicated malaria in Faladje, Mali. Venous blood (2-5mL) was depleted of leukocytes within 6 hours of collection as previously described53. The study protocol was approved by the Ethics Committee of Faculty of Medicine and Odontomatology and Faculty of Pharmacy, Bamako, Mali.
Four samples of P. malariae were obtained from travellers returning to Australia with malaria. PmUG01 and PmID01 were sourced from patients returning from Uganda and Papua Indonesia respectively, who presented at the Royal Darwin Hospital, Darwin, with microscopy-positive P. malariae infection. PmMY01 was sourced from a patient presenting at the Queen Elizabeth Hospital, Sabah, Malaysia, with microscopy-positive P. malariae infection. Patient sample PmGN02 was collected from a patient who presented to Royal Brisbane and Womens Hospital in 2013 on return from Guinea.
Venous blood samples were subject to leukodepletion within 6 hours of collection. PmUG01 was leukodepleted using a commercial Plasmodipur filter (EuroProxima, The Netherlands); home-made cellulose-based filters were used for PmID01 and PmMY01, while PmGN02 was leukodepleted using an inline leukodepletion filter present in the venesection pack (Pall Leukotrap; WBT436CEA). DNA extraction was undertaken on filtered blood using commercial kits (QIAamp DNA Blood Midi kit, Qiagen Australia).
For samples PmUG01, PmID01 and PmMY01, ethical approval for the sample collection was obtained from the Human Research Ethics Committee of NT Department of Health and Families and Menzies School of Health Research (HREC-2010-1396 and HREC-2010-1431) and the Medical Research Ethics Committee, Ministry of Health Malaysia (NMRR-10-754-6684). For sample PmGN02, ethical approval was obtained from the Royal Brisbane and Womens Hospital Human Research Ethics Committee (HREC/10/QRBW/379) and the Human Research Ethics Committee of the Queensland Institute of Medical Research (p1478).
Sample Preparation and Sequencing
One P. malariae sample, PmUG01, was selected for long read sequencing, using Pacific Biosciences (PacBio), due to its low host contamination and abundant DNA. Passing through a 25mm blunt-ended needle, 6ug of DNA was sheared to 20-25kb. SMRT bell template libraries were generated using the PacBio issued protocol (20kb Template Preparation using the BluePippin™ Size-Selection System). After a greater than 7kb size-selection using the BluePippin™ Size-Selection System (Sage Science, Beverly, MA), the library was sequenced using P6 polymerase and chemistry version 4 (P6/C4) in 20 SMRT cells (Supplementary Table 2).
The remaining isolates were sequenced with Illumina Standard libraries of 200-300bp fragments and amplification-free libraries of 400-600bp fragments were prepared54 and sequenced on the Illumina HiSeq 2000 v3 or v4 and the MiSeq v2 according to the manufacturers standard protocol (Supplementary Table 2). Raw sequence data was deposited in the European Nucleotide Archive (Supplementary Table 2).
Genome Assembly
The PacBio sequenced P. malariae sample, PmUG01, was assembled using HGAP55 with an estimated genome size of 100Mb to account for the host contamination (~85% Human). The resulting assembly was corrected initially using Quiver55, followed by iCORN56. PmUG01 consisted of two haplotypes, with the majority haplotype being used for the iCORN56, and a coverage analysis was performed to remove duplicate contigs. Additional duplicated contigs were identified using a BLASTN57 search, with the shorter contigs being removed if they were fully contained within the longer contigs or merged with the longer contig if their contig ends overlapped. Host contamination was removed by manually filtering on GC, coverage, and BLASTN hits to the non-redundant nucleotide database57.
The Illumina based genome assemblies for P. o. curtisi, P. o. wallikeri, and P. malariae-like were performed using MaSURCA58 for samples PocGH01, PowCR01, and PmlGA01 respectively. To confirm that the assemblies were indeed P. ovale, we mapped existing P. ovale capillary reads to the assemblies (www.ncbi.nlm.nih.gov/Traces/trace.cgi?view=search). Prior to applying MaSURCA58, the samples were mapped to the P. falciparum 3D7 reference genome2 to remove contaminating reads. The draft assemblies were further improved by iterative uses of SSPACE59, GapFiller60 and IMAGE61. The resulting scaffolds were ordered using ABACAS62 against the P. vivax PVP01(href="http://www.genedb.org/Homepage/PvivaxP01) assembly (both P. ovale) or against the P. malariae PacBio assembly (P. malariae-like). The assemblies were manually filtered on GC, coverage, and BLASTN hits to the non-redundant nucleotide database57. iCORN56 was used to correct frameshifts. Finally, contigs shorter than 1 kilobase (kb) were removed.
Using two more samples, PocGH02 and PowCR02, additional draft assemblies of both P. ovale species were produced using MaSURCA58 followed by RATT63 to transfer the gene models from the high-quality assemblies.
The genome sequences and annotation for both P. malariae and P. ovale can now be found on GeneDB at http://www.genedb.org/Homepage/Pmalariae and at http://www.genedb.org/Homepage/Povale.
Gene Annotation
RATT63 was used to transfer gene models based on synteny conserved with other sequenced Plasmodium species (P. falciparum2, P. vivax3, P. berghei34, and P. gallinaceum (unpublished)). In addition, genes were predicted ab initio using AUGUSTUS64, trained on a geneset consisting of manually curated P. malariae and P. ovale genes respectively. Non-coding RNAs and tRNAs were identified using Rfam 12.065. Gene models were then manually curated for both the P. malariae and P. o. curtisi reference genomes, using Artemis66 and the Artemis Comparison Tool (ACT)67. These tools were also used to manually identify deleted and disrupted genes (Supplementary Table 3).
Phylogenetics
Following ortholog assignment using BLASTP57 and OrthoMCL68, amino acid sequences of 1000 core genes from 12 Plasmodium species (P. galinaceum (unpublished), P. falciparum2, P. reichenowi47, P. knowlesi4, P. vivax3, P. cynomolgi31, P. chabaudi34, P. berghei34, and the four assemblies produced in this study) were aligned using MUSCLE49. The alignments were cleaned using GBlock69 with default parameters to remove non-informative and gapped sites. The cleaned non-zero length alignments were then concatenated. This resulted in an alignment of 421,988 amino acid sites per species. The optimal substitution model for each gene partition was determined by running RAxML70 for each gene separately using all implemented substitution models. The substitution models that generated the tree with the highest likelihood were used for each gene partition. A maximum likelihood phylogenetic tree was constructed using RAxML70 with 100 bootstraps71 (Figure 1b). To confirm this tree, we utilized different phylogenetic tools including PhyloBayes72 and PhyML73, a number of different substitution models within RAxML, starting the tree search from the commonly accepted phylogenetic tree, and removing sites in the alignment which supported significantly different trees. All approaches yielded the final tree found in Figure 1b with highest likelihood. Figtree was used to colour the tree (http://tree.bio.ed.ac.uk/software/figtree/).
A phylogenetic tree of four P. malariae (PmID01, PmGN01, PmGN02, PmMY01) and all P. malariae-like samples (PmlGA01, PmlGA02) was generated using PhyML73 based on all P. malariae genes. For each sample, the raw SNPs as called using the SNP pipeline (see below), were mapped onto all genes to morph them into sample specific gene copies using BCFtools74. Amino acids for all genes were concatenated and cleaned using GBlocks69.
Divergence Dating
Species divergence times were estimated using the Bayesian inference tool G-PhoCS19, a software which uses thousands of unlinked neutrally evolving loci and a given phylogeny to estimate demographic parameters. One additional sample per assembly (PmGN01 for P. malariae, PocGH02 for P. o. curtisi, PowCR02 for P. o. wallikeri, and PmlGA02 for P. malariae-like) was used to morph the respective assembly using iCORN56. Regions in the genomes without mapping were masked, as iCORN56 would not have morphed them. Unassigned contigs and subtelomeric regions were removed for this analysis due to the difficulty of alignment. Repetitive regions in the chromosomes of the four assemblies and the four morphed samples were masked using Dustmasker75 and then the chromosomes were aligned using FSA76. The P. o. wallikeri and the P. o. curtisi chromosomes were aligned against each other, as were the P. malariae and P. malariae-like chromosomes. The alignments were split into 1kb loci, removing those that contained gaps, masked regions, and coding regions to conform with the neutral loci assumption of G-PhoCS19. G-PhoCS19 was run for one million MCMC-iterations with a sample-skip of 1,000 and a burn-in of 10,000 for each of the two species pairs. Follow-up analyses using Tracer (http://beast.bio.ed.ac.uk/Tracer) confirmed that this was sufficient for convergence of the MCMC chain in all cases. In the model, we assumed a variable mutation rate across loci and allowed for on-going gene flow between the populations. The tau values obtained from this were 0.0049 for P. malariae and 0.0434 for P. ovale.
The tau values were used to calculate the date of the split, using the formula (tau x G)/mu, where G is the generation time in years and mu is the mutation rate. Following optimization of the P. falciparum/P. reichenowi split to 4 million years ago (unpublished), as estimated previously20, we assumed a mutation rate of 3.8 × 10−10 SNPs/site/lifecycle77 and a generation time of 65 days78. For P. malariae, a generation time of 100 days was used due to the longer intra-erythrocytic cycle.
3D Structure Prediction
The I-TASSER26 Version 4.4 online web server79 (zhanglab.ccmb.med.umich.edu/I-TASSER) was used for 3D protein structure prediction. Predicted structures with a TM-score of over 0.5 were considered reliable as suggested in the I-TASSER user guidelines80. TM-align81, as implemented in I-TASSER79, was used to overlay the predicted protein structure with existing published protein structures.
Hypnozoite Gene Search
Using the OrthoMCL68 clustering between all sequenced Plasmodium species used for the phylogenetic analysis (see above), we examined clusters containing only P. vivax P01 genes, P. cynomolgi31 genes and genes of both of the P. ovale species.
Gene Family Analysis
All P. malariae, P. ovale, and P. vivax P01 genes were compared pairwise using BLASTP57, with genes having a minimum local BLAST hit of 50% identity over 150 amino acids or more being considered connected. These gene connections were visualized in Gephi82 using a Fruchterman-Reingold83 layout and with unconnected genes.
P. malariae, P. o. curtisi and P. o. wallikeri protein sequences for Plasmodium interspersed repeat (pir) genes, excluding pseudogenes, were combined with those from P. vivax P01, P. knowlesi4, P. chabaudi AS v3 (genedb.org/Homepage/Pchabaudi), P.yoelii 17Xv234, and P. berghei v3 (genedb.org/Homepage/Pberghei). Sequences were clustered using tribeMCL84 with blast E-value 0.01 and inflation 2. This resulted in 152 subfamilies. We then excluded clusters with one member. The number of genes per species, in each subfamily were plotted in a heatmap using the heatmap.2 function in ggplots in R-3.1.2.
The pir genes from two P. o. curtisi and two P. o. wallikeri assemblies (two high-quality and two draft assemblies) were compared pairwise using BLASTP57 with a 99% identity over a minimum of 150 amino acids cutoff. The gene-gene connections were visualized in Gephi82 using a Fruchterman-Reingold83 layout after removing unconnected genes.
Mirror Tree Analysis
Using Artemis66, 79 fam-m and fam-l doublets that were confidently predicted as being paired-up were manually selected based on their dispersal throughout the subtelomeres of different chromosomes. The Mirrortree85 web server (http://csbg.cnb.csic.es/mtserver/) was used to construct mirror trees for these 79 doublets. 35 doublets with recent branching from another doublet were manually selected to enrich for genes under recent selection. To control for chance signals of co-evolution based on their subtelomeric location, the same methodology was repeated by choosing 79 pir genes in close proximity of fam-m genes as pseudo-doublets and paired up in the Mirrortree85 web server.
Reticulocyte Binding Protein (RBP) Phylogenetic Plot
Full-length RBP genes were manually inspected using ACT67 and verified to either be functional or pseudogenized by looking for sequencing reads in other samples that confirm mutations inducing pre-mature stop codons or frameshifts. All functional RBPs were aligned using MUSCLE49 and cleaned using GBlocks69. PhyML73 was used to construct a phylogenetic tree of the different RBPs. Figtree was used to colour the tree (http://tree.bio.ed.ac.uk/software/figtree/).
SNP Calling
Additional P. malariae (PmMY01, PmID01, PmGN01, PmSL01) and P. o. curtisi (PocGH01, PocGH02, PocCR01) samples were mapped back against the reference genomes using SMALT (-y 0.8, −i 300) (Supplementary Table 2). As outgroups, P. malariae-like (PmlGA01, PmlGA02) and P. o. wallikeri (PowCR01, PowCR02) were also mapped against the P. malariae and P. o. curtisi genomes respectively. The resulting bam files were merged for either of the two genomes, and GATKs41 Unified Genotyper was used to call SNPs from the merged bam files (Supplementary Tables 5 and 6). Per GATKs41 best practices, SNPs were filtered by quality of depth (QD > 2), depth of coverage (DP > 10), mapping quality (MQ > 20), and strand bias (FS < 60). Additionally, all sites for which we had missing data for any of the samples or where we had heterozygous calls were filtered away. Finally, we filtered away sites that were masked using Dustmasker75 to remove repetitive and difficult to map regions.
Molecular Evolution Analysis
To calculate the genome-wide nucleotide diversity, we extracted all raw SNPs in the genomes excluding the subtelomeres. We then divided the resulting genome size by the number of raw SNPs specific to the core of the genome. This number was averaged for the four P. malariae samples and for the two P. ovale samples.
The filtered SNPs were used to morph the reference genomes using BCFtools74 for each sample, from which sample-specific gene models were obtained. Nucleotide alignments of each gene were then generated. Codons with alignment positions that were masked using Dustmasker75 were excluded. For each alignment (ie. gene), we calculated HKA44, MK46, and Ka/Ks45 values, see below. Subtelomeric gene families and pseudogenes were excluded from the analysis. The results were analysed and plotted in RStudio (http://www.rstudio.com/).
For the HKA44, we counted the proportion of pairwise nucleotide differences intra-specifically (ie. within P. malariae and within P. o. curtisi) and inter-specifically (ie. between P. malariae and P. malariae-like, between P. o. curtisi and P. o. wallikeri). The intraspecific comparisons were averaged to get the genes nucleotide diversity pi and these were divided by the average interspecific comparisons, the nucleotide divergence, to get the HKA ratio (HKAr) for each gene.
The MK test46 was performed for each gene by obtaining the number of fixed and polymorphic changes, as well as a p-value, as previously described86 and then calculating the skew as log2(((Npoly+1)/(Spoly+1))/((Nfix+1)/(Sfix+1))) where Npoly and Nfix are polymorphic and fixed non-synonymous substitutions respectively, while Spoly and Sfix refer to the synonymous substitutions.
To calculate the average Ka/Ks ratio45, we took the cleaned alignments of the MK test, extracting the pairwise sequences of P. malariae and P. malariae-like (and of P. o. curtisi and P. o. wallikeri). The Bio::Align::DNAStatistics module was used to calculate the Ka/Ks values for each pair87, averaging across samples within a species.
Using existing RNA-Seq data from seven different life-cycle stages in P. falciparum25, reads were mapped against spliced gene sequences (exons, but not UTRs) from the P. falciparum 3D7 reference genome2 using Bowtie288 v2.1.0 (-a-X 800-x). Read counts per transcript were estimated using eXpress v1.3.089. Genes with an effective length cutoff below 10 in any sample were removed. Summing over transcripts generated read counts per gene. Each gene in P. malariae and P. ovale was classified by their P. falciparum orthologs maximum expression stage.
RBP1a Receptor Search
To find the putative RBP1a receptor, we performed an OrthoMCL68 clustering between Human, Chimpanzee90, and common marmoset91 genes. The common marmoset has been found infected with P. brasilianum (P. malariae) in the wild92. Genes without transmembrane domain as well as those annotated as ‘predicted’ were removed. To remove false positive, all remaining genes were searched against the Chimpanzee genes using BLASTP57 with a threshold of 1e-10.
Author Contributions
G.G.R. carried out the sequence assembly, genome annotation and all the data analysis; U.C.B. performed manual gene curation; M.S. coordinated sequencing; A.J.R., M. M., and F.P. performed data analysis; G.G.R., T.O.A., L.AE., J.W.B., D.P.K., C.I.N., M.B., and T.D.O. designed the P. ovale project; G.G.R., F.R., B.O., F.P., C.I.N., M.B., and T.D.O. designed the P. malariae-like project; G.G.R., A.A.D., O.M.A, N.M.A., S.A., R.N.P., J.S.M., C.I.N., M.B., and T.D.O. designed the P. malariae project; G.G.R., C.I.N., M.B., T.D.O wrote the manuscript; All authors read and critically revised the manuscript; and C.I.N., M.B., T.D.O. directed the overall study.
The first column shows country of origin for the different samples, with the second column showing the total number of samples collected in that country. The following five columns show the number of these samples that are positive for the different Plasmodium species. All samples are positive for P. falciparum, which is expected because all the samples were initially identified as P. falciparum. We do not see any samples positive for P. knowlesi, because it has a very limited geographic range and isnt found in any of the sampled countries.
The first column shows the gene ID of the P. vivax P01 homolog of the gene pseudogenized/deleted in one or more of the three human malaria parasite assemblies. The second column is the P. vivax P01 annotation of that gene. The following three columns show whether the gene is functional (blank), pseudogenized (Pseudo), deleted (Deleted), or missing due to a sequencing gap (Gap).
These are the two orthoMCL gene clusters that contain exclusively all hypnozoite-forming Plasmodium species and are not part of subtelomeric gene families.
SNP calling results as per mapping all P. malariae and P. malariae-like samples against the PmUG01 PacBio reference genome assembly. The raw SNPs are the total number of SNPs that we call using GATK default parameters in the different samples. Of these raw SNPs, some are exclusive to a certain sample (Private), are identical to the reference genome (Ref), or there is no coverage and therefore no SNP call could be made (Missing). The same information is also shown for the filtered SNPs, which were filtered according to a number of different parameters (Methods).
SNP calling results as per mapping all P. o. curtisi and P. o. wallikeri samples against the PocGH01 Illumina reference genome assembly. The raw SNPs are the total number of SNPs that we call using GATK default parameters in the different samples. Of these raw SNPs, some are exclusive to a certain sample (Private), are identical to the reference genome (Ref), or there is no coverage and therefore no SNP call could be made (Missing). The same information is also shown for the filtered SNPs, which were filtered according to a number of different parameters (Methods).
For the three population genetics measures (HKAr, Ka/Ks, and Skew), the table shows that genes that have significant value in two or more of these measures. These genes therefore represent genes under significant selection pressures.
The first column shows the 19 transmembrane-containing human genes that are shared between humans and the common marmoset, but not with chimpanzees. As RBP1a is the only RBP with large differences between P. malariae and P. malariae-like, these genes may represent interesting RBP1a receptor targets.
Acknowledgements
This work was supported by the Medical Research Council [MR/J004111/1] [MR/L008661/1] and the Wellcome Trust [098051]. S.A. and R.N.P. are funded by the Wellcome Trust (Senior Fellowship in Clinical Science awarded to RNP, 091625). F.R., B.O., and F.P. are financed by the ANR JCJC 2012 ORIGIN, the LMI Zofac, as well as by CNRS, IRD, and CIRMF. C.I.N. is funded by the Wellcome Trust [104792]. A.A.D. is funded as a Sanger International Fellow. The authors thank Eric Willaume from the Park of La Lékédi, and the different people involved in the sanitary controls of the chimpanzees. J.S.M. and N.M.A. are supported by NHMRC Practitioner Fellowships (#104072).