Abstract
Plasmodium vivax is the most prevalent malarial species in South America and exerts a substantial burden on the populations is affects. Its control and eventual elimination are a global health priority. Genomic research contributes to this objective by improving our understanding of the biology of P. vivax and through the development of new genetic markers that can be used to monitor efforts to reduce malaria transmission.
Here we analyze whole genome data from eight field samples from a region in Cordóba, Colombia where malaria is endemic. We find considerable genetic diversity within this population, a result that contrasts with earlier studies suggesting that P. vivax had limited diversity in the Americas. We also identify a selective sweep around a substitution known to confer resistance to sulphadoxine-pyrimethamine (SP). This is the first observation of a selective sweep for SP resistance in this parasite. These results indicate that P. vivax has been exposed to SP pressure even when the drug is not in use as a first line treatment for patients afflicted by this parasite. We identify multiple non-synonymous substitutions in three other genes known to be involved with drug resistance in Plasmodium species. Finally, we found extensive microsatellite polymorphisms. Using this information we developed 18 microsatellite loci that are polymorphic and easy to score and can thus be used in epidemiological investigations in South America.
Author Summary Although P. vivax is not as deadly as the more widely studied P. falciparum, it remains a pressing global health problem. Here we report the results of a whole genome study of P. vivax from Cordóba, Colombia, in South America. This parasite is the most prevalent in this region. We show that the parasite population is genetically diverse, contrary the expectations from earlier studies from the Americas. We also find molecular evidence that resistance to an anti-malarial drug has arisen recently in this region. This selective sweep indicates that the parasite has been exposed to a drug that is not used as first line treatment for this malaria parasite. In addition to extensive SNP and microsatellite polymorphism, we report 18 new genetic loci that might be helpful for fine-scale studies of this species in the Americas.
Introduction
Despite significant advancements toward malaria control and elimination, about 40% of the world’s population remains at risk of infection by one of the four protozoan species that commonly cause the disease [1]. Among the human malarias, P. vivax is the parasite with most morbidity outside Africa [1]. P. vivax differs from the more widely studied P. falciparum in aspects of its life cycle, disease severity, geographic distribution, ecology, and evolutionary history [2–7], raising concerns that gaps in our knowledge about its basic biology may compromise its control [8].
Genomic approaches provide important tools to study hard-to-culture parasites such as P. vivax. For example, genome-wide scans performed on samples from a natural parasite population can identify regions of the genome subject to strong selection. Studies using this approach in P. falciparum have contributed to the study of drug resistance and adaptation to the host immune system in that species. [9–12]. However, this approach has not yet been widely applied in P. vivax.
Previous population genetic studies on genes encoding antigens have found that P. vivax populations in many regions of the Americas are less diverse than those from Asia or Oceania [8, 13–16]. Similarly, a recent whole genome study found limited genetic diversity in a population from the Amazon basin of Peru [17]. It remains unclear whether these results are reflective of populations in the New World generally, the geographical sampling of those particular studies, or the loci sampled in earlier studies. Two recent studies have suggested that P. vivax populations in the Americas may harbor more genetic diversity than previously thought. First, a genome-wide comparison revealed substantial genetic divergence among three parasite lineages isolated in the Americas and maintained in non-human primates [18]. Second, recent population studies on the mitochondrial genome have shown high levels of divergence and limited gene flow among populations in the region [19]. This pattern indicates that vivax populations in the Americas likely have a complex history, with divergent populations harboring differing levels of genetic diversity.
Genomic studies can also support efforts to control and eliminate malaria from a given region by identifying genetic markers that will be informative for fine-scale population genetic studies in that region. Molecular epidemiological investigations rely on multilocus genotyping of SNPs or microsatellites [20] to investigate patterns of population structure and gene flow. Although the high mutation rates of microsatellite loci makes them ideal markers for such studies, only a few loci are currently in use [20]. These existing loci were developed using data from a small number of populations. As a result, some of these loci fail to amplify in samples from other localities [21]. Whole genome studies with multiple samples offer the opportunity to identify new loci, and focus on those that are known to be polymorphic in a given region and have the sort of simple repeat motifs that lead to reliable scoring of genotypes. In addition to identifying patterns of transmission within a region, these markers can be used to distinguish local cases (the result of remaining malaria transmission) from those that are introduced from another region. Ascertaining the source of the parasites detected in a given case is critical in order to evaluate the success of interventions during an elimination program.
Here we take a whole-genome approach to characterize the genetic variation of field isolates with single-lineage P. vivax infections from Northern Colombia (specifically, Tierralta, Department of Córdoba), an area in South America with seasonal transmission [22]. In addition to assessing the genetic diversity within this population, we examine patterns of diversity across the P. vivax genome and find evidence for a recent selective sweep likely associated resistance to a drug that is not prescribed for treating P. vivax malaria. We also develop 18 new microsatellite loci for fine scale studies in Colombia.
Methods
Ethics statement
A passive surveillance study was conducted between 2011 and 2013 in outpatient clinics located in Tierralta [22]. The study protocol was approved by the Institutional Review Board (IRB) affiliated to the Malaria Vaccine and Drug Development Center (MVDC, Cali-Colombia). Patients with malaria infection as determined by microscopic examination of Giemsa-stained thick blood smears received oral and written explanations about the study and, after expressing their willingness to participate, were requested to sign an informed consent (IC) previously approved by the Institutional Review Board (IRB) affiliated to the MVDC. IC from each adult individual or from the parents or guardians of children under 18 years of age was obtained. Individuals between 7 and 17 years old were asked to sign an additional informed assent. A trained physician of the study staff completed a standard clinical evaluation and a physical examination in all malaria symptomatic subjects. The local health provider treated individuals as soon as the blood sample had been drawn, using national antimalarial therapy protocol of the Colombian Ministry of Health and Social Protection [22]. Specifically, patients infected with P. vivax were treated orally with chloroquine (25 mg per kg provided in three doses) and primaquine (0.25 mg per kg daily for 14 days).
Sample collection
We collected 10 mL of blood from each patient and stored each blood sample in EDTA. In order to eliminate as much human DNA as possible, each sample was diluted in one volume of PBS and filtered with a CF11 column (≈3g) that had previously been rinsed with PBS. A new column was used for each 5 mL of sample. Each filtered sample was centrifuged at 1,000 g for 10 minutes. The supernatant was discarded and the red blood cells (RBC) were kept at -20°C and sent to our laboratory in Cali for processing. The RBCs were suspended in one volume of PBS and aliquoted into 200 μL fractions. DNA was extracted from each aliquot by using a PureLink Genomic DNA kit (Invitrogen, USA) following specifications provided by the manufacturer.
Sequencing and alignment to reference
Depending on availability and quality of the sample, we used between 300 ng and 1 mg of DNA to construct sequencing libraries. These libraries were constructed using a Kapa Biosystems DNA Library Preparation Kit (Kapa Biosystems, USA). The resulting fragments were amplified in ten rounds of PCR, using a Kapa HiFi Library Amplification Kit (Kapa Biosytems, USA). Denaturation and clustering were performed using an Illumina cBot. Once the samples were clustered, the flow cell was loaded onto a HiSeq 2000. The run module used was a paired end 2x100 reads. All sequencing and library preperation stages were performed by the DNASU sequencing core at Arizona State University.
We used Bowtie version 2.1.0 [23] to map reads from each sample to a reference genome containing sequences derived from the Salvador I (SalI) strain’s nuclear [24] (build ASM241v1) and apicoplast [25] genomes. The resulting alignments were processed using a modified version of the GATK project’s best practice guidelines [26]. Specifically, we identified and marked potential PCR duplicates using the MarkDuplicates tool from Picard version 1.106 (http://picard.sourceforge.net) and performed local realignment around possible indels using GATK 2.8 [27, 28]. Finally, we adjusted the raw base quality scores by running GATK’s BaseRecalibrator tool, treating set of putative Single Nucleotide Variants (SNVs) identified by SAMtools (0.1.18) [29] as known variants.
To investigate the possibility that our patient samples contain multiple distinct P. vivax strains [30, 31], we repeated this process using sequencing reads produced from a known single infection. We retrieved reads from the monkey-adapted SalI strain [30] from the NCBI Sequence Read Archive (accession SRS365051) We recorded the frequency of bases matching reference genome at each site for the patient-derived and single lineage alignments using a custom C++ program (http://dx.doi.org/10.5281/zenodo.18190) that makes use of the BamTools [32] library. We also calculated the overall proportion of all sequenced bases that produced a minority allele when mapped to reference for each sample.
Variant discovery
We called putative SNVs and small indels from the Colombian samples using the GATK UnifiedGenotyper [27]. Because we were able to establish that each patient was infected by a single lineage, we treated samples as haploid. Artifacts produced during the sequencing and mapping of reads to reference can lead to false positive variant calls [33]. Such false-positive variant calls may be particularity likely in P. vivax due the number multi-copy gene families and paralogs in this species [34]. In order to account for these potential artifacts, we took a conservative approach to variant calling and removed apparently variant sites that may have resulted mis-mapped reads. After performing an exploratory analysis comparing properties of our putative variants to a random sample of 100,000 non-variant sites, and using guidelines described by the GATK developers [28], we established the following set of criteria to identify likely false positive variants:
Site within 50 kb of a chromosome end (which are dominated by repeats)
Average mapping quality phred score <35
More than one sample has multiple nucleotides called at the site, such that there are more than two reads that contain minor nucleotides
P-value for a Fishers exact test of strand bias <0.001
Absolute z-score for Mann-Whitney U-test of mapping quality difference between variant and reference allele containing reads >5
Absolute z-score for Mann-Whitney U-test of read-position difference between variant and reference allele containing reads >5
Total depth at site ≥ 165x (95th percentile across all sites)
To identify microsatellite loci that are segregating within Colombia and have relatively simple evolutionary histories, we filtered the indels labeled as STRs by UnifiedGenotyper by removing any matching following criteria:
Locus has non-perfect repeats
Repeat motif >8bp
Locus is monomorphic among Colombian samples
We produced final variant sets for SNVs and microsatellites by removing all sites that matched at least one of the criteria listed above using PyVCF (http://pyvcf.readthedocs.org/). We produced a functional annotation for each polymorphic SNV with snpeff [35] using Ensembl functional annotation of the SalI reference (build ASM241v1.23) as input.
Validation of variants
We validated our SNV calling procedure by running the steps described above on a genome alignment generated from reads previously produced from the SalI strain [30] Because these reads represent an independent sequencing of the same strain that was used to produce the reference genome, we expect very few non-reference alleles. We also tested the effect of including low-coverage samples in our variant calling pipeline by repeating this procedure with the SalI alignment down-sampled to 2x coverage. We further tested the validity of filtered variant sites by searching for apparently singleton SNVs (those found only once in our population sample) in previously reported SNVs from other studies [18, 30].
Oligonucleotide primers for polymorphic microsatellites DNA markers
We designed PCR primers for 18 of the microsatellite loci identified using the above criteria. We first choose a subset of these putative microsatellites that were distributed across different P. vivax chromosomes, then developed markers using two strategies. First, nine loci were identified on alignments of conserved regions between P. vivax (SalI and the P. vivax genome data available in NCBI) and P. cynomlogi genomes. Second, nine loci were identified on conserved regions between SalI strain and Colombian samples from Tierralta.
All the alignments were made using ClustalX v2.0.12 and Muscle as implemented in SeaView v4.3.5. Dyes Hex and 6-FAM were used for labelling the forward primers. A complete characterization of the 18 microsatellites loci and the primers we used to amplify them is provided in S1 Text and S1 Table.
Population genetics
We calculated nucleotide diversity and Watterson’s estimator of the population mutation rate for the whole genome, distinct genomic features (i.e. sequences falling in exons, intron, untranslated regions and intergenic regions), and in 10 kb windows across each chromosome. We accounted for the varying levels of sequence coverage among our samples by using missing-data estimators for these measures [36]. Rather than setting an arbitrary coverage level at which a sample should be considered missing for a given site, we used our data and variant calling approach to identify the number of samples from which we could call a variant if it was present at each site in the genome. We first created new “reference genome” sequences by switching each unambiguous base in the SalI reference following Table 1. We then called variants against these “shifted” genomes using the same procedure described above (including filtering steps). At each site, a sample was considered missing if no variant could be called for that sample using the true reference genome or any of the shifted references.
We used PyBedtools [37, 38] to generate genomic windows, and extract polymorphisms from various sequence classes. We used diversity statistics calculated from genomic windows to identify genomic regions with unusually high or low genetic diversity. Specifically, we identified windows with values in the 1st or 99th percentile of either measure having first removed windows for which less than 85% of sites were callable. We discovered a region of particularly low diversity surrounding the dhps gene, so focused on this gene by calculating each statistic for set of over-lapping windows, each 10 kb wide and 500 bases apart from each other.
Results
Genome sequencing
We generated between 18 and 36 million paired-end reads from each samples (Table 2). Obtaining genomic sequences from clinical isolates of P. vivax is complicated by the presence of human DNA in parasite-containing blood samples. Although we took steps to remove leukocytes from each of our samples, the proportion of reads that could be mapped to the SalI reference genome differed markedly among samples, ranging from <1%–28%. These differences were reflected in the mean sequencing coverage achieved for each sample, which varies from less than one read per base in sample 500, to greater than 40 reads for sample 499.
Despite the presence of some poorly covered samples, our mapped reads allow comparison between multiple individuals for the vast majority of the P. vivax genome (Fig. 1). Because low-coverage samples still provide valuable data for variant calling at some sites and can often be reliably genotyped for those sites known to contain segregating variants we included all of our patient samples in subsequent analyses.
All sampled patients were infected by only one P. vivax strain
The presence of multiple distinct parasite lineages within a single host has presented a barrier to population genetic analysis in previous whole-genome studies of P. vivax. Because Plasmodium merozoites are haploid in the vertebrate host, the presence of such multiple infections in a patient can be inferred by the presence of multiple alleles at different loci. In the context of high-throughput sequencing, these additional alleles manifest as an excess of bases with intermediate frequencies at sites in a genome alignment. This result contrasts with the distribution expected from singly infected patients, where only rare sequencing errors will produce minority bases [30]. We tested our samples for multiple infections by comparing the distribution of minor bases frequencies in our alignments with the same distribution in an alignment produced from a known single infection (Fig. 2). In both the known single infection strain and our samples, minority bases were present only at low frequencies. This pattern contrasts distinctly with the high proportion of intermediate-frequency bases expected from mixed infections [30]. Because the shape of the distribution of minor base frequencies will be less informative for samples with low coverage, we also examined the proportion of all sequenced bases that were in the minority relative to other reads aligned to the same site. Again, these results are similar to those from SalI (Table 2). The proportion of minority bases in reads produced from a known single infection was 7.3 10−4, while for our data this proportion was between 7.3 10−4 and 1.11 10−3. The fact that no samples had the base frequency distributions characteristic of mixed-infections, and the low-coverage samples did no produce more minor bases than other samples, suggest all of our samples can be considered single infections.
Variant discovery
We validated our SNV calling procedure by applying it to a set of sequencing reads independently produced from the same strain used to assemble the reference genome [30]. Using this procedure, we called a total of 34 SNVs from the >22 million base pair reference genome, none of which were called as polymorphic alleles in our patient-derived data. Repeating this procedure with a lower coverage (2x) dataset generated fewer SNVs (18) including only one new putative variant. The small number of variants called from the reference data confirms the conservative nature of our variant calling procedure.
In total, we identified 33,855 non-reference SNV alleles among the Colombian samples, 30,261 of which were polymorphic (Table 3). The total number of SNVs we detect is comparable to numbers found in studies of field isolates from other regions; samples from Madagascan and Cambodian populations contain 41,630 and 45,417 SNVs respectively in the genomic regions included in our study (Fig. 3) [30]. A recent study using isolates from the Amazon basin in Peru [17] identified 10,989 SNVs in the same regions. Approximately two-thirds of the Peruvian SNVs (7,232) are also present in our Colombian samples. However, the Madagascan and Cambodian samples share many alleles that are absent in either South American population (11,618).
SNVs are relatively less common in exonic sites (1.2 SNVs per kb) than untranslated, intronic or intergenic sites (1.8 – 2.4 SNVs per kb) (Table 3). The ratio of non-synonymous to synonymous SNVs in exons is 1:1.51, substantially lower than the ≈ 1:4 ratio predicted for the P. vivax genome under neutrality [30]. This ratio, and the relative densities of SNVs in different sequence classes are very similar to those previously reported from Madagascan and Cambodian populations [30].
Among polymorphic SNVs, 12,913 (42.7%) were recorded in only one sample (Table 4). However, many of these apparent-singletons have been recorded in other studies. When we compare our SNVs to a catalogue or previously reported P. vivax SNVs [17, 18, 30] only 6,854 (20.2%) are unique to this study. Our low-coverage samples did not produce more singletons per callable-site than those with higher sequencing-coverage (Table 4), suggesting the remaining singletons are not simply a result of artifacts introduced by including these samples.
We identified 789 putative microsatellite loci that met our filtering criteria. To demonstrate the ability of whole genome studies to develop new markers, we designed PCR primers for 18 of these loci (choosing markers that well-spread among the 14 P. vivax chromosomes). We were able to generate PCR amplicons for each locus, and 16 were shown to be polymorphic within the validation panel, with between 2 and 4 loci segregating in this population (Table 5). The alleles of each locus could be determined easily, as demonstrated by the electropherograms shown in S1 Fig.
Population genetics
Because each of our patient samples represents a single parasite lineage, we can use standard population genetic analyses to investigate the evolutionary and demographic processes shaping P. vivax genomes in Colombia.
We calculated two measures of genetic diversity, nucleotide diversity (π) and Watterson’s estimator of the population mutation rate [39, 40]. Across the whole genome, we estimate to be 7.0 × 10−4. The estimate for nucleotide diversity is slightly lower at 6.8 × 10−4. For both diversity measures, genetic diversity is lowest in exonic regions, then increasingly higher in 3′ and 5′ untranslated regions of transcripts, intergenic regions and introns (Table 3).
A selective sweep around dhps
We identified regions of the genome with unusually high or low genetic diversity in this population (S2 Table). The most striking result of this analysis is an extended region of homozygosity on chromosome 14, which includes a 10kb window with no polymorphic SNVs despite having a mean of 6.35 samples contributing data. When we narrowed our focus to this region by calculating in overlapping windows, we found that the Dihydropteroate Synthetase gene (dhps) was at the centre of this region of low diversity (Fig. 4). Although there are no polymorphisms within this region, all eight of our samples contain a non-reference allele resulting from a G to C substitution in the second exon of dhps. The substitution is non-synonymous, and leads to the A383G amino acid substitution that has been associated with sulphadoxine resistance in numerous previous studies of P. vivax [41]. This pattern of low diversity surrounding a fixed substitution is the classic sign of a hard selective sweep [42, 43], in which a single mutant or migrant allele is rapidly fixed by selection. No other non-reference alleles were present in the dhps gene.
Most of the remaining genes that overlap with other high-or low-diversity windows encode proteins for which there is no functional annotation. Nevertheless, the high diversity windows include antigen and surface protein genes such as msp7 and vir family proteins, which are known to be under balancing selection in P. vivax [18]. Because our conservative variant calling approach removed difficult-to-align multi-copy gene families, it is likely we excluded other genes under balancing selection.
We also examined other genes thought to be involved in drug resistance in P. vivax 6. Alleles of the dihydrofolate reductase (dhfr) gene are known to confer resistance to pyrimethamine, a drug administrated together with pyrimethamine (SP). We did not observe evidence of a selective sweep at the dhfr locus. However, all samples for which a genotype could be called reliably (six of eight) have non-synonymous SNVs leading to both S58R and S117N amino acid substitutions. Both of these substitutions have been associated with SP resistance in P. vivax [41]. There are two distinct nucleotide variants encoding the S58R substitutions in our population samples, with three parasites having AGC>AGA mutations in the 58th codon, and three others having an AGC>CGC mutation. We found a total of 13 additional non-synonymous variants in the ATP-cassette binding proteins 1 pvmdr1 and PVX 124085 (a homolog of P. falciparum mrp proteins). There were no such variants in GTP cyclohydrase, Chloroquine resistance transporter ortholog, or Kelch 13, which are all considered possible drug resistance genes [17].
Discussion
Genetic diversity in Colombia
The genetic diversity estimated from our Colombian P. vivax population is comparable to, though slightly lower than, estimates derived from a global samples of P. falciparum (where has been estimated to be 1.03 × 10−3 using isolates from Africa, America, Asia and Oceania [44]). Considering that our samples come from a single department in Colombia, this result demonstrates that the relatively high genetic diversity reported for P. vivax can extend to small spatial scales.
This finding differs substantially from the perception that South American P. vivax populations in general have relatively low diversity due to a simple evolutionary history [18, 19]. Some populations, including the Peruvian population that was the focus of a recent whole genome study [17] do indeed have low genetic diversity, that may be the result of recent local introductions or expansions from a few founders following from malaria control programs [45]. On the other hand, the Colombian population studied here has substantial genetic diversity. It has been suggested that such diversity could be result of complex demographic processes involving multiple introductions and recombination among lineages in broad temporal and spatial scales [19]. A comprehensive population genomic study of South America would be required to understand the extent of such genetic polymorphism and the processes involved in its maintenance. Nevertheless, these results establish that malaria control programs can face a genetically diverse parasite population even in a relatively small spatial scales in South America. Such variation should be considered wherever control programs are evaluating the use molecular markers the context of surveillance
It is difficult to compare the genetic diversity of our sample to that of other P. vivax populations. Most studies that report estimates of genetic diversity for this species are focused on clinically important epitopes or a few markers. On the other hand, whole genome studies usually cannot report diversity statistics due to multiplicity of infection in their samples. However, we can make a crude comparison between our results and those from other whole genome studies [17, 30] by comparing the number of non-reference alleles found in each population. Despite our relatively small number of well-covered samples, restricted geographic range and the conservative approach to variant calling, we detected 33,855 SNVs. When we apply the same masking criteria used in this study to the variants reported form Cambodia and Madagascar we arrive at comparable number of SNVs (41,630 and 45,417 respectively). Our study discovered considerably more variants than a recent study of P. vivax isolates from the Amazon basin in Peru (10,989 variants). It is likely that the this difference reflects the low genetic diversity of the Peruvian population. Comparing these studies also highlights the degree of allele-sharing among populations (Fig. 3). Approximately two thirds of the SNVs detected from a P. vivax population in Peru where also detected in our Colombian samples, and 11,618 alleles are shared by the Cambodian and Madagascan populations but absent from both South American samples. This pattern may represent genetic differentiation between New World and Old World populations, although it is important to note that differences between these studies may also reflect different variant-calling procedures. A complete understanding of the global structure of P. vivax populations will require many more population samples and a consistent approach to variant calling
We can also compare our genetic diversity estimates to those produced from one large scale population genetic study. A study of using P. vivax isolates from across India [46] reported values ranging (from 1.3 ×10−3 – 3 × 10−3) from 5.6 kb of non-coding DNA. These values are somewhat higher than our diversity estimates from introns or intergenic regions but within the range of values we calculate from 10kb windows.
Signatures of natural selection
We identified genomic windows with exceptionally high or low genetic diversity. High genetic diversity may be maintained by balancing selection, while regions of low-diversity may be subject to strong purifying selection or be the product of recent selective sweeps. The majority of genes contained within these windows encode proteins for which little is known. However, some of the high-diversity windows overlap with antigen and surface-protein genes that are thought to be subject to balancing selection in P. vivax globally [18]. It is possible that genes in other windows of exceptional diversity have likewise been subject to natural selection; this could be confirmed from a larger population sample and thus a more statistically powerful genome scan.
We also looked a patterns of diversity more broadly, comparing different genomic features. All measures of diversity were lowest in exon sequences, followed by untranslated regions, intergenic spaces, and introns. This pattern, along with the relative lack on non-synonymous SNVs, is consistent with earlier studies demonstrating purifying selection has a strong effect on on protein coding genes in P. vivax [47]. It is interesting to note our estimates of genetic diversity were higher for introns than intergenic regions. This pattern has been reported in P. falciparum [18] and may reflect the presence of conserved but unannotated genes in what are currently considered intergenic regions [44].
Drug resistance alleles in Colombia
Our analysis of low diversity regions revealed a selective sweep associated the A383G allele of dhps, which has previously been associated with resistance to Sulfadoxine [41]. Although resistance to SP treatment, and indeed resistance mediated by this particular mutation, is a well known phenomenon, this result is interesting for two reasons. First, by identifying a selective sweep around this mutation we are able to demonstrate that SP resistance has arisen locally within Colombia via the rapid fixation of a single allele. This finding, combined with the fact another dhps allele (A385G) is most commonly associated with SP resistance in Madagascar [48], French Guiana [48], India [49], Iran [50], Pakistan [51], Thailand [52] and China [53] suggests SP resistance has come about through multiple independent origins. Similar repeated evolution of dhps resistant mutations has been reported in P. falciparum [54–57]
The evolution of SP resistance is also interesting in an operational context because it demonstrates that this P. vivax population has been subject to drug pressure from SP, a drug that has not been part of the approved treatment for uncomplicated P. vivax malaria in Colombia (where the drugs of choice are still chloroquine-primaquine combination therapy). SP has been used to treat P. falciparum infections in Colombia, so it is possible this selective pressure has arisen from misdiagnosis of P. vivax infections or the use of SP to treat mixed P. vivax -P. falciparum infections. It is also possible that poor compliance with the national drug policies, including self-medication by some patients with access to antifolates, or the long half-life of antifolate drugs, could lead to P. vivax infections coming into contact with SP.
The region surrounding the dhfr does not show the pattern of decreased genetic diversity associated with a hard selective sweep. In this case, all genotyped samples have two SP-resistance alleles (S58R and S117N), with the first allele encoded by two distinct SNVs. Thus, SP-resistance alleles in each gene have somewhat different histories, with dhps A383G entering the population once and being rapidly driven toward fixation, but dhfr resistance arising from two separate alleles that have been maintained in the population. This pattern differs with the one found in P. falciparum where dhfr mutations associated with drug resistance are fixed, as the result of a selective sweep, whereas dhps mutations are still segregating with sensitive alleles in the population [55, 57].
We detected non-synonymous variants in two other genes thought to be involved with drug resistance in Plasmodium species. There are eight non-synonymous variants in PVX 124085. The phenotypic effects of these variants are not known, but the changes to the P. falciparum ortholog of this gene have been associated with decreased sensitivity of primaqune [58] and antifolate drugs [59]. This study is the third time that an excess of non-synonymous mutations in this gene has been recorded from a population in South America [17, 60], and the clinical significance of this repeated finding should be investigated.
We identified five non-synonymous mutations in pvmdr1, three of which are known from populations in Asia and Madagascar [48, 61]. Mutations in the P. falciparum ortholog of this gene are associated with decreased sensitivity to chloroquine, and pvmdr1 variants are thus considered putative chloroquine-resistance alleles. Among the variants we report only Y976F has been associated with decreased sensitivity to chloroquine and then only in vitro and with a modest effect size [62]. There is little evidence for drug failure with chloroquine in South America at present. Nevertheless, the presence alleles in a South American population warrants further investigation and presents an opportunity to test for an association between pvmdr1 alleles and sensitivity to the drug in a clinical setting.
We also compare the variants we report from putative drug resistance genes with those reported from another South American population in the Amazon basin of Peru (Table 6). These populations share many alleles, including both SP resistance alleles in dhfr and five amino acid substitutions in PVX 124085. In contrast, four of the five variants we report form pvmdr1 are not present in the Peruvian population and there are no non-synonymous substitutions in Peruvian dhps sequences. These results demonstrate the importance of local information in designing control programs, as each population contains distinct drug resistance alleles which may generate distinct responses to different treatments. In addition, the fact two populations separated by a considerable geographical distance and the Andes share multiple alleles that are identical by state at the nucleotide level suggests it is possible drug resistance alleles can spread by gene flow between distant populations in South America.
Microsatellite loci for P. vivax studies in the Americas
Although population genomic studies offer a unique view into the biology of P.vivax, smaller-scale studies that use genotypes from only a few loci will remain important in malaria research. Indeed, one important result from this study is a set of new microsatellite loci that can be used fine scale population genetic and molecular epidemiological studies in Colombia. Microsatellite loci are particularly useful for such studies, as their relatively high mutation rates can generate highly polymorphic loci. As a result, population genetic signals in microsatellite loci can reflect demographic events occurring at short time scales, including epidemiological events [45, 63]. These markers can also be used for population assignment and testing for multiple infection [20, 21, 64, 65].
Thus far, 160 microsatellites have been found in the genome of P. vivax [21], however many of these loci fail to amplify in some populations [20, 21]. It is not surprising that loci developed in one region are not necessarily informative in others: microsatellites have complex evolutionary histories [66] and high potential for homoplasy [67]. Thus, the widespread application of microsatellite loci to epidemiological problems will require the development of new markers known to amplify and be polymorphic within specific population. We found 789 putatively polymorphic microsatellite loci from our whole genome sequencing, demonstrating that the 160 markers currently used in P. vivax represent only a small proportion of loci available in this species. Moreover, we demonstrated that loci we detect in our whole genome sequencing can be developed into useful makers. The 18 markers we developed yield patterns of repeats that are easy to score in populations in the Pacific Coast of Colombia and how high levels of polymorphism. Whether these markers will be useful at broader geographic scales remains to be seen, but the specific markers we developed will be useful for fine-scale studies in this region, where malaria elimination is currently been considered.
Conclusion
Our results add to growing evidence that P. vivax populations in the Americas are genetically more diverse than was previously proposed. Our study supports that the demographic history of malarial parasite populations in South America is far more complex than previously believed. Indeed, even at a small spatial scale, P. vivax populations could harbor extraordinary genetic diversity. Our study also demonstrates that genomic studies of natural populations of P. vivax can provide insights into how the parasite populations react to control strategies. Specifically, we identified a selective sweep associated with resistance for SP, a drug that is not used to treat P. vivax in Colombia. This resistance indicates a spillover effect of a drug that was primarily used to treat P. falciparum. The operational consequences of these results require additional investigations. Future studies with additional samples may detect additional regions under selection, and thus contribute to the identification of vaccine targets [68] or other clinically relevant phenotypes. Finally, we used our genomic data to develop a set of microsatellite markers that are both easy to genotype and known to be polymorphic within this population. These markers will aid future epidemiological studies and aid our understanding of malaria transmission and demography in Colombia.
Supporting Information Legends
S1 Text
Methods used for microsatellite devolpment
S1 Table
Characterization of 18 polymorphic P. vivax microsatellite loci. Size ranges of PCR products (in base pairs) are given for a small set of Colombian P. vivax (6) isolates analyzed. SalI was used as a positive control. Fluorescent dyes (Hex and 6-FAM) were used to label forward primers only. ML: Motif length and No.A: allele numbers.
S2 Table
Extreme diversity windows. 10kb windows with unusually high or low genetic diversity. Values for θw and π are × 10−3. The values in the “Start” column are the position of start of the genomic window in kb.
S1 Fig
Electropherograms showing peaks profiles for 18 polymorphic microsatellite loci. The y-axis correspond to fluorescence intensity (arbitrary units) and the x-axis is the PCR product length in base pairs (bp). The amplitude of the each peak in base pairs (bp) is shown in boxes underneath peaks. The allele range size for these small data set is also given for each locus.