Abstract
Colombia was the second most affected country during the American Zika virus (ZIKV) epidemic, with over 109,000 reported cases. Despite the scale of the outbreak, limited genomic sequence data were available from Colombia. We sequenced ZIKV genomes from Colombian clinical diagnostic samples and infected Aedes aegypti samples across the temporal and geographic breadth of the epidemic. Phylogeographic analysis of these genomes, along with other publicly-available ZIKV genomes from the Americas, indicates at least two separate introductions of ZIKV to Colombia, one of which was previously unrecognized. We estimate the timing of each introduction to Colombia, finding that ZIKV was introduced and circulated cryptically for 4 to 6 months prior to ZIKV confirmation in September 2015. These findings underscore the utility of genomic epidemiological studies for understanding epidemiological dynamics, especially when many infections are asymptomatic.
Author summary Understanding Zika virus epidemiology using standard surveillance methods has been challenging because many cases are asymptomatic and therefore are not observed. In such cases, we can use evolutionary analysis of pathogen genomes to explore patterns of disease transmission, including estimating when a disease arrived in a country and how it spread after it was introduced. In this paper, the authors sequence and analyze Zika viruses sampled from Colombia, the second most affected country after Brazil. The authors estimated that Zika arrived in Colombia around 4 to 6 months before laboratory confirmation of Zika presence. Zika circulated in Colombia, and also spread from Colombia into bordering countries, including Peru, Ecuador, Panama, and Venezuela. These findings help epidemiologists define when the Colombian population was at risk for Zika infection, which is important for monitoring the frequency of Zika-related outcomes, such as microcephaly and Guillain-Barré syndrome. Additionally, understanding spatial patterns of spread is an important step in understanding how Zika spread throughout the Americas.
Introduction
In recent years, countries across the Americas have experienced the emergence and endemic circulation of various mosquito-borne viruses, making this a critical area for public health surveillance and epidemiologic research. Zika virus (ZIKV) caused a particularly widespread epidemic, with over 800,000 suspected or confirmed cases reported [1]. Given estimated seroprevalence rates between 36% and 76% [2–5], the true number of ZIKV infections in the Americas is likely much higher. With neither a vaccine nor ZIKV-specific treatments available, understanding the epidemiology of ZIKV is our primary tool for controlling disease spread [6]. However, because many infections are asymptomatic [2], the analysis of surveillance data alone yields inaccurate estimates of when ZIKV arrived in a country [7, 8]. In such cases, introduction timing and transmission dynamics post-introduction are better inferred from genomic epidemiological studies, which use joint analysis of viral genome sequences and epidemiologic case data. Indeed, such studies have defined our understanding of when ZIKV arrived in Brazil [7], described general patterns of spread from Brazil into other countries in the Americas [7, 9, 10], and been used to investigate the extent of endemic transmission occurring post-introduction [8]. Genomic epidemiological studies of the spread of ZIKV in the Americas have aided our understanding of the epidemic [7–19], but generally, ZIKV pathogen sequencing has remained a challenge for the public health community [20].
Colombia has a population of approximately 48 million people. In addition, Colombia has Aedes aegypti and Ae. albopictus mosquitoes, which are commonly found at elevations below 2000m above sea level [21]. Public health surveillance for arboviral diseases, along with other notifiable conditions, is performed by the Instituto Nacional de Salud de Colombia (INS) [21]. While suspected cases from other municipalities were reported earlier [22], the INS first confirmed ZIKV circulation in mid-September 2015, in the Turbaco municipality on the Caribbean coast. ZIKV spread throughout the country, appearing in areas infested with Ae. aegypti that experience endemic dengue transmission and ongoing circulation of chikungunya virus [23]. Over the entire epidemic Colombia reported 109,265 cases of Zika virus disease [24], making it the second most ZIKV-affected country in the Americas after Brazil. The extent of the epidemic led the INS to start active surveillance for congenital Zika syndrome [25] as well as neurological syndromes associated with ZIKV infection [26]. While the INS determined that epidemic ZIKV transmission ended in July 2016, they continue to perform surveillance for endemic transmission.
Despite numerous reported cases, only 12 whole ZIKV genomes from Colombian clinical samples were publicly available. These sequences included 1 sample from Barranquilla, Atlántico department, 4 samples from Santander department, and 7 sequences for which departmental or municipal information was unspecified. We sequenced an additional 8 samples from ZIKV-positive human clinical and Ae. aegypti specimens, sampled from previously unrepresented Colombian departments. We describe here the first detailed phylogeographic analysis of Colombian ZIKV to estimate when, and how frequently, ZIKV was introduced into Colombia.
Methods
In Colombia, most ZIKV diagnostic testing occurred at the INS. However, Colombian academics were also involved in ZIKV surveillance, sampling Ae. aegypti and limited human cases. Our study includes samples collected by both the INS and the Universidad del Rosario (UR).
INS sample selection and processing
The INS National Virological Surveillance Program collected diagnostic specimens from over 32,000 suspected ZIKV cases over the course of the epidemic. Of these, roughly 800 serum specimens were PCR-positive for ZIKV and had RT-PCR cycle threshold (Ct) values less than 30. From this set we selected 176 serum specimens that were ZIKV-positive and negative for dengue and chikungunya viruses, as per results from the Trioplex RT-PCR assay [27]. Specimens were selected such that each Colombian department was represented over the entire time period in which specimens were submitted. We extracted RNA using the MagNA Pure 96 system (Roche Molecular Diagnostics, Pleasanton, CA, USA) according to manufacturers instructions. An extraction negative was used for each plate; positive controls were eschewed given the risk of cross-contaminating low titer clinical samples [20]. We attempted reverse transcription and PCR-amplification of ZIKV using the two-step multiplex PCR protocol developed by Quick and colleagues [20]. Briefly, cDNA was generated using random hexamer priming and the Protoscript II First Strand cDNA Synthesis Kit (New England Biolabs, Ipswich, MA, USA). We amplified cDNA using the ZikaAsian V1 ZIKV-specific primer scheme [20], which amplifies 400bp long overlapping amplicons across the ZIKV genome, over 35 cycles of PCR. Amplicons were purified using 1x AMPure XP beads (Beckman Coulter, Brea, CA, USA) and quantified using with the Qubit dsDNA High Sensitivity assay on the Qubit 3.0 instrument (Life Technologies, Carlsbad, CA, USA). Of the 176 processed samples, 15 amplified sufficiently to perform sequencing.
UR sample selection and processing
UR collected and performed diagnostic testing on 23 human clinical samples from different geographic regions, and 38 Ae. aegypti samples from the Cordoba department of Colombia. RNA was extracted using the RNeasy kit (Qiagen, Hilden, Germany) and a single TaqMan assay (Applied Biosystems, Foster City, CA, USA) directed to ZIKV was employed [28] to confirm ZIKV presence. Approximately 60% of samples (14 clinical samples and 23 Ae. aegypti samples) were found to be ZIKV-positive by RT-PCR. From these, we attempted amplification on 8 samples. Amplification, purification, and quantification of ZIKV amplicons from UR samples were performed as described above. Four samples amplified sufficiently to conduct sequencing; three samples were from human clinical specimens and one was from an Ae. aegypti sample.
Sequencing protocol
We sequenced amplicons from 4 UR and 15 INS samples using the Oxford Nanopore MinION (Oxford Nanopore Technologies, Oxford, UK) according to the protocol described in Quick et al [20]. Amplicons were barcoded using the Native Barcoding Kit EXP-NBD103 (Oxford Nanopore Technologies, Oxford, UK) and pooled in equimolar fashion. Sequencing libraries were prepared using the 1D Genomic DNA Sequencing kit SQK-LSK108 (Oxford Nanopore Technologies, Oxford, UK). We used AMPure XP beads (Beckman Coulter, Brea, CA, USA) for all purification steps performed as part of library preparation. Prepared libraries were sequenced on R9.4 flowcells (Oxford Nanopore Technologies, Oxford, UK) at the INS in Bogotá and at the Fred Hutchinson Cancer Research Center in Seattle.
Bioinformatic processing
Raw signal level data from the MinION were basecalled using Albacore version 2.0.2 (Oxford Nanopore Technologies, Oxford, UK) and demultiplexed using Porechop version 0.2.3_seqan2.1.1 github.com/rrwick/Porechop. Primer binding sites were trimmed from reads using custom scripts, and trimmed reads were mapped to Zika reference strain H/PF/2013 (GenBank Accession KJ776791) using BWA v0.7.17 [29]. We used Nanopolish version 0.9.0 github.com/jts/nano-polish to determine single nucleotide variants from the event-level data, and used custom scripts to extract consensus genomes given the variant calls and the reference sequence. Coverage depth of at least 20x was required to call a SNP; sites with insufficient coverage were masked with N, denoting that the exact nucleotide at that site is unknown. After bioinformatic assembly, 8 samples produced sufficiently complete genomes to be informative for phylogenetic analysis.
Dataset curation
All publicly available Asian lineage ZIKV genomes and their associated metadata were downloaded from ViPR [30] and NCBI GenBank. The full download contained both published and unpublished sequences; we sought written permission from submitting authors to include sequences that had not previously been published on. Any sequences for which we did not receive approval were removed. Additionally, we excluded sequences from the analysis if any of the following conditions were met: the sequence had ambiguous base calls at half or more sites in the alignment, the sequence was from a cultured clone for which a sequence from the original isolate was available, the sequence was sampled from countries outside the Americas or Oceania, or geographical sampling information was unknown. Finally, we also excluded viruses whose estimated clock rate was more than 4 times the interquartile distance of clock rates across all sequences in the analysis. Sequences that deviate this greatly from the average evolutionary rate either have far too many or too few mutations than expected given the date they were sampled. This deviation usually occurs if the given sampling date is incorrect, or if the sequence has been affected by contamination, lab adaptation, or sequencing error. After curation, the final dataset consisted of 360 sequences; 352 publicly available ZIKV full genomes from the the Americas (including Colombia) and Oceania, and the 8 Colombian genomes from the present study.
Phylogeographic analysis
Data were cleaned and canonicalized using Nextstrain Fauna github.com/nextstrain/fauna, a databasing tool that enforces a schema for organizing sequence data and sample metadata, thereby creating datasets compatible with the Nextstrain Augur analytic pipeline github.com/ne-xtstrain/augur and the Nextstrain Auspice visualization platform github.com/nextstrain/auspice. A full description of the Nextstrain pipelines can be found in Hadfield et al [31]. Briefly, Nextstrain Augur performs a multiple sequence alignment with MAFFT [32], which is then trimmed to the reference sequence. A maximum likelihood phylogeny is inferred using IQ-TREE [33]. Augur then uses TreeTime [34] to estimate a molecular clock; rates inferred by TreeTime are comparable to BEAST [34], a program that infers temporally-resolved phyloge-nies in a Bayesian framework [35]. Given the inferred molecular clock, TreeTime then creates a temporally-resolved phylogeny, infers sequence states at internal nodes, and estimates the geographic migration history across the tree. These data are exported as JSON files that can be interactively visualized on the web using Nextstrain Auspice.
Data and code availability
All priming schemes, kit specifications, laboratory protocols, bioinformatic pipelines, and sequence validation information are openly available at github.com/blab/zika-seq. Nextstrain analytic and visualization code is available at github.com/nextstrain. Code and builds specific to the analysis presented here, as well as all genomes sequenced for this study, are openly available at github.com/blab/zika-colombia. Consensus sequences are also available on NCBI GenBank (accessions MK049245 through MK049252). The phylogeny and all inferences reported here can be interactively explored at nextstrain.org/community/blab/zika-colombia.
Results
Sequencing and sampling characteristics of reported ZIKV genomes
In total, we attempted to amplify ZIKV nucleic acid from 184 samples collected by the Instituto Nacional de Salud de Colombia (INS) and Universidad del Rosario (UR). Given the low viral titers associated with most ZIKV infections, as well as long storage times, most samples did not amplify well. We attempted sequencing on 19 samples that amplified sufficiently to generate sequencing libraries (Table 1). Sequencing efforts yielded eight ZIKV sequences with at least 50% coverage across the genome with unambiguous base calls (Table 1). Seven of these viruses came from humans; one virus came from Ae. aegypti. Three sequences came from samples collected from infected individuals in Cali, department of Valle del Cauca, two sequences came from Montería, department of Córdoba, and one sequence each came from Ibagué, department of Tolima, Belén de Umbría, department of Risaralda, and Pitalito, department of Huila (Figure 1). Colombian viruses are sampled across the period of peak ZIKV incidence in Colombia (Figure 2).
General patterns of ZIKV transmission in the Americas
We conducted a maximum likelihood phylogeographic analysis of 360 Asian lineage ZIKV genomes; 18 sequences are sampled from Oceania, and 342 sequences are sampled from the Americas. We estimate the ZIKV evolutionary rate to be 1.02 × 10−3 substitutions per site per year, in agreement with other estimates of the molecular clock [7, 10, 11, 36] made with BEAST [35], a Bayesian method for estimating rates of evolution and inferring temporally-resolved phylogenies. Like others, we find that ZIKV moved from Oceania to the Americas, and that the American epidemic descends from a single introduction into Brazil (Figure 3A). We estimate that this introduction occurred in late November 2013 (95%CI: October 2013 - January 2014), inline with Faria et al’s [11] initial estimate of introduction to Brazil between May and December 2013, and updated estimate of introduction between October 2013 and April 2014 [7]. We also confirm findings from previous studies [7, 10] that ZIKV circulated in Brazil for approximately one year before moving in to other South American countries to the north, including Colombia, Venezuela, Suriname, and French Guiana. Movement of ZIKV into Central America occurs around late-2014 while movement into the Caribbean occurs slightly later, around mid-2015.
Multiple introductions of ZIKV to Colombia
We infer patterns of ZIKV introduction and spread in Colombia from the phylogenetic placement of 20 Colombian sequences. Previously, 12 of these sequences were publicly available. We add an additional 8 Colombian sequences sampled over a broad geographic and temporal range. Colombian ZIKV sequences clustered into two distinct clades (Figure 3B). Both clades are descended from viruses inferred to be from Brazil (Figure 3A). Viruses immediately ancestral to both Colombian clades are estimated by the phylogeographic model to have 99% model support for a Brazilian origin. However, lack of genomic sampling from many ZIKV-affected countries in the Americas may limit our ability to infer direct introduction from Brazil or transmission through unsampled countries prior to arrival in Colombia.
Clade 1 is comprised of 28 viruses, 17 of which are from Colombia, and is characterized by nucleotide mutations T738C, C858T, G864T, C3442T, C5991T and A10147G. This clade contains all previously reported Colombian genomes, as well as five of the genomes generated during this study (Figures 3 and 4). The phylogeographic model places the root of this clade in Colombia with 99% model support. The most parsimonious reading is that this clade of viruses resulted from a single introduction event from Brazil into Colombia.
Clade 2 contains 5 viruses, 3 of which are from Colombia (Figures 3 and 5). All three were sequenced during this study, and thus this clade was not previously recognized in Colombia. This clade is characterized by mutations T1858C, A3780G, G4971T, C5532T, G5751A, A6873G, T8553C, C10098T. The phylogeographic model places the root of this clade in Colombia with 98% model support and also suggests a Brazil to Colombia transmission route that is likely, but not certainly, direct. Two additional genomes with less that 50% genomic coverage also group within these clades; COL/FH14/2016 within clade 1 and COL/FH15/2016 within clade 2.
Cryptic transmission within Colombia
We estimate that clade 1 was introduced to Colombia around early April of 2015 (95% CI: March 2015 to May 2015) (Figures 3B and 4), and that clade 2 was introduced in early November of 2015 (95% CI: September 2015 to December 2015) (Figures 3B and 5). Our estimate of clade 1 introduction timing supports between four and six months of cryptic ZIKV transmission within Colombia prior to initial case detection in September 2015, a finding that is consistent with other genomic epidemiological studies of ZIKV [7, 8, 10].
Transmission from Colombia to other countries
We find evidence for onward transmission from Colombia into other countries in the Americas. Clade 1 shows movement of viruses into countries that share a border with Colombia, namely Panama, Venezuela, and Peru, as well as into the Dominican Republic and Martinique (Figure 4). Clade 2 indicates movement of Colombian ZIKV into neighboring Ecuador (Figure 5). Transmission from Colombia into bordering countries seems reasonable, and these patterns agree with previously documented trends of ZIKV expansion in the Americas [7, 9, 10], but provide more detail due to the greater amounts of sequence data now available. For instance, analysis by Metsky et al [10] also supports movement of ZIKV from Colombia to Martinique and the Dominican Republic. However, without sequence data from Panama, Peru, or Venezuela, they were unable to capture spread from Colombia into these countries.
Discussion
Despite the scale of the Colombian epidemic, publicly available sequence data were limited, and no detailed genomic epidemiological analysis of ZIKV dynamics had been performed. We sought to improve genomic sampling for Colombia, and to perform a detailed genomic analysis of the Colombian epidemic. Only 12 Colombian genomes were available prior to this study. To these data we added 8 new sequences sampled broadly across Colombia, and performed a phy-logeographic analysis of American ZIKV. We describe general transmission patterns across the Americas and present estimates of ZIKV introduction timing and frequency specific to Colombia. We find evidence of at least two introductions of ZIKV to Colombia, yet remarkably the majority of Colombian viruses cluster within a single clade, indicating that a single introduction event caused the majority of ZIKV cases in Colombia. ZIKV dispersal out of Colombia also appears widespread, with movement to bordering countries (Panama, Venezuela, Ecuador, and Peru) as well as more distal countries in the Caribbean.
While it may be tempting to read the inferred phylogeographic migration history as a complete record of transmission between countries, we caution against doing this for analyses of ZIKV. In contrast to other large outbreaks, such as the Ebola epidemic in West Africa, genomic sampling of the American ZIKV epidemic is sparse. Many ZIKV-affected countries have minimal genomic sampling; others have none at all. Thus while the phylogeographic model will correctly infer the geographic location of internal nodes given the dataset at hand, adding sequences from previously unsampled countries may alter migration histories such that apparent direct transmission from country A to country C instead becomes transmission from country A to country B to country C.
We imagine that further genomic sampling of ZIKV from Colombia, and from other unsampled countries, would likely reveal a greater number of introductions. Our finding of two separate introductions should be considered a lower bound. Consistent with other studies, our estimates of when these introductions occurred support cryptic ZIKV transmission prior to initial case confirmation. Perhaps more surprisingly, our estimate of the age of clade 1 indicates that ZIKV likely spread to Colombia even before official confirmation of ZIKV circulation in Brazil [37]. These findings underscore the utility of genomic epidemiology to retrospectively date introduction and transmission patterns that are difficult to detect using traditional surveillance methods, providing more accurate definitions of the population at risk.
Acknowledgements
We thank R. Shabman, B. Pickett, P. Rahal, L. Karan, R. Delgado, A. Enfissi, N. Grubaugh, and R. Lanciotti for giving us permission to include unpublished genomes available on GenBank in our analysis. We thank Adam Geballe for generously loaning space in his laboratory at the Fred Hutch. A.B. is supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE-1256082. L.H.M., D.P.R., M.E.H., and I.M.L. are supported by NIH U54 GM111274. This work was partially supported by Grupo de Virología, Directión de Redes en Salud Pública, Instituto Nacional de Salud. Bogotá D.C, Colombia. C.T. and J.D.R are funded by Gobernaciún de Córdoba, sistema general de regalias (SGR) Colombia, Grant No. 754/2013 and by Directión de Investigatión e Innovation from Universidad del Rosario. T.B. is a Pew Biomedical Scholar and is supported by NIH R35 GM119774-01.