ABSTRACT
Background In developed countries, human cytomegalovirus, HCMV, is a major pathogen in congenitally infected people and immunocompromised individuals, where multiple-strain infections link with disease severity. The situation is less understood in developing countries. In Zambia, breast milk, a key route for transmission, carries higher HCMV loads in HIV-positive than HIV-negative women. We investigated whether strain diversity was also higher.
Methods Strain diversity in breast milk obtained 4-16 weeks post-partum from 15 HIV-positive and 7 HIV-negative women was analysed by high-throughput sequencing, comparison to 100+ reference genomes and genotyping of hypervariable genes.
Results Multiple-strain infections were detected in 100% HIV-positive and 43% HIV-negative women, showing raised strain-diversity burden with HIV (p=0.005) and up to 6 strains, present in serial samples. There were 95 genotypes in 12 hypervariable genes, combined in 30 strains; these were conserved within individuals, but gave potential for billions of recombinants. Genetic linkage was maintained strongly for adjacent genes UL73/UL74 (encoding entry/exit glycoproteins gN/gO), and RL12/RL13 /UL1 (encoding immunomodulatory glycoproteins), but not for other nonadjacent genes.
Conclusions Breast milk is infected with multiple-strains of HCMV in HIV-infected women in Zambia. The complexity provides capacity for generating large numbers of recombinant strains and a major source for transmitting diversity.
INTRODUCTION
Human cytomegalovirus (HCMV) is a major coinfection in HIV-positive people, in whom as in other immunocompromised individuals such as transplant recipients, it contributes to morbidity and mortality. HCMV is also the most frequent congenital infection, where it causes hearing loss and adverse neurodevelopment, including microcephaly and neonatal morbidity. Postnatal infection, generally by milk in breastfeeding populations, is usually asymptomatic. However it has been linked to morbidity, especially in preterm or underweight infants, and in recent population studies, to adverse developmental effects, especially with HIV exposure in developing countries [1–4]. The most severe HCMV infections, whether due to primary infection or reactivations from latency, can result in end-organ disease, including retinitis, pneumonitis, hepatitis or enterocolitis [5]. Most studies of the diversity, transmission and epidemiology of HCMV have focused on developed countries. Less is known regarding HCMV in developing countries, including those with a high burden of endemic HIV.
HCMV (species Human betaherpesvirus 5) has a double stranded DNA genome of 236kbp containing at least 170 protein-coding genes [6]. Like other human herpesviruses, HCMV exhibits overall low diversity among strains, except for a number of genes with levels of hypervariability but existing as distinct stable genotypes. These genes encode vulnerable to immune selection, including the virus entry complex, other glycoproteins and secreted proteins. The recombinant nature of HCMV strains was first identified in serological surveys and then in genomic studies, and is a key consideration for vaccine development [7–16]. Understanding the roles of HCMV diversity is at an early stage [17–19], and is limited by the fact that most analyses have focused on only few hypervariable genes characterised by polymerase chain reaction (PCR)-based genotyping [7, 12, 20], which is relatively insensitive to multiple-strain infections resulting from de novo infection or reactivation from latency. High-throughput sequencing studies have facilitated broader analyses [10, 11, 14, 16, 21], but most have utilised strains isolated in cell-culture, which are prone to strain loss and mutations in surviving strains or have depended on direct sequencing of PCR amplicons generated from clinical specimens [11, 14, 16, 22]. Recent studies of whole-genome diversity have avoided these limitations by using target enrichment to enable direct sequencing of HCMV strains present in clinical specimens from subjects with various health problems (mainly congenital or transplantation-associated) in developed countries [21, 23]. We have applied this approach to examine HCMV diversity in breast milk, from women in Zambia, an HIV endemic population in Africa in which we have previously shown the negative developmental effects of early infant infection with HCMV, particularly with HIV exposure [1, 3].
METHODS
Patients and Specimens
Anonymised breast milk samples were collected with informed consent as a substudy of the Breast Feeding and Postpartum Health (BFPH) study conducted at the University Teaching Hospital (UTH), Lusaka, Zambia, as approved by the ethical committees of UTH and the London School of Hygiene and Tropical Medicine. This sub-study included 29 HCMV-positive breast milk samples available from 22 women (15 HIV-positive and 7 HIV-negative) at 4-16 weeks post-partum for which HCMV loads were determined [3].
Nucleic Acid Extraction and Viral Load Quantitation
DNA was extracted from 200 μl breast-milk using the QIAamp® DNA Mini kit (QIAGEN, Manchester, UK), as described [3]. HCMV load quantification used a HCMV gB TaqMan assay run on the Applied Biosystems® 7500 Fast Real-Time PCR System (Applied Biosystems, Foster City, CA, USA) as described [3].
Next-Generation Sequencing
The SureSelectXT v. 1.7 target enrichment system was used with biotinylated cRNA probe baits (Agilent, Stockport, UK) to prepare Illumina sequencing libraries, as described previously [21]. The libraries were sequenced using an Illumina MiSeq with v. 3 chemistry (Illumina, San Diego, CA, USA) to generate paired-end raw FASTQ reads of 300 nucleotide (nt). Raw FASTQ NGS reads were quality checked using FASTQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and trimmed to 100 nt minimum using Trimmomatic [24] or Trim Galore v 0.4.0 (http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/).
Variant Detection.
Trimmed reads were optimised for de novo assembly using VelvetOptimiser for reference-independent assembly with Velvet [25] and ABACAS [26], and resulting contigs verified by reference mapping using BWA, SAMtools / BCFtools [27], then GATK [28] for alignment, indexing, mapping, and variant calling. Artemis was used for visualisation [29]. Variants from the consensus were defined as >50% prevalence [21, 30]. Mixed infections with HCMV strains were identified first by mapping reads against the genotype-specific sequences of UL73 and UL74 hypervariable genes in eight genotype specific reference genomes (Supplementary Table 1), followed by ten further hypervariable genes (see below) as defined from patient cohorts with samples from blood, urine or saliva [7–10, 12, 17]. Variant nucleotides were defined as follows: prevalence <50%, overall read depth ≥50, average nucleotide quality ≥30, variant frequency ≥1% for read depths >1000 and >10% for read depths >100, and minimum SNP depth ≥10 [21, 30]. A major genotype switch in serial samples was defined by prevalence exceeding 50% [21].
Complete genome sequences for samples with single or dominant genomes were assembled as described [21], deposited in Genbank under accessions MK290742-MK290744).
Strain quantification using motifs
Short motifs
FASTQ reads were quality checked using FASTQC and trimmed to 100 bases minimum using Trimmomatic [24]. Variants were verified by cut-offs: overall read depth ≥50, average basecall quality ≥30, variant frequency ≥1% for read depths over 1000 and >10% for read depths over 100, giving minimum SNP depth of 10 as shown [21, 30]. To identify major genotype switches, a cut-off of 50% prevalence was used, as previously validated [21]. Genotype-specific nucleotide sequences using short motifs were developed guided by amino acid alignments, phylogenetic analyses and polymorphism plots in DnaSP as demonstrated for hypervariable genes UL73 and UL74 (Supplementary Figure 1, Supplementary Tables 1–3) [7, 10, 12, 18]. ‘Short’-motifs of nucleotide sequences of 12-14 bases at three positions were identified and validated against available 163 complete genomes, and over 400 single genes (Genbank Release 211 NCBI, 2015) (Supplementary Table 1). Custom Perl scripts were used to interrogate Genbank FASTA reads and Sample FASTQ reads, allowing maximum mismatches of 1-2bp, to quantify mixed infections by enumerating proportions of individual motifs in relation to read depths and plotted with Microsoft Excel 2013 (www.cpan.org).
Long motifs
The proportions of genotypes of UL73 and UL74 with further ten hypervariable genes (RL5A, RL6, RL12, RL13, UL1, UL9, UL11, UL120, UL146 and UL139) were estimated by counting perfect matches to ‘long’ motifs (24 nt with one exception at 20 nt) specific to each recognised genotype [7–10, 12, 17]. These motifs were developed from DNA sequence alignments (Suarez et al, submitted accompanying). A genotype detected in >10 reads and >2% of the total number of reads detected for all genotypes was scored as being present, and the number of strains in the sample corresponded to the greatest number of genotypes detected for at least two genes. Pie charts were created for all hyper-variable genotypes using a Perl script filtered using sensitivity cut-offs, then FASTQ reads for each genotype calculated as a percentage and plotted.
Statistical Analysis
GraphPad Prism (v. 6.05 GraphPad Software, La Jolla, CA, USA) and Microsoft Excel 2013, and were used for data analysis. Fisher’s Exact test was used to analyse contingency tables, Student’s t test to analyse sample means and Mann-Whitney U test for medians, with significance determined at p-values<0.05.
RESULTS
Genotypic structure of the UL73-UL74 locus
Our previous analyses using Sanger sequencing at multiple time points post-partum pointed to the presence of multiple-strain HCMV infections in breast-milk [3]. We explored this further by using the differences in sequence between the genotypes of hypervariable genes to estimate the number and proportions of strains present in each sample, taking two approaches. The first focused on adjacent genes UL73 and UL74, as our previous studies had shown that these genes are markedly hypervariable, are almost always genetically linked, and exist as eight genotypes each in breast milk samples. Short motifs capable of identifying individual genotypes were developed, based on one motif in UL73 and three motifs in UL74. The second approach extended the analysis to ten additional hypervariable genes, and involved the use of a single long motif for each gene.
To establish a foundation, the amino acid sequences of UL73 and UL74 were extracted from a large set of GenBank sequences and analysed phylogenetically at the amino acid sequence level (Figure 1). Three recombinant sequences were omitted (highlighted in Supplementary Tables 2–4). This analysis confirmed the existence of eight genotypes for each gene, indicated the high levels of inter-genotypic diversity (25-55% amino acid identity) and low levels of intra-genotypic diversity (<3%) [7, 12], and demonstrated the strong genotypic linkage. It also supported the inference of ancestral recombination giving rise to the gN4c and gO1c linkage as shown in the comparison of the mirrored phylogenetic trees (Figure 1).
Strain identification using sequence motifs
Having established a comprehensive view of UL73 and UL74 genotypes, we developed short sequence motifs capable of identifying them individually. A 14 nt motif was identified in a region near the 5’-end of each UL73 genotype, and 12-13 nt motifs in three regions of each UL74 genotype (Supplementary Table 1). These motifs were used to analyse the datasets derived from the 29 samples. The relative frequencies of genotypes in the datasets were similar to those in the GenBank sequences, which lacked contributions from African samples and from breast milk (Figure 1). Use of short motifs also allowed the proportions of individual genotypes in a multiple-strain infection to be estimated, and showed similar results for samples from the right and left breasts (Figure 2).
The analysis was then extended to a further ten hypervariable genes in addition to UL73 and UL74, using a single long sequence motif from each gene (Supplementary Table 5). As with the short motifs, this allowed the number of strains in a sample to be estimated, but reduced the possibility that different strains might by chance have the same UL73 and UL74 genotypes. Nonetheless, there was general agreement between the two approaches in estimates of the number of strains present (Table 1 and Supplementary Figure 2, comparing short motifs and long motifs methods in inner and outer circles). In the analysis, the major strain was identified as being present in >50% of the reads, a minor variant in 10-50%, and microvariants in 1-10% at detection limits.
Motif analyses show diversity raised with HIV
HIV-positive and negative samples were then compared for HCMV loads and strain diversity. This showed HIV-positive mothers had higher median HCMV loads at week 16 as shown previously [3], together with significantly raised levels of mixed-strain infections shown here (Tables 1 and 2), with twice the average number of strains detected in mixed infections. Up to six strains were detected per sample (Tables 1–3). Given the complexity of the mixed infections (Table 3), only single infections or predominant infections could be compiled using de novo assembly, primarily from HIV-negative mothers demonstrating the first HCMV genomes from normal donor and from Africa. Genome organisation was similar to HCMV from elsewhere, except for diversity in the 12 hypervariable genes.
Analyses of frequencies of genotypes, their linkages and recombination
Previous genomic analyses have shown evidence for recombination through HCMV while most regions are conserved. Genetic linkage has rarely been observed, but evidence has been shown for adjacent genes UL73/gN and UL74/gO [12, 17, 31], as confirmed above. In addition, the UL11 gene family shows some degenerate linkage through conservation of the IgG binding domain in this gene group [9, 10] and confirmed through linkage disequilibrium analyses [11]. We reasoned that any further genetic linkage in this current cohort could be tested by demonstration of similar frequencies of genotypes across the samples. Individual strains were determined from determining the same relative prevalence of the individual genotypes in each donor. Therefore, a haplotype-virotypes composed of the 12-hypervariable genes genotypes could be constructed for each strain. Both major and minor strain variants in individuals were assembled. Micro variants, at the cut-offs between 1-10% were not included because although detected and indicating burden of mixed-strain infection in an individual, they could not be used to construct a haplotype because at this coverage they would not be statistically equally captured at each loci. In Table 4, major and minor variant strains were tabulated and individual frequencies of each genotype for the 12-hypervariable genes indicated with a prevalence heat map. There were 95/110 possible genotypes detected in the 30 strains identified. The results clearly show that UL73/gN and UL74/gO are most closely linked followed by R12-RL13-UL1, but no evidence for any further genetic linkages in this cohort. Each strain identified within and between individuals represented a unique haplotype-virotype. Most of the genotypes were identified in this cohort, and unexpectedly each strain mixed rare and common haplotypes. Calculating with free recombination this potentially generates 2.5 x 1011 different ‘haplotype-virotype’ strains from multiplying the extant numbers of genotypes per hypervariable gene. However, linkage at UL73/gN-UL74/gO and RL12-RL13-UL1 could reduce these by at least 2 logs. For example, in this cohort alone there were 30 out of a potential for 2 billion strains.
Conservation of strain haplotypes within individuals
Having established diversity of strains between individuals we examined within individuals and their maintenance. Donors with samples taken at both week 4 and week 16 or sampled from both left and right breasts were examined further using the same heat map for analysing genotype prevalence. In Table 5, this shows despite the diversity of genotypes shown in unique strains between individuals, that within individuals both single strains and mixtures of major and minor variant strains are maintained both between compartments and longitudinally to 16 weeks post-partum. In donor 243 there is complete conservation of the strains between 4 and 16 weeks post-partum and from samples from either breast. In the other donors, there is little evidence for recombination of major variant with minor variants genotypes. While HIV-positive donors 174 at the RL5A loci in strains in the left breast sample and in donor 259 at UL120 in strains from the right breast sample provide some evidence recombination, these are microvariants at detection limits. Further, in donor 278 apparent recombination betwen linked RL12-RL13-UL1 loci in the linked ‘1’ and ‘9’ genotypes at RL13 shown at both 4 and 16 weeks post-partum and from both breast samples, is due to a single SNP. Further analyses in comparative pie charts confirm conservation of strains within individuals but diversity between individuals (Supplementary Figure 3). Micro variants were detected, but at this coverage, could be reactivations or reinfections postpartum, all the major and minor variants genotypes were detected by 4 weeks post-partum with no evidence for new infections by 16 weeks.
DISCUSSION
Genomic analyses of HCMV directly in clinical specimens is required in order to characterise natural populations and avoid mutations prevalent in tissue culture adapted virus. Target enrichment approaches have been successful [21, 23], but accurate genome assembly can be confounded by mixed-strain infections in clinical samples, particularly in immunosuppressed groups where multiple virus reactivations or new infections were demonstrated [17, 21]. Here we examined breast milk samples as a main source for HCMV transmission, comparing HIV immunosuppressed and uninfected, normal mothers. Previously we showed HIV-positive compared to negative mothers had higher HCMV loads in breast milk, which associated with adverse infant development in Sub Saharan Africa [1, 3]. Despite raised virus burden there is only limited genome analyses in milk or Africa. Here we studied their genome diversity, demonstrating in addition to high viral load, raised levels of mixed-strain infections.
We used modular analyses of 12 hypervariable genes and identified over 95/110(86%) possible genotypes in the 29 samples from the 22 donors, of this cohort alone. There were up to six strains identified in an individual using these analyses consisting of distinct genotypes in the 12-hypervariable genes forming virus haplotype-like ‘virotypes’ giving 30 African strains and distinct from 17 European strains we identified in related studies (Suarez et al, submitted). We demonstrated significantly increased mixed-strain infections in Zambian HIV-positive women in addition to high viral load. This was present from the earliest milk sample, at 4 weeks post partum, and similar in milk samples from both left and right breasts. This indicates these variants were present in the donor prior to virus reactivation in the breast tissue during lactation. Leukocyte infiltrates have been characterised at this time [32] and may be the source of the reactivated virus.
Although there were high levels of mixed-strain infections, there was little evidence for de novo infection during the time period analysed to 16 weeks post-partum. Major and minor strains, comprising the genotype mixtures at the 12 loci, were conserved within individuals, but highly diverse between individuals. Intra-genotypic recombination was rare, in agreement with previous analyses on low-passage samples, with decreased opportunity for homologous recombination at these highly variable genes [11, 16]. However, there was an ancestral intragenic recombination event modelled previously in the UL74 C-terminal domain, giving rise to UL74 gO1c, which may drive the positive selection observed [12]. The adjacent UL73/gN and UL74/gO show high gene diversity, to 55% [7, 12, 21], and the strongest linkage in this cohort and Genbank (NCBI), followed only by the gene family RL12-RL13-UL1. All other variable genotypes were unlinked. In addition to restraints on homologous recombination due to their diversity, there may be functional restraints against recombinants, as both UL73/gN and UL74/gO have roles in virus exocytosis, cellular tropism and modulating antibody neutralisation and RL12-RL13-UL1 in immune evasion via functional IgG binding domains [9, 13, 33–36]. In addition, mutations in RL13 may affect UL74 influence on HCMV growth [37].
Although variant data could be compiled for all samples, assembly of complete genomes was confounded by the multiple-strain infections. Assembly was possible with single or dominant variants in three donors. To our knowledge this includes the first complete HCMV genome from Africa and a ‘normal’ donor from breast milk, a route of infant virus transmission.
There was notable absence of tissue culture adapted, or known antiviral resistant mutations [14, 38]. Previous studies of interstrain diversity did not include UL73 and UL74 glycoproteins analyses [11, 16], possibly their diversity initially confounding assembly, and now corrected on Genbank. Recent studies on transplantation patients used UL73/UL74 for diversity references in deep and whole genome sequencing [17, 21].
The diversity of the strains indicates ancestral modular recombination, but there was little contemporary evidence for this using the current methods and time-span. Further there was no evidence for strain replacement as shown over longer periods in transplantation patients [17, 21]. Micro variant genotypes (<1-10%) present at the latest time point 16 weeks post partum, supports de novo infection, but coverage depth was not sufficient to exclude reactivation from the 4 week sample. The mothers in this cohort all had young children at home who may be sources of HCMV excretion, and all mothers were HCMV positive, therefore opportunities for home and hospital infection. A study in Uganda has shown that mothers can be infected with strains from other children [39].
The varied genotypes in the strains are in virus glycoproteins and immunomodulatory genes, where differences in genotype could provide a growth advantage leading to higher viral loads and pathology. For example, different genotypes of UL74/gO show differences in growth properties [40], and those of UL146/vCXCL1 chemokine have different efficiencies in neutrophil chemotaxis [41]. Similarly variation in the host, for example in immunoglobulin variants GM3/17 affect antibody-dependent cellular cytotoxicity against HCMV via low and higher fc-gamma affinity; of note RL13 binding GM17, would be higher in this cohort and correlates with increased susceptibility [38, 42–45]. Additional variable viral glycoproteins, could participate in antigenic variation. The modular diversity shown here provides a new model for persistent virus infection over previous SNP based antibody-escape mutant analyses for acute infections such as influenza or poliovirus. It is notable that the potential combinatorial diversity between genotypes at the 12 loci, at 1012 combinations, if at equal recombinations rates, exceeds some estimates for immune diversity.
Previous studies on transplantation patients have also shown that multiple infections associate with increased viral loads, as well as pathology such as HCMV disease, graft rejection, other coinfections and may affect delays in virus clearance [18, 20]. In contrast, analyses of HCMV infections in urine from newborns, show primarily single strain genomes, suggesting selection at transmission across the placenta or postnatally by breast-milk or saliva [46]- For example, estimates of saliva transmission indicate only a few virions establishing infection [47]. Infant transmission via breast milk increases with viral load and length of exposure [3, 48], and raised mixed-strain infections in HIV-positive women could facilitate their selection.
We showed high burden of infection via identifying mixed-strains together with raised viral-load and both could modulate HCMV pathology. Previously, we demonstrated that high viral-load and extended breastfeeding linked with increased transmission, and that this associated with poor infant development. Whether these are direct or indirect immune-modulation effects would require further analyses of genome diversity directly from both clinical material as well as normal asymptomatic donors as shown here. Effects on functions of these key hypervariable genes encoding essential glycoproteins and immunomodulatory factors are required both to understand variation and disease, as well as effects on new interventions, such as vaccines.
Acknowledgements
We thank the participants and clinical staff for facilitating the BFPH study and also Dr L Kasonka, University Teaching Hospital, Lusaka, Zambia and Professor S Filteau, London School of Hygiene and Tropical Medicine, for enabling the follow up analyses. We also thank Dr S. Camiolo, MRC-University of Glasgow Centre for Virus Research, for assisting with data deposition and Drs Taane Clark and Jody Phelan, LSHTM, for facilitating UNIX cluster access and Perl support.
Footnotes
↵* Present affiliation: Public Health England, Porton Down, Wiltshire, United Kingdom
Financial support. This work was supported by the Commonwealth Scholarship Commission, a Bloomsbury Studentship Award and the Medical Research Council (programme grant MC_UU_12014/3).
Potential conflicts of interest. The authors report no conflicts of interest in this study.