ABSTRACT
The genomic characteristics of human cytomegalovirus (HCMV) strains sequenced directly from clinical material were investigated, focusing on variation, multiple-strain infection, recombination and natural mutation. A total of 207 datasets generated in this and other studies using target enrichment and high-throughput sequencing were analysed, in the process facilitating the determination of genome sequences for 91 strains. Key findings were that (i) it is vital to monitor sequence data quality, especially when analysing intrahost diversity, (ii) intrahost diversity in single-strain infections is much less than that in multiple strain infections, (iii) many recombinant strains have been generated and transmitted during HCMV evolution, and some have survived for thousands of years without further recombination, (iv) mutants lacking gene functions have been circulating and recombining for long periods and can cause congenital infection and resulting clinical sequelae. Future studies in general populations are likely to continue illuminating the evolution, epidemiology and pathogenesis of HCMV.
BACKGROUND
Human cytomegalovirus (HCMV) poses a risk to people with immature or compromised immune systems, and can have serious outcomes in unborn children, transplant recipients and people with HIV/AIDS. Prior to the advent of high-throughput technologies, studies of HCMV genomes in natural infections were limited to Sanger sequencing of PCR products, often focusing on a small number of polymorphic (hypervariable) genes [1]. This not only ignored most of the genome, but also made it difficult to identify and characterise multiple-strain infections, which may have more serious outcomes than single-strain infections.
The first complete HCMV genome to be sequenced was that of the high-passage strain AD169, by Sanger sequencing of a set of plasmids [2]. It was over a decade before additional genomes were sequenced, also by Sanger technology, in the form of bacterial artificial chromosomes [3-5], virion DNA preparations [6] and PCR amplicons [7, 8]. These sequences were complemented by many others, most determined by high-throughput methods [7, 9-13].
With three exceptions [7, 11], all of these sequences were derived from strains that had been isolated in cell culture. Mounting data on the existence of multiple-strain infections and the propensity of HCMV to mutate during isolation [6, 7, 8, 14, 15] added impetus to sequencing genomes directly from clinical material. One strategy involves sequencing overlapping PCR amplicons [7, 16]. An alternative involves generating random DNA fragments from clinical material, amplifying them by PCR, and hybridising them to an oligonucleotide bait library representing known HCMV diversity. This target enrichment technology originated in commercial kits for cellular exome sequencing, and has been applied to various pathogens [17, 18], including HCMV [19-21]. We have used it since 2012, and have released many genome sequences via GenBank that have been pivotal in other studies [11, 12, 19-21].
High-throughput sequencing has highlighted several features of the HCMV genome that had been discovered earlier, including variation and hypervariation [22, 25], multiple-strain infection [23], recombination [24, 25], and gene loss by natural mutation [26]. HCMV is the most variable of the human herpesviruses [12], and hypervariation of a subset of genes exists in the form of constrained genotypes that may be used to explore genome form and function. We sought to extend this process on a whole-genome basis to HCMV strains as they exist in clinical situations rather than in the laboratory.
METHODS
Samples
For convenience, samples were analysed as three collections, which are detailed in Supplementary Tables 1-3 and summarised in Table 1. Collection 1 originated from congenital infections from Pavia, Jerusalem and Prague. Collection 2 originated from Hannover and Pavia, and most came from transplant recipients. Collection 3 represents samples obtained by others in previous studies from people with various conditions, and were sequenced in those studies using the approach employed here, although with a different oligonucleotide bait library. The preliminary features of the samples and datasets are in Supplementary Tables 1-3 rows 3-7, and the clinical outcomes of congenital infections are in Supplementary Table 1 row 207.
DNA sequencing
Target capture and library preparation were performed using the SureSelectXT v. 1.7 target enrichment system for Illumina paired-end sequencing libraries with biotinylated cRNA probe baits (Agilent, Stockport, UK) [21]. Custom bait libraries representing known HCMV diversity were designed in February 2012 and April 2014 from 31 and 64 complete genome sequences, respectively. Access to the latter library is available from the corresponding author. Data on viral loads and library construction are in Supplementary Tables 1-3 rows 9-12.
Datasets of 300 or 150 nt paired-end reads were were generated using a MiSeq instrument (Illumina, San Diego, CA, USA), and prepared for analysis using Trim Galore! v 0.4.0 (http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/; length=21, 9 quality=10 and stringency=3). The numbers of trimmed reads are in Supplementary Tables 1-3 row 15.
Library diversity
For each dataset, the number of reads derived from unique HCMV fragments was estimated by using Bowtie2 v. 2.2.6 [29] to align the reads against the strain Merlin sequence (GenBank accession AY446894), and also, where it could be determined, the consensus genome sequence derived from the dataset (see Table 1 for details). The relevant data are in Supplementary Tables 1-3 rows 17-19 and 24-27. Reads containing insertions or deletions were removed from the SAM alignment file, and duplicate read pairs sharing both end coordinates, or duplicate unpaired reads sharing one end coordinate, were removed, producing an alignment file for unique reads derived from unique HCMV fragments (https://centre-for-virus-research.github.io/VATK/AssemblyPostProcessing). This file was viewed using Tablet v. 1.14.11.7 [30]. The final and initial coverage depth values, and their ratio (expressed as a percentage), are in Supplementary Tables 1-3 rows 20-22 and 28-30. The ratio ranged from 0 to 100, with higher values indicating more diverse libraries derived from greater numbers of unique HCMV fragments.
Strain enumeration
The numbers and proportions of strains represented in each dataset were estimated by two strategies: genotype read-matching and motif read-matching (https://centre-for-virus-research.github.io/VATK/HCMV_pipeline). Both utilised datasets concatenated from the paired-end datasets. The genotype designations used were either based on reported phylogenies [6, 25, 31-33], amending or extending them as appropriate, or were onstructed afresh using Clustal Omega v. 1.2.4 [34] and MEGA v. 6.0.6 [35] with data for the genomes listed in Supplementary Table 4 and individual genes for which additional sequences were available in GenBank. Alignments are in Supplementary Figure 1.
For genotype read-matching, Bowtie2 was used to align the reads to sequences representing the genotypes of two hypervariable genes, UL146 and RL13 [6, 12, 36]. The sequences are in Supplementary Tables 1-3 rows 36-60, and represent the entire coding region of UL146 and the central coding region of RL13. In contrast to the UL146 genotypes, the RL13 genotypes cross-matched to some extent in four groups (G1, G2, G3; G4A, G4B; G6, G10; and G7, G8). In these instances, the genotype within the group with most matching reads was scored. The numbers of reads aligned to each genotype are also in Supplementary Tables 1-3 rows 36-60. A genotype was scored if the number of reads was >10 and also represented >2% of the total number detected for all genotypes of that gene. For 14 samples in collection 1 that had been sequenced prior to the availability of ultrapure (TruGrade) oligonucleotides, these values were set at >25 and >5%, respectively. The number of strains in a sample was scored as the greater of the numbers of genotypes detected for the two target genes, and is in Supplementary Tables 1-3 row 13.
For motif read-matching, conserved genotype-specific motifs (20-31 nt) were identified manually for 12 hypervariable genes [6, 12, 19, 33]. Additional motifs were included in order to identify certain recombinants or mutants. The motif sequences and number of reads containing perfect matches to the motif or its reverse complement are in Supplementary Tables 1-3 rows 62-180. Genotypes were scored as described above. The number of strains in a sample was scored at the greatest of the numbers of genotypes detected among the target genes, with a requirement that at least this number should have been detected for at least two genes, and is in Supplementary Tables 1-3 row 14.
Data deposition
The original read datasets (purged of human data) were deposited in ENA (project no. PRJEB29585), and the consensus genome sequences were deposited in GenBank, under the accession numbers in Supplementary Tables 1-3 rows 8 and 31, respectively. Updated genome sequences in collection 3 were deposited by the original submitters in GenBank [19] or by us as third-party annotations in ENA (project no. PRJEB29374) [20]. Features of the sequences are in Supplementary Tables 1-3 rows 32-34. Data on mutants in collection 1 are in Supplementary Table 1 rows 182-205.
Intrahost variation
Variation was examined in datasets for which a consensus genome sequence had been determined. Original datasets were prepared for analysis using Trim Galore! (length=100, quality=30 and stringency=1), and trimmed reads were mapped using Bowtie2. Alignment files in SAM format were converted into BAM format, sorted using SAMtools v. 1.3 [37], and analysed using LoFreq v. 2.1.2 [38] and V-Phaser 2 [39] under default parameters.
RESULTS
Operational limitations
The analysis involved a total of 207 datasets from 199 samples and 102 individuals (Table 1 and Supplementary Tables 1-3). The percentage of HCMV reads (target enrichment efficiency) and the coverage ratio (library diversity) tended to depend on the sample type (proportion of host DNA) and the number of genome copies used to make the library. In general, >1000 copies per library were needed to obtain data of sufficient quality to determine a complete genome sequence. However, data quality was influenced by many factors, including logistical errors, low coverage depth, low library diversity, and the apparent presence of additional strains at levels below the threshold, as a result of low-level multiple-strain infection or cross-contamination. In addition, an inability to obtain data from the entire genome, despite excellent coverage ratios, precluded determination of complete sequences from most datasets in collection 3.
Genome sequences
A total of 91 complete or almost complete HCMV genome sequences were determined (Table 1). We reported five of these previously [21], and 16 are improvements on published sequences [19]. Most originated from single-strain infections or multiple-strain infections in which one strain was predominant, and some originated from different strains that predominated in a patient at different times. Defining a strain as a virus present in an individual, these 91 sequences, plus an additional 49 deposited by our group and 104 by other groups, brought the number of strains sequenced to 244 (Supplementary Table 4). Of these, 91 were sequenced directly from clinical material, and all but one of these were determined by us. The mean size of the HCMV genome, based on the 219 complete sequences lacking sizeable deletions, is 235,514 bp.
Multiple-strain infections
Genotypic differences in hypervariable genes (Figure 1 and Supplementary Figure 1) were exploited to detect multiple-strain infections by genotype read-matching and motif read-matching, the latter proving to be the more versatile method. Single strains were present in the great majority of congenitally infected patients (n=43/50 in collections 1 and 2), whereas they were significantly less common in transplant recipients (n=9/23 in collections 2 and 3; Pearson’s chi-squared test, χ2=14.678, p<0.05).
Recombination
The 244 genome sequences were genotyped in the 12 hypervariable genes used for motif read-matching and then in five more (Figure 1 and Supplementary Table 4).
Hypervariation in UL55, which encodes glycoprotein B (gB), is located in two regions (UL55N, near the N terminus, and UL55X, encompassing the proteolytic cleavage site) [23, 40]. Five genotypes (G1-G5) have been assigned to each region [23, 40-42], which are separated by 927 bp that are 80% identical in all strains. All genomes had one of the previously reported UL55X genotypes (Supplementary Table 5). As reported previously [40], UL55N G2 and G3 could not be distinguished reliably from each other, and two additional genotypes (G6-G7) were detected that may have arisen from ancient recombination events within UL55N (Supplementary Tables 4 and 5, and Supplementary Figure 1). There was evidence for recombination in the region between UL55N and UL55X in only eight genomes. This low proportion of recombination (3.3%) in a small region contrasts with higher levels proposed proposed previously [40, 43], which may have been influenced by PCR-based artefacts arising from the presence of multiple strains. UL73 and UL74, which encode glycoproteins N and O, respectively, are adjacent hypervariable genes that exist as eight genotypes each [25, 32, 44]. There was evidence for recombination between them in only seven genomes (2.9%), in accordance with the low levels (4%) detected in PCR-based studies [25, 32]. In the region containing adjacent hypervariable genes RL12, RL13 and UL1, recombinants were rare (1.2%) in RL12 and absent from RL13 and UL1. In contrast, hypervariable genes UL146 and UL139, which encode a CXC chemokine and a membrane glycoprotein, respectively, are separated by a well-conserved region of over 5 kbp. The number (66) of the 126 possible genotype combinations represented in the 244 genomes is too large to allow any underlying genotypic linkage to be discerned, consistent with previous conclusions based on PCR [45]. No recombinants were noted within UL146.
In principle, strains in multiple-strain infections have the opportunity to recombine. In our previous analysis of RTR1 in collection 2, we noted that one strain predominated at earlier times and another at later times [21]. From the low frequency of variants across a large part of the genome, we concluded that the second strain had arisen either by recombination involving the first strain or by reinfection with, or reactivation of, a second strain fortuitously similar to the first. Here, recombination was strongly supported by a comparison of the two genome sequences (RTR1A and RTR1B), which showed that approximately two-thirds of the genome is almost identical (differing by three substitutions in noncoding regions), whereas the remaining third of the genome is highly dissimilar.
To investigate whether strains have been transmitted without recombination occurring, identical genotypic constellations were identified among the 244 genomes (Supplementary Table 6). This revealed the existence of 12 haplotype groups within which multiple strains exhibit no signs of having recombined since diverging from their last common ancestor; these are termed nonrecombinant strains below. These results suggest that nonrecombinant strains have been circulating, some for periods sufficient to allow the accumulation of >100 substitutions. Application of an evolutionary rate estimated for herpesviruses (3.5 × 10-8 substitutions/nt/year) [46] implies that these periods may have extended to many thousands of years. The distribution of substitutions across the genome in highly divergent groups 9 and 10 was examined in further detail. Group 9 (three strains) exhibited 135 differences, with the 50 that would affect protein coding distributed among 38 genes, and group 10 (two strains) exhibited 138 differences, with the 38 that would affect protein coding distributed among 27 genes. No obvious bias was observed towards greater diversity in any particular gene or group of genes, including those in the hypervariable category. As suggested by the lack of diversity within genotypes in comparison with the marked diversity among them (Supplementary Figure 1), these results fit with the view that intense diversification of hypervariable genes occurred early in human or pre-human history [30, 45], and has long since ceased.
Mutations
Mutations that cause premature translational termination, and therefore potentially affect gene function, have been catalogued previously in HCMV genomes [7, 11, 12, 26]. They may have resulted from substitutions that introduce in-frame stop codons or affect the conserved dinucleotide at the beginning or end of an intron, or from insertions, deletions or inversions that cause frameshifting or loss of protein-coding regions. The underlying data have been derived mostly from strains isolated in cell culture, and their interpretation has assumed that these mutations occurred naturally. For example, in a major study reporting that 75% of strains are mutated [12], 157 mutations were identified in 101 strains (100 passaged in vitro), but only 35 were confirmed in the clinical material. Nonetheless, the distribution of mutations among the 91 strains sequenced directly from clinical material appears similar to that among passaged strains (Table 2 and Supplementary Table 4).
Among the strains sequenced from clinical material, 77% are mutated in at least one gene (compared with 79% among all sequenced strains), and one is mutated in as many as six genes (Pat_D in collection 3). In clinical strains, the most frequently mutated genes are UL9, RL5A, UL1 and RL6 (members of the RL11 family), US7 and US9 (members of the US6 gene family), and UL111A (encoding viral interleukin-10) (Supplementary Figure 1). The likelihood that many of these mutations were not generated in the patients concerned but are ancient is supported by the finding that all were detected at levels very close to 100% in collection 1, and by previous observations that the same mutation is present in different strains [7, 12]. In addition, there was evidence from the PAV6 datasets for maternal transmission of a US7 mutant (Supplementary Table 1), and from PCR data (not shown) of a UL111A mutant to PAV16. Moreover, nine of the groups of nonrecombinant strains contained mutants, and some of the mutations were common to group members (Supplementary Table 6) and even to additional strains among the 244, indicating that they had been transferred by recombination. These observations again indicate the longevity of many mutations and their propagation by recombination. Focusing on the most common mutations, strains in which UL9, RL5A, UL1, US9, US7 and UL111A were affected (singly or in combination) were, like nonmutated strains, associated with congenital infection and, in some cases, defects in neurological development (Supplementary Table 1).
Intrahost diversity
Use of LoFreq and V-Phaser showed that single-strain infections contained markedly fewer variants (median values of 60 and 140, respectively) than multiple-strain infections (median values of 2444 and 2955, respectively; Figure 2). The differences between the values for single- and multiple-strain infections were assessed as being significant by the Kruskal-Wallis rank-sum test (LoFreq, Kruskal-Wallis χ2=67.918, p<2.2e-16; V-Phaser, Kruskal-Wallis χ2=63.536, p=1.6e-15). Seven outliers in single-strain infections were common to both analyses (in order of decreasing number of variants, RTR6B, CMV-37, RTR2, CMV-35, CMV-38, ERR1279054 and PAV6), one was reported by LoFreq only (PAV21), and four were reported by V-Phaser only (CMV-19, CMV-31, PRA6A and SCRT12).
DISCUSSION
Advances in high-throughput sequencing technology have made it possible to generate a wealth of viral genome information directly from clinical material. However, operational factors should be taken into account in assessing the data. These include sample characteristics (source, viral content and presence of multiple strains), confounding events (logistical errors and contamination), adequate design of the bait library and the sequencing protocol (ability to enrich all variants and acquire data evenly across the genome), and quality and extent of sequencing data (library diversity and coverage depth). Since perceived levels of intrahost variation are particularly sensitive to these factors, and may be greatly over- or underestimated as a result, we proceeded cautiously. However, as indicated in our earlier study [21], it is clear that the number of variants in single-strain infections was markedly less than that in multiple-strain infections. Moreover, it was also far less than that reported by others in samples from congenital infections [16]. The factors listed above may have been responsible for the outliers in single-strain infections, and may also have generally influenced the derived median values. In our view, accurate estimates of intrahost variation in single-strain infections are not yet available, and will require sequencing and bioinformatic approaches that are demonstrably robust, reliable and reproducible [47, 48].
Whole-genome analyses have confirmed the significant role of recombination during HCMV evolution reported in numerous earlier studies [12, 19]. Recombination has occurred over a very long period and remains limited in extent, with surviving recombination events being more common in long regions, less common in short regions, and rare in hypervariable regions, as would be consistent with homologous recombination. It is possible that recombination frequency is restricted in some circumstances by functional interdependence of regions in the same protein (e.g. gB) or separate proteins (e.g. gN and gO [30, 44]), but it is not known whether differential recombination due to sequence relatedness is of general biological significance. Also, strains have circulated that seem not to have recombined over many thousands of years. The extent to which recombinants arise and survive in individuals with multiple-strain infections is a different question, particularly in immunosuppressed transplantation patients. Except where populations fluctuate significantly and are sampled serially (e.g. RTR1 in collection 2), this is difficult to approach using short-read data, as these are based on PCR methodologies prone to generating recombinational artefacts. Long- or single-read technologies and new bioinformatic tools should contribute significantly to this area. Also, conclusions drawn from transplant recipients, who are immunosuppressed and in whom HCMV populations may be diversified by transplantation from HCMV-positive donors or selected with antiviral drugs, are unlikely to represent natural situations. More relevant are maternal transmission routes, including those involving breast milk (Suárez et al., manuscript in preparation).
The frequent identification of mutants, and their apparently long history, reveals an interesting aspect of HCMV microevolution. The implication that some mutants have a selective advantage in certain circumstances may be extended to their evident ability nonetheless to cause congenital infections and associated neurological sequelae, probably in combination with specific host factors. Mutated genes tend to be involved, or are suspected to be involved, in immune modulation. These genes include UL111A, which encodes viral interleukin-10 [49], and UL40, which is involved in protecting infected cells against NK cell lysis [50] via its signal peptide, in which mutations in this gene occur. By analogy to other members of the RL11 family, the most frequently mutated gene (UL9), is also likely to be involved in an aspect of immune modulation.
Modern approaches offer a powerful means for analysing HCMV genomes directly from clinical material, with the important proviso that the data should be monitored for quality and interpreted in the context of the known evolutionary and biological characteristics of the virus. The sequence data promise to become very extensive and will help shed further light on the epidemiology, pathogenesis and evolution of HCMV in clinical and natural setting, thus allowing investigation of virulence determinants and the development of new interventions.
ACKNOWLEDGMENTS
We are very grateful Florent Lasalle, Daniel Depledge and Judith Breuer (University College London) for kindly providing unpublished collection 3 datasets and for updating the associated genome sequences in GenBank. We also thank Jenny Witthuhn (Hannover Medical School) for excellent technical assistance.
Footnotes
Conflict of interest statement Dr. Davison reports grants from the Medical Research Council and the Wellcome Trust. Dr. Ganzenmueller reports grants from the Deutsche Forschungsgemeinschaft Collaborative Research Centre 900 and from Niedersächsische Ministerium für Wissenschaft und Kultur. Dr. Hubáček reports a grant from the Ministry of Health of the Czech Republic for the conceptual development of University Hospital, Motol, Prague, Czech Republic, personal fees and non-financial support from MSD and from Chimaerix, and personal fees from Dynex that are outside the scope of the submitted work. Dr. Lilieri reports a grant from the Fondazione Regionale per la Ricerca Biomedica, Regione Lombardia. Dr. Schulz reports grants from the Deutsche Forschungsgemeinschaft Collaborative Research Centre 900 and from the German Federal Ministry of Education and Research. Dr. Wilkinson reports a grant from the Wellcome Trust. Dr. Wilkie reports that his part in the submitted work was completed prior to his employment by Illumina.
Funding statement This work was funded by the Medical Research Council (MC_UU_12014/3 and MC_UU_12014/12), the Wellcome Trust (204870/Z/16/Z and WT090323MA), the Ministry of Health of the Czech Republic for conceptual development of research organization 00064203 (University Hospital, Motol, Prague, Czech Republic), the Fondazione Regionale per la Ricerca Biomedica, Regione Lombardia (FRRB 2015-043), the Deutsche Forschungsgemeinschaft Collaborative Research Centre 900 (core project Z1, grant SFB-9001), the German Center of Infection Research TTU Infections of the Immunocompromised Host, and the Niedersächsische Minis terium für Wissenschaft und Kultur (grant COALITION – Communities Allied in Infection). A. Dhingra and E. Hage were supported by the Infection Biology graduate program of Hannover Biomedical Research School.
Presentation at conferences Parts of this work have been presented on multiple occasions, most recently at the German Society for Virology (14-17 March 2018) and the UK Microbiology Society (10-13 April 2018).