Abstract
Porcine endogenous retroviruses (PERVs) are potential infectious agents of xenotransplantation as they are able to infect human cells and can be endogenized. To trace the origin of PERVs, we performed large-scale genomic mining of 142 mammal and 14 pig genomes and investigated genomic dynamics and evolution of PERV-related viral “fossils”. Large-scale genetic alterations were found in most PERVs with many indels discovered, indicating the ancient origin of these viruses. Remarkably, two none-porcine species, lesser Egyptian jerboa (Jaculus jaculus) and rock hyrax (Procavia capensis), harbor endogenous retroviruses (ERVs), named eJJRV and ePCRV, which are closely related to PERVs. Molecular dating and phylogenetic analyses suggest that ancestral PERV originated from recombination of JJRV and PCRV in ancient pigs, which likely occurred in the late Miocene in Africa. Furthermore, we have discovered evidence of genomic rearrangement via PERVs during porcine evolution. Taken together, we decipher a complex evolutionary history for the modern PERVs.
Introduction
Xenotransplantation, the transplantation of tissues and organs from one species to another, may alleviate shortages of human donor organs (1, 2). Porcine organs are suitable for xenotransplantation due to the similar size and function of porcine and human organs, and the fact that pigs can be bred in large numbers (3). However, the potential risk of cross-species transmission of porcine microorganism specific porcine endogenous retroviruses (PERVs) limits the xenotransplantation of porcine organs into humans (4). PERV, as a member of retroviruses, could potentially cause immunodeficiency and tumorigenesis (3, 5, 6).
PERVs are endogenous gammaretroviruses, and exist in the genomes of all pig strains (3, 7). The envelope (env) genes of three PERV classes (PERV-A, -B and -C) differ, especially with respect to the receptor-binding domain (RBD) (8). Although there is no evidence of PERV transmission in patients receiving encapsulated pig islets (9–11), PERV-A and -B have been observed to infect both human cells and pig cells while PERV-C infects only pig cells (12). PERVs may also integrate into the human genome in vitro (13, 14). In pig cells, PERV-C can recombine with the env of PERV-A to produce A/C recombinants, which can infect human cells more efficiently (12). This increases the inherent risk in xenotransplantation and xenogeneic cell therapies.
While several studies have examined the evolutionary relationships between PERVs and other viruses, the origin of PERVs remains unknown (15, 16). At least two species that belong to the same order as pigs (Tayassu pecari (of Eocene origin) and Babyrousa babyrussa (of Miocene origin)) lack PERVs (17). However, the common warthog (Phacochoerus africanus) carries PERVs, suggesting that an ancestral porcine species from the Miocene period (3.5 to 7.5 MYA) carried PERVs (17). PERVs have two different types of long terminal repeats (LTRs), one with a 39-bp repeat structure in the U3 region, and the other without this repeat structure (18, 19). The 39-bp repeats carried by PERV-A and -B confer strong promoter activity and thus increase transcription.(18, 19). However, the 39-bp repeat structure is absent in some PERV-A and all PERV-C. These PERVs thus have low transcriptional activity (18, 19). BLAST search analysis confirmed that the R and U5 regions of the PERV LTRs are highly conserved in the pig and mouse genomes (74–87% identity) (20). Indeed, LTRs of PERV-A, -B and LTR-IS (a LTR family found solely in the mouse genome) have similar structure (20). The conserved LTR sequences across pigs and mice might had originated from a common exogenous viral element, but have evolved independently (20). Thus, little is known about the evolutionary history of PERVs, their history is increasingly traceable as the number of available mammalian genomes grows. Using genome mining, we find that PERVs are ancient (dating from the late Miocene period), and for the first time, we reveal that the PERV ancestor likely originated from co-infection and recombination of non-porcine endogenous retroviruses (ERVs).
Results
Characterization of putative full-length PERVs
Using previously reported PERV sequences as queries, we mined 14 pig genomes (Table S1) available in GenBank, and showed detailed genome-wide distribution of full-length PERVs with two flanking LTRs. We initially compiled a PERV dataset that included 84 previously classified (30 PERV-A, 39 PERV-B and 15 PERV-C) and 18 unclassified PERVs (i.e., lacking of env gene) (Table S2). The number of classified PERVs ranged from 38 copies in Duroc pig to 1 in the Tibetan pig. We identified 2–10 PERVs in each of 12 other pig breeds, including Meishan, Goettingen, and Large White. We removed 19 previously classified PERV sequences that were low quality fragments (> 200 “N” bases). The final dataset consisted of 65 high quality classified PERV (27 PERV-A, 29 PERV-B, and 9 PERV-C), and the genomic structures of the 65 PERVs are summarized in Fig. 1. PERVs had large-scale genetic alterations induced by indels and stop codons (Fig. 1), indicating a relatively long evolutionary history. PERV LTR were classified by the presence (LTR B) or absence (LTR A) of the 18 bp and 21 bp repeat structure reported previously (8, 18, 21). Three different type B LTRs in the PERV were identified, distinguished by the number of 18 bp and 21 bp repeat sequences: LTR B1 (two 18-bp and one 21-bp repeats), LTR B2 (three 18-bp and two 21-bp repeats), and LTR B3 (four 18 bp and three 21 bp repeats). Of the 65 high-quality PERVs we analyzed, we assigned 57, of which 32 (>55%) carried LTR A, 10 carried LTR B1, 13 carried LTR B2, and 2 carried LTR B3. LTR A was identified in PERV-A and -C, and LTR B1 was identified in PERV-A and -B. LTR B2 and LTR B3 were only identified in PERV-B. The remaining eight PERVs contained different types of 5’ - and 3’-LTR, which may reflect PERV recombination over evolutionary time.
Recombination
To identify recombined PERVs, we constructed a neighbor-joining tree representing the 5’- and 3’- LTR sequences of full-length PERVs across 14 genomes. The resulting phylogenetic tree was divided into three large clusters (Fig. S1), suggesting that the ages of individual PERVs varied and that three large integration events had occurred. Retrovirus integration creates a short duplication called target site duplication (TSD) flanking the LTR (22, 23). Here, 4 bp TSDs were flanking the provirus. Remarkably, 11 PERVs did not share the same TSD (Table 1, Table S3), likely due to chromosomal rearrangement through homologous recombination between distant PERVs, as mentioned in a previous study of primate ERV (24).
Detection of PERV-related sequences in mammalian genomes
After screening 142 mammalian genomes (Table S4) in Genbank, we identified a sequence (accession number: NW_004504334.1) in the genome of lesser Egyptian jerboa (Jaculus jaculus) that showed highly significant similarity (for gag and pol: >75% nucleotide identity over 95% region; for env: >75% nucleotide identity over 55% region) to PERVs using tBLASTn and choosing three major proteins (Gag, Pol and Env) of PERVs as queries. Using this PERV-related sequence as query, three other possible PERV related sequences were identified in J. jaculus (accession number: NW_004504375.1, NW_004504378.1, and NW_004504445.1) with >85% nucleotide identity over 80% of the query sequence. The four PERV-related sequences identified in J. jaculus were designated eJJRVs. These sequences were located in large scaffolds > 5 Mb long, which indicated that the eJJRV sequences were relatively reliable. One full-length eJJRV (accession number: NW_004504334.1) is annotated in Fig. S2.
To demonstrate the similarity between PERVs and eJJRVs, we generated pairwise alignments of eJJRV and PERV nucleotides using the most closely related full-length ERVs, and performed a sliding window analysis of these pairwise alignments (25, 26). For comparison, we determined the similarity of HIV-1 provirus sequence to that of its closest relative (chimpanzee SIVcpz) (27, 28). We found that between eJJRV and PERV-A, -B and -C, gag and pol were more similar than HIV-1 and SIVcpz (Fig. 2D). However, the RBD and the proline rich-region (PRR) of the surface subunit (SU) of env were dissimilar between eJJRV and PERV-A, -B and -C, and this pattern was also found between HIV and SIVcpz. In PERVs, the RBD and PRR determine the host range (29–32), suggesting that, although gag and pol were similar between eJJRVs and PERVs, they have a distinct host range. To characterize the relationship between eJJRVs and PERVs, we constructed phylogenetic trees of Gag, Pol and Env, first removing the divergent RBD. Data show that eJJRVs clustered with PERVs (Fig. 2A-C), which suggested that PERVs and eJJRVs might share common ancestor. In Fig. 2C, a sub-branch close to PERV-A and PERV-C was showed, and the PERVs in the branch were named as PERV-IMs, which were present in all 14 pig genomes. The Env proteins of PERV-IMs showed relatively low similarity to that of PERV-A, -B, and -C. And RBD region alignment suggested that PERV-IM was distinct from PERV-A, -B and -C (Fig. 3). So PERV-IM could be a new class of PERVs.
As the quality of another three of the eJJRVs were poor, we were only able to identify one pairwise eJJRV LTRs. The length of 3’-LTR of the eJJRV is 674 bp while 5’-LTR is 932bp which has a 258 bp insertion. We aligned eJJRV LTRs with PERV LTRs. The start of the U3 region and the end of the U5 region are distinct and not included in the alignment (Fig. S6). The alignment of the R region supported a close relationship between the eJJRV and the PERVs (Fig. 2E). The eJJRV LTRs included a repeat structure (three 18 bp and two 21 bp repeat sequences) in the U3 region, identical to that of the PERV LTR B2. Alignment analysis of the repeat structure revealed a closer relationship between LTRs of the eJJRV and LTR B2 of PERVs (Fig. S6). Furthermore, 3’ LTR of eJJRV had high identity with LTR B2 of PERV (~73%). Therefore, our results indicated that eJJRVs and PERVs were homologous.
The RBDs of eJJRVs and PERVs were distinct, so we used the RBD amino acid sequences from PERV-A, -B and -C as queries to screen homologous viral elements. The eight significant hits (>60% amino acid identity over 80% region) were obtained in rock hyrax (Procavia canpensis) of Procaviidae, and all 8 hits were located in large scaffolds >0.3 Mb long (accession number: KN678690.1, KN676491.1, KN678005.1, KN677924.1, KN676905.1, KN676182.1, KN680906.1, and KN676638.1). We examined the gene flanking the eight hits (especially pol), and found that ERVs including these hits were endogenous gammaretroviruses. These hits were therefore designated ePCRVs. We aligned the RBDs of PERVs and ePCRVs, and found that ePCRVs were highly similar to PERVs (Fig. 3). To quantify the homology between ePCRVs and PERVs, we made pairwise comparisons. Our comparisons suggested that ePCRV_1 and ePCRV_2 had a high identity with PERV-B (63%) but a low identity with PERV-A, -C and PERV-IM (40– 43%). Therefore, PERV RBD might be derived from ePCRVs, and the divergence of RBD of PERVs might have occurred after the recombination of PCRVs and JJRVs.
Molecular dating analysis
To better understand the integration time of PERVs, we used an LTR-divergence method to estimate when PERVs and eJJRVs invaded the host genome. This estimation method was based on divergence between 5’- and 3’-LTR of ERVs. Because nucleotide substitution rates of S. scrofa, J. jaculus and P. canpensis were unknown, we used an average mammal neutral substitution rate (2.2 × 10−9 per site per year) (33) for these three species. Our results indicated that PERV-A first invaded the Suidae ~6.6 MYA, while PERV-B first invaded ~6.4 MYA. In contrast, the invasions of PERV-C and PERV-IM were relatively recent (~3.4 MYA and ~4.4 MYA, respectively) (Fig. 4, Table S2). Thus, the oldest PERV-A and PERV-B invaded the host just after the Suidae split from the ancestral group (~7.3 MYA) (34). PERV-A, -B and -C has continued to integrate into pig genomes, resulting in increasing numbers of insertions. Because the LTRs of three eJJRVs were incomplete, our eJJRV results were based on only one provirus. eJJRVs was estimated to have integrated ~17.2 MYA, which is well before J. jaculus speciated (~11.1 MYA), but later than the speciation of Dipodidae (~42.7 MYA) (35). ePCRV integration time was calculated based on two full-length ePCRVs. ePCRVs insertions were estimated to be much older than PERVs (~10.7 MYA and ~8.4 MYA).
Evolutionary history of PERVs
Taken together, evolutionary history of PERVs could be divided into four stages. First, JJRVs, the most closely related ERVs to PERVs, integrated into Dipodidae ~17.2 MYA. But the SU subunits of env were dissimilar between eJJRVs and PERVs, indicating that the subunit may be derived from other ERVs (Fig. 5). Then ePCRVs emerged ~10.7 MYA, and the SU subunit of env, especially RBD, was highly similar to that of PERVs, which suggested that PCRVs may also be a donor of the ancestral PERV. Third, the ancestral PERV emerged. The oldest modern PERV, PERV-A integrated into Suidae ~6.6 MYA just after the emergence of Suidae (~7.3 MYA). It is possible that the ancestral PERV originated from the co-infection and recombination of JJRVs and PCRVs, and originally appeared around the late Miocene (~6.6 - 7.3 MYA) after the emergence of Suidae. Finally, after rapid adaptation in Suidae, PERV-A and -B diverged from the ancestral PERV ~6.6 MYA and ~6.4 MYA, respectively. The integration of PERV-IM occurred around ~4.4 MYA, between the integration of earliest PERV-A (~6.6 MYA) and PERV-C (~3.4 MYA). The homologies between PERV-A and -C are reported to be ~85%, while those between PERV-B and both PERV-A and PERV-C barely exceed 70% (17). Moreover, PERV-C harbors only one type of LTR, LTR A, which is also present in PERV-A, but not PERV-B. PERV-A and -B can infect cells from several species including humans while PERV-C infects only pig cells (7, 12, 36). It is possible that PERV-C descended from PERV-A, but lost the ability to infect other species in order to increase its adapt ability.
Discussion
Using systematic large-scale genome mining, we analyzed the origin and evolution of PERVs. eJJRV, the most closely related ERV to PERVs, can be traced back to ~17.2 MYA, which is well before J. jaculus speciated (~11.1 MYA), but later than the speciation of Dipodidae (~42.7 MYA). Unexpectedly, homologous LTRs of PERVs (~73% identity) were also found in 8 Muroidea species (Mus caroli, M. pahari, M. musculus, M. spretus, Apodemus speciosus, A. sylvaticus, Rattus norvegicus and Phodopus sungorus). The coding genes (gag, pol, and env) near these homologous LTRs were identified. Also, previously study found ERV in 2 Muroidea species (M. musculus, R. norvegicus)(37). But phylogenetic analysis suggested that coding genes were distantly related to PERVs and eJJRVs (Fig. S7). The homologous LTRs in Muroidea and Dipodoidea (especially eJJRVs) indicated that PERV-related LTRs have integrated into rodents before the divergence of Muroidea and Dipodoidea from a common ancestor (~53.0 MYA). Then the LTR-related ERVs in Muroidea and Dipodoidea evolved separately. Dipodoidea became the most closely related ancestor of PERVs until eJJRVs emerged (~17.2 MYA).
PERVs (~6.6 MYA), JJRVs (~17.2 MYA), and PCRVs (~10.7 MYA) integrated into Suidae, Dipodidae, and Procaviidae, respectively. The fossil records of Suidae Dipodidae, and Procaviidae also support this speculation of evolution of PERVs. Miocene (23 - 5.33 MY) Suidae fossils have been found in East Africa, Europe and Asia (http://fossilworks.org/?a=taxonInfo&taxon_no=42381). Miocene Dipodidae fossils have been found in North Africa, Europe and Asia (http://fossilworks.org/?a=taxonInfo&taxon_no=41695); Pliocene (5.3 - 2.59 MY) Dipodidae fossils have been identified in East Africa, thus suggesting that the Dipodidae may have spread to East Africa during the Miocene. Miocene Procaviidae fossils have been found in South of Africa and East Africa (http://fossilworks.org/?a=taxonInfo&taxon_no=43293). According to the current fossil records, the only shared region for Miocene Dipodidae, Procaviidae and Suidae fossils is East Africa. So co-infection and recombination may occurred between retroviruses carried by Dipodidae and Procaviidae in East Africa during the Miocene, and then recombinants may invaded the Suidae, producing ancestral PERV.
In summary, for the first time, we decipher a complex evolutionary history for the PERVs. The ancestral PERV might derive from recombination and co-infection of JJRVs and PCRVs from Dipodidae and Procaviidae. Then the ancestral PERV split into two classes (PERV-A and PERV-B). Finally, PERV-C diverged from PERV-A. We also suggest that pig genomes have been shaped by PERV invasions, as specifically reflected by PERV-associated genomic rearrangement that have occurred during porcine evolution. In a word, modern PERVs have a complex evolutionary history prior to their appearance in pigs.
Materials and Methods
In silico identification of PERV and PERV-related proviruses
To identify PERV proviruses in Sus scrofa, tBLASTn (38) was used and amino acid sequences of Gag, Pol and Env of 20 representative PERV proviruses (accession number: HQ536016.1, HQ536015.1, HQ536013.1, KC116220.1, AY570980.1, HQ540592.1, HQ536007.1, AX546209.1, AF435967.1, AY953542.1, HQ540591.1, AY099323.1, AJ133817.1, EU523109.1, EF133960.1, AY056035.1, AY099324.1, A66553.1, HQ536011.1, and HQ536009.1) were chosen as queries to search the 14 pig genomes available in Genbank that were released before November 2017. A 50% identity over 50% region was used to filter significant hits. It has been shown that PERVs harbor two LTR structures, one with and one without a repeat structure in the U3 region (8, 18). Using two typical LTRs as queries we extended flanking sequences of coding domains of PERVs to identify LTRs with BLASTn, and TSDs were used to define boundaries of PERV. LTR lengths were defined as 100–1,000 bp. PERVs with at least one LTR and one coding gene were screened for the next analysis.
To identify PERV-related proviruses in mammals, tBLASTn was used with the queries mentioned above in 20 representative PERV proviruses to search 142 mammal genomes available in Genbank that were released before November 2017. A 50% identity over 80% region was used to filter significant hits. LTRs were identified using LTR finder (39), LTRharvest (40) and BLASTn. LTR lengths were also defined as 100–1,000 bp.
Detection of recombination mediated by PERVs
To search for proviruses involved in recombination and chromosomal rearrangements, we constructed a neighbor-joining tree of 5’- and 3’- LTR of full-length PERVs using MEGA7 (41) with Kimura 2-parameter distance estimates. LTRs less than 250 bp were not considered. Alignment was carried out with MAFFT 7.222 (42).
Phylogenetic analyses
To determine the evolutionary relationship among PERVs, eJJRVs and representative gammaretroviruses (S5 Table), phylogenetic trees were inferred with amino acid sequences. Full-length PERVs and PERVs with one LTR and at least one coding gene were used to construct phylogenetic trees. All Gag, Pol and Env protein sequences were aligned in MAFFT 7.222 and confirmed manually in MEGA7. The evolutionary history of these gammaretroviruses was then determined using the maximum-likelihood (ML) phylogenetic method available in PhyML 3.1 (43), incorporating 100 bootstrap replicates to determine the robustness. The best-fit JTT+Γ amino acid substitution model was selected for Gag, Pol and JTT+Γ+I for Env using the ProtTest 3.4.2 (44). All alignments can be found in Dataset S1
Molecular dating of PERV, eJJRV and ePCRV
The 5’ and 3’ LTRs of ERVs are identical at the point of integration, and then diverge and evolve independently (45). So the ERV integration time can be calculated using the following relation: T = (D/R)/2, in which T is the invasion time (million years, MY), D is the number of nucleotide differences per site between the two LTRs, and R is the genomic substitution rate (nucleotide substitutions per site, per year). We used the previously estimated average mammal substitution rate (2.2 × 10−9 per site per year) (33), as no substitution rate (r) has yet been estimated for the S. Scrofa, J. jaculus and P. canpensis.