Abstract
Background Analysing variant antigen gene families on a population scale is a difficult challenge for conventional methods of read mapping and variant calling due to the great variability in sequence, copy number and genomic loci. In African trypanosomes, hemoparasites of humans and animals, this is complicated by variant antigen repertoires containing hundreds of genes subject to various degrees of sequence recombination.
Findings We introduce Variant Antigen Profiler (VAPPER), a tool that allows automated analysis of variant antigen repertoires of African trypanosomes. VAPPER produces variant antigen profiles for any isolate of the veterinary pathogens Trypanosoma congolense and Trypanosoma vivax from genomic and transcriptomic sequencing data and delivers publication-ready figures that show how the queried isolate compares with a database of existing strains. VAPPER is implemented in Python. It can be installed to a local Galaxy instance from the ToolShed (https://toolshed.g2.bx.psu.edu/) or locally on a Linux platform via the command line (https://github.com/PGB-LIV/VAPPER). The documentation, requirements, examples, and test data are provided in the Github repository.
Conclusion Our approach is the first to allow large-scale analysis of trypanosome variant antigens and establishes two different methodologies that may be applicable to other multi-copy gene families that are otherwise refractory to high-throughput analysis.
Background
Advances in next-generation sequencing have enabled researchers to produce high-throughput genomic data for diverse pathogens. However, analysing multi-copy, contingency gene families remains challenging due to their abundance, high mutation and recombination rates, and unstable gene loci [1]. Yet, these gene families are often involved in many processes of pathogenesis, including antigenic variation, virulence, host use, and immune modulation in a multitude of pathogens [2–4]. A prime example of a crucial gene family lacking the necessary analytic tools for high-throughput analysis is the Variant Surface Glycoprotein (VSG) superfamily in African trypanosomes [5].
African trypanosomes are extracellular hemoparasites that cause human sleeping sicknessand animal African trypanosomiasis (AAT). Their genomes contain up to 2500 VSG genes [6] dispersed through specialized, hemizygous chromosomal regions called subtelomeres, smaller chromosomes, and less frequently in the core of megabase-sized diploid chromosomes. The VSG genes encode variant surface glycoproteins, GPI-anchored proteins that coat the entire surface of the parasite in the bloodstream of the mammal host, which function mostly in antigenic variation and immune-modulation [7]. Sporadically, specific VSG genes have been shown to evolve other functions, not related to antigenic variation, such as conferring human infectivity to T. brucei gambiense (TgsGP gene) [8, 9] and T. brucei rhodesiense (SRA gene) [10, 11], resistance to the drug suramin (VSGsur gene) [12], and mediating the transport of transferrin (TfR genes) [13, 14].
As they are key players in host-trypanosome interaction, understanding VSG diversity and its impact in pathology, disease phenotype and virulence is of foremost importance in trypanosome research [4]. However, the VSG repertoire cannot be accurately analysed using conventional approaches of read mapping and variant calling. Attempts to bypass this challenge have resulted in alternative approaches using manually-curated VSG gene databases for specific T. brucei strains [6, 15–17], but to the best of our knowledge there is no automated tool for the systematic analysis of VSG from any trypanosome genome. Thus, we have developed Variant Antigen Profiler (VAPPER), a tool that examines VSG repertoires in DNA/RNA sequence data of the main livestock trypanosomes, Trypanosoma congolense and T. vivax, and quantifies antigenic diversity. This results in a variant antigen profile (VAP) that can be compared between isolates, locations, and experimental conditions [18]. In this paper we briefly present how VAPPER can be used to further our knowledge of antigenic diversity and variation.
Findings
The service
VAPPER is primarily intended for producing and comparing VAPs of livestock trypanosomes, without the need for complex bioinformatic processes. It is available online through the Galaxy ToolShed [19] for a local Galaxy server [20], and as a Linux package for local installation. The program has three pipelines, specific for each organism (T. congolense or T. vivax) and input data type (genome or transcriptome). VAPPER requires quality-filtered, trimmed, paired sequencing reads in FASTQ format [21] or assembled contigs in FASTA format [22]. Results are presented in tables of frequencies, heatmaps, and Principal Component Analysis (PCA) plots, visualized as HTML files or exported to PDF or PNG format. A typical workflow is shown in Fig. 1.
For T. congolense genomic VAPs (gVAP), VAPPER starts with genome assembly of raw, short reads using Velvet 1.2.10 [23]. Assembled contigs are screened for pre-defined protein motifs described by a hidden Markov model using HMMER 3.1b2 [24] after 6-frame translation. A detailed description of the universal protein motifs and their biological significance in a recent manuscript [18], but, in summary, each protein motif or motif combination is diagnostic of a specific phylotype [18]; therefore, phylotype frequency can be calculated from the HMMER output. The proportions of each phylotype represent the gVAP and are recorded in a table of frequencies. The gVAP produced is also placed in the context of a T. congolense genome database supplied with VAPPER (N=97, [18, 25]), which is regularly updated. This is achieved through a Euclidean distance-based clustering analysis. Results are presented as two heatmaps with corresponding dendrograms, one showing phylotype frequency, and the other showing frequency deviation from the population mean. They are also shown as a PCA plot and a table of frequencies.
For T. congolense transcriptomic analyses (tVAP), VAPPER performs read mapping using Bowtie 2 2.2.6 [26], reference-based transcript assembly and abundance calculation using Cufflinks 2.2.1 [27], and VSG transcript screening and phylotype assigning as described for gVAP. The proportions of each phylotype are then adjusted for transcript abundance based on the Cufflinks output (Fig. 1). The tVAP is presented as a weighted bar chart and compared to the gVAP of the reference (Fig. 2c). Ideally, the user would provide their own reference genome for the mapping step. As that is not always possible, especially for field isolate analysis, we provide two reference genomes, the IL3000 Kenyan isolate [14, 28], and the Tc1/148 Nigerian isolate [29, 30]. Choosing the most adequate reference for the sample being analysed may potentially improve the VAPPER results by increasing mapping sensitivity. However, we have previously shown that closely related T. congolense strains (i.e. with short genetic distances) do not always have equally related VSG repertoires [18].
For T. vivax, the gVAP is based on presence or absence of pre-defined VSG genes, rather than phylotype frequencies as described for T. congolense. The T. vivax VSG repertoire is composed of distantly related lineages with no evidence for recombination [14]. Therefore, unlike T. congolense and T. brucei, VSG genes are often conserved across multiple strains, allowing us to build a VSG database for the entire species. No T. vivax tVAP is currently offered due to the lack of enough transcriptomic data available for benchmarking, but work is on-going to add this function. VSG-containing contigs are identified using BLAST 2.7.1 to detect sequence homology with a T. vivax VSG database. This information is added to a regularly updated presence/absence binary matrix of T. vivax genomes (N=29) and applied to a Euclidean distance-based clustering analysis. The results are presented as a heatmap and dendrogram, putting the sample in the context of the remaining T. vivax genomes and their known countries of origin (Fig. 3).
In its Linux version, VAPPER can process multiple samples concurrently, providing that the input files are compiled in a single directory. Results are shown for all samples simultaneously, allowing direct comparison of variant antigen profiles across multiple isolates, conditions, or replicates. The tabular output can be incorporated in downstream statistical analysis, whilst the graphical outputs provide figures for the visualization of antigen repertoire variability.
Linux Package Installation
To facilitate usage, the installation of VAPPER and its dependencies is automated. Upon first download of the software, a single script will ensure the system has all the required dependencies and install them in a local directory if necessary. In naïve environments and for users without administrator rights to install the necessary libraries, a Python virtual environment can be set upon each new session.
The Galaxy Tool
VAPPER is available for installation in local Galaxy servers from the Galaxy ToolShed (https://toolshed.g2.bx.psu.edu/repository?repository_id=08b5616f1d3df20c). The purpose of the incorporation of VAPPER in Galaxy is to provide a simple front-end component for non-experienced users (Fig. 2). Results can be visualised directly in Galaxy, or can be downloaded as a compressed folder containing an HTML file with combined results, individual PNG and PDF files of the heatmaps, PCA plots, and bar charts produced, and the CSV files containing the raw values of phylotype proportions and deviation from the mean.
Benchmarking
The performance of the T. congolense gVAP pipeline was compared to the manually annotated VAP of the IL3000 reference genome (Fig. 3A) and to the BLAST-based VAPs of 41 isolates (Fig. 3B) [18]. There is a very good correlation between profiles produced by VAPPER and the known IL3000 VAP (R2 = 0.88, t(13) = 9.7321, P < 0.001) and a good correlation with the BLAST-based method (R2=0.67, Pearson’s product moment correlation, t(566)=34.4, p < 0.001). Minor differences were further investigated and found to be due to BLAST’s difficulty in either analysing small contigs or quantifying multiple VSGs in the same contig sequence. Therefore, in general, more VSGs were recovered with VAPPER than with BLAST (Mean ± σ=721±277 vs. 669 ± 292, paired t-test, p-value = 0.005). A further strength of VAPPER is the ability to deal with poor, fragmented, genome assemblies. As described in our previous paper [18], when a single VSG gene is located in two distinct contig fragments, BLAST counts them incorrectly as separate genes, whereas VAPPER will not because the diagnostic motif is only present once. Therefore, we can now accurately calculate antigen profiles from incomplete genome assemblies (up to 30%), and with a VSG fragmentation level up to 40% of the original gene length (223 nucleotides) (Fig. 3C).
Validation by example
T. congolense gVAP
We have used the VAPPER to analyse the genomic repertoire of 98 T. congolense samples of savannah and forest-subtypes, collected from 12 countries across Africa, and previously described by us [18] and others [25]. In Fig. 4, two heatmaps and corresponding dendrograms show how the VSG repertoires of each strain relate to each other. On the left, the heatmap represents phylotype proportion, i.e. how many genes a specific phylotype contains in the context of the complete VSG repertoire for a given strain (Fig. 4A). This heatmap shows that P4, 8, 9, 10, and 14 have few genes in all strains, whereas other phylotypes (e.g. P1, 2, 15) are more variable, being quite abundant in some strains and rare in others. The heatmap on the right shows phylotype deviation from the mean (Fig. 4B), which is calculated as the difference between the phylotype proportion shown in panel A and the arithmetic mean of phylotype proportions. The latter is calculated from the current database, thus it will change as new samples are added.
The phylotype proportion variation patterns are perhaps better detected in the normalised heatmap (Fig. 4B). For example, it is possible to detect a signature of underrepresented P15 characteristic of all forest-subtype samples (denoted by “a”), abundant P15 in all Kenyan isolates (in purple), as well as a distinct pattern characteristic of strains IL3578 to IL2326, characterised by the combination of low P1 to 3 and high P7 (denoted by “b”). The latter does not seem to be related to geography, as it encompasses isolates from Kenya, Uganda, and Burkina Faso. The PCA plot further indicates that VSG repertoires and geography are only weakly correlated (Fig. 4C), which agrees with our previous observation that T. congolense VSG repertoires do not mimic either population structure or geography [18].
T. congolense tVAPs
We have used VAPPER to analyse the expressed VSG repertoire of the metacyclic (infective) life stage of T. congolense. For that, we have produced a tVAP for the strain TC13, whose transcriptome was published by Awuoche et al. (2018) [31]. We have compared its metacyclic tVAP to the metacyclic tVAP of the 1/148 strain (MBOI/NG/60/1-148) that we have previously described [29]. Furthermore, we have compared them to the genomic VSG repertoires of the same strain, or a related one (Fig. 5). As we do not have a genome sequence for the TC13 isolate, we compared it to IL3000, which was isolated in the same region (Transmara, Kenya) [32].
When we compare the gVAPs of 1/148 and IL3000, we see that they are distinct, and so are the tVAPs (e.g. P4 is more represented in TC13, whereas P10 is more represented in 1/148 than in TC13). However, P8 is overrepresented in both isolates compared to the genomic repertoires (Fig. 5). This agrees with our previous observation that the pattern of metacyclic VSG expression is significantly different from the genome repertoires, and that the metacyclic VSG repertoire is particularly enriched for P8 genes [18]. With the analysis of the TC13 transcriptome, we can now add that this enrichment does not seem to be strain-specific, but rather equally applicable to T. congolense strains of distinct backgrounds.
T. vivax gVAP
The T. vivax gVAP shows the VAPs in the context of the sample cohort (N=29), which currently includes samples from Nigeria, Uganda, Gambia, Ivory Coast, Brazil, Burkina Faso, and Togo. The dendrogram represents the relationships between the multiple strains, whereas the heatmap shows whether VSG genes are present or absent in each strain (Fig. 6A). The VAP relationship shows a separation between Nigerian (in dark blue) and the remaining samples, as well as a clear difference between Brazilian and Ugandan isolates. The geographical signature is diminished slightly in the non-Nigerian West African strains, although this is may reflect the smaller number of samples per country and perhaps the geographical closeness between Togo, Burkina Faso, and Ivory Coast. Despite the lack of a transcriptomic pipeline for T. vivax, we can use the gVAP to understand the geographical distribution of expressed VSGs. As an example, we took the two most abundant VSGs in the transcriptomes of three strains (i.e. LIEM-176 from Venezuela [33], IL1392 from Nigeria [34], and Lins from Brazil [35]) and compared them to the VAP database (Fig. 6B). We observe that there are five different VSGs, which represent three different geographical patterns (Fig. 6C). Specifically, the first LIEM-176 VSG transcript has been found in strains from Venezuela, Nigeria and Gambia, but not in Brazil, Uganda, or Ivory Coast (map 1 in Fig. 6C). The second LIEM-176 VSG is present in Brazil, Venezuela, Nigeria, and Uganda, yet not in Ivory Coast, a pattern that is shared with the top two most abundant VSGs in Lins (map 2 in Fig. 6C). Finally, the top two most abundant VSGs in IL1392 have been found in strains from Brazil, Venezuela, Gambia, Nigeria, but not in Uganda nor Ivory Coast (map 3 in Fig. 6C). It is possible that strain or location-specific VSGs might be epidemiologically relevant, perhaps contributing to the considerable phenotypic variation observed in T. vivax AAT.
Conclusion
VAPPER is the first tool for the systematic analysis of VSG gene and expression diversity across strains and during infections. It establishes a practical approach for measuring antigenic diversity in these important pathogens based on universal protein motifs and/or gene mapping. Despite being often seen as a veterinary extension of HAT, AAT is a spectrum of diseases, dependent on the multiple species and strains of African trypanosomes and their multiple mammal hosts [36]. This predicament results in large variability in pathogenesis, epidemiology, and clinical outcome that remains poorly understood. For example, in East Africa, T. vivax, usually causes mild, chronic disease, but has occasionally resulted in acute haemorrhagic syndromes [37] without apparent reason. Likewise, in Brazil, related strains of T. vivax can cause both chronic disease of low parasitaemia and localized epidemics of up to 70% mortality rates, even in the same host species (although perhaps not the same genetic background) [38–40]. VAPPER allows us to identify and characterise differences in antigenic repertoires between strains, hosts, and conditions, which may be the starting point to build a real understanding of the association between disease genotypes and phenotypes. Importantly, with time this approach may be extended to the analysis of similar multi-copy, contingency gene families, particularly those involved in antigenic variation, in diverse pathogens.
Availability and requirements
Project name: VAPPER – High-throughput Variant Antigen Profiling in African trypanosomes
Project home page: https://github.com/PGB-LIV/VAPPER
Operating System: Platform independent
Programming language: Python
Installation Requirements: Velvet 1.2.10; HMMER 3.1b2; Bowtie 2 2.2.6; SAMtools 1.6; Cufflinks 2.2.1; BLAST 2.7.1; EMBOSS
License: Apache v.2.0
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Funding
This work was supported by a Grand Challenges (Round 11) award from the Bill and Melinda Gates Foundation, a BBSRC New investigator Award (BB/M022811/1), and the Technology Directorate of the University of Liverpool to APJ.
Authors’ contributions
SSP wrote the original code in Perl and tested the software. JH and ARJ wrote the final code in Python. SSP and APJ conceptualized the software and wrote the manuscript. All authors contributed to and approved the final manuscript.