Abstract
Dual RNA-Seq is the simultaneous analysis of host and parasite transcriptomes. This approach can identify host-parasite interactions by correlated gene expression. Co-expression might highlight interlinked signaling, metabolic or gene regulatory pathways in addition to potentially physically interacting proteins. Numerous studies have used gene expression data to investigate Plasmodium infection causing malaria. Usually such studies focus on one organism – either the host or the parasite – and the other is considered “contaminant”. Dual RNA-Seq, in contrast, follows the rationale that cross-species interactions determine not only virulence of the parasite but also tolerance, resistance or susceptibility of the host.
Here we propose a meta-analysis approach for dual RNA-Seq. We screened malaria transcriptome experiments for studies providing gene expression data from both Plasmodium and its host. Out of 105 malaria studies in Homo sapiens, Macaca mulatta and Mus musculus, we identified 56 studies with the potential to provide host and parasite data. While 15 studies (1935 total samples) of these 56 explicitly aimed to generate dual RNA-Seq data, 41 (1129 samples) had an original focus on either the host or the parasite. We show that a total of up to 2530 samples are suitable for dual RNA-Seq analysis providing an unexplored potential for meta analysis.
We argue that the multitude of variations in experimental conditions found in the selected studies should help narrow down a conserved core of cross-species interactions. Different hosts used as laboratory models for human malaria infection are infected by evolutionarily diverse species of genus Plasmodium. We propose that a conserved core of interacting pathways and co-regulated genes might be identified using overlying interaction networks of different host-parasite species pairs based on orthologous genes. Our approach might also provide the opportunity to gauge the applicability of model systems for different pathways in malaria studies.
Introduction
Transcriptomes are often analysed in a first attempt to understand cellular and organismic events, because a comprehensive profile of RNA expression can be obtained at reasonable cost and with high technical accuracy [1]. Microarrays dominated transcriptomics for over ten years since 1995 [2–4]. Microarrays quantify gene expression based on hybridisation of a target sequence to an immobilised probe of known sequence. Technical difficulties associated with microarrays lie in probe selection, cross-hybridization, and design cost of custom chips [5]. RNA sequencing (RNA-Seq) eliminates these difficulties and provides deep and accurate expression estimates for all RNAs in a sample. RNA-Seq has thus replaced microarrays as the predominant tool for transcriptomics [1,6]. RNA-Seq assesses host and parasite transcriptomes simultaneously, if RNA of both organisms is contained in a sample. Virulence of infectious disease is often a result of interlinked processes of both host and pathogen (“host-pathogen interactions”) and it has been proposed to analyse transcriptomes of both organisms involved in an infection to obtain a more complete understanding of disease [5–7]. This approach is called dual RNA-Seq.
In case of malaria, unlike in bacterial infections, both the pathogen and the host are eukaryotic organisms with similar transcriptomes. Host and parasite mRNA is selected simultaneously when poly-dT priming is used to amplify polyadenylated transcripts [5,6]. This makes most malaria transcriptome datasets potentially suitable for dual RNA-Seq analysis. Malaria research, especially transcriptomics, is traditionally designed to target one organism, either the host or the parasite. Expression of mRNA, for example, can be compared between different time points in the life cycle of Plasmodium or between different drug treatment conditions. In the mammalian intermediate host, Plasmodium invades first liver and then red blood cells (RBCs) for development and asexual expansion. While the nuclear machinery of cells from both host and parasite produces mRNA in the liver, RBCs are enucleated and transcriptionally inactive in mammalian host. In blood infections leukocytes are thus the source of host mRNA. Researchers conducting a targeted experiment might regard transcripts from the non-target organism as “contamination”. Nevertheless, expression of those transcripts potentially responds to stimuli during the investigation. Additionally, some recent studies on malaria make intentional use dual RNA-Seq. Malaria is the most thoroughly investigated disease caused by an eukaryotic organism and accumulation of these two kinds of studies, RNA-Seq with “contaminants” and intentional dual RNA-Seq, provides a rich resource for meta-analysis.
Such a meta analysis can use co-regulated gene expression to infer host-parasite interactions. Correlation of mRNA expression can be indicative of different kinds of biological “interactions”: On one hand, protein products could be directly involved in the formation of complexes and might therefore be produced at quantities varying similarly under altered conditions. On the other, involvement in the same biological pathways can result in co-regulated gene expression without physical interaction. This broad concept of interaction has long been exploited in single organisms (e.g. [?, 8–10]). We (and others before [11]) propose to extrapolate this to interactions between the host and pathogen. It can be expected that a stimulus presented by the parasite to a host causes host immune response and the parasite in turn tries to evade this response, creating a cascade of genes co-regulated at different time points or under different conditions.
In this paper we explore first steps in a comparative meta-analysis of dual RNA-Seq transcriptomes. Existing raw read datasets collectively present an unexplored potential to answer questions that have not been investigated by individual studies. Meta-analysis increases the number of observations and statistical power and helps eliminate false positives and true negatives which may otherwise conceal important biological inferences [12–14]. Since mice- and macaque-malaria are often used as laboratory models for human malaria, we analyse the availability and suitability of mRNA sequencing data from three evolutionarily close hosts - Homo sapiens, Macaca mulatta and Mus musculus - and their associated Plasmodium parasites. We summarize available data, challenges and approaches to obtain host-parasite interactions and discuss orthology across different host-parasite systems as a means to enrich information.
Data review and curation of potentially suitable studies
Sequence data generated in biological experiments is submitted to one of the three mirroring databases of the International Nucleotide Sequence Database Collaboration (INSDC): NCBI Sequence Read Archive (SRA), EBI Sequence Read Archive (ERA) and DDBJ Sequence Read Archive (DRA). Comprehensive query tools to access these databases via web interfaces and programmatically via scriptable languages exist (for example, SRAdb, ENAbrowseR). In these databases, all experiments submitted under a single accession are given a single “study accession number” and are collectively referred to as a “study” here onwards.
We used SRAdb [15], a Bioconductor/R package [16,17], to query SRA [18, 19] for malaria RNA-Seq studies with the potential to provide host and Plasmodium reads for our meta-analysis. We first selected studies with “library strategy” given as “RNA-Seq” and “Plasmodium” in study title, abstract or sample attributes using the “dbGetQuery” function. Then we used the “getSRA” function with the query “(malaria OR Plasmodium) AND RNA-Seq”. This function searches all fields. We manually curated the combined results and added studies based on a literature review using the terms described for the “getSRA” function in SRA, PubMed and Google Scholar. During this search, we disregarded 91 studies, all of which provide data from vectors and non-target hosts (e.g. avian malaria). 49 more studies were excluded because their gene expression data was derived from Plasmodium. spp cultures in erythrocytes, blood or RPMI and thus can be expected to be devoid of host mRNA. We then used the SRAdb Bioconductor/R package, and the prefetch and fastq-dump functions from SRAtoolkit, to download all replicate samples (called “runs” in the databases) of the selected studies. The curation of studies and the download was performed on 21 January, 2019.
In total we found 56 potentially suitable studies in this database and literature review. The host organism for 22 studies was Homo sapiens, for 24, Mus musculus and for 10, Macaca mulatta. The corresponding infecting parasites were P. falciparum, P. vivax and P. berghei in human studies (including four artificial infections of human liver cell culture with P. berghei), P. yoelii, P. chabaudi and P. berghei in mouse studies and P. cynomolgi and P. coatneyi in macaque studies(table 1).
We note that 20 of the 56 studies depleted (or enriched, respectively) specific classes of cells from their samples. Some studies, for example, targeted the parasite using vaccines derived from sporozoites during liver infection [20–22]. Such infection is physiologically asymptomatic and a low number of parasites cells [23] makes it difficult to study Plasmodium transcriptomes in this stage. To reduce overwhelming host RNA levels, 3 out of 10 liver studies sorted infected hepatoma cells from uninfected cells. Similarly, 17 other studies have depleted or enriched host WBCs (leukocytes) to focus expression analysis on Plasmodium or the host immune system, respectively. In all these scenarios, we suspect depletion to be imperfect and thus the samples to potentially include mRNA of both organisms. We note, however, that host gene-expression for WBC-depleted samples might be problematic as incomplete depletion might affect different types of WBCs differentially and hence bias the detectable host mRNA expression in the direction of less-depleted cell types. For the similar reasons, parasite depletion might be challenging to control for.
For 15 out of the 56 studies the authors state that they intended to simultaneously study host and parasite transcriptomes (“dual RNA-Seq”). This includes 8 studies from MaHPIC (Malaria Host-Pathogen Interaction Center), based at Emory University, that made extensive omics measurements in macaque malaria. The original focus of the remaining 41 studies was on the parasite in 20 and on the host in 21 cases.
Plasmodium parasites sequester in bone marrow, adipose tissue, lung, spleen and brain (the latter causing cerebral malaria) [24,25]. To study a comprehensive spectrum of host-parasite interactions it would be optimal to have data from these different tissues. Our collection of studies represent data derived from blood and liver for all three host organisms. In addition, we have seven spleen studies ([26–32]) and two studies of cerebral malaria ([33,34]) from mice. MaHPIC offers a collection of blood and bone marrow studies in macaques.
Experiments performed on mouse blood focus on the parasite instead of the host (11 vs. 0). Studies on human blood infection focus more often on the host immune response than on the parasite (9 vs. 5). Liver and spleen studies focus on host and parasite almost equally as often, with sources for host tissue in this case being either mice (in vivo) or hepatoma cell cultures (in vitro). We, here, argue that small clusters of genes co-expressed across several of such diverse conditions might help to point towards potentially novel core host-parasite interactions.
Dual RNA-Seq suitability analysis
A sample (experimental replicate or “run” in the jargon of sequencing databases) suitable for dual RNA-Seq analysis must provide “sufficient” gene expression from both host and parasite. To assess the proportion for host and parasite RNA sequencing reads in each study and sample we mapped sequencing reads onto concatenated host and parasite reference genomes using STAR [35,36]. Simultaneous mapping against both genomes should avoid non-specific mapping of reads in regions conserved between host and parasites. We quantified the sequencing reads mapped to exons using the “countOverlaps” function of the GenomicRanges package [37] and calculated the proportion of reads mapping to host and parasite genes.
The proportions of host and parasite reads for each run does not always reflect the original focus of a study (fig. 1a). Studies using no depletion or enrichment give us an idea how skewed overall RNA expression is towards one organism under native conditions: in studies on blood stage infections, the original focus is mostly on immune gene expression from leukocytes. In the respective samples the number of host reads is often overwhelming unless parasitemia is very high, like in studies originally designed to use a dual RNA-Seq approach on blood stages. Samples with lower parasitemia are mostly not suitable for dual RNA-Seq analysis (table 1).
Many studies using depletion or enrichment prior to RNA sequencing (“enriched/depleted” in fig. 1a) show considerable expression of the non-target organism. Studies on liver infection, for example, [38] and [39], comprise several runs with balanced proportions of host and parasite reads. This is a result of infected liver cells being sorted from uninfected cells in culture. While the parasite has been the original target organism in most studies they provide data suitable for dual RNA-Seq. Studies depleting whole blood from leukocytes to focus on parasite transcriptomes still show considerable host gene expression and provide principally suitable runs for the analysis of blood infection at lower intensities. The latter comes with the caveat that host expression might be biased by unequal depletion of particular cell types.
To establish suitability thresholds for inclusion of individual samples (runs) in further analysis we plotted the number of host and parasite reads against the number of host and parasite genes expressed (fig. 1b and fig. 1c). For runs with high sequencing depth the total number of expressed genes of the host and parasite approaches the number of annotated genes: around 30000 for the mammalian host and around 4500 for Plasmodium. When sequencing depth is lower, the number of genes detected as expressed is lower and a decrease in sensitivity can be expected to prevent analysis of lowly expressed genes. We propose four parameters for suitability thresholds in dual RNA-Seq analysis: the number reads mapping to host (1) and (2) parasite genes and the number of genes these reads map to (expressed genes) in host and parasite (3, 4). In table 1, we give the number of runs considered suitable for three different combinations of thresholding. Without claiming a particular thresholds to be ideal we propose to use thresholds to avoid uninformative runs in further processing to reduce the computational burden of co-expression analysis.
Suitable runs at the thresholds chosen here are identified from human-P. falciparum, monkey-P. cynomolgi, human-P. berghei and mouse-P. berghei systems. Unfortunately, with current thresholds and currently available data, we highly under-represent human- P. vivax and human-P. berghei systems, the two liver in vitro models. This outcome is understandable owing to the low parasitemia in liver cultures [40]. We note that the thresholds could further be made lenient enough to include more runs for these systems at the cost of analysing only the most highly expressed parasite genes. An alternative approach relies on depleted/enriched samples for these systems. For further analysis, however, we it could prove challenging to include depleted/enriched samples as discussed before. Analysis approaches such as multilayer networks (see below) might help to gauge problems with such runs for the inference of co-expression in further steps of analysis.
Identification of co-expressed genes via correlation techniques
Some genes are likely to show almost uniform expression under different experimental conditions (“housekeeping genes”). Naive assessments of correlation could, however, identify pairs of such genes as highly correlated. An analysis of co-expression can deal with this challenge in two different ways:
Firstly, the most variable genes within and across studies can be selected and other genes discarded. While requiring little computational time and resources, exclusion of genes with too little variance in expression from downstream analysis should be performed with caution, as seemingly small variations might result in a suitable signal over a large set of runs. To select only variable genes, one option is to compute their variance across all samples (in one or multiple studies). Genes with variance below a threshold may then be excluded from further analysis. As variance increases with the mean for gene expression data, the Biological Coefficient of Variation (BCV) [41,42] may provide a more robust threshold. Secondly, one can compute empirical correlation indices, similar to p-values, for any gene-pair. Empirical p-values are a robust way to estimate whether gene-pairs are correlated because of specific events (treatment condition, time point) and not by chance (e.g., housekeeping genes) [43,44]. These methods construct a null distribution using permutations of the given data instead of assuming a null distribution in advance. Since host and parasite genomes total nearly 30,000 genes, the number of permutations has to be around 1.6 × 109 to be suitable for corrections for multiple-testing. Alternatively, as computational costs for these permutations can be expected to be too high for datasets with thousands of samples, non-corrected “p-values” may be considered a ranking for host-parasite gene correlation, following the suggestion of Reid and Berriman [11]. Nevertheless, reliance on empirical computation of p-values without prior variance/BCV filtering might become impracticable for very large datasets in the proposed meta-analysis.
We consider partial correlation as an additional approach that could be combined with the above methods. Partial correlation can control pairwise correlations for the influence of other genes [45]. In transcriptomic applications full-conditioned partial correlation is computationally very expensive. Some studies therefore resort to second-order partial correlation (relationship between two genes independent of two other genes) [46–48]. A suitable pipeline might first use (zero-order partial, that is “regular”) correlation with empirical p-values to remove constitutively expressed gene-pairs. For all correlations with an empirical “p-value” below a certain threshold, one could compute e.g. first-order partial correlations reducing the number of computations. Iterations of such an approach with higher-order partial correlations are then possible.
Across different studies; across different host-parasite systems
Gene × gene matrices obtained from correlation analysis can be visualised and analysed as interaction networks. We have identified different but interlinked workflows to reconstruct a consensus network of expression correlation (fig. 2). A first approach (fig. 2(a)) integrates data from different studies of one host-parasite system by simply appending expression profiles of their runs.
Knowledge of 1:1 orthologs [49] between different host and different parasite species can be used in the next steps to integrate across different host-parasite systems. Humans and macaques share 18179 1:1 orthologous genes, humans and mice share 17089 orthologous genes and 14776 genes are 1:1:1 orthologs among all three species. Similarly, 7760 groups of orthologous genes exist among the Plasmodium species. A simple approach to combine data across host-parasite systems could again append those orthologs in the original datasets before correlations of gene expression.
Alternatively, to construct a consensus network involving all hosts and parasites, a multi-layer network analysis could align networks by orthologous genes. This approach can offer more control when looking for similar correlation in different layers representing different host-parasite systems. Similarly, more insight could be possible when correlations from different types of tissues are combined as multilayer networks. This would only require the construction of networks for a single host-parasite system and multi-layered network analysis on networks from single studies of the same host-parasite system.
We hope correlation between host and parasite transcript expression to highlight host-parasite interactions worth scrutiny of further focussed research. As a second goal, meta-analysis involving different host-parasite systems could give insights into how easily other insights obtained in malaria models can be translated to human malaria. If e.g. certain groups of pathways show lower evolutionary conservation in host-parasite co-expression networks, one could expect results on those to be harder to translate between systems. Finally, one can ask whether expression correlation between host and parasite species is more or less evolutionarily conserved than within host species [50–52].
References
- [1].↵
- [2].↵
- [3].
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].
- [10].↵
- [11].↵
- [12].↵
- [13].
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [20].↵
- [21].
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].
- [28].
- [29].
- [30].
- [31].
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].
- [48].↵
- [49].↵
- [50].↵
- [51].
- [52].↵
- [53].
- [54].
- [55].
- [56].
- [57].
- [58].
- [59].
- [60].
- [61].
- [62].
- [63].
- [64].
- [65].
- [66].
- [67].
- [68].
- [69].
- [70].
- [71].
- [72].
- [73].
- [74].
- [75].
- [76].
- [77].
- [78].
- [79].
- [80].
- [81].
- [82].
- [83].
- [84].
- [85].
- [86].
- [87].
- [88].
- [89].
- [90].
- [91].
- [92].
- [93].
- [94].
- [95].
- [96].
- [97].
- [98].
- [99].
- [100].
- [101].
- [102].
- [103].
- [104].
- [105].
- [106].
- [107].
- [108].
- [109].
- [110].
- [111].
- [112].
- [113].
- [114].
- [115].
- [116].
- [117].
- [118].
- [119].
- [120].
- [121].
- [122].
- [123].
- [124].
- [125].
- [126].
- [127].
- [128].
- [129].
- [130].
- [131].
- [132].
- [133].
- [134].
- [135].
- [136].
- [137].
- [138].
- [139].
- [140].
- [141].
- [142].
- [143].
- [144].
- [145].
- [146].
- [147].
- [148].
- [149].
- [150].
- [151].
- [152].
- [153].
- [154].
- [155].
- [156].
- [157].
- [158].
- [159].
- [160].
- [161].
- [162].
- [163].
- [164].
- [165].
- [166].
- [167].
- [168].
- [169].
- [170].
- [171].
- [172].
- [173].
- [174].
- [175].
- [176].
- [177].
- [178].
- [179].
- [180].
- [181].
- [182].
- [183].
- [184].
- [185].
- [186].
- [187].
- [188].
- [189].
- [190].
- [191].
- [192].
- [193].
- [194].
- [195].
- [196].
- [197].