Abstract
Background The last decade has seen a major increase in the availability of genomic data. This includes expert-curated databases that describe the biological activity of genes, as well as high-throughput assays that measure the gene expression of bulk tissue and single cells. Integrating these heterogeneous data sources can generate new hypotheses about biological systems. Our primary objective is to combine population-level drug-response data with patient-level single-cell expression data to predict how any gene will respond to any drug for any patient.
Methods We use a “dual-channel” random walk with restart (RWR) algorithm to perform 3 analyses. First, we use glioblastoma single cells from 5 individual patients to discover genes whose functions differ between cancers. Second, we use drug screening data from the Library of Integrated Network-Based Cellular Signatures (LINCS) to show how a cell-specific drug-response signature can be accurately predicted from a baseline (drug-free) gene co-expression network. Finally, we combine both data streams to show how the RWR algorithm can predict how any gene will respond to any drug for each of the 5 glioblastoma patients.
Conclusions Our manuscript introduces two innovations to the integration of heterogeneous biological data. First, we use a “dual-channel” RWR method to predict up-regulation and down-regulation separately. Second, we use individualized single-cell gene co-expression networks to make personalized predictions. These innovations let us predict gene function and drug response for individual patients. When applied to real data, we identify a number of genes that exhibit a patient-specific drug response, including the pan-cancer oncogene EGFR.
1 Introduction
Advances in high-throughput RNA-sequencing (RNA-Seq) have made it possible to measure the relative amount of RNA in any biological sample [28]. The resultant gene expression signature can serve as a biomarker for disease prediction [14, 1, 47] and surveillance [29, 37]. Over the last few years, single-cell RNA-Seq has risen in popularity [13]. Compared with conventional bulk RNA-Seq, which measures the average gene expression for an individual sample, single-cell RNA-Seq (scRNA-Seq) measures the gene expression for an individual cell. This new mode of data collection makes it possible to explore tissue heterogeneity, notably tumor heterogeneity [23].
RNA-Seq and scRNA-Seq both measure the relative abundance for tens of thousands of genes, making the data highly dimensional. Although per-gene differential expression analyses are popular, genes are often understood to work in cooperative modules, making the analysis of gene co-expression networks an attractive option. However, RNA-Seq and scRNA-Seq are especially difficult to study. Beyond requiring several pre-processing steps, the summarized data arise from a sampling process that introduces between-sample biases in which the total number of counts, called the sequencing depth, depends on technical factors, not on the amount of input material [12, 41, 36]. Analysts often attempt to remove this bias with an effective library size normalization, or with normalization to a spike-in or house-keeping transcript [26] (though all normalizations have limitations [35]). Instead, one could build normalization-free gene co-expression networks using proportionality [25]. Although this does not offer a perfect solution [11], studying gene-gene proportionality has a strong theoretical justification [25] and empirically outperforms other metrics of association for scRNA-Seq [40] For bulk RNA-Seq, a gene co-expression network describes how genes co-occur for a population of individual samples. As such, the network characterizes a sample cohort. On the other hand, a scRNA-Seq network describes gene co-expression for a population of single cells. When these cells belong to an individual patient, the scRNA-Seq network is a kind of personalized network that one could use for precision medicine tasks.
Whether using bulk RNA-Seq or scRNA-Seq, analysts often want to interpret gene co-expression networks to draw biologically meaningful conclusions. Most commonly, this is done by integrating outside information from annotation databases, a curated relational database that associates molecular functions with gene labels (e.g., Gene Ontology [2]). The analysis then seeks to combine the general knowledge (in the form of a relational database) with some specific knowledge about a sample (in the form of a co-expression network). Weighted gene co-expression network analysis is one popular method used to functionally characterize parts of the network, or the network as a whole [21, 22]. Although these coarse descriptions are useful, one could also combine general- and specific knowledge to make finer-level predictions about the behavior of individual genes. By representing each modality as a graph, multiple data streams can be combined into a heterogeneous information network, and then analyzed under a unified framework based on the principle of “guilt-by-association” [45] (e.g., if “a” is connected to “b” and “b” is connected to “c”, then “a” is probably connected to “c). When the general knowledge is gene-annotation associations, we can (a) impute the function for genes with no known role or (b) select the most important known function. When the general knowledge is gene-drug response, we can predict the response of any gene to any drug. Since these inferences are tailored to the co-expression network used, they can be made personalized by using the single-cell network of an individual patient.
Random walk (RW) is a popular method that offers a general solution to the analysis of heterogeneous information networks [32, 45]. One could conceptualize RW as a measure of how a blindfolded person would randomly “walk” along a graph. There are many variants to RW, including random walk with restart (RWR). For RWR, each step has a probability of restarting from the starting node (or a neighbor of the starting node) [44]. RW and RWR are often used in recommendation systems [3, 8, 18], but can also perform other machine learning tasks like image segmentation [15, 17], image captioning [30], or community detection [34, 20]. One advantage of RW is that it can handle missing data [16], making it a good choice for processing sparse gene annotation databases and zero-laden single-cell data. RW and RWR have both found use in the analysis of biological data, often to find associations between genes and another data modality. For example, the “InfAcrOnt” method used an RW-based method to infer similarities between ontology terms by integrating annotations with a gene-gene interaction network [6]. Similarly, the “RWLPAP” method used RW to find lncRNA-protein associations [50], while others have used RW to predict gene-disease associations [51]. Meanwhile, RWR has been used to identify epigenetic factors within the genome [24], key genes involved in colorectal cancer [9], novel microRNA-disease associations [43], infection-related genes [52], disease-related genes [46], and functional similarities between genes [33]. Bi-random walk, another random walk variant, has been used to rank disease genes from a protein-protein interaction network [48].
In contrast to the previous work, which made use of population-level graphs, we apply RWR to patient-level graphs, allowing us to make predictions about gene behavior that are personalized to each patient. In this manuscript, we perform 3 key analyses, along with 2 forms of in silico validation. First, we use glioblastoma single cells from 5 individual patients to discover genes whose functions differ between cancers. Second, we use drug screening data from the Library of Integrated Network-Based Cellular Signatures (LINCS) to show how a cell-specific drug-response signature can be accurately predicted from a baseline (drug-free) gene co-expression network. Finally, we combine both data streams to show how the RWR algorithm can predict how any gene will respond to any drug for each of the 5 glioblastoma patients. Our analysis reveals a number of genes that exhibit a patient-specific drug response, including the pan-cancer oncogene EGFR. To the best of our knowledge, this is the first application of RWR on personalized single-cell networks to predict the function of any gene to any drug for any patient.
2 Methods
2.1 Overview
In the medical domain, gene expression can be used as a biomarker to measure the functional state of a cell. One way in which drugs mediate their therapeutic or toxic effects is by altering gene expression. However, the assays needed to test how gene expression changes in response to a drug are expensive and time consuming. Imputation has the potential to accelerate research by “recommending” novel gene-drug relationships for follow-up validation, but can be complicated by having heterogeneous and sparse data. Random walk methods can combine sparse heterogeneous graphs based on the principle of “guilt-by-association” [45]. Figure 1 provides an abstracted schematic of the proposed framework. Figure 2 provides a visualization of the input and output for the random walk with restart (RWR) method. Figure 3 presents a bird’s-eye view of the data collection, integration, and analysis steps performed in this study.
2.2 Data acquisition
The gene expression data come from two primary sources. First, we acquired single-cell RNA-Seq (scRNA-Seq) expression data for 5 glioblastoma multiforme tumors [31] using the recount2 package for the R programming language [7] (ID: SRP042161). Since scRNA-Seq data are incredibly sparse, and since the random walk with restart algorithm is computationally expensive, we elected to remove genes that had zero values in more than 25% of cells. This resulted in 3022 genes. Finally, we randomly split the cells into 5-folds per patient so that we could estimate the variability of our downstream analyses. Second, we acquired gene expression data from the Library of Integrated Network-Based Cellular Signatures (LINCS) [19] using the Gene Expression Omnibus (GEO) [10] (ID: GSE70138). We split these LINCS data into smaller data sets based on the cell line ID under study. We included the A375, HA1E, HT29, MCF7, and PC3 cell lines because they were treated with the largest number of drugs.
2.3 Defining the gene co-expression network graphs
Although correlation is a popular choice for measuring gene co-expression, correlations can yield spurious results for next-generation sequencing data [25]. Instead, we calculate the proportionality between genes using the ϕs metric from the propr package for the R programming language [38]. This metric describes the dissimilarity between any two genes, and ranges from [0, ∞), where 0 indicates a perfect association. We converted this to a similarity measure ϕi that ranges from [0, 1] by max-scaling ϕi = (max(ϕs) − ϕs)/max(ϕs), such that ϕi = 1 when ϕs = 0. A gene-gene matrix of ϕi scores is analogous to a gene-gene matrix of correlation coefficients, and constitutes our gene co-expression network. We calculated the ϕi co-expression network for the entire scRNA-Seq data set (1 network), for each of the 5-folds per-patient (25 networks total), and for each baseline (drug-free) cell line (5 networks total). All co-expression networks are available from https://zenodo.org/record/3522494.
2.4 Defining the bipartite graphs
We constructed two types of bipartite graphs: the gene-annotation graph and the gene-drug graph. First, we made the gene-annotation graph from the Gene Ontology Biological Process database [2] via the AnnotationDbi and org.Hs.eg.db Bioconductor packages. An edge exists when-ever a gene is associated with an annotation. Second, we made the gene-drug graphs using the LINCS data. For each cell line, we computed a gene-drug graph by calculating the log-fold change between the median of the drug-treated cell’s expression and the median of the drug-naive cell’s expression. This results in a fully-connected and weighted bipartite graph, where a large positive value means that the drug causes the gene to up-regulate (and vice versa). All bipartite graphs are available from https://zenodo.org/record/3522494.
2.5 The combined co-expression and bipartite graph
Consider a graph G with V = 1…N vertices, E+ positive edges, and E− negative edges. The graphs used for our analyses are composed for two parts: a (general knowledge) bipartite graph and a (specific knowledge) fully-connected gene co-expression graph. For a bipartite graph, the vertex set V can be separated into two distinct sets, V1 and V2, such that no edges exist within either set. For a fully-connected (or complete) graph, there exists an edge between every pair of vertices within one set. For our graph G, the bipartite and fully-connected graphs are joined via the common vertex set V1 that contains genes. The vertex set V2 contains annotations or drugs.
2.6 Dual-channel random walk with restart (RWR)
Traditional RWR methods can only perform a random walk on graphs with positive edge weights [32]. Since the response of a gene to a drug is directional (up-regulated or down-regulated), we chose to use a modified RWR method, proposed by [5], that handles graphs with both positive and negative edge weights. Random walk requires transition probability matrices to decide the next step in the walk. The Chen et al. transition probability matrices can be computed based on the following equations: when eij ≥ 0, and when eij < 0. For all equations, eij is the edge weight between nodes xi and xj, and N (xi) is the set of neighbors for node xi. These equations separate out the positive (and negative) transitions, and are used to calculate the total positive (and negative) information flow for each node. They are fixed for all steps.
Though the transition probabilities are computed separately, the information accumulated in a node depends on both the positive and negative information which flows through the node. For example, the positive information in a node depends on the negative information of any neighboring node connected by a negative edge weight (think: negative times negative is positive). Likewise, negative information in a node depends on the positive information in a neighboring node connected by a negative edge weight, and vice versa (think: negative times positive is negative). Figure 4 illustrates the information flow to a node xj from two neighbors.
The flow of information between the positive “plane” of the graph to the negative “plane” of the graph can be formulated with the equations: where the probability is updated at each step k = 2…10000.
RWR always considers a probability α to return back to the original nearest neighboring nodes at each step in the random walk. This is used to weigh the importance of node-specific information with respect to the whole graph, including for long walks: where the restart probability is updated at each step k = 2…10000, and is the probability after the first update. These equations find the positive and negative restart information with respect to the node xj. Each is a vector of probabilities that together sum to 1. This probability has two parts: the global information and the local information. The local information is the initial probability with respect to the nearest neighbors of node xj, and is denoted by [or ] (i.e., the probability after the first update). The restart probability α is chosen from the range [0, 1], where a higher value weighs the local information more than the global information. We chose α = 0.1 to place a larger emphasis on the global information. Simulations with a toy data set verified this choice.
2.7 Analysis of random walk with restart (RWR) scores
For each gene, the RWR algorithm returns a vector of probabilities that together sum to 1. We interpret these probabilities to indicate the strength of the connection between the reference gene and each target. Since we are only interested in gene-annotation and gene-drug relationships, we exclude all gene-gene probabilities. Then, we perform a centered log-ratio transformation of the probability vector. This transform enables an analysis of proportional data, and is appropriate when working with a subset of a compositional vector [4]. We define the RWR score (or ) for each gene-annotation connection as the transform of its RWR probability: for a bipartite graph describing g = 1…G genes and a = 1…A annotations (or A drugs), where (i.e., from the final step). These transformed RWR scores can be used for univariate statistical analyses, such as an analysis of variance (ANOVA) (e.g., as commonly done for other kinds of compositional data [12, 27]).
2.8 Benchmark validation
We take 2 approaches to benchmarking RWR for data integration. First, we evaluate how well it can predict known gene functions from single-cell gene co-expression networks. Second, we evaluate how well it can predict known drug responses from individual cell networks. These benchmarks support our use of RWR to predict drug responses for individualized single-cell networks in the absence of experimental validation.
2.8.1 Validation of gene-annotation prediction
Our strategy to validate RWR for gene-annotation prediction involves “hiding” known functional associations and seeing whether the RWR algorithm can re-discover them. This is done by turning 1s into 0s in the bipartite graph, a process we call “sparsification”. Our sparsification procedure works in 4 steps. First, we combine the original GO BP (or MF) bipartite graph with the master single-cell co-expression graph. Second, we subset the graph to include 25% of the gene annoations and 25% of the genes (this is done to reduce the computational overhead). Third, we randomly hide [10, 25, 50] percent of the gene-annotation connections from the bipartite sub-graph. Since this random selection could cause a feature to lose all connections, we use a constrained sampling strategy: the subsampled graph must contain at least one non-zero entry for each feature. Fourth, we apply the RWR algorithm to the sparsified and non-sparsified graphs, separately. We repeat this process 25 times, using a different random graph each time. By comparing the RWR scores between the hidden and unknown connections, we can determine whether our method rediscovers hidden connections.
2.8.2 Validation of gene-drug prediction
We use a different strategy to validate RWR for drug-response prediction. Since we have the gene-drug and gene-gene interaction data for 5 cell lines (A375, HA1E, HT29, MCF7 and PC3), we can set aside the known gene-drug responses for 1 cell line (PC3) as a “ground truth” test set. Then, we can use a composite of the remaining 4 gene-drug graphs to predict the gene-drug responses for the withheld cell line.
This is done in two steps. First, we use the averaged gene-drug data for 4 cell lines (a general drug graph) and the gene-gene data for PC3 (a specific gene graph) to impute the gene-drug response for PC3 (a specific drug graph). In the second step, we use the gene-drug data for PC3 (a specific drug graph) and its corresponding gene-gene data (a specific gene graph) to calculate the “ground truth” RWR scores for PC (a specific drug graph). The “ground truth” is the RWR scores when all PC3 drug-response experiments have been performed. With these two outputs, we can calculate the agreement between the imputed and “ground truth” RWR scores (using Spearman’s correlation and accuracy).
2.9 Personalized gene-drug prediction
Having demonstrated that RWR can perform well for single-cell co-expression networks, and can make meaningful drug-response predictions from composite LINCS data, we combine these het-erogeneous data sources to make personalized drug-response predictions for individual single-cell networks. This requires some data munging. First, we transform the ENGS features used by the single-cell data into the HGNC features used by LINCS (only including genes with a 1-to-1 mapping, resulting in 181 genes). Second, we build an HGNC co-expression network with ϕi (for 5 folds of 5 patients, yielding 25 networks total). Third, we combine the composite LINCS gene-drug bipartite graph with each of the 25 HGNC single-cell networks. Fourth, we use our RWR algorithm to predict how 181 genes would respond to 1732 drugs for each patient fold. As above, we perform an analysis of variance (ANOVA) to detect inter-patient differences.
3 Results and Discussion
3.1 Gene co-expression is a patient-specific signature
In this study, we analyze a previously published single-cell data set that measured the gene expression for 5 glioblastoma patients. A principal components analysis of these data show that the major axes of variance tend to group the cells according to the patient-of-origin. Indeed, an ANOVA of gene expression with respect to patient ID reveals that 2204 of the 3022 genes have significantly different expression in at least one patient (FDR-adjusted p < .05). This suggests that the single-cell gene expression signature is unique to each patient.
3.2 Random walk can re-discover “hidden” gene functions
The Gene Ontology (GO) project has curated a database which relates genes to biological processes (BP) and molecular functions (MF) (called annotations). The GO database has widespread use in bioinformatics for assigning “functional” relevance to sets of gene biomarkers [42]. Although GO organizes the semantic relationships between annotations as a directed acyclic graph, we could more simply represent the relationships as a bipartite graph. By combining a (fully-connected) gene co-expression graph with a (sparsely-connected) gene-annotation bipartite graph, the random walk with restart (RWR) algorithm can predict new gene-annotation connections.
To test whether the RWR predictions are meaingful, we constructed a “master” gene co-expression network using all cells from all patients. We then “hid” a percentage of known gene-annotation links (by turning 1s into 0s in the bipartite graph), and compared the RWR scores for the hidden gene-annotation links with those for the unknown links (see Methods for a definition of the RWR score). Figure 5 shows that the RWR scores for hidden connections are appreciably larger than for the unknown connections, confirming that RWR can discover real gene-annotation relationships from a single-cell gene co-expression network.
3.3 Random walk can predict patient-specific gene functions
Since single-cell RNA-Seq assays measure RNA for multiple cells per patient, we can use these data to build a personalized graph that describes the gene-gene relationships for an individual patient. In order to estimate the variation in these personalized graphs, we divided the cells from each sample into 5 folds (giving us 5 networks per-patient). Above, we show that RWR can discover real gene-annotation relationships. By combining the personalized graph (a kind of specific knowledge) with a gene-annotation bipartite graph (a kind of general knowledge), the RWR algorithm will score the gene-annotation connections for a given patient. From this, we can identify genes that have a different functional importance in one cancer versus the others.
Taking a subset of the 50 genes with the largest inter-patient differences, we use RWR to compute personalized RWR scores. This results in 25 matrices (for 5 folds of 5 patients), each with 50 rows (for genes) and 369 columns (for BP annotations). Performing an ANOVA on each gene-annotation connection results in a matrix of 50×369 p-values. Figure 6 shows a heatmap of the significant gene-annotation connections (dark red indicates a gene-wise FDR-adjusted p < .05). Figure 7 plots the per-patient RWR scores for 4 annotations of the BCL-6 gene that significantly differ between patients. BCL-6 is an important biomarker whose increased expression is associated with worse outcomes in glioblastoma [49]. This figure suggests that BCL-6 may have a larger role in inflammation for patients 3 and 5, but a larger role in cartilage development and translational elongation in patient 1. Of course, this hypothesis requires experimental validation.
3.4 Random walk can predict cell line drug responses
The NIH LINCS program has generated a large amount of data on how the gene expression signatures of cell lines change in response to a drug. By conceptualizing the baseline (drug-free) gene co-expression network as a complete graph of specific knowledge, and by re-factoring the average gene-drug response as a (weighted) bipartite graph of general knowledge, we can apply the same RWR algorithm to predict a cell’s gene expression response to any drug. Since the modified RWR algorithm contains two channels–a positive and negative channel–we can predict up-regulation or down-regulation events separately.
To test whether RWR can make accurate predictions about how a gene in a cell would respond to a drug, we ran the RWR algorithm on the baseline (drug-free) gene co-expression graph of the PC3 cell line using a composite gene-drug graph of 4 different cell lines. We then compared these RWR scores with a “ground truth” (i.e., the RWR scores for when all PC3 drug-response experiments have been performed). The agreement between the composite gene-drug RWR scores and the “ground truth” gene-drug RWR scores tells us how well the composite gene-drug map generalizes to new cell types. Table 1 reports the overall agreement (Spearman’s correlation) and the accuracy of the overlap (for the top 5%, 10%, 25%, and 50% predicted scores), as calculated separately for the positive and negative channels. Overall, agreement is high, especially for the top up-regulation and down-regulation events. This confirms that our composite gene-drug graph is useful for drug-response prediction.
3.5 Random walk can predict patient-specific drug responses
The RWR algorithm can combine specific knowledge and general knowledge from disparate sources to make personalized recommendations. This makes RWR a valuable tool for precision medicine. To this end, we combine the personalized gene co-expression networks with the composite gene-drug graph from LINCS. By running the RWR algorithm on these two data streams, the RWR scores now suggest how the expression of any gene might change in response to any drug for each of the 5 glioblastoma patients. Using an ANOVA, we identify hundreds of gene-drug connections with RWR scores that differ significantly between patients (gene-wise FDR-adjusted p < .05).
Figure 7 shows an example of drugs that have different (negative channel) RWR scores for EGFR. It suggests that the anti-inflammatory drug valdecoxib and the anti-neoplastic drug salirasib may cause a stronger down-regulation of EGFR (a pan-cancer oncogene [39]) in patients 1 and 4 versus the others. The Supplementary Information includes a complete table of the unadjusted ANOVA p-values for the gene-drug inter-patient differences. Although RWR can recommend many hypotheses, experimental validation is needed to determine whether these predictions are true.
4 Summary
In this manuscript, we show how random walk with restart (RWR) can be used to make personalized predictions about gene function and drug response. We demonstrate the application of RWR in 3 contexts: to predict the likely function of a gene for an individual patient, to predict a gene’s response to a drug for an individual cell line, and to predict a gene’s response to a drug for an individual patient. In the absence of experimental validation, we support our analyses using 2 forms of in silico validation, which together demonstrate that RWR can integrate sparse heterogeneous data to discover real biological activity. Importantly, our approach makes use of a generic framework, and so can be applied to combine many kinds of data. We believe that the targeted analysis of personalized single-cell networks is promising, and could offer a new direction for precision medicine research.
We conclude with some perspectives on what the future of personalized network analysis may hold. Though RWR can handle sparse heterogeneous data, the positive and negative information obtained for each node can be infinitesimally small. One might address this by transforming the RWR probabilities into another space for greater reliability. Otherwise, we note that RWR is computationally expensive, making the analysis of high-dimensional data prohibitively slow. One might address this by pre-training a deep neural network to provide an approximate RWR solution. These improvements could help scale personalized predictions to larger graphs.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Availability of data and material
The raw data are publicly available from the resources described in the Methods. All gene co-expression and bipartite graphs used in these analyses are available from https://zenodo.org/record/3522494.
Competing interests
No authors have competing interests.
Authors’ contributions
HH implemented the RWR algorithm and applied it to the graphical data. TPQ prepared the graph data and performed the analysis of the resultant RWR scores. HH and TPQ reviewed the literature, designed the experiments, and drafted the manuscript. All authors helped conceptualize the project and revise the manuscript.
Acknowledgements
Not applicable.
List of Abbreviations
- RW
- random walk
- RWR
- random walk with restart
- RNA-Seq
- RNA sequencing
- scRNA-Seq
- single-cell RNA sequencing
- LINCS
- Library of Integrated Network-Based Cellular Signatures
- GO
- Gene Ontology
- BP
- Biological Process
- MF
- Molecular Function
- ANOVA
- analysis of variance
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵