Abstract
Massively accumulated pharmacogenomics, chemogenomics, and side effect datasets offer an unprecedented opportunity for drug response prediction, drug target identification and drug side effect prediction. Existing computational approaches limit their scope to only one of these three tasks, inevitably overlooking the rich connection among them. Here, we propose DrugOrchestra, a deep multi-task learning framework that jointly predicts drug response, targets and side effects. DrugOrchestra leverages pre-trained molecular structure-based drug representation to bridge these three tasks. Instead of directly fine-tuning on an individual task, DrugOrchestra uses deep multi-task learning to obtain a phenotype-based drug representation by simultaneously fine-tuning on drug response, target and side effect prediction. By coupling these three tasks together, DrugOrchestra is able to make predictions for unseen drugs by only knowing their molecular structures. We constructed a heterogeneous drug discovery dataset of over 21k drugs by integrating 8 datasets across three tasks. Our method obtained significant improvement in comparison to methods that were trained on a single task or a single dataset. We further revealed the transferability across 8 datasets and 3 tasks, providing novel insights for understanding drug mechanisms.
Availability https://github.com/jiangdada1221/DrugOrchestra
1. Introduction
Large-scale pharmacogenomics studies[1–3], which investigate how genes affect an individual’s response to drugs, pave the path towards precision medicine[4,5]. Likewise, massively generated chemogenomics screens and patient side effect records have been used to identify drug-target interactions (DTIs) and adverse drug reactions (ADRs) in bulk. To date, more than 15 million DTIs and 0.6 million ADRs have been collected in public resources[6–10], presenting an unprecedented opportunity for drug discovery. In pursuit of this vision, computational approaches have been developed to predict drug response, targets, and side effects, which further prioritize experiments for biologists and support clinical decision making for doctors[10–17]. For example, Torng and Altman proposed a two-step graph convolutional neural network framework to learn high-quality protein pocket representations for drug-target interaction prediction[15]. Kim et al. formulated drug response prediction as a linear integer program and identified drug sensitivity subnetworks[16]. Luo et al. proposed a network-based approach that integrated a variety of biological features for drug target identification[17].
However, despite the encouraging performance of these existing methods on each individual task, none of them has considered jointly modeling all three tasks, inevitably overlooking connections among them. From a biological perspective, molecular and cellular targets serve as the basis of uncovering drug mechanisms[18,19], which are further used to explain systematic drug response variations[3,20]. Interactions between a drug and unintended receptors that are crucial in normal cellular functions can result in severe adverse drug reactions[21]. From a computational perspective, integrating and jointly analyzing datasets from all three tasks might boost the prediction performance by utilizing common patterns to alleviate overfitting. More importantly, molecular structure-based drug representation such as SMILES strings[22] have shown to be useful in all these tasks and can thus be used to transfer knowledge across them. Therefore, investigating these three tasks collectively might better decipher drug mechanisms and shed light on future drug development.
Inspired by the strong connection among these three tasks, we propose to jointly learn all three tasks using multi-task learning (MTL). The key idea of MTL is to solve multiple prediction tasks at the same time while automatically exploiting similarities and differences across tasks. Empowered by the advance of deep learning, MTL has achieved promising performance in computer vision[23], speech recognition[24] and natural language processing[25], in comparison to single task learning, which optimized each task separately. In biomedicine, MTL has also recently demonstrated exciting progress in drug discovery. Bharath and Steven et al. used MTL to model large-scale experiments on 40 million measurements of over 200 targets[26]. Qiao et al. applied an MTL model to a heterogeneous dataset integrated from 12 different individual cancer drug response datasets[27]. Kamran et al. utilized MTL to improve large-scale gene expression inference[28]. However, since these methods treated each drug as a single task when exploiting MTL, they required a substantial number of training samples for each drug and thus cannot be applied to new drugs.
In this paper, we proposed DrugOrchestra, a deep multi-task learning method that can simultaneously predict drug response, drug targets, and drug side effects. DrugOrchestra first used drug molecular structure-based drug representation pre-trained from millions of compounds[29] as shared features to bridge these three tasks. It then used a hard parameter sharing deep learning structure to jointly optimize all three tasks. Such structure enables DrugOrchestra to not only gain performance improvement on each single task but also make predictions for unseen drugs, which have never been seen in the training samples in any of these three tasks. This ability to predict for unseen drugs would play a key role in extending pharmacogenomics studies from thousands of tested compounds to millions of underexplored small molecules. We applied our method to 8 datasets across three tasks, including Repurposing Hub, Drugbank, STITCH, PDX, GDSC, CCLE, SIDER, and OFFSIDES, which in total cover more than 21k compounds. We observed significant improvement of our method on seven of these datasets. Our method further reveals the similarity and transferability across these datasets and across these tasks, providing novel insights for understanding drug mechanisms.
2. A heterogeneous drug discovery dataset
We first introduce the new heterogeneous drug discovery dataset we collected for drug response, targets and side effects prediction, including 21,032 compounds, 2,057 cell lines, 400 PDX models, 17,563 genes and 1,005 side effects.
Drug target dataset
We obtained DTIs from three sources: Repurposing Hub[6], Drugbank[7] and STITCH[8]. Repurposing Hub contains 13,563 DTIs from 6,798 drugs and 2,183 gene targets. Drugbank contains 25,569 drug-target DTIs from 13580 drugs and 4673 gene targets. STITCH consists of 15,473,939 DTIs from 787,039 drugs and 19,195 gene targets. For each dataset, we excluded drugs whose SMILES strings were not in PubChem[30] to enable that each compound had a corresponding SMILES string representation. For the STITCH database, we excluded drug-target interactions that have a confidence score of less than 0.9. The number of drugs within each dataset and the overlap of drugs across datasets are shown in Fig. 1a. On average, there are 2.7, 3.6, and 3.1 gene targets associated with each drug and 5.7, 6.9, and 15.3 drugs associated with each target in Repurposing Hub, Drugbank, and STITCH, respectively.
To construct the vector representations of drugs, we utilized the simplified molecular-input line-entry system (SMILES)[18], which uses a line notation to represent the structure of small molecules. In our paper, we used the pre-trained chemical molecular embedding model from Hu and Liu et al.[29], which takes SIMILES strings as inputs to obtain the drug representation. This pre-trained model was trained on millions of unlabeled chemical structure data, which enables the model to capture the domain-specific knowledge in the 2D molecular graph described by a SMILES string. We then obtained the low-dimensional drug features for each drug in the 8 datasets by inputting the corresponding SMILES string to the pre-trained model.
We obtained the target gene features from the pre-trained representations of genes in humans and several model organisms computed by the state-of-the-art network embedding method Mashup[31]. In brief, Mashup integrated multiple individual networks into a low-dimensional space according to the topological relationship between nodes. We mapped each gene in drug-target datasets to the gene vector in the pre-trained gene vector file provided by Mashup. We then excluded target genes that were not included in the gene list of Mashup. After the gene feature extraction, there are 2,183 target genes, 3,157 target genes, 17,288 target genes remained in Repurposing Hub, Drugbank, and STITCH respectively.
Drug response dataset
We built a drug response collection by integrating a Patient-Derived Xenograft (PDX) dataset[32], the Genomics of Drug Sensitivity in Cancer (GDSC)[2] and the Cancer Therapeutics Response Portal (CCLE)[3]. The PDX dataset has 400 PDX model-derived tumor xenograft models with a diverse set of driver mutations. It consists of 37 drugs across 400 PDX model samples. The GDSC dataset has 255 drugs across 1018 cell lines. The CCLE dataset profiles 545 drugs across 1039 cell lines. We used gene expression profiles as the feature for drug response prediction. We used IC50 values as the indicator of the drug response data in GDSC and CCLE and Best Tumor Response value as the indicator of the drug response data in PDX. We also excluded drugs without the corresponding SMILES string in PubChem. For the response values, we took the z-score of the original response data for each dataset. After filtering, 1,634, 190,853, 322,045 drug response data points (i.e., cell line drug pairs) remained in PDX, GDSC, CCLE respectively. The number of drugs within each dataset and the overlap of drugs among three drug response datasets are shown in Fig. 1b.
We formed the cell line (PDX model) features based on the gene expression matrix which illustrates relationships between cell lines and genes. In the gene expression matrices of drug-response datasets, we only keep common genes that appear in all the three drug-response datasets. We then applied the PCA (Principal Component Analysis)[33] technique to the filtered gene expression matrices and obtained low-dimensional feature matrices. We took the z-score across rows to normalize the feature matrices. Each column in the feature matrices represents a feature vector for the corresponding cell line.
Drug side effect dataset
We collected drug side effects from SIDER[9] and OFFSIDES[10]. The SIDER dataset contains 2,523,626 drug side-effect associations over 1,430 drugs and 5,868 side effects obtained by mining ADR events from drug label text. The OFFSIDES dataset consists of 0.32 million interactions between 2,730 drugs and 14,544 side effects collected from adverse event reporting systems. Likewise, we excluded drugs without the corresponding SMILES string in Pubchem. For OFFSIDES, we further excluded drug side-effect associations whose proportional reporting ratio errors are greater than 0.25. Fig. 1c shows the number of drugs within each dataset and the overlap of drugs between these two drug side effect datasets. On average, there are 73 and 108 diseases associated with each drug and 97 and 238 drugs associated with each disease in SIDER and OFFSIDES respectively.
We formulated the feature for a given disease by utilizing associations between diseases and genes pulled from DisGeNET[34]. The initial feature vector of each disease was an one hot vector, where 1 represented the existence of an association between the disease and the gene and 0 otherwise. Side effects that were not included in the DisGeNET as diseases were excluded. Following the same procedure of extracting cell line features, we used PCA to reduce the disease feature matrix (gene-by-disease) to a dimension of 300 and then applied z-score normalization to the low-dimensional vectors. 1,005 side effects remained after the processing and each column of the reduced feature matrix represented a feature vector for the corresponding disease. The overlap of drugs among three tasks and the statistics of each dataset are shown in Fig. 1d and Table 1.
3. Methods
DrugOrchestra architecture
The structure of DrugOrchestra is shown in Fig. 2. Each task used a neural network as the base classifier. Each neural network consists of two components, shared layers across all tasks (three tasks in this paper) and task-specific layers. We used hard parameter sharing for parameters of the shared layers. Input features for shared layers are pre-trained molecular graph-based drug features. Input features for task-specific layers are different for each task. The output of the task-specific layers and the shared layers are concatenated and used to train the remaining layers of the task-specific neural network for each task.
In particular, let d1, d2…. dn denote input features for shared layers of n tasks. f1, f2….fn denote input features for task-specific layers of n tasks. For shared layers, the activations of task i is calculated as: where the denotes the k-th shared hidden layer in the i-th task, σ is the activation function. is the input drug feature di. and represent the weight and bias of the k-th hidden layer, which are shared among different tasks. Drug features from different datasets are processed by the same hidden layers to capture common patterns across tasks. We used the ReLU function as the activation function σ for hidden layers. We used the sigmoid function as the activation σ for the output layer of classification tasks (e.g., drug target prediction and drug side effect prediction) and the linear activation function for the activation σ of the output layer of the regression task (e.g., drug response prediction).
For the task-specific hidden layers that use task-specific features, the activations of each group is defined as: where and denote the weight and bias of the k-th task-specific hidden layer for the i-th task. is the input task-specific feature fi. Different tasks can learn distinct feature representations which preserve the knowledge of each separate task. We used two hidden layers for both drug hidden layers and task-specific hidden layers. After we obtained the processed low-dimensional drug features and task-specific features, we concatenated them to get the new feature vector ci = [qi, ai] for task i. ci is then fed into a task-specific neural network with ReLU activation function. Let oi be the output value of the remaining task-specific neural network. oi = sigmoid(ci) for the classification task and oi = ci for the regression task.
Finally, we used the cross entropy loss for the classification task: where m is the number of training sample in i-th task and is the label of the k-th training sample in the i-th task. We used the mean squared error loss for the regression task:
The overall objective for our multi-task learning model is to minimize the weighted combination of losses from each task, which is defined as:
Here, C denotes a set of classification tasks, R denotes a set of regression tasks, λi is the weight of task i. The DrugOrchestra framework is flexible to include more tasks that involve drug features. Through the multi-task learning framework, DrugOrchestra implicitly augments the training data and leverages other tasks to regularize a specific task, thus potentially alleviating overfitting.
Dynamic Weight Adjustment
Training multiple tasks simultaneously could be challenging using multi-task learning since they need to be assigned proper importance weights. Extensively tuning hyperparameters for the weights of each task is time consuming. Therefore, we used a schema that can automatically adjust the weight of each task. In this paper, we compared two weight adjusting strategies introduced by Liu et al.[23] and Liu et al.[48] and observed that the dynamic weight adjustment strategy from Liu et al.[23] is empirically better. The dynamic weight adjustment (DWA) strategy first calculates the relative descending rate ω from two previous epochs:
Here, t denotes the t-th epoch, k denotes the k-th task, and denotes the loss of the k-th task in the t-th epoch. Then the importance weight of task k in the t-th epoch is calculated as: T controls the effect of task weighting. A smaller T makes the optimization process aggressively focused on important tasks determined by ω. This weight is used to weigh the loss function of each specific and DWA allocates higher weights to tasks with lower descending rates to boost the reduction of losses of those tasks. We set T=2 in our experiments as suggested by previous work[23].
4. Experimental setup
We compared our multi-task learning framework against three comparison approaches. 1) Linear model: We applied the Support Vector Machine[36] (SVM) with linear kernel to each single dataset. The learning rate schedule is set to be ‘optimal’ and the training stops until convergence. 2) Ensemble model: We applied the Random Forest[38] to each single dataset. The model was implemented using the sklearn library[37] with trees and depth set to 100 and 10 respectively. 3) Single-task Learning: We used a single-task deep neural network (STL-NN) model, which exploited the same architecture as DrugOrchestra but only trained on a single task. Although all these three comparison approaches are trained on a single task, they used all available training samples from the corresponding datasets. Moreover, all comparison approaches also used the same pre-trained molecular features, input features and data splits as DrugOrchestra. We used the same learning rate and number of epochs to train DrugOrchestra and STL-NN.
Drug target prediction and drug side effect prediction are classification tasks. Therefore, we used the area under curve for Receiver Operating Characteristic (AUROC) and the area under Precision Recall Curve (AUPRC) to evaluate the performance of them. Large AUROC and AUPRC indicate a better performance. Drug response prediction is a regression task. Consequently, we used the Spearman’s Correlation Coefficient (Spearman) and the Mean Squared Error (MSE) to evaluate the performance. A larger SCC value indicates a better performance whereas a smaller MSE indicates a better performance.
We defined the transferability between the source dataset (task) and the target dataset (task) using the performance gain:
Here, rt is the performance of the target dataset t. rs−>t represents the performance of training DrugOrchestra on the full source dataset s and the training set of dataset t jointly and then testing on the test set of dataset t. The Tanimoto score between two drugs are computed using the RDKit package[39]. A large Tanimoto score indicates that two drugs are similar in terms of the SMILES string.
We used 3-fold cross validation to train and evaluate our models. In particular, each time, we randomly selected 2/3 drugs for training and the remaining 1/3 drugs for testing. Since there is no overlap between test drugs and training drugs, this evaluation setting is able to examine whether our method can be used to predict for unseen drugs. We repeated this process for 10 times and used the average results as the final performance. The STL-NN model and DrugOrchestra were trained with ADAM optimizer[40] using a learning rate of 10-3, with a batch size of 256 for 20 epochs. The parameters of all layers were initialized by Xavier approach[49]. We used the binary cross-entropy (BCE) loss for drug-target prediction tasks and drug side effect prediction tasks, and the mean squared error (MSE) loss for drug response prediction tasks. The DrugOrchestra and the STL-NN model are implemented in Python using the Pytorch library[41].
5. Experimental results
DrugOrchestra substantially improves performance on all three tasks
We first studied whether DrugOrchestra can improve prediction performance on these three tasks by comparing it with comparison approaches on 8 datasets of these 3 tasks. Results are summarized in Table 2. We found that 7 out of 8 datasets obtained significant improvement (t-test p-value<0.05). For example, in GDSC, DrugOrchestra achieved 0.375 SCC which is much higher than 0.315 SCC of STL-NN, 0.235 SCC of SVM and 0.149 SCC of Random Forest. Other six datasets also outperformed comparison methods significantly. The only exception is PDX where DrugOrchestra is better than RF but worse than SVM. We attributed the superior performance of linear SVM on PDX to the relatively smaller size of PDX, which could result in overfitting by using more expressive models. The overall improved performance of MTL demonstrates the effectiveness of using multi-task learning to transfer knowledge across these three tasks. Furthermore, we observed that DrugOrchestra had the most prominent improvement on drug response prediction, which might suggest a stronger connection between drug response and the other two tasks. Unlike the STL-NN model, SVM and Random Forest, DrugOrchestra models multiple datasets simultaneously so that it can take advantage of the shared latent features and obtain an enhanced drug representation.
DrugOrchestra reveals transferability across datasets
The larger improvement on drug response prediction in comparison to the other two tasks motivates us to further investigate the transferability across datasets and tasks. To this end, we used performance gain to calculate the transferability across datasets and summarized the result in Table 3. In general, datasets belonging to the same task have better transferability compared to datasets from different tasks. For example, the average performance gain within drug response, targets and side effects prediction was 1.91%, 4.71% and 2.75% respectively, which is higher than the average performance gain 0.274%, 1.85% and 0.544% between datasets of different tasks. Datasets from the same task have closer distribution and the drug representation can thus be better shared across them. Moreover, we observed that the transferability can partially reflect the improvement of DrugOrchestra in Table 2. For instance, the GDSC dataset received the highest 18% improvement from CCLE dataset, and it also exhibited the highest improvement of 19% over STL. Interestingly, we also observed negative transferability in Table 3, which were mostly between datasets that were not from the same task. One exception was the transferability from PDX to CCLE, which again is consistent with the worse performance of DrugOrchestra on PDX in Table 2. We found that the remaining negative transferability was mostly between a drug side effect dataset and a drug target dataset, which might suggest the inherent difference between these two tasks.
DrugOrchestra reveals transferability across tasks
Since the negative transferability is mostly observed between a drug side effect dataset and a drug target dataset, we then sought to examine the transferability at the task level. We performed the same analysis by aggregation datasets from the same task. Table 4 shows the result of task level performance gain. We observed 4 out of 6 task pairs had positive performance gain by using DrugOrchestra. The improvement is most prominent when using drug targets to help the prediction of drug response. Drug target is known to be one of the most important features in a variety of drug response prediction approaches[42–44], partly due to its central role in understanding drug mechanisms. Interestingly, drug response is much less helpful for drug target prediction. The seemingly contradictory transferability might suggest the underlying causality relationship during apoptosis. Finally, we also observed substantial transferability between drug response and drug side effects, which are both drug phenotypes and have been shown to be associated[51]. The observed transferability at task levels raises our confidence that DrugOrchestra can be integrated with other datasets, which could further enhance the prediction performance on these three tasks.
DrugOrchestra enables prediction of unseen drugs
One key advantage of DrugOrchestra in comparison to other multi-task learning approaches[26–28] is its ability to make predictions for unseen drugs. Such improvement has been observed in Table 2 where the test drug set has no overlap with the training drug set. However, compounds with similar molecular structures might still exhibit in the test and training set, which might cause potential information leakage and obscure the usage of our method in predicting unseen drugs. To investigate the applicability domain[50], we excluded a test set drug if it has a high Tanimoto similarity to any drug in the training set. By using different stringent Tanimoto similarity cutoffs, we still observed substantial improvement of DrugOrchestra in comparison to STL (Table 5). For example,when using a cutoff of 0.6, our method achieved 0.785 AUROC and 0.371 SCC in STITCH and GDSC respectively, which are much higher than 0.745 AUROC and 0.318 SCC of STL-NN. All the other 7 datasets including PDX obtained improved performance when using our method. Notably, the improvement of our method against STL was larger when using a more stringent cutoff, indicating the enhanced prediction performance of our method in making predictions for unseen drugs. By coupling these three tasks together and transferring knowledge across them, the new drug representation obtained by our method is more robust and can be generalized to unseen drugs.
6. Discussion
In this paper, we have presented DrugOrchestra, a novel deep multi-task learning model for jointly training drug response, targets, and side effects. DrugOrchestra exploits the task similarity among drug response, targets and side effects prediction. We collected a large-scale drug discovery dataset including 21,032 drugs, 17,563 genes, 2,057 cell lines and 1,005 side effects. Our method obtains significant improvement on 7 out of 8 datasets when evaluated on this dataset and thus enables more accurate predictions of response, side effects and targets for a given compound. We further showed the transferability across datasets and tasks, which were well aligned with our observation of the improvement and underlying drug mechanisms.
Our method is inspired by recent advances in the field of machine learning such as BERT[45] and GPT[46], which pre-trained on large-scale unlabeled data and then fine-tined on the specific task. Such two-stage approaches have been shown to be effective on a variety of tasks. A key conceptual advance of our method is to perform a multi-task learning step between the pre-training and fine-tuning step. In this paper, the multi-task learning is the joint training of drug response, side effects and targets. Based on the ablation study that compares to STL, we found that this multi-task learning step is crucial to the improvement, which suggests that knowledge of all three tasks can transfer to each other and enables us to get better molecular graph-based drug representation.
There are several directions for future research. From a biological perspective, we plan to integrate other tasks in the multi-task learning steps, such as drug perturbation prediction, drug solubility prediction and drug synergistic effect prediction. From a computational perspective, we would like to design interpretable methods to help us understand the multi-task learning model. In addition, we are also interested in experimentally validating the prioritized predictions by our model for drug repurposing and development.