Abstract
Personalised medicine has predominantly focused on genetically-altered cancer genes that stratify drug responses, but there is a need to objectively evaluate differential pharmacology patterns at a subpopulation level. Here, we introduce an approach based on unsupervised machine learning to compare the pharmacological response relationships between 344 pairs of cancer therapies. This approach integrated multiple measures of response to identify subpopulations that react differently to inhibitors of the same or different targets to understand mechanisms of resistance and pathway cross-talk. MEK, BRAF, and PI3K inhibitors were shown to be effective as combination therapies for particular BRAF mutant subpopulations. A systematic analysis of the preclinical data for a failed phase III trial of selumetinib combined with docetaxel in lung cancer suggests potential indications in urogenital and colorectal cancers with KRAS mutation. This data-driven study exemplifies a method for stratified medicine to identify novel cancer subpopulations, their genetic biomarkers, and effective drug combinations.
Introduction
Drug developers face a conundrum in predicting the efficacy of their investigational compound compared to existing drugs used as the standard of care treatment. Systematic screening of drug compounds across a variety of genomic backgrounds in cancer cell lines has improved clinical trial design and personalized treatments 1. Following the pioneering NCI-60 screen comprised of 59 unique cell lines 2, modern high-throughput screens such as the Genomics of Drug Sensitivity in Cancer (GDSC) 3,4, the Cancer Cell Line Encyclopedia (CCLE) 5 and the Cancer Therapeutics Response Portal (CTRP) 6–8 have characterised >1,000 cancer cell lines with the goal of establishing the genetic landscape of cancer. The deep molecular characterisation of these large cell line panels is complemented with high-throughput drug screens, which enables the discovery of drug response biomarkers. For example, analysis of the generic BRAF inhibitors PLX4720, SB590885 and CI-1040 reproduced drug sensitivity association with the BRAF mutation in melanoma, or afatinib sensitivity with with ERBB2 amplifications in breast cancer 3,4,9. These associations between genetic variants and treatment response have helped identify specific patient subpopulations who are most likely to benefit from treatment. In Phase III clinical trials, however, for new drugs to be successful, they must demonstrate a significant improvement over the existing standard of care. Accurately defining in which subpopulations a new drug demonstrates improved differential efficacy over other drugs targeting the same disease could lead to both better clinical outcomes as well as new targeted therapies.
While several methods have been proposed to identify drug response biomarkers in cell lines for precision medicine and drug repositioning 4,5,10,11, there is a need for more objective and unsupervised approaches for identifying subpopulations with differences in drug response (differential drug response), and consequently systematically gain mechanistic insights from biomarkers. Most approaches capable of comparing multiple drugs measure the overall similarity (or correlation) based on a single response summary metric 7,12, which permits drug repositioning based on subpopulations with similar behavior, but neglects ones that behave differently (Figure S1A). Here, we used a technique based on unsupervised machine learning, which identifies differentially sensitive or resistant subpopulations and may be applied generally to evaluate any pair (or n-tuple) of targets using any number of drug response summary metrics (e.g. IC50 or AUC) to stratify the pharmacology response. Segmentation of the overall population occurs top-down and along globally-optimal contours that are derived explicitly and maximize the differences between the two resulting subpopulations. The segmentation continues recursively and is modulated by multiple user-defined criteria such as the size or separability of the resulting subpopulations. Higher threshold values for both result in less granular subpopulations but increase certainty that the subpopulations and the quantities estimated from them are both distinct and accurate.
We present results from our platform, SEABED (SEgmentation And Biomarker Enrichment of Differential treatment response), to demonstrate how unsupervised machine learning can discover intrinsic partitions in the drug response measurements of two or more drugs that directly correspond to distinct pharmacological patterns of response with therapeutic biomarkers. Addressing the challenges in comparing the response of two drugs, SEABED initially assesses two gold standards with established clinical biomarkers, namely the differential response of a BRAF inhibitor and MEK inhibitor with anticipated BRAF and KRAS mutations 13–16, and an EGFR inhibitor and MEK inhibitor with expected biomarkers of EGFR, ERBB2 and KRAS mutations 17–20. Next, we systematically compare how different drugs targeting the MAPK and PI3K-AKT pathway yield different patterns of response within subpopulations. We show how differential drug response may indicate benefit for drug combinations explained through independent action rather than probable synergy by examining subpopulations uniquely sensitive to a single drug 21, which may be precisely targeted by identified biomarkers. Finally, we demonstrate how the analysis of differential response can guide the design of clinical trials by revealing specific indications where an investigational therapy may be more effective than the standard treatment.
Results
We applied our technique to discover subpopulations of cell lines in which two or more compounds, possibly addressing the same disease state or even targeting the same genetic alteration, have a common pharmacological pattern of response. By further associating enriched genetic alterations in subpopulations with specific patterns of response, we shed light into molecular mechanisms responsible for patient subpopulations that respond differently to two drugs.
Identifying subpopulations of differential drug response
We first considered the specific circumstance in which two drugs engage different targets within the same signalling pathway, namely agents targeting MAPK signaling. SEABED used nearly 1,000 cancer cells derived from the GDSC database, and we evaluated two established drug response measures: the drug concentration required to reduce cell viability by half (IC50) and the area under the dose-response curve (AUC; Figure 1A). SEABED employed a multivariate similarity measure to compare the vector patterns of response for each distinct pair of cell lines without requiring a priori assumptions on the number or distribution of the subpopulations. The result is a diverse cell line population segmented into distinct subpopulations having homogeneous patterns of drug response (Figure 1B). Here exemplified, we show that the drug response of 802 cell lines treated with either SB590885 (BRAF inhibitor) or CI-1040 (MEK inhibitor) could be segmented into 7 distinct subpopulations with a median size of 40 cell lines by integrating the two metrics of drug response, AUC and IC50 (Figure 1C; see Figures S1B and S1C for individual cell lines segmented by IC50 and AUC respectively). The subpopulation sensitive to both inhibitors was significantly enriched for BRAF mutants (P=3.87e-14, hypergeometric test), while another subpopulation was exclusively sensitive to the MEK inhibitor and significantly enriched for KRAS mutations (P=0.00589, hypergeometric test).
In another example we examined a case where one inhibitor might overcome resistance to another inhibitor targeting the same pathway; AZD6244/ARRY-142886 selumetinib (MEK inhibitor) with afatinib (EGFR and ERBB2 dual inhibitor) across 812 cell lines (Figure 1D). Strong markers of sensitivity for selumetinib are subpopulations carrying known associated KRAS, NRAS and BRAF mutations (Figures 1D and 1E). A less anticipated association is APC loss-of-function sensitivity to selumetinib, albeit this was also found with trametinib (another MEK inhibitor) in APC deficient mice 22. We reproduced the well-established associations of afatinib with either EGFR and ERBB2 amplifications 4,23, and surprisingly our unsupervised segmentation returned two subpopulations enriched for EGFR amplifications. The more sensitive subpopulation is solely enriched for EGFR amplifications, whilst the less sensitive subpopulation additionally includes activating PIK3CA mutations. In concordance with recent literature, PI3K-AKT signaling drives acquired drug resistance to EGFR inhibitors in lung cancer 24.
Drug response segmentation resulted in 14 subpopulations with a median size of 38 (Figure 1D). The subpopulation enriched for EGFR, ERBB2 and PI3KCA variants, has an average log(IC50) of 0.9486µM for selumetinib and −0.596µM for afatinib. In contrast, the BRAF mutation was enriched in a subpopulation where the average log(IC50) for selumetinib was - 1.061µM and 0.593µM for afatinib. The difference in response between afatinib and selumetinib was significantly greater (t-test P<0.01) between the subpopulations identified and the total population of PIK3CA or BRAF mutant cell lines (Figures 1F and 1G).
Cross-comparison of multiple drugs redefines best-in-class drugs for specific subpopulations
Although there is a larger portfolio of clinical drugs with identical putative targets, their responses may differ substantially in subpopulations as a consequence of multiple factors, for example mode-of-action, different off-target effects and binding properties. The ability to discover cell line subpopulations with distinct pharmacological patterns of response characterised by genetic mutations re-defines best-in-class drugs by their differential response to other drugs in a specific subpopulation, rather than their absolute response across an entire population.
In order to demonstrate this approach for drug discovery, we applied SEABED to 745 cell lines across cancer types to evaluate the differential response in those cell lines to five inhibitors (CI-1040, PD0325901, RDEA119, selumetinib, and trametinib) which all target the MEK protein (Figure 2A). The segmentation of cell lines revealed 13 subpopulations with different patterns of response and three having enriched biomarkers (Figure S2A). Two subpopulations were sensitive to all MEK inhibitors, with trametinib achieving the greatest sensitivity. In one subpopulation the KRAS mutation was enriched (Fisher exact p-value = 1.12e-4 and 40.8% of the cell lines) while another had the BRAF mutation enriched (Fisher exact p-value = 1.39e-7 and 50% of the cell lines). In contrast, another subpopulation was enriched with the RB1 mutation (Fisher exact p-value = 3.84e-2 and 21.6% of cell lines), within which the cell lines were almost uniformly resistant to all MEK inhibitors.
Distribution of subpopulations highlight distinct pharmacological relationships between PI3K-AKT and MAPK signaling
Next, we used SEABED to investigate the cross-talk between two frequently active cancer pathways, MAPK and PI3K-AKT signalling, by systematically comparing pairs of drugs targeting different genes of each pathway (Figure 2A, B). In total, SEABED performed 342 pairwise comparisons of 18 PI3K-AKT and 19 MAPK pathway inhibitors. Each drug pair was classified into five categories based on the distribution of subpopulation drug responses: (i) no differential response, (ii) sensitive to both MAPK and PI3K-AKT pathway inhibitors (i.e. correlated response) (Figure S2B), (iii) preferential MAPK pathway sensitivity (Figure S2C), (iv) preferential PI3K-AKT pathway sensitivity (Figure S2D), (v) sensitive to either a MAPK pathway or a PI3K-AKT pathway inhibitor, i.e. divergent response (Figure S2E).
We found 28 drug pairs with higher than expected number of subpopulations with sensitivity to both PI3K-AKT and MAPK pathway inhibition. This association between subpopulation size and sensitive response was significant when comparing a CRAF inhibitor (TL-2-105) to PI3K-AKT signaling inhibitors (P=1.832e-5). The same trend was observed for inhibiting ERK (FR-180204) or RSK (FMK) compared to inhibiting any PI3K-AKT signaling gene (P=0.000197 and P=7.231e-8, respectively), but interestingly there was no mutual sensitivity when comparing to either BRAF or MEK inhibitors.
There were 68 drug pairs with a significantly high proportion of subpopulations (P < 0.05) exhibiting preferential sensitivity to MAPK pathway inhibition. This phenotype is strongly pronounced in pairs with BRAF, ERK (FR-180204) and RSK (FMK) inhibitors (P=0.000195, P=0.0133 and P=0.00315, respectively; hypergeometric test). In contrast, 29 drug pairs were found with significantly high proportions of preferential PI3K-AKT pathway inhibition. In total, 29 drug pairs showed this phenotype, with an enrichment of 19 MEK inhibitors (hypergeometric test P=0.00102). MEK inhibitors were particularly enriched when paired with PI3K or PDK1 inhibitor (hypergeometric test P=0.00529).
In 54 cases, we observed drug pairs with sensitivity to either a MAPK pathway or a PI3K-AKT pathway inhibitor, i.e. divergent response. This response type was enriched for pairs of any PI3K-AKT pathway inhibitors and EGFR (erlotinib), BRAF (PLX4720-1 and PLX4720-2), or MEK inhibitors (P=7.826e-6, P=0.000308 and P=0.0437, respectively; hypergeometric test), while even more significant for AKT inhibitors in comparison with either the EGFR, BRAF, or MEK inhibitors (P=0.0133, P=0.000262 and P=0.000311; hypergeometric test). Response patterns for all drug pairs can be explored in our portal (Website S1; https://szen95.github.io/SEABED).
Subpopulations of differential response identifies drug combination efficacy
Previous studies have hypothesised that the efficacy of many approved drug combinations can be explained by the independent action of single agents on different patient subpopulations with cancers driven by multiple pathways 21. We hypothesised that SEABED comparisons of drug pairs would highlight subpopulations of differential response that would exhibit synergistic or independent action effects when the drugs are tested in combination. Systematic comparison of responses between two drugs highlighted subpopulations of cell lines in which there was sensitivity to either drug but not both (divergent response). We observed this phenomenon in 50 drug pair comparisons, including a MEK inhibitor (RDEA119-2) which showed divergent responses to four PI3K inhibitors (PI-103, GSK2126458, ZSTK474, and PIK-93; Figures S3A-D). Drug pairs with divergent response were also observed in cell lines treated with PLX4720-1 (BRAF inhibitor) and three MEK inhibitors (PI-103, GSK2126458, and ZSTK474; (Figures 3A; Figures S3E-G). Two subpopulations with a high proportion of a BRAF mutation were identified with greater sensitivity to the PI3K inhibitor (Figure 3B).
We next examined the drug pairs as combination therapies in cell lines 25 and patient-derived tumor xenograft models (PDXs) 26 to investigate whether the drug pairs with divergent response and subpopulations with preferential sensitivity to one drug would be associated with efficacy of their combination treatment (Figure 3C). SEABED first compared the single drug responses of BRAF, MEK and PI3K inhibitors as before to identify BRAF mutant subpopulations with differential response. When the drugs were tested as combinations in BRAF mutant cell lines, the MEK/PI3K inhibitor combination had a similar level of synergy as BRAF/MEK combinations, which was recently a clinically approved combination 27,28. These two combinations had significantly higher synergistic effect when used on BRAF mutant cell lines compared to all cell lines (t-test P=0.0204), and compared to all drug combinations tested (t-test P=1.46 e-5; Figure 3D; Figure S3H). In terms of overall efficacy in PDXs, we observed a similar level of inhibition to tumour volume for the BRAF/PI3K inhibitor combination on BRAF mutant cells when compared to the clinically approved BRAF/MEK combination and a significantly greater (t-test P=0.0418) inhibition of tumour growth compared to all combinations (Figure 3E; Figure S3I).
Lack of subpopulations of differential response may explain clinical failure
Sometimes, despite strong preclinical evidence, some drugs do not succeed in clinical trials 29. One such trial was SELECT-1 (Table S1) which compared the efficacy of combining selumetinib and docetaxel to docetaxel alone in patients with advanced KRAS-mutant non– small cell lung cancer (NSCLC) 30. Although there were KRAS mutant cell lines sensitive to selumetinib in preclinical testing 31, we re-examined the pharmacological data with SEABED to assess whether there were distinct subpopulations that justified the patient selection criteria for KRAS mutation.
In this analysis, instead of only inspecting the subpopulation identified by SEABED when the segmentation algorithm terminated, we thoroughly examined all possible subpopulations. SEABED identified a total of 61 possible subpopulations from 840 cell lines across tissue types tested with selumetinib and docetaxel (Figure 4A). 12 subpopulations were more sensitive to selumetinib than docetaxel (Figure 4B), and 5 of those subpopulations were enriched for KRAS mutation. However, those subpopulations enriched for NSCLC KRAS mutants were small in size and mostly exhibited less sensitivity to selumetinib compared to docetaxel (Figures S4A and S4B). The distribution of different KRAS mutations (p.G12C vs p.G12V) was also no different in selumetinib sensitive subpopulations compared to resistant subpopulations (Figures S4C and S4D). Independent of mutation status, only 8.7% of NSCLC cell lines were found in selumetinib sensitive subpopulations, whereas 25.4% cell lines originating from aerodigestive cancer types (eg. esophageal) were found in these subpopulations (Figures S4E and S4F).
Next, we focused on subpopulation_60, which had the greatest difference in sensitivity (IC50 and AUC) to selumetinib compared to docetaxel (Figure 4C). This subpopulation of 122 cell lines was enriched in KRAS mutations (28.8%, P=3.061e-4) found across multiple tissue types. NSCLC cell lines accounted for only 8% of this subpopulation, with 50% of those cell lines being KRAS mutants. Colorectal and pancreatic cell lines accounted for 15% and 8% respectively of the subpopulation, and they both had a higher proportion of KRAS mutations (56% and 100% respectively; Figure 4D).
Discussion
The ability to identify distinct subpopulations based on multiple measures of drug response (eg. IC50 and AUC) and extract their biomarkers is the basis for personalised therapeutics, which may ultimately increase the likelihood of successful clinical trials 32,33. Using a network-based segmentation algorithm coupled with biomarker detection (SEABED), we investigated well-established pharmacological targets and clinical biomarkers by comparing the response patterns for BRAF (SB590885) and MEK (CI-1040) inhibition, which expectedly reproduced subpopulations sensitive to both enriched for BRAF mutants 34–36. In another example, SEABED compared EGFR/ERBB2 (afatinib) and MEK (selumetinib) inhibition to reveal expected biomarkers such as BRAF, KRAS and NRAS mutations for selumetinib 13–16, and afatinib associated with EGFR and ERBB2 amplifications 37,38. Interestingly, the more afatinib-resistant subpopulation was enriched for PI3KCA-activating mutation, which may cause acquired resistance 24. When we systematically compared inhibitors of the MAPK and PI3K-AKT signaling pathways, we observed subpopulations sensitive to both CRAF, ERK or RSK targeted drugs and other drugs targeting the PI3K-AKT pathway, however, there were few instances of these subpopulations for inhibitors targeting other genes in the MAPK signalling 39. We found many more subpopulations that were more sensitive to BRAF inhibitors than other PI3K-AKT inhibitors, and as expected, many contained BRAF mutations 34. In contrast, there were not significantly more subpopulations sensitive to MEK inhibition compared to inhibition of PI3K-AKT signalling targets, but BRAF mutant subpopulations may have greater differential response 14. Divergent response was observed when comparing EGFR, BRAF and MEK inhibitors to drugs targeting the PI3K-AKT pathway. Our results comparing the MAPK and PI3K-AKT pathways based on drug response profiles highlights how intertwined those two pathways are in pharmacology space 39.
Arguably, the divergent response type is the most exciting for personalised treatment, since it may identify cases where independent drug action and synergy may guide effective drug combinations 21. Here exemplified, we showed that PI3K inhibitors combined with either BRAF or MEK inhibitors increase in vitro synergy and reduce tumour volume of in-vivo models. Furthermore, we were able to show that synergistic and overall effect can be further enhanced by the correct biomarker indication, in this instance, BRAF mutant subpopulations 40,41. The BRAF mutant subpopulation with high efficacy for the BRAF inhibitor and not the other inhibitor could be cases where independent drug action explains drug combination efficacy, whereas, the subpopulation with lower efficacy for single treatments of either drug may be cases for synergistic effects when the drugs are combined.
In examining the preclinical evidence for trial testing combination treatment of NSCLC in which the KRAS mutation was the biomarker 42, SEABED revealed a high proportion of NSCLC subpopulations having the KRAS mutation that are resistant to both selumetinib and docetaxel, suggesting a smaller likelihood of efficacy for the drug combination. Alternately, we identified a subpopulation with differential response to selumetinib for a small proportion of KRAS NSCLC cell lines, but this subpopulation contained a higher proportion of colorectal and pancreatic cancer cells with KRAS mutations. Previous studies have shown the plausibility in treating colorectal cancer using MEK inhibitor combinations 43,44. With consideration of KRAS mutations in subpopulations having greater sensitivity to selumetinib, SEABED suggests that while the correct biomarker was used for the clinical trial, there may be other potential indications for selumetinib. Although response in cell lines may not always correspond to response clinically, the use of data-driven approaches to examine large populations of cells may reveal clinically relevant drug response patterns. Future studies may need to account for differences between in vitro and in vivo responses.
SEABED depends on a segmentation framework that builds upon previous work using network models in biomedical contexts 45,46 that partition a population of cell lines described by multiple variables into distinct subpopulations using a “top-down” approach of recursively identifying optimal cuts for graph bisection. Traditional approaches to segmentation, such as agglomerative, “bottoms-up” hierarchical clustering and iterative K-Means clustering are greedy algorithms that are inherently sub-optimal in constructing clusters and consequently may not identify the most distinct subpopulations. Moreover, these approaches frequently require a priori estimates of the number of sub-populations for which many heuristics exist but in practice is commonly estimated using trial and error. Hierarchical clustering has been utilized routinely to attribute molecular markers to differences in subpopulation drug response and outcomes 47,48. Because of their success in other industries 49,50 and their natural amenability to matrix decomposition techniques, network-based approaches have emerged as viable alternatives for discovering distinct subpopulations 45,46,51. Similarly, while our segmentation capitalizes on past progress made in spectral clustering 52,53, our effort distinguishes itself from past attempts by integrating all variables into a single network model using a multivariate similarity measure that utilizes local and global network statistics. Deeper interpretations of matrix subspaces in network models may provide further insight into the linkage between subpopulations of cancer cell lines and drugs.
As a whole, this study demonstrates several important insights about the pharmacological pattern of response for different cancer drugs by applying an unsupervised machine learning platform to segment a large pan-cancer in vitro pharmacology data set. By organizing cell lines along similar pharmacological patterns of response, we identified distinct, intrinsic subpopulations sensitive to one drug but resistant to others, and in some cases identified genetic alterations that can be used as biomarkers for those subpopulations. In the context of analytical frameworks for increasing drug R&D productivity by sharpening the focus of drugs 54, our work demonstrates the value of advanced analytical approaches in translational medicine to enable decision making that is more data-driven and less ambiguous. Moreover, by analyzing different pharmacological responses and interpreting its outputs in the context of the underlying genetics and molecular pathways, we have created a multi-faceted landscape for developing and assessing new drug therapies.
Methods
CONTACT FOR REAGENT AND RESOURCE SHARING
All code for the pipeline is open source and available at: https://github.com/szen95/SEABED. Further information and requests should be directed to and will be fulfilled by the Lead Contact, Dennis Wang (dennis.wang{at}sheffield.ac.uk).
METHOD DETAILS
Pharmacology data
The discovery pharmacology dataset was extracted from the The Genomics of Drug Sensitivity in Cancer (GDSC) database 3,4, while leads from the analysis were validated with the Cancer Cell Line Encyclopedia (CCLE) 5 and the Cancer Therapeutics Response Portal (CTRP) 6–8. Furthermore, suggested drug combinations were validated with cell line responses from the AstraZeneca-DREAM challenge dataset 25 and patient derived xenograft (PDX) models from Gao et al. 26.
For a given cell line in GDSC, the drug response was fitted with a sigmoid curve 55 and consecutively quantified as area under the curve (AUC) or the concentration required to reduce cell viability by half (IC50). GDSC contains 265 compounds tested in 990 cell lines, whilst we focus on a subset of 38 drugs targeting either the PI3K-AKT or MAPK signalling, which leads to 344 experiments considered for evaluation.
Deep molecular characterisation of the cancer cell lines
The GDSC resource provides the characterisation of >1,000 cell lines including whole exome sequencing and SNP6.0 arrays, which enabled to quantify gene-level mutational and copy number variation status. Additional, 10 key fusion genes were included in this analysis, which is summarized in the binary event matrix (BEM) from Iorio et al. 4.
Processing drug response measures (AUC/IC50 values)
We build network models for a set of Ncell lines, C = {C1, …, CN}, that are separately exposed to two distinct drugs, D1 and D2, which results in two sets of M measurement variables, Xi = [x1,..,xM],i = 1,2, describing the response to each compound:
We use a network model that is an undirected graph, G, consisting of Nvertices, Vi, i = 1,…,N, (one for each cell line in C) with weighted edges, Wi,j(Vi,Vj), i,j = 1,…,N, i ≠ j, between every distinct pair of vertices. Our approach uses a single multivariate similarity measure (Equation 2), to construct one network model, with the advantage that the subspace properties of the resulting adjacency and Laplacian matrices are fully embedded with the complete characteristics of C. The weight is the similarity, wi,j, between i-th and j-th composite 2M × 1 dose response profile (DRP), Xi = [X1,i,X2,i], for Ci and Cj.
We characterize drug response by two important continuous-valued measurements extrapolated from the cell line pharmacology screens: the IC50 value and the AUC of the dose-response curve (Table S2) observed when one compound is applied in vitro to a single cell line sample at successively greater concentrations. Since every cell line possesses a length-4 DRP for a given pair of drugs, the similarity, w, between any two cell lines resides on (0,1] and is calculated by a multivariate quasi-Gaussian comparison that differences the elements of the DRPs but also weighs the differences by a combination of local and global network statistics. Similarity between the response vectors, Xi and Xj, is given by:
The similarity between two cell lines equals one when both have identical covariate values, and approaches zero as their covariates increasingly differ. Additionally, w(Xi, Xj)=w(Xj,Xi). ΔXi,j is a 4 × 1 vector whose entries are the difference of the DRP values in Xi and Xj and β modulates the similarity between two patients. we selected β = 0.5 for our experiments based on experimentation and the observations of previous efforts.
Σi,j is a 4 × 4 covariance-like matrix that is estimated for every distinct (i, j)-pair and captures the variability of individual variables as well as their inter-relationships. While Σi,j is an explicit function of the two patients being compared, it also captures network-wide characteristics. For diagonal elements, Σi,j(a,a),a = 1,..4, the entries are: where Ngbd(ΔXI(a))corresponds to all edges neighboring vertex-i, and the overbar is the averaging operator. The off-diagonal elements, Σi,j(a,b)a,b = 1,..4,a ≠ bare:
The Moore-Penrose pseudo-inverse was used to avoid problems with low-rank during matrix of similarity. The symmeric, positive semi-definite, N × N weighted adjacency matrix, W, holds the pairwise similarities.
Segmentation
The set of cell lines, C, is segmented recursively into distinct subpopulations using the Fiedler eigenvector derived from the eigendecomposition of W56. Each subpopulation of cell lines is successively segmented into two subpopulations until the size of either subpopulation falls below a user-defined threshold, or when previously observed significant enrichment of genetic biomarkers (see below) is no longer observed in all current subpopulations. In our experiments, we required both resulting subpopulations to have 20 or more members in order to be retained. Criteria and thresholds can be modified and adapted to emphasize relevant factors in a particular problem. The whole process yields K mutually exclusive subpopulations Pk, k = 1,…,K, where C = ∪k=1 Pk. Successive segmentation results in sub-populations with increasingly similar DRPs.
QUANTIFICATION AND STATISTICAL ANALYSIS
Enrichment of features to nominate biomarkers
Because genetic alterations in each cell lines are known, each subpopulation can be evaluated by non-parametric statistical tests to identify enriched alterations that may be attributed to patterns of sensitivity or resistance in the DRP across both drugs. For each subpopulation, we measured the number of cell lines in the subpopulation with a particular gene mutation, and the number of cell lines outside of the subpopulation with the mutation. A 2×2 contingency table was generated from the cell line counts of with/without mutation and inside/outside of subpopulation. Significance of observed enrichment of mutations within subpopulations were calculated using the Fisher’s exact test. The resulting p-values were corrected for multiple testing using the Benjamini and Hochberg (BH) procedure (Table S3).
Classification of pair-wise drug responses
We made 342 pairwise comparisons of drugs targeting the MAPK and PI3K-AKT pathways. Based on the distribution of log(IC50) values across all cell lines tested with both drugs, we determined the 20th-percentile of log(IC50) values for each drug. The 20th percentile cutoffs P20 for drugs A and B was used to categorise the average log(IC50) y of each subpopulation i into four categories:
yi < P20, A and yi < P20, B = sensitive to drugs A and B
yi < P20, A and yi ≥ P20, B = more sensitive to drug A
yi ≥ P20, A and yi < P20,B = more sensitive to drug B
yi ≥ P20, A and yi ≥ P20,B = resistant to drugs A and B
The number of subpopulations in each category were recorded in a 2×2 contingency matrix and normalised by the number of cell lines in each subpopulation. For each drug pair, a binomial test was performed to test whether the number of subpopulations in each category is greater than what would be expected.
After classification of pair-wise drug responses, we assessed whether a drug was significantly enriched for one category in comparisons with all other drugs. Testing was carried out using the hypergeometric test (phyper R package).
2-D visualization of drug response profiles
To visualize DRPs across cell lines and drug comparisons, we calculated the average log(IC50) values for each drug in subpopulations generated based on their response to the tested drug pairs (Table S3). We then plotted the mean log(IC50) values as circles on a 2-D scatter plot using the Matplotlib Python library. Dashed lines indicative of 20th percentile of log(IC50) values for each drug were also plotted on the scatter plot. The radii of the circles is proportional to subpopulation size.
Tree visualization of subpopulations
We utilized tree diagrams to visualize the data generated. The tree diagrams illustrate how the the cancer cell lines are segmented into different subpopulations, based on whether they are sensitive or resistant to the drugs that are being tested. The tree diagrams were generated through an open-source Python library called Graphviz. The style of each component of the tree diagram was first initialized through a class. This included the colours, shapes, and fonts of the edges and nodes of the tree diagram. A method to create tree diagrams was developed to accept the number of vertices and leaves, the labels for the leaves, and the tree diagram filename. The tree diagram is finally generated and saved by calling the method.
DATA AND SOFTWARE AVAILABILITY
All code for the pipeline is open source and available at: https://github.com/szen95/SEABED. All data used in the paper are published previously and publicly available at the GDSC, CCLE, and CTRP databases. Datasets used are listed in Table S2, Table S3, and the Key Resources Table.
ADDITIONAL RESOURCES
Response patterns of 342 pair-wise comparisons of 18 PI3K-AKT and 19 MAPK pathway inhibitors: https://szen95.github.io/SEABED/
Supplemental Information
Supplemental Text and Figures.pdf
Document S1. Figures S1–S4 and Table S1.
Table S2.xlsx
Table S2. Input Data for Segmentation, Related to Figures 1-4, S1-S4, and Website S1
Input data: log(IC50) and AUC values for 1074 cancer cell lines treated with 265 anti-cancer drugs.
Table S3.xlsx
Table S3. Output Data and Enriched Biomarkers After Segmentation, Related to Figures 1-4, S2-S4, and Website S1
Output data: IC50 20% cutoff, minimum and maximum IC50 concentration, and average IC50 response of each subpopulation towards each drug (344 cancer drug pairs) tested on the cancer cell lines. The subpopulation number and the number of cell lines in each subpopulation are recorded. Each individual cell line in every subpopulation together with individual cell line tissue types are also shown. Enriched biomarkers: Biomarkers found within subpopulations (adjusted p-value and/or p-value < 0.05), together with the subpopulation number, the number of cell lines in each subpopulation, percentage of the biomarkers found within each subpopulation, the number of cell lines in the subpopulation with the biomarker, the p-value, and adjusted p-value.
Author contributions
NK, TST, MM, and DW contributed to the conceptualization of the project. NK, TST, and DW were responsible for data curation. Formal analysis was performed by NK, TST, MM, and DW. DW was solely responsible for funding acquisition. The methodology was developed by NK, TST, MM, and DW. TST and DW were responsible for administration of the project. Resources for the experiments were prepared by NK. Various software for the project was developed and implemented by NK, TST, and DW. MM, and DW oversaw supervision of the project. NK, TST, MM, and DW wrote the manuscript.
Declaration of Interests
N.K. is an employee of Constellation Analytics, LLC.
Acknowledgements
We would like to thank Ben Sidders (AstraZeneca plc.), Jonathan Dry (AstraZeneca plc.), Francesco Iorio, (Sanger Institute), Michael Schubert (EMBL-EBI), Mi Yang (RWTH Aachen) and Winston Hide (Harvard University) for useful discussions. This work is supported by funding from the NIHR Sheffield Biomedical Research Centre.