Data and preprocessing
Cancer Cell Line Encyclopedia (CCLE) [23] and Genomics of Drug Sensitivity in Cancer (GDSC) project [24] are two most important resources of publicly available data for investigating anticancer drug response. They are benchmark compilations of gene expression, gene copy number and massively parallel sequencing data. We selected 491 cancer cell lines from CCLE, downloaded the chemical structure files of 23 drugs from PubChem Compound, and then obtained a cell line-drug response matrix consisting of 11,293 entries, of which 423 (3.75%) are missing values. We also selected 655 cancer cell lines from GDSC and 129 drugs in the PubChem database. The resulting drug response matrix has 84,495 entries, out of which 15,763 (18.66%) are missing. The given drug responses were measured by activity area for CCLE and IC50 for GDSC. Higher Activity area or lower IC50 value indicates a better sensitivity of the cell line to a given drug. To eliminate the differences in susceptibility of different drugs, we normalized the drug response data such that all cell line susceptibility data have the same baseline and the same range (see Fig. 1 as an example).
Generalized observation
For the first question, we want to know whether available drug-cell line response values have the statistical power to predict the response of ‘new drug to new cell line’. Motivated by [20, 22], we first examined the response correlations between genetically similar cell lines and structurally similar drugs.
Cell line similarities are measured by Pearson correlation coefficients between their corresponding gene expression profiles. The correlations of most cell line pairs (around 92% for CCLE, 70% for GDSC) are larger than 0.8. We divided all possible cell line pairs with correlation coefficients higher than 0.9 into high similar group ‘Hc’, and other pairs into low similar group ‘Lc’.
Next, we used Open Babel to obtain molecular fingerprints of selected drugs [25]. Fingerprint-based Tanimoto coefficient is often used as a molecular similarity indicator in cheminformatics literature [22, 26, 27]. Define the distance between two drugs as d(Di, Dj) = 1 − T(Di, Dj), where T(Di, Dj) is the Tanimoto coefficient between drugs Di and Dj. Based on the drug distance matrix (see Additional file 1: Table S1 and Additional file 2: Table S2), we clustered all drugs using “complete” method in R. Drugs with high distances tend to be in different clusters, while drugs with similar structure are expected to be clustered together (see Fig. 2a and c). For CCLE dataset, we extracted such drug pairs from Fig. 2a with Tanimoto coefficient greater than 0.5 and distance less than 0.49 into high similar group ‘Hd’: {17-AAG, Paclitaxel, AZD6244, PD-0325901, Nilotinib, PD-0332991, AEW541, PF2341066, Erlotinib, ZD-6474, AZD0530, TAE684, Lapatinib, PLX4720, PHA-665752, Irinotecan, Topotecan}. Other drug pairs were divided into low similar group ‘Ld’. For GDSC dataset, we extracted such drug pairs from Fig. 2c with Tanimoto coefficient greater than 0.5 and distance less than 0.45 into high similar group ‘Hd’: {Tipifarnib, PLX4720, Dasatinib, Sunitinib, PHA-665752, AZ628, Imatinib, AMG-706, BMS-754807, PF-02341066, Bosutinib, A-770041, PD-173074, AZD6244, CI-1040, PD-0325901, Erlotinib, AZD-0530, Gefitinib, BIBW2992, NVP-TAE684, WH-4023}. Other drug pairs were divided into low similar group ‘Ld’. From Fig. 2b and d we found that more similar Cell lines always show higher response correlations to more similar drugs, it holds for both CCLE and GDSC data sets.
Construction of cell line-drug complex network model
We use Ω to represent the set of all possible cell line-drug pairs. Denote ρ(C, Ci) as the Pearson correlation coefficient between cell lines C and Ci, T(D, Dj) as the Tanimoto coefficient between drugs D and Dj. Meanwhile, we use R(C, D) to represent the observed response value of the pair (C, D) ∈ Ω. Define Ci and Cj as adjacent if ρ(Ci, Cj) ≠ 0, and the weight of this edge as ρ(Ci, Cj). Similarly, Di and Dj are called adjacent if their weight T(Di, Dj) > 0. Define Ci and Dj as adjacent if R(Ci, Dj) is available. Obviously, the resulting network involves cell line similarity and drug similarity information, as well as cell line-drug response situations, so we call it the cell line-drug complex network (CDCN). In fact, this network is the dual-layer integrated cell line-drug network in [20]. Figure 3b showed a CDCN corresponding to the cell line-drug response matrix described in Fig. 3a.
Define \( w\left(\mathrm{C},{\mathrm{C}}_i\right)={e}^{-\frac{{\left(1-\rho \left(\mathrm{C},{\mathrm{C}}_i\right)\right)}^2}{2{\alpha}^2}} \) as a weight function of cell lines. It increases with respect to ρ(C, Ci), where the parameter α measures the decay rate with the decrease of ρ(C, Ci). Similarly, define a weight function of drugs \( w\left(\mathrm{D},{\mathrm{D}}_j\right)={e}^{-\frac{{\left(1-T\left(\mathrm{D},{\mathrm{D}}_j\right)\right)}^2}{2{\tau}^2}} \) with decay parameter τ.
For a given pair (C, D), let Ω\{(C, D)} be the set of all other pairs (Ci, Dj) besides (C, D). Based on the generalized observation we are able to make a prediction by dealing with all possible observed response values R(Ci, Dj) as the following,
$$ \hat{R}\left(\mathrm{C},\mathrm{D}\right)=\frac{\sum_{\left({\mathrm{C}}_i,{\mathrm{D}}_j\right)\in \Omega \setminus \left\{\left(\mathrm{C},\mathrm{D}\right)\right\}}w\left(\mathrm{C},{\mathrm{C}}_i\right)w\left(\mathrm{D},{\mathrm{D}}_j\right)R\left({\mathrm{C}}_i,{\mathrm{D}}_j\right)}{\sum_{\left({\mathrm{C}}_i,{\mathrm{D}}_j\right)\in \Omega \setminus \left\{\left(\mathrm{C},\mathrm{D}\right)\right\}}w\left(\mathrm{C},{\mathrm{C}}_i\right)w\left(\mathrm{D},{\mathrm{D}}_j\right)} $$
(1)
where \( \widehat{R}\left(\mathrm{C},\mathrm{D}\right) \) is the predicted response value for the pair (C, D). The product w(C, Ci)w(D, Dj) reflects the contribution of R(Ci, Dj) to \( \widehat{R}\left(\mathrm{C},\mathrm{D}\right) \).
It is worth mentioning that formula (1) is applicable to all types of pairs (C, D). Even if C and D are both new (it means that R(C, Dj) and R(Ci, D) are not known for any existing drug Dj and any existing cell line Ci). In this circumstance, the cell line-drug response matrix and the corresponding cell line-drug complex network showed in Fig. 3 would be changed into ones depicted in Fig. 4. Formula (1) also has a ‘little variation’ in the assignment of the pair (Ci, Dj), that is
$$ \hat{R}\left(\mathrm{C},\mathrm{D}\right)=\frac{\sum_{\begin{array}{c}\left({\mathrm{C}}_i,{\mathrm{D}}_j\right)\in \Omega \\ {}{\mathrm{C}}_i\ne \mathrm{Cand}{\mathrm{D}}_j\ne \mathrm{D}\end{array}}w\left(\mathrm{C},{\mathrm{C}}_i\right)w\left(\mathrm{D},{\mathrm{D}}_j\right)R\left({\mathrm{C}}_i,{\mathrm{D}}_j\right)}{\sum_{\begin{array}{c}\left({\mathrm{C}}_i,{\mathrm{D}}_j\right)\in \Omega \\ {}{\mathrm{C}}_i\ne \mathrm{Cand}{\mathrm{D}}_j\ne \mathrm{D}\end{array}}w\left(\mathrm{C},{\mathrm{C}}_i\right)w\left(\mathrm{D},{\mathrm{D}}_j\right)} $$
(2)
The ‘little variation’ is crucial for accomplishing the response prediction of ‘new drug to new cell line’. To highlight the difference between two formulas, we called formula (1) as CDCN model I and formula (2) as CDCN model II.
The decay parameter pairs (α, τ) could be optimized by minimizing the following overall error function
$$ \left(\widehat{\alpha},\widehat{\tau}\right)={\mathrm{argmin}}_{\left(\alpha, \tau \right)}{\sum}_{\left(\mathrm{C},\mathrm{D}\right)\in \Omega}{\left(\widehat{R}\left(\mathrm{C},\mathrm{D}\right)-R\left(\mathrm{C},\mathrm{D}\right)\right)}^2 $$
(3)
where α and τ are ranged from 0 to 1 with increment 0.01, respectively, and the pair (α, τ) takes all possible combinations.
We conducted leave-one-out cross-validation by singling out each cell line-drug pair as the test dataset, and used Pearson correlation coefficients between predicted and observed response values to evaluate the predictive power of the proposed model. Root mean square error (RMSE) and normalized root mean square error (NRMSE) of each drug D were also calculated to assess the model.
$$ \mathrm{RMSE}\left(\mathrm{D}\right)=\sqrt{\frac{\sum_{\mathrm{C}}{\left(\widehat{R}\left(\mathrm{C},\mathrm{D}\right)-R\left(\mathrm{C},\mathrm{D}\right)\right)}^2}{n}} $$
(4)
$$ \mathrm{NRMSE}\left(\mathrm{D}\right)=\frac{\mathrm{RMSE}\left(\mathrm{D}\right)}{\max_{\mathrm{C}}R\left(\mathrm{C},\mathrm{D}\right)-{\min}_{\mathrm{C}}R\left(\mathrm{C},\mathrm{D}\right)} $$
(5)
Where C ranges over all cell lines for which R(C, D) are known, and n is the number of such cell lines.