 Research
 Open Access
 Published:
Application of transfer learning for cancer drug sensitivity prediction
BMC Bioinformatics volume 19, Article number: 497 (2018)
Abstract
Background
In precision medicine, scarcity of suitable biological data often hinders the design of an appropriate predictive model. In this regard, large scale pharmacogenomics studies, like CCLE and GDSC hold the promise to mitigate the issue. However, one cannot directly employ data from multiple sources together due to the existing distribution shift in data. One way to solve this problem is to utilize the transfer learning methodologies tailored to fit in this specific context.
Results
In this paper, we present two novel approaches for incorporating information from a secondary database for improving the prediction in a target database. The first approach is based on latent variable cost optimization and the second approach considers polynomial mapping between the two databases. Utilizing CCLE and GDSC databases, we illustrate that the proposed approaches accomplish a better prediction of drug sensitivities for different scenarios as compared to the existing approaches.
Conclusion
We have compared the performance of the proposed predictive models with databasespecific individual models as well as existing transfer learning approaches. We note that our proposed approaches exhibit superior performance compared to the abovementioned alternative techniques for predicting sensitivity for different anticancer compounds, particularly the nonlinear mapping model shows the best overall performance.
Background
A consistent challenge in precision medicine is to design appropriate models for predicting the sensitivity of a tumor to an anticancer compound with high accuracy. In this aspect, largescale pharmacogenomic studies of cancer genomes have provided unprecedented insights for studying anticancer therapeutics to determine putative prediction of drug sensitivity. The Genomics of Drug Sensitivity in Cancer (GDSC) [1] of the Cancer Genome Project and the Cancer Cell Line Encyclopedia (CCLE) [2] from the Broad Institute are two such studies where drug sensitivity profiles and genomic information across hundreds of compounds and cancer cell lines have been systematically gathered. There exists significant overlaps between the two databases which can further be utilized in designing more accurate sensitivity predictive models. Biological data for designing suitable predictive models are frequently scarce and therefore the availability of a secondary dataset often holds the promise for a better model development. However, majority of the machine learning approaches used in drug sensitivity prediction follow the inherent assumption that both training data and test data are in the same feature space with the same distribution. But, when training and test data, despite being in the same feature space, exhibit different distributions, one need to take the distribution shift into account. This is where transfer learning (TL) methodologies come into play [3].
Often in TL environment, the source and target domains can be considered as linked subspaces as part of a highlevel common domain space [4]. We, therefore need to assume that there exists some consistency between the different datasets to be utilized in TL. HaibeKains et al. [5] at first pointed out that, although the gene expression from CCLE and GDSC databases are well correlated between themselves, unexpectedly the measured pharmacological drug responses using common estimators such as IC_{50} and the area under the curve (AUC) measures are highly discordant. In response, the CCLE and GDSC investigators performed their own analysis [6] and presented results opposing the conclusions in [5]. They pointed out that in majority of the drugs, the exhibited AUC and IC_{50} distributions are dominated by drug insensitive lines with a much smaller number of outliers, and postulated that the differences in cell line biology between studies have resulted in the poor correlation. Considering these facts, they have demonstrated significant improvement in correlation between most of the drugs. In any event, the fact that both the databases are providing information about the same biological process, make them suitable candidates for applying transfer learning methodologies.
In case of inconsistent data with different distributions for training and test sets, various TL approaches [3] have been attempted for dataset shift. Unsupervised methods such as INSPIRE (INferring Shared modules from multiPle gene expREssion datasets) [7] is primarily focused on the expression datasets to extract a lowdimensional representation and predicts tumor phenotypes using regularized regression approaches. Inductive transfer learning (ITL) approaches, as in [8], tackle the issue of prediction for scarce primary data using a secondary dataset through importance sampling i.e., reweighting the secondary distribution to the primary. While the primary data size is assumed to be significantly smaller than secondary data, for large number of unlabeled data, one has to adapt to covariate shift along with ITL. Boosting based approaches such as DynamicTrAdaBoost [9] applies ensemble methods to both source and target instances and then employs an update mechanism incorporating only the source instances useful for target task, with an additional dynamic correction factor. Kernel based ITL methods [10, 11] focus on finding an appropriate kernel for the newly available data, modeling the difference with existing data as a problem of finding the suitable bias.
The previous approaches for transfer learning work well under the assumption that the datasets are closely related (such as 9 ovarian cancer datasets in INSPIRE) and the number of samples are significantly larger than the number of features (n>p). However, the scenario is frequently reversed in the case of genomic (or proteomic) data i.e., we usually have tens of thousands of genes and a small number of cell lines. Additionally, the previous methods for TL often involve removing the distribution shift via weighting without any explicit domain transfer. In our work, we have proposed two different TL approaches that consider mapping the data from two different databases to either a common space or to each other’s domain, inherently taking care of the n<<p problem. The inherent assumptions here for each pair of similar datasets from CCLE and GDSC are – (i) The datasets are monotonically changing in the same direction, and (ii) There exists a functional relationship between them. To build an appropriate prediction model, we utilize the gene expression as the predictors and the drug sensitivity (specifically AUC) as the output. Considering the application of TL on these datasets, the proposed approaches in this paper can be classified into two categories, as illustrated in Fig. 1.

Cost optimization based approach where we employ latent variable models to extract the underlying variables between different datasets. In this case, TL can be applied to only the output (Fig. 1(a)), as in parameter transfer approach [12, 13] or to both model input and output (Fig. 1(b)), as in [14, 15].

Domain transfer approach where we design maps between databases to transfer data from primary domain to secondary and utilize the secondary data to improve the prediction model. Here, TL is applied to both input and output (Fig. 1(c)), as in instance transfer approach [14, 15].
To summarize, the key contributions of the paper is – we have implemented two TL based approach, where the target (primary) data is either transferred to a common latent variable space along with the source (secondary) data, or to the source domain through nonlinear mapping to improve the prediction of limited primary data employing the available secondary data.
Results
To evaluate the performance of our transfer learning algorithms, we have initially retrieved the data common to both CCLE and GDSC. From GDSC (v6.0) and CCLE, there are 15,664 common genes available in 623 common cell lines along with 15 common drugs. We have performed a drugwise analysis and found that the number of cell lines decreases from 623 after incorporating the available drug sensitivity values, resulting in datasets with cell lines between 91−310 along with 15,664 genes and corresponding sensitivity measures. For analysis involving gene expression, we have used ReliefF [16] to select the top 200 genes from each dataset and taken the intersection as the final feature set. For drug sensitivity measure, we have used the AUC values as they have more concordance between databases (median ρ_{s}=0.34) than IC_{50} (median ρ_{s}=0.28) [5]. Note that in spite of our discussion on inconsistencies between databases, the main goal here is to consider the scenario where a small portion of database 1 (i.e., GDSC) is available while data for the entire database 2 (i.e., CCLE) is available and we would like to use database 2 to improve the prediction performance for the rest of database 1. Thus, for evaluation, we will use the GDSC experimental AUCs as the gold standard and compare with the predicted AUCs.
Latent variable cost optimization approach
We have performed drug sensitivity prediction using the three latent variable cost optimization based approaches – Latent Regression Prediction (LRP), LatentLatent Prediction (LLP), Combined Latent Prediction (CLP) (described in the “Methods” section) for 7 common drugs with sufficient cell lines (n>200). For each method, subsets of 50 randomly chosen GDSC cell lines (X_{11} & y_{11} in Figs. 2 & 3) are used for the cost optimization in training and the rest (y_{12}) are predicted along with the known CCLE data (X_{2} & y_{2} in Figs. 2 & 3). Table 1 illustrates the comparison of prediction performance for all three methods with Direct prediction (DP) for Kfold crossvalidation, where DP is defined as training on the 50 available cell lines and predicting for the rest. Here, the number of folds is found as \(K = \frac {n}{50}\), where 1 fold (containing ∼ 50 samples) is used for training and the remaining (K−1) folds are used for testing.
Domain transfer approach
We have performed the Mapped Prediction (MP) approach (described in the “Methods” section) for predicting GDSC sensitivities for 7 common drugs with sufficient cell lines (n>200) and different levels of database consistency. Figure 4 demonstrates the effect of firstorder polynomial mapping for a representative gene expression set, while Fig. 5 illustrates the effect of secondorder polynomial mapping for a representative drug sensitivity vector. Again, we used random subsets of 50 cell lines (G_{11},d_{11} & G_{21},d_{21} in Fig. 6) to retrieve the mapping functions and sensitivities for the rest (d_{12}) are predicted using the known CCLE data (G_{22},d_{22}). Table 2 shows the comparison of prediction performance for MP approach for all 7 drugs with two other methods – Direct Prediction (DP) and CCLE model prediction (CP) for Kfold crossvalidation, as defined above (i.e., \(K = \frac {n}{50}\) and 1 fold is used for training and (K−1) folds for testing). For CP approach, the model is built using the available CCLE data directly and prediction is performed using the GDSC expression data. For prediction of AUC values using gene expression data, we have used a Biascorrected Random Forest (BCRF) [17–19] model.
Discussion
From Table 1, it is evident that the CLP method yields the best performance. Additionally, even though the LLP method often yield better results than DP, it frequently underperforms than LRP. Overall, 6 drugs out of 7 yield the best performance for CLP method while only Nilotinib performs the best with LRP. The prediction performance is similar in the reverse direction (i.e., CCLE as the primary set and GDSC as secondary) where 5 out of 7 drugs show best performance for CLP.
For the Domain Transfer approach, it is evident from Table 2 that the MP approach performs significantly better than the both CP and DP. Furthermore, the performance of the CP approach is much worse compared to either MP or DP, which can be attributed to the existing distribution shift between CCLE and GDSC data in general. Note that among the 7 drugs, 17AAG and PD0325901 has moderate concordance (0.5≤ρ_{s}<0.6) while AZD6244, Nutlin3 and PD0332991 have poor concordance (ρ_{s}<0.4) between databases. For PLX4720 and Nilotinib, there exist moderate to high consistency in terms of Pearson correlation (ρ=0.57 and ρ=0.88 respectively), although the rank correlation is low (ρ_{s}=0.29 and ρ_{s}≈0.1 respectively). We have also implemented a model that uses the ensemble of available CCLE and GDSC data directly for training and predicts for the unlabeled GDSC expression data, referred as the Combined Model Prediction. An additional section provides a detailed description and comparative analysis of this model with the MP approach [see Additional file 1].
Comparison with inductive transfer learning
We have compared the results from the Mapped Prediction approach with an existing transfer learning approach, namely the Importanceweighted Direct Inductive Transfer Learning (DITL) proposed by Garcke et al. [8]. In DITL, the primary and secondary datasets are assumed to be related in a way so that in some parts of the domain, the two distributions can be similar, and therefore, one can employ the secondary dataset with primary via importance sampling (i.e., reweighting the secondary distribution to the primary so that the secondary data points with positive effect on primary data will have greater weights). For prediction, DITL uses weighted Kernel Ridge regression (KRR) with Gaussian kernels, dubbing the whole approach as DITLKRR [8]. Table 3 shows the comparison of prediction performance for DITLKRR approach with MP and DP approaches for 4 representative drugs. Unlike the MP approach, DITL follows the n>p assumption of machine learning and therefore, we used the intersection of top 50 genes from both datasets as the feature set while 50 cell lines were used for training. From Table 3, we can conclude that MP has a superior performance compared to the other approaches even when the number of features (therefore, information) is reduced to < 50.
Conclusions
In precision medicine, data from multiple large pharmacological studies can be utilized to design better predictive models. In this regard, transfer learning is employed to eliminate the distribution shift between the primary and secondary datasets. In this paper, we have proposed two different TL approaches to incorporate data from two large studies i.e., CCLE and GDSC for designing a better predictive model. In the first approach, we have used a latent variable approach and then optimized the appropriate cost functions to get a pertinent prediction model. The second method uses a nonlinear mapping between both genomic and sensitivity data to transfer the primary data to secondary domain space and perform prediction utilizing the secondary datasets. Both methods show marked improvement in drug sensitivity prediction compared to direct prediction and existing TL approaches, while the mapping approach shows the best overall performance.
We have faced a couple of issues during implementation. The LRP approach utilizes the underlying latent variable between the sensitivity datasets and generate the latent variable corresponding to unknown primary sensitivity data. However, to do so, it uses the available secondary data inferring that the prediction can be only performed for matched pair of datasets. Although the LLP approach overcomes this limitation, it often underperforms than LRP. In Table 4, we have presented the applicability of the sensitivity prediction approaches discussed in this paper for matched vs. unmatched pairs of datasets.
Furthermore, in Mapped Prediction, drug sensitivity mapping between databases using polynomials is drugdependent and thus vulnerable to a userfault. One potential new step can be modeling the map to be robust against the outliers. Another development can be investigating the effect of model stacking using the proposed approaches.
Methods
Latent variable cost optimization approach
In this section, our goal is to analyze the transfer learning approach from the viewpoint of a cost function optimization. Here, the assumption is that– if there exists such a way to transfer data from both CCLE and GDSC to a common space, then the information available in both databases can be incorporated together to result in a better overall performance [3]. Therefore, it can be inferred that in a suitable common space, the individual concordance between the common set (i.e., underlying latent variable) and each dataset will be maximized and the reconstruction errors from the common set will be minimized. This is the rationale behind the cost function optimization approach.
Drug sensitivity prediction via cost optimization of sensitivity data
In this section, we have deployed cost function optimization of CCLE and GDSC sensitivity data to utilize the underlying latent vector for improving the sensitivity prediction to an anticancer drug. The hypothesis is that if both CCLE and GDSC sensitivity vectors can be represented as functions of a common latent variable, then this variable can be utilized along with a known set of CCLE sensitivity values to predict the unknown GDSC sensitivity or vice versa. This approach is regarded as the Latent Regression Prediction (LRP), as the final prediction is performed using a regression model on the latent vector. For this method, only the drug sensitivity values (namely AUC) from the two databases are employed without any use of genomic characteristics data. Figure 2 illustrates the use of LRP method for drug sensitivity prediction. Assume that only a small portion, \(\phantom {\dot {i}\!}(y_{11})_{n_{1} \times 1}\) of GDSC AUC set, (y_{1})_{n×1}, is known, where n_{1}<n. Then, the corresponding AUC set, \(\phantom {\dot {i}\!}(y_{21})_{n_{1} \times 1}\), in CCLE can be used with y_{11} to perform a cost optimization to retrieve the optimum weight vector c for the latent variable, \(\phantom {\dot {i}\!}(w_{1})_{n_{1} \times 1}\), as follows (An additional section provides the detailed development of the cost function [see Additional file 1])
where \( W_{1} = \left [\begin {array}{ll} \vec {1} & w_{1} \end {array}\right ]\), \(c = \left [\begin {array}{lll} c_{0} & c_{1} & c_{2} \end {array}\right ]^{T}\)and \(\vec {1}\) denotes a vectorofone. Here, w_{1} is the latent vector corresponding to y_{11} & y_{21} and assuming linear relationships, c_{1} & c_{2} are the weights of y_{11} & y_{21} in w_{1} (while c_{0} is the offset), defined as
Now, a_{1} & a_{2} are the regression coefficients for reconstruction of y_{11} & y_{21} from w_{1} and can be obtained from the Least Squares (LS) minimizations of the reconstruction errors (ε).
Solving (1), the weight vector, c, and, in turn, a_{1},a_{2} are found. From (3), it can be inferred that w_{1} is also expressed as a linear function of y_{11} or y_{21} alone, i.e.
We assume that both CCLE and GDSC sensitivity vectors maintain individual functional relationships with the latent variable, and therefore, the coefficients a_{1},a_{2},b_{1},b_{2} will remain the same for the whole response sets (y_{1} & y_{2} in Fig. 2). Using w_{1} and the known CCLE AUC set, y_{21}, the coefficient b_{2} in (4) can be retrieved using LS minimization.
where (·)^{+} denotes the MoorePenrose pseudoinverse. Using the rest of known CCLE AUC set, \(\phantom {\dot {i}\!}(y_{22})_{n_{2} \times 1}\), the underlying latent vector, \(\phantom {\dot {i}\!}(w_{2})_{n_{2} \times 1}\), can be retrieved following (4)
Finally, utilizing the coefficient a_{1} found initially from solving (1), the unknown GDSC AUC values can be predicted following (3), as
If only a part of CCLE drug sensitivity response is known along with a bigger portion of GDSC sensitivity set, then this whole process can be utilized for the prediction of CCLE responses by interchanging the GDSC and CCLE values.
We have also implemented a kNN regression based transfer learning approach for sensitivity prediction [see Additional file 1], which is computationally inexpensive to implement but often underperforms the LRP approach. We then applied an iterative update scheme to improve the performance of kNN approach and combined the updated kNN model with the LRP model [see Additional file 1]. The combined model shows similar performance to LRP model.
Drug sensitivity prediction via cost optimization of genomic and sensitivity data
In this section, we have utilized both gene expression and AUC data in cost optimization to improve the drug sensitivity prediction. Here, the goal is to establish a relationship between the two underlying latent variables corresponding to gene expression and AUC datasets respectively, and then exploiting this relationship for the prediction of unknown AUC values. This method is regarded as the LatentLatent Prediction (LLP) since it involves the prediction of one latent variable from another. Figure 3 illustrates the use of LLP method for drug sensitivity prediction. Again, we assume that only a small portion, y_{11}, of GDSC AUC set, y_{1}, is known. Then, the corresponding CCLE AUC set, y_{21}, in CCLE is used with y_{11} to perform the cost optimization in (1) to generate the latent vector w_{1} and the regression coefficients a_{1},a_{2}.
Similar to the AUC optimization, the latent vector, (v_{k})_{n×1}, corresponding to the expression vectors, \(\phantom {\dot {i}\!}(x_{1k})_{n_{1} \times 1}\) & \(\phantom {\dot {i}\!}(x_{2k})_{n_{1} \times 1}\) of gene k in GDSC & CCLE (where k=1,2,⋯,p) can be found as follows (An additional section provides the detailed development of the cost function [see Additional file 1])
where \(V_{k} = \left [\begin {array}{ll} \vec {1} & v_{k} \end {array}\right ]\) and \(v_{k} = \left [\begin {array}{lll} \vec {1} & x_{1k} & x_{2k} \end {array}\right ] \lambda _{k}\).
Again, assuming linear relationships, \(\lambda _{k} = \left [\begin {array}{lll} \lambda _{k0} & \lambda _{k1} & \lambda _{k2} \end {array}\right ]^{T}\) is the weight vector of latent v_{k} corresponding to the expression vectors x_{1k} & x_{2k}, kth columns of the matrices (X_{1})_{n×p} & (X_{2})_{n×p}, respectively and α’s are the corresponding regression coefficients. The complete latent matrix, V_{n×p} is found performing this optimization for all p genes and concatenating the individual latent vectors, i.e.
For training, the latent matrix \(\phantom {\dot {i}\!}(V_{1})_{n_{1} \times p}\) corresponding to X_{11} and X_{21} is used as model input and w_{1} as the corresponding output. The remaining latent, \(\phantom {\dot {i}\!}(V_{2})_{n_{2} \times p}\), is utilized for prediction of the latent vector, \(\phantom {\dot {i}\!}(w_{2})_{n_{2} \times 1}\). The unknown AUC values \(\phantom {\dot {i}\!}(y_{12})_{n_{2} \times 1}\) are predicted using (7) again.
We have used Random Forest (RF) [18, 20] as our prediction model here. If only a part of CCLE drug sensitivity response is known along with a bigger portion of GDSC sensitivity set, then this whole process can be utilized for the prediction of CCLE responses by interchanging the GDSC and CCLE values.
Combined latent drug sensitivity prediction
To improve the predictive performance of the LLP model and utilize the available CCLE data more effectively, we have incorporated the two latent variable based models together. Here, we combine the predicted latent variables from the two models i.e., \(w_{2}^{LRP}\) from (6) and \(w_{2}^{LLP}\) from (10) via simple averaging and generate the final prediction as before.
The whole process is depicted as the Combined Latent Prediction (CLP). Comparisons among the three optimization based approaches yield that the combined method performs the best while the LLP approach often underperforms than LRP.
Domain transfer approach
In this section, our goal is to analyze whether the dependency structure between CCLE and GDSC can be modeled using a common mapping across different cell lines. The hypothesis is that– if there exists such a common mapping so that the data from one domain can be shifted to the other, then the additional information available in the second database can easily be transferred to the first to produce an overall better performance [3]. For analysis, we have considered a polynomial regression mapping [21] and selected the polynomial order by utilizing the Spearman rank correlation (ρ_{s}) between each pair of datasets from the two databases. This infers a high concordance for gene expression data between databases but poor consistency for drug sensitivity measures such as AUC or IC_{50} [5].
Gene expression mapping
Between GDSC and CCLE, there exist 15,664 common genes in 623 cell lines. Since the rank correlation between CCLE and GDSC gene expression is high (median ρ_{s}=0.86), we have applied a genewise firstorder polynomial regression mapping. Assume that (g_{1i})_{n×1} and (g_{2i})_{n×1} denote the expressions of the ith gene in GDSC and CCLE, respectively (where i=1,2,⋯,p). Then, for each individual gene, the expression mapping from GDSC space to CCLE space
where \(\hat {g}_{2i}\) denotes the mapped gene expression for ith gene and α’s are the regression coefficients quantifying the strength of the association. For the total n×p gene expression matrices, the equation becomes
where (A_{0})_{p×p} and (A_{1})_{p×p} are two diagonal matrices containing the regression coefficients and \(\mathcal {E}_{n_{1} \times p}\) is the mapping error. Here, \(\stackrel {\leftrightarrow }{1}\) denotes a matrixofone.
We have performed a drugwise analysis so that only data corresponding to a single drug is available at a time. Therefore, only a subset of the common 623×15664 gene expression matrix is used for each drug, corresponding to the available cell line responses. We used ReliefF [16] to select top 200 genes from both CCLE and GDSC datasets for each drug and took the intersection as the final feature set. Figure 4 illustrates the effect of the mapping for a single gene "DBNDD1". For analysis, we have randomly selected a small subset (i.e., 50 cell lines) of available GDSC samples to get the mapping from the corresponding CCLE data and evaluated the performance on the remaining cell lines. Table 5 shows the correlation between the mapped GDSC expression set with corresponding CCLE set compared to the correlation between the actual GDSC and CCLE sets for two common drugs and the mean square errors (MSE) for reconstruction. From the correlation and MSE values, it can be inferred that the mapping function successfully captures the interrelationship between CCLE and GDSC gene expression sets.
Drug sensitivity mapping
For drug sensitivity measure, we used the AUC values again. The overall concordance for AUC between databases is poor (median ρ_{s}=0.34), and therefore, we have considered a drugwise secondorder polynomial regression mapping. Assume that (d_{1j})_{n×1} and (d_{2j})_{n×1} denote the AUC vectors for the jth drug in GDSC and CCLE, respectively. Then, for each drug, the drug sensitivity mapping from CCLE space to GDSC space
where \(\hat {d}_{1j}\) denotes the mapped drug sensitivity dataset for jth drug, \(D_{2j} = \left [\begin {array}{lll} \vec {1} & d_{2j} & d_{2j}^{2} \end {array}\right ]\) is the design matrix, β contains the regression coefficients quantifying the strength of the association and ε_{n×1} is the mapping error.
Note that, out of the 15 common drugs, 3 of the drugs have moderate consistency (0.5≤ρ_{s}<0.6) between databases, 3 have fair consistency (0.4≤ρ_{s}<0.5) and the rest have poor consistency (ρ_{s}<0.4). Figure 5 illustrates the effect of the mapping of AUC values from CCLE to GDSC space for the drug AZD6244 with poor consistency between databases (ρ_{s}=0.26).
For analysis, again we have randomly selected 50 cell lines to get the mapping and evaluated the performance on the rest. Table 6 shows the correlation between the mapped GDSC AUC set with corresponding CCLE set compared to the correlation between the actual GDSC and CCLE sets for two common drugs and MSE for reconstruction. From the correlation and MSE values, it can be inferred that the mapping function captures the interrelationship between CCLE and GDSC drug sensitivity sets satisfactorily.
Drug sensitivity prediction using nonlinear mapping
In this section, we have exploited the interrelationships between CCLE and GDSC through the mapping functions discussed in the previous sections. By using the mapping, we have integrated data from both databases to improve drug sensitivity prediction. Figure 6 illustrates the drug sensitivity prediction procedure using nonlinear mapping. We have performed a drugwise analysis so that data is available for a single drug at a time. Assume that the GDSC and CCLE gene expression data are expressed as two n×p matrices, G_{1} and G_{2}, respectively. Furthermore, only a small portion of G_{1}i.e., \(\phantom {\dot {i}\!}(G_{11})_{n_{1} \times p}\), is available with the corresponding AUC values, \(\phantom {\dot {i}\!}(d_{11})_{n_{1} \times 1}\) where n_{1}<n, while the whole G_{2} matrix is available with the AUC response, (d_{2})_{n×1}. The goal is to predict the unknown AUC values, \(\phantom {\dot {i}\!}(d_{12})_{n_{2} \times 1}\), for the larger GDSC subset, \(\phantom {\dot {i}\!}(G_{12})_{n_{2} \times p}\). The CCLE datasets, G_{21} & d_{21}, corresponding to G_{11} & d_{11}, can be utilized in this regard to acquire the individual mapping functions h & f, generated from the polynomial mapping in (15) & (17), respectively.
where A_{0},A_{1} are defined from (16).
To predict the AUC for G_{12}, we map it to CCLE space using the mapping h as \((\hat {G}_{22})_{n_{2} \times p}\), as in Fig. 6. One can now utilize the additional information in the CCLE space by employing the complete CCLE data G_{2} & d_{2} for training the prediction model \(\mathcal {M}\) while the mapped GDSC set, \(\hat {G}_{22}\), is used to predict the sensitivity, \((\hat {d}_{22})_{n_{2} \times 1}\), in CCLE space. The desired prediction is then obtained by mapping it back to the GDSC space using f.
The whole process is referred as the Mapped Prediction (MP) of GDSC data. Furthermore, if only a part of CCLE gene expression data is available with corresponding drug sensitivity values along with a bigger portion of labeled GDSC data, then this whole process can be utilized for the prediction of CCLE sensitivity by interchanging the GDSC and CCLE values. For prediction using gene expression, we have used a Bias Corrected Random Forest (BCRF) [19, 22] model where the effect of bias correction is measured using the residual angle [23].
Abbreviations
 AUC:

Area under the curve
 CCLE:

Cancer cell line encyclopedia
 CLP:

Combined latent prediction
 GDSC:

Genomics of drug sensitivity in cancer
 LLP:

Latentlatent prediction
 LRP:

Latent regression prediction
 MP:

Mapped prediction
 NRMSE:

Normalized root mean squared error
 RF:

Random forest
 TL:

Transfer learning
References
 1
Yang W, Soares J, Greninger P, Edelman EJ, Lightfoot H, Forbes S, Bindal N, Beare D, Smith JA, Thompson IR, et al. Genomics of drug sensitivity in cancer (gdsc): a resource for therapeutic biomarker discovery in cancer cells. Nucleic Acids Res. 2013; 41(D1):955–61.
 2
Barretina J, Caponigro G, Stransky N, Venkatesan K, Margolin AA, Kim S, Wilson CJ, Lehár J, Kryukov GV, Sonkin D, et al. The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature. 2012; 483(7391):603–7.
 3
Pan SJ, Yang Q. A survey on transfer learning. IEEE Trans Knowl Data Eng. 2010; 22(10):1345–59.
 4
Weiss K, Khoshgoftaar TM, Wang D. A survey of transfer learning. J Big Data. 2016; 3(1):9.
 5
HaibeKains B, ElHachem N, Birkbak N. J, Jin AC, Beck AH, Aerts HJ, Quackenbush J. Inconsistency in large pharmacogenomic studies. Nature. 2013; 504(7480):389–93.
 6
Consortium CCLE, of Drug Sensitivity in Cancer Consortium G, et al. Pharmacogenomic agreement between two cancer cell line data sets. Nature. 2015; 528(7580):84–87.
 7
Celik S, Logsdon BA, Battle S, Drescher CW, Rendi M, Hawkins RD, Lee SI. Extracting a lowdimensional description of multiple gene expression datasets reveals a potential driver for tumorassociated stroma in ovarian cancer. Genome Med. 2016; 8(1):66.
 8
Garcke J, Vanck T. Importance weighted inductive transfer learning for regression. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Berlin: Springer: 2014. p. 466–81.
 9
AlStouhi S, Reddy C. Adaptive boosting for transfer learning using dynamic updates. In: Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases  Volume Part I (ECML PKDD’11), Dimitrios Gunopulos, Thomas Hofmann, Donato Malerba, and Michalis Vazirgiannis (Eds.), Vol. Part I. Berlin: SpringerVerlag: 2011. p. 60–75.
 10
Rückert U, Kramer S. Kernelbased inductive transfer. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Berlin: Springer: 2008. p. 220–33.
 11
Sugiyama M, Kawanabe M. Machine Learning in Nonstationary Environments: Introduction to Covariate Shift Adaptation. Cambridge: MIT press; 2012, pp. 48–71.
 12
Bonilla EV, Chai KM, Williams C. Multitask gaussian process prediction. In: Advances in Neural Information Processing Systems. USA: Curran Associates Inc.: 2008. p. 153–60.
 13
Gao J, Fan W, Jiang J, Han J. Knowledge transfer via multiple model local structure mapping. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM: 2008. p. 283–91.
 14
Jiang J, Zhai C. Instance weighting for domain adaptation in nlp. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, ACL, vol. 7. Prague: Association for Computational Linguistics: 2007. p. 264–71.
 15
Liao X, Xue Y, Carin L. Logistic regression with an auxiliary data source. In: Proceedings of the 22nd International Conference on Machine Learning. New York: ACM: 2005. p. 505–12.
 16
Kira K, Rendell LA. The feature selection problem: Traditional methods and a new algorithm. In: Proceedings of the 10th National Conference on Artificial Intelligence, AAAI, vol. 2. San Jose: AAAI Press / The MIT Press: 1992. p. 129–34.
 17
Breiman L. Random forests. Mach Learn. 2001; 45(1):5–32.
 18
Rahman R, Otridge J, Pal R. Integratedmrf: random forestbased framework for integrating prediction from different data types. Bioinformatics (Oxford, England). 2017; 33(9):1407–1410.
 19
Song J. Bias corrections for random forest in regression using residual rotation. J Korean Stat Soc. 2015; 44(2):321–6.
 20
Rahman R, Haider S, Ghosh S, Pal R. Design of probabilistic random forests with applications to anticancer drug sensitivity prediction. Cancer Informat. 2015; 14(Suppl 5):57.
 21
Draper NR, Smith H. Applied regression analysis. 1966; 709(1):13.
 22
Zhang G, Lu Y. Biascorrected random forests in regression. J Appl Stat. 2012; 39(1):151–60.
 23
Matlock K, De Niz C, Rahman R, Ghosh S, Pal R. Investigation of model stacking for drug sensitivity prediction. In: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. ACM: 2017. p. 772.
Acknowledgments
Not applicable.
Funding
This work was supported by NIH grant R01GM12208401. The publication costs of this article was funded by NIH grant R01GM122084.
Availability of data and materials
For the analysis of transfer learning, the MATLAB codes are available in the following link: https://github.com/dhruba018/Transfer_Learning_Precision_Medicine, while the primary and secondary gene expression and area under the curve data are from the Genomics of Drug Sensitivity in Cancer repository, http://www.cancerrxgene.org/ and Cancer Cell Line Encyclopedia https://portals.broadinstitute.org/ccle, respectively.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 19 Supplement 17, 2018: Selected articles from the International Conference on Intelligent Biology and Medicine (ICIBM) 2018: bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume19supplement17.
Author information
Affiliations
Contributions
SRD, RR, SG and RP conceived of and designed the experiments. SRD and RR performed the experiments. SRD and RP analyzed the data. SRD, RR, KM, SG and RP wrote the paper. All authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional file
Additional file 1
Supplementary information to application of transfer learning for cancer drug sensitivity prediction. Figure S1. Illustration of kNN image regression prediction for unknown GDSC AUC dataset using the available CCLE data. Figure S2. Illustration of change in performance for a single validation set with change in the number of nearest neighbors. Figure S3. Illustration of prediction for a single iteration for the Updated kNN image regression prediction. Figure S4. Illustration of shift between GDSC and CCLE AUC distributions. Table S1. Comparison of MSE for reconstruction and corresponding cost function value for both optimized latent vector and mean latent vector. Table S2. Comparison of Pearson correlation and NRMSE among kNN Image Regression Prediction, Latent Regression Prediction and Direct Prediction of GDSC sensitivity using CCLE data. Table S3. Comparison of Pearson correlation and NRMSE among kNN Image Regression Prediction, Latent Regression Prediction and Direct Prediction of CCLE sensitivity using GDSC data. Table S4. Comparison of Pearson correlation and NRMSE among combined Latent Regression & updated kNN Image Regression Prediction, kNN Image regression Prediction, Latent Regression Prediction and Direct Prediction of GDSC drug sensitivity using CCLE data. Table S5. Comparison of Pearson correlation and NRMSE among combined Latent Regression & updated kNN Image Regression Prediction, kNN Image regression Prediction, Latent Regression Prediction and Direct Prediction of CCLE drug sensitivity using GDSC data. Table S6. Comparison of Kfold crossvalidation results for 4 GDSC drug sensitivity prediction approaches using CCLE – Mapped Prediction, CCLE model Prediction, Combined Model Prediction and Direct Prediction. (PDF 648 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Dhruba, S., Rahman, R., Matlock, K. et al. Application of transfer learning for cancer drug sensitivity prediction. BMC Bioinformatics 19, 497 (2018). https://doi.org/10.1186/s128590182465y
Published:
Keywords
 Drug sensitivity prediction
 Pharmacogenomic studies
 CCLE
 GDSC
 Transfer learning
 Nonlinear mapping
 Latent variable
 Cost optimization