Application of transfer learning for cancer drug sensitivity prediction.

BACKGROUND
In precision medicine, scarcity of suitable biological data often hinders the design of an appropriate predictive model. In this regard, large scale pharmacogenomics studies, like CCLE and GDSC hold the promise to mitigate the issue. However, one cannot directly employ data from multiple sources together due to the existing distribution shift in data. One way to solve this problem is to utilize the transfer learning methodologies tailored to fit in this specific context.


RESULTS
In this paper, we present two novel approaches for incorporating information from a secondary database for improving the prediction in a target database. The first approach is based on latent variable cost optimization and the second approach considers polynomial mapping between the two databases. Utilizing CCLE and GDSC databases, we illustrate that the proposed approaches accomplish a better prediction of drug sensitivities for different scenarios as compared to the existing approaches.


CONCLUSION
We have compared the performance of the proposed predictive models with database-specific individual models as well as existing transfer learning approaches. We note that our proposed approaches exhibit superior performance compared to the abovementioned alternative techniques for predicting sensitivity for different anti-cancer compounds, particularly the nonlinear mapping model shows the best overall performance.


Background
A consistent challenge in precision medicine is to design appropriate models for predicting the sensitivity of a tumor to an anti-cancer compound with high accuracy. In this aspect, large-scale pharmacogenomic studies of cancer genomes have provided unprecedented insights for studying anti-cancer therapeutics to determine putative prediction of drug sensitivity. The Genomics of Drug Sensitivity in Cancer (GDSC) [1] of the Cancer Genome Project and the Cancer Cell Line Encyclopedia (CCLE) [2] from the Broad Institute are two such studies where drug sensitivity profiles and genomic information across hundreds of compounds and cancer cell lines have been systematically gathered. There exists significant overlaps between the two databases which can further be utilized in designing more accurate sensitivity predictive models. Biological data for designing suitable predictive models are frequently scarce and therefore the availability of a secondary dataset often holds the promise for a better model development. However, majority of the machine learning approaches used in drug sensitivity prediction follow the inherent assumption that both training data and test data are in the same feature space with the same distribution. But, when training and test data, despite being in the same feature space, exhibit different distributions, one need to take the distribution shift into account. This is where transfer learning (TL) methodologies come into play [3].
Often in TL environment, the source and target domains can be considered as linked subspaces as part of a high-level common domain space [4]. We, therefore need to assume that there exists some consistency between the different datasets to be utilized in TL. Haibe-Kains et al. [5] at first pointed out that, although the gene expression from CCLE and GDSC databases are well correlated between themselves, unexpectedly the measured pharmacological drug responses using common estimators such as IC 50 and the area under the curve (AUC) measures are highly discordant. In response, the CCLE and GDSC investigators performed their own analysis [6] and presented results opposing the conclusions in [5]. They pointed out that in majority of the drugs, the exhibited AUC and IC 50 distributions are dominated by drug insensitive lines with a much smaller number of outliers, and postulated that the differences in cell line biology between studies have resulted in the poor correlation. Considering these facts, they have demonstrated significant improvement in correlation between most of the drugs. In any event, the fact that both the databases are providing information about the same biological process, make them suitable candidates for applying transfer learning methodologies.
In case of inconsistent data with different distributions for training and test sets, various TL approaches [3] have been attempted for dataset shift. Unsupervised methods such as INSPIRE (INferring Shared modules from multi-Ple gene expREssion datasets) [7] is primarily focused on the expression datasets to extract a low-dimensional representation and predicts tumor phenotypes using regularized regression approaches. Inductive transfer learning (ITL) approaches, as in [8], tackle the issue of prediction for scarce primary data using a secondary dataset through importance sampling i.e., reweighting the secondary distribution to the primary. While the primary data size is assumed to be significantly smaller than secondary data, for large number of unlabeled data, one has to adapt to covariate shift along with ITL. Boosting based approaches such as Dynamic-TrAdaBoost [9] applies ensemble methods to both source and target instances and then employs an update mechanism incorporating only the source instances useful for target task, with an additional dynamic correction factor. Kernel based ITL methods [10,11] focus on finding an appropriate kernel for the newly available data, modeling the difference with existing data as a problem of finding the suitable bias.
The previous approaches for transfer learning work well under the assumption that the datasets are closely related (such as 9 ovarian cancer datasets in INSPIRE) and the number of samples are significantly larger than the number of features (n > p). However, the scenario is frequently reversed in the case of genomic (or proteomic) data i.e., we usually have tens of thousands of genes and a small number of cell lines. Additionally, the previous methods for TL often involve removing the distribution shift via weighting without any explicit domain transfer. In our work, we have proposed two different TL approaches that consider mapping the data from two different databases to either a common space or to each other's domain, inherently taking care of the n << p problem. The inherent assumptions here for each pair of similar datasets from CCLE and GDSC are -(i) The datasets are monotonically changing in the same direction, and (ii) There exists a functional relationship between them. To build an appropriate prediction model, we utilize the gene expression as the predictors and the drug sensitivity (specifically AUC) as the output. Considering the application of TL on these datasets, the proposed approaches in this paper can be classified into two categories, as illustrated in Fig. 1.
Cost optimization based approach where we employ latent variable models to extract the underlying variables between different datasets. In this case, TL can be applied to only the output ( Fig. 1(a)), as in parameter transfer approach [12,13] or to both model input and output ( Fig. 1(b)), as in [14,15]. Domain transfer approach where we design maps between databases to transfer data from primary domain to secondary and utilize the secondary data to improve the prediction model. Here, TL is applied to both input and output ( Fig. 1(c)), as in instance transfer approach [14,15].
To summarize, the key contributions of the paper iswe have implemented two TL based approach, where the target (primary) data is either transferred to a common latent variable space along with the source (secondary) data, or to the source domain through nonlinear mapping to improve the prediction of limited primary data employing the available secondary data.

Results
To evaluate the performance of our transfer learning algorithms, we have initially retrieved the data common to both CCLE and GDSC. From GDSC (v6.0) and CCLE, there are 15,664 common genes available in 623 common cell lines along with 15 common drugs. We have performed a drug-wise analysis and found that the number of cell lines decreases from 623 after incorporating the available drug sensitivity values, resulting in datasets with cell lines between 91 − 310 along with 15,664 genes and corresponding sensitivity measures. For analysis involving gene expression, we have used ReliefF [16] to select the top 200 genes from each dataset and taken the intersection as the final feature set. For drug sensitivity measure, we have used the AUC values as they have more concordance between databases (median ρ s = 0.34) than  IC 50 (median ρ s = 0.28) [5]. Note that in spite of our discussion on inconsistencies between databases, the main goal here is to consider the scenario where a small portion of database 1 (i.e., GDSC) is available while data for the entire database 2 (i.e., CCLE) is available and we would like to use database 2 to improve the prediction performance for the rest of database 1. Thus, for evaluation, we will use the GDSC experimental AUCs as the gold standard and compare with the predicted AUCs.

Latent variable cost optimization approach
We have performed drug sensitivity prediction using the three latent variable cost optimization based approaches -Latent Regression Prediction (LRP), Latent-Latent Prediction (LLP), Combined Latent Prediction (CLP) (described in the "Methods" section) for 7 common drugs with sufficient cell lines (n > 200). For each method, subsets of 50 randomly chosen GDSC cell lines (X 11 Table 1 illustrates the comparison of prediction performance for all three methods with Direct prediction (DP) for K-fold cross-validation, where DP is defined as training on the 50 available cell lines and predicting for the rest. Here, the number of folds is found as K = n 50 , where 1 fold (containing ∼50 samples) is used for training and the remaining (K − 1) folds are used for testing.

Domain transfer approach
We have performed the Mapped Prediction (MP) approach (described in the "Methods" section) for predicting GDSC sensitivities for 7 common drugs with sufficient cell lines (n > 200) and different levels of database consistency. Figure 4 demonstrates the effect of first-order polynomial mapping for a representative gene expression set, while Fig. 5 illustrates the effect of second-order polynomial mapping for a representative drug sensitivity vector. Again, we used random subsets of 50 cell lines (G 11 , d 11 & G 21 , d 21 in Fig. 6) to retrieve the mapping functions and sensitivities for the rest (d 12 ) are predicted using the known CCLE data (G 22 , d 22 ). Table 2 shows the comparison of prediction performance for MP approach for all 7 drugs with two other methods -Direct Prediction (DP) and CCLE model prediction (CP) for K-fold crossvalidation, as defined above (i.e., K = n 50 and 1 fold is used for training and (K −1) folds for testing). For CP approach, the model is built using the available CCLE data directly and prediction is performed using the GDSC expression data. For prediction of AUC values using gene expression data, we have used a Bias-corrected Random Forest (BC-RF) [17][18][19] model.

Discussion
From Table 1, it is evident that the CLP method yields the best performance. Additionally, even though the LLP method often yield better results than DP, it frequently underperforms than LRP. Overall, 6 drugs out of 7 yield the best performance for CLP method while only Nilotinib performs the best with LRP. The prediction performance is similar in the reverse direction (i.e., CCLE as the primary set and GDSC as secondary) where 5 out of 7 drugs show best performance for CLP. For the Domain Transfer approach, it is evident from Table 2 that the MP approach performs significantly better than the both CP and DP. Furthermore, the performance of the CP approach is much worse compared to either MP or DP, which can be attributed to the existing distribution shift between CCLE and GDSC data in general. Note that among the 7 drugs, 17-AAG and PD-0325901 has moderate concordance (0.5 ≤ ρ s < 0.6) while AZD6244, Nutlin-3 and PD-0332991 have poor concordance (ρ s < 0.4) between databases. For PLX4720 and Nilotinib, there exist moderate to high consistency in terms of Pearson correlation (ρ = 0.57 and ρ = 0.88 respectively), although the rank correlation is low (ρ s = 0.29 and ρ s ≈ 0.1 respectively). We have also implemented a model that uses the ensemble of available CCLE and GDSC data directly

Comparison with inductive transfer learning
We have compared the results from the Mapped Prediction approach with an existing transfer learning approach, namely the Importance-weighted Direct Inductive Transfer Learning (DITL) proposed by Garcke et al. [8]. In DITL, the primary and secondary datasets are assumed to be related in a way so that in some parts of the domain, the two distributions can be similar, and therefore, one can employ the secondary dataset with primary via importance sampling (i.e., reweighting the secondary distribution to the primary so that the secondary data points with positive effect on primary data will have greater weights). For prediction, DITL uses weighted Kernel Ridge regression (KRR) with Gaussian kernels, dubbing the whole approach as DITL-KRR [8]. Table 3 shows the comparison of prediction performance for DITL-KRR approach with MP and DP approaches for 4 representative drugs. Unlike the MP approach, DITL follows the n > p assumption of machine learning and therefore, we used the intersection of top 50 genes from both datasets as the feature set while 50 cell lines were used for training. From Table 3, we can conclude that MP has a superior performance compared  to the other approaches even when the number of features (therefore, information) is reduced to < 50.

Conclusions
In precision medicine, data from multiple large pharmacological studies can be utilized to design better predictive models. In this regard, transfer learning is employed to eliminate the distribution shift between the primary and secondary datasets. In this paper, we have proposed two different TL approaches to incorporate data from two large studies i.e., CCLE and GDSC for designing a better predictive model. In the first approach, we have used a latent variable approach and then optimized the appropriate cost functions to get a pertinent prediction model. The second method uses a nonlinear mapping between both genomic and sensitivity data to transfer the primary data to secondary domain space and perform prediction utilizing the secondary datasets. Both methods show marked  improvement in drug sensitivity prediction compared to direct prediction and existing TL approaches, while the mapping approach shows the best overall performance. We have faced a couple of issues during implementation. The LRP approach utilizes the underlying latent variable between the sensitivity datasets and generate the latent variable corresponding to unknown primary sensitivity data. However, to do so, it uses the available secondary data inferring that the prediction can be only performed for matched pair of datasets. Although the LLP approach overcomes this limitation, it often underperforms than LRP. In Table 4, we have presented the applicability of the sensitivity prediction approaches discussed in this paper for matched vs. unmatched pairs of datasets.
Furthermore, in Mapped Prediction, drug sensitivity mapping between databases using polynomials is drugdependent and thus vulnerable to a user-fault. One potential new step can be modeling the map to be robust against the outliers. Another development can be investigating the effect of model stacking using the proposed approaches.

Latent variable cost optimization approach
In this section, our goal is to analyze the transfer learning approach from the viewpoint of a cost function optimization. Here, the assumption is that-if there exists such a way to transfer data from both CCLE and GDSC to a common space, then the information available in both databases can be incorporated together to result in a better overall performance [3]. Therefore, it can be inferred that in a suitable common space, the individual concordance between the common set (i.e., underlying latent variable) and each dataset will be maximized and the reconstruction errors from the common set will be minimized. This is the rationale behind the cost function optimization approach.

Drug sensitivity prediction via cost optimization of sensitivity data
In this section, we have deployed cost function optimization of CCLE and GDSC sensitivity data to utilize the underlying latent vector for improving the sensitivity prediction to an anti-cancer drug. The hypothesis is that if both CCLE and GDSC sensitivity vectors can be represented as functions of a common latent variable, then this variable can be utilized along with a known set of CCLE sensitivity values to predict the unknown GDSC sensitivity or vice versa. This approach is regarded as the Latent Regression Prediction (LRP), as the final prediction is performed using a regression model on the latent vector. For this method, only the drug sensitivity values (namely AUC) from the two databases are employed without any use of genomic characteristics data. Figure 2 illustrates the use of LRP method for drug sensitivity prediction. Assume that only a small portion, (y 11 ) n 1 ×1 of GDSC AUC set, (y 1 ) n×1 , is known, where n 1 < n. Then, the corresponding AUC set, (y 21 ) n 1 ×1 , in CCLE can be used with y 11 to perform a cost optimization to retrieve the optimum weight vector c for the latent variable, .
where a 1 = a 10 a 11 , a 2 = a 20 a 21 Solving (1), the weight vector, c, and, in turn, a 1 , a 2 are found. From (3), it can be inferred that w 1 is also expressed as a linear function of y 11 or y 21 alone, i.e.
We assume that both CCLE and GDSC sensitivity vectors maintain individual functional relationships with the latent variable, and therefore, the coefficients a 1 , a 2 , b 1 , b 2 will remain the same for the whole response sets (y 1 & y 2 in Fig. 2). Using w 1 and the known CCLE AUC set, y 21 , the coefficient b 2 in (4) can be retrieved using LS minimization.
Using the rest of known CCLE AUC set, (y 22 ) n 2 ×1 , the underlying latent vector, (w 2 ) n 2 ×1 , can be retrieved following (4) Finally, utilizing the coefficient a 1 found initially from solving (1), the unknown GDSC AUC values can be predicted following (3), aŝ If only a part of CCLE drug sensitivity response is known along with a bigger portion of GDSC sensitivity set, then this whole process can be utilized for the prediction of CCLE responses by interchanging the GDSC and CCLE values.
We have also implemented a kNN regression based transfer learning approach for sensitivity prediction [see Additional file 1], which is computationally inexpensive to implement but often underperforms the LRP approach. We then applied an iterative update scheme to improve the performance of kNN approach and combined the updated kNN model with the LRP model [see Additional file 1]. The combined model shows similar performance to LRP model.

Drug sensitivity prediction via cost optimization of genomic and sensitivity data
In this section, we have utilized both gene expression and AUC data in cost optimization to improve the drug sensitivity prediction. Here, the goal is to establish a relationship between the two underlying latent variables corresponding to gene expression and AUC datasets respectively, and then exploiting this relationship for the prediction of unknown AUC values. This method is regarded as the Latent-Latent Prediction (LLP) since it involves the prediction of one latent variable from another. Figure 3 illustrates the use of LLP method for drug sensitivity prediction. Again, we assume that only a small portion, y 11 , of GDSC AUC set, y 1 , is known. Then, the corresponding CCLE AUC set, y 21 , in CCLE is used with y 11 to perform the cost optimization in (1) to generate the latent vector w 1 and the regression coefficients a 1 , a 2 .
Similar to the AUC optimization, the latent vector, (v k ) n×1 , corresponding to the expression vectors, (x 1k ) n 1 ×1 & (x 2k ) n 1 ×1 of gene k in GDSC & CCLE (where k = 1, 2, · · · , p) can be found as follows (An additional section provides the detailed development of the cost function [see Additional file 1]) subject to Again, assuming linear relationships, λ k = λ k0 λ k1 λ k2 T is the weight vector of latent v k corresponding to the expression vectors x 1k & x 2k , k-th columns of the matrices (X 1 ) n×p & (X 2 ) n×p , respectively and α's are the corresponding regression coefficients. The complete latent matrix, V n×p is found performing this optimization for all p genes and concatenating the individual latent vectors, i.e.
For training, the latent matrix (V 1 ) n 1 ×p corresponding to X 11 and X 21 is used as model input and w 1 as the corresponding output. The remaining latent, (V 2 ) n 2 ×p , is utilized for prediction of the latent vector, (w 2 ) n 2 ×1 . The unknown AUC values (y 12 ) n 2 ×1 are predicted using (7) again.
We have used Random Forest (RF) [18,20] as our prediction model here. If only a part of CCLE drug sensitivity response is known along with a bigger portion of GDSC sensitivity set, then this whole process can be utilized for the prediction of CCLE responses by interchanging the GDSC and CCLE values.

Combined latent drug sensitivity prediction
To improve the predictive performance of the LLP model and utilize the available CCLE data more effectively, we have incorporated the two latent variable based models together. Here, we combine the predicted latent variables from the two models i.e., w LRP 2 from (6) and w LLP 2 from (10) via simple averaging and generate the final prediction as before. (12) The whole process is depicted as the Combined Latent Prediction (CLP). Comparisons among the three optimization based approaches yield that the combined method performs the best while the LLP approach often underperforms than LRP.

Domain transfer approach
In this section, our goal is to analyze whether the dependency structure between CCLE and GDSC can be modeled using a common mapping across different cell lines.
The hypothesis is that-if there exists such a common mapping so that the data from one domain can be shifted to the other, then the additional information available in the second database can easily be transferred to the first to produce an overall better performance [3]. For analysis, we have considered a polynomial regression mapping [21] and selected the polynomial order by utilizing the Spearman rank correlation (ρ s ) between each pair of datasets from the two databases. This infers a high concordance for gene expression data between databases but poor consistency for drug sensitivity measures such as AUC or IC 50 [5].

Gene expression mapping
Between GDSC and CCLE, there exist 15,664 common genes in 623 cell lines. Since the rank correlation between CCLE and GDSC gene expression is high (median ρ s = 0.86), we have applied a gene-wise first-order polynomial regression mapping. Assume that (g 1i ) n×1 and (g 2i ) n×1 denote the expressions of the i-th gene in GDSC and CCLE, respectively (where i = 1, 2, · · · , p). Then, for each individual gene, the expression mapping from GDSC space to CCLE spacê whereĝ 2i denotes the mapped gene expression for i-th gene and α's are the regression coefficients quantifying the strength of the association. For the total n × p gene expression matrices, the equation becomes where (A 0 ) p×p and (A 1 ) p×p are two diagonal matrices containing the regression coefficients and E n 1 ×p is the mapping error. Here, ↔ 1 denotes a matrix-of-one.
1 , · · · , α (p) 1 We have performed a drug-wise analysis so that only data corresponding to a single drug is available at a time. Therefore, only a subset of the common 623 × 15664 gene expression matrix is used for each drug, corresponding to the available cell line responses. We used ReliefF [16] to select top 200 genes from both CCLE and GDSC datasets for each drug and took the intersection as the final feature set. Figure 4 illustrates the effect of the mapping for a single gene "DBNDD1". For analysis, we have randomly selected a small subset (i.e., 50 cell lines) of available GDSC samples to get the mapping from the corresponding CCLE data and evaluated the performance on the remaining cell lines. Table 5 shows the correlation between the mapped GDSC expression set with corresponding CCLE set compared to the correlation

Drug sensitivity mapping
For drug sensitivity measure, we used the AUC values again. The overall concordance for AUC between databases is poor (median ρ s = 0.34), and therefore, we have considered a drug-wise second-order polynomial regression mapping. Assume that (d 1j ) n×1 and (d 2j ) n×1 denote the AUC vectors for the j-th drug in GDSC and CCLE, respectively. Then, for each drug, the drug sensitivity mapping from CCLE space to GDSC spacê whered 1j denotes the mapped drug sensitivity dataset for is the design matrix, β contains the regression coefficients quantifying the strength of the association and ε n×1 is the mapping error.
Note that, out of the 15 common drugs, 3 of the drugs have moderate consistency (0.5 ≤ ρ s < 0.6) between databases, 3 have fair consistency (0.4 ≤ ρ s < 0.5) and the rest have poor consistency (ρ s < 0.4). Figure 5 illustrates the effect of the mapping of AUC values from CCLE to GDSC space for the drug AZD6244 with poor consistency between databases (ρ s = 0.26).
For analysis, again we have randomly selected 50 cell lines to get the mapping and evaluated the performance on the rest. Table 6 shows the correlation between the mapped GDSC AUC set with corresponding CCLE set compared to the correlation between the actual GDSC and CCLE sets for two common drugs and MSE for reconstruction. From the correlation and MSE values, it can be inferred that the mapping function captures the interrelationship between CCLE and GDSC drug sensitivity sets satisfactorily.

Drug sensitivity prediction using nonlinear mapping
In this section, we have exploited the interrelationships between CCLE and GDSC through the mapping functions discussed in the previous sections. By using the mapping, we have integrated data from both databases to improve drug sensitivity prediction. Figure 6 illustrates the drug sensitivity prediction procedure using nonlinear mapping. We have performed a drug-wise analysis so that data is available for a single drug at a time. Assume that the GDSC and CCLE gene expression data are expressed as two n × p matrices, G 1 and G 2 , respectively. Furthermore, only a small portion of G 1 i.e., (G 11 ) n 1 ×p , is available with the corresponding AUC values, (d 11 ) n 1 ×1 where n 1 < n, while the whole G 2 matrix is available with the AUC response, (d 2 ) n×1 . The goal is to predict the unknown AUC values, (d 12 ) n 2 ×1 , for the larger GDSC subset, (G 12 ) n 2 ×p . The CCLE datasets, G 21 where A 0 , A 1 are defined from (16).
To predict the AUC for G 12 , we map it to CCLE space using the mapping h as (Ĝ 22 ) n 2 ×p , as in Fig. 6. One can now utilize the additional information in the CCLE space by employing the complete CCLE data G 2 & d 2 for training the prediction model M while the mapped GDSC set, G 22 , is used to predict the sensitivity, (d 22 ) n 2 ×1 , in CCLE space. The desired prediction is then obtained by mapping it back to the GDSC space using f.
The whole process is referred as the Mapped Prediction (MP) of GDSC data. Furthermore, if only a part of CCLE gene expression data is available with corresponding drug sensitivity values along with a bigger portion of labeled GDSC data, then this whole process can be utilized for the prediction of CCLE sensitivity by interchanging the GDSC and CCLE values. For prediction using gene expression, we have used a Bias Corrected Random Forest (BC-RF) [19,22] model where the effect of bias correction is measured using the residual angle [23].

Additional file
Additional file 1: Supplementary information to application of transfer learning for cancer drug sensitivity prediction. Figure S1. Illustration of kNN image regression prediction for unknown GDSC AUC dataset using the available CCLE data. Figure S2. Illustration of change in performance for a single validation set with change in the number of nearest neighbors. Figure S3. Illustration of prediction for a single iteration for the Updated kNN image regression prediction. Figure S4. Illustration of shift between GDSC and CCLE AUC distributions. Table S1. Comparison of MSE for reconstruction and corresponding cost function value for both optimized latent vector and mean latent vector.