 Methodology article
 Open Access
 Published:
AutoHMMLMF: feature selection based method for prediction of drug response via autoencoder and hidden Markov model
BMC Bioinformatics volume 22, Article number: 33 (2021)
Abstract
Background
Predicting the response of cancer cell lines to specific drugs is an essential problem in personalized medicine. Since drug response is closely associated with genomic information in cancer cells, some large panels of several hundred human cancer cell lines are organized with genomic and pharmacogenomic data. Although several methods have been developed to predict the drug response, there are many challenges in achieving accurate predictions. This study proposes a novel feature selectionbased method, named AutoHMMLMF, to predict cell linedrug associations accurately. Because of the vast dimensions of the feature space for predicting the drug response, AutoHMMLMF focuses on the feature selection issue for exploiting a subset of inputs with a significant contribution.
Results
This research introduces a novel method for feature selection of mutation data based on signature assignments and hidden Markov models. Also, we use the autoencoder models for feature selection of gene expression and copy number variation data. After selecting features, the logistic matrix factorization model is applied to predict drug response values. Besides, by comparing to one of the most powerful feature selection methods, the ensemble feature selection method (EFS), we showed that the performance of the predictive model based on selected features introduced in this paper is much better for drug response prediction. Two datasets, the Genomics of Drug Sensitivity in Cancer (GDSC) and Cancer Cell Line Encyclopedia (CCLE) are used to indicate the efficiency of the proposed method across unseen patient cellline. Evaluation of the proposed model showed that AutoHMMLMF could improve the accuracy of the results of the stateoftheart algorithms, and it can find useful features for the logistic matrix factorization method.
Conclusions
We depicted an application of AutoHMMLMF in exploring the new candidate drugs for head and neck cancer that showed the proposed method is useful in drug repositioning and personalized medicine. The source code of AutoHMMLMF method is available in https://github.com/emdadi/AutoHMMLMF.
Background
Computational models for personalized medicine make it possible to understand cancer cell lines on the basis of genomic information. This knowledge makes it possible to recommend individualized therapies to patients with different types of cancer by measuring drug responses. Many effective anticancer drugs have already been developed for each cancer type, such as breast cancer, lung cancer, ovary cancer and Brain cancer. For example, docetaxel, paclitaxel, carboplatin, cisplatin, vinorelbine and eribulin are just a few examples of drugs used to treat breast cancer. Since drug response to cancer treatment depends on multiple factors such as the patient's genomic profile, this process is a complicated problem in cancer treatment. These challenges have generated largescale experiments on human cancer cell lines and various anticancer drugs. For instance, two datasets Genomics of Drug Sensitivity in Cancer [1] (GDSC) and Cancer Cell Line Encyclopedia [2] (CCLE) are created based on drug sensitivity data of established anticancer drugs against diverse cancer cell lines. The various genetic features for the panels of cancer cell lines, such as gene expression profile, copy number alteration, single nucleotide mutation and methylation data, have been provided. By using these databases, machine learning algorithms are increasingly being applied to the predictions of drug responses by integrating data from different sources in a statistically meaningful way.
Several recommender systembased models were proposed for predicting drug response. Wang et al. adopted a similarityregularized matrix factorization (SRMF) method to predict anticancer drug responses of cell lines using the gene expression profile in cell lines and drugs’ chemical structures. They indicated that rapamycin (an mTOR inhibitor) could be a new therapeutic agent for nonsmall cell lung cancer [3]. Suphavilai et al. developed a model, termed Cancer Drug Response prediction using a Recommender System (CaDRReS), to learn projections for drugs and cell lines into a latent space. Also, they demonstrated how to explore drug mechanisms and drugpathway associations using the achieved features [4]. Emdadi et al. proposed DSPLMF method based on a logistic matrix factorization approach for predicting anticancer drug response. DSPLMF focuses on discovering significant features and latent vectors of cell lines and drugs for computing the probability of the cell lines are sensitive to drugs. They used the obtained latent vectors to identify subtypes of the cancer cell line and drugpathway associations [5].
Identifying the optimal subset of features from many genetic candidate features is a crucial issue for classification models for predicting drug response. Thus, a large number of algorithms have proposed using different approaches for feature selection. Xu et al. proposed AutoBorutaRF method based on feature selection for predicting drug response. This method first built an autoencoder network, and it used Boruta algorithm [6] to select important features for applying the RandomForest classifier to predict drug response [7]. Dong et al. proposed a model termed Support Vector Machine Recursive Feature Elimination (SVMRFE), which used a wrapper method using a recursive feature selection and SVM classifier to predict drug response [8].
This study presents a feature selectionbased method for drug response prediction, named AutoHMMLMF, to efficiently predict cell linedrug associations. Gene expression profile, copy number alteration, singlenucleotide mutation, tissue type information of the cell line, and drugs’ chemical structure information were incorporated. Two strategies based on autoencoder and hidden Markov modelmultinomial mixture model are used for selecting the essential features of input information. The autoencoder networks are applied on gene expression profile, copy number alteration data. Also, hidden Markov model and multinomial mixture model are applied on mutation data. A proper evaluation of the AutoHMMLMF method using tenfold crossvalidation was carried out to compare it with the stateoftheart methods. Results show its performance is superior for the tested data sets. Also, by comparing to the ensemble feature selection method (EFS), we showed that two considered strategies for feature selection in the AutoHMMLMF method could select proper features that significantly improved the prediction result.
Methods
This paper proposes a novel method (AutoHMMLMF) to efficiently predict cell linedrug associations by combining and effectively using feature selection approaches. The main scheme of the AutoHMMLMF algorithm is represented in Fig. 1. In the first step, two strategies for selecting the important features of input data are used. A feature selection approach based on autoencoder networks is applied to the gene expression profile of cell lines, and the similarity matrix (Sim_{EXP}) is constructed using selected features. Similarly, the similarity matrix (Sim_{CNV}) is created using the selected feature by applying the autoencoder feature selection method on copy number alteration information. In the next step, the similarity matrix (Sim_{MUT}) is generated using a novel feature selection approach based on the hidden Markov model and multinomial mixture model on singlenucleotide mutation data. Two similarity matrices (Sim_{IC50}) and (Sim_{TISSUE}) are achieved using IC50 values of cell lines across the drugs and tissue type information of each cell line, respectively. Finally, for constructing the latent vectors for each cell line and the drug and predicting whether the cell line is sensitive to the drug or not, the logistic matrix factorization method is applied. For assigning the IC50 values to two labels, sensitivity and resistance, we used the strategy introduced in previous researches [5, 7, 9], which used the median of the IC50 values for individual drugs as a threshold for the classification model. A cell line assigned to the sensitivity or class with label 1, if its IC50 value is smaller than the median of cell lines for an individual drug, and a cell line assigned to the resistance or class with label 0, otherwise. In the next section, we first describe the datasets used in the study and data preprocessing, and then the details of each above step are explained.
Datasets and preprocessing
In this work, we use the GDSC dataset, consisting of 1001 cancer cell lines and 265 tested drugs and the CCLE dataset that has analyzed 1457 cancer cell lines and their genomic profiles against 24 drugs. In these datasets, cell lines were characterized by genomic features such as gene expression profile, copy number alteration, and single nucleotide mutation. The halfmaximal inhibitory concentration (IC50) values are used for the sensitivity measure of cell lines across drugs. We focused on the 98 and 24 drugs for which SDF format (encoding the chemical structure of the drugs) were available from the NCBI PubChem Repository in GDSC and CCLE, respectively. There was no missing value in the gene expression features in these datasets. However, some cell lines have missing values for the response value, the single nucleotide mutation features, and the copy number alteration features. In the first step, the cell lines that contain missing values for more than half of the features were removed.
The known values of knearest neighbors imputed the remaining missing values. The Euclidean distance for each pair of cell lines c_{i} and c_{j} based on their gene expression profiles x_{i} and x_{j} are defined as follows:
Then the mean feature value among knearest cell lines for cell line c was used to impute the missing drug response value (IC50) of drug d as follows:
Similarly, the mean feature value among knearest cell lines for cell line c was used to impute the missing copy number alteration value (CNV) of gene g as follows:
The values of singlenucleotide mutation features are binaryvalued, i.e., 1 for mutation and 0 for wild type. The mean feature value among knearest cell lines for cell line c was considered to impute the missing MUT (singlenucleotide mutation) value of gene g as follows:
Finally, 555 cell lines and 98 drugs are considered for GDSC dataset and 363 cell lines and 24 drugs for CCLE dataset.
Feature selection using autoencoder
Feature selection methods aim to reduce data dimensionality by identifying the subset of informative and nonredundant features in a dataset. Autoencoder is a nonrecurrent neural network for unsupervised learning that reduces the datasets from initial feature space to a more significant feature space. It has an input layer, an output layer, and one or more hidden layers. The number of nodes (neurons) in the output layer is the same as in the input layer. Autoencoder learns the weight vector by assuming the output layer vector as the input layer vector. For constructing the autoencoder network for feature selection of the gene expression profile and copy number alteration information, the strategy introduced by Xu et al. [7] is used. Two autoencoder networks with a single hidden layer (with 100 neurons) and hyperbolic tangent as the activation functions are considered for screening out the gene expression features and copy number alteration data. After selecting the subset of features, a further small set of significant features was identified as two categories of inputs for the logistic matrix factorization model using the Boruta algorithm [6]. For determining the essential genes, the set of selected features by autoencoder networks along with the label of sensitivity and resistance corresponding to cell lines and drugs imported to the Boruta algorithm. Boruta algorithm is a wrapper built based on random forest classification that iteratively removes the less significant features by a statistical test. This algorithm added copies of all the features obtained using the autoencoder and it shuffled the values of the copied features for constructing shadow features, and it tried to find essential features. A random forest classifier is run on the extended information system, and Zscore values compute the importance of all attributes. Boruta algorithm repeats the finding procedure (finding the maximum Zscore among attributes) until the importance is assigned for all the attributes [6].
The first autoencoder with singlehiddenlayer and Boruta algorithm are applied to the gene expression profile of 11, 712 and 19, 389 genes for two datasets GDSC and CCLE, respectively. The numbers of selected essential genes are 798 and 1189 for GDSC and CCLE, respectively. Also, the similar autoencoder and Boruta algorithm are applied to copy number alteration of 24, 959 and 24, 960 genes for two GDSC and CCLE datasets. 67 and 127 features selected for GDSC and CCLE datasets, respectively.
Feature selection using hidden Markov model and multinomial mixture model
Understanding the activity of the mutational processes is critical for cancer treatment and personalized therapy. Since the mutational processes leave signatures of their activity in cancer genomes, characterizing the signatures of active mutational processes in patients from their patterns of single base substitutions is very important. In this study, we used the strategy proposed by Wojtowicz et al. for assigning the known signatures to the corresponding individual mutations for selecting essential mutated genes in cancer types [10]. In this work, we consider only the validated mutation signatures of the Catalogue of Somatic Mutations in Cancer (COSMIC) [11], and we focused on the signatures previously identified as active in cancer types [12]. Table 1 shows the active signatures of 14 cancer types corresponding to cancer cell lines in GDSC and CCLE datasets (only the cancer types with at least 15 cell lines in GDSC and CCLE datasets are considered).
Because there are six classes of base substitution (C:G > A:T, C:G > T:A, C:G > G:C, A:T > C:G, A:T > T:A, A:T > G:C) and four possible 5′, we categorized mutations in a cancer genome into 96 categories that include its base substitution and four possible 3′ bases [13, 14]. We downloaded single base substitutions of cancer types from the International Cancer Genome Consortium Data Portal [15]. We analyzed single base substitutions of several patients from considered cancer types, and the number of these patients (patient group1) corresponding to each cancer type is shown in Table 2. For each cancer type, the following hidden Markov model and multinomial mixture model are applied, and the important genes for the considered cancer type will be determined. In this model, the number of states for each cancer type is determined based on the number of the corresponding signature that is shown in Table 1. For example, the number of states (t) in BRCA cancer is 12.
The detailed stepwise feature selection procedure is described as follows:
Identifying close and isolated mutations
We classified the mutations into two classes, close and isolated mutations, using a distance threshold of 2000 bp (isolated mutations are distant from any other mutation). We set the first mutation of each mutation sequence to close. For other mutations, if the corresponding distance to the previous mutation is greater than 2000 bp, the mutation is labeled as isolated, and close otherwise. Therefore from a sequence of mutation of the patient, we can obtain several subsequences, some corresponding to close and some corresponding to isolated mutations. For example, two subsequences corresponding to the isolated, and three subsequences corresponding to the close mutations of a patient with BRCA cancer is as follows:
Modeling close mutations
Since the subsequences corresponding to close mutations are close to each other, it can be assumed that there is a dependency between them. So these subsequences were modeled using a hidden Markov model (HMM).
An HMM M with t (the number of mutation signatures) hidden states is represented by

Σ = {c_{1}, …, c_{s}} is the set of alphabets of all sequences.

Q = {q_{1}, …, q_{t}} is the set of states, each of which is able to emit symbols of alphabet Σ.

π_{i}, ∀i = 1, …, t is the probability to start with ith state.

A = [a_{i,j}]_{i,j=1,….,t} which a_{i,j} is the transition probability from q_{i} to q_{j}.

E = [e_{i,j}]_{i=1,…,t, j=1,…,s} where e_{i,j} is the probability that state q_{i} emits c_{j}.
The model assumes that each observation, representing a mutation category, is emitted by one of the t states. The sequence of states that generated the observed sequence is unknown, and each state depends on the previous state. For learning the parameters of the model, π, A, E, all obtained close subsequences in the first step considered as the training set for estimation of the parameters of HMM. In this study, the AntMarkov algorithm (the algorithm for parameter estimation of Hidden Markov Model inspired by Ant Colony Optimization) [16] was applied to estimate HMM parameters.
Modeling isolated mutations
Since the isolated subsequences are distant from any other mutation, the assumption of dependency between them is less motivated. So, isolated mutations were modeled using a multinomial mixture model (MMM). An MMM is parameterized by a vector g of t mutation signature marginal probabilities and a t × s emission matrix E, (s = 96). All obtained isolated subsequences in the first step considered as the training set for estimation of the parameters of MMM. The vector g and emission matrix E were estimated based on counting the number of times the isolated mutations were observed in samples (experimental distribution). We consider a vector T with size 96, which T[i] is the total number of times the ith mutation category were observed as isolated mutations in samples. By applying vector T to the initialized vector g and emission matrix E, their estimated values were achieved.
Computing mutation sequence occurrence
After training two above models, the probability of sequence occurrence O_{1}, …, O_{T}, which is decomposed into close and isolated subsequences,{C_{1}, I_{1}, C_{2}, I_{2}, …, C_{k1}, I_{k2}}, is formulated as follows:
The Viterbi algorithm [17] was applied to find the paths of the most likely sequence of states that generated the close subsequences. To determine the most probable paths corresponding to isolated mutations, the estimated values of g vector and emission matrix, E are used. As for each isolated mutation category (O_{t}), the state with maximum probability value (Q_{t}) is obtained from the following formula:
Finally, we append these two most probable paths of states to construct the final path corresponding to the patient. Then, the numbers of observed states (signature) are calculated as signature frequency per sample or signature activities for each path. For example, a patient with BRCA cancer has a vector with size 12 corresponding to the number of signatures, and the elements of this vector are calculated based on the number of times each state is observed in the final path.
Identifying the important genes
For considering the relationship between signature activities and gene expression profiles for each patient, we downloaded the gene expression files for patients from the International Cancer Genome Consortium Data Portal [15]. The number of patients whose gene expression information is available (patient group2) is shown in Table 2. Also, since the information of single nucleotide mutation of 54 and 1667 genes for two GDSC and CCLE datasets were accessible, we analyzed these genes’ expression for computing the Spearman correlation coefficients. Therefore, the Spearman correlation coefficients between the expression of 1721 genes and signature activities across samples are calculated. In this way, we identified essential genes with a high Spearman correlation coefficient (greater than 0.2) in close and isolated regions in each cancer type. The results of the correlation between some genes and signature activity in 14 cancer types that have high Spearman correlation coefficients are illustrated in the Additional file 1: TableS1. We considered these genes as essential features for single nucleotide mutation data in GDSC and CCLE datasets. Finally, 22 and 72 genes are selected by the above strategy based on a hidden Markov model and multinomial mixture model in GDSC and CCLE, respectively. The list of these genes is illustrated in the Additional file 1: Table S1.
Similarity definition
Since similar cell lines and similar drugs may have similar drug responses, the similarities between cell lines and drugs can improve drug response prediction [5, 18].
The similarity matrices are required for the identification of the nearest neighbors in the logistic matrix factorization model. Gene expression profile, copy number alteration, singlenucleotide mutation, and tissue type information are used for cell line similarity, and chemical structures of drugs are used for drug similarity. So, five similarities between each pair of cell lines and the similarity between each pair of drugs are defined as follows:
Cell line similarity

(Sim_{EXP}) is the similarity based on the selected features of the gene expression profile, in which the numbers of identified essential genes for gene expression profile by autoencoder are 798 and 1189 for two datasets GDSC and CCLE, respectively. Sim_{EXP} is defined as the Pearson correlation between the gene expression vector of each pair of n cell lines, arranged in an n × n matrix.

(Sim_{CNV}) is the similarity based on the selected features of copy number alteration data, which 67 and 127 useful features selected by autoencoder in GDSC and CCLE, respectively. Sim_{CNV} matrix is defined as an n × n matrix by Pearson correlation between the copy number alteration vector of each pair of cell lines.

(Sim_{MUT}) is the similarity based on the selected features of single nucleotide mutation information by the hidden Markov model and multinomial mixture model. 22 and 67 essential genes identified by this strategy from GDSC and CCLE datasets, respectively. Then, the Jaccard similarity is applied on each pair of single nucleotide mutation vectors corresponding to n cell lines, and Sim_{MUT} is constructed as an n × n matrix.

(Sim_{IC50}) is the similarity between cell lines based on their IC50 values. This definition of similarity between cell lines proposed by Liu is based on the correlation between response IC50 values of the cell lines [19]. Sim_{IC50} is defined as the Pearson correlation between each of the n cell lines considered an n × n matrix.

(Sim_{TISSUE}) is the similarity between cell lines based on tissue type. The complete set of samples consisted of GDSC and CCLE datasets cancer cell lines originated from around 14 tissue sites. Sim_{TISSUE} is an n × n binaryvalued matrix, which for entry corresponding to row i and column j is 1, if two cell lines c_{i} and c_{j} have the same tissue type and zero otherwise. The Sim_{TISSUE} matrices corresponding to GDSC and CCLE cell lines are represented in the Additional file 2: TableS2 and Additional file 3: TableS3.
Since the correlation coefficient between each pair of the above similarity matrices is very low, there is no collinearity between matrices, and they can be linearly combined. We constructed an integrated matrix similarity, Sim_{CL} = [SC_{ij}]_{n×n}, using the combination of Sim_{EXP}, Sim_{CNV}, Sim_{MUT}, Sim_{IC50} and Sim_{TISSUE} by the following formula:
where γ, λ, ϕ, ψ, and ρ are parameters that control the importance of each of the matrix and tuned in the model. We defined the set N_{k}(c_{i}) that denotes the kmost similar cell lines to c_{i} (except c_{i}) using (Sim_{CL}) matrix. We constructed adjacency matrix A = [a_{ij}]_{n×n} that represents cell line neighborhood information as follow:
Drug similarity
The similarity between drugs is constructed based on chemical substructures (Sim_{DRUG}). For each drug, a zero–one vector of size 881 is considered where 881 is the number of known chemical substructures of a drug. In this vector, 1 indicates the presence of a substructure of the drug and 0 otherwise. Sim_{DRUG} = [SD_{ij}]_{m×m} is constructed as an m × m matrix by Jaccard similarity between each of the chemical substructures vector corresponding to the m drugs. For a drug d_{i}, the set N_{k}(d_{i}) denotes the kmost similar drugs to d_{i} (except d_{i}) using Sim_{DRUG} matrix. The adjacency matrix, B = [b_{ij}]_{m×m}, describes the drug neighborhood information as follows:
Logistic matrix factorization
For drug response prediction of cancer cell lines from GDSC and CCLE datasets using selected features, the DSPLMF method introduced based on the logistic matrix factorization method [5] is applied based on the following objective function:
where u_{i} and v_{j} are the latent vectors of size L corresponding to the cell line c_{i} and drug d_{j}, respectively and the latent vectors of all cell lines and all drugs are denoted by U and V. The positive values \({\upbeta }_{{\text{i }}}^{c}\) and \({\upbeta }_{{\text{j }}}^{d}\) are the bias parameters according to cell line c_{i} and drug d_{j} and \({\upbeta }_{ }^{c} {\text{and}} {\upbeta }_{ }^{d}\) are the bias vectors for cell lines and drugs, respectively [20]. Two parameters, λ_{c} = \(\frac{1}{{{\upsigma }_{{\text{c }}}^{2} }}\), λ_{d} = \(\frac{1}{{{\upsigma }_{{\text{d }}}^{2} }}\), where \({\upsigma }_{{\text{c }}}^{2}\) and \({\upsigma }_{{\text{d }}}^{2}\) are parameters for controlling the variances of prior distributions of cell lines and drugs. The parameters α and β determine the effectiveness of cell line similarity and drug similarity in the DSPLMF method. (r ≥ 1) is a parameter for controlling the importance levels of observed interactions. Since both sensitivity and resistance classes have the same importance in drug response prediction problem, we set r to be one. Also, H^{c} = (E^{c} + \(\widetilde{{{\text{E}}^{c} }}\)) − (A + A^{T}), E^{c} and \(\widetilde{{{\text{E}}^{c} }}\) are two diagonal matrices with \({\text{E}}_{{\text{ii }}}^{c} = \sum\nolimits_{{{\text{j}} = 1}}^{{\text{n}}} {({\text{a}}_{{\text{ij }}} )}\) and \(\widetilde{{{\text{E}}_{{\text{ jj }}}^{c} }} = \sum\nolimits_{{{\text{i}} = 1}}^{{\text{n}}} {({\text{a}}_{{\text{ij }}} )}\), H^{d} = (E^{d} + \(\widetilde{{{\text{E}}^{d} }}\)) − (B + B^{T}) as diagonal elements (n is the numbers of cell lines). E^{d} and \(\widetilde{{{\text{E}}^{d} }}\) are two diagonal matrices with \({\text{E}}_{{\text{ii }}}^{d} = \sum\nolimits_{{{\text{j}} = 1}}^{{\text{m}}} {({\text{b}}_{{\text{ij }}} )}\) and \(\widetilde{{{\text{E}}_{{\text{ jj }}}^{d} }} = \sum\nolimits_{{{\text{i}} = 1}}^{{\text{m}}} {({\text{b}}_{{\text{ij }}} )}\), as diagonal elements (m is the numbers of drugs). After training the proposed model, the latent vectors of cell lines and drugs are determined using the formula 11. Then, for predicting the IC50 values of a given new cell line across all drugs, the knearest neighbors for the new cell line are selected, and the latent vector for this new cell line is estimated based on the average of latent vectors of its neighbors. Since the elements of (Sim_{IC50}) matrix are unknown, the (Sim_{CL}) matrix cannot be used for finding the knearest neighbors for the new cell line. We used the strategy introduced in the DSPLMF method for estimation (Sim_{IC50}) matrix. DSPLMF method is designed a Decision Tree Classifier model for estimation (Sim_{IC50}) matrix using the gene expression profile, copy number alteration, and singlenucleotide mutation information of the new cell line [5]. Then by a similar method, we estimated the latent vector corresponding to the new cell line to predict the probabilities that the new cell line is sensitive to drugs indicated by Eq. 12. For the set of cell lines and drugs, the probability of the cell line c_{i} is sensitive to the drug d_{j} can be modeled as a logistic function as follows:
Finally, a threshold is applied on probabilities to assign a sensitive or resistance class to each new cell linedrug pair.
Results
Evaluation of prediction performance of AutoHMMLMF
Using the feature selection approaches is one of the common methods to reduce the dimensions of the features in drug response prediction problems. In some of the previous predictive methods, such as AutoBorutaRF, the autoencoder approaches are used for selecting significant features of genomic information. One of the most powerful methods of selecting features is the EFS method proposed by Neumann et al. The EFS method integrated eight different feature selection methods and normalized all individual outputs to a common scale, an interval from 0 to 1 [21, 22]. First, to evaluate the efficiency of the feature selection strategies in the AutoHMMLMF model, we use the EFS method to select important features in the gene expression profile, copy number variation and single nucleotide mutation data. In this method, the number of features selected for each group of data is equal to the number of features selected by the AutoHMMLMF method. Then we alternated these features with features selected by Autoencoder and HMMMMM in the AutoHMMLMF method, and we compared the achieved results to other methods. This method is applied to two CCLE and GDSC datasets, and we represent the results of this approach by the name of EFSLMF in Tables 3 and 4. In this study, the tenfold crossvalidation is repeated 30 times, and the mean value of them is used as criteria for evaluating the predictive performance of the AutoHMMLMF method.
We compared the AutoHMMLMF method to six classification models, DSPLMF, EFSLMF, CaDRReS, SRMF, AutoBorutaRF, and SVMRFE for different metrics. DSPLMF and AutoBorutaRF are designed as the classification models, but the CaDRReS and SRMF methods predicted IC50 values as output. So, for comparison of these models with the AutoHMMLMF and EFSLMF methods, we applied the median of predicted IC50 values for each drug as a classification threshold. If the predicted IC50 value corresponding to a cell linedrug pair is smaller than this threshold, the sensitive class was assigned to it; otherwise, it was labeled with resistance class. Seven metrics Accuracy, Recall, Precision, Specificity, F1Score, Matthews correlation coefficient (MCC) and area under the receiver operating characteristic curve (AUC) are used that; these criteria are formulated as follows:
where

TP (true positive): The number of cell lines labeled with sensitivity and predicted as sensitivity.

TN (true negative): The number of cell lines labeled with resistance and predicted as resistance.

FP (false positive): The number of cell lines labeled with resistance and predicted as sensitivity.

FN (false negative): The number of cell lines labeled with sensitivity and predicted as resistance.
Tables 3 and 4 show the results of comparative experiments conducted on the GDSC and CCLE datasets (the bold number represents the best result). As is shown in Table 3, the value of Accuracy, Recall, Precision, F1Score, MCC, and AUC criteria have increased by 0.02, 0.03, 0.01, 0.03, 0.02, and 0.02 compared to the best algorithm, DSPLMF. In the Specificity criterion, the AutoBorutaRF method performs significantly better than the other methods. Concerning the other criteria, the AutoHMMLMF method has very significant results to the results of the AutoBorutaRF method. In Table 4, the value of all criteria by AutoHMMLMF has increased compared to the result of other algorithms, and AutoHMMLMF significantly outperformed the stateoftheartmethods in this dataset. As is shown in Tables 3 and 4, the value of all criteria by AutoHMMLMF has increased compared to the result of other algorithms. These observations demonstrated that the selected features by HMM and MMM strategies for mutation data and autoencoder technique for gene expression and copy number variation data are very effective and essential. Also, the features selected by the EFS method cannot be as powerful as the features selected by the AutoHMMLMF method in predicting drug response.
Tissue specific of cell line type
To demonstrate the AutoHMMLMF method’s performance in different tissue types, we examine whether our proposed method can achieve good performance when considering specific cell line tissue types. In this way, 73 Haematopoietic and lymphoid cell lines in the GDSC dataset are considered, and seven criteria evaluate the AutoHMMLMF method. We trained the AutoHMMLMF method by these cell lines, and we applied a tenfold crossvalidation approach for drug response prediction of considered cell lines. As is shown in Table 5, these results justify that the AutoHMMLMF method can also achieve consistently or, in some criteria, more performance on Haematopoietic and lymphoid cell lines.
Correlation between predicted and observed responses values
We plotted the Bar chart of Pearson correlation coefficients of observed drug responses and predicted values for 24 drugs in the CCLE dataset. As is shown in Fig. 2, (70%) above correlation coefficients, 17 from 24, are higher than 0.5. For four of these drugs (PD—0325901, T opotecan, AZD6244 and Irinotecan) correlation coefficients are greater than 0.65. These plots show the excellent performance of the AutoHMMLMF method in predicting drug response values. The scatter plots of observed and predicted drug responses by the AutoHMMLMF model of 4 above drugs are drawn in Fig. 3, and the scatter plots of 20 other drugs in the CCLE dataset are illustrated in the Additional file 4.
Application for drug repositioning
Drug repositioning is the process of selecting a known drug for an alternative pharmacological purpose. For this issue, we considered 37 US Food and Drug Administration (FDA) approved drugs that were not tested in the GDSC dataset from the study of Choi et al. [23]. The AutoHMMLMF model was trained on the GDSC dataset and the probability of sensitivity of 20 cell lines of head and neck cancer (HNSC) across 20 anticancer drugs of 37 drugs were predicted and were shown in Fig. 4.
As it can be seen in Fig. 4, 11 following drugs have been identified as an effective treatment of HNSC:

Azacitidine: Azacitidine is a type of drug called a hypomethylating agent, and a study reported that Azacitidine and Cisplatin are effective for the treatment of head and neck cancer [24].

Regorafenib: Regorafenib is an oral multikinase inhibitor that targets receptor tyrosine kinase (RTK). Klinghammer et al. established a panel of 65 head and neck squamous cell carcinoma, and they demonstrated that combinational treatment of regorafenib and Everolimus is useful in these patients [25].

Everolimus: Everolimus is used as an immunosuppressant to prevent rejection of organ transplants in the treatment of cancer. Recently, a study showed that patients with TP53 mutations benefited significantly from Everolimus in head and neck cancer [26].

Fulvestrant: Fulvestrant is a drug used to treat hormone receptor (HR)positive metastatic. Grünow et al. [27] showed that Fulvestrant inhibits irradiationinduced ESR2 expression, and their findings demonstrated the efficacy of Fulvestrant in combination with radiotherapy for HNSC patients.

InterferonAlf a–2B: InterferonAlf a–2B is an antiviral or antineoplastic drug that is an effective treatment in head and neck cancer [28].

Porfimer: Porfimer is a photosensitizer, and it is used in radiation therapy in cancer treatment. An in vivo study suggested that this drug can be used in treatment for HNSC patients [29].

lmiquimod: Imiquimod (INN) is a prescription drug that acts as an immune response modifier used to treat basal cell carcinoma. The study showed that topical Imiquimod might offer a reasonable and welltolerated palliative treatment option for patients [30].

Decitabine or 5–aza–2′–deoxycytidine is a nucleic acid synthesis inhibitor for cancer treatment. Cisplatin resistance in head and neck squamous cell carcinoma reduces survival. Viet et al. [31] showed that Decitabine treatment restored Cisplatin sensitivity in HNSC cell lines and significantly reduced the Cisplatin dose required to induce apoptosis.

Hydroxyurea: Hydroxyurea is an anticancer agent used to treat melanoma, resistant, recurrent, and metastatic cancer types. A study displayed Hydroxyurea is a single active agent in head and neck cancer. It has been used clinically as a radiationenhancing drug with radiotherapy [32].

V andetanib: V andetanib acts as a kinase inhibitor of several cell receptors, and it is an anticancer drug for the treatment of cancer cell lines. Sano et al. [33] approved the addition of V andetanib to combination therapy with Cisplatin, and radiation can overcome resistance in vitro and in vivo models of HNSC.

Carf ilzomib: Carf ilzomib is an anticancer drug acting as a selective proteasome inhibitor. By upregulation of proapoptotic Bik, Carf ilzomib and ONX0912 potently induced apoptosis in HNSC cell lines [34].
These results indicate that the AutoHMMLMF model can be useful in drug repositioning. Also, five drugs (Exemestane, Ibrutinib, T halidomide, Romidepsin and Fluoxymesterone) may be novel therapeutic drugs for HNSC.
Hyperparameters settings
Since the numbers of cell lines and drugs in the GDSC dataset are higher than the CCLE dataset, we tuned the hyperparameters on the GDSC dataset, and we used the obtained values of the hyperparameters in both datasets. In this way, the tenfold crossvalidation procedure is applied to GDSC, and hyperparameters are determined by maximizing the AUC criterion.
The learned hyperparameters using the GDSC dataset are shown in Table 6. The threshold parameter applied on Eq. 12 for determining the label of the class for each new cell line was chosen from {0.1, …,1} and this parameter was set to 0.4. The latent space dimension L was chosen from {1, …, min(n, m)}, for GDSC dataset L parameter was set to 95 and for CCLE dataset L was set to 23 (where n and m are the numbers of cell lines and the numbers of drugs, respectively).
Discussion
This paper proposed the AutoHMMLMF method based on feature selection approaches and logistic matrix factorization strategy to predict drug response. The proposed prediction model showed higher predictive efficiency than the existing computational models. Also, we demonstrated that the AutoHMMLMF model could be useful in drug repositioning. So, we identified five drugs (Exemestane, Ibrutinib, T halidomide, Romidepsin and Fluoxymesterone) for HNSC treatment. To illustrate the biological significance of the features selected by the hidden Markov model (HMM) and multinomial mixture model (MMM) on mutation data in the AutoHMMLMF method, we further consider cancer cell lines related to breast cancer (BRCA) and two important processes, namely MMR and HRD. This study selected 30 significant genes by considering the Spearman correlation coefficient between their gene expression file and signature activity for 12 signatures of BRCA cancer cell lines. Among these genes, the gene expression of four genes namely PMS2, MLH1, MSH2, and MSH6 has high Spearman correlation coefficient with signatures 6, 20, and 26 activities. The results of the Spearman correlation between the expression of these genes and three mutation signature activities in BRCA are shown in Fig. 5. On the other hand, a recent study [35] has shown that the three 6, 20, and 26 signatures are associated with MMR deficiency in breast cancer. Defective DNA mismatch repair (MMR) occurs in many cancer types, and mutations in the PMS2, MLH1, MSH2, and MSH6 genes are the most common cause of mismatch repair (MMR) deficient. The above genes are known as DNA mismatch repair (MMR) genes, and these genes are involved in repairing errors in DNA replication (the errors that occur when DNA is copied in preparation for cell division) [36].
Also, the expression of ten other selected genes by proposed model, namely BRCA1, BRCA2, ATM, CHEK2, MRE11A, NBN, FANCA, PALB2, RAD51C, RAD50 has high Spearman correlation coefficient with activity of mutation signature 3. The Spearman correlation coefficients between the expressions of these genes with a signature 3 activity are shown in Fig. 6. Similarly, in a recent study [35] it was shown that homologous recombination deficiency (HRD) is associated with the signature 3 in breast cancer patients. Homologous recombination deficiency is the inability to repair doublestrand breaks in human cells. Several genetic alterations causing HRD include somatic mutations of genes such as the selected 10 genes. The eight other selected genes by the AutoHMMLMF method for BRCA cancer cell lines are APOBEC3A, APOBEC3B, APOBEC1, APOBEC3C, APOBEC3D, APOBEC3F, APOBEC3G and APOBEC4 that belong to the APOBEC family. The Spearman correlation coefficients between the expression of these genes with 2 and 13 signature activities are shown in Fig. 7. A recent study [35] has conducted that APOBEC deamination of cytosine to uracil is thought to initiate mutations of signatures 2 and 13. Therefore, these results show that the genes selected for breast cancer are biologically essential, and the AutoHMMLMF method has been able to detect significant features for single nucleotide mutation data. In addition to increasing the accuracy of drug response prediction compared to other models, one of the most important advantages of the AutoHMMLMF algorithm is that this algorithm's running time is significantly lower than the running time of the other mentioned methods. Since this method is based on the selection of features, one of its limitations is that the results depend on the selected features. So, this method's results can be improved by using a more powerful feature selection approach. The following limitation of the method is that it was designed to solve the cold problem for a new cell line, while some of the proposed methods can also make predictions for the new drug or new pair of cell linedrug.
Conclusion
In this study, we developed a feature selectionbased method, AutoHMMLMF, to predict cancer cell lines’ response to drugs in the GDSC and CCLE datasets. For feature selection of gene expression and copy number variation data, two autoencoder networks are designed. For feature selection of single nucleotide mutation information, the novel approach based on the hidden Markov model (HMM) and multinomial mixture model (MMM) is applied. AutoHMMLMF shows better overall prediction performance than the stateoftheart prediction methods. Also, by comparing to one of the most powerful feature selection methods, the EFS method, we showed that the performance of the predictive model based on selected features introduced in this paper is much better for drug response prediction. Also, we suggest that the proposed model can be useful in numerous therapeutic research areas, such as drug repositioning and personalized medicine. Finally, we found substantial evidence that the selected features and predicted responses by AutoHMMLMF have significant consistency with many previous studies.
Availability of data and materials
The data and implementation are accessible from (https://github.com/emdadi/AutoHMMLMF).
Abbreviations
 GDSC:

Genomics of drug sensitivity in cancer
 CCLE:

Cancer cell line encyclopedia
 EFS:

Ensemble feature selection
 SRMF:

Similarity regularized matrix factorization
 CaDRReS:

Cancer drug response prediction using a recommender system
 DSPLMF:

Drug sensitivity prediction based on logistic matrix factorization
 COSMIC:

Catalogue of somatic mutations in cancer
 HMM:

Hidden Markov model
 MMM:

Multinomial mixture model
 RTK:

Receptor tyrosine kinase
 MMR:

Mismatch repair
 HRD:

Homologous recombination deficiency
References
 1.
Yang W, Soares J, Greninger P, Edelman EJ, Lightfoot H, Forbes S, Bindal N, Beare D, Smith JA, Thompson IR, et al. Genomics of drug sensitivity in cancer (gdsc): a resource for therapeutic biomarker discovery in cancer cells. Nucleic Acids Res. 2012;41(D1):955–61.
 2.
Barretina J, Caponigro G, Stransky N, Venkatesan K, Margolin AA, Kim S, Wilson CJ, Lehár J, Kryukov GV, Sonkin D, et al. The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature. 2012;483(7391):603.
 3.
Wang L, Li X, Zhang L, Gao Q. Improved anticancer drug response prediction in cell lines using matrix factorization with similarity regularization. BMC Cancer. 2017;17(1):1–12.
 4.
Suphavilai C, Bertrand D, Nagarajan N. Predicting cancer drug response using a recommender system. Bioinformatics. 2018;34(22):3907–14.
 5.
Emdadi A, Eslahchi C. Dsplmf: a method for cancer drug sensitivity prediction using a novel regularization approach in logistic matrix factorization. Front Genet. 2020;11:75.
 6.
Kursa MB, Rudnicki WR, et al. Feature selection with the boruta package. J Stat Softw. 2010;36(11):1–13.
 7.
Xu X, Gu H, Wang Y, Wang J, Qin P. Autoencoder based feature selection method for classification of anticancer drug response. Front Genet. 2019;10:233.
 8.
Dong Z, Zhang N, Li C, Wang H, Fang Y, Wang J, Zheng X. Anticancer drug sensitivity prediction in cell lines from baseline gene expression through recursive feature selection. BMC Cancer. 2015;15(1):1–12.
 9.
Li B, Shin H, Gulbekyan G, Pustovalova O, Nikolsky Y, Hope A, Bessarabova M, Schu M, KolpakovaHart E, Merberg D, et al. Development of a drugresponse modeling framework to identify cell line derived translational biomarkers that can predict treatment outcome to erlotinib or sorafenib. PLoS ONE. 2015;10(6):0130700.
 10.
Wojtowicz D, Sason I, Huang X, Kim YA, Leiserson MD, Przytycka TM, Sharan R. Hidden markov models lead to higher resolution maps of mutation signature activity in cancer. Genome Med. 2019;11(1):1–12.
 11.
Forbes SA, Beare D, Boutselakis H, Bamford S, Bindal N, Tate J, Cole CG, Ward S, Dawson E, Ponting L, et al. Cosmic: somatic cancer genetics at highresolution. Nucleic Acids Res. 2017;45(D1):777–83.
 12.
Ghandi M, Huang FW, JanéValbuena J, Kryukov GV, Lo CC, McDonald ER, Barretina J, Gelfand ET, Bielski CM, Li H, et al. Nextgeneration characterization of the cancer cell line encyclopedia. Nature. 2019;569(7757):503–8.
 13.
NikZainal S, Alexandrov LB, Wedge DC, Van Loo P, Greenman CD, Raine K, Jones D, Hinton J, Marshall J, Stebbings LA, et al. Mutational processes molding the genomes of 21 breast cancers. Cell. 2012;149(5):979–93.
 14.
Alexandrov LB, NikZainal S, Wedge DC, Aparicio SA, Behjati S, Biankin AV, Bignell GR, Bolli N, Borg A, BørresenDale AL, et al. Signatures of mutational processes in human cancer. Nature. 2013;500(7463):415–21.
 15.
Zhang J, Bajari R, Andric D, Gerthoffert F, Lepsa A, NahalBose H, Stein LD, Ferretti V. The international cancer genome consortium data portal. Nat Biotechnol. 2019;37(4):367–9.
 16.
Emdadi A, Moughari FA, Meybodi FY, Eslahchi C. A novel algorithm for parameter estimation of hidden markov model inspired by ant colony optimization. Heliyon. 2019;5(3):01299.
 17.
Durbin R, Eddy S, Krogh A, Mitchison G. Probabilistic models of proteins and nucleic acids. Biol Seq Anal. 1998;14:164–73.
 18.
Moughari FA, Eslahchi C. Adrml: anticancer drug response prediction using manifold learning. Sci Rep. 2020;10(1):1–18.
 19.
Liu H, Zhao Y, Zhang L, Chen X. Anticancer drug response prediction using neighborbased collaborative filtering with global effect removal. Mol TherapyNucleic Acids. 2018;13:303–11.
 20.
Liu Y, Wu M, Miao C, Zhao P, Li XL. Neighborhood regularized logistic matrix factorization for drugtarget interaction prediction. PLoS Comput Biol. 2016;12(2):e1004760.
 21.
Neumann U, Genze N, Heider D. Efs: an ensemble feature selection tool implemented as rpackage and webapplication. BioData Min. 2017;10(1):1–9.
 22.
Neumann U, Riemenschneider M, Sowa JP, Baars T, Kälsch J, Canbay A, Heider D. Compensation of feature selection biases accompanied with improved predictive performance for binary classification by using a novel ensemble feature selection approach. BioData Min. 2016;9(1):1–14.
 23.
Choi J, Park S, Ahn J. Refdnn: a reference drug based neural network for more accurate prediction of anticancer drug resistance. Sci Rep. 2020;10(1):1–11.
 24.
Liao YM, Mirshahidi H, Zhang K, Mirshahidi S, Williamson S, Hsueh CT. Phase I study of azacitidine and cisplatin in patients with advanced head and neck or nonsmall cell lung cancer. AACR. 2012.
 25.
Klinghammer KF, Raguse JD, Albers A, WulfGoldenberg A, Zopf D, Hoffmann J, Fichtner I, Keilholz U. Employing head and neck cancer patient derived xenografts to inform clinical trial design: results from combining regorafenib with everolimus. Am Soc Clin Oncol. 2015;33:15.
 26.
Fuerst ML. Adjuvant Everolimus extends survival in advanced head and neck cancer. LWW. 2020;42:34.
 27.
Grünow J, Rong C, Hischmann J, Zaoui K, Flechtenmacher C, Weber KJ, Plinkert P, Hess J. Regulation of submaxillary gland androgenregulated protein 3a via estrogen receptor 2 in radioresistant head and neck squamous cell carcinoma cells. J Exp Clin Cancer Res. 2017;25:25.
 28.
SeixasSilva JA, Richards T, Khuri FR, Wieand HS, Kim E, Murphy B, Francisco M, Hong WK, Shin DM. Phase 2 bioadjuvant study of interferon alfa2a, isotretinoin, and vitamin e in locally advanced squamous cell carcinoma of the head and neck: longterm followup. Arch Otolaryngol Head Neck Surg. 2005;131:304–7.
 29.
Mang T, Sullivan M, Cooper M, Loree T, Rigual N. The use of photodynamic therapy using 630 nm laser light and porfimer sodium for the treatment of oral squamous cell carcinoma. Photodiagn Photodyn Ther. 2006;3:272–5.
 30.
Wester A, Eyler JT, Swan JW. Topical imiquimod for the palliative treatment of recurrent oral squamous cell carcinoma. JAAD Case Rep. 2017;3:329–31.
 31.
Viet CT, Dang D, Achdjian S, Ye Y, Katz SG, Schmidt BL. Decitabine rescues cisplatin resistance in head and neck squamous cell carcinoma. PLoS ONE. 2014;9:220.
 32.
Vokes EE, Haraf DJ, Panje WR, Schilsky RL, Weichselbaum RR. Hydroxyurea with concomitant radiotherapy for locally advanced head and neck cancer. Semin Oncol. 1992;19:53–8.
 33.
Sano D, Matsumoto F, Valdecanas DR, Zhao M, Molkentine DP, Takahashi Y, Hanna EY, Papadimitrakopoulou V, Heymach J, Milas L, et al. Vandetanib restores head and neck squamous cell carcinoma cells’ sensitivity to cisplatin and radiation in vivo and in vitro. Clin Cancer Res. 2011;17:15–27.
 34.
Zang Y, Thomas SM, Chan ET, Kirk CJ, Freilino ML, DeLancey HM, Grandis JR, Li C, Johnson DE. Carfilzomib and onx 0912 inhibit cell survival and tumor growth of head and neck cancer and their activities are enhanced by suppression of mcl1 or autophagy. Clin Cancer Res. 2012;18:39–49.
 35.
Morganella S, Alexandrov LB, Glodzik D, Zou X, Davies H, Staaf J, Sieuwerts AM, Brinkman AB, Martin S, Ramakrishna M, et al. The topography of mutational processes in breast cancer genomes. Nat Commun. 2016;7:1–11.
 36.
Wimmer K, Kratz CP. Constitutional mismatch repairdeficiency syndrome. Haematologica. 2010;95:699–701.
Acknowledgements
Not applicable.
Funding
No funding was obtained for this study.
Author information
Affiliations
Contributions
A.E. proposed the method, wrote the manuscript, and conducted the implementation, comparisons, and analysis. C.E. evaluated the results, designed the analysis, and reviewed the manuscript. All authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Competing interests
The authors declare no competing interests.
Consent for publication
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
The results of the correlation between some genes and signature activity in 14 cancer types with high Spearman correlation coefficients.
Additional file 2.
The similarity matrices based on tissue type information corresponding to GDSC dataset.
Additional file 3.
The similarity matrices based on tissue type information corresponding to CCLE dataset.
Additional file 4.
The scatter plots of observed and predicted drug responses by the AutoHMMLMF model of 20 drugs in the CCLE dataset.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Emdadi, A., Eslahchi, C. AutoHMMLMF: feature selection based method for prediction of drug response via autoencoder and hidden Markov model. BMC Bioinformatics 22, 33 (2021). https://doi.org/10.1186/s12859021039743
Received:
Accepted:
Published:
Keywords
 Cancer
 Drug response
 Autoencoder
 Hidden Markov model
 Matrix factorization
 Personalized treatment