A pMedian approach for predicting drug response in tumour cells
 Elisabetta Fersini^{1}Email author,
 Enza Messina^{1} and
 Francesco Archetti^{1, 2}
https://doi.org/10.1186/s1285901403537
© Fersini et al.; licensee BioMed Central Ltd. 2014
Received: 9 December 2013
Accepted: 16 October 2014
Published: 29 October 2014
Abstract
Background
The complexity of biological data related to the genetic origins of tumour cells, originates significant challenges to glean valuable knowledge that can be used to predict therapeutic responses. In order to discover a link between gene expression profiles and drug responses, a computational framework based on Consensus pMedian clustering is proposed. The main goal is to simultaneously predict (in silico) anticancer responses by extracting common patterns among tumour cell lines, selecting genes that could potentially explain the therapy outcome and finally learning a probabilistic model able to predict the therapeutic responses.
Results
The experimental investigation performed on the NCI60 dataset highlights three main findings: (1) Consensus pMedian is able to create groups of cell lines that are highly correlated both in terms of gene expression and drug response; (2) from a biological point of view, the proposed approach enables the selection of genes that are strongly involved in several cancer processes; (3) the final prediction of drug responses, built upon Consensus pMedian and the selected genes, represents a promising step for predicting potential useful drugs.
Conclusion
The proposed learning framework represents a promising approach predicting drug response in tumour cells.
Keywords
Background
Cancer is a disease treated with various strategies depending on the type of cancer and the stage of the disease. Generally, therapeutic agents are selected according to the specific cancer type and patient population, based on the effectiveness in largepopulation studies [1],[2]. Now, with the advances of the genomic era, a massive amount of highthroughput data has been made available for understanding the cancer system biology. The public available datasets composed of genomic data and drug responses offer the opportunity to reveal valuable knowledge about the hidden relationships between gene expression and drug activity of tumor cells, pointing out the conditions that bring a patient to be more responsive than others to a given therapeutic agent. Although data collection provides the baseline to enable a better understanding of cancer mechanisms, data integration and interpretation is still an open issue. Mathematical and statistical models of complex biological systems play a fundamental role in system biology, and in particular in cancer related issues. They can be exploited for exploratory purposes, to validate hypothesis and make predictions about quantities that are difficult or impossible to be measured in vivo.
In the last decade, several studies have been conducted to develop platforms on which cancer highthroughput computational analysis can be performed. Much of these computational approaches are targeted at predicting the drug sensitivity/resistance by means of statistical inference and regression methods able to take into account genomic information of hundreds of genes for determining a specific drug response [3][5]. However, the massive availability of chemical compounds as potential cancer therapies has opened to the investigation of in silico therapy response prediction which requires more sophisticated computational models and methods to optimize the experimental design of celldrug screenings.
A first attempt of genedrug integrative analysis was presented in [6] where, thanks to a hierarchical clustering algorithm, several investigations have been performed: (1) celltocell correlation on the basis of gene expression and drug activity profiles, (2) relationships between drug activity patterns and mechanisms of action, (3) genedrug correlation on the basis of gene expression and drug activity profiles. In subsequent investigations [7],[8], the triangle gene expression profiles, drug responses and cancer types has been explored by integrating unsupervised and supervised machine learning algorithms. The clustering approach based on Soft Topographic Vector Quantization (STVQ) [9] has shown that gene expression profiles are more related to the cancer type than to the drug activity patterns, while thanks to the structure learning of Bayesian Networks (BN) some biologically meaningful relationships among gene expression levels, drug activities and cancer types have been confirmed and in some cases revealed. More recent works [10][13] are targeted at integrating explorative approaches with predictive paradigm towards a computational genedrug screening. While [10],[11] and [12] are based on nondeterministic clustering approaches (kMeans, STVQ and Genetic Programming) for identifying relevant genes involved in cancer mechanisms and predictive of drug response, [13] introduces a framework based global optimization to cancel the randomness, and therefore the variance, of stochastic clustering results when predicting a therapy outcome.

The explorative analysis performed through clustering approaches reveals that the tissue of origin is more related to the gene expression profile than the drug activity patterns. This suggests that the genomic information of a cell line plays a fundamental role, independently of the organ of origin, to understand anticancer therapy responses. This idea has been supported by the fact that several cell lines with a relatively high expression level of those genes regulating multidrug resistance have been clustered in the same group. This indicates that chemoresponse mechanisms are distributed across different tissues in the panel and that it should be possible to link drug responses to gene expression profiles.

In order to cancel the variability of results of stochastic clustering and to guarantee the convergence to a global minimum, we need to address the clustering problem by exact approaches able to find globally optimal solutions.

Computational approaches based on Bayesian Networks reveal interesting relationships among subsets of genes and drugs. The potential of Bayesian Networks encourages us to exploit this probabilistic model not only for deductive purposes, but also for prediction issues.
In order to achieve the final goal of simultaneously predicting the drug response of several compounds given a patient genomic profile, we propose a computational framework based on the following assumption: groups of cell lines homogeneous in terms of both gene expression profile and drug activity should be characterized by a subset of genes that explains the drug responses. To this purpose a threefolds analysis has been investigated: pMedian problem formulations to create clusters of homogeneous cell lines, Feature Selection Policies to select relevant genes and finally Bayesian Networks to predict drug responses of tumour cell lines. Computational results show that the proposed Consensus pMedian, combined with gene selection and BN inference engine, yields homogeneous clusters while guaranteeing good predictive power for inferring drug responses for a new cell line. This is also confirmed by the biological evaluation performed on the selected genes: according to the existing literature the set of genes used to train the BNs, which has been selected by using the groups of cell lines obtained by the proposed Consensus pMedian, has shown to be biologically relevant from an oncological point of view.
Methods
Problem formulation
 1.
The creation of homogeneous groups of tumor cell lines by means of pMedian formulations. In particular, a novel Consensus pMedian formulation is proposed and compared with traditional state of the art approaches, i.e. kMeans [14], STVQ [9] and Relational kMeans [11] and Probabilistic DClustering [15].
 2.
The selection of relevant genes able to predict the response of hundreds of drugs. We explore the potential of the solutions determined by solving the above mentioned pMedian problem formulation for identifying a subset of genes that characterizes each cluster, i.e. those subsets of genes that could be responsible of drug responses. To accomplish this task two main feature selection policies have been investigated, i.e. Information Gain [16] and Correlationbased Feature Subset Evaluation (CFS) [17].
 3.
The simultaneous prediction of different drug responses by exploiting the potential of Bayesian Networks [18]. Establishing a straightforward dependency structure of the Bayesian Network, we explore the ability of the selected genes to predict a panel of drug responses given the genomic profiles of patients.
The proposed computational framework exploits the well known dataset provided by the U.S. National Cancer Institute. The dataset consists of 60 cell lines from 9 kinds of cancers, all extracted from human patients, where the tumors considered in the panel derive from colorectal, renal, ovarian, breast, prostate, lung and central nervous system as well as leukemia and melanoma cancer tissues.
where ${x}_{i}^{G}$ represents the transcript expression level as a vector into the space ${\mathbb{R}}^{m}$ and ${x}_{i}^{D}$ denotes the drug response as a vector into the space ${\mathbb{R}}^{n}$.
Sherf dataset: cDNA arrays and DTPtested chemical compounds
According to the Sherf representation ${\mathbb{R}}^{m}$, with m= 1375, includes genes selected from the original NCI60 dataset (characterized by 9073 genes) having 5 or fewer missing values and showing strong pattern of variation among the 60 cell lines (more than 3 measurements must have redgreen intensity ratios >2.6 or <0.38). The space ${\mathbb{R}}^{n}$, with n=1400, includes drugs contained into the original dataset, where each compound has been tested one at time and independently. Considering that among 1375 genes and 1400 drugs missing values were still present, they have been replaced by the average gene expression value (or the average drug activity) over the 60 cell lines. The gene expression profiles and drug activity response for Sherf dataset are available for download as Additional file 1: Sherf gene expression data and Additional file 2: Sherf drug activity data.
Liu dataset: microRNA arrays and drugs with known mechanism of action
MicroRNAs (miRNA in the following) are a group of short noncoding RNAs that regulate gene expression at the posttranscriptional level. They are involved in many biological processes, including development, differentiation, apoptosis, and carcinogenesis. Because miRNAs may play a role in the initiation and progression of cancer, they comprise a novel class of promising diagnostic and prognostic molecular markers and potential drug targets. In order to achieve our goal by exploiting the miRNA data, we considered the dataset presented in [19]. This dataset leads us to represent the sets Ω^{ G } and Ω^{ D } by means of 422 miRNA expression profile and 118 G I_{50} responses related to drugs with known mechanism of action. The same selection criterion applied on Sherf dataset has been exploited for Liu dataset. Concerning the miRNA expressions, in this dataset there are no missing values and more than 3 experiments have redgreen intensity ratios >2.6 or <0.38, implying no selection of miRNA and therefore a space ${\mathbb{R}}^{m=422}$. Regarding the drug space, ${\mathbb{R}}^{n=118}$ is characterized by the presence of missing values. As well as for Sherf dataset, they have been replaced by the average drug activity over the 60 cell lines. The miRNA expression profiles and drug activity response for Liu dataset are available for download as Additional file 3: Liu miRNA expression data and Additional file 4: Liu drug activity data.
Cluster analysis
Cluster analysis is aimed at discovering embedded patterns into a given dataset. From a high level point of view cluster analysis consists of partitioning a set of patterns into subsets (clusters) based on similarity, i.e. a cluster has to contain similar patterns and dissimilar patterns have to be in different clusters. This could be accomplished by partitioning data points into a prespecified number of clusters through the optimization of a cost function related to a similarity/dissimilarity measure between data points.
where x_{ i } and x_{ j } represent two cell lines and c o r r(x_{ i },x_{ j }) denotes the Pearson Correlation coefficient between x_{ i } and x_{ j }. Thanks to the distance measure denoted by equation (4), cell lines having high distance due to anticorrelated genes/drugs are likely placed in different clusters, while cell lines characterized by a small gap are expected to be clustered together. The adoption of a correlationbased metric instead of the Euclidean distance is motivated by its sensitivity with respect to magnitude: Euclidean distance is sensitive to scaling and differences in average expression level, whereas correlation is not.
In the following we present three different clustering approaches based on pMedian formulation: traditional pMedian, probabilistic dClustering and Consensus pMedian.
Traditional pMedian
The pMedian problem was originally designed for facility location planning [21], where the location of “pfacilities” relative to a set of “customers” has been formulated such that the sum of the shortest demand weighted distance between “customers” and “facilities” is minimized. In our investigation, the pMedian problem has been formulated as an assignment problem for creating groups of cell lines by using a “flat” representation of data, i.e. by representing each cell line as a vector in ${\mathbb{R}}^{m+n}$. Given a cell line x_{ i }∈Ω and K desired clusters, the clustering problem consists in assigning each x_{ i } to a cluster center x_{ j }, such that the intracluster distance is minimized and the intercluster distance is maximized.
where z_{ ij } represents the assignment variable that indicates whether a cell line x_{ i } is assigned to a cluster center x_{ j }. Note that the matrix Z has dimension Ω×Ω because each entry z_{ ij } denotes the potential association of a cell line x_{ i } to any of the points x_{ j } in Ω (where x_{ j } can be a cluster center or not).
According to this formulation, the objective function in equation (6) denotes a combinatorial optimization problem whose objective is to minimize the distance between all data points belonging to the same cluster through the identification of optimal cluster centers x_{ j }∈Ω. Constraint (7) ensures that each cell line x_{ i } is assigned to only one cluster, constraint (8) guarantees that there will be exactly K clusters and constraint (9) ensures that if x_{ i } is assigned to x_{ j } then x_{ j } is a cluster center and therefore a median. The last constraint (10) guarantees integrality.
For seek of clarity, the above mentioned pMedian is a mathematical programming formulation (also known as generalized FermatWeber problem formulation) for uncapacitated facility location problems. The objective of this formulation is to minimize the sum of the distances from all data points x_{ i } to their respective cluster centers (geometric medians). In this paper the pMedian problem is solved deterministically^{a} by means of a canonical “branch and cut” algorithm [22]. The solution of the pMedian problem finds out not only the cluster assignments, but also the geometric medians as cluster representatives.
pMedian must not be confused with approaches like kMeans [14], kMedoids [23] and kMedians [24], which represent heuristic algorithms for approximating the above mentioned objective function. While kMeans computes a cluster representative (centroid) as mean vector of all points belonging to a cluster, kMedoids and kMedians select respectively k of the Ω data points as medoids (whose average distance to all the objects in the cluster is minimal) and medians (combination of multiple instances). On the contrary, a branchandcut algorithm on a pMedian formulation determines the set of p data points that minimize the sum of weighted distances to any points of the dataset and consequently finds out the cluster assignment for each data point. The geometric medians determined by solving the pMedian problem do not coincide neither with the centroids, medians or medoids (the only exceptions are for the 1dimensional case, where the geometric median coincides with the median and when in kMedoids the medoids are selected as median objects instead of computed as combination of multiple instances). In our investigation, the solution of the pMedian problem formulations are ensured to be the global optimum, while the ones originated by the heuristic approaches can correspond to local optimum among all possible solutions.
Probabilistic DClustering
The assignment problem presented above assumes to create K mutually exclusive clusters of cell lines, with similar profiles of gene expression and drug response. The crisp formulation can be relaxed by modelling probabilistic (or soft) assignments (with cluster membership probabilities), leading to a probabilistic pMedian named Probabilistic DClustering [15].
where the decision variables c_{ k } and p_{ k }(x_{ i }) denote the cluster centers c_{ k } and the probability of assigning the cell line x_{ i } to the cluster c_{ k } respectively. Each cell line can be finally assigned to the cluster center with the highest probability.
It can be easily noted that the formulation of Probabilistic DClustering is a further generalization of the pMedian (FermatWeber) problem, slightly different from the ones presented in equation (6)(10) but still belonging to the combinatorial optimization. While for Traditional pMedian the creation of K clusters is forced by the constraint (8), in Probabilistic DClustering the generation of K clusters is driven by the objective function.

Step 1: Probabilities UpdateProbabilities Update. Given the centers c_{ k } and the distance between each cell line x_{ i } and c_{ k }, the probabilities that x_{ i } is assigned to the cluster k can be estimated as:${p}_{k}\left({x}_{i}\right)=\frac{\underset{j\ne k}{\Pi}d\left({x}_{i},{c}_{j}\right)}{\sum _{l=1}^{K}\underset{m\ne l}{\Pi}d\left({x}_{i},{c}_{m}\right)}$(14)
Given the clusters, their centers, and the distances of data points from these centers, the probability of cluster membership at any point is assumed inversely proportional to the distance from (the center of) the cluster.

Step 2: Centers UpdateCenters Update. Given the probabilities p_{ k }(x_{ i }), the centers ${c}_{k}^{+}$ can be updated according to the current cluster distribution as:${c}_{k}^{+}=\frac{\sum _{i=1}^{\left\Omega \right}{\mu}_{k}\left({x}_{i}\right){x}_{i}}{\sum _{j=1}^{\left\Omega \right}{\mu}_{k}\left({x}_{j}\right)}$(15)where${\mu}_{k}\left({x}_{i}\right)=\frac{{p}_{k}\left({x}_{i}^{2}\right)}{d\left({x}_{i},{c}_{k}\right)}$(16)
The centers are updated as convex combinations of these points, with weights determined by the working principle.
originating a clustering of cell lines in the space ${\mathbb{R}}^{m+n}$. The optimal clustering solution can be determined (see ref. [15]) by verifying the optimality of centers and assignments through the dual problem corresponding to the primal reported in Eq. (11)(13).
Consensus PMedian
The cluster analysis of the NCI60 dataset relates to a set of objects (cell lines) that need to be grouped taking into account multiple sources (gene expression profiles and drug activity patters). Most of the multisource clustering approaches follow one of the following paradigms: (a) clustering each data source separately to then adhoc integrate the separate clustering solutions [26],[27] or (b) combining all data sources to determine a single “joint” clustering [28],[29] as Traditional pMedian and Probabilistic DClustering. The first kind of approaches is characterized by an independent analysis: while they take advantage of modeling sourcespecific features, they are not able to capture intersource associations. On the other side, the second type of approaches is based on a joint analysis that is able to exploit shared structure among data sources, but disregarding the heterogeneity of the data and taking no account important features that are specific to each data source. More flexible methods allow for separate but dependent source clusterings [30],[31].
where ${z}_{\mathit{\text{ij}}}^{\ast}$ denotes the solution of problem (6)(9). This problem formulation consists in assigning each cell line x_{ i } to a given cluster according to a distance measure computed in one space d^{(1)}(for example the gene space). The constraints denoted by equations (20)(23) have the same role as in the traditional pMedian formulation, while equation (19) provides a constraint about the cluster assignment by taking into account the cluster placement occurred during the first step. In particular, this constraint avoids the clustering solution of the Consensus pMedian to diverge, according to a value μ and to the distance measure d^{(2)}≠d^{(1)}, from the solution found at the first step. The parameter μ tunes the effect of the solution that optimizes F, i.e. $\sum _{i=1}^{\left\Omega \right}\sum _{j=1}^{\left\Omega \right}{z}_{\mathit{\text{ij}}}^{\ast}{d}_{\mathit{\text{ij}}}^{\left(2\right)}$. The parameter μ ranges between the lower bound μ=1.0 and an upper bound μ^{∗}. μ=1 implies that the solution of the Consensus pMedian will generate the same assignment as the traditional pMedian solved at the first step. Increasing values of μ cause a decreasing effect of optimal assignment ${z}_{\mathit{\text{ij}}}^{\ast}$ coming from the first phase (μ can be updated incrementally until the convergence criterion is satisfied, i.e. the solution of the Consensus pMedian doesn’t change for increasing values of μ). Note that for μ<1.0 no feasible solution exists.
The pseudocode reported in Algorithm 1 summarizes the iterative process for solving the Consensus pMedian until the value of μ^{∗} is found, i.e. until constraint (19) becomes redundant. For the sake of simplicity, we will denote with Consensus pMedian (gd) the approach where at step 2 the set μ^{ G } is used and at steps 3 and 8 the set μ^{ D } is exploited. On the other hand, we will denote with Consensus pMedian (dg) the approach where μ^{ D } is exploited at step 2, while at step 3 and 8 the set μ^{ G } is used.
Feature selection
The clusters that can be generated by the above mentioned approaches represent sets of cell lines that show a similar response to anticancer therapy also taking into account genomic information. This enables a feature selection activity that allows us to to identify the subset of genes that could possibly regulate the cell response behavior. To compactly characterize the obtained clusters, we attempt to select a subset of genes that best represents the cell lines membership. In order to validate the hypothesis that the obtained groups of cell lines embed useful information for helping the pharmacology of cancer, we applied two feature selection techniques known as Information Gain and Correlationbased Feature Subset Evaluation.
Information gain
We can therefore consider equation (26) as a measure of dependency between the density of variable a_{ t } (gene) and the distribution of the target c_{ k } (cluster).To compute the entropy in equation (26), the T nominal expression values need to be represented as discrete quantities. In order to discretize genes as up, down and normo regulated, a double filtering approach has been applied.
Once genes have been discretized, the value of I G(C,A) for each attribute can be computed allowing genes to be ranked accordinglyb. The top 10 genes have been selected as the most representative to train the predictive model described subsequently.
Correlationbased feature subset evaluation
where r_{ bf } is the mean featurecluster correlation, and r_{ ff } is the average featurefeature intercorrelation. The numerator can be viewed as an indication of how predictive of the cluster a set of features are, while the denominator of how much redundancy there is among the features^{b}. More details about CFS can be found in [17].
Prediction
A complete joint probability distribution over a set of random variables must specify a probability value for each of the possible set instantiation. For example, if we consider to specify an arbitrary joint distribution P(X^{1},X^{2},..,X^{Q}) for Q dichotomous variables, a table with 2^{Q} entries is required. This complexity makes an infeasible probability model for any domain of realistic size. A possible solution that tries to overcome this problem is represented by Bayesian Networks [18]. The key component, that reduces the probability model complexity, is the assumption that each variable is directly influenced by only few others.
This assumption is captured graphically by the dependency structure: a probability distribution is encoded by a directed acyclic graph whose nodes represent random variables and edges denote direct dependencies. Formally, a Bayesian Network asserts that each node (random variable) is conditional independent of its nondescendants given its parents. This conditionally independence assumption allows us to represent concisely the joint probability distribution over the random variables.
Instead of specifying the probability of X^{ s } conditional on all possible realizations of its predecessors X^{1},..,X^{s1}, we can consider only its set of parents P a(X^{ s }). More precisely, a set of variables P a(X^{ s }) is defined as the Markovian parents of X^{ s } if P a(X^{ s }) is a minimal set of predecessors of X^{ s } that makes X^{ s } independent on all the other predecessors.
where $P\left({x}_{i}^{s}\mathit{\text{Pa}}\left({x}_{i}^{s}\right)\right)$ is described by a conditional probability distribution (CPD). These local conditional distributions correspond to the set μ of parameters.
The upper part of the network, which comprises 10 nodes, represents the most relevant genes selected by the policies described in the previous sections. The central part of the network, which is composed of only one variable, denotes the cluster obtained by solving the clustering problems. The bottom part, which comprises n=1400 nodes, represents the drug responses to be predicted. These last variables have been discretized in order to train discrete CPDs and consequently a fully discrete BN. In particular, following the discretization introduced in [33], cell lines with log10(GI50) at least 0.8 SDs above the mean were defined as resistant to the compound, whereas those with log10(GI50) at least 0.8 Standard Deviations below the mean were defined as sensitive. Cell lines with log10(GI50) within 0.8 Standard Deviations of the mean were considered to be intermediate. The remaining cell lines within 0.8 Standard Deviations were defined as intermediate. After this discretization process, the CPDs related to the dependency structure of the BN can be easily estimated, to then simultaneously predict the response of n drugs given the expression value of the 10 relevant genes.
Results and discussion

a cell line x_{ i } is removed from the training set μ

clustering, feature selection, discretization and BN training are performed on the set {Ω\x_{ i }}

the removed cell line x_{ i } is then used as test for prediction in BN
where n_{ k } is the cardinality of cluster k. More specifically, the coefficient R has been computed with respect both to the gene and to the drug space, originating then two correlation coefficients: R^{ G } is computed considering the correlation between instances represented by their gene expression profiles, while R^{ D } is estimated considering the correlation between instances represented by their drug response profiles.
We also report the correlation indices of some baseline clustering approaches previously investigated for mining the NCI60 dataset: kMeans [14], Soft Topographic Vector Quantization (SVTQ) [9] and Relational kMeans [11].
Each point of the series for the Consensus pMedian corresponds to a solution obtained according to the parameter μ. The ordinate axis represents the correlation coefficients in the drug space (R^{ D } values), while the abscissae axis the correlation in the gene space (R^{ G } values).
An interesting remark is related to the average correlation indices of the proposed approach. All the solutions provided by the Consensus pMedian show a slightly better (averaged) Pearson Coefficient than the others. This implies that our approach leads to clusters that are more homogeneous both in terms of gene expression and drug activity than the clusters obtained by the other approaches. This is highlighted by the fact that most of the solutions determined by the Consensus pMedian dominate the ones generated by the other approaches. The most promising “competitor” is Relational kMeans, which leads to almost homogeneous cluster configuration. In order to validate the significance of the results, confidence intervals have been estimated on the clustering solutions. Confidence intervals provide a range about the observed “effect size”, allowing us to understand how likely the generated solutions are: the smaller the confidence interval, the more certain we are about the solution. In our specific case, the confidence intervals have been computed as follows. First, for each run l of the leaveoneout, the Pearson Correlation Coefficients R_{ l } (specifically ${R}_{l}^{G}$ and ${R}_{l}^{D}$) have been estimated. Then, the mean and the confidence interval have been estimated over the leaveoneout results.
Confidence interval (at the 95%) level of the clustering solutions on the Sherf dataset
Gene correlation  Drug correlation  

Confidence±  Confidence±  
μ=1.1  0.4777  0.0015  0.8553  0.0007  
μ=1.2  0.5073  0.0014  0.8595  0.0010  
μ=1.3  0.5200  0.0015  0.8522  0.0009  
μ=1.4  0.5265  0.0012  0.8497  0.0006  
Consensus pMedian (dg)  μ=1.5  0.5373  0.0008  0.8401  0.0005 
μ=1.6  0.5401  0.0013  0.8357  0.0010  
μ=1.7  0.5449  0.0008  0.8349  0.0006  
μ=1.8  0.5464  0.0011  0.8334  0.0007  
μ=1.1  0.5054  0.0010  0.8613  0.0008  
μ=1.2  0.4586  0.0009  0.8604  0.0006  
Consensus pMedian (gd)  μ=1.3  0.4232  0.0016  0.8566  0.0014 
μ=1.4  0.3735  0.0012  0.8366  0.0008  
μ=1.5  0.3689  0.0011  0.8363  0.0007  
α=0.0  0.5450  0.0037  0.8200  0.0035  
α=0.1  0.5300  0.0031  0.8213  0.0030  
α=0.2  0.5110  0.0034  0.8217  0.0019  
α=0.3  0.4960  0.0045  0.8265  0.0033  
α=0.4  0.4800  0.0039  0.8289  0.0042  
STVQ  α=0.5  0.4770  0.0058  0.8301  0.0028 
α=0.6  0.4536  0.0051  0.8303  0.0031  
α=0.7  0.4298  0.0031  0.8304  0.0039  
α=0.8  0.4022  0.0029  0.8306  0.0033  
α=0.9  0.3713  0.0046  0.8309  0.0027  
α=1.0  0.3598  0.0028  0.8310  0.0029  
pMedian  −  0.4596  0.0015  0.8366  0.0008 
kMeans  −  0.4770  0.0058  0.8301  0.0028 
Relational kMeans  −  0.4983  0.0023  0.8240  0.0012 
Probabilistic DClustering  −  0.4122  0.0236  0.7916  0.0160 
Concerning the computational complexity related to pMedian problems, it is well known that they belong to the NPHard complexity class. However, some recent metaheuristcs allow to solve the pMedian problems in $\mathcal{O}\left(\Omega {}^{2}\right)$ making these kind of approaches competitive in respect of others. While a pMedian can be solved in $\mathcal{O}\left(\Omega {}^{2}\right)$, approaches like kMeans, Relational kMeans and STVQ have a computational complexity of $\mathcal{O}\left(\left\Omega \right\mathit{\text{IKQ}}\right)$ (where Q is related to the time spent on computing vector distances during the iterative procedure, I denotes the fixed number of iterations and K the clusters to be obtained). Considering that in our case Q⟫Ω because Q depends on the vector dimension ${\mathbb{R}}^{m+n}$, it follows that a pMedian approach is more efficient than others.
Confidence interval (at the 95%) level of the BN predictions (accuracy) on the Sherf dataset
IG  CFS  

Average  Confidence±  Average  Confidence±  
μ=1.1  85.05  0.50  84.49  0.59  
μ=1.2  85.24  0.53  84.56  0.60  
μ=1.3  85.44  0.55  84.70  0.57  
μ=1.4  85.21  0.57  84.78  0.56  
Consensus pMedian (dg)  μ=1.5  85.39  0.59  84.75  0.64 
μ=1.6  85.63  0.53  84.98  0.60  
μ=1.7  85.32  0.54  84.82  0.59  
μ=1.8  85.31  0.55  84.88  0.56  
μ=1.1  85.30  0.52  84.79  0.60  
μ=1.2  85.00  0.44  84.71  0.55  
Consensus pMedian (gd)  μ=1.3  84.75  0.48  84.29  0.69 
α=1.4  84.55  0.55  84.43  0.63  
α=1.5  84.20  0.52  83.88  0.60  
α=0.0  82.10  0.47  81.10  0.49  
α=0.1  82.30  0.49  81.40  0.54  
α=0.2  82.50  0.53  81.70  0.56  
α=0.3  82.70  0.51  82.00  0.59  
α=0.4  82.90  0.55  82.30  0.60  
STVQ  α=0.5  83.10  0.55  82.60  0.61 
α=0.6  83.00  0.58  82.40  0.61  
α=0.7  82.90  0.49  82.20  0.53  
α=0.8  82.80  0.45  82.00  0.50  
α=0.9  82.70  0.42  81.80  0.52  
α=1.0  82.60  0.41  81.60  0.51  
pMedian  −  82.90  0.49  83.34  0.61 
kMeans  −  83.10  0.59  82.80  0.65 
Relational kMeans  −  84.00  0.56  84.10  0.60 
Probabilistic DClustering  −  81.80  1.00  83.10  0.77 
Gene Selection (IG) based on Consensus pMedian on the Sherf dataset
Gene name  Biological process  Role (referred literature) 

SPARC  Regulation of cell proliferation; signal transduction  SPARC is a secreted protein, acidic and rich in cysteines. It is a matrix associated protein that elicits changes in cell shape, inhibits cell cycle progression, and influences the synthesis of extracellular matrix. Clinical evidence indicates that SPARC expression correlates with tumor progression [42]. The gene product has been associated with tumor suppression but has also been correlated with metastasis based on changes to cell shape which can promote tumor cell invasion [43]. 
MAP1B  Microtubule bundle formation; negative regulation of intracellular transport  MAP1B interacts with a wide variety of proteins, there is growing consideration that MAP1B plays a crucial role in cytoskeleton stability and may also have a role in other cellular functions as well [44]. DAPK1 promotes autophagy by binding to the microtubuleassociated protein MAP1B, which is an LC3 interactor with antiautophagic functions [45]. 
DNAJA3  Apoptosis; cell death; negative regulation of cell proliferation  It is an important cell death regulator and could exert tumor suppressor activity [46]. The results establish DNAJA3 as a novel regulator of p53mediated apoptosis, and suggest that therapies designed to enhance DNAJA3’s function in promoting mitochondrial localization of p53 and apoptosis could be an effective therapy in many cancers [47]. 
SGK1  Apoptosis  SGK1 is a downstream target of cell survival and that it is primarily regulated at the level of transcription [48],[49]. 
ELF3  Inflammatory response;  Transcriptional inhibition of ELF3 could be a one of the mechanisms of colonic carcinogenesis [50]. 
CDKN2A  Cell cycle arrest; cell cycle checkpoint; negative regulation of cell growth; negative regulation of cell proliferation  CDKN2A is an important tumor suppressor gene and is specifically required for p53 activation under oncogenic stress [51]. Suppression of CDKN2A, a cellcycle regulator, occurs in essentially all common human cancers [52]. Inactivating these tumor suppressors directly promotes tumorigenesis due to lack of control over cellular processes [53]. 
SPINT2  Cellular component movement  SPINT2 play important roles in controlling the aggressive nature and spread of cancer, displaying a unique therapeutic potential [54]. 
GJA1  Apoptosis  GJA1 is involved in several kinds of tumor, as breast, lung, prostate and ovarian [55],[56] and [57]. 
AKT3  Signal transduction  AKT signaling pathway is activated in human cancers and consequences for molecularly targeted therapies. AKT isoform may play a positive or negative role in cell migration and invasion. AKT is also involved in regulation of tumor angiogenesis [58]. 
EpCAM  Positive regulation of cell proliferation  EpCAM has oncogenic potential and is activated by release of its intracellular domain, which can signal into the cell nucleus by engagement of elements of the wnt pathway [59]. Regulated intramembrane proteolysis activates EpCAM as a mitogenic signal transducer in vitro and in vivo [60]. 
Gene selection (CFS) based on consensus pMedian on the Sherf dataset
Gene name  Biological process  Role (referred literature) 

POLR2F  Protein kinase activity; DNA binding  POLR2F exhibited elevated levels in carcinomas compared to normal tissue samples suggesting a possible role for these molecules in colorectal cancer [61]. 
SPARC  Regulation of cell proliferation; signal transduction  SPARC is a secreted protein, acidic and rich in cysteines. It is a matrix associated protein that elicits changes in cell shape, inhibits cell cycle progression, and influences the synthesis of extracellular matrix. Clinical evidence indicates that SPARC expression correlates with tumor progression [42]. The gene product has been associated with tumor suppression but has also been correlated with metastasis based on changes to cell shape which can promote tumor cell invasion [43]. 
DNAJA3  Apoptosis; cell death; negative regulation of cell proliferation  It is an important cell death regulator and could exert tumor suppressor activity [46]. The results establish DNAJA3 as a novel regulator of p53mediated apoptosis, and suggest that therapies designed to enhance DNAJA3’s function in promoting mitochondrial localization of p53 and apoptosis could be an effective therapy in many cancers [47]. 
PTN  Regulation of cell proliferation and division  PTN is an angiogenic factor and has been found to be constitutively expressed in many human tumors of different cell types [62]. 
AIF1  Regulation of muscle cell proliferation  AIF1 can promote the growth of breast tumors via activating NFkappaB signaling [63]. 
STMN4  Regulation of microtubule polymerization or depolymerization   
PSAP  Lipid BINDING  PSAP is involved in prostate cancer invasion [64] and inhibits tumor metastasis via paracrine and endocrine stimulation of stromal p53 and Tsp1 [65]. 
AKT3  Signal transduction  AKT signaling pathway is activated in human cancers and consequences for molecularly targeted therapies. AKT isoform may play a positive or negative role in cell migration and invasion. AKT is also involved in regulation of tumor angiogenesis [58]. 
FBXO7  Cell death; protein binding   
P4HA2  Lascorbic acid binding  Overexpression of PRDX4 and P4HA2 was significantly associated with lymphatic metastasis in oral cavity squamous cell carcinoma [66]. P4HA2 was upregulated in breast tumor cells compared with its adjacent normal tissues [67]. 
miRNA Selection (IG and CFS) based on Consensus pMedian on the Liu dataset
IG  CFS  

miRNA  Target gene  miRNA  Target gene 
hsamiR200a  AP3S1  hsamiR196b  HOXA7 
hsamiR429  ZEB2  hsamiR18b  OTX2 
hsamiR200b  ZEB2  hsamiR1425p  SGPP1 
hsamiR200c  ZEB2  hsamiR100  TMPRSS13 
hsamiR141  AP3S1  hsamiR106a  DYNC1LI2 
hsamiR196b  HOXA7  hsamiR145  FAM108C1 
hsamiR18b  OTX2  hsamiR17*  HMGA2 
hsamiR100  TMPRSS13  hsamiR376c  PAX4 
hsamiR365  ZNF680  hsamiR211  ACSM2A 
hsamiR494  ARID4B  hsamiR503  MYH10 
Efficiency comparison (in terms of seconds) ofthe entire framework on Sherf dataset
(a) Clustering  

Clustering  Execution time 
Consensus pMedian (dg)  0.504 
Consensus pMedian (gd)  0.425 
STVQ  0.693 
pMedian  0.250 
kMeans  0.533 
Relational kMeans  0.966 
Probabilistic DClustering  0.590 
(b) Feature selection and prediction  
Feature selection  Execution time 
IG  0.27 
CFS  3980 
Prediction  Execution time 
Training  9.68 
Inference  110.227 
Efficiency comparison (in terms of seconds) ofthe entire framework on Liu Dataset
(a) Clustering  

Clustering  Execution time 
Consensus pMedian (dg)  0.398 
Consensus pMedian (gd)  0.381 
STVQ  0.473 
pMedian  0.215 
kMeans  0.446 
Relational kMeans  0.687 
Probabilistic DClustering  0.482 
(b) Feature selection and prediction  
Feature selection  Execution time 
IG  0.130 
CFS  1321 
Prediction  Execution time 
Training  0.604 
Inference  4.791 
As final remark, considering both qualitative and quantitative results, we can assert that Consensus pMedian together with Information Gain and Bayesian Network represent an optimal tradeoff between efficacy and efficiency to simultaneously predict (in silico) anticancer responses.
Conclusion
In this paper the problem of identifying a suitable profile of cancer patients by linking gene expressions, drug responses and types of cancer has been addressed. A learning framework based on three building blocks has been proposed. The experimental results highlight three main findings: (1) the proposed Consensus pMedian is able to create groups of cell lines that are highly correlated both in terms of gene expression and drug response; (2) from a biological point of view, the gene selection performed on these clusters allows the identification of genes that are strongly involved in several cancer processes; (3) the prediction of drug responses, by using the patient profile obtained through clustering and gene selection, represents a promising step for predicting potentially useful drugs. Concerning the ongoing research, several issues are still to be investigated. Among them the next future work will be focused to the identification of a suitable number of clusters and the use of more “selective” discretization policies. As far is concerned with the methodological approach, an interesting comparison relates with those approaches, belonging to the multiple tasks learning, able to simultaneously predict the drug responses given a (subset) of gene expressions. Instances of future investigations are Marginal Regression [70] and Support Vector Machine [71] For Multitask Learning. A further development of the proposed investigation relates to the exploitation of additional data sources, such as proteomic expression profiles, to better predict the drug response in tumour cells.
Availability of supporting data
The data sets supporting the results of this article are included within the article (and its additional files).
Endnotes
^{a} The solutions of the pMedian problems have been determined by using the CPLEX commercial solver.
^{b} For the feature selection process, the WEKA environment [72] has been exploited [73].
^{c} For training and inference of Bayesian Networks, the BNT Matlab toolbox has been used. The toolbox is available for download at [74].
Additional files
Declarations
Acknowledgements
This work has been partially funded by Regione Lombardia: “Dote Ricercatori”  FSE and NEDD project.
Authors’ Affiliations
References
 Van Steenbergen L, Elferink M, Krijnen P, Lemmens V, Siesling S, Rutten H, Richel D, KarimKos H, Coebergh J: Improved survival of colon cancer due to improved treatment and detection: a nationwide populationbased study in the Netherlands 19892006. Ann Oncol. 2010, 21 (11): 22062212. 10.1093/annonc/mdq227.View ArticlePubMedGoogle Scholar
 Joerger M, Thürlimann B, Savidan A, Frick H, Bouchardy C, Konzelmann I, ProbstHensch N, Ess S: A populationbased study on the implementation of treatment recommendations for chemotherapy in early breast cancer. Clin Breast Cancer. 2012, 12 (2): 102109. 10.1016/j.clbc.2011.10.005.View ArticlePubMedGoogle Scholar
 Blower PE, Verducci JS, Lin S, Zhou J, Chung JH, Dai Z, Liu CG, Reinhold W, Lorenzi PL, Kaldjian EP, Croce CM, Weinstein JN, Sadee W: MicroRNA expression profiles for the nci60 cancer cell panel. Mol Cancer Ther. 2007, 6 (5): 14831491. 10.1158/15357163.MCT070009.View ArticlePubMedGoogle Scholar
 Grills C, Jithesh PV, Blayney J, Zhang SD, Fennell DA: Gene expression metaanalysis identifies VDAC1 as a predictor of poor outcome in early stage nonsmall cell lung cancer. PLoS ONE. 2011, 6 (1): e1463510.1371/journal.pone.0014635.View ArticlePubMed CentralPubMedGoogle Scholar
 Masica DL, Karchin R: Collections of simultaneously altered genes as biomarkers of cancer cell drug response. Cancer Res. 2013, 73 (6): 16991708. 10.1158/00085472.CAN123122.View ArticlePubMed CentralPubMedGoogle Scholar
 Scherf U, Ross DT, Waltham M, Smith LH, Lee JK, Tanabe L, Kohn KW, Reinhold WC, Myers TG, Andrews DT, Scudiero DA, Eisen MB, Sausville EA, Pommier Y, Botstein D, Brown PO, Weinstein JN: A gene expression database for the molecular pharmacology of cancer. Nat Genet. 2000, 24 (3): 236244. 10.1038/73439.View ArticlePubMedGoogle Scholar
 Chang JH, Hwang KB, Zhang BT: Analysis of gene expression profiles and drug activity patterns by clustering and bayesian network learning. Methods of Microarray Data Analysis II. Edited by: Lin SM, Johnson KF. 2002, Springer US, New York, 169184.View ArticleGoogle Scholar
 Chang JH, Hwang KB, Oh SJ, Zhang BT: Bayesian network learning with feature abstraction for genedrug dependency analysis. J Bioinformatics Comput Biol. 2005, 3 (1): 6177. 10.1142/S0219720005000874.View ArticleGoogle Scholar
 Burger M, Graepel T, Obermayer K: Phase transitions in soft topographic vector quantization. Artificial Neural NetworksICANN’97. Edited by: Gerstner W, Germond A, Hasler M, Nicoud JD. 1997, Springer Berlin Heidelberg, New York, 619624.Google Scholar
 Fersini E, Giordani I, Messina E, Archetti F: Relational clustering and bayesian networks for linking gene expression profiles and drug activity patterns. Proceedings of IEEE International Conference on Bioinformatics and Biomedicine Workshop: 14 November 2009; Washington DC. Edited by: Chen J. 2009, IEEE Computer Society, Washington DC, 2025.View ArticleGoogle Scholar
 Fersini E, Messina E, Archetti F, Manfredotti C: Combining gene expression profiles and drug activity patterns analysis: A relational clustering approach. J Math Modelling Algorithms. 2010, 9 (3): 275289. 10.1007/s1085201091402.View ArticleGoogle Scholar
 Archetti F, Giordani I, Vanneschi L: Genetic programming for anticancer therapeutic response prediction using the nci60 dataset. Comput Oper Res. 2010, 37 (8): 13951405. 10.1016/j.cor.2009.02.015.View ArticleGoogle Scholar
 Fersini E, Messina E, Leporati A: Discovering genedrug relationships for the pharmacology of cancer. Advances in Computational Intelligence  Communications in Computer and Information Science Series. Edited by: Greco S, BouchonMeunier B, Coletti G, Fedrizzi M, Matarazzo B, Yager R. 2012, Springer Berlin Heidelberg, New York, 117126.Google Scholar
 MacQueen JB: Some methods for classification and analysis of multivariate observations. Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability; Berkeley. Edited by: LeCam LM, Neyman N. 1967, University of California Press, Berkeley, CA, 281297.Google Scholar
 Iyigun C, BenIsrael A: A generalized weiszfeld method for the multifacility location problem. Oper Res Lett. 2010, 38 (3): 207214. 10.1016/j.orl.2009.11.005.View ArticleGoogle Scholar
 Quinlan JR: Induction of decision trees. Mach Learn. 1986, 1: 81106.Google Scholar
 Hall M: Correlationbased feature selection for discrete and numeric class machine learning. Proceedings of Seventeenth International Conference on Machine Learning: June 29  July 2 2000; Stanford, CA. Edited by: Langley P. 2000, Morgan Kaufmann Publishers, San Francisco, 359366.Google Scholar
 Pearl J: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. 1988, San Francisco, Morgan Kaufmann PublishersGoogle Scholar
 Liu H, D’Andrade P, FulmerSmentek S, Lorenzi P, Kohn KW, Weinstein JN, Pommier Y, Reinhold WC: mRNA and microRNA expression profiles of the nci60 integrated with drug activities. Mol Cancer Ther. 2010, 9 (5): 10801091. 10.1158/15357163.MCT090965.View ArticlePubMed CentralPubMedGoogle Scholar
 Lin SM, Johnson K: Methods of Microarray Data Analysis II. 2002, Springer US, New YorkView ArticleGoogle Scholar
 Drezner Z: Facility Location: a Survey of Applications and Methods. 1995, Springer US, New YorkView ArticleGoogle Scholar
 Järvinen P, Rajala J, Sinervo H: Technical note  a branchandbound algorithm for seeking the PMedian. Oper Res. 1972, 20 (1): 173178. 10.1287/opre.20.1.173.View ArticleGoogle Scholar
 Kaufman L, Rousseeuw PJ: Finding Groups in Data: An Introduction to Cluster Analysis. 1990, Wiley, New YorkView ArticleGoogle Scholar
 Bradley PS, Mangasarian OL, Street WN: Clustering via concave minimization. Proceedings of Advances in Neural Information Processing Systems: December 25, 1996; Denver, CO. Edited by: Mozer MC, Jordan MI, Petsche T. 1996, MIT Press, Cambridge, MA, 68374.Google Scholar
 Weiszfeld E: Sur le point pour lequel la somme des distances de n points donn’s est minimum. Tohoku Math J. 1937, 43 (2): 355386.Google Scholar
 Wang P, Domeniconi C, Laskey KB: Nonparametric bayesian clustering ensembles. Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part III: 2024 September 2010; Barcellona. Edited by: Balcázar JL, Bonchi F, Gionis A, Sebag M. 2010, SpringerVerlag, Berlin, 435450.Google Scholar
 Nguyen N, Caruana R: Consensus clusterings. Proceedings of the 7th IEEE International Conference on Data Mining: 2831 October 2007; Omaha, NE. Edited by: Ramakrishnan N, Zaïane OR, Shi Y, Clifton CW, Wu X. 2007, IEEE Computer Society, Washington DC, 607612.Google Scholar
 Mo Q, Wang S, Seshan VE, Olshen AB, Schultz N, Sander C, Powers RS, Ladanyi M, Shen R: Pattern discovery and cancer gene identification in integrated cancer genomic data. PNAS. 2013, 110 (11): 42454250. 10.1073/pnas.1208949110.View ArticlePubMed CentralPubMedGoogle Scholar
 Rey M, Roth V: Copula mixture model for dependencyseeking clustering. Proceedings of the 29th International Conference on Machine Learning: June 26July 1 2012; Edinburgh. Edited by: Langford J, Pineau J. 2012, Omnipress, Madison, WI, 927934.Google Scholar
 Kirk P, Griffin JE, Savage RS, Ghahramani Z, Wild DL: Bayesian correlated clustering to integrate multiple datasets. Bioinformatics. 2012, 28: 32903297. 10.1093/bioinformatics/bts595.View ArticlePubMed CentralPubMedGoogle Scholar
 Rogers S, Girolami M, Kolch W, Waters KM, Liu T, Thrall B, Wiley HS: Investigating the correspondence between transcriptomic and proteomic expression profiles using coupled cluster models. Bioinformatics. 2008, 24: 28942900. 10.1093/bioinformatics/btn553.View ArticlePubMed CentralPubMedGoogle Scholar
 Korn EL, Troendle JF, McShane LM, Simon R: Controlling the number of false discoveries: application to highdimensional genomic data. J Stat Plann Inference. 2004, 124 (2): 379398. 10.1016/S03783758(03)002118.View ArticleGoogle Scholar
 Staunton JE, Slonim DK, Coller HA, Tamayo P, Angelo MJ, Park J, Scherf U, Lee JK, Reinhold WO, Weinstein JN, Mesirov JP, Lander ES, Golub TR: Chemosensitivity prediction by transcriptional profiling. PNAS. 2001, 98 (19): 1078710792. 10.1073/pnas.191368598.View ArticlePubMed CentralPubMedGoogle Scholar
 Langley P, Iba W, Thompson K: An analysis of Bayesian classifiers. Proceedings of the 10th National Conference on Artificial Intelligence: July 1216 1992; San Jose, CA. Edited by: Swartout WR. 1992, AAAI Press, Palo Alto, CA, 223228.Google Scholar
 Quinlan JR: C4.5: Programs for Machine Learning. 1993, Morgan Kaufmann Publishers, San Francisco, CAGoogle Scholar
 Aha DW, Kibler D, Albert MK: Instancebased learning algorithms. Mach Learn. 1991, 6 (1): 3766.Google Scholar
 Vapnik V: Statistical Learning Theory. 1998, Wiley, New YorkGoogle Scholar
 Tsamardinos I, Borboudakis G, Christodoulou E, Røe OD: Chemosensitivity Prediction of Tumours Based on Expression, miRNA, and Proteomics Data. Int J Syst Biol Biomed Technol. 2012, 1 (2): 119.Google Scholar
 Entrez gene database. [], [http://www.ncbi.nlm.nih.gov/gene/]
 Maglott D, Ostell J, Pruitt KD, Tatusova T: Entrez gene: genecentered information at NCBI. Nucleic Acids Res. 2005, 33 (suppl 1): 5458.Google Scholar
 Nagaraju GPC, Sharma D: Anticancer role of SPARC, an inhibitor of adipogenesis. Cancer Treat Rev. 2011, 37 (7): 559566. 10.1016/j.ctrv.2010.12.001.View ArticlePubMed CentralPubMedGoogle Scholar
 Clark CJ, Sage EH: A prototypic matricellular protein in the tumor microenvironment where there’s SPARC, there’s fire. J Cell Biochem. 2009, 104 (3): 721732. 10.1002/jcb.21688.View ArticleGoogle Scholar
 Arnold SA, Brekken RA: SPARC: a matricellular regulator of tumorigenesis. J Cell Commun Signal. 2009, 3 (34): 255273. 10.1007/s1207900900724.View ArticlePubMed CentralPubMedGoogle Scholar
 Riederer BM: Microtubuleassociated protein 1B, a growthassociated and phosphorylated scaffold protein. Brain Res Bull. 2007, 71 (6): 541558. 10.1016/j.brainresbull.2006.11.012.View ArticlePubMedGoogle Scholar
 Morselli E, Galluzzi L, Kepp O, Vicencio JM, Criollo A, Maiuri MC, Kroemer G: Antiand protumor functions of autophagy. Biochim Biophys Acta. 2009, 1793 (9): 15241532. 10.1016/j.bbamcr.2009.01.006.View ArticlePubMedGoogle Scholar
 Edwards KM, Münger K: Depletion of physiological levels of the human TID1 protein renders cancer cell lines resistant to apoptosis mediated by multiple exogenous stimuli. Oncogene. 2004, 23 (52): 84198431. 10.1038/sj.onc.1207732.View ArticlePubMedGoogle Scholar
 Ralph SJ, RodríguezEnríquez S, Neuzil J, Saavedra E, MorenoSánchez R: The causes of cancer revisited: “mitochondrial malignancy” and ROSinduced oncogenic transformation  why mitochondria are targets for cancer therapy. Mol Aspects Med. 2010, 31 (2): 145170. 10.1016/j.mam.2010.02.008.View ArticlePubMedGoogle Scholar
 Mikosz CA, Brickley DR, Sharkey MS, Moran TW, Conzen SD: Glucocorticoid receptormediated protection from apoptosis is associated with induction of the serine/threonine survival kinase gene, sgk1. J Biol Chem. 2001, 276 (20): 1664916654. 10.1074/jbc.M010842200.View ArticlePubMedGoogle Scholar
 Zhang L, Cui R, Cheng X, Du J: Antiapoptotic effect of serum and glucocorticoidinducible protein kinase is mediated by novel mechanism activating IκB Kinas. Cancer Res. 2005, 65 (2): 457464.PubMedGoogle Scholar
 Lee HJ, Chang JH, Kim YS, Kim SJ, Yang HK: Effect of etsrelated transcription factor (ERT) on transforming growth factor (TGF)beta type II receptor gene expression in human cancer cell lines. J Exp Clin Cancer Res. 2003, 22 (3): 477480.PubMedGoogle Scholar
 Chen D, Shan J, Zhu WG, Qin J, Gu W: Transcriptionindependent ARF regulation in oncogenic stressmediated p53 responses. Nature. 2010, 464 (7288): 624627. 10.1038/nature08820.View ArticlePubMed CentralPubMedGoogle Scholar
 Liggett W, Sidransky D: Role of the p16 tumor suppressor gene in cancer. J Clin Oncol. 1998, 16 (3): 11971206.PubMedGoogle Scholar
 Virani S, Colacino JA, Kim JH, Rozek LS: Cancer epigenetics: a brief review. ILAR J. 2013, 53 (34): 359369.View ArticleGoogle Scholar
 Parr C, Jiang WG: Hepatocyte growth factor activation inhibitors (HAI1 and HAI2) regulate HGFinduced invasion of human breast cancer cells. Int J Cancer. 2006, 119 (5): 11761183. 10.1002/ijc.21881.View ArticlePubMedGoogle Scholar
 Toler CR, Taylor DD, GercelTaylor C: Loss of communication in ovarian cancer. Am J Obstet Gynecol. 2006, 194 (5): e2731. 10.1016/j.ajog.2006.01.024.View ArticlePubMedGoogle Scholar
 Li Z, Zhou Z, Welch DR: Donahue HJ. Expressing connexin 43 in breast cancer cells reduces their metastasis to lungs. Clin Exp Metastasis. 2008, 25 (8): 893901. 10.1007/s1058500892089.View ArticlePubMed CentralPubMedGoogle Scholar
 Qin H, Shao Q, Curtis H, Galipeau J, Belliveau DJ, Wang T, AlaouiJamali MA, Laird DW: Retroviral delivery of connexin genes to human breast tumor cells inhibits in vivo tumor growth by a mechanism that is independent of significant gap junctional intercellular communication. J Biol Chem. 2002, 277 (32): 2913229138. 10.1074/jbc.M200797200.View ArticlePubMedGoogle Scholar
 Cheung M, Testa JR: Diverse mechanisms of AKT pathway activation in human malignancy. Current Cancer Drug Targets. 2013, 13 (3): 234244. 10.2174/1568009611313030002.View ArticlePubMed CentralPubMedGoogle Scholar
 Munz M, Baeuerle PA, Gires O: The emerging role of EpCAM in cancer and stem cell signaling. Cancer Res. 2009, 69 (14): 56275629. 10.1158/00085472.CAN090654.View ArticlePubMedGoogle Scholar
 Maetzel D, Denzel S, Mack B, Canis M, Went P, Benk M, Kieu C, Papior P, Baeuerle PA, Munz M, Gires O: Nuclear signalling by tumourassociated antigen EpCAM. Nat Cell Biol. 2009, 11 (2): 162171. 10.1038/ncb1824.View ArticlePubMedGoogle Scholar
 Antonacopoulou AG, Grivas PD, Skarlas L, Kalofonos M, Scopa CD: Kalofonos HP: POLR2F, ATP6V0A1 and PRNP expression in colorectal cancer: new molecules with prognostic significance?. Anticancer Res. 2008, 28 (2B): 12211227.PubMedGoogle Scholar
 Zhang N, Zhong R, PerezPinera P, Herradon G, Ezquerra L, Wang ZY, Deuel TF: Identification of the angiogenesis signaling domain in pleiotrophin defines a mechanism of the angiogenic switch. Biochem Biophys Res Commun. 2006, 343 (2): 653658. 10.1016/j.bbrc.2006.03.006.View ArticlePubMedGoogle Scholar
 Li T, Feng Z, Jia S, Wang W, Du Z, Chen N, Chen Z: Daintain/AIF1 promotes breast cancer cell migration by upregulated TNFαvia activate p38 MAPK signaling pathway. Breast cancer Res Treatment. 2012, 131 (3): 891898. 10.1007/s105490111519x.View ArticleGoogle Scholar
 Hu S, Delorme N, Liu Z, Liu T, VelascoGonzalez C, Garai J, Pullikuth A, Koochekpour S: Prosaposin downmodulation decreases metastatic prostate cancer cell adhesion, migration, and invasion. Mol Cancer2010, 9(30).Google Scholar
 Kang SY, Halvorsen OJ, Gravdal K, Bhattacharya N, Lee JM, Liu NW, Johnston BT, Johnston AB, Haukaas SA, Aamodt K, Yoo S, Akslen LA, Watnick RS: Prosaposin inhibits tumor metastasis via paracrine and endocrine stimulation of stromal p53 and Tsp1. PNAS. 2009, 106 (29): 1211512120. 10.1073/pnas.0903120106.View ArticlePubMed CentralPubMedGoogle Scholar
 Pan PW, Zhang Q, Bai F, Hou J, Bai G: Profiling and comparative analysis of glycoproteins in Hs578BST and Hs578T and investigation of prolyl 4hydroxylase alpha polypeptide II expression and influence in breast cancer cells. Biochemistry. 2012, 77 (5): 539545.PubMedGoogle Scholar
 Chang KP, Yu JS, Chien KY, Lee CW, Liang Y, Liao CT, Yen TC, Lee LY, Huang LL, Liu SC, Chang YS, Chi LM: Identification of PRDX4 and P4HA2 as metastasisassociated proteins in oral cavity squamous cell carcinoma by comparative tissue proteomics of microdissected specimens using iTRAQ technology. J Proteome Res. 2011, 10 (11): 49354947. 10.1021/pr200311p.View ArticlePubMedGoogle Scholar
 microRNA.org  targets and expression. [], [http://www.microrna.org/]
 Betel D, Wilson M, Gabow A, Marks DS, Sander C: The microRNA.org resource: targets and expression. Nucleic Acids Res. 2008, 36 (Database issue): D149D153.PubMed CentralPubMedGoogle Scholar
 Kolar M, Liu H: Marginal regression for multitask learning. Proceedings of the International Conference on Artificial Intelligence and Statistics: April 2123 2012; La Palma, Canary Islands. Edited by: Lawrence ND, Girolami M. 2012, JMLR.org., Cambridge, 647655.Google Scholar
 Evgeniou T, Micchelli CA, Pontil M, ShaweTaylor J: Learning multiple tasks with kernel methods. J Mach Learn Res. 2005, 6 (4): 615637.Google Scholar
 WEKA data mining software. [], [http://www.cs.waikato.ac.nz/ml/weka/]
 Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH: The WEKA data mining software: an update. ACM SIGKDD Explorations Newslett. 2009, 11 (1): 1018. 10.1145/1656274.1656278.View ArticleGoogle Scholar
 Bayesian network toolbox. [], [https://code.google.com/p/bnt/]
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.