Methodology Article  Open  Published:
A pMedian approach for predicting drug response in tumour cells
BMC Bioinformaticsvolume 15, Article number: 353 (2014)
Abstract
Background
The complexity of biological data related to the genetic origins of tumour cells, originates significant challenges to glean valuable knowledge that can be used to predict therapeutic responses. In order to discover a link between gene expression profiles and drug responses, a computational framework based on Consensus pMedian clustering is proposed. The main goal is to simultaneously predict (in silico) anticancer responses by extracting common patterns among tumour cell lines, selecting genes that could potentially explain the therapy outcome and finally learning a probabilistic model able to predict the therapeutic responses.
Results
The experimental investigation performed on the NCI60 dataset highlights three main findings: (1) Consensus pMedian is able to create groups of cell lines that are highly correlated both in terms of gene expression and drug response; (2) from a biological point of view, the proposed approach enables the selection of genes that are strongly involved in several cancer processes; (3) the final prediction of drug responses, built upon Consensus pMedian and the selected genes, represents a promising step for predicting potential useful drugs.
Conclusion
The proposed learning framework represents a promising approach predicting drug response in tumour cells.
Background
Cancer is a disease treated with various strategies depending on the type of cancer and the stage of the disease. Generally, therapeutic agents are selected according to the specific cancer type and patient population, based on the effectiveness in largepopulation studies [1],[2]. Now, with the advances of the genomic era, a massive amount of highthroughput data has been made available for understanding the cancer system biology. The public available datasets composed of genomic data and drug responses offer the opportunity to reveal valuable knowledge about the hidden relationships between gene expression and drug activity of tumor cells, pointing out the conditions that bring a patient to be more responsive than others to a given therapeutic agent. Although data collection provides the baseline to enable a better understanding of cancer mechanisms, data integration and interpretation is still an open issue. Mathematical and statistical models of complex biological systems play a fundamental role in system biology, and in particular in cancer related issues. They can be exploited for exploratory purposes, to validate hypothesis and make predictions about quantities that are difficult or impossible to be measured in vivo.
In the last decade, several studies have been conducted to develop platforms on which cancer highthroughput computational analysis can be performed. Much of these computational approaches are targeted at predicting the drug sensitivity/resistance by means of statistical inference and regression methods able to take into account genomic information of hundreds of genes for determining a specific drug response [3][5]. However, the massive availability of chemical compounds as potential cancer therapies has opened to the investigation of in silico therapy response prediction which requires more sophisticated computational models and methods to optimize the experimental design of celldrug screenings.
A first attempt of genedrug integrative analysis was presented in [6] where, thanks to a hierarchical clustering algorithm, several investigations have been performed: (1) celltocell correlation on the basis of gene expression and drug activity profiles, (2) relationships between drug activity patterns and mechanisms of action, (3) genedrug correlation on the basis of gene expression and drug activity profiles. In subsequent investigations [7],[8], the triangle gene expression profiles, drug responses and cancer types has been explored by integrating unsupervised and supervised machine learning algorithms. The clustering approach based on Soft Topographic Vector Quantization (STVQ) [9] has shown that gene expression profiles are more related to the cancer type than to the drug activity patterns, while thanks to the structure learning of Bayesian Networks (BN) some biologically meaningful relationships among gene expression levels, drug activities and cancer types have been confirmed and in some cases revealed. More recent works [10][13] are targeted at integrating explorative approaches with predictive paradigm towards a computational genedrug screening. While [10],[11] and [12] are based on nondeterministic clustering approaches (kMeans, STVQ and Genetic Programming) for identifying relevant genes involved in cancer mechanisms and predictive of drug response, [13] introduces a framework based global optimization to cancel the randomness, and therefore the variance, of stochastic clustering results when predicting a therapy outcome.
The results of the above quoted papers and of a wide set of related approaches highlight several interesting considerations:

The explorative analysis performed through clustering approaches reveals that the tissue of origin is more related to the gene expression profile than the drug activity patterns. This suggests that the genomic information of a cell line plays a fundamental role, independently of the organ of origin, to understand anticancer therapy responses. This idea has been supported by the fact that several cell lines with a relatively high expression level of those genes regulating multidrug resistance have been clustered in the same group. This indicates that chemoresponse mechanisms are distributed across different tissues in the panel and that it should be possible to link drug responses to gene expression profiles.

In order to cancel the variability of results of stochastic clustering and to guarantee the convergence to a global minimum, we need to address the clustering problem by exact approaches able to find globally optimal solutions.

Computational approaches based on Bayesian Networks reveal interesting relationships among subsets of genes and drugs. The potential of Bayesian Networks encourages us to exploit this probabilistic model not only for deductive purposes, but also for prediction issues.
In order to achieve the final goal of simultaneously predicting the drug response of several compounds given a patient genomic profile, we propose a computational framework based on the following assumption: groups of cell lines homogeneous in terms of both gene expression profile and drug activity should be characterized by a subset of genes that explains the drug responses. To this purpose a threefolds analysis has been investigated: pMedian problem formulations to create clusters of homogeneous cell lines, Feature Selection Policies to select relevant genes and finally Bayesian Networks to predict drug responses of tumour cell lines. Computational results show that the proposed Consensus pMedian, combined with gene selection and BN inference engine, yields homogeneous clusters while guaranteeing good predictive power for inferring drug responses for a new cell line. This is also confirmed by the biological evaluation performed on the selected genes: according to the existing literature the set of genes used to train the BNs, which has been selected by using the groups of cell lines obtained by the proposed Consensus pMedian, has shown to be biologically relevant from an oncological point of view.
Methods
Problem formulation
The problem of simultaneously predicting the response of several therapeutic compounds given the patient genomic profile is addressed by a computational framework composed of three main building blocks:

1.
The creation of homogeneous groups of tumor cell lines by means of pMedian formulations. In particular, a novel Consensus pMedian formulation is proposed and compared with traditional state of the art approaches, i.e. kMeans [14], STVQ [9] and Relational kMeans [11] and Probabilistic DClustering [15].

2.
The selection of relevant genes able to predict the response of hundreds of drugs. We explore the potential of the solutions determined by solving the above mentioned pMedian problem formulation for identifying a subset of genes that characterizes each cluster, i.e. those subsets of genes that could be responsible of drug responses. To accomplish this task two main feature selection policies have been investigated, i.e. Information Gain [16] and Correlationbased Feature Subset Evaluation (CFS) [17].

3.
The simultaneous prediction of different drug responses by exploiting the potential of Bayesian Networks [18]. Establishing a straightforward dependency structure of the Bayesian Network, we explore the ability of the selected genes to predict a panel of drug responses given the genomic profiles of patients.
The proposed computational framework exploits the well known dataset provided by the U.S. National Cancer Institute. The dataset consists of 60 cell lines from 9 kinds of cancers, all extracted from human patients, where the tumors considered in the panel derive from colorectal, renal, ovarian, breast, prostate, lung and central nervous system as well as leukemia and melanoma cancer tissues.
For the cell lines in the panel, both transcript profiling and chemosensitivity patterns have been considered. In the following we will consider two datasets stemmed from the original one: (a) Scherf et al. based on cDNA arrays and (b) Liu et al. based on microRNA arrays. In both cases the dataset is defined as a set Ω of all cell lines x_{ i }, with i={1,…60}, into the real vector space ${\mathbb{R}}^{m+n}$:
where ${x}_{i}^{G}$ represents the transcript expression level as a vector into the space ${\mathbb{R}}^{m}$ and ${x}_{i}^{D}$ denotes the drug response as a vector into the space ${\mathbb{R}}^{n}$.
Sherf dataset: cDNA arrays and DTPtested chemical compounds
The Sherf dataset, originally presented in [6], denotes the gene expression profile ${x}_{i}^{G}$ by using the cDNA microarray technology and the drug response ${x}_{i}^{D}$ by assessing the grown inhibition activities (G I_{50}) after 48 hours of drug treatment through Sulphorhodamine B. We can consequently define Ω^{G} and Ω^{D} as the set of cell lines represented through their gene expression profiles and their drug activity responses respectively:
According to the Sherf representation ${\mathbb{R}}^{m}$, with m= 1375, includes genes selected from the original NCI60 dataset (characterized by 9073 genes) having 5 or fewer missing values and showing strong pattern of variation among the 60 cell lines (more than 3 measurements must have redgreen intensity ratios >2.6 or <0.38). The space ${\mathbb{R}}^{n}$, with n=1400, includes drugs contained into the original dataset, where each compound has been tested one at time and independently. Considering that among 1375 genes and 1400 drugs missing values were still present, they have been replaced by the average gene expression value (or the average drug activity) over the 60 cell lines. The gene expression profiles and drug activity response for Sherf dataset are available for download as Additional file 1: Sherf gene expression data and Additional file 2: Sherf drug activity data.
Liu dataset: microRNA arrays and drugs with known mechanism of action
MicroRNAs (miRNA in the following) are a group of short noncoding RNAs that regulate gene expression at the posttranscriptional level. They are involved in many biological processes, including development, differentiation, apoptosis, and carcinogenesis. Because miRNAs may play a role in the initiation and progression of cancer, they comprise a novel class of promising diagnostic and prognostic molecular markers and potential drug targets. In order to achieve our goal by exploiting the miRNA data, we considered the dataset presented in [19]. This dataset leads us to represent the sets Ω^{G} and Ω^{D} by means of 422 miRNA expression profile and 118 G I_{50} responses related to drugs with known mechanism of action. The same selection criterion applied on Sherf dataset has been exploited for Liu dataset. Concerning the miRNA expressions, in this dataset there are no missing values and more than 3 experiments have redgreen intensity ratios >2.6 or <0.38, implying no selection of miRNA and therefore a space ${\mathbb{R}}^{m=422}$. Regarding the drug space, ${\mathbb{R}}^{n=118}$ is characterized by the presence of missing values. As well as for Sherf dataset, they have been replaced by the average drug activity over the 60 cell lines. The miRNA expression profiles and drug activity response for Liu dataset are available for download as Additional file 3: Liu miRNA expression data and Additional file 4: Liu drug activity data.
Cluster analysis
Cluster analysis is aimed at discovering embedded patterns into a given dataset. From a high level point of view cluster analysis consists of partitioning a set of patterns into subsets (clusters) based on similarity, i.e. a cluster has to contain similar patterns and dissimilar patterns have to be in different clusters. This could be accomplished by partitioning data points into a prespecified number of clusters through the optimization of a cost function related to a similarity/dissimilarity measure between data points.
An important step in any clustering algorithm is to select a distance measure, which will determine how the similarity/dissimilarity of two data points is calculated. In order to perform a cluster analysis we chose one of the most used distance measures [20] based on Pearson Correlation (corr):
where x_{ i } and x_{ j } represent two cell lines and c o r r(x_{ i },x_{ j }) denotes the Pearson Correlation coefficient between x_{ i } and x_{ j }. Thanks to the distance measure denoted by equation (4), cell lines having high distance due to anticorrelated genes/drugs are likely placed in different clusters, while cell lines characterized by a small gap are expected to be clustered together. The adoption of a correlationbased metric instead of the Euclidean distance is motivated by its sensitivity with respect to magnitude: Euclidean distance is sensitive to scaling and differences in average expression level, whereas correlation is not.
In the following we present three different clustering approaches based on pMedian formulation: traditional pMedian, probabilistic dClustering and Consensus pMedian.
Traditional pMedian
The pMedian problem was originally designed for facility location planning [21], where the location of “pfacilities” relative to a set of “customers” has been formulated such that the sum of the shortest demand weighted distance between “customers” and “facilities” is minimized. In our investigation, the pMedian problem has been formulated as an assignment problem for creating groups of cell lines by using a “flat” representation of data, i.e. by representing each cell line as a vector in ${\mathbb{R}}^{m+n}$. Given a cell line x_{ i }∈Ω and K desired clusters, the clustering problem consists in assigning each x_{ i } to a cluster center x_{ j }, such that the intracluster distance is minimized and the intercluster distance is maximized.
Let Z be a matrix of dimension Ω×Ω, as:
where z_{ ij } represents the assignment variable that indicates whether a cell line x_{ i } is assigned to a cluster center x_{ j }. Note that the matrix Z has dimension Ω×Ω because each entry z_{ ij } denotes the potential association of a cell line x_{ i } to any of the points x_{ j } in Ω (where x_{ j } can be a cluster center or not).
The clustering problem can be formulated as follows:
s.t:
According to this formulation, the objective function in equation (6) denotes a combinatorial optimization problem whose objective is to minimize the distance between all data points belonging to the same cluster through the identification of optimal cluster centers x_{ j }∈Ω. Constraint (7) ensures that each cell line x_{ i } is assigned to only one cluster, constraint (8) guarantees that there will be exactly K clusters and constraint (9) ensures that if x_{ i } is assigned to x_{ j } then x_{ j } is a cluster center and therefore a median. The last constraint (10) guarantees integrality.
For seek of clarity, the above mentioned pMedian is a mathematical programming formulation (also known as generalized FermatWeber problem formulation) for uncapacitated facility location problems. The objective of this formulation is to minimize the sum of the distances from all data points x_{ i } to their respective cluster centers (geometric medians). In this paper the pMedian problem is solved deterministically^{a} by means of a canonical “branch and cut” algorithm [22]. The solution of the pMedian problem finds out not only the cluster assignments, but also the geometric medians as cluster representatives.
pMedian must not be confused with approaches like kMeans [14], kMedoids [23] and kMedians [24], which represent heuristic algorithms for approximating the above mentioned objective function. While kMeans computes a cluster representative (centroid) as mean vector of all points belonging to a cluster, kMedoids and kMedians select respectively k of the Ω data points as medoids (whose average distance to all the objects in the cluster is minimal) and medians (combination of multiple instances). On the contrary, a branchandcut algorithm on a pMedian formulation determines the set of p data points that minimize the sum of weighted distances to any points of the dataset and consequently finds out the cluster assignment for each data point. The geometric medians determined by solving the pMedian problem do not coincide neither with the centroids, medians or medoids (the only exceptions are for the 1dimensional case, where the geometric median coincides with the median and when in kMedoids the medoids are selected as median objects instead of computed as combination of multiple instances). In our investigation, the solution of the pMedian problem formulations are ensured to be the global optimum, while the ones originated by the heuristic approaches can correspond to local optimum among all possible solutions.
Probabilistic DClustering
The assignment problem presented above assumes to create K mutually exclusive clusters of cell lines, with similar profiles of gene expression and drug response. The crisp formulation can be relaxed by modelling probabilistic (or soft) assignments (with cluster membership probabilities), leading to a probabilistic pMedian named Probabilistic DClustering [15].
The formulation reported in (6)(10) can be therefore approximated by the following minimization problem:
s.t:
where the decision variables c_{ k } and p_{ k }(x_{ i }) denote the cluster centers c_{ k } and the probability of assigning the cell line x_{ i } to the cluster c_{ k } respectively. Each cell line can be finally assigned to the cluster center with the highest probability.
It can be easily noted that the formulation of Probabilistic DClustering is a further generalization of the pMedian (FermatWeber) problem, slightly different from the ones presented in equation (6)(10) but still belonging to the combinatorial optimization. While for Traditional pMedian the creation of K clusters is forced by the constraint (8), in Probabilistic DClustering the generation of K clusters is driven by the objective function.
A natural working principle to solve (11)(13) is to fix one set of variables and minimize the objective function with respect to the other set of variables, then fix the other set and minimize again, until convergence is achieved. An iterative method has been recently proposed to solve the problem, leading to a generalized Weiszfeld method [25], where centers and probabilities are sequentially updated. The iterative method alternates between:

Step 1: Probabilities UpdateProbabilities Update. Given the centers c_{ k } and the distance between each cell line x_{ i } and c_{ k }, the probabilities that x_{ i } is assigned to the cluster k can be estimated as:
$${p}_{k}\left({x}_{i}\right)=\frac{\underset{j\ne k}{\Pi}d\left({x}_{i},{c}_{j}\right)}{\sum _{l=1}^{K}\underset{m\ne l}{\Pi}d\left({x}_{i},{c}_{m}\right)}$$(14)Given the clusters, their centers, and the distances of data points from these centers, the probability of cluster membership at any point is assumed inversely proportional to the distance from (the center of) the cluster.

Step 2: Centers UpdateCenters Update. Given the probabilities p_{ k }(x_{ i }), the centers ${c}_{k}^{+}$ can be updated according to the current cluster distribution as:
$${c}_{k}^{+}=\frac{\sum _{i=1}^{\left\Omega \right}{\mu}_{k}\left({x}_{i}\right){x}_{i}}{\sum _{j=1}^{\left\Omega \right}{\mu}_{k}\left({x}_{j}\right)}$$(15)where
$${\mu}_{k}\left({x}_{i}\right)=\frac{{p}_{k}\left({x}_{i}^{2}\right)}{d\left({x}_{i},{c}_{k}\right)}$$(16)The centers are updated as convex combinations of these points, with weights determined by the working principle.
The iterative process stops when the centers stabilize, i.e. when
originating a clustering of cell lines in the space ${\mathbb{R}}^{m+n}$. The optimal clustering solution can be determined (see ref. [15]) by verifying the optimality of centers and assignments through the dual problem corresponding to the primal reported in Eq. (11)(13).
Consensus PMedian
The cluster analysis of the NCI60 dataset relates to a set of objects (cell lines) that need to be grouped taking into account multiple sources (gene expression profiles and drug activity patters). Most of the multisource clustering approaches follow one of the following paradigms: (a) clustering each data source separately to then adhoc integrate the separate clustering solutions [26],[27] or (b) combining all data sources to determine a single “joint” clustering [28],[29] as Traditional pMedian and Probabilistic DClustering. The first kind of approaches is characterized by an independent analysis: while they take advantage of modeling sourcespecific features, they are not able to capture intersource associations. On the other side, the second type of approaches is based on a joint analysis that is able to exploit shared structure among data sources, but disregarding the heterogeneity of the data and taking no account important features that are specific to each data source. More flexible methods allow for separate but dependent source clusterings [30],[31].
The characteristics of these more flexible approaches, along with the remarks highlighted in Background section, led us to define a Consensus pMedian formulation based on two steps: the first is aimed at determining groups of cell lines into the gene (or drug) space, whereas the second one determines the clusters of cell lines into the drug (or gene) space, while constraining the optimal solution in order to take into account the assignment of the first step. This approach aims at finding a tradeoff between gene expression and drug response profiles, by defining a sequence of two integer linear programming formulations. While the problem at the first step can be formulated as a traditional pMedian (in one of the two spaces, i.e. either gene or drug space), the second step leads to the definition of the following Consensus pMedian formulation:
s.t:
where ${z}_{\mathit{\text{ij}}}^{\ast}$ denotes the solution of problem (6)(9). This problem formulation consists in assigning each cell line x_{ i } to a given cluster according to a distance measure computed in one space d^{(1)}(for example the gene space). The constraints denoted by equations (20)(23) have the same role as in the traditional pMedian formulation, while equation (19) provides a constraint about the cluster assignment by taking into account the cluster placement occurred during the first step. In particular, this constraint avoids the clustering solution of the Consensus pMedian to diverge, according to a value μ and to the distance measure d^{(2)}≠d^{(1)}, from the solution found at the first step. The parameter μ tunes the effect of the solution that optimizes F, i.e. $\sum _{i=1}^{\left\Omega \right}\sum _{j=1}^{\left\Omega \right}{z}_{\mathit{\text{ij}}}^{\ast}{d}_{\mathit{\text{ij}}}^{\left(2\right)}$. The parameter μ ranges between the lower bound μ=1.0 and an upper bound μ^{∗}. μ=1 implies that the solution of the Consensus pMedian will generate the same assignment as the traditional pMedian solved at the first step. Increasing values of μ cause a decreasing effect of optimal assignment ${z}_{\mathit{\text{ij}}}^{\ast}$ coming from the first phase (μ can be updated incrementally until the convergence criterion is satisfied, i.e. the solution of the Consensus pMedian doesn’t change for increasing values of μ). Note that for μ<1.0 no feasible solution exists.
The pseudocode reported in Algorithm 1 summarizes the iterative process for solving the Consensus pMedian until the value of μ^{∗} is found, i.e. until constraint (19) becomes redundant. For the sake of simplicity, we will denote with Consensus pMedian (gd) the approach where at step 2 the set μ^{G} is used and at steps 3 and 8 the set μ^{D} is exploited. On the other hand, we will denote with Consensus pMedian (dg) the approach where μ^{D} is exploited at step 2, while at step 3 and 8 the set μ^{G} is used.
Feature selection
The clusters that can be generated by the above mentioned approaches represent sets of cell lines that show a similar response to anticancer therapy also taking into account genomic information. This enables a feature selection activity that allows us to to identify the subset of genes that could possibly regulate the cell response behavior. To compactly characterize the obtained clusters, we attempt to select a subset of genes that best represents the cell lines membership. In order to validate the hypothesis that the obtained groups of cell lines embed useful information for helping the pharmacology of cancer, we applied two feature selection techniques known as Information Gain and Correlationbased Feature Subset Evaluation.
Information gain
In order to determine the most relevant genes that characterize a cluster and therefore that can be responsible of drug response for the cell lines belonging to that cluster, a feature selection based on Information Gain has been applied. Information Gain measures the decrease in entropy when the feature is given vs absent. According to this measure a “good” feature can contribute, independently of any other feature, to reduce the uncertainty of each clusters given the attribute values. Formally, given a cluster attribute C representing the obtained clusters and a gene attribute A, denoting the expression level of a given gene, the Information Gain (IG) is computed as follows:
where
We can therefore consider equation (26) as a measure of dependency between the density of variable a_{ t } (gene) and the distribution of the target c_{ k } (cluster).To compute the entropy in equation (26), the T nominal expression values need to be represented as discrete quantities. In order to discretize genes as up, down and normo regulated, a double filtering approach has been applied.
In order to discretize genes as up, down and normo regulated, a double filtering approach has been applied. In particular, genes that are differentially expressed have been identified by applying FDR corrected pvalue test [32], with the requirement that the rate of false significant genes should not exceed 5% with a confidence of 99%. Once nonsignificant genes have been identified, a mean difference cutoff (on the log fold changes) has been applied to discriminate between up and downregulated genes among the significant ones. The mean difference cutoff μ corresponds to the minimum absolute value of expression such that a gene is not considered as nonsignificant. With FDR corrected pvalue, the mean difference cutoff corresponds once more to μ=0.86 (both Sherf and Liu datasets show a cutoff μ=0.86). According to the double filtering approach, genes with a fold change >+0.86 are considered as upregulated, while genes with a fold change <0.86 are considered as downregulated. Gene expression values into the interval [ 0.86,0.86] are identified as normoregulated. The discretization threshold can be easily grasped by looking at the volcano plot reported in Figure 1.
Once genes have been discretized, the value of I G(C,A) for each attribute can be computed allowing genes to be ranked accordinglyb. The top 10 genes have been selected as the most representative to train the predictive model described subsequently.
Correlationbased feature subset evaluation
An alternative feature selection method, able to evaluate the contribution of each gene, is the Correlationbased Feature Subset Evaluation (CFS). This approach assumes that good feature subsets contain features highly correlated with the cluster attribute, while yet uncorrelated with each other. The selection algorithm, which takes as input the genes discretized according to the FDR corrected pvalue test introduced above, is a heuristic that evaluates the merit of a subset of features, taking into account the usefulness of individual features for predicting the class label (cluster assignment) along with the level of intercorrelation among them. The merit of a subset S composed of g features can be estimated as:
where r_{ bf } is the mean featurecluster correlation, and r_{ ff } is the average featurefeature intercorrelation. The numerator can be viewed as an indication of how predictive of the cluster a set of features are, while the denominator of how much redundancy there is among the features^{b}. More details about CFS can be found in [17].
Prediction
The results obtained in the previous steps allow us to train a predictive model able to infer, for a new cancer patient, the multiple drug responses by using his/her gene expression profile of selected genes. One way of deriving a predictive model is to estimate a joint distribution for the set Q of features that characterize the dataset. The joint distribution for a sample x_{ i }∈Ω, where ${x}_{i}=\left\{{x}_{i}^{1},{x}_{i}^{2},\mathrm{..},{x}_{i}^{\leftQ\right}\right\}$, can estimated over the feature space as:
A complete joint probability distribution over a set of random variables must specify a probability value for each of the possible set instantiation. For example, if we consider to specify an arbitrary joint distribution P(X^{1},X^{2},..,X^{Q}) for Q dichotomous variables, a table with 2^{Q} entries is required. This complexity makes an infeasible probability model for any domain of realistic size. A possible solution that tries to overcome this problem is represented by Bayesian Networks [18]. The key component, that reduces the probability model complexity, is the assumption that each variable is directly influenced by only few others.
This assumption is captured graphically by the dependency structure: a probability distribution is encoded by a directed acyclic graph whose nodes represent random variables and edges denote direct dependencies. Formally, a Bayesian Network asserts that each node (random variable) is conditional independent of its nondescendants given its parents. This conditionally independence assumption allows us to represent concisely the joint probability distribution over the random variables.
If we consider a distribution over Q features, which can be arbitrarily ordered as X^{1},X^{2},..,X^{Q}, it can be decomposed as the product of Q conditional distributions:
Instead of specifying the probability of X^{s} conditional on all possible realizations of its predecessors X^{1},..,X^{s1}, we can consider only its set of parents P a(X^{s}). More precisely, a set of variables P a(X^{s}) is defined as the Markovian parents of X^{s} if P a(X^{s}) is a minimal set of predecessors of X^{s} that makes X^{s} independent on all the other predecessors.
The joint probability distribution can therefore defined as:
where $P\left({x}_{i}^{s}\mathit{\text{Pa}}\left({x}_{i}^{s}\right)\right)$ is described by a conditional probability distribution (CPD). These local conditional distributions correspond to the set μ of parameters.
Figure 2 shows the dependency structure of BN used for the prediction task. We could gain an insight of how the expression pattern of genes influences the activity level of drugs through the cluster assignment. This structure of BN has been defined to train a probabilistic model able to simultaneously predict the drug responses of a new cell line, only by providing its (selection of) gene expression profile^{c}.
The upper part of the network, which comprises 10 nodes, represents the most relevant genes selected by the policies described in the previous sections. The central part of the network, which is composed of only one variable, denotes the cluster obtained by solving the clustering problems. The bottom part, which comprises n=1400 nodes, represents the drug responses to be predicted. These last variables have been discretized in order to train discrete CPDs and consequently a fully discrete BN. In particular, following the discretization introduced in [33], cell lines with log10(GI50) at least 0.8 SDs above the mean were defined as resistant to the compound, whereas those with log10(GI50) at least 0.8 Standard Deviations below the mean were defined as sensitive. Cell lines with log10(GI50) within 0.8 Standard Deviations of the mean were considered to be intermediate. The remaining cell lines within 0.8 Standard Deviations were defined as intermediate. After this discretization process, the CPDs related to the dependency structure of the BN can be easily estimated, to then simultaneously predict the response of n drugs given the expression value of the 10 relevant genes.
Results and discussion
In order to evaluate the quality of the proposed framework, a threefold analysis has been performed. In our experimental investigation we consider K=9 clusters to be obtained, which respects the number of tumor types considered by the NCI60 panel. In order to avoid overfitting and to report unbiased experiments, each step of the proposed framework (clustering, feature selection, discretization and prediction) has been enclosed in a leaveoneout cross validation. For seek of clarity, the leaveoneout procedure works as follows:

a cell line x_{ i } is removed from the training set μ

clustering, feature selection, discretization and BN training are performed on the set {Ω\x_{ i }}

the removed cell line x_{ i } is then used as test for prediction in BN
Results reported in the following are therefore averaged over the leaveoneout folds. The first analysis is concerned with the average Pearson Correlation Coefficient for estimating how homogeneous the clusters are. Given the obtained K clusters, the Pearson Correlation Coefficient R is computed as follows:
where n_{ k } is the cardinality of cluster k. More specifically, the coefficient R has been computed with respect both to the gene and to the drug space, originating then two correlation coefficients: R^{G} is computed considering the correlation between instances represented by their gene expression profiles, while R^{D} is estimated considering the correlation between instances represented by their drug response profiles.
We also report the correlation indices of some baseline clustering approaches previously investigated for mining the NCI60 dataset: kMeans [14], Soft Topographic Vector Quantization (SVTQ) [9] and Relational kMeans [11].
In Figures 3 and 4, a comparison in terms of correlation (averaged on the leaveoneout folds) between the investigated clustering approaches is depicted reporting the traditional pMedian, Probabilistic DClustering, kMeans, SVTQ, Relational kMeans and the proposed Consensus pMedian. For the Consensus pMedian two series are reported, i.e. Consensus pMedian (gd) and Consensus pMedian (dg).
Each point of the series for the Consensus pMedian corresponds to a solution obtained according to the parameter μ. The ordinate axis represents the correlation coefficients in the drug space (R^{D} values), while the abscissae axis the correlation in the gene space (R^{G} values).
An interesting remark is related to the average correlation indices of the proposed approach. All the solutions provided by the Consensus pMedian show a slightly better (averaged) Pearson Coefficient than the others. This implies that our approach leads to clusters that are more homogeneous both in terms of gene expression and drug activity than the clusters obtained by the other approaches. This is highlighted by the fact that most of the solutions determined by the Consensus pMedian dominate the ones generated by the other approaches. The most promising “competitor” is Relational kMeans, which leads to almost homogeneous cluster configuration. In order to validate the significance of the results, confidence intervals have been estimated on the clustering solutions. Confidence intervals provide a range about the observed “effect size”, allowing us to understand how likely the generated solutions are: the smaller the confidence interval, the more certain we are about the solution. In our specific case, the confidence intervals have been computed as follows. First, for each run l of the leaveoneout, the Pearson Correlation Coefficients R_{ l } (specifically ${R}_{l}^{G}$ and ${R}_{l}^{D}$) have been estimated. Then, the mean and the confidence interval have been estimated over the leaveoneout results.
In Table 1 confidence intervals are reported for the investigated clustering approaches on the Sherf dataset. We can easily note that the leaveoneout cross validation procedure provides a small effect size for most of the approaches both for correlations in the gene and drug space. We can state therefore that, with a confidence level of 95%, that the results are robust, enabling an easy identification of optimal solutions as Pareto points (marked as bold). Similar results have been obtained on the Liu dataset.
Concerning the computational complexity related to pMedian problems, it is well known that they belong to the NPHard complexity class. However, some recent metaheuristcs allow to solve the pMedian problems in $\mathcal{O}\left(\Omega {}^{2}\right)$ making these kind of approaches competitive in respect of others. While a pMedian can be solved in $\mathcal{O}\left(\Omega {}^{2}\right)$, approaches like kMeans, Relational kMeans and STVQ have a computational complexity of $\mathcal{O}\left(\left\Omega \right\mathit{\text{IKQ}}\right)$ (where Q is related to the time spent on computing vector distances during the iterative procedure, I denotes the fixed number of iterations and K the clusters to be obtained). Considering that in our case Q⟫Ω because Q depends on the vector dimension ${\mathbb{R}}^{m+n}$, it follows that a pMedian approach is more efficient than others.
A further validation is targeted at the correctness of Bayesian Networks to predict the drug responses. In particular, we have measured the prediction accuracy of BNs trained with the top relevant genes characterizing the groups of cell lines derived by the mentioned clustering approaches. We have also reported the accuracy of a trivial classifier as baseline, where the prediction of a drug response is performed according to its majority class on the training data. In Figures 5, 6, 7 and 8 the comparison in terms of accuracy, i.e. percentage of drug response correctly predicted, between the investigated approaches is shown. The BNs are trained according to the (top ten) genes selected by the Information Gain and Correlationbased Subset Evaluation policies. In particular, considering that the experimental investigation is performed by means of a leaveoneout cross validation, the relevant genes to be used for training BNs have been selected as the most frequent over the top ten genes selected for each fold of the cross validation. Specifically, given the L=60 solutions obtained by performing a leaveoneout (for each given clustering approach), a voting mechanism has been applied. Each gene received a vote if, in a given run of the leaveoneout, it appears in the top ten list of relevant genes. Once the votes have been collected, the 10 genes with the highest number of votes are selected as the most important and therefore used to train the BN. It can be easily noted that all the solutions generated by the proposed approach outperform the ones obtained by the other methods. Concerning the Sherf dataset (Figures 5 and 6), the BNs trained according to the Consensus pMedian are able to ensure an average prediction accuracy of 85.63% with IG and 84.98% CFS selection policies, outperforming the accuracy of Probabilistic DClustering (81.8% with IG and 83.1% with CFS), pMedian (82.9% with IG and 83.33% with CFS), STVQ (83.1% with IG and 82.6% with CFS), kMeans (83.2% with IG and 82.8 with CFS), Relational kMeans (84.0% with IG and 84.2 with CFS) and the trivial classifier (80.5%).
Confidence intervals reported in Table 2 show not only the ability of IG selection policy to obtain small variability in the expected predictions, but also the correspondence between Pareto points and the most promising (in terms of average accuracy) Bayesian Networks (analogous results have been obtained on Liu dataset). This correspondence allows us to assert that the proposed Consensus pMedian is able to create groups of homogeneous cell lines taking into account two different data source and, as consequence, to derive a prediction model that outperforms the others. A similar result is obtained on the Liu dataset (Figures 7 and 8), where the proposed approach achieves more accurate predictions with respect to the other ones.
Considering that the ultimate goal of this paper is the prediction of anticancer drug responses, we have also compared the BNs trained according to the Consensus pMedian with some traditional supervised methods. In particular, our approach has been compared with Naive Bayes (NB) [34], Decision Tree (DT) [35], 1Nearest Neighbor (1NN) [36] and Linear Support Vector Machines (SVM) [37] classifiers. Each model has been trained to predict one drug at a time by using all the available genes. Also for these classifiers a leaveoneout validation has been performed. For the Sherf Dataset, our approach obtained the highest performance with 85.63% of accuracy, against the ones of DT (81.00%), 1NN (82.00%), NB (82.51%) and SVM (83.01%). Concerning the Liu dataset, similar results have been obtained. In particular, the proposed approach is able to achieve an accuracy of 84.22% compared with DT (80.00%), 1NN (78.16%), NB (82.00%) and SVM (79.11%). A summary of the accuracy confidence intervals, both for Sherf and Liu datasets, is depicted in Figure 9. We can easily point out that the models based on RNA (Sherf dataset) are able to achieve higher average accuracy, with smaller confidence intervals, than the models based on microRNA (Liu dataset). A more interesting remark relates to the comparison of the considered models. Figure 9 confirms that the proposed approach, based on Consensus pMedian, guarantees not only the highest average prediction accuracy of drug response, but also a nonoverlapping confidence interval. It’s also interesting to note that the performance of all the trained models have a small gap with respect to the trivial classifier, highlighting that (as similarly demonstrated in [38]) a relatively small number of drugs can be actually predicted better than the trivial baseline.
The last evaluation has been performed from a biological point of view in order to highlight the functional role of the most informative genes characterizing each cluster. Indeed, since clusters represent sets of cell lines that show a similar response to anticancer therapies also taking into account genomic information, the feature selection activity should be able to identify the subset of genes that could possibly regulate the cells response behaviour. In order to validate this hypothesis we searched for the biological functions associated to the selected genes by accessing the Entrez Gene Database ([39]), which is a NCBI’s (National Center for Biotechnology Information) database for genespecific information ([40]). In Tables 3 and 4 we report the 10 genes that have been used to train the best Bayesian Network for the Sherf dataset, that correspond to the configuration obtained by the Consensus pMedian (dg) with μ=1.6 both for IG and CFS. Column one reports the official gene name on Entrez Gene, column two contains a description of the main biological processes in which the gene is involved, and finally column three reports a description  extracted from the literature  of the role of the gene with respect to cancer mechanisms. Concerning IG selection policy on the Sherf dataset (Table 3), it’s interesting to highlight that most of the selected genes are recognized in the literature as biologically relevant. It’s also interesting to underline that some of the selected genes have been identified as relevant in previous investigations [7],[11]: SPARC, SGK1, DNAJA3, ELF3 and GJA1. A remarkable evidence is provided by genes CDKN2A and DNAJA3 as tumor suppressors, genes SPARC and GJA1 as tumor marker and finally SGK1 as prognostic marker. In particular SPARC, considered relevant in this study as well as in [7] and [11], has a reputation for being a potent anticancer. It has been shown to be involved in cell cycle, cell invasion, adhesion, migration, angiogenesis and apoptosis both in vitro and in vivo (see [41] for an overview of the multifunctional role of SPARC in cancer). Regarding CFS selection policy on the Sherf dataset (Table 4), we can note that some of the selected genes can be considered marginal from an oncological point of view, while others are shared with the ones obtained by applying IG (SPARC, DNAJA3 and AKT3).
Concerning the Liu dataset, the analysis of the selected relevant miRNAs can be only preliminary. Although miRNAs represent a recently discovered class of noncoding RNAs that play a fundamental role in the regulation of gene expression, most of their functions still remain to be discovered. For this reason, we can only report the genes whose mRNA can interact with the considered miRNA (see Table 5). The listed genes have been selected by accessing the data available at “microRNA.org  Targets and Expression” ([68]), which is a freely available opensource software able to provide microRNA target predictions [69]. In order to provide a preliminary evaluation of the selected miRNA, we can point out their involvement in the “microRNA in cancer” pathway (from KEGG source record hsa05206). As highlighted in Table 5, some miRNAs belong to the above mentioned pathway: hsamiR200a, hsamiR200b, hsamiR200c, hsamiR141, hsamiR100 and hsamiR494 identified by the IG selection, and hsamiR145 and hsamiR17* determined by the CFS policy. We highlight that (as for Sherf dataset) some miRNA are shared between the two selection policies (hsamiR196b, hsamiR18b and hsamiR100).
In order to analyze the computational time of the entire framework, a comparison over the considered datasets has been reported in Tables 6 and 7. The execution time (in terms of seconds) has been reported for the three phases, i.e. clustering, feature selection and prediction. Concerning the clustering phase, it can be noted in Tables 6(a) and 7(a) that the most efficient approach is traditional pMedian, followed by the proposed Consensus pMedian (the proposed approach has to solve two pMedian problems instead of only only one). Tables 6(b) and 7(b) report the computational effort required by the feature selection policies (IG and CFS) and training and inference phases needed for prediction with BNs. The selection strategy based on CFS is clearly computationally intensive because it requires the search of the suboptimal set of features that could compactly represent the clusters. On the contrary, IG is more efficient thanks to its ability to evaluate each feature (gene) independently on the others. Considering BNs, the training step requires a quite limited computational effort because only Conditional Probability Tables need to be estimated (the dependency structure is fixed a priori). The time required by the inference step is mainly influenced by the number of drugs that are simultaneously predicted, i.e. more therapeutic compounds are considered and more time is necessary to estimate their posterior probability of being sensitive, resistant and intermediate.
As final remark, considering both qualitative and quantitative results, we can assert that Consensus pMedian together with Information Gain and Bayesian Network represent an optimal tradeoff between efficacy and efficiency to simultaneously predict (in silico) anticancer responses.
Conclusion
In this paper the problem of identifying a suitable profile of cancer patients by linking gene expressions, drug responses and types of cancer has been addressed. A learning framework based on three building blocks has been proposed. The experimental results highlight three main findings: (1) the proposed Consensus pMedian is able to create groups of cell lines that are highly correlated both in terms of gene expression and drug response; (2) from a biological point of view, the gene selection performed on these clusters allows the identification of genes that are strongly involved in several cancer processes; (3) the prediction of drug responses, by using the patient profile obtained through clustering and gene selection, represents a promising step for predicting potentially useful drugs. Concerning the ongoing research, several issues are still to be investigated. Among them the next future work will be focused to the identification of a suitable number of clusters and the use of more “selective” discretization policies. As far is concerned with the methodological approach, an interesting comparison relates with those approaches, belonging to the multiple tasks learning, able to simultaneously predict the drug responses given a (subset) of gene expressions. Instances of future investigations are Marginal Regression [70] and Support Vector Machine [71] For Multitask Learning. A further development of the proposed investigation relates to the exploitation of additional data sources, such as proteomic expression profiles, to better predict the drug response in tumour cells.
Availability of supporting data
The data sets supporting the results of this article are included within the article (and its additional files).
Endnotes
^{a} The solutions of the pMedian problems have been determined by using the CPLEX commercial solver.
^{b} For the feature selection process, the WEKA environment [72] has been exploited [73].
^{c} For training and inference of Bayesian Networks, the BNT Matlab toolbox has been used. The toolbox is available for download at [74].
Additional files
References
 1.
Van Steenbergen L, Elferink M, Krijnen P, Lemmens V, Siesling S, Rutten H, Richel D, KarimKos H, Coebergh J: Improved survival of colon cancer due to improved treatment and detection: a nationwide populationbased study in the Netherlands 19892006. Ann Oncol. 2010, 21 (11): 22062212. 10.1093/annonc/mdq227.
 2.
Joerger M, Thürlimann B, Savidan A, Frick H, Bouchardy C, Konzelmann I, ProbstHensch N, Ess S: A populationbased study on the implementation of treatment recommendations for chemotherapy in early breast cancer. Clin Breast Cancer. 2012, 12 (2): 102109. 10.1016/j.clbc.2011.10.005.
 3.
Blower PE, Verducci JS, Lin S, Zhou J, Chung JH, Dai Z, Liu CG, Reinhold W, Lorenzi PL, Kaldjian EP, Croce CM, Weinstein JN, Sadee W: MicroRNA expression profiles for the nci60 cancer cell panel. Mol Cancer Ther. 2007, 6 (5): 14831491. 10.1158/15357163.MCT070009.
 4.
Grills C, Jithesh PV, Blayney J, Zhang SD, Fennell DA: Gene expression metaanalysis identifies VDAC1 as a predictor of poor outcome in early stage nonsmall cell lung cancer. PLoS ONE. 2011, 6 (1): e1463510.1371/journal.pone.0014635.
 5.
Masica DL, Karchin R: Collections of simultaneously altered genes as biomarkers of cancer cell drug response. Cancer Res. 2013, 73 (6): 16991708. 10.1158/00085472.CAN123122.
 6.
Scherf U, Ross DT, Waltham M, Smith LH, Lee JK, Tanabe L, Kohn KW, Reinhold WC, Myers TG, Andrews DT, Scudiero DA, Eisen MB, Sausville EA, Pommier Y, Botstein D, Brown PO, Weinstein JN: A gene expression database for the molecular pharmacology of cancer. Nat Genet. 2000, 24 (3): 236244. 10.1038/73439.
 7.
Chang JH, Hwang KB, Zhang BT: Analysis of gene expression profiles and drug activity patterns by clustering and bayesian network learning. Methods of Microarray Data Analysis II. Edited by: Lin SM, Johnson KF. 2002, Springer US, New York, 169184.
 8.
Chang JH, Hwang KB, Oh SJ, Zhang BT: Bayesian network learning with feature abstraction for genedrug dependency analysis. J Bioinformatics Comput Biol. 2005, 3 (1): 6177. 10.1142/S0219720005000874.
 9.
Burger M, Graepel T, Obermayer K: Phase transitions in soft topographic vector quantization. Artificial Neural NetworksICANN’97. Edited by: Gerstner W, Germond A, Hasler M, Nicoud JD. 1997, Springer Berlin Heidelberg, New York, 619624.
 10.
Fersini E, Giordani I, Messina E, Archetti F: Relational clustering and bayesian networks for linking gene expression profiles and drug activity patterns. Proceedings of IEEE International Conference on Bioinformatics and Biomedicine Workshop: 14 November 2009; Washington DC. Edited by: Chen J. 2009, IEEE Computer Society, Washington DC, 2025.
 11.
Fersini E, Messina E, Archetti F, Manfredotti C: Combining gene expression profiles and drug activity patterns analysis: A relational clustering approach. J Math Modelling Algorithms. 2010, 9 (3): 275289. 10.1007/s1085201091402.
 12.
Archetti F, Giordani I, Vanneschi L: Genetic programming for anticancer therapeutic response prediction using the nci60 dataset. Comput Oper Res. 2010, 37 (8): 13951405. 10.1016/j.cor.2009.02.015.
 13.
Fersini E, Messina E, Leporati A: Discovering genedrug relationships for the pharmacology of cancer. Advances in Computational Intelligence  Communications in Computer and Information Science Series. Edited by: Greco S, BouchonMeunier B, Coletti G, Fedrizzi M, Matarazzo B, Yager R. 2012, Springer Berlin Heidelberg, New York, 117126.
 14.
MacQueen JB: Some methods for classification and analysis of multivariate observations. Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability; Berkeley. Edited by: LeCam LM, Neyman N. 1967, University of California Press, Berkeley, CA, 281297.
 15.
Iyigun C, BenIsrael A: A generalized weiszfeld method for the multifacility location problem. Oper Res Lett. 2010, 38 (3): 207214. 10.1016/j.orl.2009.11.005.
 16.
Quinlan JR: Induction of decision trees. Mach Learn. 1986, 1: 81106.
 17.
Hall M: Correlationbased feature selection for discrete and numeric class machine learning. Proceedings of Seventeenth International Conference on Machine Learning: June 29  July 2 2000; Stanford, CA. Edited by: Langley P. 2000, Morgan Kaufmann Publishers, San Francisco, 359366.
 18.
Pearl J: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. 1988, San Francisco, Morgan Kaufmann Publishers
 19.
Liu H, D’Andrade P, FulmerSmentek S, Lorenzi P, Kohn KW, Weinstein JN, Pommier Y, Reinhold WC: mRNA and microRNA expression profiles of the nci60 integrated with drug activities. Mol Cancer Ther. 2010, 9 (5): 10801091. 10.1158/15357163.MCT090965.
 20.
Lin SM, Johnson K: Methods of Microarray Data Analysis II. 2002, Springer US, New York
 21.
Drezner Z: Facility Location: a Survey of Applications and Methods. 1995, Springer US, New York
 22.
Järvinen P, Rajala J, Sinervo H: Technical note  a branchandbound algorithm for seeking the PMedian. Oper Res. 1972, 20 (1): 173178. 10.1287/opre.20.1.173.
 23.
Kaufman L, Rousseeuw PJ: Finding Groups in Data: An Introduction to Cluster Analysis. 1990, Wiley, New York
 24.
Bradley PS, Mangasarian OL, Street WN: Clustering via concave minimization. Proceedings of Advances in Neural Information Processing Systems: December 25, 1996; Denver, CO. Edited by: Mozer MC, Jordan MI, Petsche T. 1996, MIT Press, Cambridge, MA, 68374.
 25.
Weiszfeld E: Sur le point pour lequel la somme des distances de n points donn’s est minimum. Tohoku Math J. 1937, 43 (2): 355386.
 26.
Wang P, Domeniconi C, Laskey KB: Nonparametric bayesian clustering ensembles. Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part III: 2024 September 2010; Barcellona. Edited by: Balcázar JL, Bonchi F, Gionis A, Sebag M. 2010, SpringerVerlag, Berlin, 435450.
 27.
Nguyen N, Caruana R: Consensus clusterings. Proceedings of the 7th IEEE International Conference on Data Mining: 2831 October 2007; Omaha, NE. Edited by: Ramakrishnan N, Zaïane OR, Shi Y, Clifton CW, Wu X. 2007, IEEE Computer Society, Washington DC, 607612.
 28.
Mo Q, Wang S, Seshan VE, Olshen AB, Schultz N, Sander C, Powers RS, Ladanyi M, Shen R: Pattern discovery and cancer gene identification in integrated cancer genomic data. PNAS. 2013, 110 (11): 42454250. 10.1073/pnas.1208949110.
 29.
Rey M, Roth V: Copula mixture model for dependencyseeking clustering. Proceedings of the 29th International Conference on Machine Learning: June 26July 1 2012; Edinburgh. Edited by: Langford J, Pineau J. 2012, Omnipress, Madison, WI, 927934.
 30.
Kirk P, Griffin JE, Savage RS, Ghahramani Z, Wild DL: Bayesian correlated clustering to integrate multiple datasets. Bioinformatics. 2012, 28: 32903297. 10.1093/bioinformatics/bts595.
 31.
Rogers S, Girolami M, Kolch W, Waters KM, Liu T, Thrall B, Wiley HS: Investigating the correspondence between transcriptomic and proteomic expression profiles using coupled cluster models. Bioinformatics. 2008, 24: 28942900. 10.1093/bioinformatics/btn553.
 32.
Korn EL, Troendle JF, McShane LM, Simon R: Controlling the number of false discoveries: application to highdimensional genomic data. J Stat Plann Inference. 2004, 124 (2): 379398. 10.1016/S03783758(03)002118.
 33.
Staunton JE, Slonim DK, Coller HA, Tamayo P, Angelo MJ, Park J, Scherf U, Lee JK, Reinhold WO, Weinstein JN, Mesirov JP, Lander ES, Golub TR: Chemosensitivity prediction by transcriptional profiling. PNAS. 2001, 98 (19): 1078710792. 10.1073/pnas.191368598.
 34.
Langley P, Iba W, Thompson K: An analysis of Bayesian classifiers. Proceedings of the 10th National Conference on Artificial Intelligence: July 1216 1992; San Jose, CA. Edited by: Swartout WR. 1992, AAAI Press, Palo Alto, CA, 223228.
 35.
Quinlan JR: C4.5: Programs for Machine Learning. 1993, Morgan Kaufmann Publishers, San Francisco, CA
 36.
Aha DW, Kibler D, Albert MK: Instancebased learning algorithms. Mach Learn. 1991, 6 (1): 3766.
 37.
Vapnik V: Statistical Learning Theory. 1998, Wiley, New York
 38.
Tsamardinos I, Borboudakis G, Christodoulou E, Røe OD: Chemosensitivity Prediction of Tumours Based on Expression, miRNA, and Proteomics Data. Int J Syst Biol Biomed Technol. 2012, 1 (2): 119.
 39.
Entrez gene database. [], [http://www.ncbi.nlm.nih.gov/gene/]
 40.
Maglott D, Ostell J, Pruitt KD, Tatusova T: Entrez gene: genecentered information at NCBI. Nucleic Acids Res. 2005, 33 (suppl 1): 5458.
 41.
Nagaraju GPC, Sharma D: Anticancer role of SPARC, an inhibitor of adipogenesis. Cancer Treat Rev. 2011, 37 (7): 559566. 10.1016/j.ctrv.2010.12.001.
 42.
Clark CJ, Sage EH: A prototypic matricellular protein in the tumor microenvironment where there’s SPARC, there’s fire. J Cell Biochem. 2009, 104 (3): 721732. 10.1002/jcb.21688.
 43.
Arnold SA, Brekken RA: SPARC: a matricellular regulator of tumorigenesis. J Cell Commun Signal. 2009, 3 (34): 255273. 10.1007/s1207900900724.
 44.
Riederer BM: Microtubuleassociated protein 1B, a growthassociated and phosphorylated scaffold protein. Brain Res Bull. 2007, 71 (6): 541558. 10.1016/j.brainresbull.2006.11.012.
 45.
Morselli E, Galluzzi L, Kepp O, Vicencio JM, Criollo A, Maiuri MC, Kroemer G: Antiand protumor functions of autophagy. Biochim Biophys Acta. 2009, 1793 (9): 15241532. 10.1016/j.bbamcr.2009.01.006.
 46.
Edwards KM, Münger K: Depletion of physiological levels of the human TID1 protein renders cancer cell lines resistant to apoptosis mediated by multiple exogenous stimuli. Oncogene. 2004, 23 (52): 84198431. 10.1038/sj.onc.1207732.
 47.
Ralph SJ, RodríguezEnríquez S, Neuzil J, Saavedra E, MorenoSánchez R: The causes of cancer revisited: “mitochondrial malignancy” and ROSinduced oncogenic transformation  why mitochondria are targets for cancer therapy. Mol Aspects Med. 2010, 31 (2): 145170. 10.1016/j.mam.2010.02.008.
 48.
Mikosz CA, Brickley DR, Sharkey MS, Moran TW, Conzen SD: Glucocorticoid receptormediated protection from apoptosis is associated with induction of the serine/threonine survival kinase gene, sgk1. J Biol Chem. 2001, 276 (20): 1664916654. 10.1074/jbc.M010842200.
 49.
Zhang L, Cui R, Cheng X, Du J: Antiapoptotic effect of serum and glucocorticoidinducible protein kinase is mediated by novel mechanism activating IκB Kinas. Cancer Res. 2005, 65 (2): 457464.
 50.
Lee HJ, Chang JH, Kim YS, Kim SJ, Yang HK: Effect of etsrelated transcription factor (ERT) on transforming growth factor (TGF)beta type II receptor gene expression in human cancer cell lines. J Exp Clin Cancer Res. 2003, 22 (3): 477480.
 51.
Chen D, Shan J, Zhu WG, Qin J, Gu W: Transcriptionindependent ARF regulation in oncogenic stressmediated p53 responses. Nature. 2010, 464 (7288): 624627. 10.1038/nature08820.
 52.
Liggett W, Sidransky D: Role of the p16 tumor suppressor gene in cancer. J Clin Oncol. 1998, 16 (3): 11971206.
 53.
Virani S, Colacino JA, Kim JH, Rozek LS: Cancer epigenetics: a brief review. ILAR J. 2013, 53 (34): 359369.
 54.
Parr C, Jiang WG: Hepatocyte growth factor activation inhibitors (HAI1 and HAI2) regulate HGFinduced invasion of human breast cancer cells. Int J Cancer. 2006, 119 (5): 11761183. 10.1002/ijc.21881.
 55.
Toler CR, Taylor DD, GercelTaylor C: Loss of communication in ovarian cancer. Am J Obstet Gynecol. 2006, 194 (5): e2731. 10.1016/j.ajog.2006.01.024.
 56.
Li Z, Zhou Z, Welch DR: Donahue HJ. Expressing connexin 43 in breast cancer cells reduces their metastasis to lungs. Clin Exp Metastasis. 2008, 25 (8): 893901. 10.1007/s1058500892089.
 57.
Qin H, Shao Q, Curtis H, Galipeau J, Belliveau DJ, Wang T, AlaouiJamali MA, Laird DW: Retroviral delivery of connexin genes to human breast tumor cells inhibits in vivo tumor growth by a mechanism that is independent of significant gap junctional intercellular communication. J Biol Chem. 2002, 277 (32): 2913229138. 10.1074/jbc.M200797200.
 58.
Cheung M, Testa JR: Diverse mechanisms of AKT pathway activation in human malignancy. Current Cancer Drug Targets. 2013, 13 (3): 234244. 10.2174/1568009611313030002.
 59.
Munz M, Baeuerle PA, Gires O: The emerging role of EpCAM in cancer and stem cell signaling. Cancer Res. 2009, 69 (14): 56275629. 10.1158/00085472.CAN090654.
 60.
Maetzel D, Denzel S, Mack B, Canis M, Went P, Benk M, Kieu C, Papior P, Baeuerle PA, Munz M, Gires O: Nuclear signalling by tumourassociated antigen EpCAM. Nat Cell Biol. 2009, 11 (2): 162171. 10.1038/ncb1824.
 61.
Antonacopoulou AG, Grivas PD, Skarlas L, Kalofonos M, Scopa CD: Kalofonos HP: POLR2F, ATP6V0A1 and PRNP expression in colorectal cancer: new molecules with prognostic significance?. Anticancer Res. 2008, 28 (2B): 12211227.
 62.
Zhang N, Zhong R, PerezPinera P, Herradon G, Ezquerra L, Wang ZY, Deuel TF: Identification of the angiogenesis signaling domain in pleiotrophin defines a mechanism of the angiogenic switch. Biochem Biophys Res Commun. 2006, 343 (2): 653658. 10.1016/j.bbrc.2006.03.006.
 63.
Li T, Feng Z, Jia S, Wang W, Du Z, Chen N, Chen Z: Daintain/AIF1 promotes breast cancer cell migration by upregulated TNFαvia activate p38 MAPK signaling pathway. Breast cancer Res Treatment. 2012, 131 (3): 891898. 10.1007/s105490111519x.
 64.
Hu S, Delorme N, Liu Z, Liu T, VelascoGonzalez C, Garai J, Pullikuth A, Koochekpour S: Prosaposin downmodulation decreases metastatic prostate cancer cell adhesion, migration, and invasion. Mol Cancer2010, 9(30).
 65.
Kang SY, Halvorsen OJ, Gravdal K, Bhattacharya N, Lee JM, Liu NW, Johnston BT, Johnston AB, Haukaas SA, Aamodt K, Yoo S, Akslen LA, Watnick RS: Prosaposin inhibits tumor metastasis via paracrine and endocrine stimulation of stromal p53 and Tsp1. PNAS. 2009, 106 (29): 1211512120. 10.1073/pnas.0903120106.
 66.
Pan PW, Zhang Q, Bai F, Hou J, Bai G: Profiling and comparative analysis of glycoproteins in Hs578BST and Hs578T and investigation of prolyl 4hydroxylase alpha polypeptide II expression and influence in breast cancer cells. Biochemistry. 2012, 77 (5): 539545.
 67.
Chang KP, Yu JS, Chien KY, Lee CW, Liang Y, Liao CT, Yen TC, Lee LY, Huang LL, Liu SC, Chang YS, Chi LM: Identification of PRDX4 and P4HA2 as metastasisassociated proteins in oral cavity squamous cell carcinoma by comparative tissue proteomics of microdissected specimens using iTRAQ technology. J Proteome Res. 2011, 10 (11): 49354947. 10.1021/pr200311p.
 68.
microRNA.org  targets and expression. [], [http://www.microrna.org/]
 69.
Betel D, Wilson M, Gabow A, Marks DS, Sander C: The microRNA.org resource: targets and expression. Nucleic Acids Res. 2008, 36 (Database issue): D149D153.
 70.
Kolar M, Liu H: Marginal regression for multitask learning. Proceedings of the International Conference on Artificial Intelligence and Statistics: April 2123 2012; La Palma, Canary Islands. Edited by: Lawrence ND, Girolami M. 2012, JMLR.org., Cambridge, 647655.
 71.
Evgeniou T, Micchelli CA, Pontil M, ShaweTaylor J: Learning multiple tasks with kernel methods. J Mach Learn Res. 2005, 6 (4): 615637.
 72.
WEKA data mining software. [], [http://www.cs.waikato.ac.nz/ml/weka/]
 73.
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH: The WEKA data mining software: an update. ACM SIGKDD Explorations Newslett. 2009, 11 (1): 1018. 10.1145/1656274.1656278.
 74.
Bayesian network toolbox. [], [https://code.google.com/p/bnt/]
Acknowledgements
This work has been partially funded by Regione Lombardia: “Dote Ricercatori”  FSE and NEDD project.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
EF, EM and FA conceived the algorithm and designed the experiments. EF performed experiments and analysed the results. All authors read and approved the paper.
Electronic supplementary material
Authors’ original submitted files for images
Rights and permissions
About this article
Received
Accepted
Published
DOI
Keywords
 pMedian clustering
 Bayesian networks
 Drug response prediction