# Integrated QSAR study for inhibitors of hedgehog signal pathway against multiple cell lines:a collaborative filtering method

- Jun Gao†
^{1, 2}, - Dongsheng Che†
^{3}, - Vincent W Zheng
^{4}, - Ruixin Zhu
^{1}and - Qi Liu
^{1}Email author

**13**:186

https://doi.org/10.1186/1471-2105-13-186

© Gao et al.; licensee BioMed Central Ltd. 2012

**Received: **3 February 2012

**Accepted: **11 July 2012

**Published: **31 July 2012

## Abstract

### Background

The Hedgehog Signaling Pathway is one of signaling pathways that are very important to embryonic development. The participation of inhibitors in the Hedgehog Signal Pathway can control cell growth and death, and searching novel inhibitors to the functioning of the pathway are in a great demand. As the matter of fact, effective inhibitors could provide efficient therapies for a wide range of malignancies, and targeting such pathway in cells represents a promising new paradigm for cell growth and death control. Current research mainly focuses on the syntheses of the inhibitors of cyclopamine derivatives, which bind specifically to the Smo protein, and can be used for cancer therapy. While quantitatively structure-activity relationship (QSAR) studies have been performed for these compounds among different cell lines, none of them have achieved acceptable results in the prediction of activity values of new compounds. In this study, we proposed a novel collaborative QSAR model for inhibitors of the Hedgehog Signaling Pathway by integration the information from multiple cell lines. Such a model is expected to substantially improve the QSAR ability from single cell lines, and provide useful clues in developing clinically effective inhibitors and modifications of parent lead compounds for target on the Hedgehog Signaling Pathway.

### Results

In this study, we have presented: (1) a collaborative QSAR model, which is used to integrate information among multiple cell lines to boost the QSAR results, rather than only a single cell line QSAR modeling. Our experiments have shown that the performance of our model is significantly better than single cell line QSAR methods; and (2) an efficient feature selection strategy under such collaborative environment, which can derive the commonly important features related to the entire given cell lines, while simultaneously showing their specific contributions to a specific cell-line. Based on feature selection results, we have proposed several possible chemical modifications to improve the inhibitor affinity towards multiple targets in the Hedgehog Signaling Pathway.

### Conclusions

Our model with the feature selection strategy presented here is efficient, robust, and flexible, and can be easily extended to model large-scale multiple cell line/QSAR data. The data and scripts for collaborative QSAR modeling are available in the Additional file 1.

## Background

The Hedgehog Signaling Pathway plays an important role in regulating embryonic development in vertebrates, and it is highly conserved from flies to humans [1][2][3][4].The pathway name comes from a polypeptide ligand called Hedgehog (Hh), which is an intercellular signaling molecule in Drosophila. In Drosophila, the mutation of the gene in the Hedgehog Signaling Pathway gives rise to an unusual spiky-haired phenotype [1]. The misregulation of such pathways has been directly associated with a variety of inherited and sporadic diseases [4][5][6]. The key role of the Hedgehog Signaling Pathway in the cell differentiation, growth, and proliferation makes it an excellent candidate in drug discovery, and thus targeting such pathway in cells represents a promising new paradigm for cell growth and death control.

The causal relationship between the activation of Hedgehog Signaling Pathway and oncogenesis has driven cancer researchers in the direction of finding specific inhibitors of hedgehog signaling, since this will provide efficient therapies to a wide range of malignancies [1, 2]. To date, several druggable nodes within the pathway have been identified. Assays implanted on various cell lines have shown that small molecules were able to alter the activity of these targets. Among them, murine cell lines such as NIH 3 T3, TM3h12, and C3H10T1/2 have been used [2]. While current cell lines allow the measurement of the inhibitory effects of compounds on the Hh pathway, they, however, provide little or no information about the specific underlying targets. To the best of our knowledge, only specific Smoothened inhibitors have been identified. Among them, the well-known BODIPY–cyclopamine, which is a fluorescent derivative of the naturally occurring Smo antagonist cyclopamine, binds specifically to cells expressing the Smo protein. This is one of the small chemical compounds that specifically inhibit Smoothened in the Hedgehog Signaling Pathway[2]. In our previous study [7], we have performed several quantitatively structure-activity relationship (QSAR) studies for cyclopamine derivatives in multiple cell lines, and such study could reveal useful clues in developing clinically effective drugs and modifications of parent lead compounds for cancer therapy.

- (1)
In our previous QSAR study, for specific cell lines, the activities were categorized into a binary classification under a naïve Bayesian model, and we obtained relatively acceptable QSAR results. However, no matter what kinds of statistical models or 2D descriptors were tested, low testing correlation coefficients were found when numeric activities were used. This may be due to the inherent noise existed in experimental activity measurement, or the relatively small number of training data used for a specific cell line. Due to our compound data tested against multiple cell lines to evaluate their activities, we hypotheses that such information can be integrated to improve the QSAR results rather than only a single cell line QSAR modeling. Such investigation will be extremely useful for the scenario that a small number of compound activities are measured under different experimental conditions (such as different cell lines, targets, assays etc.), and will provide novel insights on the integration of existing information, avoiding repeatable laborious work in drug discovery. In addition, such a study may also lead to novel computational models for integrated QSAR modeling, which is closely related to multi-task QSAR modeling [9] [10], Multi-Assay-Based QSAR modeling [11] , and Multi-target QSAR study [12].

- (2)
Due to the existence of compound activity data against multiple cell lines, how can we integrate such information to derive more robust and efficient feature selection strategies for compound modification under such “collaborative” multi-cell line environment? That is, can we derive the commonly important features related to the entire given cell lines for compound description, while in the meantime present their specific contributions to a specific cell-line? This issue is closely related to the first one, but tougher to be solved since it needs much more domain knowledge.

Inspired by these two problems, we aim to develop an efficient integrated QSAR model for inhibitors of Hedgehog Signal Pathway against multiple cell lines. This type of model has been used for information retrieval in social network, i.e. collaborative filtering [13][14], and it has been widely applied by the web companies such as Google, Amazon. Dumitru Erhan etc. has pioneered to use the term “collaborative filtering” in multiple target study [15]. Nevertheless, their methodology can be categorized as multiple regression or neural network, and a complex kernel function for similarity measurement is needed. In this study, we will present a collective matrix factorization based collaborative filtering model for integrated QSAR modeling, which is more naturally suitable for QSAR modeling, and scales up well on large dataset. Furthermore, we will also derive a powerful feature selection strategy for collaborative compound design to get more efficient inhibitors of Hedgehog Signal Pathway.

## Methods and materials

### Dataset

*PK*

_{ i }, as defined in the following Cheng-Prusoff equation [16]:

(*L* is the concentration of free radioligand used and *K*_{
D
} is its equilibrium dissociation constant for the receptor [16])

Where *IC*_{
50
} (half maximal inhibitory concentration) is a measure of the effectiveness of a compound in inhibiting biological or biochemical function. More specifically, it indicates how much of a particular drug or other substance (inhibitor) needed to inhibit a given biological process (or component of a process, i.e. an enzyme, cell, cell receptor or microorganism) by half. In our study, the data are formulated as a data matrix *X*. Note that the collective matrix factorization requires the matrix to be non-negative. In our original experiments, we measured the compound affinity under the PK_{i} evaluation system, and the activity values were negative. Since the *PK*_{
i
} measurement is calculated by taking *IC*_{
50
} as the input in equation (1), we can just take the absolute value of the PK_{i} in QSAR modeling , and this will not affect our final results.

### Definitions and Notations

In this paper, the different cell lines and the compounds tested for Hedgehog Signal Pathway will be denoted as *t* and *c* respectively, and their corresponding subscripts denote a specific compound and cell line. Thus, for a specific compound *c*_{
i
}, its experimentally activity value (measured as PK_{i} ) against specific cell line *t*_{
j
} is denoted as *x*_{
ij
}. We can build a *m* by *n* dimensional matrix *X*, where *m* is the number of the compounds and *n* is the number of cell lines.

Each compound will be represented by a vector of descriptors, denoted as a matrix *Y* with *m* by *r* dimensions, where *m* is the number of compounds and *r* is the length of the corresponding descriptor vector. Similar to our previous study [7], two different molecule descriptors, general descriptor [17] and drug-like index (DLI) [18] will be used for compound representation.

### Collaborative filtering for multiple cell line QSAR modeling

Based on the above definitions, it can be seen that the traditional single cell line QSAR modeling is applied on the data in a specific column of matrix *X*. In this study, we are more interested in incorporating the information from other columns (cell lines) to enhance the performance of the QSAR modeling for a particular column (cell line). This scenario is similar to the recommendation system presented by Electronic retailers and content providers such as Amazon.com and Netflix [14], which make automatic predictions (filtering) of users’ interests by collecting preferences or taste information from many users (collaborating), naturally termed as “collaborative filtering (CF)”.

*n*users {

*u*

_{ 1 },

*u*

_{ 2 }, . . . ,

*u*

_{ n }} and a list of

*m*items {

*i*

_{ 1 },

*i*

_{ 2 }, . . . ,

*i*

_{ m }}, and each user,

*u*

_{ i }, has a list of items,

*Iu*

_{ i }, which the user has rated, or about which their preferences have been inferred through their behaviors. The ratings can either be explicit indication on a 1–5 scale, or implicit indication such as purchases or click-throughs [13]. Such a user-item relationship can be formulated as a matrix, which may be sparse and can have missing values (i.e. users did not give their preferences). The goal of CF is to predict such missing values based on the existed information of users/items to make the reasonable recommendation (Left Panel of Figure 2).

Such a CF scenario is inherently suitable for our multiple cell line QSAR modeling. In our study, the former “cell line- compound” matrix *X* can be viewed as a kind of “item-user” matrix, where “compound” is analogue to “item” and “cell line” is analogue to “user” (Right Panel of Figure 2). The traditional single cell line QSAR modeling uses the data restricted in a specific column of matrix *X* to train and test. From the perspective of machine learning, we just hold part of the data in the column as testing dataset, and use the other part of the data in the column to train a QSAR model. This procedure can be naturally extend to the multiple cell line QSAR modeling under the CF framework, where we can treat the testing data in a specific column as “missing” value and using the remain data from this column as well as the data from other columns to predict such missing values.

### Collective matrix factorization for collaborative filtering in the multiple cell line QSAR modeling

We formulate the multiple cell line QSAR modeling problem as a collaborative filtering problem. There are two existing techniques for solving collaborative filtering, i.e., the neighborhood methods and latent factor models. Neighborhood methods are centered on computing the relationships between items or, alternatively, between users for missing value prediction, while latent factor models characterize both items and users on, say, 20 to 100 factors inferred from the ratings patterns [19][20][21]. Generally, realizations of latent factor models are based on matrix factorization. In its basic form, matrix factorization characterizes both items and users through vectors of factors inferred from item rating patterns. High correlation between item and user factors leads to a recommendation. These methods have become popular in recent years by combining good scalability with predictive accuracy. Thus, we will present a matrix factorization based multiple cell line QSAR modeling method in our study.

Specifically, we have matrix $X\in {R}_{+}^{m\phantom{\rule{0.5em}{0ex}}x\phantom{\rule{0.5em}{0ex}}n}$, where *X*_{
i j
}*,* epresents the activity measurement of compound *i* against specific cell line *j*. Noted that *X* is sparse in a specific column since we will hold part of elements in this column as the testing data (missing values) for QSAR modeling. We use an indicator matrix $I\in {R}^{m\phantom{\rule{0.5em}{0ex}}x\phantom{\rule{0.5em}{0ex}}n}$ to represent the missing values, where *I*_{
ij
} = 0 if *X*_{
ij
} is missing and *I*_{
ij
} = 1 otherwise.

We denote by ${X}_{j}.\text{,}\phantom{\rule{0.5em}{0ex}}1\le \text{i}\le \text{m and}X{.}_{j}\text{,}1\le \text{j}\le \text{n}$ the *i* th row and *j* th column of *X*, which represent the *i* th compound's activities against all the cell lines and the activities of the *j* th cell line for all the compounds, respectively.

*X*, thus to fill/predict the missing values. Such matrix factorization can be achieved by solve the following optimization function:

In equation (3), the operator “∘” denotes the entry-wise product. $\left|\right|*|{|}_{F}$ denotes the Frobenius norm. The last 2 terms add regularizations to the matrix *U* and *V* by avoiding over-fitting the observed data. The parameters ${\lambda}_{1}$ and ${\lambda}_{2}$ control the extent of regularizations, and they are usually determined by cross-validation.

*Y*in order to use such auxiliary information to aid a more reasonable reconstruction of matrix

*X*. We further presented a collective matrix factorization (CMF)[22] method for multiple cell line QSAR modeling. The CMF method was recently presented by machine learning community [22], and it jointly factorizes multiple matrices simultaneously, assuming that they share several common latent factors. To be more specific, given a compound - cell line matrix $X\in {R}_{+}^{m\times n}$, and a compound description matrix $Y\in {R}_{+}^{m\times n}$, we extend the optimization function (2) and (3) to the following:

*L*(U, V, W)

Equation (5) is similar to equation (3), it reconstructs $X\approx U{V}^{T}$ and $Y\approx U{W}^{T}$ by sharing the common factor *U*, where $X\in {R}_{+}^{m\times n}\text{,}\phantom{\rule{0.25em}{0ex}}Y\in {R}_{+}^{m\times r}\text{,}\phantom{\rule{0.25em}{0ex}}U\in {R}_{+}^{m\times d},V\in {R}_{+}^{n\times d}\text{and}W\in {R}_{+}^{r\times d}$. *U*, *V* and *W* are low-dimensional matrices with dimensiond $\le min(m,n,r)$. By solving such optimization function, we can successfully incorporate the information of the multiple cell line compound activities and compound description for a better missing value prediction.

*U*,

*V*,

*W*, and we cannot get closed-form solutions for minimizing the objective function. Therefore, we will turn to some numerical method such as gradient descent to get the local optimal solutions. Specifically, we have the gradients as:

After obtaining the gradients, we can use gradient descent to iteratively minimize the objective function. The algorithm for the collective two matrix factorization is given below:

Algorithm 1: collective matrix factorization for multiple cell line QSAR modeling

Input: An incomplete matrix *X* and a complete matrix *Y* , where *X* represents the compound activities in multiple cell lines with missing values in specific column, *Y* represents the compound description matrix.

Output: The complete matrix for *X*.

- 1.
*t*= 1; - 2.
While (

*t*<*T*and ${L}_{t}-{L}_{t+1}>\epsilon $ do - 3.
Get the gradients ${\nabla}_{u}L\text{,}\phantom{\rule{0.5em}{0ex}}{\nabla}_{v}L\text{,}\phantom{\rule{0.5em}{0ex}}{\nabla}_{w}L$ by Equation (5)-(7);

- 4.
y = 1;

- 5.
While $(L({\text{U}}_{t}-\gamma {\nabla u}_{t}L,{\text{V}}_{t}-\gamma {\nabla v}_{t}L,{\text{W}}_{t}-\gamma {\nabla}_{w}tL)\ge L({\text{U}}_{t},{\text{V}}_{t},{\text{W}}_{t}))$ do

- 6.;$\gamma =\gamma /2$
- 7.
End

- 8.${\text{U}}_{t+1}={\text{U}}_{t}-\gamma {\nabla u}_{t}L,{\text{V}}_{t+1}={\text{V}}_{t}-\gamma {\nabla v}_{t}L,{\text{W}}_{t+1}={\text{W}}_{t}-\gamma {\nabla w}_{t}L$
- 9.
t = t + 1;

- 10.
End

- 11.
return X;

End

### Performance measurement

In order to demonstrate the efficiency of collective matrix factorization based multiple cell line QSAR modeling, we compare our approach with two other base line methods, i.e., linear ridge regression and support vector regression (SVR) for single cell line QSAR modeling used in our previous study [7]. For the purpose of equal comparison, we apply the following two testing strategies for each specific cell line: (1). We randomly selected 2/3 of the data to train the linear ridge regression and SVR, and the remaining data as to test these two methods. These two base line methods are compared with collective matrix factorization based QSAR method, where the same testing data (missing values) are predicted based on the original training data for this specific cell line plus the data from other cell lines. The whole procedure was repeated 10 times. (2) In order to evaluate the QSAR model more rigorously and consider the representative ability of the compounds in training dataset, we applied another data partition strategy, i.e., *Diverse Subset* data division method [7], which is commonly used in the chemoinformatics community. Generally speaking, the *Diverse Subset* method ranks compound entries based on diversity. In the procedure of data division, the first entry of the original dataset is taken as a reference and will always be viewed as part of a diverse subset. Then the most “distinct” compound data is assigned #2, and then the most distinct compound to these two is assigned #3 and so on until the required number of diverse compounds is identified or the whole dataset is ranked in diversity order [7]. In this study, we also select 1/3 of the data as testing dataset and the remaining data as the training dataset, while such partition is generated in a *Diverse Subset* way rather than randomly to keep the representative and distinct characteristics of data.

Two classical measurements, i.e., Root mean squared error (RMSE) and squared correlation coefficient (*R*-square) were adopted as the performance evaluations for testing results. The definitions of these statistical parameters are provided as follows:

where *n* is the number of test compounds, ${e}_{i}={y}_{i}-{\widehat{y}}_{i}$, is the difference between the observed compound affinity data and the fitted model. ${y}_{i}$ is the observed compound affinity, ${\widehat{y}}_{i}$ is the predicted compound affinity.

where ${P}^{\mathit{avg}}$ is the average value of ${P}_{i}^{exp}$ over the *n* predicted compound affinities.

### Feature selection based on CMF for compound description among multiple cell lines

Under such collaborative QSAR schema, we presented a novel feature selection model for compound descriptions weighting, which is also derived from the content-based recommender systems and collaborative filtering [23]. Basically, we want to quantify the effect of each compound feature against a specific cell line (weighting for intra-cell line) as well as among all the cell lines (weighting for inter-cell line). The final feature weighting is an integration of the two types of weighting, where both specific and the whole cell lines contribute. Such a feature selection strategy is attractive in multiple cell line QSAR modeling, since it can provide useful clues of how to modify chemical compounds to improve their activities for a specific target, or for all given cell lines simultaneously. While the latter one is a key step for multi-target compound design.

Specially, given a compound activity-cell line matrix *X* (*m* by *n*) and compound-feature description matrix *Y* (*m* by *r*), we want to derive a cell line-description feature weighting matrix *Z* (*n* by *r*), where its element *z*_{
ij
} is the weight of a compound feature *j* in cell line *i*. The value of element *z*_{
ij
} is contributed from two sides, i.e. intra-cell line and inter-cell line. The generally procedure for computing a weight for each compound feature is based on (1) the amount of information provided by itself , and (2) the correlation between the compound feature and a specific cell line. Three steps are performed here:

Step 1. Weighting for inter-cell line. For each compound feature *c*_{
j
}, an entropy based method is applied to compute the amount of information that it can offer regardless of cell line, as denoted as *H*_{
j
}.

Step 2. Weighting for intra-cell line. For each compound feature *c*_{
j
} and a specific cell line *t*_{
j
}, the correlation between compound feature and the cell line is calculated. This calculation will depend on the nature of the features (qualitative, quantitative). Two kinds of correlations, i.e., correlation coefficient and contingency coefficient [23] are proposed for quantitative features and qualitative features respectively.

Step 3. Calculation of the final weights. The feature weight is obtained as a result of the product of entropy and degree of dependency.

## Results

We performed a comprehensive study of the collective matrix factorization based multiple cell line QSAR modeling for the inhibitors of Hedgehog Signaling Pathway as described in Section 2. In the rest of this section we present and summarize the key results from this study. The performance of our method was compared with the baseline QSAR models of liner ridge regression and SAR. Details are listed in the following.

### Performance of the collective matrix factorization based QSAR modeling

*diverse subset*to consider the data representative ability in the training and testing dataset.

From Figure 45, it can be seen that that the different data partition strategies actually achieve the similar performance results. For all the cell lines and all the kinds of data representation, the performance improvement of collaborative QSAR modeling was dramatic, especially for the evaluation of *R*-square. The improvement is statistically significant, with significant *p*-value measured by RMSE and *R*-square respectively. We had already noticed in our previous study [7] that under the measurement of *R*-square, the QSAR modeling results for the four cell lines with numeric compound activities were not satisfied, indicating a satisfiable QSAR modeling against single cell line individually was hard to obtain. In contrast, in our current collaborative QSAR modeling, performance against all the cell lines was improved. The significant improvement margin evaluated by *R*-square indicates that our CMF based QSAR modeling could successfully capture the correlation, rather than its absolute value of difference among the dataset as evaluated by RMSE.

*R*-square of different QSAR models, we also investigated their error distribution under the

*diverse subset*partition strategy to give a more rigorous comparison of their performance. It can be seen from the boxplots of the error square (Figure 6–7) that for both two compound descriptions, collaborative QSAR modeling achieved the lowest error means and low variances compared to other two baselines, indicating the best prediction ability among all methods.

It should be noted that in our previous study we found that different cell lines perform differently for modeling the inhibitor affinity based on the linear regression or SVR. Particularly, only the data of NCI-H446 could produce a reasonable model by QSAR analysis, probably due to the fact that the other three cell lines may be less sensitive as HCI-H466 cells to the hedgehog signaling inhibitor [7]. Nevertheless, it can be seen from Figure 47 that if we combine all these data from different cell lines together under the CMF based QSAR modeling, we can greatly reduce such non-specific effects in the cell lines, and result in a reasonable QSAR modeling against all the cell lines respectively. Such improvement is attributed to the fact that the collaborative filtering based framework allows different cell line data tasks to enhance each other during the training process, which eventually makes the efficacy modeling better than those of using the datasets separately. We believe that such “collaborative” scenario for drug analysis will become more popular in the future, as more and more cell line will exist and the drug are often required to be investigated under various circumstances.

*domain of application*(DOA) for the model under the

*diverse subset*partition strategy. The domain of application (DOA) is used to estimate the reliability in the prediction of a new compound [24] for a specific method. Those molecules fall out the domain may lead to unreliable predictions [10]. In the analysis of DOA, a value of leverage ${h}_{i}$ is defined in equation (11) for each chemical molecule:

Where ${X}_{i}$ is the row-vector descriptor of the query compound,*X* is the $n\times k$ matrix containing *k* descriptor values and *n* training samples. The superscript *T* is the transpose of the matrix or vector. Generally, the warning leverage $h*$ is fixed at $3k/n$, where *n* is the number of training compounds, and *k* is the number of descriptors. When the leverage is greater than the warning leverage$h*$, the predicted activity is the result of substantial extrapolation of the model and, therefore, it may not be reliable and tend to be over-fitting.

*h*), and can be used to obtain an immediate and simple graphical detection of both the response outliers (

*Y*outliers) and the structurally influential chemicals (

*X*outliers) of a model. Generally, the points with their values of

*Y*axis fall outside the 3

*σ*line (

*σ*is the standard residuals unit of the compounds) can be considered as

*Y*outliers, while the points with their values of

*X*axis fall outside the warning leverage $h*$line can be considered as

*X*outliers [10]. Figures 8 and 9 represent the William plots for the four cell lines with compound representations of General Descriptor and Drug-like index respectively. It can be seen that for all four cell lines, most of the compounds fall into their corresponding application domain, which indicate that the collaborative QSAR modeling has achieved a reliable activity prediction for the compounds, and they are following a well-defined domain of applicability.

### Impact of the Regularization Parameters

### Feature selection based on CMF for compound description among multiple cell lines

Using collaborative filtering based feature selection strategy we proposed aforementioned, we obtained the feature weighting for intra-cell line and inter-cell line for the inhibitors of Hedgehog Signaling Pathway. The former one can be used to uncover the important features in inhibitor design against a specific cell line, while the later one is used to identify common features that are important for the inhibitors against multiple cell line simultaneously. We compared the difference between these two kinds of feature weighting to provide useful clues for inhibitor modifications and improve their affinities.

In this feature selection, we used Drug-like index to represent each compound, with the total of 28 features, since it is easy to interpret biological meanings. The General descriptor feature space has been hybridized, and the original meanings of compound structure description for current features couldn’t kept. Therefore, GD will not be adopted for feature weighting here. Generally speaking, Drug-like index belongs to the category of structural descriptors. Structural descriptors can correlate with each other; some of them may be redundant. However, if they have different and significant distributions in the considered drug class, they can be used for drug-knowledge extraction and the redundant can be ignored. In our study, the descriptors maintain their identity and clearly interpretable structural significance throughout the process. A table with detailed descriptions of each drug-like index is listed in Additional file 2: Table S1.

- 1)
As shown in Figure 11, the features of ‘# of non-H’ (DLI1), ‘# of non-H polar bonds’ (DLI5) and ‘# of 2-degree cyclic atoms’ (DLI13) were ranked top 3. These findings indicate that the volume of the molecular, the polar of the molecular and the cyclic degree of the molecular are the most important features for the design of multi-target inhibitors of Hedgehog Signaling Pathway. Our findings are actually consistent with the empirical rules for lead compound optimization, which use these three elements to determine their activities.

- 2)
We can see from Figure 11 that the feature ‘# of cap fragments’ (DLI23) was also important when the multi-cell line inhibitors were designed. This is consistent with the empirical rule, which changes the substituent group (functional group) in order to improve the inhibitor activity. However, compared with Figure 12, it can be seen that the importance of this feature for multi-cell line inhibitor design is not as much significant as that for individual cell lines. This could be explained that, although this feature is important for individual cell line inhibitor, their activity improvement directions may be inconsistent, thus reducing its importance when multi-cell lines are confronted.

- 3)
All three figures (Figure 11, 12, 13) have shown that the weight for the feature of '# of 3-level bonding patterns' (DLI18) is 0. This is probably due to the following two reasons:a) All compound samples in our study are lack of this feature, and b) this feature is not considered in most of the in-silico compound optimizations.

## Discussion

### Comparison of CMF based QSAR modeling with other collaborative QSAR modeling

Although the CMF based QSAR modeling was investigated in our study, we do realize the existence of other QSAR modeling with integrated information, and we call such models as the “collaborative” QSAR modeling, like the neural network based [15][25–27] and multi-task learning based [9][10] models, as well as the proteochemometrics modeling (PCM) [28][29]. In order to further uncover the characteristics of such collaborative QSAR modeling, we discuss our CMF based method with the aforementioned methods on our multiple cell line QSAR modeling for the inhibitors of Hedgehog Signaling Pathway.

### Neural network based collaborative QSAR modeling

As we mentioned above, Erhan etc. proposed a neural network based collaborative QSAR modeling for drug discovery [15]. This is one of the first attempts to construct an efficient procedure for integrating multiple drug target information at a time by extending standard multi-layer neural networks. Basically, neural networks provided an ideal test bed for implementing collaborative QSAR modeling: the simplest of such form was to create a shared hidden layer that is trained in parallel for all the learning tasks. In this case, the training procedure would be done on all the tasks (in our study it will be all the cell line QSAR data) in parallel. Because the structure of the network includes a shared layer (weight matrix), it is possible for so-called “shared internal representations” to develop and to be learned.

Specifically, in our multiple cell line QSAR modeling for the inhibitors of Hedgehog Signaling Pathway, we used a 10-cross fold validation schema to test our data from 4 cell lines in this neural network model. The weights from input layer to hidden layer as well as from hidden layer to output layer for the network will be learned through the back propagation (BP) algorithm. Our in-house test indicated that the neural network based collaborative QSAR modeling was comparable to CMF based QSAR modeling, with no surprisingly better than the single QSAR modeling (Results are not shown here).

### Multi-task learning based collaborative QSAR modeling

*N*tuples, ${z}_{i}=({\text{x}}_{i},{y}_{i},{k}_{i})$ for i = {1…

*N*}, where ${\text{x}}_{i}\in {\text{R}}^{d}$ is the drug descriptor, and ${k}_{i}\in ${1…

*M*} is the indicator corresponding to the example ${(\text{x}}_{i},{y}_{i})$. The

*M*tasks correspond to

*M*different cell lines or drug targets. A critical issue in this collaborative QSAR modeling is to learn a set of sparse functions across these tasks for drug activity regression. This is commonly achieved by learning

*M*linear regressions of the form ${w}_{k}^{T}x$, with the following square loss function is adopted (other loss function can also be applied):

where$z=(\text{x},y,k)$, *W*$=[{\text{w}}_{1},{\text{w}}_{2},\dots ,{\text{w}}_{M}]\in {\text{R}}^{d\times M}$ and ${W}^{j}$ be the *j* th row of *W*.

In the multi-task learning framework, *W* can be optimized and calculated by enforcing the joint sparsity across different tasks with adding the different norm of the matrix *W* to the square loss function, which leads to only a few non-zero rows of *W*.

The relationship between collaborative filtering and multi-task learning has been discussed in previous studies [30]. The multi-task learning model is closed related to the multiple response regression models [31]. Multiple response regression is the task of estimating several response variables using a common set of input variables. In general, both multi-task learning and multiple response regression can be used to find the correlation between different tasks, and thus improve the single task learning. Such an approach have many potential applications in various areas, interested readers may be referred to the paper [31]. It should be noted that in the multi-task learning framework, the samples for different tasks should not be identical. In general, the less overlap of the samples containing across different tasks, the more prediction ability of each task. This idea is related to another interesting algorithm, transfer learning [32], whereas multi-task learning can be categorized into this area and the information between different tasks is expected to “transfer” from each other to boost the performance of individual task.

For the particular data in our multiple cell line QSAR modeling for the inhibitors of Hedgehog Signaling Pathway, it can be seen that the drug samples for all the cell line are totally identical, thus it is unnecessary to use multi-task learning in the collaborative QSAR here. Nevertheless, if non-identical samples for multiple cell line exist, multi-task learning will be a good choice for collaborative QSAR modeling with integrating of different data sources.

### Proteochemometric Modeling

## Conclusions

In this study, an efficient collaborative QSAR model for inhibitors of Hedgehog Signal Pathway from multiple cell lines was proposed. The model is derived from the area of information retrieval in social network, i.e. collaborative filtering, and its performance is well demonstrated and explained in our study. By applying this elegant computational model, we successfully addressed two issues remained in our previous study, i.e., (1) The information among multiple cell lines can be integrated to boost the QSAR results, rather than single cell line QSAR modeling. Our extensive experiments indicated that the performance is remarkable compared to other single cell line QSAR methods. (2) A novel feature selection strategy under such collaborative environment was proposed, which can be used to derive the commonly important features related to the entire given cell lines, while meantime presenting their specific contributions to a specific cell-line. Based on the results of feature selection, we presented several ways of chemical modifications which will likely improve the compound affinity towards multiple targets in the Hedgehog Signal Pathway simultaneously. In summary, our study provides useful clues for multiple cell line/targets QSAR modeling when the cell line or target information among a related pathway exist. The proposed collaborative model with the feature selection strategy here is efficient, robust, flexible, and can be easily extended to model large-scale multiple cell line/QSAR data.

## Notes

## Declarations

### Acknowledgements

We thank the lab of Prof. Wei-Dong Zhang in Second Military Medical University, Shanghai to share the compound data. This work was supported in part by Project Shanghai Pujiang Talents Funding (Grant No.11PJ1407400), Young Teachers for the Doctoral Program of Ministry of Education, China (Grant No. 20110072120048) and National Natural Science Foundation of China (Grant No. 30976611, Grant No.31100956 and Grant No. 61173117)

## Authors’ Affiliations

## References

- Di Magliano MP, Hebrok M: Hedgehog signalling in cancer formation and maintenance.
*Nat Rev Cancer*2003, 3(12):903–911. 10.1038/nrc1229View ArticleGoogle Scholar - Chen JK, Taipale J, Cooper MK, Beachy PA: Inhibition of Hedgehog signaling by direct binding of cyclopamine to Smoothened.
*Genes Dev*2002, 16(21):2743. 10.1101/gad.1025302PubMed CentralView ArticlePubMedGoogle Scholar - Ingham PW, Nakano Y, Seger C: Mechanisms and functions of Hedgehog signalling across the metazoa.
*Nat Rev Genet*2011, 12(6):393–406. 10.1038/nrg2984View ArticlePubMedGoogle Scholar - Teichert AE, Elalieh H, Elias PM, Welsh JE, Bikle DD: Overexpression of Hedgehog Signaling Is Associated with Epidermal Tumor Formation in Vitamin D Receptor–Null Mice.
*J Investig Dermatol*2011, 131(11):2289–97. 10.1038/jid.2011.196PubMed CentralView ArticlePubMedGoogle Scholar - Watkins DN, Berman DM, Burkholder SG, Wang B, Beachy PA, Baylin SB: Hedgehog signalling within airway epithelial progenitors and in small-cell lung cancer.
*Nature*2003, 422(6929):313–317. 10.1038/nature01493View ArticlePubMedGoogle Scholar - Zhao C, Chen A, Jamieson CH, Fereshteh M, Abrahamsson A, Blum J, Kwon HY, Kim J, Chute JP, Rizzieri D: Hedgehog signalling is essential for maintenance of cancer stem cells in myeloid leukaemia.
*Nature*2009, 458(7239):776–779. 10.1038/nature07737PubMed CentralView ArticlePubMedGoogle Scholar - Zhu R, Liu Q, Tang J, Li H, Cao Z: Investigations on Inhibitors of Hedgehog Signal Pathway: A Quantitative Structure-Activity Relationship Study.
*Int J Mol Sci*2011, 12(5):3018–3033. 10.3390/ijms12053018PubMed CentralView ArticlePubMedGoogle Scholar - Tang J, Li HL, Shen YH, Jin HZ, Yan SK, Liu XH, Zeng HW, Liu RH, Tan YX, Zhang WD: Antitumor and antiplatelet activity of alkaloids from Veratrum dahuricum.
*Phytother Res*2010, 24(6):821–826.PubMedGoogle Scholar - Liu Q, Che D, Huang Q, Cao Z, Zhu R: Multi‐target QSAR Study in the Analysis and Design of HIV‐1 Inhibitors.
*Chin J Chem*2010, 28(9):1587–1592. 10.1002/cjoc.201090269View ArticleGoogle Scholar - Liu Q, Zhou H, Liu L, Chen X, Zhu R, Cao Z: Multi-target QSAR modelling in the analysis and design of HIV-HCV co-inhibitors: an in-silico study.
*BMC Bioinforma*2011, 12: 294. 10.1186/1471-2105-12-294View ArticleGoogle Scholar - Ning X, Rangwala H, Karypis G: Multi-assay-based structure− activity relationship models: improving structure− activity relationship models by incorporating activity information from related targets.
*J Chem Inf Model*2009, 49(11):2444–2456. 10.1021/ci900182qView ArticlePubMedGoogle Scholar - Medina-Franco JL, Yongye AB:
*P rez-Villanueva J*. Multi-target Structure-Activity Relationships Characterized by Activity-Difference Maps and Consensus Similarity Measure. Journal of chemical information and modeling, Houghten R, Martinez-Mayorga K; 2011.Google Scholar - Herlocker JL, Konstan JA, Terveen LG, Riedl JT: Evaluating collaborative filtering recommender systems.
*ACM Trans Information Syst (TOIS)*2004, 22(1):5–53. 10.1145/963770.963772View ArticleGoogle Scholar - Breese JS, Heckerman D, Kadie C: Empirical Analysis of Predictive Algorithms for Collaborative Filtering. In
*Proceedings of Fourteenth Conference on Uncertainty in Artificial Intelligence*. Madison, WI: Morgan Kaufmann; 1998:43–52.Google Scholar - Erhan D, L'Heureux PJ, Yue SY, Bengio Y: Collaborative filtering on a family of biological targets.
*J Chem Inf Model*2006, 46(2):626–635. 10.1021/ci050367tView ArticlePubMedGoogle Scholar - Lazareno S, Birdsall N: Estimation of antagonist Kb from inhibition curves in functional experiments: alternatives to the Cheng-Prusoff equation.
*Trends Pharmacol Sci*1993, 14(6):237–239. 10.1016/0165-6147(93)90018-FView ArticlePubMedGoogle Scholar - Todeschini R, Consonni V:
*Handbook of molecular descriptors*. Wiley-Vch; 2008. vol. 79 vol. 79Google Scholar - Xu J, Stevenson J: Drug-like index: a new approach to measure drug-like compounds and their diversity.
*J Chem Inf Comput Sci*2000, 40(5):1177–1187. 10.1021/ci000026+View ArticlePubMedGoogle Scholar - Melville P, Mooney RJ, Nagarajan R: Content-boosted collaborative filtering for improved recommendations. In
*2002*. 2002 edition. Menlo Park, CA; Cambridge, MA; London: AAAI Press; MIT Press; 1999:187–192.Google Scholar - Schafer J, Frankowski D, Herlocker J, Sen S: Collaborative filtering recommender systems. The Adaptive Web.
*The Adaptive Web: Methods and Strategies of Web Personalization, Vol. 4321*2007, 291–324.View ArticleGoogle Scholar - Su X, Khoshgoftaar TM: A survey of collaborative filtering techniques.
*Advances in Artificial Intelligence*2009, 2009: 4.View ArticleGoogle Scholar - Singh AP, Gordon GJ: Relational learning via collective matrix factorization. In
*2008*. ACM; 2008:650–658.Google Scholar - Barranco M, Martínez L: A method for weighting multi-valued features in content-based filtering.
*Trends in Applied Intelligent Syst*2010, 418: 409–418.View ArticleGoogle Scholar - Weaver S, Gleeson MP: The importance of the domain of applicability in QSAR modeling.
*J Mol Graph Model*2008, 26(8):1315–1326. 10.1016/j.jmgm.2008.01.002View ArticlePubMedGoogle Scholar - Patra JC, Singh O: Artificial neural networks‐based approach to design ARIs using QSAR for diabetes mellitus.
*J Comput Chem*2009, 30(15):2494–2508. 10.1002/jcc.21240View ArticlePubMedGoogle Scholar - Patra JC, Chua BH: Artificial neural network‐based drug design for diabetes mellitus using flavonoids.
*J Comput Chem*2011, 32(4):555–567. 10.1002/jcc.21641View ArticlePubMedGoogle Scholar - Xu J, Wang L, Shen X, Xu W: QSPR study of Setschenow constants of organic compounds using MLR, ANN, and SVM analyses.
*J Comput Chem*2011, 32(15):3241–52. 10.1002/jcc.21907View ArticlePubMedGoogle Scholar - Lapinsh M, Prusis P, Lundstedt T, Wikberg JES: Proteochemometrics modeling of the interaction of amine G-protein coupled receptors with a diverse set of ligands.
*Mol Pharmacol*2002, 61(6):1465. 10.1124/mol.61.6.1465View ArticlePubMedGoogle Scholar - Wikberg JES, Lapinsh M, Prusis P:
*Proteochemometrics: a tool for modeling the molecular interaction space. Chemogenomics in Drug Discovery: A Medicinal Chemistry Perspective 2005, Chapter 10*. 2004, 289–309.Google Scholar - Yu K, Tresp V: Learning to learn and collaborative filtering.
*Neural Information Processing Systems Workshop on Inductive Transfer: 10 Years Later*2005.Google Scholar - Breiman L, Friedman JH: Predicting multivariate responses in multiple linear regression.
*J Royal Stat Soc: Series B (Stat Methodol)*1997, 59(1):3–54. 10.1111/1467-9868.00054View ArticleGoogle Scholar - Pan SJ, Yang Q: A survey on transfer learning.
*Knowl Data Eng, IEEE Trans on*2010, 22(10):1345–1359.View ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.