Methodology article  Open  Published:
Integrated QSAR study for inhibitors of hedgehog signal pathway against multiple cell lines:a collaborative filtering method
BMC Bioinformaticsvolume 13, Article number: 186 (2012)
Abstract
Background
The Hedgehog Signaling Pathway is one of signaling pathways that are very important to embryonic development. The participation of inhibitors in the Hedgehog Signal Pathway can control cell growth and death, and searching novel inhibitors to the functioning of the pathway are in a great demand. As the matter of fact, effective inhibitors could provide efficient therapies for a wide range of malignancies, and targeting such pathway in cells represents a promising new paradigm for cell growth and death control. Current research mainly focuses on the syntheses of the inhibitors of cyclopamine derivatives, which bind specifically to the Smo protein, and can be used for cancer therapy. While quantitatively structureactivity relationship (QSAR) studies have been performed for these compounds among different cell lines, none of them have achieved acceptable results in the prediction of activity values of new compounds. In this study, we proposed a novel collaborative QSAR model for inhibitors of the Hedgehog Signaling Pathway by integration the information from multiple cell lines. Such a model is expected to substantially improve the QSAR ability from single cell lines, and provide useful clues in developing clinically effective inhibitors and modifications of parent lead compounds for target on the Hedgehog Signaling Pathway.
Results
In this study, we have presented: (1) a collaborative QSAR model, which is used to integrate information among multiple cell lines to boost the QSAR results, rather than only a single cell line QSAR modeling. Our experiments have shown that the performance of our model is significantly better than single cell line QSAR methods; and (2) an efficient feature selection strategy under such collaborative environment, which can derive the commonly important features related to the entire given cell lines, while simultaneously showing their specific contributions to a specific cellline. Based on feature selection results, we have proposed several possible chemical modifications to improve the inhibitor affinity towards multiple targets in the Hedgehog Signaling Pathway.
Conclusions
Our model with the feature selection strategy presented here is efficient, robust, and flexible, and can be easily extended to model largescale multiple cell line/QSAR data. The data and scripts for collaborative QSAR modeling are available in the Additional file 1.
Background
The Hedgehog Signaling Pathway plays an important role in regulating embryonic development in vertebrates, and it is highly conserved from flies to humans [1][2][3][4].The pathway name comes from a polypeptide ligand called Hedgehog (Hh), which is an intercellular signaling molecule in Drosophila. In Drosophila, the mutation of the gene in the Hedgehog Signaling Pathway gives rise to an unusual spikyhaired phenotype [1]. The misregulation of such pathways has been directly associated with a variety of inherited and sporadic diseases [4][5][6]. The key role of the Hedgehog Signaling Pathway in the cell differentiation, growth, and proliferation makes it an excellent candidate in drug discovery, and thus targeting such pathway in cells represents a promising new paradigm for cell growth and death control.
The Hedgehog Signal Pathway is composed of four important components: Sonic Hedgehog, Patched, Smoothened and Gli transcription factors [3] (Figure 1). The functional Hh protein secreted from the membranes of the producing cells and initiates the Hh signaling cascade upon binding to the 12pass transmembrane receptor Patched (Ptch). In the absence of an Hh ligand, the Patched receptor inhibits the activity of the downstream sevenpass transmembrane receptor Smoothened (Smo), which resembles Gproteincoupled receptors (GPCRs) in general topology. Active Smo then signals via a cytosolic complex of proteins including Suppressor of Fused (SuFu), and the cascade culminates by triggering activation of the glioma (Gli) family of transcription factors and their translocation to the nucleus. This activation results in the expression of specific genes that promote cell proliferation and differentiation [3].
The causal relationship between the activation of Hedgehog Signaling Pathway and oncogenesis has driven cancer researchers in the direction of finding specific inhibitors of hedgehog signaling, since this will provide efficient therapies to a wide range of malignancies [1, 2]. To date, several druggable nodes within the pathway have been identified. Assays implanted on various cell lines have shown that small molecules were able to alter the activity of these targets. Among them, murine cell lines such as NIH 3 T3, TM3h12, and C3H10T1/2 have been used [2]. While current cell lines allow the measurement of the inhibitory effects of compounds on the Hh pathway, they, however, provide little or no information about the specific underlying targets. To the best of our knowledge, only specific Smoothened inhibitors have been identified. Among them, the wellknown BODIPY–cyclopamine, which is a fluorescent derivative of the naturally occurring Smo antagonist cyclopamine, binds specifically to cells expressing the Smo protein. This is one of the small chemical compounds that specifically inhibit Smoothened in the Hedgehog Signaling Pathway[2]. In our previous study [7], we have performed several quantitatively structureactivity relationship (QSAR) studies for cyclopamine derivatives in multiple cell lines, and such study could reveal useful clues in developing clinically effective drugs and modifications of parent lead compounds for cancer therapy.
Recently, our partners have synthesized 93 cyclopamine derivatives and their activities were tested against four different cell lines (BxPC3, NCIH446, SW1990 and NCIH157) respectively [7][8]. Based on these experimental data, a systematical QSAR investigation was carried out by incorporation of various statistic modelings and different molecular descriptors [7]. However, there are still several issues remain to be solved, which we believe that solving such problems will greatly enhance the understanding of inhibitors on Hedgehog Signaling Pathway, as well as the development of novel QSAR methodology. We describe the two major problems below:

(1)
In our previous QSAR study, for specific cell lines, the activities were categorized into a binary classification under a naïve Bayesian model, and we obtained relatively acceptable QSAR results. However, no matter what kinds of statistical models or 2D descriptors were tested, low testing correlation coefficients were found when numeric activities were used. This may be due to the inherent noise existed in experimental activity measurement, or the relatively small number of training data used for a specific cell line. Due to our compound data tested against multiple cell lines to evaluate their activities, we hypotheses that such information can be integrated to improve the QSAR results rather than only a single cell line QSAR modeling. Such investigation will be extremely useful for the scenario that a small number of compound activities are measured under different experimental conditions (such as different cell lines, targets, assays etc.), and will provide novel insights on the integration of existing information, avoiding repeatable laborious work in drug discovery. In addition, such a study may also lead to novel computational models for integrated QSAR modeling, which is closely related to multitask QSAR modeling [9] [10], MultiAssayBased QSAR modeling [11] , and Multitarget QSAR study [12].

(2)
Due to the existence of compound activity data against multiple cell lines, how can we integrate such information to derive more robust and efficient feature selection strategies for compound modification under such “collaborative” multicell line environment? That is, can we derive the commonly important features related to the entire given cell lines for compound description, while in the meantime present their specific contributions to a specific cellline? This issue is closely related to the first one, but tougher to be solved since it needs much more domain knowledge.
Inspired by these two problems, we aim to develop an efficient integrated QSAR model for inhibitors of Hedgehog Signal Pathway against multiple cell lines. This type of model has been used for information retrieval in social network, i.e. collaborative filtering [13][14], and it has been widely applied by the web companies such as Google, Amazon. Dumitru Erhan etc. has pioneered to use the term “collaborative filtering” in multiple target study [15]. Nevertheless, their methodology can be categorized as multiple regression or neural network, and a complex kernel function for similarity measurement is needed. In this study, we will present a collective matrix factorization based collaborative filtering model for integrated QSAR modeling, which is more naturally suitable for QSAR modeling, and scales up well on large dataset. Furthermore, we will also derive a powerful feature selection strategy for collaborative compound design to get more efficient inhibitors of Hedgehog Signal Pathway.
Methods and materials
Dataset
93 cyclopamine derivatives with their activities against four different cell lines (BxPC3, NCIH446, SW1990 and NCIH157) were obtained from our previous work [7]. The compound activity is measured by PK _{ i }, as defined in the following ChengPrusoff equation [16]:
(L is the concentration of free radioligand used and K_{ D } is its equilibrium dissociation constant for the receptor [16])
Where IC_{ 50 } (half maximal inhibitory concentration) is a measure of the effectiveness of a compound in inhibiting biological or biochemical function. More specifically, it indicates how much of a particular drug or other substance (inhibitor) needed to inhibit a given biological process (or component of a process, i.e. an enzyme, cell, cell receptor or microorganism) by half. In our study, the data are formulated as a data matrix X. Note that the collective matrix factorization requires the matrix to be nonnegative. In our original experiments, we measured the compound affinity under the PK_{i} evaluation system, and the activity values were negative. Since the PK_{ i } measurement is calculated by taking IC_{ 50 } as the input in equation (1), we can just take the absolute value of the PK_{i} in QSAR modeling , and this will not affect our final results.
Definitions and Notations
In this paper, the different cell lines and the compounds tested for Hedgehog Signal Pathway will be denoted as t and c respectively, and their corresponding subscripts denote a specific compound and cell line. Thus, for a specific compound c_{ i }, its experimentally activity value (measured as PK_{i} ) against specific cell line t_{ j } is denoted as x_{ ij }. We can build a m by n dimensional matrix X, where m is the number of the compounds and n is the number of cell lines.
Each compound will be represented by a vector of descriptors, denoted as a matrix Y with m by r dimensions, where m is the number of compounds and r is the length of the corresponding descriptor vector. Similar to our previous study [7], two different molecule descriptors, general descriptor [17] and druglike index (DLI) [18] will be used for compound representation.
Collaborative filtering for multiple cell line QSAR modeling
Based on the above definitions, it can be seen that the traditional single cell line QSAR modeling is applied on the data in a specific column of matrix X. In this study, we are more interested in incorporating the information from other columns (cell lines) to enhance the performance of the QSAR modeling for a particular column (cell line). This scenario is similar to the recommendation system presented by Electronic retailers and content providers such as Amazon.com and Netflix [14], which make automatic predictions (filtering) of users’ interests by collecting preferences or taste information from many users (collaborating), naturally termed as “collaborative filtering (CF)”.
Formally speaking, in a typical CF scenario, there is a list of n users {u_{ 1 }, u_{ 2 }, . . . , u_{ n }} and a list of m items {i_{ 1 }, i_{ 2 }, . . . , i_{ m }}, and each user, u_{ i }, has a list of items, Iu_{ i }, which the user has rated, or about which their preferences have been inferred through their behaviors. The ratings can either be explicit indication on a 1–5 scale, or implicit indication such as purchases or clickthroughs [13]. Such a useritem relationship can be formulated as a matrix, which may be sparse and can have missing values (i.e. users did not give their preferences). The goal of CF is to predict such missing values based on the existed information of users/items to make the reasonable recommendation (Left Panel of Figure 2).
Such a CF scenario is inherently suitable for our multiple cell line QSAR modeling. In our study, the former “cell line compound” matrix X can be viewed as a kind of “itemuser” matrix, where “compound” is analogue to “item” and “cell line” is analogue to “user” (Right Panel of Figure 2). The traditional single cell line QSAR modeling uses the data restricted in a specific column of matrix X to train and test. From the perspective of machine learning, we just hold part of the data in the column as testing dataset, and use the other part of the data in the column to train a QSAR model. This procedure can be naturally extend to the multiple cell line QSAR modeling under the CF framework, where we can treat the testing data in a specific column as “missing” value and using the remain data from this column as well as the data from other columns to predict such missing values.
Collective matrix factorization for collaborative filtering in the multiple cell line QSAR modeling
We formulate the multiple cell line QSAR modeling problem as a collaborative filtering problem. There are two existing techniques for solving collaborative filtering, i.e., the neighborhood methods and latent factor models. Neighborhood methods are centered on computing the relationships between items or, alternatively, between users for missing value prediction, while latent factor models characterize both items and users on, say, 20 to 100 factors inferred from the ratings patterns [19][20][21]. Generally, realizations of latent factor models are based on matrix factorization. In its basic form, matrix factorization characterizes both items and users through vectors of factors inferred from item rating patterns. High correlation between item and user factors leads to a recommendation. These methods have become popular in recent years by combining good scalability with predictive accuracy. Thus, we will present a matrix factorization based multiple cell line QSAR modeling method in our study.
Specifically, we have matrix $X\in {R}_{+}^{m\phantom{\rule{0.5em}{0ex}}x\phantom{\rule{0.5em}{0ex}}n}$, where X_{ i j }, epresents the activity measurement of compound i against specific cell line j. Noted that X is sparse in a specific column since we will hold part of elements in this column as the testing data (missing values) for QSAR modeling. We use an indicator matrix $I\in {R}^{m\phantom{\rule{0.5em}{0ex}}x\phantom{\rule{0.5em}{0ex}}n}$ to represent the missing values, where I_{ ij } = 0 if X_{ ij } is missing and I_{ ij } = 1 otherwise.
We denote by ${X}_{j}.\text{,}\phantom{\rule{0.5em}{0ex}}1\le \text{i}\le \text{m and}X{.}_{j}\text{,}1\le \text{j}\le \text{n}$ the i th row and j th column of X, which represent the i th compound's activities against all the cell lines and the activities of the j th cell line for all the compounds, respectively.
In a basic matrix factorization model, we usually seek two lowrank matrices, $U\in {\text{R}}_{+}^{m\times d}$ and $V\in {\text{R}}_{+}^{n\times d}$. The row vector ${u}_{i}$ and ${v}_{j}$ represent the lowdimensional representations of compounds and cell lines respectively. We use matrix $\text{U}\times {\mathrm{V}}^{\mathrm{T}}$ to approximate the original matrix X, thus to fill/predict the missing values. Such matrix factorization can be achieved by solve the following optimization function:
Where
In equation (3), the operator “∘” denotes the entrywise product. $\left\right*{}_{F}$ denotes the Frobenius norm. The last 2 terms add regularizations to the matrix U and V by avoiding overfitting the observed data. The parameters ${\lambda}_{1}$ and ${\lambda}_{2}$ control the extent of regularizations, and they are usually determined by crossvalidation.
We also have the compound description information as described in matrix Y in order to use such auxiliary information to aid a more reasonable reconstruction of matrix X. We further presented a collective matrix factorization (CMF)[22] method for multiple cell line QSAR modeling. The CMF method was recently presented by machine learning community [22], and it jointly factorizes multiple matrices simultaneously, assuming that they share several common latent factors. To be more specific, given a compound  cell line matrix $X\in {R}_{+}^{m\times n}$, and a compound description matrix $Y\in {R}_{+}^{m\times n}$, we extend the optimization function (2) and (3) to the following:
Where L(U, V, W)
Equation (5) is similar to equation (3), it reconstructs $X\approx U{V}^{T}$ and $Y\approx U{W}^{T}$ by sharing the common factor U, where $X\in {R}_{+}^{m\times n}\text{,}\phantom{\rule{0.25em}{0ex}}Y\in {R}_{+}^{m\times r}\text{,}\phantom{\rule{0.25em}{0ex}}U\in {R}_{+}^{m\times d},V\in {R}_{+}^{n\times d}\text{and}W\in {R}_{+}^{r\times d}$. U, V and W are lowdimensional matrices with dimensiond $\le min(m,n,r)$. By solving such optimization function, we can successfully incorporate the information of the multiple cell line compound activities and compound description for a better missing value prediction.
In general, the objective function (5) is not jointly convex to all the variables U, V,W, and we cannot get closedform solutions for minimizing the objective function. Therefore, we will turn to some numerical method such as gradient descent to get the local optimal solutions. Specifically, we have the gradients as:
After obtaining the gradients, we can use gradient descent to iteratively minimize the objective function. The algorithm for the collective two matrix factorization is given below:
Algorithm 1: collective matrix factorization for multiple cell line QSAR modeling
Input: An incomplete matrix X and a complete matrix Y , where X represents the compound activities in multiple cell lines with missing values in specific column, Y represents the compound description matrix.
Output: The complete matrix for X.
Begin

1.
t = 1;

2.
While (t < T and ${L}_{t}{L}_{t+1}>\epsilon $ do

3.
Get the gradients ${\nabla}_{u}L\text{,}\phantom{\rule{0.5em}{0ex}}{\nabla}_{v}L\text{,}\phantom{\rule{0.5em}{0ex}}{\nabla}_{w}L$ by Equation (5)(7);

4.
y = 1;

5.
While $(L({\text{U}}_{t}\gamma {\nabla u}_{t}L,{\text{V}}_{t}\gamma {\nabla v}_{t}L,{\text{W}}_{t}\gamma {\nabla}_{w}tL)\ge L({\text{U}}_{t},{\text{V}}_{t},{\text{W}}_{t}))$ do

6.
$$\gamma =\gamma /2$$
;

7.
End

8.
$${\text{U}}_{t+1}={\text{U}}_{t}\gamma {\nabla u}_{t}L,{\text{V}}_{t+1}={\text{V}}_{t}\gamma {\nabla v}_{t}L,{\text{W}}_{t+1}={\text{W}}_{t}\gamma {\nabla w}_{t}L$$

9.
t = t + 1;

10.
End

11.
return X;
End
Performance measurement
In order to demonstrate the efficiency of collective matrix factorization based multiple cell line QSAR modeling, we compare our approach with two other base line methods, i.e., linear ridge regression and support vector regression (SVR) for single cell line QSAR modeling used in our previous study [7]. For the purpose of equal comparison, we apply the following two testing strategies for each specific cell line: (1). We randomly selected 2/3 of the data to train the linear ridge regression and SVR, and the remaining data as to test these two methods. These two base line methods are compared with collective matrix factorization based QSAR method, where the same testing data (missing values) are predicted based on the original training data for this specific cell line plus the data from other cell lines. The whole procedure was repeated 10 times. (2) In order to evaluate the QSAR model more rigorously and consider the representative ability of the compounds in training dataset, we applied another data partition strategy, i.e., Diverse Subset data division method [7], which is commonly used in the chemoinformatics community. Generally speaking, the Diverse Subset method ranks compound entries based on diversity. In the procedure of data division, the first entry of the original dataset is taken as a reference and will always be viewed as part of a diverse subset. Then the most “distinct” compound data is assigned #2, and then the most distinct compound to these two is assigned #3 and so on until the required number of diverse compounds is identified or the whole dataset is ranked in diversity order [7]. In this study, we also select 1/3 of the data as testing dataset and the remaining data as the training dataset, while such partition is generated in a Diverse Subset way rather than randomly to keep the representative and distinct characteristics of data.
Two classical measurements, i.e., Root mean squared error (RMSE) and squared correlation coefficient (Rsquare) were adopted as the performance evaluations for testing results. The definitions of these statistical parameters are provided as follows:
Root mean squared error (RMSE):
where n is the number of test compounds, ${e}_{i}={y}_{i}{\widehat{y}}_{i}$, is the difference between the observed compound affinity data and the fitted model. ${y}_{i}$ is the observed compound affinity, ${\widehat{y}}_{i}$ is the predicted compound affinity.
Squared correlation coefficient (R2):
where ${P}^{\mathit{avg}}$ is the average value of ${P}_{i}^{exp}$ over the n predicted compound affinities.
Feature selection based on CMF for compound description among multiple cell lines
Under such collaborative QSAR schema, we presented a novel feature selection model for compound descriptions weighting, which is also derived from the contentbased recommender systems and collaborative filtering [23]. Basically, we want to quantify the effect of each compound feature against a specific cell line (weighting for intracell line) as well as among all the cell lines (weighting for intercell line). The final feature weighting is an integration of the two types of weighting, where both specific and the whole cell lines contribute. Such a feature selection strategy is attractive in multiple cell line QSAR modeling, since it can provide useful clues of how to modify chemical compounds to improve their activities for a specific target, or for all given cell lines simultaneously. While the latter one is a key step for multitarget compound design.
Specially, given a compound activitycell line matrix X (m by n) and compoundfeature description matrix Y (m by r), we want to derive a cell linedescription feature weighting matrix Z (n by r), where its element z_{ ij } is the weight of a compound feature j in cell line i. The value of element z_{ ij } is contributed from two sides, i.e. intracell line and intercell line. The generally procedure for computing a weight for each compound feature is based on (1) the amount of information provided by itself , and (2) the correlation between the compound feature and a specific cell line. Three steps are performed here:
Step 1. Weighting for intercell line. For each compound feature c_{ j }, an entropy based method is applied to compute the amount of information that it can offer regardless of cell line, as denoted as H_{ j }.
Step 2. Weighting for intracell line. For each compound feature c_{ j } and a specific cell line t_{ j }, the correlation between compound feature and the cell line is calculated. This calculation will depend on the nature of the features (qualitative, quantitative). Two kinds of correlations, i.e., correlation coefficient and contingency coefficient [23] are proposed for quantitative features and qualitative features respectively.
Step 3. Calculation of the final weights. The feature weight is obtained as a result of the product of entropy and degree of dependency.
A generally outline of the proposed feature selection strategy is presented in Figure 3. Detailed information can be referred to the original work [23].
Results
We performed a comprehensive study of the collective matrix factorization based multiple cell line QSAR modeling for the inhibitors of Hedgehog Signaling Pathway as described in Section 2. In the rest of this section we present and summarize the key results from this study. The performance of our method was compared with the baseline QSAR models of liner ridge regression and SAR. Details are listed in the following.
Performance of the collective matrix factorization based QSAR modeling
Figures 4 and 5 present average improvements achieved by the CMF based multiple cell line QSAR modeling over the baseline methods for four cell lines, with two different kinds of drug representations, i.e., general descriptor and druglike index respectively. Figure 4 shows the performance result of the first partition strategy, where the test was carried out under certain parameter setting and with 10 times repetition by randomly selected 2/3 data as training dataset and 1/3 data as testing datasets. Figure 5 shows the second strategy, where the test was carried out under certain parameter setting with diverse subset to consider the data representative ability in the training and testing dataset.
From Figure 45, it can be seen that that the different data partition strategies actually achieve the similar performance results. For all the cell lines and all the kinds of data representation, the performance improvement of collaborative QSAR modeling was dramatic, especially for the evaluation of Rsquare. The improvement is statistically significant, with significant pvalue measured by RMSE and Rsquare respectively. We had already noticed in our previous study [7] that under the measurement of Rsquare, the QSAR modeling results for the four cell lines with numeric compound activities were not satisfied, indicating a satisfiable QSAR modeling against single cell line individually was hard to obtain. In contrast, in our current collaborative QSAR modeling, performance against all the cell lines was improved. The significant improvement margin evaluated by Rsquare indicates that our CMF based QSAR modeling could successfully capture the correlation, rather than its absolute value of difference among the dataset as evaluated by RMSE.
Besides the measurements of the average RMSE and Rsquare of different QSAR models, we also investigated their error distribution under the diverse subset partition strategy to give a more rigorous comparison of their performance. It can be seen from the boxplots of the error square (Figure 6–7) that for both two compound descriptions, collaborative QSAR modeling achieved the lowest error means and low variances compared to other two baselines, indicating the best prediction ability among all methods.
It should be noted that in our previous study we found that different cell lines perform differently for modeling the inhibitor affinity based on the linear regression or SVR. Particularly, only the data of NCIH446 could produce a reasonable model by QSAR analysis, probably due to the fact that the other three cell lines may be less sensitive as HCIH466 cells to the hedgehog signaling inhibitor [7]. Nevertheless, it can be seen from Figure 47 that if we combine all these data from different cell lines together under the CMF based QSAR modeling, we can greatly reduce such nonspecific effects in the cell lines, and result in a reasonable QSAR modeling against all the cell lines respectively. Such improvement is attributed to the fact that the collaborative filtering based framework allows different cell line data tasks to enhance each other during the training process, which eventually makes the efficacy modeling better than those of using the datasets separately. We believe that such “collaborative” scenario for drug analysis will become more popular in the future, as more and more cell line will exist and the drug are often required to be investigated under various circumstances.
Finally, in order to evaluate whether our collaborative QSAR model is general enough for new predictions, we also checked the domain of application (DOA) for the model under the diverse subset partition strategy. The domain of application (DOA) is used to estimate the reliability in the prediction of a new compound [24] for a specific method. Those molecules fall out the domain may lead to unreliable predictions [10]. In the analysis of DOA, a value of leverage ${h}_{i}$ is defined in equation (11) for each chemical molecule:
Where ${X}_{i}$ is the rowvector descriptor of the query compound,X is the $n\times k$ matrix containing k descriptor values and n training samples. The superscript T is the transpose of the matrix or vector. Generally, the warning leverage $h*$ is fixed at $3k/n$, where n is the number of training compounds, and k is the number of descriptors. When the leverage is greater than the warning leverage$h*$, the predicted activity is the result of substantial extrapolation of the model and, therefore, it may not be reliable and tend to be overfitting.
Based on the definition of leverage, Williams plot was used in this study to visualize the DOA of the QSAR model [10]. The Williams plot plots the standardized crossvalidated residuals (RES) versus leverage values (h), and can be used to obtain an immediate and simple graphical detection of both the response outliers (Y outliers) and the structurally influential chemicals (X outliers) of a model. Generally, the points with their values of Y axis fall outside the 3σ line (σ is the standard residuals unit of the compounds) can be considered as Y outliers, while the points with their values of X axis fall outside the warning leverage $h*$line can be considered as X outliers [10]. Figures 8 and 9 represent the William plots for the four cell lines with compound representations of General Descriptor and Druglike index respectively. It can be seen that for all four cell lines, most of the compounds fall into their corresponding application domain, which indicate that the collaborative QSAR modeling has achieved a reliable activity prediction for the compounds, and they are following a welldefined domain of applicability.
Impact of the Regularization Parameters
In this subsection, we will investigate the impact of the regularization parameters on our CMFbased QSAR modeling. We choose the values of ${\lambda}_{1}$ and ${\lambda}_{2}$ under different dimensionality of low dimensional representations and different numbers of training ratings, and plot the RMSE based on the whole four cell line data as shown in Figure 10. The tests were performed with different compound descriptions, i.e., General Descriptors and DrugLike Index respectively. In the figure, xaxis corresponds to different value of regularization parameter (0.001, 0.01, 0.1, 1, 10, 100) while yaxis corresponds to the number of training ratios for the whole QSAR data (15 %, 35 %, 55 %, 75 %). It can be seen that (1) basically the influence of the regularization parameter is small on the performance, indicating that our proposed method is robust and insensitive to the parameters, (2) higher performance will be achieved with the larger number of training samples, which is not surprising in our study, and (3) generally the two compound description, i.e, DrugLike index and General Descriptor, performed the same in CMF with no statistically different.
Feature selection based on CMF for compound description among multiple cell lines
Using collaborative filtering based feature selection strategy we proposed aforementioned, we obtained the feature weighting for intracell line and intercell line for the inhibitors of Hedgehog Signaling Pathway. The former one can be used to uncover the important features in inhibitor design against a specific cell line, while the later one is used to identify common features that are important for the inhibitors against multiple cell line simultaneously. We compared the difference between these two kinds of feature weighting to provide useful clues for inhibitor modifications and improve their affinities.
In this feature selection, we used Druglike index to represent each compound, with the total of 28 features, since it is easy to interpret biological meanings. The General descriptor feature space has been hybridized, and the original meanings of compound structure description for current features couldn’t kept. Therefore, GD will not be adopted for feature weighting here. Generally speaking, Druglike index belongs to the category of structural descriptors. Structural descriptors can correlate with each other; some of them may be redundant. However, if they have different and significant distributions in the considered drug class, they can be used for drugknowledge extraction and the redundant can be ignored. In our study, the descriptors maintain their identity and clearly interpretable structural significance throughout the process. A table with detailed descriptions of each druglike index is listed in Additional file 2: Table S1.
The 28 feature weights for the intracell line, intercell line and the final integrated one are shown in Figures 11, 12 and 13. In all three figures, xaxis represents the Druglike index feature ID and yaxis represents its corresponding weights. It can be seen that the final integrated feature weighting is different from that of intracell line. Moreover, the intercell line feature weighting can be viewed as an efficient way to identify the potential features important for multitarget inhibitors of Hedgehog Signaling Pathway. We provide our insights about inhibitors design based on these figures:

1)
As shown in Figure 11, the features of ‘# of nonH’ (DLI1), ‘# of nonH polar bonds’ (DLI5) and ‘# of 2degree cyclic atoms’ (DLI13) were ranked top 3. These findings indicate that the volume of the molecular, the polar of the molecular and the cyclic degree of the molecular are the most important features for the design of multitarget inhibitors of Hedgehog Signaling Pathway. Our findings are actually consistent with the empirical rules for lead compound optimization, which use these three elements to determine their activities.

2)
We can see from Figure 11 that the feature ‘# of cap fragments’ (DLI23) was also important when the multicell line inhibitors were designed. This is consistent with the empirical rule, which changes the substituent group (functional group) in order to improve the inhibitor activity. However, compared with Figure 12, it can be seen that the importance of this feature for multicell line inhibitor design is not as much significant as that for individual cell lines. This could be explained that, although this feature is important for individual cell line inhibitor, their activity improvement directions may be inconsistent, thus reducing its importance when multicell lines are confronted.

3)
All three figures (Figure 11, 12, 13) have shown that the weight for the feature of '# of 3level bonding patterns' (DLI18) is 0. This is probably due to the following two reasons:a) All compound samples in our study are lack of this feature, and b) this feature is not considered in most of the insilico compound optimizations.
Discussion
Comparison of CMF based QSAR modeling with other collaborative QSAR modeling
Although the CMF based QSAR modeling was investigated in our study, we do realize the existence of other QSAR modeling with integrated information, and we call such models as the “collaborative” QSAR modeling, like the neural network based [15][25–27] and multitask learning based [9][10] models, as well as the proteochemometrics modeling (PCM) [28][29]. In order to further uncover the characteristics of such collaborative QSAR modeling, we discuss our CMF based method with the aforementioned methods on our multiple cell line QSAR modeling for the inhibitors of Hedgehog Signaling Pathway.
Neural network based collaborative QSAR modeling
As we mentioned above, Erhan etc. proposed a neural network based collaborative QSAR modeling for drug discovery [15]. This is one of the first attempts to construct an efficient procedure for integrating multiple drug target information at a time by extending standard multilayer neural networks. Basically, neural networks provided an ideal test bed for implementing collaborative QSAR modeling: the simplest of such form was to create a shared hidden layer that is trained in parallel for all the learning tasks. In this case, the training procedure would be done on all the tasks (in our study it will be all the cell line QSAR data) in parallel. Because the structure of the network includes a shared layer (weight matrix), it is possible for socalled “shared internal representations” to develop and to be learned.
Specifically, in our multiple cell line QSAR modeling for the inhibitors of Hedgehog Signaling Pathway, we used a 10cross fold validation schema to test our data from 4 cell lines in this neural network model. The weights from input layer to hidden layer as well as from hidden layer to output layer for the network will be learned through the back propagation (BP) algorithm. Our inhouse test indicated that the neural network based collaborative QSAR modeling was comparable to CMF based QSAR modeling, with no surprisingly better than the single QSAR modeling (Results are not shown here).
Multitask learning based collaborative QSAR modeling
Neural network can be viewed as a specific form of multitask learning. Multitask learning has been developed for those situations where multiple related learning tasks are to be accomplished together. When explicit or hidden interrelationship among the tasks can be exploited [9][10], multitask learning is more effective than learning each task independently. The intuition underlying the framework is that the multiple related tasks can benefit each other by sharing the data and features across the tasks, and thus boosting the learning performance of each single task [30]. It also provides an efficient mechanism for crosstask feature selection, thus uncovering the common dominate features for all the tasks simultaneously. Our group has successfully applied multitask learning in QSAR modeling with specific study of HIV and HCV inhibitors [9][10]. Basically, assume that the datasets contain N tuples, ${z}_{i}=({\text{x}}_{i},{y}_{i},{k}_{i})$ for i = {1…N}, where ${\text{x}}_{i}\in {\text{R}}^{d}$ is the drug descriptor, and ${k}_{i}\in ${1…M} is the indicator corresponding to the example ${(\text{x}}_{i},{y}_{i})$. The M tasks correspond to M different cell lines or drug targets. A critical issue in this collaborative QSAR modeling is to learn a set of sparse functions across these tasks for drug activity regression. This is commonly achieved by learning M linear regressions of the form ${w}_{k}^{T}x$, with the following square loss function is adopted (other loss function can also be applied):
where$z=(\text{x},y,k)$, W$=[{\text{w}}_{1},{\text{w}}_{2},\dots ,{\text{w}}_{M}]\in {\text{R}}^{d\times M}$ and ${W}^{j}$ be the j th row of W.
In the multitask learning framework, W can be optimized and calculated by enforcing the joint sparsity across different tasks with adding the different norm of the matrix W to the square loss function, which leads to only a few nonzero rows of W.
The relationship between collaborative filtering and multitask learning has been discussed in previous studies [30]. The multitask learning model is closed related to the multiple response regression models [31]. Multiple response regression is the task of estimating several response variables using a common set of input variables. In general, both multitask learning and multiple response regression can be used to find the correlation between different tasks, and thus improve the single task learning. Such an approach have many potential applications in various areas, interested readers may be referred to the paper [31]. It should be noted that in the multitask learning framework, the samples for different tasks should not be identical. In general, the less overlap of the samples containing across different tasks, the more prediction ability of each task. This idea is related to another interesting algorithm, transfer learning [32], whereas multitask learning can be categorized into this area and the information between different tasks is expected to “transfer” from each other to boost the performance of individual task.
For the particular data in our multiple cell line QSAR modeling for the inhibitors of Hedgehog Signaling Pathway, it can be seen that the drug samples for all the cell line are totally identical, thus it is unnecessary to use multitask learning in the collaborative QSAR here. Nevertheless, if nonidentical samples for multiple cell line exist, multitask learning will be a good choice for collaborative QSAR modeling with integrating of different data sources.
Proteochemometric Modeling
Proteochemometric modeling (PCM) is presented based on the similarity of a group of ligands and a group of targets, to the extent that PCM models the socalled ligandtarget interaction space [28][29]. Like QSAR modeling, the PCM model is constructed based on chemical descriptors that describe the compound data set and it introduces an additional term, a descriptor of the protein  target interaction (Figure 14). Therefore, a PCM model is constructed on both ligand and target similarity, and it can be regarded as an extension of conventional QSAR modeling, which models the relationship between multiple compounds and targets simultaneously. PCM is intrinsically the most similar to our collaborative filtering based QSAR modeling among all others. PCM explicitly requires the target information as well as the proteintarget interaction descriptions. Whereas in our collaborative filtering based QSAR modeling, these two kinds of information are implicitly embedded in one computational schema. From this point of view, our model is more flexible and extendable. Since in our multiple cell line QSAR modeling for the inhibitors of Hedgehog Signaling Pathway, there is no explicit target information available, we cannot use PCM for the QSAR modeling. Largescale ligandtarget relationship study and comparison between collaborative filtering based methods and PCM still remain to be an interesting and useful topic for the future study.
Conclusions
In this study, an efficient collaborative QSAR model for inhibitors of Hedgehog Signal Pathway from multiple cell lines was proposed. The model is derived from the area of information retrieval in social network, i.e. collaborative filtering, and its performance is well demonstrated and explained in our study. By applying this elegant computational model, we successfully addressed two issues remained in our previous study, i.e., (1) The information among multiple cell lines can be integrated to boost the QSAR results, rather than single cell line QSAR modeling. Our extensive experiments indicated that the performance is remarkable compared to other single cell line QSAR methods. (2) A novel feature selection strategy under such collaborative environment was proposed, which can be used to derive the commonly important features related to the entire given cell lines, while meantime presenting their specific contributions to a specific cellline. Based on the results of feature selection, we presented several ways of chemical modifications which will likely improve the compound affinity towards multiple targets in the Hedgehog Signal Pathway simultaneously. In summary, our study provides useful clues for multiple cell line/targets QSAR modeling when the cell line or target information among a related pathway exist. The proposed collaborative model with the feature selection strategy here is efficient, robust, flexible, and can be easily extended to model largescale multiple cell line/QSAR data.
References
 1.
Di Magliano MP, Hebrok M: Hedgehog signalling in cancer formation and maintenance. Nat Rev Cancer 2003, 3(12):903–911. 10.1038/nrc1229
 2.
Chen JK, Taipale J, Cooper MK, Beachy PA: Inhibition of Hedgehog signaling by direct binding of cyclopamine to Smoothened. Genes Dev 2002, 16(21):2743. 10.1101/gad.1025302
 3.
Ingham PW, Nakano Y, Seger C: Mechanisms and functions of Hedgehog signalling across the metazoa. Nat Rev Genet 2011, 12(6):393–406. 10.1038/nrg2984
 4.
Teichert AE, Elalieh H, Elias PM, Welsh JE, Bikle DD: Overexpression of Hedgehog Signaling Is Associated with Epidermal Tumor Formation in Vitamin D Receptor–Null Mice. J Investig Dermatol 2011, 131(11):2289–97. 10.1038/jid.2011.196
 5.
Watkins DN, Berman DM, Burkholder SG, Wang B, Beachy PA, Baylin SB: Hedgehog signalling within airway epithelial progenitors and in smallcell lung cancer. Nature 2003, 422(6929):313–317. 10.1038/nature01493
 6.
Zhao C, Chen A, Jamieson CH, Fereshteh M, Abrahamsson A, Blum J, Kwon HY, Kim J, Chute JP, Rizzieri D: Hedgehog signalling is essential for maintenance of cancer stem cells in myeloid leukaemia. Nature 2009, 458(7239):776–779. 10.1038/nature07737
 7.
Zhu R, Liu Q, Tang J, Li H, Cao Z: Investigations on Inhibitors of Hedgehog Signal Pathway: A Quantitative StructureActivity Relationship Study. Int J Mol Sci 2011, 12(5):3018–3033. 10.3390/ijms12053018
 8.
Tang J, Li HL, Shen YH, Jin HZ, Yan SK, Liu XH, Zeng HW, Liu RH, Tan YX, Zhang WD: Antitumor and antiplatelet activity of alkaloids from Veratrum dahuricum. Phytother Res 2010, 24(6):821–826.
 9.
Liu Q, Che D, Huang Q, Cao Z, Zhu R: Multi‐target QSAR Study in the Analysis and Design of HIV‐1 Inhibitors. Chin J Chem 2010, 28(9):1587–1592. 10.1002/cjoc.201090269
 10.
Liu Q, Zhou H, Liu L, Chen X, Zhu R, Cao Z: Multitarget QSAR modelling in the analysis and design of HIVHCV coinhibitors: an insilico study. BMC Bioinforma 2011, 12: 294. 10.1186/1471210512294
 11.
Ning X, Rangwala H, Karypis G: Multiassaybased structure− activity relationship models: improving structure− activity relationship models by incorporating activity information from related targets. J Chem Inf Model 2009, 49(11):2444–2456. 10.1021/ci900182q
 12.
MedinaFranco JL, Yongye AB: P rezVillanueva J. Multitarget StructureActivity Relationships Characterized by ActivityDifference Maps and Consensus Similarity Measure. Journal of chemical information and modeling, Houghten R, MartinezMayorga K; 2011.
 13.
Herlocker JL, Konstan JA, Terveen LG, Riedl JT: Evaluating collaborative filtering recommender systems. ACM Trans Information Syst (TOIS) 2004, 22(1):5–53. 10.1145/963770.963772
 14.
Breese JS, Heckerman D, Kadie C: Empirical Analysis of Predictive Algorithms for Collaborative Filtering. In Proceedings of Fourteenth Conference on Uncertainty in Artificial Intelligence. Madison, WI: Morgan Kaufmann; 1998:43–52.
 15.
Erhan D, L'Heureux PJ, Yue SY, Bengio Y: Collaborative filtering on a family of biological targets. J Chem Inf Model 2006, 46(2):626–635. 10.1021/ci050367t
 16.
Lazareno S, Birdsall N: Estimation of antagonist Kb from inhibition curves in functional experiments: alternatives to the ChengPrusoff equation. Trends Pharmacol Sci 1993, 14(6):237–239. 10.1016/01656147(93)90018F
 17.
Todeschini R, Consonni V: Handbook of molecular descriptors. WileyVch; 2008. vol. 79 vol. 79
 18.
Xu J, Stevenson J: Druglike index: a new approach to measure druglike compounds and their diversity. J Chem Inf Comput Sci 2000, 40(5):1177–1187. 10.1021/ci000026+
 19.
Melville P, Mooney RJ, Nagarajan R: Contentboosted collaborative filtering for improved recommendations. In 2002. 2002 edition. Menlo Park, CA; Cambridge, MA; London: AAAI Press; MIT Press; 1999:187–192.
 20.
Schafer J, Frankowski D, Herlocker J, Sen S: Collaborative filtering recommender systems. The Adaptive Web. The Adaptive Web: Methods and Strategies of Web Personalization, Vol. 4321 2007, 291–324.
 21.
Su X, Khoshgoftaar TM: A survey of collaborative filtering techniques. Advances in Artificial Intelligence 2009, 2009: 4.
 22.
Singh AP, Gordon GJ: Relational learning via collective matrix factorization. In 2008. ACM; 2008:650–658.
 23.
Barranco M, Martínez L: A method for weighting multivalued features in contentbased filtering. Trends in Applied Intelligent Syst 2010, 418: 409–418.
 24.
Weaver S, Gleeson MP: The importance of the domain of applicability in QSAR modeling. J Mol Graph Model 2008, 26(8):1315–1326. 10.1016/j.jmgm.2008.01.002
 25.
Patra JC, Singh O: Artificial neural networks‐based approach to design ARIs using QSAR for diabetes mellitus. J Comput Chem 2009, 30(15):2494–2508. 10.1002/jcc.21240
 26.
Patra JC, Chua BH: Artificial neural network‐based drug design for diabetes mellitus using flavonoids. J Comput Chem 2011, 32(4):555–567. 10.1002/jcc.21641
 27.
Xu J, Wang L, Shen X, Xu W: QSPR study of Setschenow constants of organic compounds using MLR, ANN, and SVM analyses. J Comput Chem 2011, 32(15):3241–52. 10.1002/jcc.21907
 28.
Lapinsh M, Prusis P, Lundstedt T, Wikberg JES: Proteochemometrics modeling of the interaction of amine Gprotein coupled receptors with a diverse set of ligands. Mol Pharmacol 2002, 61(6):1465. 10.1124/mol.61.6.1465
 29.
Wikberg JES, Lapinsh M, Prusis P: Proteochemometrics: a tool for modeling the molecular interaction space. Chemogenomics in Drug Discovery: A Medicinal Chemistry Perspective 2005, Chapter 10. 2004, 289–309.
 30.
Yu K, Tresp V: Learning to learn and collaborative filtering. Neural Information Processing Systems Workshop on Inductive Transfer: 10 Years Later 2005.
 31.
Breiman L, Friedman JH: Predicting multivariate responses in multiple linear regression. J Royal Stat Soc: Series B (Stat Methodol) 1997, 59(1):3–54. 10.1111/14679868.00054
 32.
Pan SJ, Yang Q: A survey on transfer learning. Knowl Data Eng, IEEE Trans on 2010, 22(10):1345–1359.
Acknowledgements
We thank the lab of Prof. WeiDong Zhang in Second Military Medical University, Shanghai to share the compound data. This work was supported in part by Project Shanghai Pujiang Talents Funding (Grant No.11PJ1407400), Young Teachers for the Doctoral Program of Ministry of Education, China (Grant No. 20110072120048) and National Natural Science Foundation of China (Grant No. 30976611, Grant No.31100956 and Grant No. 61173117)
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
QL and JG carried out the designing of the whole computational algorithm and drafted the manuscript. JG and DC were responsible for the algorithm implementation. DC, VZ and RZ were responsible for the algorithm design. QL conceived the study and participated in the design and coordination of the analyses. All authors read and approved the final manuscript.
Jun Gao, Dongsheng Che contributed equally to this work.
Authors’ original submitted files for images
Rights and permissions
About this article
Received
Accepted
Published
DOI
Keywords
 Root Mean Square Error
 Support Vector Regression
 QSAR Modeling
 Collaborative Filter
 Cyclopamine