Current computational methods for prediction of protein function rely to a large extent on predictions based on the amino acid sequence similarity with proteins having known functions. The accuracy of such predictions depends on how much information about function is embedded in the sequence similarity and on how well the computational methods are able to extract that information. Other computational methods for prediction of protein function include structural similarity comparisons and molecular dynamics simulations (e.g. molecular docking). Although these latter methods are powerful and may in general offer important 3D mechanistic explanations of interaction and function, they require access to protein 3D structure. Computational determination of a 3D structure is well known to be resource demanding, error prone, and generally requires prior knowledge, such as the 3D structure of a homologous protein. This bottleneck makes it important to develop new methods for prediction of protein function when a 3D model is not available.
Recently a new bioinformatic approach to prediction of protein function called proteochemometrics was introduced that has several useful features [1–4]. In proteochemometrics the physico-chemical properties of the interacting molecules are used to characterize protein interaction and classify the proteins into different categories using multivariate statistical techniques. One major strength of proteochemometrics is that the results are obtained directly from real interaction measurement data and do not require access to any 3D protein structure model to provide quite specific information about interaction.
Proteochemometrics has its roots in chemometrics, the subfield of chemistry associated with statistical planning, modelling and analysis of chemical experiments [5]. In particular it is closely related to quantitative-structure activity relationship (QSAR) modelling, a branch of chemometrics used in computer based drug discovery. Modern computer based drug discovery is based on modelling interactions between small drug candidates (ligands) and proteins. The standard approach is to predict the affinity of a ligand by means of numerical calculations from first principles using molecular dynamics or quantum mechanics. QSAR modelling is an alternative approach where experimental observations are used to design a multivariate regression model.
With x
i
denoting descriptor i among D different descriptors and y denoting the biological activity, (linear) QSAR modelling aims at a linear multivariate model
y = wTx = w0 + w1x1 + w2x2 + ... + w
D
x
D
(1)
where w = [w0, w1, w2,..., w
D
]Tare the regression coefficients and x = [1, x1, x2,..., x
D
]T. The activity y, may be the binding affinity to a receptor but may also be any biological activity e.g., the growth inhibition of cancer cells. In comparison with numerical calculations from first principles and similar approaches, the main advantages of QSAR modelling are that it does not require access to the molecular details of the biological subsystem of interest and that information can be obtained directly from relatively cheap measurements.
The joint perturbation of both the ligand and protein in proteochemometrics yields additional information about the different combinations of ligand and protein properties for an interaction than can be obtained in conventional QSAR modelling where only the ligand is perturbed. In recent years, various other bioinformatic modifications of conventional QSAR modelling have been reported. These include simultaneous modifications of the ligand and the chemical environment (buffer composition and/or temperature) in which the interaction take place [6–8], and three-dimensional QSAR modelling of protein-protein interactions that directly yields valuable stereo-chemical information [9].
Although proteochemometrics has already proven to be an useful methodology for improved understanding of bio-chemical interactions directly from measurement data, the quantitative proteochemometric models designed so far have not yet been subject to a detailed and unbiased statistical evaluation.
A key issue in this evaluation is the problem of overfitting. Since the number of ligand and protein properties available is usually very large, to avoid overfitting, one has to constrain the fitting of the regression coefficients. For example, in ridge regression [10], a penalty parameter is tuned based on data to avoid overfitting, and in partial least squares (PLS) regression [11–13] the overfitting is controlled by tuning the number of latent variables employed. In proteochemometrics as well as in many QSAR studies reported, the performance estimates reported are obtained as follows: 1) Perform a K-fold CV for different regression parameters, 2) Select the parameter value that yields the largest estimated performance value, and 3) Report the most promising model found and the associated performance estimate. Although this procedure may seem intuitive and may yield predictive models (as we in fact demonstrate below) the performance estimates obtained in this way may be heavily biased. Interestingly, this problem was recently addressed in the context of conventional QSAR modelling [14], and has also been discussed in earlier work, see [15, 16].
As an alternative or complement to constraining the regression coefficients, one may also reduce the variance by means of variable subset selection (VSS). In QSAR modelling, many algorithms for VSS have been proposed based on various methodologies, for example optimal experimental design [17, 18], sequential refinements [19], and global optimization [20]. VSS is used to exclude variables that are not important for the response variable, in the process of model building. Variables that are not important receive low weights in both a PLS and a ridge regression model, however if the fraction of unimportant variables is very large [21] the overall predictive power of the model is reduced. In this case VSS can improve the predictivity. However, if the fraction of unimportant variables is rather small, the quality of the model will not be improved by using VSS, it might on the contrary be slightly reduced. However, the interpretability of the model will in both cases be improved.
Although many of the advanced algorithms for VSS are powerful, they are all computationally demanding. Therefore, in order to keep the computing time down in our use of the double loop cross validation procedure employed here, conceptually and computationally simple algorithms for VSS were used instead of the more advanced ones presented, e.g. in [17–20]. Most likely, the more advanced algorithms would yield more reliable models with even higher predictive power than for the models designed here. However, the main issue of interest in this paper is to confirm the potential of proteochemometrics.
In previous reported proteochemometrics modelling, all available examples were used in the VSS. These were split into K separate parts and a conventional K-fold cross validation (CV) was performed. However, since all available examples were used, there were no longer any completely independent test examples available for model evaluation. Interestingly, this problem of introducing an optimistic selection bias via VSS was recently also pointed out in the supervised classification of gene expression microarray data [22].
In this paper we employ a procedure that can be used to perform unbiased statistical evaluations of proteochemometric and other QSAR modelling approaches. An overview of this so-called double loop CV procedure is presented in Figure 1, and may be regarded as a refinement of the current practice in proteochemometrics in the following respects:
-
1.
K1 different variable subset selections are performed, one for each step in the outer CV loop. This avoids optimistic selection bias.
-
2.
The best performance estimates (Q2) found in the inner loop by means of K2-fold CV are computed, but not reported as the model's performance estimate. This avoids the second optimistic selection bias mentioned above.
-
3.
An unbiased performance estimate, P2, is computed in the outer loop and is reported as the performance estimate of the modelling approach defined by the procedure in the inner loop (the methods of VSS, regression, and model selection employed). P2 is the result of different models that are designed and selected in the inner loop. It reflects the performance that one should expect on average.
-
4.
Repeated K1-fold CVs which yield information about the robustness in the results obtained (presented as confidence intervals).
In addition to these refinements, this work also demonstrates the potential of fast and straight forward alternative methods for VSS and regression in the inner loop. Moreover, it indicates that the performance estimates reported by certain software packages for QSAR may be quite misleading.
We reanalyzed two of the largest proteochemometric data sets yet reported. The first data set is presented in [2] and contains information about the interactions between 332 combinations of 23 different compounds with 21 different human and rat amine G-protein coupled receptors. In total, there are 23 × 21 = 483 possible interactions and the basic task is to fill in the 483-332 = 151 missing values. The second data is presented in [23] and contains information about the binding of 12 different compounds (4-piperidyl oxazole antagonists) to 18 human α1-adrenoreceptor variants (wild-types, chimeric, and point mutated). As for the first data set, there are not interaction data available for all the 12 × 18 = 216 possible interactions, but for 131, see [24] for more details about this data set. Below these two data sets are referred to as the amine data set and the alpha data set, respectively.