Design of selective and multiselective medications requires understanding of the properties of the biological targets that distinguish the chosen target(s) from numerous similar "anti-targets" encoded in the human genome. Contemporary drug design has to a large extent been focused to structure-based methods where ligands are designed to fit into a binding pocket of the target. This requires knowledge of the exact 3 D structures of the targets and anti-targets, which is a problem for protein-kinases as X-ray structures have been solved for only 124 human protein kinase domains .
Proteochemometrics, on the other hand, has a distinct advantage when the studied proteins share the same structural organization since primary amino acid sequences can then be used without the need to have high-resolution 3 D structures of the targets. Proteochemometrics has also the advantage that multiple targets and anti-targets can be encompassed in one single model. Structural alignments of protein kinases have shown that they all contain universal conserved subdomains whereas their amino acid sequences still show quite notable variation. In fact, there is generally a much higher degree of conservation of the 3D-structures among protein families than of their primary sequences . The average pair-wise sequence identity over the kinase domains falls below 30%, and only a small fraction of residues are markedly conserved across the entire superfamily . Use of sequence-derived descriptions can hence be considered to be a rational approach for kinase representation in multivariate modelling, stated that the sequence descriptions are made in such a way that they are relevant for the structural and functional organization of the kinases. Descriptions can be derived based on prior sequence alignments or in alignment independent ways, the latter approaches are advantageous for less similar sequences, when unambiguous alignments are impossible to obtain.
In the first phase of this study we performed PCA and PLS-DA, using one set of alignment based and five sets of alignment independent descriptors of protein kinase amino acid sequences. The purpose of this analysis was to evaluate the ability of the different descriptions to separate kinases into groups according to their functions. PLS-DA for the best model (which exploited alignment based z-scale descriptions) afforded excellent separation of the seven groups of kinases; the cross-validated squared correlation coefficients fell between 0.93-0.98 for six of the groups, while for the more diverse tyrosine kinase-like kinase group it was 0.89.
As explained in the Methods section, PLS-DA models create regression equations for each of the modelled classes and thus identify properties that are more typical, or even unique, for a particular class compared to the other classes. Thus, inspection of the alignment based PLS-DA regression equation exploiting z-scale descriptors reveals that in some cases the description of the physico-chemical properties of very short sequence stretches and even of single residues are sufficient to separate all members of one kinase group from all other kinases. In one such example, when we inspected the alignment based PLS-DA model we revealed that a conserved proline residue located surrounded by two hydrophobic amino acids in the activation loop of the TKs sequences is the sufficient pattern for class separation. In the majority of the cases this triplet is embraced by two positively charged lysine or arginine residues (e.g., the sequence stretch being KFPIK in ABL1 kinase, KVPIK in EGFR kinase, and RLPVK in KIT kinase). Analysis of the alignment independent PLS-DA model exploiting AAC-DC descriptors further identifies that groups of kinases are often distinguished by the model by small sets of dipeptides (for instance, each of dipeptides CW, VW, RN, and GM is present in more than 90% of TKs compared to only 15-35% of kinases from the six other groups). Such identified specific sequence residues or patterns, which may be identified by our models, could accordingly potentially be addressed in the design of targeted and multi-targeted drugs. In fact, a few such amino acids (sometimes termed 'selectivity filters') have been previously exploited in drug design for kinases. This includes the so-called gatekeeper residue, which is a bulky amino acid present in most kinases, while 20% of the kinases have a threonine at this position. The property was used in design of selectivity for ABL kinase inhibitors. (However, unfortunately, the residue position is also a common site for mutations that confer resistance to imatinib, gefitinib, and erlotinib ). A study of Cohen et al.  designed inhibitors for RSK family kinases by targeting two selectivity filters in the ATP binding site, namely the threonine gatekeeper and a cysteine residue, which is an uncommon amino acid in the kinases' active site. These two amino acids that distinguishes RSKs from other protein kinases were sufficient to confer high activity of the designed inhibitor.
Although we here limited PLS-DA modelling to separation of seven major groups of the kinase superfamily the analysis can be performed hierarchically at any resolution, e.g., to delineate particular families, subfamilies, and even single kinases.
In the subsequent studies we created quantitative models for kinase-inhibitor interaction activities using the six types of kinase descriptions and performing correlations using SVM, PLS, k-NN, and decision trees. The small molecule inhibitors were in all models represented by a unified set of 3D-structural and physicochemical property descriptors. Models that exploited z-scale descriptions of the alignable parts of the protein kinase sequences performed the best. However, using ACC or MACC transformations gave only slightly inferior models when correlations to the activity data were done by SVM or PLS. ACC transformed descriptors performed worse with the k-NN approach, while MACC transformations resulted in a weaker model with use of decision trees. The advantages of ACC and MACC transforms are that they do not require prior alignment and that they are calculated from full-length sequences of kinase domains, which in the present data set varied from 194 to 606 residues (albeit for about one half of kinases it ranged 240-260 residues; for less than 30% kinases it exceeded 280 residues). Whereas ACCs reflect the covariances of amino acid properties over whole sequences, MACCs pinpoint individual pairs of residues with specific property combinations. MACC based models may thus identify patterns that are not confined to the same location in each and every protein and/or are situated in sequence stretches that can not be aligned unambiguously over the whole dataset. Consequently, models exploiting MACCs may complement the alignment-based models in analysis and prediction of kinase-inhibitor interactions. The three other descriptions for the protein sequences used (CTD, SO-PAA, and AAC-DC) showed inferior performances compared to z-scale based descriptions and thus appear less useful in proteochemometric modelling.
SVM outperformed the other data analysis methods, including PLS, in both the prediction accuracy for the active kinase-inhibitor combinations as manifested by P2 and P2
kin parameters (Tables 1 and 2) and in the ability to distinguish interacting versus non-interacting kinase-inhibitor pairs as revealed by the areas under the ROC curves (Figure 4). Accordingly, SVM seems to be the optimal choice for predicting full kinome-wide selectivity profiles of the existing compounds, and for virtual screening to find new hits with desired selectivities. However, an important point is that SVM is essentially a 'black box' technique, which makes interpretations of its models difficult. Thus, even if the performance of SVM in virtual screening is superior to PLS, it is problematic to comprehend which of the molecular properties of kinases and inhibitors that are important in the model. PLS contrasts to 'black box' methods like SVM and to locally derived kNN and DT models because it expresses the correlation results in a single straightforwardly interpretable regression equation. Moreover, PLS provides additional tools for model diagnostics, such as score and loading plots and 'distance to model' parameters that allow identification of outliers and assessment of reliability of extrapolations outside the modelled chemical and interaction spaces . Consequently, the parallel use of PLS and SVM modelling techniques may be advantageous when one aims at obtaining models for both predictions and interpretations, and cross-checking of model performances. (In this context it ought to be mentioned that several approaches have been recently suggested to give SVM models some transparency [17–19], which may be in the advantage for use of SVM in proteochemometric modelling).
The models built on small sub-parts of the dataset showed the robustness of the proteochemometric modelling approach. Thus, even for the smallest dataset comprising only about 30 kinases the SVM and PLS models showed acceptable predictive ability. The performances of the models based on small data-sets were even more impressive in prediction of interacting versus non-interacting kinase-inhibitor pairs; the discriminatory power of SVM and PLS models being, respectively, 0.83 and 0.82 for the models created on 30 kinases (compared to 0.93 and 0.92 for the largest dataset size). These results may have a wide impact to the protein kinase field as they mean that a relatively limited amount of experimental work is needed to afford qualitative and quantitative interaction models that will generalize for the whole kinome.
Success of any empirical modelling depends on the quality of data, which in proteochemometrics should comprise accurate activity measurements and descriptions of relevant physico-chemical and/or structural properties of proteins and their ligands. Yet another prerequisite for proteochemometrics is an adequate composition of the dataset, which should be balanced and include both interacting and non-interacting protein-ligand combinations. Unfortunately, 'negative' results are often omitted in study reports. Moreover, interaction databases populated by data from multiple series, contain typically activities for a fairly low fraction of all possible ligand-protein combinations, which implies that a bulk of the non-interacting entity pairs are absent. Modelling of sparse data matrices with overrepresented high activity data would inevitably give rise to false-positive predictions. Hence, the success of any modelling study owes most to using a well-balanced dataset, such as the here used dataset comprising data for both active and inactive kinase-inhibitor combinations for more than one half of the human kinome.
Although the modelled dataset covered more than 12,000 interactions, the series of 38 kinase inhibitors can not be considered as large, even though it included seven of the eight presently approved anticancer agents as well as other compounds with mutually dissimilar inhibition profiles. One can thus expect to gain further improvements by analyzing data for many more chemical compounds providing wider and denser coverage of the chemical and interaction spaces. In the present study the dataset parts for modelling and validation were selected randomly to assure objective assessment of the modelling performances. However, it is possible to apply statistical experimental design  to choose small representative panels of kinases to be used for assaying and interaction modelling. One technique is D-optimal design that could be used to select kinases that cover most of the diversity of the kinase sequence and activity space. Designed molecular libraries have proven much more informative than random collections, and they have been shown in some cases to allow a 103-104 fold reduction of the experimental work required, while still retaining the full generalization ability of derived interaction models [21, 21]. We can hence conclude that the values in Table 1 are the lowest limits of the predictive abilities, which would be surpassed in any models for datasets of the same size if kinases were selected according to principles of statistical experimental design. Hence, for any experimental work to be undertaken in the kinase field following this study we would strongly encourage the use of experimental design. The final outcome will be kinome wide models that can predict the interaction strength of a random chemical over all known protein kinases.