Skip to main content
  • Research article
  • Open access
  • Published:

Prediction of indirect interactions in proteins



Both direct and indirect interactions determine molecular recognition of ligands by proteins. Indirect interactions can be defined as effects on recognition controlled from distant sites in the proteins, e.g. by changes in protein conformation and mobility, whereas direct interactions occur in close proximity of the protein's amino acids and the ligand. Molecular recognition is traditionally studied using three-dimensional methods, but with such techniques it is difficult to predict the effects caused by mutational changes of amino acids located far away from the ligand-binding site. We recently developed an approach, proteochemometrics, to the study of molecular recognition that models the chemical effects involved in the recognition of ligands by proteins using statistical sampling and mathematical modelling.


A proteochemometric model was built, based on a statistically designed protein library's (melanocortin receptors') interaction with three peptides and used to predict which amino acids and sequence fragments that are involved in direct and indirect ligand interactions. The model predictions were confirmed by directed mutagenesis. The predicted presumed direct interactions were in good agreement with previous three-dimensional studies of ligand recognition. However, in addition the model could also correctly predict the location of indirect effects on ligand recognition arising from distant sites in the receptors, something that three-dimensional modelling could not afford.


We demonstrate experimentally that proteochemometric modelling can be used with high accuracy to predict the site of origin of direct and indirect effects on ligand recognitions by proteins.


The processes of life depend on intermolecular recognition. Molecular recognition by proteins is a complex process that is determined not only by direct interactions of a protein with the interacting molecule, but also by indirect effects arising at distant sites in the proteins. Using three-dimensional (3D) structure approaches it is not straightforward to analyze long distance effects on, for example, protein conformation, mobility, and stability. Recently, a new approach using statistical analysis of protein and ligand interaction space, proteochemometrics, was developed [1, 2]. It was used to model protein-peptide interactions and interactions of proteins with organic compounds [1, 37]. Here we show its utility in predicting indirect effects in proteins.

Proteochemometrics originates from chemometrics, the mathematical methods used to analyze chemical data. Proteochemometrics aims to describe the interactions between a series of macromolecules (such as proteins) and a series of ligands. Proteochemometric models are thereby created. These models are useful for predicting the affinities of new proteins for their ligands if the new molecules fall within the description space of the protein-ligand pairs of the training data set. A proteochemometric experiment is typically described by three descriptor blocks; the ligand descriptor (DL), protein descriptor (DP), and ligand-protein cross-term (DLP) blocks. A vector of numbers, called the ligand descriptors (DL), characterizes each ligand L. Similarly, each protein P has its protein descriptors (DP). If we use a linear method of regression, the negative logarithm of ligand L's affinity (pK LP ) for the protein P is expressed by:

p K L P = p K ¯ + i N C i D i L + j M C j D j P + i N j M C i j D i j L P E q . 1 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGWbaCcqWGlbWsdaWgaaWcbaGaemitaWKaemiuaafabeaakiabg2da9maanaaabaGaemiCaaNaem4saSeaaiabgUcaRmaaqahabaGaem4qam0aaSbaaSqaaiabdMgaPbqabaaabaGaemyAaKgabaGaemOta4eaniabggHiLdGccqWGebardaqhaaWcbaGaemyAaKgabaGaemitaWeaaOGaey4kaSYaaabCaeaacqWGdbWqdaWgaaWcbaGaemOAaOgabeaaaeaacqWGQbGAaeaacqWGnbqta0GaeyyeIuoakiabdseaenaaDaaaleaacqWGQbGAaeaacqWGqbauaaGccqGHRaWkdaaeWbqaamaaqahabaGaem4qam0aaSbaaSqaaiabdMgaPjabdQgaQbqabaaabaGaemOAaOgabaGaemyta0eaniabggHiLdaaleaacqWGPbqAaeaacqWGobGta0GaeyyeIuoakiabdseaenaaDaaaleaacqWGPbqAcqWGQbGAaeaacqWGmbatcqWGqbauaaGccaWLjaGaaCzcaiabdweafjabdghaXjabc6caUiabbccaGiabigdaXaaa@6858@

where p K ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabdchaWjabdUealbaaaaa@2F45@ is the average affinity, C i , C j , and C ij are the regression coefficients for ligand descriptors, protein descriptors, and ligand-protein cross-terms, respectively, and N and M are the number of descriptors for ligands and proteins, respectively.

Ligand-protein cross-terms are usually defined by a new vector that is formed by multiplying each ligand descriptor with each receptor descriptor of particular ligand-receptor pairs. Hence,

D i j L P = D i L D j P E q . 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGebardaqhaaWcbaGaemyAaKMaemOAaOgabaGaemitaWKaemiuaafaaOGaeyypa0Jaemiraq0aa0baaSqaaiabdMgaPbqaaiabdYeambaakiabdseaenaaDaaaleaacqWGQbGAaeaacqWGqbauaaGccaWLjaGaaCzcaiabdweafjabdghaXjabc6caUiabbccaGiabikdaYaaa@41ED@

and then

p K L P = p K ¯ + i N C i D i L + j M C j D j P + i N D i L [ j M C i j D j P ] E q . 3 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGWbaCcqWGlbWsdaWgaaWcbaGaemitaWKaemiuaafabeaakiabg2da9maanaaabaGaemiCaaNaem4saSeaaiabgUcaRmaaqahabaGaem4qam0aaSbaaSqaaiabdMgaPbqabaaabaGaemyAaKgabaGaemOta4eaniabggHiLdGccqWGebardaqhaaWcbaGaemyAaKgabaGaemitaWeaaOGaey4kaSYaaabCaeaacqWGdbWqdaWgaaWcbaGaemOAaOgabeaaaeaacqWGQbGAaeaacqWGnbqta0GaeyyeIuoakiabdseaenaaDaaaleaacqWGQbGAaeaacqWGqbauaaGccqGHRaWkdaaeWbqaaiabdseaenaaDaaaleaacqWGPbqAaeaacqWGmbataaaabaGaemyAaKgabaGaemOta4eaniabggHiLdGcdaWadaqaamaaqahabaGaem4qam0aaSbaaSqaaiabdMgaPjabdQgaQbqabaGccqWGebardaqhaaWcbaGaemOAaOgabaGaemiuaafaaaqaaiabdQgaQbqaaiabd2eanbqdcqGHris5aaGccaGLBbGaayzxaaGaaCzcaiaaxMaacqWGfbqrcqWGXbqCcqGGUaGlcqqGGaaicqqGZaWmaaa@6B84@

By using Eq. 3, the selectivity, S LAB , between protein A and protein B for some particular ligand L can be expressed as:

S L A B = p K L A p K L B = j M C j ( D j A D j B ) + i N D i L [ j M C i j ( D j A D j B ) ] E q . 4 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGtbWudaWgaaWcbaGaemitaWKaemyqaeKaemOqaieabeaakiabg2da9iabdchaWjabdUealnaaBaaaleaacqWGmbatcqWGbbqqaeqaaOGaeyOeI0IaemiCaaNaem4saS0aaSbaaSqaaiabdYeamjabdkeacbqabaGccqGH9aqpdaaeWbqaaiabdoeadnaaBaaaleaacqWGQbGAaeqaaaqaaiabdQgaQbqaaiabd2eanbqdcqGHris5aOWaaeWaaeaacqWGebardaqhaaWcbaGaemOAaOgabaGaemyqaeeaaOGaeyOeI0Iaemiraq0aa0baaSqaaiabdQgaQbqaaiabdkeacbaaaOGaayjkaiaawMcaaiabgUcaRmaaqahabaGaemiraq0aa0baaSqaaiabdMgaPbqaaiabdYeambaaaeaacqWGPbqAaeaacqWGobGta0GaeyyeIuoakmaadmaabaWaaabCaeaacqWGdbWqdaWgaaWcbaGaemyAaKMaemOAaOgabeaaaeaacqWGQbGAaeaacqWGnbqta0GaeyyeIuoakmaabmaabaGaemiraq0aa0baaSqaaiabdQgaQbqaaiabdgeabbaakiabgkHiTiabdseaenaaDaaaleaacqWGQbGAaeaacqWGcbGqaaaakiaawIcacaGLPaaaaiaawUfacaGLDbaacaWLjaGaaCzcaiabdweafjabdghaXjabc6caUiabbccaGiabisda0aaa@7388@

If a region U of a protein is described by the set of descriptors, V, then the contribution to the selectivity, S L A B U MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGtbWudaqhaaWcbaGaemitaWKaemyqaeKaemOqaieabaGaemyvaufaaaaa@3274@ , by U between proteins A and B for ligand L is obtained by:

S L A B U = j V K C j ( D j A D j B ) + i N D i L [ j V K C i j ( D j A D j B ) ] E q .  5 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGtbWudaqhaaWcbaGaemitaWKaemyqaeKaemOqaieabaGaemyvaufaaOGaeyypa0ZaaabCaeaacqWGdbWqdaWgaaWcbaGaemOAaOgabeaaaeaacqWGQbGAcqGHiiIZcqWGwbGvaeaacqWGlbWsa0GaeyyeIuoakmaabmaabaGaemiraq0aa0baaSqaaiabdQgaQbqaaiabdgeabbaakiabgkHiTiabdseaenaaDaaaleaacqWGQbGAaeaacqWGcbGqaaaakiaawIcacaGLPaaacqGHRaWkdaaeWbqaaiabdseaenaaDaaaleaacqWGPbqAaeaacqWGmbataaaabaGaemyAaKgabaGaemOta4eaniabggHiLdGcdaWadaqaamaaqahabaGaem4qam0aaSbaaSqaaiabdMgaPjabdQgaQbqabaaabaGaemOAaOMaeyicI4SaemOvayfabaGaem4saSeaniabggHiLdGcdaqadaqaaiabdseaenaaDaaaleaacqWGQbGAaeaacqWGbbqqaaGccqGHsislcqWGebardaqhaaWcbaGaemOAaOgabaGaemOqaieaaaGccaGLOaGaayzkaaaacaGLBbGaayzxaaGaaCzcaiaaxMaacqWGfbqrcqWGXbqCcqGGUaGlcqqGGaaicqqG1aqnaaa@6E58@

Accordingly, we can localize regions in a protein that afford selectivity (i.e., functionality difference between protein pairs) for a particular ligand by applying Eq. 5. A region can be a subsequence, a 3D molecular interaction field, a single amino acid, or even a physicochemical property of an individual amino acid, and is only restricted by the way the descriptors are assigned to the proteins [17]. Since Eq. 5 places no restriction how far in space a ligand is located from a region in a protein, proteochemometrics is useful to predict indirect interactions in proteins.


Design and testing of multiple chimeric melanocortin receptor library

It is necessary to have modifications of both the proteins and ligands, preferably both in the form of a series (i.e., "libraries"), in order to use Eq. 5. Here we used a library of multiple chimeric melanocortin receptors [7]. Briefly, the library was created from four natural melanocortin receptors (MC1, MC3–5). Each receptor was divided into five sequence fragments (S1–S5) and multiple chimeric receptors were then obtained by systematically shuffling the fragments. In order to maximize the chemical space information of the receptors, while keeping the number of experiments as low as possible, statistical molecular design was applied to properly select the sequence fragments [710]. The entire receptor library comprised 18 receptors and was tested for its interaction with the native melanocortin ligand, α-MSH and the synthetic MSH analogues NDP-MSH and [125I]-NDP-MSH (see Table 1).

Table 1 Affinities of multiple chimeric melanocortin receptors. Affinities (pK ± standard deviation, SD) of multiple chimeric melanocortin receptors for α-MSH, NDP-MSH and [125I]-NDP-MSH determined by radioligand binding.

Proteochemometrics modelling

Interpretation of Eq.5 is dependent on our description of the proteins. Here two proteochemometrics models were created. One was based on a binary description of the proteins. The other used physicochemical descriptions of amino acids located inside transmembrane regions with presumed proximity to possible ligand binding cleft(s) according to the x-ray structure of bovine rhodopsin [7, 11, 12]. The binary model comprised information on the extent to which segments S1–S5 are involved in the selective recognition of ligands by the receptors, while the model based on physicochemical descriptions of amino acids comprised information on the contributions of single amino acids. Both models performed excellently in state-of-the-art model validations (see Table 2).

Table 2 Performance of proteochemometric models. Performance of proteochemometric models derived using wild-type and multiple chimeric receptors interacting with melanocortins, α-MSH, NDP-MSH and [125]-NDP-MSH. Shown are results from the model based on binary receptor descriptors (A) and the model based on physicochemical description of amino acids of the transmembrane regions presumed to face in a direction opposite to the membrane (B) (see Methods for further details).

Prediction of ligand recognition

The models were used to compute selectivity recognition maps for all melanocortin receptor pairs for the α-MSH hormone (see Tables 3 and 4). α-MSH is recognized by MC receptors, albeit it binds with more than 1000 times higher affinity to the MC1 receptor than to the MC4 receptor. The recognition map derived from the binary model predicted that segments S1, S2, and S4 play major roles for the α-MSH MC1/MC4 selectivity, while S3 and S5 contribute only a little (Fig. 1). The involvement of S1 and S2 was expected, as these regions possess amino acids close to the ligand recognition site according to 3D modelling [11]. However, the whole S4 region was located farther away, and its involvement was therefore more surprising. The model based on amino-acid physicochemical properties in the transmembrane regions indicated that three amino acids in S1, five in S2, one in S4, and one in S5 caused the MC1/MC4 receptor selectivity (see Table 4).

Table 3 Selectivity recognition map predicted from binary model. Selectivity recognition map for wild-type MC receptor pairs for α-MSH computed from the binary proteochemometric model. The contribution to the selectivity Su, log(M), for indicated segments S1–S5, was computed from the model using Eq. 5. Total selectivity represents the entire differences in affinity between receptor pairs computed from the model (see Eq. 4).
Table 4 Predicted amino acid selectivity recognition map. Amino acid selectivity recognition map for MC receptors for α-MSH computed from the proteochemometric model based on the physicochemical description of amino acids of the transmembrane regions presumed to facing in the direction opposite to the membrane. The contributions to selectivity Su, log(M), of the indicated amino acid positions were computed from the model using Eq. 5. Total selectivity represents the entire difference in affinity between receptor pairs computed from the model (see Eq. 4). TM, transmembrane regions; SG, segments.
Figure 1
figure 1

Increase in affinity of α-MSH predicted from the binary proteochemometric model. Shown is the increase in affinity SU (log(M)) for α-MSH afforded by swapping each of segments S1 – S5 in the MC4 receptor with the corresponding segment in the MC1 receptor as predicted from the binary proteochemometrics model.

Verification of predictions using mutagenesis

In order to assess the predictions for the ten amino acid positions predicted with by the model based on physicochemical properties of amino-acids located inside the receptors transmembrane regions, they were subjected to single, double, triple, pentuple, and heptuple mutations in the MC4 receptor, replacing them with the corresponding amino acids in the MC1 receptor. Measurements taken from the mutants showed that seven positions gave the expected increase in affinity for α-MSH (see Fig. 2, Table 5). Combining amino acid substitutions also gave higher increases in affinity than did the single point mutations (see Fig. 2, Table 5).

Figure 2
figure 2

Predicted and experimentally determined increase in α-MSH affinity for site-directed mutants. Predicted and experimentally determined increase in affinity (i.e., computed SU and measured pK increase, log(M), respectively) for α-MSH afforded by point mutations in the MC4 receptor. Shown by red bars is the change in affinity predicted by the model utilizing physicochemical descriptions of amino acids of the receptors' transmembrane regions and facing potential ligand binding clefts to be afforded by the indicated point mutations (i.e., the predicted increase in affinity that should be gained by exchanging the indicated amino acids in the MC4 receptor with the corresponding ones in the MC1 receptor). Shown in blue bars is the experimentally determined change in affinity for these mutations vs. the wild-type MC4 receptor. Significance (nonparametric Wilcoxon Rank Sum statistical test [29]) denoted as follows: * p <0.05; ** p < 0.005; *** p < 0.0005.

Table 5 Experimentally determined pK values of α-MSH for MC4 receptor mutants. Shown is the average pK ± SD determined in radioligand binding competition using [125I]NDP-MSH as radioligand. (The numbers of independent replicates are in brackets). The significance, p, is calculated using nonparametric Wilcoxon Rank Sum test [29] (see Methods for details), when compared with the wild-type MC4 receptor.

However, three of the ten positions (A89S, I102A and I251L) did not show the predicted increase in affinity (see Fig. 2, Table 5). The failure of the A89S and I102A mutations to increase the affinity of the MC4 receptor can be explained by the presence of several amino acid positions in the protein library that show the same or similar variability (i.e., they co-vary). For example, positions A89 and A135 (numbered according to the MC4 receptor) both reside in the same segment and are serines in the MC1 receptor and alanines in the MC3, MC4, and MC5 receptors. Such co-varying sequence positions gain equal importance in a proteochemometric model, even when some of them are not responsible for the explained activity. In the present library, mutations A89S, I102A, I129T, A135S, and L141G represent co-varying amino acid positions. The failure of the A89S and I102A mutations to cause the predicted increase in affinity may thus be explained on the basis of co-variance, where the actual effect is caused by mutations I129T, A135S, and L141G. In fact, the sum of the experimentally determined affinity increase by mutations A89S, I102A, I129T, A135S, and L141G (pKi = 0.69) closely agrees with that predicted from the model (SU = 0.56).

However, the failure of mutation I251L to increase affinity could not be explained by co-variances of amino acids within the model. Accordingly the predicted effect must originate from amino acids actually co-varying in the library but excluded from the modelling in the selections of amino acids based on 3D structure. Three such possible excluded positions partially co-varying with I251 are found in the midst of TM6, three are found towards the intracellular face of the lipid bi-layer in TM6, and several are present in the third intracellular loop. In the 3D structure, these amino acids are obviously located very far from any eventual binding pocket for α-MSH. To verify the prediction that these positions are responsible for the 'missing' affinity they were mutated separately and in combinations (see Fig. 3, Table 5). Although all mutations afforded an increase in affinity, most of the increase (~12 -fold) was afforded by swapping the entire third intracellular loop (see Fig. 3, Table 5). (Mutating all of these positions together afforded an ~20-fold increase in affinity).

Figure 3
figure 3

Experimentally determined increase in α-MSH affinity by mutations in segment S4. Experimentally determined increase in affinity (pK) afforded by mutations N240G/M241L/I245V (indicated in green), and V253I/V255F/V256L (indicated in blue) (i.e., exchanging amino acid residues in the MC4 receptor with the corresponding residues in the MC1 receptor), and swapping intracellular loop 3 (IL3, indicated in dark yellow) in the MC4 receptor with the corresponding in the MC1 receptor. Also shown is the increase in affinity for the whole S4 segment (S4, indicated in pale yellow). Significance (nonparametric Wilcoxon Rank Sum statistical test [29]) denoted as indicated in the legend to Fig. 2.


The experimental data demonstrate the utility of proteochemometrics for mapping ligand recognition. Using it seven amino-acid positions and the third intracellular loop were identified as accounting for most of the difference in affinity of the MC1 and MC4 receptor for α-MSH. The ability to localize a distant effect in a protein distinguishes proteochemometrics from other approaches to mapping molecular recognition. A general scheme for mapping ligand recognition using proteochemometrics is outlined in Fig. 4. The protein library might initially be collected from wild-type proteins, or created by constructing multiple chimeric proteins, as performed in this study. Applying experimental design could, in both cases, substantially reduce the number of entities that need to be constructed and tested without detrimentally compromising the information gained [2]. The application of proteochemometric modelling to the data creates a selectivity recognition map for the proteins and ligands. Analyzing the map (e.g., using Eq. 5) reveals the regions in the proteins involved in recognition of the ligands in a data-set. Regions of high interest can then be identified and analyzed further by extending the library, if necessary, in order to remove ambiguities due to co-variances (Fig. 4). In the selection of the regions for mutations the present study shows that also distant effects on protein function should be taken into account. Overall the approach would lead to a substantial gain in information at reduced experimental effort

Figure 4
figure 4

Outline of the mapping of direct and indirect interactions in molecular recognition using proteochemometrics. Outline of a general procedure for mapping molecular recognition by biomacromolecules. Initially, sets of wild-type macromolecules are identified. Using statistical molecular design a library is then created from the wild-type macromolecules. The library can be selected from the wild-type molecules if the initial collection contains sufficient chemical variation. Chemical variation may also be introduced artificially by mutagensis. Shuffling sequence fragments can then be used to create a library, where three or more segments and three or more macromolecules are used as the starting point. After evaluating the interaction of the library with a suitable library of ligands of interest, a proteochemometric model can be created. This model may be used to localize the regions in each macromolecule that contribute to the selectivity of each particular ligand evaluated. It may happen that it is not possible to unambiguously localize individual amino acids within a particular region, due to co-varying amino acid positions in the macromolecular library. In that case, an extension of the library can be made in order to resolve the ambiguity.

Mapping molecular recognition with proteochemometrics obtains different information than 3D-structure-based methods. The latter generally reveal where one particular ligand touches one particular receptor in one of many bound states. Proteochemometrics, by contrast, reveals regions of biomacromolecules that influence the selectivity of their ligand binding, and it has the capacity to predict effects arising from distant sites in a protein. The information obtained is only partially overlapping and, therefore, the approaches complement each other. The information gained in proteochemometrics relates to protein function and has direct utility for changing the function of a protein in a desired direction by mutation, in a priori protein engineering, or by altering the structure of a chemical entity and improving its interaction with binding sites, in a priori drug design [2].


In the present work we have presented a theoretical basis for the use of proteochemometrics to assess direct and indirect interactions in proteins, and we verified its utility to this end experimentally. We have also outlined a general strategy to afford cost effective mapping of molecular recognition using proteochemometrics and indicate important differences of proteochemometric mapping of ligand recognitions compared to traditional three-dimensional approaches. We propose that proteochemometrics can be used as a complement to classical 3D based methods to the study of ligand recognition, or even as the prime choice, in particular when three-dimensional protein structures are not available. Moreover, since proteochemometrics can be used to map both direct and indirect interactions, it may find a more general use in e.g. protein engineering and mapping protein function.


Receptor clones and multichimeric receptors

The coding sequences of the MC1 and MC5 receptors were cloned earlier [13, 14]. The coding sequences of the MC3 and MC4 receptor were gifts of Dr. Ira Gantz [15, 16]. Multichimeric MC1,3–5 receptors were constructed so that each receptor was divided into three regions, with each region being replaced with a corresponding region from one of the four wild-type MC1,3–5 receptors. Swapping all segments from the four receptors would have created a library with an extensive number of combinations, but this would have been impractical. We applied a statistical molecular design, which combines statistical approaches to maximize information gained from a minimal number of experiments. We used a D-optimal design, which could approximate all combinations of the melanocortin receptor regions making up 16 receptors, of which 4 were wild-type and 12 were multiple chimeric receptors. The multiple chimeric receptors were constructed in two sets, the F- and S-sets. The divisions were made at the end of the third and fifth transmembrane segments for the F-set (9 receptors of the 12 designed could be obtained), and at the beginning of the second and middle of the sixth transmembrane segments for the S-set (5 receptors were thus obtained, completing the library according to the design). Thus, the library came to include a total of 18 receptors: 14 that were multiple chimeric (in five segments) and four wild-type MC1,3–5 receptors. We have previously reported a full account of the construction of this receptor library [7]. Moreover, full accounts of the theory and applicability of statistical molecular design have also been reported previously [810].

Site-directed mutagenesis

Selected individual amino acids in the MC4 receptor were exchanged for the corresponding amino acids of the MC1 receptor, using mutation-inducing primers and PCR [17]. Eight single mutants (I66V, A89S, I102A, I129T, A135S, L141G, I251L, and Y268I), a double mutant (Q43E/P48D), and two triple mutants (N240G/M241L/I245V, V253I/V255F/V256L), were created. Some of these mutants were then used as templates for constructing mutants with multiple mutations, yielding I66V/L141G, I129T/A135S, L141G/Y268I, Q43E/P48D/I66V, Q43E/P48D/I66V/I129T/A135S, and Q43E/P48D/I66V/I129T/A135S/L141G/Y268I. We also manufactured two chimeric receptors in which subsequences of the MC4 receptor were replaced by the corresponding subsequences of the MC1 receptor. First we constructed a receptor ("S4 chimeric receptor") in which the entire S4 segment in the MC4 receptor was replaced with that from the MC1 receptor. Another receptor ("IL3 chimeric receptor") comprised the MC4 receptor with intracellular loop 3 replaced with the corresponding loop of the MC1 receptor.

Peptides and radioligand

Peptide ligands α-MSH (Ac-SYSMGHFRWGKPV-NH2) and NDP-MSH ([Tyr2, Nle4, D-Phe7]-α-MSH) were synthesized using standard solid phase peptide synthesis and purified by HPLC; their correct molecular weights were verified by mass spectrometry. [125I]-NDP-MSH ([125I-Tyr2, Nle4, D-Phe7]-α-MSH) was prepared in radiochemically pure form by custom iodination at EuroDiagnostica AB, Sweden.

Receptor expression and radioligand binding

COS-1 cells were grown in Dulbecco's modified Eagle's medium with 10% fetal calf serum. Eighty-percent confluent cultures were transfected in 100-mm dishes with the DNA constructs (10 μg DNA per dish, mixed with liposomes, as described [18]. Twelve to sixteen hours after transfection, the serum-free medium was replaced with growth medium and the cells were cultivated for approximately 48 hours, then scraped off, centrifuged, and used for radioligand binding. Dissociation constants (Kd) for [125I]-NDP-MSH were estimated for all the receptors by radioligand binding saturation as described [7, 18]. Inhibition constants (Ki) for α-MSH and NDP-MSH were then estimated in competition with [125I]-NDP-MSH using established procedures [7, 18]. Obtained pK values (i.e., the negative logarithms of the Ki and Kd values) are listed in Tables 1 and 5.

Numerical descriptions of receptors for proteochemometric modelling

Two types of descriptions were used for the receptors, namely a binary description and a description based on physicochemical descriptions of amino acids. For the binary description, four binary numbers described each region, with each number corresponding to one receptor subtype. Each of these descriptors was assigned a value of +1 if the region was taken from descriptors' corresponding receptor subtype, otherwise it was assigned the number -1 [7].

The receptor descriptions based on physicochemical descriptions of amino acids at 38 selected sequence positions was used as described [7]. These positions were selected from a three-dimensional model of the transmembrane regions of the MC1 receptor.

Using the crystal structure coordinates of rhodopsin as a template [12], the model was derived by replacing the side chains with the corresponding side chains of the MC1 receptor, using the alignments of the GPCRDB database [19] and the SCWRL program [20]. Only amino acids pointing in the direction opposite the lipid bilayer were considered. This led to a considerable reduction in the number of co-varying sequence positions considered in the proteochemometric modelling, albeit with the obvious risk of excluding important positions. The selected positions are listed in Table 4.

Each position was coded by using five principal components derived from 26 physicochemical properties of amino acids, so called z-scales [21]. These z-scales represent hydrophobicity, steric properties, polarizability (z1–z3), polarity, and electronic effects (z4, z5). The data for the 38 × 5 = 190 descriptors obtained were compressed by applying principal component analysis (PCA) [22] to each of the five segments of the receptor library, which yielded in total 15 descriptors, with three descriptors for each segment. Prior to PCA, the z-scale descriptors had been scaled to unit variance [23]. Principal component analysis was performed using the Simca-P program [23].

Numerical description of peptides for proteochemometric modelling

The three peptide ligands used showed limited structural variation and were assigned two binary descriptors termed Y and F, using the Free-Wilson approach [24]. Y was set to +1 if the Tyr2 residue of the peptides was iodinated; otherwise it was set to -1. Thus, ligand descriptor Y distinguished between [125I]-NDP-MSH and the other two peptides. The descriptor F was set to +1 if Phe7 was in the D-conformation and there was an Nle at position 4; otherwise it was set to -1. Thus, ligand descriptor F distinguished between α-MSH and the other two ligands. We also included the cross-term formed between the peptide descriptors Y and F(termed Y*F). This cross-term distinguishes NDP-MSH from α-MSH and [125I]-NDP-MSH. The peptide ligand description is summarized in Table 6.

Table 6 Description of peptides for proteochemometric modelling. The general sequence of the peptides used herein was: Ac-Ser-Pos2-Ser-Pos4-Gly-His-Pos7-Arg-Trp-Gly-Lys-Pro-Val-NH2. Amino acids in Pos2, Pos4, Pos7, and the corresponding description, were as shown.

Numerical description of binding experiments and proteochemometric modelling

In each binding experiment, the receptor-peptide combination was described using the above receptor and peptide descriptors, and by computing receptor-ligand cross-terms using Eq. 2, each cross-term being calculated as the product of one peptide and one receptor descriptor. Prior to calculating cross-terms, all descriptors were mean-centred and scaled to unit variance [23]. In order to account for differences in the number and mutual correlation of each descriptor type, the peptide descriptors, receptor descriptors, and cross-terms, block scaling was applied. All descriptors and cross-terms were mean centred and scaled to unit variance prior to block scaling [23]. Descriptors were finally correlated to the pK values using partial least squares projection to latent structures (PLS) regression [22]. PLS modelling was done using Simca-P [23].

Validation of modelling

The goodness-of-fit was estimated by the correlation coefficient R2 and root mean square error (RMSE) [23]. Models were further validated using cross-validation (CV) [25, 26], validation by response permutations [27], and validation by external prediction.

In cross-validation, one divides the data into several fractions. Seven were used in this study. Each fraction is repeatedly excluded once and then predicted from the model developed on the remaining data. The goodness of prediction of the CV is assessed by the Q2 measure [23].

In response permutation validation, many models are created using randomly permuted response data. Twenty permutations were used here. For each permuted model, the R2, Q2, and correlation coefficients between the original and permuted response values are estimated. The correlation coefficients are plotted against the R2 and Q2 values. The two corresponding linear correlation lines are estimated, one for R2 and one for Q2, and the intercepts iR2 and iQ2 of the two regression lines with the zero correlation coefficient line are calculated [5, 23]. These intercept values indicate the R2 and Q2 of random response data. For example, a negative Q2 intercept shows that it is not possible to obtain predictive models with random data, and indicates that a high Q2 value of the original model is not obtained by pure chance.

External prediction assesses a model's stability when a substantial fraction of the data is excluded (e.g., more than one-third). External prediction may aim to predict the properties of new entities, in other words, entities that are entirely excluded from the data set. In the present case we predicted pKs for the S-set receptors using only data for F-set receptors. The goodness of external prediction was assessed by the external Q2 (eQ2) value [5]. Further details on these model validation approaches and how to interpret their results have been previously reported [28].

Computation of the selectivity contribution of amino acids

The contributions of individual amino acids to the selectivity of peptide binding between pairs of receptors were computed using Eq. 5. In order to apply Eq. 5, the regression coefficients for the individual z-scales of the receptor's amino acids were computed from the corresponding PLS and PCA models. Nine amino acids (Q43, P48, I66, A89, I102, I129, L141, I251, Y268, according to the numbering in the MC4 receptor) contributed most to the selectivity of α-MSH (see Table 4) and were selected for site-directed mutagenesis. Moreover, upon analyzing the 3D model of the MC1 receptor it was deemed that S83 (corresponding to A89 in the MC4 receptor) had a strong H-bond interaction with S130 (A135 in the MC4 receptor). Since the amino acid positions A89/A135 (MC4 receptor numbering) showed identical co-variance in the receptor library, A135 was also selected for site-directed mutagenesis.

Statistical tests

The distribution of measured affinity for MC4 receptor and mutant receptors presented here as well as its logarithm values (pK) did not correspond to normal distribution. Therefore we decided to use nonparametric Wilcoxon Rank Sum statistical test [29] to verify the hypothesis that affinity for corresponding mutant receptor differs from wild-type MC4 receptor. The test was performed using R program [30].


  1. Prusis P, Muceniece R, Andersson P, Post C, Lundstedt T, Wikberg JE: PLS modeling of chimeric MS04/MSH-peptide and MC1/MC3-receptor interactions reveals a novel method for the analysis of ligand-receptor interactions. Biochim Biophys Acta 2001, 1544(1–2):350–357.

    Article  CAS  PubMed  Google Scholar 

  2. Wikberg JES, Lapinsh M, Prusis P: Proteochemometrics: A tool for modelling the molecular interaction space. In Chemogenomics in Drug Discovery – A Medicinal Chemistry Perspective. Edited by: Kubinyi H, Müller G, Mannhold R, Folkers G. Weinheim: Wiley-VCH; 2004:289–309.

    Google Scholar 

  3. Lapinsh M, Prusis P, Gutcaits A, Lundstedt T, Wikberg JE: Development of proteo-chemometrics: a novel technology for the analysis of drug-receptor interactions. Biochim Biophys Acta 2001, 1525(1–2):180–190.

    Article  CAS  PubMed  Google Scholar 

  4. Lapinsh M, Prusis P, Lundstedt T, Wikberg JE: Proteochemometrics modeling of the interaction of amine G-protein coupled receptors with a diverse set of ligands. Mol Pharmacol 2002, 61(6):1465–1475. 10.1124/mol.61.6.1465

    Article  CAS  PubMed  Google Scholar 

  5. Prusis P, Lundstedt T, Wikberg JE: Proteo-chemometrics analysis of MSH peptide binding to melanocortin receptors. Protein Eng 2002, 15(4):305–311. 10.1093/protein/15.4.305

    Article  CAS  PubMed  Google Scholar 

  6. Lapinsh M, Prusis P, Mutule I, Mutulis F, Wikberg JE: QSAR and proteo-chemometric analysis of the interaction of a series of organic compounds with melanocortin receptor subtypes. J Med Chem 2003, 46(13):2572–2579. 10.1021/jm020945m

    Article  CAS  PubMed  Google Scholar 

  7. Lapinsh M, Veiksina S, Uhlén S, Petrovska R, Mutule I, Mutulis F, Yahorava S, Prusis P, Wikberg JE: Proteochemometric mapping of the interaction of organic compounds with melanocortin receptor subtypes. Mol Pharmacol 2005, 67(1):50–59. 10.1124/mol.104.002857

    Article  CAS  PubMed  Google Scholar 

  8. Linusson A, Wold S, Nordén B: Statistical molecular design of peptoid libraries. Mol Divers 1998, 4(2):103–114. 10.1023/A:1026416430656

    Article  CAS  PubMed  Google Scholar 

  9. Lundstedt T, Seifert E, Abramo L, Thelin B, Nystrom A, Pettersen J, Bergman R: Experimental design and optimization. Chemometrics and Intelligent Laboratory Systems 1998, 42(1–2):3–40. 10.1016/S0169-7439(98)00065-3

    Article  CAS  Google Scholar 

  10. Linusson A, Gottfries J, Lindgren F, Wold S: Statistical Molecular Design of Building Blocks for Combinatorial Chemistry. J Med Chem 2000, 43(7):1320–1328. 10.1021/jm991118x

    Article  CAS  PubMed  Google Scholar 

  11. Prusis P, Schiöth HB, Muceniece R, Herzyk P, Afshar M, Hubbard RE, Wikberg JE: Modeling of the three-dimensional structure of the human melanocortin 1 receptor, using an automated method and docking of a rigid cyclic melanocyte-stimulating hormone core peptide. J Mol Graph Model 1997, 15(5):307–17. 334 10.1016/S1093-3263(98)00004-7

    Article  CAS  PubMed  Google Scholar 

  12. Palczewski K, Kumasaka T, Hori T, Behnke CA, Motoshima H, Fox BA, Le Trong I, Teller DC, Okada T, Stenkamp RE, Yamamoto M, Miyano M: Crystal structure of rhodopsin: A G protein-coupled receptor. Science 2000, 289(5480):739–745. 10.1126/science.289.5480.739

    Article  CAS  PubMed  Google Scholar 

  13. Chhajlani V, Wikberg JE: Molecular cloning and expression of the human melanocyte stimulating hormone receptor cDNA. FEBS Lett 1992, 309(3):417–420. 10.1016/0014-5793(92)80820-7

    Article  CAS  PubMed  Google Scholar 

  14. Chhajlani V, Muceniece R, Wikberg JE: Molecular cloning of a novel human melanocortin receptor. Biochem Biophys Res Commun 1993, 195(2):866–873. 10.1006/bbrc.1993.2125

    Article  CAS  PubMed  Google Scholar 

  15. Gantz I, Konda Y, Tashiro T, Shimoto Y, Miwa H, Munzert G, Watson SJ, DelValle J, Yamada T: Molecular cloning of a novel melanocortin receptor. J Biol Chem 1993, 268(11):8246–8250.

    CAS  PubMed  Google Scholar 

  16. Gantz I, Miwa H, Konda Y, Shimoto Y, Tashiro T, Watson SJ, DelValle J, Yamada T: Molecular cloning, expression, and gene localization of a fourth melanocortin receptor. J Biol Chem 1993, 268(20):15174–15179.

    CAS  PubMed  Google Scholar 

  17. Ho SN, Hunt HD, Horton RM, Pullen JK, Pease LR: Site-directed mutagenesis by overlap extension using the polymerase chain reaction. Gene 1989, 77(1):51–59. 10.1016/0378-1119(89)90358-2

    Article  CAS  PubMed  Google Scholar 

  18. Schioth HB, Muceniece R, Wikberg JE: Characterisation of the melanocortin 4 receptor by radioligand binding. Pharmacol Toxicol 1996, 79(3):161–165.

    Article  CAS  PubMed  Google Scholar 

  19. Horn F, Bettler E, Oliveira L, Campagne F, Cohen FE, Vriend G: GPCRDB information system for G protein-coupled receptors. Nucleic Acids Res 2003, 31(1):294–297. 10.1093/nar/gkg103

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  20. Bower MJ, Cohen FE, Dunbrack RL: Prediction of protein side-chain rotamers from a backbone-dependent rotamer library: a new homology modeling tool. J Mol Biol 1997, 267(5):1268–1282. 10.1006/jmbi.1997.0926

    Article  CAS  PubMed  Google Scholar 

  21. Sandberg M, Eriksson L, Jonsson J, Sjöström M, Wold S: New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids. J Med Chem 1998, 41(14):2481–2491. 10.1021/jm9700575

    Article  CAS  PubMed  Google Scholar 

  22. Geladi P, Kowalski BR: Partial least-squares regression: A tutorial. Anal Chim Acta 1986, 185: 1–17. 10.1016/0003-2670(86)80028-9

    Article  CAS  Google Scholar 

  23. SIMCA-P 9 User Guide and Tutorial. Umeå: Umetrics AB; 2001.

  24. Free SM, Wilson JW: A mathematical contribution to structure-activity studies. J Med Chem 1964, 1: 395–399. 10.1021/jm00334a001

    Article  Google Scholar 

  25. Wold S: Cross-validatory estimation of the number of components in factor and principal component models. Technometrics 1978, 20: 397–405.

    Article  Google Scholar 

  26. Wakeling IN, Morris JJ: A test of significance for partial least squares regression. J Chemometr 1993, 7: 291–304. 10.1002/cem.1180070407

    Article  CAS  Google Scholar 

  27. Efron B: Better bootstrap confidence intervals. J Am Stat Assoc 1987, 78: 171–200.

    Article  Google Scholar 

  28. Eriksson L, Johansson E, Wold S: Quantitative structure-activity relationship validation. In Quantitative structure-activity relationships in environmental sciences-VII. Edited by: Schuurmann G, Chen F. Pensacola: SETAC; 1997:381–397.

    Google Scholar 

  29. Wilcoxon F: Individual Comparisons by Ranking Methods. Biometrics 1945, 1(6):80–83.

    Article  Google Scholar 

  30. R Development Core Team: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2004.

    Google Scholar 

  31. Schiöth HB, Petersson S, Muceniece R, Szardenings M, Wikberg JE: Deletions of the N-terminal regions of the human melanocortin receptors. FEBS Lett 1997, 410(2–3):223–228. 10.1016/S0014-5793(97)00593-0

    Article  PubMed  Google Scholar 

Download references


Supported by the Swedish Research Council (04X-05957 and 621-2002-4711). We thank Santa Veiksina for technical assistance.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Jarl ES Wikberg.

Additional information

Authors' contributions

PP did most of modeling, data analysis and interpretation and manuscript preparation work. SU was responsible for all molecular biology work. RP assisted SU as well as performed pharmacological measurements. ML gave significant intellectual contribution for modeling, data analysis and interpretation. JESW was the major initiator of the project and supervisor of it and he contributed significantly to drafting the manuscript.

Authors’ original submitted files for images

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Prusis, P., Uhlén, S., Petrovska, R. et al. Prediction of indirect interactions in proteins. BMC Bioinformatics 7, 167 (2006).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: