Prediction of indirect interactions in proteins

Background Both direct and indirect interactions determine molecular recognition of ligands by proteins. Indirect interactions can be defined as effects on recognition controlled from distant sites in the proteins, e.g. by changes in protein conformation and mobility, whereas direct interactions occur in close proximity of the protein's amino acids and the ligand. Molecular recognition is traditionally studied using three-dimensional methods, but with such techniques it is difficult to predict the effects caused by mutational changes of amino acids located far away from the ligand-binding site. We recently developed an approach, proteochemometrics, to the study of molecular recognition that models the chemical effects involved in the recognition of ligands by proteins using statistical sampling and mathematical modelling. Results A proteochemometric model was built, based on a statistically designed protein library's (melanocortin receptors') interaction with three peptides and used to predict which amino acids and sequence fragments that are involved in direct and indirect ligand interactions. The model predictions were confirmed by directed mutagenesis. The predicted presumed direct interactions were in good agreement with previous three-dimensional studies of ligand recognition. However, in addition the model could also correctly predict the location of indirect effects on ligand recognition arising from distant sites in the receptors, something that three-dimensional modelling could not afford. Conclusion We demonstrate experimentally that proteochemometric modelling can be used with high accuracy to predict the site of origin of direct and indirect effects on ligand recognitions by proteins.


Background
The processes of life depend on intermolecular recognition. Molecular recognition by proteins is a complex process that is determined not only by direct interactions of a protein with the interacting molecule, but also by indirect effects arising at distant sites in the proteins. Using threedimensional (3D) structure approaches it is not straightforward to analyze long distance effects on, for example, protein conformation, mobility, and stability. Recently, a new approach using statistical analysis of protein and lig-and interaction space, proteochemometrics, was developed [1,2]. It was used to model protein-peptide interactions and interactions of proteins with organic compounds [1,[3][4][5][6][7]. Here we show its utility in predicting indirect effects in proteins.
Proteochemometrics originates from chemometrics, the mathematical methods used to analyze chemical data. Proteochemometrics aims to describe the interactions between a series of macromolecules (such as proteins) and a series of ligands. Proteochemometric models are thereby created. These models are useful for predicting the affinities of new proteins for their ligands if the new molecules fall within the description space of the protein-ligand pairs of the training data set. A proteochemometric experiment is typically described by three descriptor blocks; the ligand descriptor (D L ), protein descriptor (D P ), and ligand-protein cross-term (D LP ) blocks. A vector of numbers, called the ligand descriptors (D L ), characterizes each ligand L. Similarly, each protein P has its protein descriptors (D P ). If we use a linear method of regression, the negative logarithm of ligand L's affinity (pK LP ) for the protein P is expressed by: where is the average affinity, C i , C j , and C ij are the regression coefficients for ligand descriptors, protein descriptors, and ligand-protein cross-terms, respectively, and N and M are the number of descriptors for ligands and proteins, respectively.
Ligand-protein cross-terms are usually defined by a new vector that is formed by multiplying each ligand descriptor with each receptor descriptor of particular ligandreceptor pairs. Hence, and then By using Eq. 3, the selectivity, S LAB , between protein A and protein B for some particular ligand L can be expressed as: If a region U of a protein is described by the set of descriptors, V, then the contribution to the selectivity, , by U between proteins A and B for ligand L is obtained by:  Accordingly, we can localize regions in a protein that afford selectivity (i.e., functionality difference between protein pairs) for a particular ligand by applying Eq. 5. A region can be a subsequence, a 3D molecular interaction field, a single amino acid, or even a physicochemical property of an individual amino acid, and is only restricted by the way the descriptors are assigned to the proteins [1][2][3][4][5][6][7]. Since Eq. 5 places no restriction how far in space a ligand is located from a region in a protein, proteochemometrics is useful to predict indirect interactions in proteins.

Design and testing of multiple chimeric melanocortin receptor library
It is necessary to have modifications of both the proteins and ligands, preferably both in the form of a series (i.e., "libraries"), in order to use Eq. 5. Here we used a library of multiple chimeric melanocortin receptors [7]. Briefly, the library was created from four natural melanocortin receptors (MC 1 , MC [3][4][5]. Each receptor was divided into five sequence fragments (S1-S5) and multiple chimeric receptors were then obtained by systematically shuffling the fragments. In order to maximize the chemical space information of the receptors, while keeping the number of experiments as low as possible, statistical molecular design was applied to properly select the sequence fragments [7][8][9][10]. The entire receptor library comprised 18 receptors and was tested for its interaction with the native melanocortin ligand, α-MSH and the synthetic MSH analogues NDP-MSH and [ 125 I]-NDP-MSH (see Table 1).

Proteochemometrics modelling
Interpretation of Eq.5 is dependent on our description of the proteins. Here two proteochemometrics models were created. One was based on a binary description of the proteins. The other used physicochemical descriptions of amino acids located inside transmembrane regions with presumed proximity to possible ligand binding cleft(s) according to the x-ray structure of bovine rhodopsin [7,11,12]. The binary model comprised information on the extent to which segments S1-S5 are involved in the selective recognition of ligands by the receptors, while the model based on physicochemical descriptions of amino acids comprised information on the contributions of single amino acids. Both models performed excellently in state-of-the-art model validations (see Table 2).

Prediction of ligand recognition
The models were used to compute selectivity recognition maps for all melanocortin receptor pairs for the α-MSH hormone (see Tables 3 and 4). α-MSH is recognized by MC receptors, albeit it binds with more than 1000 times higher affinity to the MC 1 receptor than to the MC 4 receptor. The recognition map derived from the binary model predicted that segments S1, S2, and S4 play major roles for the α-MSH MC 1 /MC 4 selectivity, while S3 and S5 contribute only a little (Fig. 1). The involvement of S1 and S2 was expected, as these regions possess amino acids close to the ligand recognition site according to 3D modelling [11]. However, the whole S4 region was located farther away, and its involvement was therefore more surprising. The model based on amino-acid physicochemical properties  in the transmembrane regions indicated that three amino acids in S1, five in S2, one in S4, and one in S5 caused the MC 1 /MC 4 receptor selectivity (see Table 4).

Verification of predictions using mutagenesis
In order to assess the predictions for the ten amino acid positions predicted with by the model based on physicochemical properties of amino-acids located inside the receptors transmembrane regions, they were subjected to single, double, triple, pentuple, and heptuple mutations in the MC 4 receptor, replacing them with the correspond-ing amino acids in the MC 1 receptor. Measurements taken from the mutants showed that seven positions gave the expected increase in affinity for α-MSH (see Fig. 2, Table  5). Combining amino acid substitutions also gave higher increases in affinity than did the single point mutations (see Fig. 2, Table 5).
However, three of the ten positions (A89S, I102A and I251L) did not show the predicted increase in affinity (see Fig. 2, Table 5). The failure of the A89S and I102A mutations to increase the affinity of the MC 4 receptor can be Increase in affinity of α-MSH predicted from the binary proteochemometric model Figure 1 Increase in affinity of α-MSH predicted from the binary proteochemometric model. Shown is the increase in affinity S U (log(M)) for α-MSH afforded by swapping each of segments S1 -S5 in the MC 4 receptor with the corresponding segment in the MC 1 receptor as predicted from the binary proteochemometrics model.
Predicted and experimentally determined increase in α-MSH affinity for site-directed mutants Figure 2 Predicted and experimentally determined increase in α-MSH affinity for site-directed mutants. Predicted and experimentally determined increase in affinity (i.e., computed SU and measured pK increase, log(M), respectively) for α-MSH afforded by point mutations in the MC 4 receptor. Shown by red bars is the change in affinity predicted by the model utilizing physicochemical descriptions of amino acids of the receptors' transmembrane regions and facing potential ligand binding clefts to be afforded by the indicated point mutations (i.e., the predicted increase in affinity that should be gained by exchanging the indicated amino acids in the MC 4 receptor with the corresponding ones in the MC 1 receptor). Shown in blue bars is the experimentally determined change in affinity for these mutations vs. the wild-type MC 4 receptor. Significance (nonparametric Wilcoxon Rank Sum statistical test [29]) denoted as follows: * p <0.05; ** p < 0.005; *** p < 0.0005.
tors. Such co-varying sequence positions gain equal importance in a proteochemometric model, even when some of them are not responsible for the explained activity. In the present library, mutations A89S, I102A, I129T, A135S, and L141G represent co-varying amino acid positions. The failure of the A89S and I102A mutations to cause the predicted increase in affinity may thus be explained on the basis of co-variance, where the actual effect is caused by mutations I129T, A135S, and L141G. In fact, the sum of the experimentally determined affinity increase by mutations A89S, I102A, I129T, A135S, and L141G (pKi = 0.69) closely agrees with that predicted from the model (S U = 0.56).
However, the failure of mutation I251L to increase affinity could not be explained by co-variances of amino acids within the model. Accordingly the predicted effect must originate from amino acids actually co-varying in the library but excluded from the modelling in the selections of amino acids based on 3D structure. Three such possible excluded positions partially co-varying with I251 are found in the midst of TM6, three are found towards the intracellular face of the lipid bi-layer in TM6, and several are present in the third intracellular loop. In the 3D structure, these amino acids are obviously located very far from any eventual binding pocket for α-MSH. To verify the prediction that these positions are responsible for the 'missing' affinity they were mutated separately and in combinations (see Fig. 3, Table 5). Although all mutations afforded an increase in affinity, most of the increase (~12 -fold) was afforded by swapping the entire third intracellular loop (see Fig. 3, Table 5). (Mutating all of these positions together afforded an ~20-fold increase in affinity).

Discussion
The experimental data demonstrate the utility of proteochemometrics for mapping ligand recognition. Using it seven amino-acid positions and the third intracellular loop were identified as accounting for most of the difference in affinity of the MC 1 and MC 4 receptor for α-MSH.
The ability to localize a distant effect in a protein distinguishes proteochemometrics from other approaches to mapping molecular recognition. A general scheme for mapping ligand recognition using proteochemometrics is outlined in Fig. 4. The protein library might initially be collected from wild-type proteins, or created by constructing multiple chimeric proteins, as performed in this study.
Applying experimental design could, in both cases, substantially reduce the number of entities that need to be constructed and tested without detrimentally compromising the information gained [2]. The application of proteochemometric modelling to the data creates a selectivity recognition map for the proteins and ligands. Analyzing the map (e.g., using Eq. 5) reveals the regions in the proteins involved in recognition of the ligands in a data-set. Regions of high interest can then be identified and analyzed further by extending the library, if necessary, in order to remove ambiguities due to co-variances (Fig. 4).
In the selection of the regions for mutations the present  4

receptor mutants. Shown is the average pK ± SD determined in radioligand binding competition using [ 125 I]NDP-MSH as radioligand. (The numbers of independent replicates are in brackets). The significance, p, is calculated using nonparametric Wilcoxon Rank Sum test [29] (see Methods for details), when compared with the wild-type MC 4 receptor.
Name pK i (log(nM)) ± SD (Repeats) p Experimentally determined increase in α-MSH affinity by mutations in segment S4 Figure 3 Experimentally determined increase in α-MSH affinity by mutations in segment S4. Experimentally determined increase in affinity (pK) afforded by mutations N240G/M241L/I245V (indicated in green), and V253I/V255F/V256L (indicated in blue) (i.e., exchanging amino acid residues in the MC 4 receptor with the corresponding residues in the MC 1 receptor), and swapping intracellular loop 3 (IL3, indicated in dark yellow) in the MC 4 receptor with the corresponding in the MC 1 receptor. Also shown is the increase in affinity for the whole S4 segment (S4, indicated in pale yellow). Significance (nonparametric Wilcoxon Rank Sum statistical test [29]) denoted as indicated in the legend to Fig. 2.
Outline of the mapping of direct and indirect interactions in molecular recognition using proteochemometrics Figure 4 Outline of the mapping of direct and indirect interactions in molecular recognition using proteochemometrics. Outline of a general procedure for mapping molecular recognition by biomacromolecules. Initially, sets of wild-type macromolecules are identified. Using statistical molecular design a library is then created from the wild-type macromolecules. The library can be selected from the wild-type molecules if the initial collection contains sufficient chemical variation. Chemical variation may also be introduced artificially by mutagensis. Shuffling sequence fragments can then be used to create a library, where three or more segments and three or more macromolecules are used as the starting point. After evaluating the interaction of the library with a suitable library of ligands of interest, a proteochemometric model can be created. This model may be used to localize the regions in each macromolecule that contribute to the selectivity of each particular ligand evaluated. It may happen that it is not possible to unambiguously localize individual amino acids within a particular region, due to co-varying amino acid positions in the macromolecular library. In that case, an extension of the library can be made in order to resolve the ambiguity.
study shows that also distant effects on protein function should be taken into account. Overall the approach would lead to a substantial gain in information at reduced experimental effort Mapping molecular recognition with proteochemometrics obtains different information than 3D-structurebased methods. The latter generally reveal where one particular ligand touches one particular receptor in one of many bound states. Proteochemometrics, by contrast, reveals regions of biomacromolecules that influence the selectivity of their ligand binding, and it has the capacity to predict effects arising from distant sites in a protein. The information obtained is only partially overlapping and, therefore, the approaches complement each other. The information gained in proteochemometrics relates to protein function and has direct utility for changing the function of a protein in a desired direction by mutation, in a priori protein engineering, or by altering the structure of a chemical entity and improving its interaction with binding sites, in a priori drug design [2].

Conclusion
In the present work we have presented a theoretical basis for the use of proteochemometrics to assess direct and indirect interactions in proteins, and we verified its utility to this end experimentally. We have also outlined a general strategy to afford cost effective mapping of molecular recognition using proteochemometrics and indicate important differences of proteochemometric mapping of ligand recognitions compared to traditional three-dimensional approaches. We propose that proteochemometrics can be used as a complement to classical 3D based methods to the study of ligand recognition, or even as the prime choice, in particular when three-dimensional protein structures are not available. Moreover, since proteochemometrics can be used to map both direct and indirect interactions, it may find a more general use in e.g. protein engineering and mapping protein function.

Receptor clones and multichimeric receptors
The coding sequences of the MC 1 and MC 5 receptors were cloned earlier [13,14]. The coding sequences of the MC 3 and MC 4 receptor were gifts of Dr. Ira Gantz [15,16]. Multichimeric MC 1,3-5 receptors were constructed so that each receptor was divided into three regions, with each region being replaced with a corresponding region from one of the four wild-type MC 1,3-5 receptors. Swapping all segments from the four receptors would have created a library with an extensive number of combinations, but this would have been impractical. We applied a statistical molecular design, which combines statistical approaches to maximize information gained from a minimal number of experiments. We used a D-optimal design, which could approximate all combinations of the melanocortin receptor regions making up 16 receptors, of which 4 were wildtype and 12 were multiple chimeric receptors. The multiple chimeric receptors were constructed in two sets, the Fand S-sets. The divisions were made at the end of the third and fifth transmembrane segments for the F-set (9 receptors of the 12 designed could be obtained), and at the beginning of the second and middle of the sixth transmembrane segments for the S-set (5 receptors were thus obtained, completing the library according to the design). Thus, the library came to include a total of 18 receptors: 14 that were multiple chimeric (in five segments) and four wild-type MC 1,3-5 receptors. We have previously reported a full account of the construction of this receptor library [7]. Moreover, full accounts of the theory and applicability of statistical molecular design have also been reported previously [8][9][10].

Receptor expression and radioligand binding
COS-1 cells were grown in Dulbecco's modified Eagle's medium with 10% fetal calf serum. Eighty-percent confluent cultures were transfected in 100-mm dishes with the DNA constructs (10 µg DNA per dish, mixed with lipo-somes, as described [18]. Twelve to sixteen hours after transfection, the serum-free medium was replaced with growth medium and the cells were cultivated for approximately 48 hours, then scraped off, centrifuged, and used for radioligand binding. Dissociation constants (K d ) for [ 125 I]-NDP-MSH were estimated for all the receptors by radioligand binding saturation as described [7,18]. Inhibition constants (K i ) for α-MSH and NDP-MSH were then estimated in competition with [ 125 I]-NDP-MSH using established procedures [7,18]. Obtained pK values (i.e., the negative logarithms of the K i and K d values) are listed in Tables 1 and 5.

Numerical descriptions of receptors for proteochemometric modelling
Two types of descriptions were used for the receptors, namely a binary description and a description based on physicochemical descriptions of amino acids. For the binary description, four binary numbers described each region, with each number corresponding to one receptor subtype. Each of these descriptors was assigned a value of +1 if the region was taken from descriptors' corresponding receptor subtype, otherwise it was assigned the number -1 [7].
The receptor descriptions based on physicochemical descriptions of amino acids at 38 selected sequence positions was used as described [7]. These positions were selected from a three-dimensional model of the transmembrane regions of the MC 1 receptor.
Using the crystal structure coordinates of rhodopsin as a template [12], the model was derived by replacing the side chains with the corresponding side chains of the MC 1 receptor, using the alignments of the GPCRDB database [19] and the SCWRL program [20]. Only amino acids pointing in the direction opposite the lipid bilayer were considered. This led to a considerable reduction in the number of co-varying sequence positions considered in the proteochemometric modelling, albeit with the obvious risk of excluding important positions. The selected positions are listed in Table 4.
Each position was coded by using five principal components derived from 26 physicochemical properties of amino acids, so called z-scales [21]. These z-scales represent hydrophobicity, steric properties, polarizability (z1-z3), polarity, and electronic effects (z4, z5). The data for the 38 × 5 = 190 descriptors obtained were compressed by applying principal component analysis (PCA) [22] to each of the five segments of the receptor library, which yielded in total 15 descriptors, with three descriptors for each segment. Prior to PCA, the z-scale descriptors had been scaled to unit variance [23]. Principal component analysis was performed using the Simca-P program [23].

Numerical description of peptides for proteochemometric modelling
The three peptide ligands used showed limited structural variation and were assigned two binary descriptors termed Y and F, using the Free-Wilson approach [24]. Y was set to +1 if the Tyr 2 residue of the peptides was iodinated; otherwise it was set to -1. Thus, ligand descriptor Y distinguished between [ 125 I]-NDP-MSH and the other two peptides. The descriptor F was set to +1 if Phe 7 was in the D-conformation and there was an Nle at position 4; otherwise it was set to -1. Thus, ligand descriptor F distinguished between α-MSH and the other two ligands. We also included the cross-term formed between the peptide descriptors Y and F (termed Y*F). This cross-term distinguishes NDP-MSH from α-MSH and [ 125 I]-NDP-MSH. The peptide ligand description is summarized in Table 6.

Numerical description of binding experiments and proteochemometric modelling
In each binding experiment, the receptor-peptide combination was described using the above receptor and peptide descriptors, and by computing receptor-ligand crossterms using Eq. 2, each cross-term being calculated as the product of one peptide and one receptor descriptor. Prior to calculating cross-terms, all descriptors were mean-centred and scaled to unit variance [23]. In order to account for differences in the number and mutual correlation of each descriptor type, the peptide descriptors, receptor descriptors, and cross-terms, block scaling was applied. All descriptors and cross-terms were mean centred and scaled to unit variance prior to block scaling [23]. Descriptors were finally correlated to the pK values using partial least squares projection to latent structures (PLS) regression [22]. PLS modelling was done using Simca-P [23].

Validation of modelling
The goodness-of-fit was estimated by the correlation coefficient R 2 and root mean square error (RMSE) [23]. Models were further validated using cross-validation (CV) [25,26], validation by response permutations [27], and validation by external prediction.
In cross-validation, one divides the data into several fractions. Seven were used in this study. Each fraction is repeatedly excluded once and then predicted from the model developed on the remaining data. The goodness of prediction of the CV is assessed by the Q 2 measure [23].
In response permutation validation, many models are created using randomly permuted response data. Twenty permutations were used here. For each permuted model, the R 2 , Q 2 , and correlation coefficients between the original and permuted response values are estimated. The correlation coefficients are plotted against the R 2 and Q 2 values. The two corresponding linear correlation lines are estimated, one for R 2 and one for Q 2 , and the intercepts iR 2 and iQ 2 of the two regression lines with the zero correlation coefficient line are calculated [5,23]. These intercept values indicate the R 2 and Q 2 of random response data. For example, a negative Q 2 intercept shows that it is not possible to obtain predictive models with random data, and indicates that a high Q 2 value of the original model is not obtained by pure chance.
External prediction assesses a model's stability when a substantial fraction of the data is excluded (e.g., more than one-third). External prediction may aim to predict the properties of new entities, in other words, entities that are entirely excluded from the data set. In the present case we predicted pKs for the S-set receptors using only data for F-set receptors. The goodness of external prediction was assessed by the external Q 2 (eQ 2 ) value [5]. Further details on these model validation approaches and how to interpret their results have been previously reported [28].

Computation of the selectivity contribution of amino acids
The contributions of individual amino acids to the selectivity of peptide binding between pairs of receptors were computed using Eq. 5. In order to apply Eq. 5, the regression coefficients for the individual z-scales of the receptor's amino acids were computed from the corresponding PLS and PCA models. Nine amino acids (Q43, P48, I66, A89, I102, I129, L141, I251, Y268, according to the numbering in the MC 4 receptor) contributed most to the selectivity of α-MSH (see Table 4) and were selected for sitedirected mutagenesis. Moreover, upon analyzing the 3D model of the MC 1 receptor it was deemed that S83 (corresponding to A89 in the MC 4 receptor) had a strong Hbond interaction with S130 (A135 in the MC 4 receptor).
Since the amino acid positions A89/A135 (MC 4 receptor numbering) showed identical co-variance in the receptor library, A135 was also selected for site-directed mutagenesis.

Statistical tests
The distribution of measured affinity for MC 4 receptor and mutant receptors presented here as well as its logarithm values (pK) did not correspond to normal distribution. Therefore we decided to use nonparametric Wilcoxon Rank Sum statistical test [29] to verify the hypothesis that affinity for corresponding mutant receptor differs from wild-type MC 4 receptor. The test was performed using R program [30].

Authors' contributions
PP did most of modeling, data analysis and interpretation and manuscript preparation work. SU was responsible for all molecular biology work. RP assisted SU as well as performed pharmacological measurements. ML gave significant intellectual contribution for modeling, data analysis and interpretation. JESW was the major initiator of the project and supervisor of it and he contributed significantly to drafting the manuscript.