### Monomeric structure selection

We used 810 X-ray structures from the PDBselect collection [37]. Sequence identity between all pairs was less than 25%. The structures were monomeric and did not contain any cofactors or metal ions. Crystallographic resolution was better than 3.3 Å (average of 2.0 Å). They were divided into two sets, having similar average chain lengths (around 190 amino acids). 615 proteins were used as an Optimization set, for the parameterization of the energy functions below. The other 200 were used as a Test set, for evaluating the parameters' performance.

### Monomeric decoys

For each native structure, we performed a gapless threading of the sequence onto segments of matching size from all the structures in our 810 protein set. On average, we obtained 349 decoys per native structure. For each amino acid position in a decoy, the most probable rotamer for the given amino acid type was arbitrarily taken from the Dunbrack backbone-independant rotamer library [38] and constructed. Thus, steric constraints could be violated in the decoy structures, because the rotamers were not optimized with respect to the particular backbone and sidechain environment. With the energy functions used below, this is not a difficulty. Furthermore, this simple method for constructing monomeric decoys was deemed sufficient, because our main focus is not fold recognition, but the identification of protein–protein complexes.

### Dimeric structure selection

We used 219 dimer structures from the DIMER-1 set of Lu et al [19], which contains 340 structures. We retained only dimers, as opposed to higher-order complexes. This distinction is not always straightforward, and one higher-order complex was identified in the data set at a late stage. It is not expected to affect the results, judging by the similar parameters obtained with OS1 and OS2, or when an optimized parameter set is refined further with the higher-order structure left out. All the structures have a crystallographic resolution better than 2.5 Å, 30 interface contacts or more (using a 4.5 Å distance cutoff), and a pairwise sequence identity with any other member of the set below 35%. All correspond to biological dimers, as opposed to crystal contacts. This was established by running an automated test with the PQS server [39] and by checking the literature.

Our final set included 195 homodimers and just 17 heterodimers. A second set of 23 heterodimers was set aside to perform blind tests after parameterization was finished. The predominance of homodimers was largely imposed by the structures available in the PDB [40]. A typical homodimer is expected to have a larger, somewhat more hydrophobic interface, and a stronger binding constant, compared to a typical heterodimer. This is both a limitation and an advantage for our purposes. We expect that there will be a stronger dimerization signal in the homodimer interfaces and, furthermore, the signal is repeated twice, since the complexes are almost always symmetrical. This should facilitate parameterization, without damaging the heterodimer performance. Indeed, heterodimer association is driven by largely the same physical effects as homodimerization, so that the easier homodimer parameterization should carry over and be applicable to heterodimers. This is confirmed by our blind tests on the separate set of 23 heterodimers, which are not used in any step of the parameterization, and are not homologous to any of the proteins used.

### Dimeric decoys

Decoy sets were built by a docking procedure, starting from each native, dimeric structure. First, the two, native partners were shifted apart through a random translation and rotation of one of the partners. Second, rigid body minimization was done, using the Charmm19 molecular mechanics energy function [

41]. No solvent model was included; ie, electrostatic interactions were computed with a dielectric constant of one. A harmonic pulling restraint was applied between the two centers of masses. The target distance was the native separation distance; the force constant was 60 kcal/mol/Å

^{2}. After 100 steps of rigid body minimization, the number of interprotein contacts was evaluated as the number of pairs of residues with at least one interatomic distance below 4.5 Å. We then compared the number of interprotein contacts in the decoy and the native complex. We define the "relative contact number"

*F* of a given structure by:

where *N*
_{dec, }
*N*
_{
nat
}are the number of residue–residue contacts in the decoy and native structure, respectively. If *F* was at least 80%, the decoy coordinates were output at this stage. If not (the vast majority of cases), minimization was continued, without the pulling restraint, but allowing now intramolecular deformation of one of the partners. Also, an electrostatic contrast was introduced by enhancing the atomic charges on the two partners (with respect to the Charmm19 values), to increase the interprotein attraction. The charges of one partner were all increased by 0.25*e*, while the charges of the other partner were decreased by 0.25*e*. Intraprotein sidechain–sidechain electrostatic interactions were omitted and intraprotein sidechain-backbone electrostatic interactions were reduced by 1/2.

After 50 steps of Powell minimization [42], we compared the decoy structure to the native one. For the "deformable" partner, we performed a best fit of the monomer on its native structure and evaluated the "intramolecular" rms deviation between the two, considering its sidechains only. If the sidechain deviation was above 4.5 Å, the decoy was discarded. Decoys with *F* < 0.45 were also discarded. Decoys with a van der Waals energy greater than 4000 kcal/mol were also discarded. If the decoy was too similar to the native complex it was discarded. The criterion for similarity was the rms deviation for the "mobile" partner in the decoy and the native complex: if the rms deviation was less than 3 Å, the decoy was discarded. The deviation was computed without superimposing the two structures, so that it measures the overall motion of the mobile partner away from its starting, native position. Finally, to eliminate redundant decoy structures, we computed rms deviations between all pairs of decoy structures, rejecting decoys that were within 3.5 Å of another decoy. In all, we obtained a total of about 290000 decoys, for 219 native complexes, giving an average of 1326 decoys per native complex, with a minimum of 277 and a maximum of 1843 decoys. Typically, several tries were needed per decoy, so that the first step of the protocol (random displacement of one partner) was done over one million times.

To assess the quality of decoy structures, we used a detailed, atomic energy function, corresponding to the Charmm19 molecular mechanics force field [41]. The solvent environment was modelled with the ACE variant [43] of the Generalized Born model, parameterized as in [44]. These calculations were done with the XPLOR program [45].

### Empirical, residue-based energy function

Consider a protein with the amino acid sequence

*S* and a given three-dimensional conformation. We use a residue-based energy function, of the form:

where the sum is over all pairs of amino acids *i*, *j* in the protein. *U*(*S*
_{
i
}, *S*
_{
j
}) is an interaction energy, which depends on the amino acid types *S*
_{
i
}, *S*
_{
j
}at positions *i*, *j*. *C*
_{
ij
}is one if the pair *i*, *j* is close together, and zero otherwise. Specifically, *i*, *j* are assumed to interact if they have at least one pair of nonhydrogen atoms within a distance of 4.5 Å, and if they are not first or second neighbors along the polypeptide chain; i.e., |*i* – *j*| ≥ 3. We refer to *U* as an energy matrix, and to *C* as a "contact map".

With a complete amino acid alphabet of twenty types, the 210 elements in the energy matrix *U* are all different. With a reduced alphabet, amino acid types belonging to the same group share the same coefficients in the energy matrix *U*. For example, with a binary, hydrophobic/hydrophilic alphabet, the 210 elements in *U* fall into just three categories, and have just three possible values, corresponding to hydrophobic–hydrophobic, hydrophobic–hydrophilic, or hydrophilic–hydrophilic pairs. In what follows, we consider alphabets with two, three, four, six, and twenty classes.

### Reduced amino acid alphabets

To define reduced amino acid alphabets, we mainly used a hierarchical clustering proposed by Levy and coworkers [

30], shown on the right of Fig.

1. It is based on the following similarity measure between two amino acid types

*x*,

*y*:

*M* is the Blosum50 matrix [34], and the sums are over the twenty amino acid types. The clustering method assigns the two amino acid types with the highest similarity to a group. Then, the pair with the next-highest similarity value is considered. If one member of this pair belongs to the first group, the other member is added to that group. If not, the pair is assigned to a new group. This process is repeated until all the amino acid types are divided into a predefined, desired number of groups.

We also performed our own groupings, using the same clustering method, but a different similarity measure, defined by our own empirical energy function. In this case, the similarity between two amino acid types was related to the similarity between the associated coefficients in the energy matrix

*U*, constructed with the complete amino acid alphabet of twenty types. The pairwise similarity was defined as the Pearson correlation coefficient between their sets of coefficients in the energy matrix

*U*:

Here, *x* and *y* are two different amino acids, *U*
_{x, i}is the pairwise energy of interaction between the residues *x* and *i*,
is the average over the *U*
_{x, i}, and similarly for *y*.

### Optimizing the energy function: general method

We define different energy functions, corresponding to amino acid alphabets of varying complexities. For a given amino acid alphabet, the energy function is optimized by adjusting the elements of the corresponding energy matrix

*U*. The goal is to assign low energies to conformations that ressemble the native structure (which can be monomeric or dimeric). We follow a method introduced by Bastolla and coworkers [

20]. To characterize the similarity between two conformations, with the contact maps

*C* and

*C'*, we consider the number

*q*(

*C*,

*C'*) of contacts that are present in both maps, divided by the number of contacts in the larger of the two maps (the one with the most contacts).

*q*(

*C*,

*C'*) is referred to as the "contact similarity". Then; for any native structure, we compute the Boltzmann-averaged overlap

*Q* with all of its decoys:

where the sum is over all the decoys, *C* is the contact map of a decoy, *C*
_{
nat
}is the native contact map, *k* is Boltzmann's constant, and *T* is the absolute temperature. If the Boltzmann-averaged overlap *Q* is close to one, then the lowest-energy conformation is either equal to the native structure or very close to it. At physiological temperatures, a high value of *Q* implies that structures very different from the native one have high energies, while native-like structures have low energies. The energy landscape is said to be well-correlated [20].

The interaction parameters are chosen to maximize the quantity

,

where the sum is over all the proteins of our Optimization set. Following Bastolla et al, we optimize

through a gradient ascent method, adjusting the energy matrix

*U* iteratively, according to the rule:

where *t*, *t* + 1 are successive iterations; the sum is over the proteins *S* in the Optimization set; *q*
_{0} is the overlap between the minimum energy conformation and the native structure of *S*; *δ* and *γ* are numerical parameters, chosen to facilitate convergence to a maximum. Typical values were 0.2 and 0.0075. For more details, see [20].

For monomer structure recognition, we optimized the energy function using the 615 native structures in our Optimization set, along with their associated decoys. For dimer recognition, we optimized the energy function using a combination of monomeric and dimeric structures. We considered two Optimization sets, each containing 109 or 110 dimeric structures and their associated decoys, along with 110 monomeric structures and their decoys. We denote the Optimization sets by OS1 and OS2. The dimeric structures were assigned randomly to OS1 or OS2. The monomeric structures were chosen so that their average chain length is equal to that of the Monomeric Optimization Set, above (189 amino acids). The OS2 dimer structures form the Test Set 1 (TSl); the OS1 dimer structures form Test Set 2 (TS2). The same monomeric structures were used for both OS1 and OS2. Notice that cross-validation, below, is done using the performance of the energy functions on interface recognition only. Parameter bias towards the monomer structures can exist but is not a concern. Notice also that with the normalization of *Q* in Eq. (6), the monomers and dimers have the same weight in the combined optimization, despite the larger number of decoys for the dimers.

### Optimizing the energy function: starting parameters

For the more complex alphabets, the number of energy parameters to be optimized is large (210 in the case of the complete amino acid alphabet). This means that there are many local maxima for
, and the optimization must be done repeatedly, with multiple starting points. We used three methods, both for the monomeric and dimeric protein sets. For the simplest alphabets, with two or three amino acid types, we performed a systematic scan of possible parameter choices. The range of values considered was from -10 to + 10 kcal/mol, with a step of 0.25 (binary alphabet) or 0.5 (ternary alphabet). For the larger alphabets, a systematic scan was no longer possible; instead, we used 64000 random starting points, with each parameter drawn randomly from the same range of -10 to +10 kcal/mol. In addition, for alphabets with three amino acid types or more, we used starting points taken from the optimization of the smaller alphabets. Indeed, the alphabets form a hierarchy, where the finer groupings are subdivisions of the larger groupings. For example, the binary alphabet uses the groups {LVIMCAGSTPFYW} (hydrophobic) and {EDNQKRH} (hydrophilic), while the ternary alphabet splits the hydrophobic group into two subgroups.

The ternary optimization can be started from the binary optimum, assigning initially the same, "binary" parameters to the two hydrophobia subgroups. As the optimization proceeds, the parameters associated with two subgroups will shift to different values.

For the alphabets with 6 and 20 classes, to explore the effect of an overall scaling of the energy parameters, we also generated sets of random starting parameters with a fine-grained sampling of possible values, limited to the range -1 to +1 kcal/mol. These parameter sets were optimized in combination with four different values of the temperature *T* (Eq. 6), corresponding to *kT* products of 0.6, 1.0, 1.5, and 4.0 kcal/mol. 2000 parameter sets were optimized with each temperature, for each each alphabet, giving 8000 additional parameter sets per alphabet.

### Testing the energy function

To test the potentials, we used a cross validation procedure. As described above, the set of native structures were split into Optimization and Test sets. Only the Optimization set was used in Eqs. (7, 8), above. The total
was then evaluated for the Test set. A value close to 100% means that the native structure is almost always ranked as the lowest-energy conformation. We also compute a "discrimination" percentage. We mostly employ a "strong" discrimination, *D*, which is the fraction of native structures that are ranked as the lowest-energy conformation (compared to their decoy set). We occasionally compute a "weaker" discrimination, *D*
_{
k
}, which is the fraction of native structures that are ranked among the *k* lowest-energy conformations (so that *D* = *D*
_{1}).

Another form of cross validation was done by performing several blind tests. In one test, we used our energy functions to rank native and decoy structures available on the web, from Vakser, Sternberg and coworkers. Vakser's decoys sets were produced by a rigid body docking method, using the GRAMM program at "high resolution" (grid step ≤ 2 Å) [27]. They correspond to five biological dimeric complexes. Each decoy series includes 99 decoys. A few of these are within 5 Å (rms deviation) of the native structure; the others are farther away. The Sternberg decoys were produced by a rigid body docking method using the MultiDock program [28]. They correspond to ten biological dimeric complexes, including one that is present in the Vakser set (with different decoys). Each series includes 99 decoys. An electrostatic filter is used during the docking simulations, and all ten decoy series include at least three decoys that are close to the native structure. The complexes are listed in Supplementary Material.

Finally, in another blind test, we applied our energy functions to native and decoy structures of protein dimers submitted to the CAPRI experiment for protein–protein complex structure prediction (rounds 2–5) [8, 9]. We considered 13 native structures, with an average of 173 decoys each and a minimum of 66.