Homology modelling of protein-protein complexes: a simple method and its possibilities and limitations

Background Structure-based computational methods are needed to help identify and characterize protein-protein complexes and their function. For individual proteins, the most successful technique is homology modelling. We investigate a simple extension of this technique to protein-protein complexes. We consider a large set of complexes of known structures, involving pairs of single-domain proteins. The complexes are compared with each other to establish their sequence and structural similarities and the relation between the two. Compared to earlier studies, a simpler dataset, a simpler structural alignment procedure, and an additional energy criterion are used. Next, we compare the Xray structures to models obtained by threading the native sequence onto other, homologous complexes. An elementary requirement for a successful energy function is to rank the native structure above any threaded structure. We use the DFIREβ energy function, whose quality and complexity are typical of the models used today. Finally, we compare near-native models to distinctly non-native models. Results If weakly stable complexes are excluded (defined by a binding energy cutoff), as well as a few unusual complexes, a simple homology principle holds: complexes that share more than 35% sequence identity share similar structures and interaction modes; this principle was less clearcut in earlier studies. The energy function was then tested for its ability to identify experimental structures among sets of decoys, produced by a simple threading procedure. On average, the experimental structure is ranked above 92% of the alternate structures. Thus, discrimination of the native structure is good but not perfect. The discrimination of near-native structures is fair. Typically, a single, alternate, non-native binding mode exists that has a native-like energy. Some of the associated failures may correspond to genuine, alternate binding modes and/or native complexes that are artefacts of the crystal environment. In other cases, additional model filtering with more sophisticated tools is needed. Conclusion The results suggest that the simple modelling procedure applied here could help identify and characterize protein-protein complexes. The next step is to apply it on a genomic scale.


Background
Many cellular functions are mediated by protein-protein interactions [1][2][3]. An aim of modern genomics is to iden-tify and characterize these interactions. With hundreds of genomes completely sequenced, computational methods that exploit sequence data are an attractive goal. Methods have been proposed to identify, classify, and validate putative interactions [4][5][6][7][8]. For example, amino acids that participate in a stable protein-protein interface tend to undergo correlated mutations during evolution, providing an interaction signature [4,9,10]; proteins that interact, physically or functionally, have an increased chance of being encoded by genes that are physically nearby, so that analyses of genome structure can also provide information on protein interactions [5,6].
Structure-based computational methods provide additional information, and are especially useful to characterize direct, physical interactions between proteins [11][12][13][14][15][16][17][18][19]. Docking algorithms are increasingly powerful and can provide a detailed description of the interaction [13,[20][21][22]; however, they are too costly for routine large scale studies. A more attractive possibility is to perform homology modelling, and exploit the ever-growing structural databases [23] to model putative protein-protein complexes [16,18]. Homology modeling of a protein-protein complex requires that a structural similarity principle should hold: similar proteins should interact in a similar way. Previous studies show that this is true in many cases [24]. However, exceptions have also been found, where two homologous pairs of proteins interact in two very different ways [25]. In particular, a recent, systematic study by Aloy & Russell [24] found only a moderate correlation between sequence and structural similarity for proteinprotein complexes.
Here, we examine the possibilities and limitations of a simple homology modelling procedure for protein-protein complexes. The method is closely related to earlier methods [18], but differs in many significant details. Ultimately, the method should allow us to predict whether a pair of proteins interact, given that a homologous pair forms a complex of known structure. We focus on the simplest type of protein-protein interactions: stable interactions between pairs of monodomain proteins. Domains are tightly-packed, globular structures. Their interactions can be considered as fairly representative of the interactions that take place between larger proteins [12,14,15]. Protein domains are usually thought to be structures that have been conserved in the course of evolution. The SCOP database [26], for example, provides a hierarchical classification of domains of known 3D structure according to their probable evolutionary relationships. Thus, domains are the logical starting point to attempt homology modelling.
As a first step, we examine the structural and sequence similarity among known domain-domain complexes. This question has been examined at length in the past [15,24]. Here, we use a slightly simpler dataset (binary complexes between single-domain proteins), a simpler method to align and superimpose pairs of complexes, and an additional energy criterion. A simple structural homology principle holds for most cases. Excluding complexes with weak association energies (which may be artefacts of the crystal environment), most complexes in the dataset that share over 35% sequence identity always share similar interaction modes and structures. Note that we are referring here to the sequence identity averaged over the whole complex, not just the interface region. This result is more clearcut than the earlier studies.
As a second step, we examine the performance of a simple homology modelling approach, which could be used for large-scale studies. It combines structure-based alignments with a very simple threading procedure and an empirical energy function. For a given domain-domain complex, several possible templates are considered, made up of homologous complexes. We compare the association energy of experimental complex to the energies of the modelled complexes. This test represents the most basic requirement for a successful energy function. With the simple, DFIRE β energy function [21], discrimination is good but not perfect for this relatively easy test.
Finally, we perform a more realistic test, comparing nearnative structures to distinctly non-native models. Most near-native structures are ranked above most non-native models. Typically, however, a single alternate, non-native binding mode is found with a native-like energy. In some cases, this could correspond to a genuine binding mode; in other cases, such modes artefacts that would have to be filtered out using more sophisticated models.
Overall, it appears that our simple homology modelling, despite its limitations, can help identify and characterize protein-protein complexes, at least in a preliminary way. The method is computationally inexpensive and could be applied on a large scale. It can also be used in combination with other, purely sequence-based methods. In the following, we describe the Results. The next section is a Discussion. Computational Methods are described last.

Datasets of protein-protein complexes
Three datasets of protein-protein complexes are used. Importantly, all the complexes are formed from pairs of single-domain proteins. Two of the datasets are from previous publications [15,18,27]. The first, "Keskin" dataset [15,27] includes 21686 pairs of domains, divided into 3799 groups of complexes with similar interaction modes. It will be used to judge the accuracy of our structural alignments. Indeed, the domains in this set were carefully aligned and grouped into clusters by Keskin et al [15]; therefore, they provide a benchmark to check that our own alignment procedure is reasonable. The second, much smaller, "Aloy" dataset [18] includes 35 complexes, divided into nine functional groups. This dataset was used by Aloy and Russell to test their own, earlier modelling procedure [18]. By applying our procedure to this dataset, we can directly compare our performance to theirs, and judge the accuracy of our procedure.
Finally, a third set of complexes was constructed here, larger than the Aloy set, based on the SCOP classification, and comprising only single domain protein complexes (unlike the Keskin set). Most members of this dataset are also found in the Keskin set. This set was constructed as follows. The starting point was the ASTRAL Compendium of SCOP domains [26,28], which is based on version 1.67 of SCOP and contains 65122 domains. From the Protein Data Bank [23], we collected all structures that included more than one chain. Cross-checking with the ASTRAL Compendium [26,28], we discarded structures that contained more than two SCOP domains or were absent from ASTRAL. We also required that the two domains be continuous and carried by different polypeptide chains, thus excluding complexes between two domains within the same protein. We were left with 4765 structures. To exclude complexes that are obviously non-biological, such as crystal contacts, we discarded structures that had fewer than ten contacts between their two domains. An interface contact was said to occur when one domain of the complex had a nonhydrogen atom less than 8 Å from a nonhydrogen atom of the other domain [29,30]. This left just 1509 complexes. Each one was checked with the program PQS [31]; only those for which PQS returned a dimeric status were retained. The 750 remaining complexes were partitioned into groups based on proximity in the SCOP classification. Specifically, two complexes A:B and A':B' were put in the same group if A and A' are part of the same SCOP superfamily (say, S A ) and B and B' are part of the same SCOP superfamily (which can be different from S A ). Groups containing three or fewer complexes were discarded, since they allow just a few threading models to be built (see below) and just a few discrimination tests to be done. At the end of this selection process, we were left with 743 domain-domain complexes, partitioned into 40 groups, with between 4 and 71 complexes per group. 667 complexes are homodimers; 66 are heterodimers. Each group will be labelled by its pair of SCOP superfamily identifiers. We will refer to each of the 40 groups as an "Interacting Superfamily Group", or ISG.

Testing our alignments by classifying interaction modes
We first evaluate the accuracy of our structural alignments by comparing to the earlier, benchmark study by Keskin et al. [32]. The Keskin dataset contains 21686 complexes, clustered [27] according to the structural similarity of their binding modes [33]. 621 of the Keskin complexes are also part of our dataset. We performed a similar geo-metrical analysis of our own dataset, to identify the binding modes present in each Interacting Superfamily Group, or ISG (see Methods). All the complexes within each ISG (40 groups; 743 complexes in all) were compared to each other (see Methods), for a total of 9630 pairwise comparisons. The structural deviation is measured by an "Interaction rmsd", or Irmsd, which corresponds to the rms C α coordinate deviation between the smaller partners after superimposition of the larger partners; see Methods. The Irmsd was computed for each pair of complexes. A hierachical, average-linkage clustering was then performed, using the Irmsd as the distance metric. A maximum distance of 6 Å was allowed between any member of a cluster. With this procedure, each ISG yields a certain number of clusters, corresponding to distinct interaction modes between the two domains.
With this procedure, our clusters are in good agreement with the Keskin set (Table 1). This shows that our alignment method is reasonable. In particular, the use of a sequence alignment of regions that flank the MATRAS structural alignment (see Methods) does not cause difficulties. Most ISGs contain several interaction modes, most of which were also identified by Keskin  Compared to the Keskin interaction modes, our analysis returned nine additional modes ( Table 1). Most of these modes should probably be considered false positives, or "overcounts" of our mode counting. They correspond to structural similarities between interaction modes that were underestimated by our comparison method. Note that a few are genuine, since our dataset includes some additional complexes that were not part of the Keskin study.
Both our own and the Keskin analyses frequently identify very diverse modes of interaction within the same SCOP superfamily. In fact, this diversity is misleading. We will see below that the different interaction modes are almost always associated either with sets of complexes having a low mutual sequence identity, or with small, weakly-stable interfaces that are probably non-biological, induced by a given crystal environment.

Relation between sequence and structure similarity
Homology modelling of protein-protein complexes is only viable if similar sequences lead to similar structures. To understand more clearly the relation between sequence and structural homology, we consider our 743 complexes, grouped into 40 ISGs. We begin by identifying complexes whose interfaces are sufficiently large and energetically stable. Specifically, we identify complexes that have a buried surface of at least 600 Å 2 and a DFIRE β interaction energy of -10 kcal/mol or better. Following earlier analyses of protein-protein complexes [29,30], we assume that complexes with such large and stable interfaces are likely to be biologically meaningful complexes, whereas the other complexes are much more likely to be artefacts of the crystal environment. The energy cutoff is a small but partly arbitrary value; see below. Of the 743 complexes, 30 are eliminated by their small surface areas. Another 233 are eliminated by the energy cutoff. We are left with 480 "large and stable" complexes.
From each pairwise comparison, we obtain a sequence identity and a structural similarity score, as described in Methods. The sequence identity is given by the optimal alignment (structural, plus flanking sequence alignment if required; see Methods). The structural similarity is measured by the pairwise Irmsd (computed by first superimposing the larger members of a complex, say A i and A j , then comparing all equivalent C α 's of the smaller members, say B i and B j ; see Methods). The Irmsd values are plotted vs. the corresponding sequence identities in Fig. 1 for the whole dataset (9630 points). In Fig. 1, each large dot corresponds to a comparison between two of the "large and stable" complexes. Each small dot corresponds to a comparison where one or (usually) both of the complexes being compared have small and/or less stable interfaces. Large, gray dots correspond to comparisons where the structural alignment with MATRAS provided fewer than 80% of the equivalent residues used for the Irmsd calculation; the rest of the equivalent residues are pro-vided by a sequence alignment (see Methods). All the gray points lie in the lefthand part of the plot, below a 35% sequence identity. Thus, our use of a flanking sequence alignment for some complexes does not affect the qualitative results. We can now comment on our choice of energy cutoff: -10 kcal/mol is in fact the largest (least negative) value that gives a clean separation between the large and small dots in Figure 1.
In a previous study, Aloy and Russell extensively studied the relationship between the sequence and structural similarities of numerous protein complexes; see Fig. 1 in reference [24]. In contrast with their approach, which used a simplified geometric representation to measure structural deviations (four sites per subunit), we compute the Irmsd The relationship between sequence and structural similarity Figure 1 The relationship between sequence and structural similarity. 743 complexes from 40 interacting superfamily groups (ISGs) were analyzed. All pairs within each ISG were compared, for a total of 9630 pairwise comparisons. Small points correspond to comparisons involving at least one complex with either a small interface (buried area < 600 Å 2 ) or a weak association energy (E int > -10 kcal/mol; see text). Points labelled A-G are discussed in the text. The horizontal line corresponds to Irmsd = 6 Å; the vertical line corresponds to a 35% sequence identity. Gray points correspond to comparisons where the MATRAS structural alignment provided fewer than 80% of the equivalent residues used for the Irmsd calculation. All the gray points lie below the 35% similarity threshold.
between pairs of complexes using an atomic level of detail. We also apply an energy cutoff that yields additional insights, and we use a simpler, more homogeneous dataset (binary complexes between single-domain proteins).
With these modifications, the sequence/structure homology principle is more clearcut. Indeed, considering only the large dots (black or gray), we see that when two complexes share at least 35% sequence identity between their subunits, they almost always have a similar mode of interaction. Note that the sequence identity is averaged here over the whole complex, not just the interface region. At this level of sequence identity, almost all of the measured Irmsd's are below 6 Å (Fig. 1). Thus, in the vast majority of cases, sequence homology implies structural homology. This is a basic requirement to allow homology modelling of protein-protein complexes. Below 35% of sequence identity, the interaction modes often differ.
In contrast, the small dots are widely distributed throughout Fig. 1. One group of points, for example, forms a band at around 100% sequence identity, extending up to Irmsd values of 80 Å. These points presumably arise from different packing modes in different crystal forms. A detailed analysis of crystal packing effects is beyond the scope of this paper; see [29,30].
In some cases, two "large and stable" complexes have subunits that share more than 35% sequence identity and yet they do present strongly dissimilar interaction modes (Irmsd above 6 Å). In Fig. 1, such complexes form nine groups, identified by the letters A-G. Group A involves two structures of the endonuclease EcoRV, which is a homodimer, either bound to DNA (1BSU) or unbound (1RVE). Although the whole enzyme is considered to form a single domain by SCOP (c.52.1 superfamily), the structure is actually made of two distinct, globular regions. Upon binding to the DNA, there is a relative motion of these two regions, which leads to a fairly high Irmsd between the two structures (7.8 Å), despite their nearly identical sequences (98.3% identity). Group B corresponds to another two endonucleases, with the same structural change upon DNA binding. Group C is similar. One complex has a large overall Irmsd value with respect to two others; if only the regions close to the interface are compared, the structural deviations are very small (<2 Å). The large Irmsd values are thus produced by a compact, globular region far from the interface which could be considered as a separate domain [34]. All these cases can be considered artefacts of the SCOP domain definitions.
Group D involves the complex 1A64, a member of the Immunoglobulin-like ISG. The subunits of this complex are are artificially engineered mutants of the variable anti-body domains. The mutants were specifically selected to produce different interaction modes [35]. Group E corresponds to Ribonuclease A structures. These homodimeric structures are known to provide remarkable examples of domain swapping [36,37]. This leads to geometrically different interfaces, despite identical or very similar sequences. Such phenomena are relatively rare among protein dimers. Group F corresponds to a dimeric snake toxin protein [38], compared to a complex of a homologous toxin with an inhibitor. The inhibitor is expressed in the snake to modulate the activity of the toxin [39].
Remarkably, the inhibitor is strongly homologous to the toxin (~50% identity), and modulates its action by forming a heterodimer with one of the toxin subunits. The mode of interaction between the toxin and the inhibitor differs from the toxin homodimerization mode. This is an unusual case where a new biological interface has evolved that violates the homology principle (similar sequences but different interaction modes).
Another region of interest is the one between 33 and 37% sequence identity. Here, there are three small groups with large Irmsd values: G1, G2, G3. Group G3 is another case where structural deviations arise from a domain swapping event (complexes 1E8A, 1HT9, 1KSO). Group G1 arises from three complexes (1DEK, 1KO5, 1KO8), two of which (1KO5, 1KO8) are very similar (Irmsd of 1.5 Å). Comparing 1DEK and the other two, the interaction modes are very different, despite a 35% sequence identity. However, the interfaces are not very large (~2000 Å 2 ) and the DFIRE β association energies are about -11 kcal/mol, barely below our -10 kcal/mol threshold. Finally, Group G2 contains two complexes (1G64 and 1HOP) with different interaction modes despite a 35% sequence identity, large buried surfaces, and large association energies (<-20 kcal/mol). Thus, the region between 35 and 37% sequence identity is a limiting region, where the structural homology principle begins to be violated. Notice that the precise location of this threshold region (~35%) is a result of our precise method of sequence and structural comparison, and the precise energy cutoff employed; slightly different methods would lead to a slightly different threshold.
Overall, structural homology holds in the vast majority of cases when the sequence identity is above 35% and the association energy is sufficiently large.

Recognition of native interfaces: comparing the experimental structure to threading models
The structural homology principle observed above (in agreement with earlier work [24]) is a necessary but not a sufficient condition for homology modelling of proteinprotein complexes. We turn now to the problem of the energy function. We first perform a very simple test, both with DFIRE β and with a simpler, residue-based energy function [40]. Their discrimination power was evaluated by their ability to assign a lower association energy E int to the experimental, Xray structure, compared to alternate structures obtained by threading the sequence onto other, homologous dimers. This discrimination, again, is a basic, necessary condition for a successful modelling procedure. 17 groups of homologous dimers were used, for a total of 123 complexes. Eight of these groups, containing 88 complexes, came from our own, ISG dataset. Nine groups, containing 35 complexes, came from the work of Aloy and Russell [18], who performed the same discrimination test.
The DFIRE β function has a 92% success rate in the discrimination tests (1035 successes out of 1124 tests). The number of successful and failed tests are given in Table 2 for each superfamily group. Each sequence within a group is threaded onto every other structure in the group (see Methods). Success means that a positive S thread score is obtained when an experimental structure is compared to a threaded model (see Methods). The residue-based energy function has a respectable, but poorer discrimination rate of 66%. Among the structures used by Aloy and Russell, those from the peptidase and squash trypsin groups were the hardest to discriminate. This is in fair agreement with the results they reported [18], using a different energy function and a different measure of structural similarity.
36 complexes out of the 123 were not fully discriminated: their sequences had a better energy for at least one of the models than for their Xray structure (negative S thread ). For most of these, either the model was very similar to the Xray structure (Irmsd values of 2 Å or less), or the Xray structure had a weak association energy (suggesting that it may not be a biological complex). We consider each case in detail.
In the b. 29 (Table  1). Only 14 were used in the discrimination tests (Table  2), as several redundant complexes were excluded (100% sequence identity with other complexes in the group and very similar structures; for example 1fk8 and 1fjh, 1keu and 1ker). Five of the 14 complexes are not discriminated. Overall, if we include near successes, exclude very weak complexes, and take into account mutually similar, redundant structures, this first, basic, discrimination test has just 20 real failures.

Recognition of native interfaces: comparing near-native and distinctly non-native models
In a practical modelling situation, one does not know the experimental structure; rather, one must distinguish between models that are close to the real structure ("nearnative" models) and others that deviate significantly from the true structure ("non-native" models). It is therefore of interest to examine the variation of the association energy as a function of the deviation Irmsd from the native structure. Ideally, a smooth increase should be observed. To illustrate the possibilities and limitations of the present modelling procedure, we focus on two ISGs in particular: c.2.1 and a.133.1. These groups are fairly representative of the full set of discrimination tests ( Table 2). Fig. 2 shows the complete set of threading events for each group (650 and 210 events, respectively), relating each S thread score to the structural similarity (Irmsd) between the Xray structure and the corresponding model. The dataset includes both "non-native" and "near-native" tests; a "near-native" model occurs when the native sequence is threaded onto another complex that happens to be structurally homologous. The energy "surfaces" thus constructed appear complex, with no simple relation between S thread and Irmsd. We do not observe a smooth increase with Irmsd. A more detailed analysis shows, however, that the model does display a distinct, albeit imperfect correlation between energy and structure.
Indeed, Fig. 3 gives a more detailed view of the c.2.1 data. Several distinct binding modes were identified above (Table 1). Using a somewhat finer clustering here, we distinguish 11 modes (instead of nine in Table 1). These are shown in the upper part of Fig. 3. Complexes are colored according to their binding mode; eleven colors can be seen. For simplicity, only 18 complexes are shown in the Figure (of the 26 in this group; some very similar structures are left out of the Figure). In the lower part of Fig. 3, each panel corresponds to one Xray complex and one binding mode; the Xray complex is compared to all possible threading models. Only eight of the eleven binding modes are shown. With this detailed representation, the few, very mild discrimination failures within this group can be seen individually (points with a negative S thread ; e.g., there are two failures in the 1EK6 and 1KEW panels, which correspond to the yellow and green binding modes). Except for a few points, S thread does tend to increase with Irmsd. This indicates a correlated energy surface. In the 1ID1 panel, for example, S thread increases rapidly then remains roughly constant for 40 ≤ Irmsd < 60 Å.
In the 1JAX and 1EK6 panels, the situation is similar, except for one brown point. In all panels, any points with low energies but large Irmsds are associated with one or all of the red, blue, and especially brown binding modes. These binding modes are mutually similar; see tree in upper panel of Fig. 3. Clearly, this group of binding modes represents a strong competitor for several other, experimental modes. It may be that this group represents a genuine, alternate mode that could be observed under the right experimental conditions; it may also be that some of the other, experimental modes are artefacts of the crystal environment. Even if the low energy assigned to this group of modes is a pure artefact of the model, we note that our procedure has allowed most other modes to be correctly positioned in the high energy region. The few, low energy, competing structures (red, brown, and blue) all correspond to a single, competing binding mode that could possibly be discriminated by further, more sophisticated energy calculations such as all-atom models. some or all of the corresponding, experimental binding modes could be artefacts of the crystal environment; this would explain the observed violations of the energy/structure correlation. For this ISG, we compared the sequence identity at the interface to the overall sequence identity for several pairs of complexes that share the same interaction mode. In 7 cases out of 7, the overall sequence identity was higher than the interface sequence identity.
Overall, the procedure is much more successful than suggested by the simplistic analysis shown in Fig. 2. The energy/structure correlation is mainly violated by a few models, corresponding to one or a small group of very similar binding modes. In both ISGs analyzed, some or all of these violations may be due to experimental binding modes induced artificially by the crystal environment.
Even if the violations are pure artefacts of our modelling procedure, their number is small enough so that a more sophisticated modelling procedure could be used for their further analysis. Indeed, a larger number of high-Irmsd models are clearly eliminated by our energy values. A clustering of binding modes and their detailed energy analysis was needed to reveal this more promising picture. Such an analysis had not been performed in the past.

Discussion and conclusion
For individual proteins, structure-based homology modelling is by far the most important prediction method.
Most of the difficulties that exist for individual protein modelling can also affect complex modelling. However, complexes include an additional structural level, namely quaternary structure. Therefore additional difficulties occur, especially for large, multi-domain, multi-protein complexes. Here, we limited ourselves to the simplest class of complexes: binary complexes between single domain proteins. These already present several specific difficulties, at least four of which were illustrated in this work. First, in the PDB, there are far fewer structures of complexes than of individual proteins. Second, it is often hard to distinguish biologically-meaningful complexes from those induced by the crystal environment. Third, two proteins with similar structures may have very different modes of interaction with a third protein, simply because a few surface residues differ. Fourth, domains within the same protein may shift with respect to each other in different environments.
To overcome these difficulties, we took two main steps. We limited ourselves to binary complexes between singledomain proteins, and we identified complexes that have a weak association free energy (E int >-10 kcal/mol). When comparing two complexes, we also paid care to the definition of equivalent residues (see Methods), so that a simple and intuitive measure of structural distances could be used (the Irmsd), and to the treatment of gaps in the threading procedure.
We obtained two main results. First, complexes that share over 35% sequence identity usually share similar structures and interaction modes. The same qualitative result was obtained in earlier studies [24]. Here, however, our dataset is simpler, our analysis somewhat more detailed, and the overall result is more clearcut. Note that the reverse situation has also been observed: complexes with weak sequence homology can occasionally have similar binding modes. The literature provides just a few known cases, such as cytochome c' and its homolog, the Erv2p thiol oxydase [41]. The precise value of the sequence identity threshold obtained here, 35%, obviously depends on the details of our data set and our alignment method.
Exceptions to the similarity principle were either weaklystable complexes or a few unusual cases (such as domain swapped structures). Below 35% identity it is common to find different interaction modes for homologous complexes.
Our second main result is the imperfect but still respectable success rate obtained when comparing near-native and non-native models (Figs. 3 and 4). The energy/structure correlation is mainly violated by a few models, corresponding to one or a small group of similar binding modes. In both ISGs analyzed above, some of the failures correspond to weakly-stable complexes. These failures and possibly others may be due to experimental binding modes induced artificially by the crystal environment. In general, we expect poor discrimination for highly-transient and weakly-stable complexes, which were shown above to depart from the simple similarity principle. We expect that the small number of other failures could be resolved using a more sophisticated, all-atom energy function. Such an energy function will require a detailed model for the protein sidechains (in contrast to DFIRE β , which only uses the C β ); this could introduce additional errors. Note that the performance of DFIRE β is already comparable to several all-atom energy functions [42,43].
Overall, our results suggest that the simple modelling procedure applied here could help identify and characterize, at least in a preliminary way, protein-protein complexes. The next step is to apply it on a genomic scale [17].

Methods
We describe first the procedure used for the structural and sequence comparisons. Next, we describe the energy function employed. Finally, we describe the structure modeling procedure, which involves a simple threading technique.

Measuring similarities between pairs of protein domains
The dimers within each ISG were compared in terms of their structures and sequences. To compare two complexes, A i :B i and A j :B j , we first superimpose the structures of A i onto A j and B i onto B j . Both pairs were aligned using the MATRAS web server [44], which provides a local structural alignment. MATRAS uses a heuristic method to solve the problem of the structural alignment of two proteins. It employs different similarity scores which are all constructed by applying a Markovian Transition Model to observed data frequencies [45]. A first, rough alignment of the two proteins is obtained by superimposing pairs of secondary structures. This initial alignment is refined by performing two consecutive dynamic programming alignments. Each one uses a specific score and an affine gap penalty. The first score is an "environment score", characterizing the local chain structure combined with a binary description of the solvent accessibility [45]. The second dynamic programming alignment uses a "distance score" between pairs of residues in each protein. This score function was parameterized based on the observed pairwise distances between C β 's in proteins.
The final alignments usually included only a subset of one or both proteins (8054 out of 9630 pairs of dimers, corresponding to 11055 out of 19260 alignments). The regions flanking the aligned segments were sometimes quite large; for 6.9% of the alignments, they represented over 20% of the sequence length (Fig. 5). In these cases, we supplemented the MATRAS alignment with a sequence-based alignment of the regions not aligned by MATRAS. Indeed, we want to characterize the relation between sequence/ structure similarity for single-domain, binary complexes. While it is reasonable to limit ourselves to regions with a significant sequence homology, there is no reason to exclude regions where the structures are less similar (i.e., they are not aligned by MATRAS). If we excluded such regions, we would assume, not prove a correlation between sequence and structure similarity. It is more appropriate to take into account flanking regions whose sequences can be aligned, even though their structures were not aligned by MATRAS.
Therefore, we aligned separately the sequences N-terminal and C-terminal with respect to the segment aligned by Alignments MATRAS. We used the EMBOSS implementation [46] of the Needleman and Wunsch algorithm [47], with 10 and 0.5 as penalties for the opening and the extension of gaps. This way, we could include additional N-or C-terminal regions in the final alignment. The two alignments (with and without the flanking regions) define two sets of equivalent residues in A i and A j (respectively, B i and B j ).

MATRAS contributions to the alignments
The equivalent residue sets were then used to compute structural deviations, as follows. The largest of the two domains A i and B i was chosen, say A i . It was superimposed onto A j using the C α coordinates of equivalent residues. Then, an rms coordinate deviation was computed for B i and B j . This deviation measures the displacement of the smaller domain, B j , at the surface of the larger, A j , when going from the A i :B i complex to A j :B j . The rms deviation is computed for all equivalent residues, defined by the larger alignment (structural alignment, plus sequence alignment of flanking residues). We refer to this deviation measure as an "interaction rmsd", or Irmsd [48]. Occasionally, we also employ a deviation computed for interface residues only.

Measuring the stabilities of the complexes
An energy function is used to compare the different complexes. We rely on pairwise statistical potentials derived in previous studies [40,43]. The stability of a given complex can be characterized by the total energy of interaction between the partners. Alternate structures will be constructed for each dimer using as templates other, homologous dimers from the same ISG. Therefore, a method is also needed to thread the sequence of a given complex onto the structure of another, homologous complex.

Statistical energy function for protein-protein interactions
For a complex A:B, an energy score E int is defined: Here, C is an atomic or residue contact map describing the A:B interface; l, m are amino acid positions in A and B, respectively; S l , S m are the corresponding amino acid types; and U is a table of interaction parameters. This score is computed using either of two knowledge-based energy functions. The first is a residue-based energy function described earlier [40]. It uses a residue-residue contact map. If two nonhydrogen atoms from two residues belonging to the two proteins are less than 4.5 Å apart, the residues are said to be in contact. The interaction parameters form a 6 × 6 table, corresponding to six groups of amino acids: {LVIMCAGSTPFYW}, {ASTP}, {FYW}, {EDNQ}, {KRH}, and {G}. This very simple energy function performed well in several applications [40]. The second energy function is atom-based. It is taken from the DFIRE potentials developed by Zhou and coworkers [42,43]; the protein structures are reduced to their backbone and C β atoms. We refer to it as the DFIRE β potential.
Threading a sequence onto a structure Each experimental, or "native" complex is compared to a series of modelled complexes. Modelling is done by threading the sequence of the native complex, A i :B i , onto the structures of homologous complexes A j :B j . Hence, the target dimer A i :B i and the template structure A j :B j are always part of the same ISG. An interaction energy is computed for each modelled complex. Ideally, templates that are structurally close to the native complex should lead to models with strong interaction energies. Templates that are structurally more distant should lead to poorer energies.
Mapping the A i :B i sequence onto the A j :B j template structure is a classic threading problem [49][50][51][52]. Here, threading is performed very simply, by changing each amino acid type in A j :B j to the type found at the equivalent position in A i :B i . Equivalency of positions is defined by the alignment described above. Gaps are handled as follows. When a residue R of the template A j :B j is aligned with a gap, we add a distance-dependent energy penalty P to its other interactions. By trial and error, we found that the following form worked well: where the sum is over the atoms k of the "gapped" residue R and all the atoms l of the target protein. A is a real number, empirically set to 10 kcal/mol and α is a dimensionless DFIRE scaling factor, α = 0.0157 [43]. With this form, the penalty P is roughly equivalent to applying a 4.6 kcal/mol cost for removing a residue from the rim of a protein-protein interface [53]. No penalty is applied for the opening or extension of gaps at the beginning or the end of either polypeptide chain (A j or B j ). Indeed, the definitions of domain boundaries within SCOP are somewhat imprecise, so that differences at the beginning or the end of a domain should not be penalized.
For each threading of a pair of sequences A i :B i onto a template complex A j :B j , a score S thread is computed, based on the protein-protein interaction energy: Here, E int (s i , C j ) is the interaction energy of the A i :B i sequence, s i , in its threaded conformation, C j (i.e., threaded onto the A j :B j structure). E int (s i , C i ) is the energy of the s i sequence in its own, native conformation, C i . If the sequence s i is threaded onto its own, native structure, the score is zero. If the sequence s i is poorly suited to the threaded conformation C j , the score is positive. Occasionally, one can obtain a negative score, in which case the threaded conformation may represent a superior mode of interaction for the A i , B i proteins. This might occur if the "native" A i :B i structure is, in fact, an artefact of the crystal environment.

Basic criteria detecting crystal interfaces
By inspecting all the complexes with PQS [31] and applying a threshold of ten for the minimal number of contacts in a complex (see above), we eliminated many spurious interactions from our dataset. However, large and physically-plausible interfaces that are, in fact, crystal contacts remain in the dataset. This is actually useful, since they provide alternate models that can be compared to even larger, biological interfaces. Following previous studies [29,30], three main indicators are used above to differentiate crystal and biological interfaces: the DFIRE β interaction energy E int , the number of interfacial contacts, and the total surface buried upon dimerization.