Interaction profile-based protein classification of death domain
© Lett et al 2004
Received: 07 February 2004
Accepted: 09 June 2004
Published: 09 June 2004
Skip to main content
© Lett et al 2004
Received: 07 February 2004
Accepted: 09 June 2004
Published: 09 June 2004
The increasing number of protein sequences and 3D structure obtained from genomic initiatives is leading many of us to focus on proteomics, and to dedicate our experimental and computational efforts on the creation and analysis of information derived from 3D structure. In particular, the high-throughput generation of protein-protein interaction data from a few organisms makes such an approach very important towards understanding the molecular recognition that make-up the entire protein-protein interaction network. Since the generation of sequences, and experimental protein-protein interactions increases faster than the 3D structure determination of protein complexes, there is tremendous interest in developing in silico methods that generate such structure for prediction and classification purposes. In this study we focused on classifying protein family members based on their protein-protein interaction distinctiveness. Structure-based classification of protein-protein interfaces has been described initially by Ponstingl et al.  and more recently by Valdar et al.  and Mintseris et al. , from complex structures that have been solved experimentally. However, little has been done on protein classification based on the prediction of protein-protein complexes obtained from homology modeling and docking simulation.
We have developed an in silico classification system entitled HODOCO (Homology modeling, Docking and Classification Oracle), in which protein Residue Potential Interaction Profiles (RPIPS) are used to summarize protein-protein interaction characteristics. This system applied to a dataset of 64 proteins of the death domain superfamily was used to classify each member into its proper subfamily. Two classification methods were attempted, heuristic and support vector machine learning. Both methods were tested with a 5-fold cross-validation. The heuristic approach yielded a 61% average accuracy, while the machine learning approach yielded an 89% average accuracy.
We have confirmed the reliability and potential value of classifying proteins via their predicted interactions. Our results are in the same range of accuracy as other studies that classify protein-protein interactions from 3D complex structure obtained experimentally. While our classification scheme does not take directly into account sequence information our results are in agreement with functional and sequence based classification of death domain family members.
The genomic revolution has provided vast protein data resources now waiting to be transformed into usable knowledge that can be applied to solve pressing biological problems. Classification remains a favorite method for performing such transformations because of its intuitiveness and robustness against errors. Several schemes have now been proposed for automatic classification of proteins [4, 5]. They range from simple amino acid sequence comparisons, through more localized motif-based methods [6–8] further improved by position specific scoring matrices  and finally to hidden Markov model profile-based methods . Alternatively, structure-based classification provides a more direct means of inferring function, albeit on the much smaller structural databases [11, 12]. Recently, groups have taken an integrated approach that blends the advantages of the methods discussed above [13, 14].
The high-throughput generation of protein-protein interaction data from a few organisms has been carried out [15–18]. This wealth of experimental data requires new computational mining approaches to help us understand molecular recognition in protein-protein interaction networks. Since the generation of sequences and experimental protein-protein interactions increases faster than the 3D structure determination of protein complexes, there is tremendous interest in developing in silico methods that could predict macromolecular structures and assembly for prediction and classification purpose.
For example, computational approaches based on sequence, expression and literature abstract data have been developed to predict protein-protein interactions . These methods are based on the assumption that non-homologous pairs of genes that show correlated behavior across data from different sources should interact with each other. In addition, structure-based classification of protein-protein interfaces has been described initially by Ponstingl et al.  and more recently by Valdar et al.  and Mintseris et al. , from complex structures that have been solved experimentally.
The last decade has seen enormous progress in the reliability and accuracy of 3D structure-based in silico techniques including 3D structure prediction based on sequence homology and macromolecular docking. Competitions in both domains have spurred the ingenuity necessary for tackling these challenging problems [20, 21]. In this study we combine these two approaches to perform protein classification.
To efficiently dock two molecules that participate in a protein-protein or protein-ligand interaction, a certain number of steps have to be determined . The process first involves an efficient search and matching algorithm that covers the conformational space, and then one or more selective scoring functions that can eliminate efficiently between native and non-native solutions. Docking algorithms are defined and classified by the extent of flexibility that they attempt to address (1) Rigid body docking, where the two molecules are rigid solid bodies, (2) Semi-flexible docking where one molecule, the receptor, is considered a rigid body while the ligand, generally smaller, is considered flexible, and finally (3) flexible docking where both molecules are considered flexible. Flexible docking is now becoming more popular because it takes into account conformational changes that generally occur when proteins interacts with each other. However, rigid body docking simulation has already been widely employed and used successfully in the docking of protein-protein complexes [22, 23]. In this method flexibility can be incorporated through a "soft belt" into which atoms from the second molecule can penetrate, reducing drastically the complexity  and increasing the speed of the simulation. Rigid body docking is based on the observation that 3D protein complexes reveal a close geometric match at the interface of a receptor and a ligand. Since many false positives with better scores than the true solution are very often obtained, additional rescoring functions have been introduced to eliminate these wrong solutions .
In this study, rigid body docking is applied to the classification of protein-protein interactions in the death domain superfamily. We chose rigid body docking because of its higher speed. The scheme uses in silico protein-protein interaction predictions, applied to 3D protein structures built using homology modeling, as its exclusive means of performing classifications. We implemented the approach in a system called HODOCO (HOmology modeling, DOcking, and Classifying Oracle), and used Residue Pair Interaction Profiles (RPIPs) as a means to summarize protein interaction characteristics. The system was applied successfully to the problem of classifying members of the human death domain superfamily. We show that despite the limited reliability of current docking algorithm, interaction profile-based classification of this family can be obtained with 90% accuracy.
Human death domain superfamily members with known structures. PDB codes are followed by chain identifiers; an underscore represents the only chain http://www.rcsb.org.
1c15:A 2ygs:A 3ygs:C 1cy5:A 1cww:A
1e3y:A 1e41:A 1fad:A
Gene names and RefSeq IDs of the sequences used in this study. Data from the UCSC Genome Browser  April 2003 assembly.
Great care was taken during the model building process to ensure that the highest quality models were accepted. This degree of caution was required to avoid the risk of propagating inaccuracies throughout the system. GRAMM was chosen as the docking engine for its ability to obtain raw putative complexes that have not been further filtered, thereby allowing us to estimate a signal-to-noise ratio from the raw data and in turn allowing us to devise a procedure for reducing the search space. Docking algorithm was used to build a database that could be mined for specific complexes with properties unique to a given family. We chose to perform docking only between members of the same family as the intrafamily complexes provided a broad enough sampling to yield high accuracy rates and limit our computational cost. Keeping in mind GRAMM's asymmetric algorithm, 212 + 102 + 212 + 122 = 1126 docking simulations were conducted each offering 1000 putative complexes, for a total of over 1.1 million putative complexes.
The goal of this work was to show that in silico interaction-based protein classification can be obtained reliably for the death domain superfamily. We developed a classification pipeline that allows us to obtain protein classification.
While others have used in silico interaction profiles to characterize the docking ability of small molecules binding to experimentally determined 3D structures , or to discover novel protein interactions using known 3D complexes , our method is unique in that it applies to 3D molecular models of proteins complexes, and it not only considers multiple binding partners, but also multiple interfaces for each partner.
In future, there is much further work that can be done. Similar to , RPIPs could be used to cluster models rather than classify them. Here the goal would be to find alternative "families" based on interaction data, without regard to sequence homology. For example, discovering groups of models, sharing a common interaction interface or that are an outlier with a unique interface. Such a property has been highlighted by the constant misclassification of the CARD domain of APAF1 obtained in this study that has unique binding properties in the family.
It is very important to note that this study is only as powerful as the methods it builds upon. In particular, docking is still a highly active area of research with much work remaining to be done on model flexibility, solvent simulation, and force field optimization. Similarly, model building of the protein complex remains a difficult procedure requiring great care, making it difficult to accurately automate . Our results have shown that despite the introduction of error in the classification pipeline due to the reliability of the underlying tools, interaction-profile based protein classification can be obtained with confidence. The fact that multiple parties have independently begun researching the potential of interaction profiles suggests that it may become a popular method for biological data mining in the future.
Three parallel approaches were used to obtain the set of human death domain protein sequences examined in this study. First, a literature survey was conducted to identify an incomplete set of protein sequences from well-known family members. Second, the Pfam database was consulted to extract a set of protein sequences from members of the CARD, DED, DEATH and PAAD/DAPIN/PYRIN families. Third, the sequences found in the previous two methods were pooled to conduct a distant homologue search via an iterative BLAST procedure .
Atomic coordinates from solved 3D protein structures were retrieved from the Protein Data Bank  and used for docking studies where available. Atomic coordinates from the remaining family members were obtained by homology modeling. Each amino acid sequence (the target sequence) from the death domain superfamily was submitted into the Polish metaserver . Once Pair-wise alignments between the target sequence and a template were generated, a pair-wise alignment with the best score and the template's structure were submitted into the MODELLER program  to generate homology models for the target sequence. The default parameters for MODELLER were used, while the "loop-modelling" option was enabled. A total of 6 models were generated for each target, and the models were refined by molecular dynamics with simulated annealing (a functionality in MODELLER) to improve the quality of the model. All 6 models were verified for favorable geometrical and stereochemical properties using Verify 3D  and PROCHECK , and the rms deviation between the model and the template from which the model originated. From these criteria the best one was selected as the representative model for the target. Low quality models were discarded if they exhibited an RMSD greater than 1.5 Å on the total main chain atoms with the structure template from which they were built. All remaining models were then used as input to the classification analysis, as if they were the original input to the system. Models were structurally superimposed on a reference model (ASC) using the Combinatorial Extension method  for docking studies. We refer to the models as μ1 through μm, where m is the number of models.
Putative complexes were predicted through computational docking using GRAMM . GRAMM performs an exhaustive 6-dimensional search of all translations and rotations between a given pair of macromolecules and returns a list of high scoring complexes based on rigid-body geometric fit and hydrophobicity. It should be mentioned that, due to GRAMM's algorithm, the order of the models is important. Specifically, the list of complexes returned from docking molecule A against molecule B is not guaranteed to be the same as the list returned from docking molecule B against molecule A. We refer to these related lists as sister lists. GRAMM's parameters chosen such that the two known death domain superfamily complexes, human caspase recruitment domains of APAF1 with pro-Caspase-9 (PDB: 3YGS) and drosophila death domains of PELLE with TUBE (PDB: 1D2Z) had the best rank possible when docking their respective individual monomers were: Matching mode = generic, grid step = 1.5, repulsion attraction = 5, attraction double range = 0, potential range type = atom_radius, projection = black and white, representation = all, number of matches to output = 1000 and angle of rotation = 10. Each model within each family was docked against every other model within the same family and the top 1000 complexes from each docking simulation were retained in a MySQL database for further filtering. It could be argued that the GRAMM parameters giving the best rank, when obtained from the docking of the individual monomers of the two known protein complexes (bound form), could not be optimal when each death domain superfamily member are docked against each others (unbound form). Effectively docking from unbound monomers has to consider conformational changes of the protein partners upon binding that do not occur when a protein monomer belongs to a known protein complex. To limit the complexity of the problem and allow comparison between our simulations we used the low resolution docking parameters of GRAMM. In such a procedure flexibility is handled through a "soft belt" into which atoms from the second molecule can penetrate reducing drastically the complexity , and increasing the speed of the simulation.
The database of resulting complexes was mined for those with the maximum information gain with respect to family classification. We refer to the mined complexes as informative interfaces, since each complex defines an interface between the component models. We labeled the informative interfaces ι1 through ιn, n the number of interfaces. The mining procedure consisted of two parts: (i) Rescoring the complexes and (ii) Taking the intersection of sister lists, described shortly. All mining was performed via a combination of shell scripts, Perl scripts and C++ programs.
The first mining technique was to rescore each complex using the software Rpdock – a member of the 3D-Dock suite  and reject those below a threshold. Rpdock uses evidence gathered empirically to quantify the probability of a complex's existence and returns a score (RPScore) based on the results. The algorithm uses residue pair potentials across protein interfaces  as the basis for the score. Ranked via this alternative scoring system, the top 10 complexes from each docking experiment were retained, while the other 990 complexes were rejected.
The second mining technique applied took advantage of the input-model order dependence of GRAMM. Every pair of sister lists was examined for complexes found in both lists. More precisely, every pair-wise combination of complexes from all sister lists was considered in turn. Recall that complexes in sister lists are composed of the same two models (call them μ i and μ j , i, j ∈ [1...m]). If after the two instances of μ i were structurally superimposed, the distance between the two instances of μ j fell below a given threshold then both complexes were retained. Complexes never retained in this way were rejected (see Figure 2). Note that the two instances of μ j need not be exactly structurally superimposed. We refer to μ i as the aligned model and μ j as the orientation-independent model. The informative interface is then the interface that lies between these models.
Given the 3D structure of a sequence of interest (the target), and an informative interface, ι i (i ∈ [1...n]), consisting of an aligned model and an orientation independent model, the aligned model was replaced with the target ensuring preservation of orientation. (Recall that all models were previously structurally superimposed to a reference model, thereby normalizing rotations across models). We refer to this modified complex as the hybrid complex.
Rpdock was used to calculate the RPScore of the hybrid complex, and the result was stored in element i of the target's RPIP. Note that the RPScore of the hybrid complex could be grossly different from that of the unmodified complex.
Two methods of classification were attempted. First, it was postulated that the RPIP elements pertaining to a model's true family would be, on average, greater than those not. Thus, a classifier was built that compared the median RPScores across RPIP elements for each of the four families, and made a prediction based on the greatest mean. This classifier was termed the family-based classifier.
The second method used the RPIPs to build four support vector machines one for each family. This classifier was termed the SVM-based classifier. The software SVM light was used in this study http://svmlight.joachims.org. Each machine was trained to discriminate members of one family, so that all four machines would have to be used to make a final prediction. A linear kernel was used when building the machines to avoid undue distortion of the underlying RPScores.
To test the accuracy of HODOCO we conducted five runs of 5-fold cross-validation for each classification method. Generally, k-fold cross-validation randomly divides the target dataset into k equal-sized partitions, iteratively using one partition for testing and the other partitions for training. Method accuracy is measured as the average number of correctly classified models divided by the total number of models over a series of cross-validation runs. Referring back to Figure 1(c), a target is classified by building its model, building its RPIP and finally applying one of the classification methods to the RPIP. Note that it is assumed that the unknown sequence has been previously screened to be a member of the superfamily.
We would especially like to thank Maggie Lau and Tony Zhan for performing much of the pre-analytical data gathering, and the SFU Co-operative Education Program for making their employment possible. D.L.'s stipend was provided by the Canadian Institute for Health Research (CIHR) and by the Michael Smith Foundation. We are also grateful for funding support from the Natural Sciences and Engineering Research Council of Canada (NSERC) and by the British Columbia Advanced Systems Institute (BC ASI).
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.