An integrated approach to the prediction of domaindomain interactions
 Hyunju Lee^{1, 3},
 Minghua Deng^{2},
 Fengzhu Sun^{3}Email author and
 Ting Chen^{3}Email author
DOI: 10.1186/147121057269
© Lee et al; licensee BioMed Central Ltd. 2006
Received: 19 December 2005
Accepted: 25 May 2006
Published: 25 May 2006
Abstract
Background
The development of highthroughput technologies has produced several large scale protein interaction data sets for multiple species, and significant efforts have been made to analyze the data sets in order to understand protein activities. Considering that the basic units of protein interactions are domain interactions, it is crucial to understand protein interactions at the level of the domains. The availability of many diverse biological data sets provides an opportunity to discover the underlying domain interactions within protein interactions through an integration of these biological data sets.
Results
We combine protein interaction data sets from multiple species, molecular sequences, and gene ontology to construct a set of highconfidence domaindomain interactions. First, we propose a new measure, the expected number of interactions for each pair of domains, to score domain interactions based on protein interaction data in one species and show that it has similar performance as the Evalue defined by Riley et al. [1]. Our new measure is applied to the protein interaction data sets from yeast, worm, fruitfly and humans. Second, information on pairs of domains that coexist in known proteins and on pairs of domains with the same gene ontology function annotations are incorporated to construct a highconfidence set of domaindomain interactions using a Bayesian approach. Finally, we evaluate the set of domaindomain interactions by comparing predicted domain interactions with those defined in iPfam database [2, 3] that were derived based on protein structures. The accuracy of predicted domain interactions are also confirmed by comparing with experimentally obtained domain interactions from H. pylori [4]. As a result, a total of 2,391 highconfidence domain interactions are obtained and these domain interactions are used to unravel detailed protein and domain interactions in several protein complexes.
Conclusion
Our study shows that integration of multiple biological data sets based on the Bayesian approach provides a reliable framework to predict domain interactions. By integrating multiple data sources, the coverage and accuracy of predicted domain interactions can be significantly increased.
Background
With the completion of genome sequences of many species, comparative analysis of these organisms becomes increasingly important in understanding the function and evolution of genes and proteins. Comparison of the genome sequences between worm and yeast has revealed that most of the core biological functions were carried out by orthologous proteins, and that the multicellular worm had more diverse proteins than the unicellular yeast [5]. In addition, more than 50 bacterial, archaeal, and eukaryotic genomes have been analyzed for protein function prediction, phylogenetic profiling of domains, and eukaryoticsignature domain organizations [6].
The development of highthroughput technologies such as yeast twohybrid assays has produced large scale protein interaction data sets for several species, and significant efforts have been made to analyze them. By combining protein interaction data sets and orthology information on yeast protein sequences and a bacterial pathogen, Kelley et al. [7] and Sharan et al. [8] identified conserved protein interaction pathways and complexes. Further studies on conserved protein complexes and functional modules can be found in [9, 10].
The basic units of proteins are domains and proteins interact with each other through their domains. Therefore, it is crucial to understand protein interactions at the level of the domains [11]. Several groups have developed methods to understand domain interactions based on protein interactions. Sprinzak and Margalit [12] selected domain interaction pairs based on the frequency of observed protein interactions that contain the pair of domains over its expect value. Deng et al. [13] developed a maximum likelihood estimation (MLE) method and an ExpectationMaximization (EM) algorithm to infer underlying domain interactions from protein interactions. Liu et al. [14] extended the MLE method to combine protein interactions from multiple species, and showed that the extension resulted in a higher accuracy in predicting protein interactions than using the yeast protein interactions alone. Liu et al. [14] also showed that, for a single species, the approach by Deng et al. [13] was comparable to that of Gomez et al. [15] and outperformed those of the Sprinzak and Margalit [12] and the Gomez et al. [16] for predicting protein interactions. More recently, Riley et al. [1] modified the Deng et al. [13] approach to be applicable to all the protein interactions in DIP [17, 18] assuming no false positives and false negatives. Most importantly, they presented a new score for domain interactions, the Escore, defined as the log likelihood ratio of the observed interactions assuming the domain pairs interact over assuming the domain pairs do not interact. They showed that the Escore outperformed the Deng et al. [13] method in predicting domain interactions. Other approaches for predicting domain interactions using multiple data sources were developed in [19, 20]. In this study, we focus on the integration of multiple data sources from multiple species to predict highconfidence domain interactions. First, we calculate the probability of domain interactions from four species: yeast [21–23], worm [24], fruitfly [25] and humans [26], respectively. Using these probabilities, we compute the expected number of interactions for each pair of domains within a species. Second, we investigate information on protein fusion and the domain functions. Third, a Bayesian approach is used to integrate those data sources to predict highconfidence domain interactions. These predictions help us to unravel the domain interactions in protein complexes and protein interactions. Our study differs from previous studies in several significant ways. Compared to Liu et al. and Ng et al. [14, 19, 20], our approach develop a new measure to score domaindomain interactions and validate it with experimentally derived domain interactions instead of using indirect ways such as validating reinferred protein interactions. Compared to Riley et al. [1], protein fusion and Gene Ontology (GO) [27] functions are also integrated using a Bayesian approach. We show that the integration significantly increases the accuracy of predicted domaindomain interactions.
The paper is organized as follows. In the Methods section, we present the various data sources used in our analysis, followed by the methods for analyzing an integration of the different data sources. In the Results section, we present the results based on the various data sources separately, followed by the results based on integrated analysis. We evaluate our results by comparing with the domaindomain interactions in iPfam. Finally, we show limitations of our approach and further studies.
Methods
Data sources
Data sets. The characteristics of protein interaction data sets for yeast, worm, fruitfly and humans, the correspondingdomain information, and the values of fn and fp used in the analysis. Only protein pairs with both proteins containing PfamA domains are included in the protein interaction data sets, and proteins in those protein interactions are counted. The numbers in the parenthesis are the total number of available protein interactions.
Yeast  Worm  Fruitfly  Humans  

Proteins  2,568  1,580  2,444  3,493 
Proteinprotein interactions  7,985 (15,461)  2,193 (4,030)  3,944 (20,429)  10,906 (15,274) 
Domains  1,386  888  1,195  1,401 
False Negative (fn)  0.25  0.67  0.61  0.25 
False Positive (fp)  0.0009  0.0007  0.0005  0.0007 
Protein interactions for yeast and worm
We download the protein interaction data sets for yeast and worm from the DIP database [17, 18]. Each protein is associated with a DIP number, SWISSPROT ID, GI number, etc. We use the SWISSPROT accession numbers to associate domain information from the Pfam database [29] with the proteins in the DIP. We also use the GI numbers to obtain additional Pfam domain information from the National Center for Biotechnology Information [30]. For worm, the domain information collected using the GI numbers increases the number of protein interactions with domain information.
Protein interactions for fruitfly
We obtain the protein interaction data set for fruitfly from Giot et al. [25]. In this data set, protein names are identified by CG numbers. To obtain the relationship between proteins and domains, we associate the CG numbers with the SWISSPROT accession numbers by the protein table Integr8 in EMBLEBI [31]. The compiled SWISSPROT accession numbers are used to extract proteindomain relationship from the Pfam database.
Protein interactions for human
We obtain the human protein interaction data set from the Human Protein Reference Database (HPRD) [26], which contains proteinprotein interactions from individual smallscale experiments published in theliterature. The proteinsare identified by NP numbers. We associate the NP numbers in the HPRD with the SWISSPROT accession numbers using the protein table Integr8 in EMBLEBI, and then extractproteindomain relationship from the Pfam database.
Domain functions
We obtain domain functions, biological process, using the mapping table from Pfam to GO in the Gene Ontology webpage [27] and use the domains in the table to compile domain pairs with the same function.
Domain fusion
We use proteindomain information in PfamA to identify pairs of domains coexisting in one protein. The method is referred to as domain fusion in the rest of the paper.
Databases of domain interactions
We use two structure based domain interactions: iPfam [3] and Protein Quaternary Structure (PQS) [32] to estimate the reliability of predicted domaindomain interactions. iPfam contains 2,580 domain interactions(July 2004 version). The domain interactions in iPfam are obtained by calculating all bonds between all pairs of residues between domains based on the protein structures in Protein Data Bank (PDB). PQS provides probable quaternary states for structures based on PDB. In PQS, the analysis of determining biologically relevant interactions and crystal packing is attempted based on some known properties such as hydrophobicity, shape analysis, and the size of the solventaccessible surface area (asa). Note that biologically relevant domain interactions and crystal contacts are not distinguished in iPfam. As domains in PQS are annotated by SCOP superfamily, we associate them with the Pfam domains using the mapping table in the SCOP webpage [33]. Finally, we obtain 36,439 domain interactions.
Computational methods
In this subsection, we describe (1) the computational methods for calculating the probability of domaindomain interactions, (2) a new measure to evaluate the strength of domaindomain interactions, and (3) a Bayesian method for integrating different data sources to construct a highconfidence set of domaindomain interactions.
The maximum likelihood estimation for probabilities of domaindomain interactions
The maximum likelihood estimation method proposed by Deng et al. [13] has been shown to have good performance in estimating the probabilities of domaindomain interactions. We adopt this method in this study and briefly describe the method as follows.
The basic assumption of the MLE method is that two proteins interact if and only if at least one pair of domains from each of the two proteins interact. Given two proteins P_{ i }and P_{ j }, the probability that they interact is
where P_{ ij }= 1 if they interact and 0 otherwise, and D_{ mn }∈ _{ ij }denotes that domains D_{ m }and D_{ n }belong to proteins P_{ i }and P_{ j }, respectively, and D_{ mn }= 1 if domain D_{ m }interacts with domain D_{ n }. For an experiment in a species, the false positive rate (fp) is defined as the probability that two noninteracting proteins were observed to interact and the false negative rate (fn) is defined as the probability that two truly interacting proteins were not observed to interact in the experiment. Let O_{ ij }= 1 if the interaction between proteins P_{ i }and P_{ j }is observed and O_{ ij }= 0 otherwise. Thus, the probability for the observed protein interaction is
Pr(O_{ ij }= 1) = Pr(P_{ ij }= 1)(1  fn) + (1  Pr(P_{ ij }= l))fp. (2)
The likelihood functionthe probability of the whole interaction data set is
Our objective is to maximize the likelihood L, which can be represented as the function of P(D_{ mn }= 1) with fixed fp and fn by incorporating Equations 1, 2, and 3. P(D_{ mn }= 1) can be estimated by an expectationmaximization (EM) algorithm [13]. Deng et al. [13] presented a method to approximate the values of fn and fp based on the number of observed interactions. We combine this idea and the reliability of protein interaction data sets to approximate values of fn and fp in each species used in this study. The results are shown in Table 1. The details are presented in the additional file 1.
The expected number of occurrences of domain interactions
Deng et al. [13] used the estimated value of P(D_{ mn }= 1) to rank domaindomain interactions. One problem of the approach is that the estimated value of P(D_{ mn }= 1) is generally large if (1) each of the two domains appears only in one protein, (2) each of these two proteins contains only one domain, and (3) these two proteins interact. Another problem is that the value of P(D_{ mn }= 1) is generally small if (1) both domains appear in many proteins and (2) only a small proportion of these pairs of proteins having these two domains interact.
In order to overcome these problems, we score each domain pairs by the expected number of occurrences of domain interactions.
E(#D_{ mn }) = N_{ mn }Pr(D_{ mn }= 1), (4)
where N_{ mn }is the number of protein pairs having domains D_{ m }and D_{ n }. Our intuition is that if a pair of domains are observed in multiple protein interactions, this pair of domains are more likely to interact. We use E() as a feature in our integrative model.
Domain fusion
In addition to the protein interaction data, we also incorporate information on domain fusion and domain function to build a set of highconfidence domaindomain interactions. Enright et al. [34] and Marcotte et al. [35] showed that two proteins are more likely to interact if they are fused into one protein in another species. This idea can be further extended to domains in that if two domains are fused in one protein in any species, they are more likely to interact. Thus, we search proteins having multiple PfamA domains and 9,615 PfamA domain pairs that coexist in the same proteins are obtained. We define CE(D_{ mn }), where CE stands for CoExistence, as the number of occurrences that domain D_{ m }and domain D_{ n }coexist in the same proteins. It is expected that if CE(D_{ mn }) is larger, domain D_{ m }and domain D_{ n }are more likely to interact. We use CE() as a feature in our integrative model.
Domain functions
We obtain gene ontology terms of domains and find 57,907 domain pairs having the same GO terms in the category of the biological process. The gene ontology has a hierarchical structure (a directed acyclic graph), where the parents denote functions of more general terms and the offsprings represent functions of more specific terms. It is expected that two domains participating in the same GO function (biological process) are more likely to interact than they do in different functions. Moreover, two domains participating in a more specific function are more likely to interact than they do in a more general function. A more specific function generally covers a smaller number of domains. Assume that domain D_{ m }and domain D_{ n }have the same function F_{ f }. We define SG(D_{ mn }), where SG stands for the Same Gene ontology, as the number of domains having the function F_{ f }. We use SG() as a feature in our integrative model.
Integrating multiple data sources
The six information sources can be combined to construct a highconfidence set of domaindomain interactions. Several heuristic methods can be used for data integration. Here we consider three approaches: evidence counting, naïve Bayesian, and logistic regression.
For each pair of domains, six information sources for their interaction can be obtained from the analysis of the expected number of domain interactions derived from protein interactions of four species, the number of occurrences in the domain fusion, and the number of domains with the same GO annotation. We applied the aforementioned three computational methods to integrate these six biological evidences to predict domain interactions. The methods are described as follows.
Evidence counting
The number of evidences supporting domain interactions is used to score domain pairs for potential interactions. For a pair of domains D_{ m }and D_{ n }, we say that the interaction between D_{ m }and D_{ n }is supported by the yeast protein interactions if the expected number of occurrences of domain interactions is at least 1, i.e E(#D_{ mn }) ≥ 1. We count this as one evidence. A domain interaction can have a maximum of 4 evidences from yeast, worm, fruitfly and humans. Similarly, we say that the interaction between D_{ m }and D_{ n }is supported by the domain fusion if CE(D_{ mn }) ≥ 1, and by the domain functions if SG(D_{ mn }) ≥ 1. The number of evidences for a pair of domains ranges from 0 to 6.
Naïve Bayesian
The naïve Bayesian approach assumes the independence of data sources, and has been applied to the integration of multiple data sources for predicting protein interactions [36, 37]. The basic idea is to calculate the likelihood ratio of each of the six evidences and then multiply these likelihood ratios. We define the set of observed interactions (Obs) as the interacting domain pairs in iPfam and the set of nonobserved interactions (Nobs) as the domain pairs not presented in iPfam. The likelihood ratio for six data sources are calculated as follows. For each species, we split the values of E(#D_{ mn }) into 7 intervals. We call an interval as a bin, and this process as a binning process. Let d = E(#D_{ mn }) and d falls into the tth bin. Let Pr(dObs) be the fraction of the observed interactions in the tth bin and let Pr(dNobs) be the fraction of the nonobserved interactions in the tth bin. Then, the likelihood ratio for the tth bin is Pr(dObs)/Pr(dNobs). Similarly, we bin the values of CE(D_{ mn }) and SG(D_{ mn }) and then calculate the likelihood ratio for each of them. Additional file 2 shows the likelihood ratios for each data source. Let d_{1},..., d_{4} be the values of E(#D_{ mn }) in yeast, worm, fruitfly, and humans, respectively, and let d_{5} and d_{6} be the values of CE(D_{ mn }) and SG(D_{ mn }), respectively. Then, the total likelihood ratio is
Logistic regression
Let E_{ y }(#D_{ mn }), E_{ w }(#D_{ mn }), E_{ f }(#D_{ mn }), and E_{ h }(#D_{ mn }) denote the expected number of occurrences of the domain interactions in yeast, worm, fruitfly and humans, respectively. Let I(d) be the indicator function: I(d) = 1 if d ≥ 1 and 0, otherwise. Let EV(D_{ mn }) be the number of evidences from the evidence counting method. We use the following model,
Validating the predicted domain interactions
To evaluate the reliability of the predicted domain interactions, we compare them with the domain interactions in iPfam. The interactions in iPfam are treated as the observed interactions. Although many domain interactions are not included in the database, a good score function for domain interactions should include a higher fraction of observed interactions in the highest ranked predictions than a random scoring function. Therefore, for a given scoring range, the fraction of the observed interactions among all domain pairs having scores within the range is calculated. We also calculate the ratio of this fraction over that from a random scoring function and refer to it as the fold value. For a good score function, the fold value should increase with the score.
Another method to evaluate the reliability of predicted domain interactions is using the Receiver Operating Characteristic (ROC) curve representing the relationship between false positive rate (FPR) and sensitivity (SN). As we mentioned before, we use domain pairs in iPfam as the observed interactions and domain pairs not in iPfam as the nonobserved interactions. Because this gives too many nonobserved interactions (1,536,555), we randomly remove domain pairs without any evidence and finally obtain 84,385 domain pairs, about twice of the number of domain pairs with at least one evidence, for the nonobserved set. For a given threshold value t, domain pairs with score larger than t are predicted as interacting and others as noninteracting. The results can be represented as
The FPR and SN are defined as
We use fivefold crossvalidation to compare the performance. We use a subset of iPfam domain interactions for training to calculate the likelihood ratio of the Bayesian approach and the coefficients of the logistic regression. The remaining iPfam domain interactions are used for testing.
Results
Conserved domain interactions across multiple species
The numbers of predicted domain interactions using protein interactions. The predicted domain interactions classified by the number of species (1,2,3 and 4) and their overlaps with iPfam and PQS.
Species  1  2  3  4  All 

Predicted domain interactions  19,520  707  95  10  20,332 
Overlap with iPfam (Ratio)  468 (2.4%)  115 (16.2%)  28 (29.5%)  5 (50%)  616 (3.0%) 
Overlap with PQS (Ratio)  883 (4.5%)  147 (20.8%)  31 (32.6%)  4 (40%)  1,065 (5.2%) 
Contributions of each data source to the accuracy of predicted domain interactions
The numbers of predicted domain interactions using domain fusion, domain function, and combining six data sets. The predicted domain interactions, the number of evidences, and the overlaps with iPfam. Numbers in the first column indicate the number of evidences for the domain interactions, and the second column is the number of interactions having the corresponding evidences. "PPI" represents the protein interaction data sets. "Fraction" indicates the fraction of domain interactions in iPfam in a given set. "Fold" indicates the ratio of the fraction over expected value (0.17%).
Evidence  Interactions  Overlap with iPfam  Fraction  Fold 

Random domain pairs  1,539,135  2,580  0.17%   
Domain fusion  9,615  1,141  11.8%  69 
Domain fusion & PPI  859  283  32.9%  194 
Same GO terms  57,907  1,302  0.8%  13 
Same GO terms & PPI  1,031  234  22.7%  134 
≥ 1  23,606  2,071  8.8%  52 
≥ 2  1,624  820  50.5%  297 
≥ 3  307  200  65.1%  383 
≥ 4  58  43  74.1%  436 
≥ 5  13  10  76.9%  452 
= 6  0       
We also incorporate information on domain pairs with the same GO annotations. It is known that proteins having similar functions are more likely to interact [38, 39]. In fact, the observation is true for domains as well. We find 57,907 domain pairs having the same GO terms in the category of biological process. 1,031 domain pairs are also found in predicted domain interactions based on protein interaction data, among which 234 (22.7%) domain interactions are found in iPfam (Table 3).
Integration of multiple biological data sources
We integrate six data sources using different methods described in the Methods section, and compare the performance using a fivefold crossvalidation. We first show the improvement of integrating multiple biological data sources. Table 3 shows the percentages of overlaps between iPfam and the predicted domain interactions with multiple evidences. The results indicate that one single evidence is not sufficient for predicting domain interactions as only 8.8% of these domain interactions overlap with iPfam. However, the percentage of overlaps increases to 50.5% for domain interactions with two or more evidences. As the number of evidences increases, the predictions are more accurate but, the number of predictions decreases at the same time. Only 58 predicted domain interactions have four or more evidences and 43 out of 58 (= 74.1%) belong to iPfam.
The likelihood ratio values of predicted domain interactions. The likelihood ratio values of predicted domain interaction, the numbers of predicted domain interactions, and the overlap with iPfam. Numbers in the first column indicate the likelihood ratio values for the domain interactions, and the second column is the number of interactions having the corresponding likelihood ratio values.
Likelihood ratio values  Interactions  Overlap with iPfam  Fraction  Fold 

Random domain pairs  1,539,135  2,580  0.17%   
> 0  25,352  2,080  8.2%  48 
≥ 1  6,386  1,641  25.7%  151 
≥ 4  2,391  1,241  51.9%  305 
≥ 6  2,044  1,142  55.9%  329 
≥ 11  1,683  1,011  60.1%  353 
≥ 21  886  634  71.6%  421 
≥ 51  420  336  80.0%  471 
The ten highest ranked domaindomain interactions. The ten highest ranked domaindomain interactions from the Bayesian approach which are not in iPfam. iPfam_2005 represents domain interactions found in updated version of iPfam (Oct 2005 version).
Domain 1  Domain 2  iPfam_2005  

Pfam ID  Accession  Pfam ID  Accession  
WD40  PF00400  Pkinase  PF00069  
zfC2H2  PF00096  Pkinase  PF00069  
zfC3HC4  PF00097  zfC3HC4  PF00097  
Fbox  PF00646  Skp1_POZ  PF03931  
zfC4  PF00105  Hormone_recep  PF00104  x 
SMC_hinge  PF06470  SMC_N  PF02463  x 
Cation_ATPase_N  PF00690  Cation_ATPase_C  PF00689  
MutS_V  PF00488  MutS_I  PF01624  
Cadherin  PF00028  Cadherin_C  PF01049  
dsrm  PF00035  dsrm  PF00035  x 
Table 6
Prediction  

Interacting  Noninteracting  
Observed  TP  FN 
Nonobserved  FP  TN 
Comparison with domain interactions in H. pylori
Rain et al. [4] reported a proteinprotein interaction data set for H. pylori using yeast two hybrid assays. This data set provides the ranges of sequences of the prey proteins interacting with the bait proteins. We map these ranges in the preys to the PfamA domains when the overlap between them is larger than 50% of the Pfam domains. As we do not have such information for the baits, we assume that all domains in the baits interact with the specific site of the preys. We obtain a total of 1,101 interactions between PfamA domains. Note that the domain interactions from H. pylori may contain false positives as the interacting domains in the baits are not known. We compare our predicted domain interactions from the six data sources using the Bayesian approach with the experimentally derived domain interactions from H. pylori. For comparison, we use a subset of the predicted domain interactions with domains involved in domain interactions in H. pylori. Additional file 8 shows the percentages of overlaps between the domain interactions from H. pylori and the predicted domain interactions. The fraction of domain pairs overlapped with the domain interactions in H. pylori increases as the likelihood ratio score increases, confirming the accuracy of the predicted domain interactions.
We also study our scoring algorithm using the H. pylori database. We infer domain interactions from H. pylori protein interactions using four scoring functions and compare the predicted domain interactions with the domain interactions from H. pylori. The number of domains in H. pylori is 848 and 848*849/2 = 359,976 are potential interacting pairs. From the Expectation scoring function, we obtain 1,150 predicted domain interactions (larger than zero). Among them, 750 predicted domain interactions overlap with the 1,011 domain interactions in H. pylori. Additional file 9 shows that true positive rate is around 0.8 in 1,150 ranked domain interactions, showing the accuracy of the scoring functions.
Domain interactions in yeast complexes
Figure 6(b) shows a Pyruvate dehydrogenase (PDH) complex. This complex converts pyruvate to acetyl CoA. The interaction between protein Lat1 and protein Pdb1 is mainly due to the interaction between domain PF02817 and domain PF02780. Domain PF02817 is an E3 binding domain, and PF02780 is the Cterminal domain of transketolase, which has been proposed as a regulatory molecule binding site. The interaction between protein Lap1 and protein Lpd1 occurs through the interaction of domain PF02817 and domain PF02852, which is the Pyridine nucleotidedisulphide oxidoreductase, dimerisation domain.
Discussion
The basic units of proteins are domains. If two proteins interact, at least one pair of domains from each of the two proteins interact. However, current biotechnologies such as the yeasttwohybrid system can only detect protein interactions and it is tedious and labor intensive to derive domain interactions. The prediction of domain interactions based on protein interactions from one species has been formulated as a missing value problem and an EM algorithm has been developed to achieve this objective [13]. The method has been modified to integrate protein interaction data sets from multiple species and the results have been improved [1, 14]. In this study, we further explore the problem of domaindomain interactions from multiple data sources including protein interactions from four species; yeast, worm, fruitfly, and humans, as well as domain fusion and domain function information. We first provide a score function, the expected number of domaindomain interactions in the observed interactions, to infer the reliability of domain interactions. By comparing with domain interactions in iPfam, we show that the new score outperforms the score of Deng et al. [13] for predicting domain interactions. The true positive rate among highly ranked domain interactions predicted from the new score is higher than that from Deng et al. [13]. We further show that, by including the domain fusion and gene ontology information, the accuracy of the predicted domain interactions can be significantly increased. We also show that the simple naïve Bayesian approach works well to combine multiple biological information for predicting highconfidence domain interactions. There are several limitations of this study. First, we did not include all the interaction data from all the species as Riley et al. [1] did. The reason is that the size of data in other species is much smaller than those in the four species. Second, the protein interaction data sets used in this study are incomplete and contain many false positives. Additional file 1 shows the ROC curves of the prediction results using various values of false positive (fp) and false negative (fn). In particular, we compared the result based on the fp and fn values presented in Table 1 with the result based on fp = fn = 0 used in Riley et al. [1]. Depending on species, the former approach is sometimes better than or similar to the latter approach, and sometimes is worse. Third, although we have shown that the naïve Bayesian approach outperforms the evidence counting and the logistic regression methods, there is room to improve the prediction by considering the correlations between data sources.
Conclusion
We have shown that the likelihood ratio score provides a mean for evaluating the reliability of domain interactions. Based on the likelihood ratio score, we have derived a set of highconfidence domain interactions. This set has important implication in understanding protein functions at the domain level as well as in understanding protein interactions.
Abbreviations
 MLE:

Maximum Likelihood Estimation
 EM:

Expectation Maximization
 HPRD:

Human Protein Reference Database
 GO:

Gene Ontology
 ROC:

Receiver Operating Characteristic
 FPR:

False Positive Rate
 SN:

Sensitivity
 PQS:

Protein Quaternary Structure
Declarations
Acknowledgements
We thank two anonymous reviewers for several helpful suggestions, which significantly improved the manuscript. One reviewer suggested the comparison with H. pylori data which is now included in the manuscript. This research is supported by NIH/NSF joint mathematical biology initiative DMS0241102. MH Deng is supported by the grants from the National Key Basic Research Project of China (No. 2003CB715903) and National Natural Science Foundation of China (No. 90208022, No.30570425).
Authors’ Affiliations
References
 Riley R, Lee C, Sabatti C, Eisenberg D: Inferring protein domain interactions from databases of interacting proteins. Genome Bio 2005, 6(10):R89. 10.1186/gb2005610r89View ArticleGoogle Scholar
 iPfam[http://www.sanger.ac.uk/Software/Pfam/iPfam/]
 Finn R, Bateman A: Visualisation of proteinprotein interactions at domains and amino acid resolutions. Bioinformatics 2005, 21: 410–412. 10.1093/bioinformatics/bti011View ArticlePubMedGoogle Scholar
 Rain JC, Selig L, Reuse HD, Battaglia V, Reverdy C, Simon S, Lenzen G, Petel F, Wojcik J, Schachter V, Chemama Y, Labigne A, P L: The proteinprotein interaction map of Helicobacter pylori . Nature 2001, 409: 211–215. 10.1038/35051615View ArticlePubMedGoogle Scholar
 Chervitz S, Aravind L, Sherlock G, Ball CA, Koonin EV, Dwight SS, Harris MA, Dolinski K, Mohr S, Smith T, Weng S, Cherry JM, D B: Comparison of the Complete Protein Sets of Worm and Yeast: Orthology and Divergence. Nucleic Acids Res 1998, 282: 2022–2028.Google Scholar
 Ye Y, Godzik A: Comparative Analysis of Protein Domain Organization. Genome Res 2004, 14: 343–353. 10.1101/gr.1610504PubMed CentralView ArticlePubMedGoogle Scholar
 Kelley BP, Sharan R, Karp RM, Sittler T, Root DE, Stockwell BR, Ideker T: Conserved pathways within bacteria and yeast as revealed by global protein network alignment. Proc Natl Acad Sci USA 2003, 20: 11394–11399. 10.1073/pnas.1534710100View ArticleGoogle Scholar
 Sharan R, Ideker T, Kelley BP, Shamir R, Karp RM: Identification of protein complexes by comparative analysis of yeast and bacterial protein interaction data. J Comput Biol 2005, 12(6):835–846. 10.1089/cmb.2005.12.835View ArticlePubMedGoogle Scholar
 Butland G, PeregrinAlvarez JM, Li J, Yang W, Yang X, Canadien V, Starostine A, Richards D, Beattie B, Krogan N, Davey M, Parkinson J, Greenblatt J, A E: Interaction network containing conserved and essential protein complexes in Escherichia coli. Nature 2005, 433: 531–537. 10.1038/nature03239View ArticlePubMedGoogle Scholar
 PereiraLeal JB, Teichmann SA: Novel specificities emerge by stepwise duplication of functional modules. Genome Res 2005, 15: 552–559. 10.1101/gr.3102105PubMed CentralView ArticlePubMedGoogle Scholar
 Wojcik J, Schachter V: Proteinprotein interaction map inference using interaction domain profile pairs. Bioinformatics 2001, 17(Suppl 1):S296–305.View ArticlePubMedGoogle Scholar
 Sprinzak E, Margalit H: Correlated Sequencesignatures as Markers of ProteinProtein Interaction. J Mol Biol 2001, 311: 681–692. 10.1006/jmbi.2001.4920View ArticlePubMedGoogle Scholar
 Deng M, Sun F, Chen T: Inferring domaindomain interactions from proteinprotein interactions. Genome Res 2002, 12: 1540–1548. 10.1101/gr.153002PubMed CentralView ArticlePubMedGoogle Scholar
 Liu Y, Liu N, Zhao H: Inferring proteinprotein interactions through highthroughput interaction data from diverse organisms. Bioinformatics 2005, 21(15):3279–3285. 10.1093/bioinformatics/bti492View ArticlePubMedGoogle Scholar
 Gomez SM, Noble WS, A R: Learning to predict proteinprotein interactions from protein sequences. Bioinformatics 2003, 19(15):1875–1881. 10.1093/bioinformatics/btg352View ArticlePubMedGoogle Scholar
 Gomez SM, Lo SH, A R: Probabilistic prediction of unknown metabolic and signaltransduction networks. Genetics 2001, 159(3):1291–1298.PubMed CentralPubMedGoogle Scholar
 DIP[http://dip.doembi.ucla.edu/]
 Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Database of Interacting Proteins: 2004 update. Nucleic Acids Res 2004, (32 Database):D449–51. 10.1093/nar/gkh086Google Scholar
 Ng SK, Zhang Z, Tan SH: Integrative approach for computationally inferring protein domain interactions. Bioinformatics 2003, 19(8):923–929. 10.1093/bioinformatics/btg118View ArticlePubMedGoogle Scholar
 Ng SK, Zhang Z, Tan SH, Lin K: InterDom a database of putative interacting protein domains for validating predicted protein interactions and complexes. Nucleic Acids Res 2003, 31(1):251–254. 10.1093/nar/gkg079PubMed CentralView ArticlePubMedGoogle Scholar
 Mewes HW, Frishman D, Guldener U, Mannhaupt G, Mayer K, Mokrejs M, Morgenstern B, Munsterkotter M, Rudd S, Weil B: MIPS: a Database for Genomes and Protein Sequences. Nucleic Acids Res 2002, 30: 31–34. 10.1093/nar/30.1.31PubMed CentralView ArticlePubMedGoogle Scholar
 Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, QureshiEmili A, Li Y, Godwin B, Conover D, Kalbfleisch T, Vijayadamodar G, Yang M, Johnston M, Fields S, Rothberg JM: A comprehensive analysis of proteinprotein interactions in Saccharomyces cerevisiae. Nature 2000, 403: 623–627. 10.1038/35001009View ArticlePubMedGoogle Scholar
 Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive twohybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA 2001, 98: 4569. 10.1073/pnas.061034498PubMed CentralView ArticlePubMedGoogle Scholar
 Li S, Armstrong CM, Bertin N: A map of the interactome network of the metazoan C. elegans. Science 2003, 303(5657):540–543. 10.1126/science.1091403View ArticleGoogle Scholar
 Giot L, Bader JS, Brouwer C, Chaudhuri A: A protein interaction map of Drosophila melanogaster. Science 2003, 302(5651):1727–1736. 10.1126/science.1090289View ArticlePubMedGoogle Scholar
 Peri S, Navarro J, Amanchy R, Kristiansen T, Jonnalagadda C, Surendranath V, Niranjan V, Muthusamy B, Gandhi TK, Gronborg M, Ibarrola N, Deshpande N, Shanker K, Shivashankar HN, Rashmi BP, Ramya MA, Zhao Z, Chandrika KN, Padma N, Harsha HC, Yatish AJ, Kavitha MP, Menezes M, Choudhury DR, Suresh S, Ghosh N, Saravana R, Chandran S, Krishna S, Joy M, Anand SK, Madavan V, Joseph A, Wong GW, Schiemann WP, Constantinescu SN, Huang L, KhosraviFar R, Steen H, Tewari M, Ghaffari S, Blobe GC, Dang CV, Garcia JG, Pevsner J, Jensen ON, Roepstorff P, Deshpande KS, Chinnaiyan AM, Hamosh A, Chakravarti A, A P: Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res 2003, 13: 2363–2371. 10.1101/gr.1680803PubMed CentralView ArticlePubMedGoogle Scholar
 Gene Ontology[http://www.geneontology.org/]
 Bateman A, Birney E, Cerruti L, Durbin R, L E, Eddy SR, S GJ, Howe KL, Marshall M, Sonnhammer EL: The Pfam Protein Families Database. Nucleic Acids Res 2002, 30: 276–280. 10.1093/nar/30.1.276PubMed CentralView ArticlePubMedGoogle Scholar
 Pfam[http://www.sanger.ac.uk/Software/Pfam/]
 NCBI[http://www.ncbi.nlm.nih.gov/]
 EMBLEBI[http://www.ebi.ac.uk/integr8/]
 Henrick K, Thornton JM: PQS: a protein quaternary structure file server. Trends Biochem Sci 1998, 23(9):358–61. 10.1016/S09680004(98)012535View ArticlePubMedGoogle Scholar
 SCOP[http://scop.berkeley.edu/]
 Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA: Protein interaction maps for complete genomes based on gene fusion events. Nature 1999, 402(6757):86–90. 10.1038/47056View ArticlePubMedGoogle Scholar
 Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D: Detecting Protein Function and Proteinprotein Interactions from Genome Sequences. Science 1999, 285: 751–753. 10.1126/science.285.5428.751View ArticlePubMedGoogle Scholar
 Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan NJ, Chung S, Emili A, Snyder M, Greenblatt JF, Gerstein M: A Bayesian networks approach for predicting proteinprotein interactions from genomic data. Science 2003, 302: 449–453. 10.1126/science.1087361View ArticlePubMedGoogle Scholar
 Lee I, Date S, Adai A, Marcotte E: A probabilistic functional network of yeast genes. Science 2004, 306: 1555–1558. 10.1126/science.1099511View ArticlePubMedGoogle Scholar
 Lehner B, Fraser A: A firstdraft human proteininteraction map. Genome Biol 2004, 5: R63. 10.1186/gb200459r63PubMed CentralView ArticlePubMedGoogle Scholar
 Deng M, Zhang K, Mehta S, Chen T, Sun F: Prediction of Protein Function Using Proteinprotein Interaction Data. J Comput Biol 2003, 10(6):197–206. 10.1089/106652703322756168Google Scholar
 Patton EE, Willems AR, Sa D, Kuras L, Thomas D, Craig KL, Tyers M: Cdc53 is a scaffold protein for multiple Cdc34/Skp1/Fbox proteincomplexes that regulate cell division and methionine biosynthesis in yeast. Genes Dev 1998, 12(5):692–705.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.