InPrePPI: an integrated evaluation method based on genomic context for predicting protein-protein interactions in prokaryotic genomes
- Jingchun Sun†1,
- Yan Sun†2, 3,
- Guohui Ding†2, 3,
- Qi Liu4,
- Chuan Wang2,
- Youyu He2,
- Tieliu Shi2,
- Yixue Li2 and
- Zhongming Zhao1, 5, 6Email author
© Sun et al; licensee BioMed Central Ltd. 2007
Received: 30 March 2007
Accepted: 26 October 2007
Published: 26 October 2007
Although many genomic features have been used in the prediction of protein-protein interactions (PPIs), frequently only one is used in a computational method. After realizing the limited power in the prediction using only one genomic feature, investigators are now moving toward integration. So far, there have been few integration studies for PPI prediction; one failed to yield appreciable improvement of prediction and the others did not conduct performance comparison. It remains unclear whether an integration of multiple genomic features can improve the PPI prediction and, if it can, how to integrate these features.
In this study, we first performed a systematic evaluation on the PPI prediction in Escherichia coli (E. coli) by four genomic context based methods: the phylogenetic profile method, the gene cluster method, the gene fusion method, and the gene neighbor method. The number of predicted PPIs and the average degree in the predicted PPI networks varied greatly among the four methods. Further, no method outperformed the others when we tested using three well-defined positive datasets from the KEGG, EcoCyc, and DIP databases. Based on these comparisons, we developed a novel integrated method, named InPrePPI. InPrePPI first normalizes the AC value (an integrated value of the accuracy and coverage) of each method using three positive datasets, then calculates a weight for each method, and finally uses the weight to calculate an integrated score for each protein pair predicted by the four genomic context based methods. We demonstrate that InPrePPI outperforms each of the four individual methods and, in general, the other two existing integrated methods: the joint observation method and the integrated prediction method in STRING. These four methods and InPrePPI are implemented in a user-friendly web interface.
This study evaluated the PPI prediction by four genomic context based methods, and presents an integrated evaluation method that shows better performance in E. coli.
Uncovering all protein-protein interactions (PPIs), or, the interactome, of an organism is essential for understanding its complex biological processes [1, 2]. Recently, many high-throughput experimental and computational methods have been developed and applied to model organisms such as Escherichia coli (E. coli), yeast, and humans [3–10]. High-throughput experimental methods can directly detect the set of PPIs in a genome, but the capacity to identify PPIs is still limited by present technology. Computational approaches, which usually mine and then utilize the features from the known PPIs and the genomic information from one or multiple genomes, can largely meet this strong demand . The major limitation in both the computational and experimental approaches is their uncertain confidence in the identification of PPIs, with high false-positive and false-negative rates [12, 13].
Genomic context information has been frequently used in the computational methods for PPI prediction. There are four major genomic context based methods: the phylogenetic profile method , the gene cluster method , the gene fusion method , and the gene neighbor method . Each method mainly utilizes one specific genomic context feature; thus, its prediction has biases towards the information it relies on . There is one comparison of the phylogenetic profile, gene fusion, and gene neighbor methods, suggesting that the gene neighbor method might outperform the other two [17, 18]. To date, there have been no other systematic evaluations of these four methods. It is likely that an integration of these methods would take advantage of different genomic features and thus outperform each of these four methods . Indeed, investigators now realize the importance of integration [19, 20]. The integration strategy has been applied in two methods: the joint observation method [3, 14, 21] and STRING . The joint observation method selects the PPIs that are predicted or identified by more than one method [10, 21]. Its rationale is based on the understanding that the confidence of PPI prediction relies on the amount of supporting evidence, and that the confidence increases with more evidence (i.e., methods). This strategy was successfully demonstrated in Uetz et al.  and von Mering et al. . However, the joint observation method results in a strong decrease of the coverage, especially when the number of methods becomes large. Since an efficient approach to inferring PPIs needs to consider both coverage and accuracy, the joint observation method has limited applications [12, 24]. STRING calculates a combined score for each pair of proteins assuming that the features from various sources are independent . While this scoring algorithm has been implemented in the STRING database, there is no evaluation on the improvement of PPI prediction.
In this study, we first performed a systematic evaluation on the prediction efficacy of these four genomic context based methods by using three gold standards of positive datasets obtained from the KEGG , EcoCyc , and DIP databases , respectively. We used E. coli K12 in this study because it is the most studied prokaryotic organism and its protein annotations are available in several databases. Our evaluation indicated that there is no consensus among these methods and no method could outperform the others in all tests. Based on these comparisons, we developed a new method to integrate the features used in all four methods. We named the method InPrePPI (an In tegrated method for Pre diction of P rotein-P rotein I nteractions). InPrePPI first calculates a score for each protein-protein pair predicted by each method, then optimally weighs the score, and finally obtains an integrated score. Based on the integrated score, InPrePPI extracts the PPIs with high confidence from all of the predicted protein pairs. Our comparison of InPrePPI with the joint observation method and STRING indicates that InPrePPI in general outperforms the others. Finally, we implemented the four genomic context based methods and InPrePPI in a user-friendly platform-independent system.
Comparison of the PPIs predicted by the four methods
Protein-protein interactions predicted by four methods
Number of PPIs
Number of proteins involved
Number of PPIs covered by two methods
We next examined the average degree for the PPIs predicted by the four methods. The degree is the most elementary characteristic of a node in a biological network . If the average degree in the predicted network is much lower than the expected, it may reflect that the prediction does not have a good coverage of the PPIs in the genome. Conversely, if it is much higher than the expected, it may reflect many false positive results in the prediction (i.e., low accuracy). Note that this comparison does not directly test the performance. We measured the average degree by the average number of links in the predicted PPIs. The average degree was close to 1 in the GCM or GNM, remarkably lower than that in the PPM (21.4) or GFM (5.4) (Table 1). According to the previous estimations, an average degree should be in a range of 2 to 10 links for each protein in a typical functioning cell [29, 30]. Thus, it seems that only the GFM had a reasonable average degree. Overall, the prediction of PPIs varied greatly among these four genomic context methods.
Finally, we examined the PPIs that were similarly predicted by more than one method. A total of 1,155 PPIs were predicted by both the GCM and GNM. They accounted for 47% of the total predicted PPIs by the GCM and 32% by the GNM (Table 1). For the PPIs predicted by the GFM and PPM, 1,532 overlapped, which accounted for 23% of the total PPIs by the GFM and 3% by the PPM, respectively. The number of overlapped PPIs in the remaining comparisons between two methods was smaller (Table 1). Furthermore, there were only 298 PPIs that were predicted by three or more methods. Of those 298 PPIs, 55 were predicted by all four methods. The comparison suggests that (1) GCM and GNM, which likely share some common genetic context information, have similar predictions of PPIs to some extent, and (2) there was no consensus in the prediction of PPIs by these methods that utilize different features of genomic context. The lack of consensus in prediction by different methods was similarly reported in the previous study , implying that they could complement each other.
Biological biases of the PPIs predicted by the four methods
We further compared the features of these four methods by evaluating the performance of PPI prediction using three well-defined datasets from the KEGG, EcoCyc, and DIP databases. The KEGG dataset included pathway information, the EcoCyc included protein complexes, and the DIP included the protein interactions with evidence. The performance of each method was measured by an AC value, which is an integrated value of the accuracy and coverage (see Methods), because an assessment of the prediction needs to consider both accuracy and coverage .
When k = 15, we assigned an integrated score to each of the 54,911 pairs predicted by the four methods (Table 1). These 54,911 pairs were separated into three classes based on the prediction confidence: InPrePPI_high (1,194 pairs), InPrePPI_medium (5,403), and InPrePPI_low (48,314). The data are available at InPrePPI web site  or upon request.
Comparison of InPrePPI with other methods
We first compared the PPI prediction by InPrePPI with the four individual methods. The AC value was higher in InPrePPI than each of the four methods (Figure 2).
Accuracy and coverage in three integrated methods
Number of PPIs
Joint observation method (JOM)
A web-based, user-friendly application (InPrePPI) for PPI prediction was implemented by Java. This InPrePPI web interface  allows the user to predict PPIs using one of the four methods (PPM, GCM, GFM, and GNM) or InPrePPI. If the user chooses InPrePPI, the application first predicts PPIs using the four methods and then assigns an integrated score () to each pair of the predicted PPIs. The user has the option to set or modify parameters such as BLASTP E-value, target organism, or list of reference organisms. This package can be downloaded at no cost from the web site and installed in a local computer. Because the system was designed to provide flexibility in PPI prediction, the data are not pre-computed. This may lead to a long computation time; therefore, we recommend that the user retrieve the results via email or run it directly in a local computer.
Many biological features have been explored in the prediction of protein-protein interactions and it has been found that there is limited prediction power when utilizing only one genomic feature. Investigators are now moving toward integration [12, 22, 35]. A systematic assessment of the existing methods is a prerequisite to an effective integration. In this study, we focused on four major methods (PPM, GCM, GFM, and GNM) that utilize genomic context information. Each method characterizes in its own way. We hypothesized that an efficient integration of these four major methods would improve prediction performance. We first performed extensive comparisons of these four methods using three positive datasets (KEGG, EcoCyc, and DIP). We found that these four methods lacked consensus but complemented each other to some extent. Based on these comparisons, we developed an integrated method, InPrePPI, which optimally weighs the scores of protein pairs predicted by the four methods. Our performance comparison indicates that InPrePPI outperforms each individual method (Figure 2) and, in general, the other two integrated methods: the JOM and STRING (Table 2, Figures 4 and 5).
However, InPrePPI did not outperform the JOM or STRING in all tests. In the JOM, the accuracy values were higher for the PPIs that were consistently predicted by at least three methods. Such high values were reached by dramatically decreasing the coverage. This makes JOM impractical when multiple methods or supporting evidence is employed. InPrePPI does not have this limitation because it uses an integration score, rather than an intersection of multiple data. Compared to STRING, InPrePPI had consistently higher accuracy values and its coverage values were higher or close, in most cases, except in the high confidence class of the EcoCyc and DIP datasets. In the latter two cases, the difference was not as remarkable as it was in the comparison between the JOM and InPrePPI. For example, the coverage value in InPrePPI was 33.19% in the high confidence class of EcoCyc; this is comparable to the 42.33% in STRING but much higher than the 2.65% in the JOM4 (Table 2). When we considered both the accuracy and coverage values, InPrePPI outperformed STRING in all tests except in the high confidence class of EcoCyc (Figure 4). Furthermore, our independent test using COG annotations indicates that the fractions of true positives in InPrePPI were consistently higher than those in STRING in all three classes of predicted PPIs (Figure 5).
The STRING database provides a comprehensive, high quality collection of protein-protein associations for a large number of organisms . The association data were compiled from high-throughput experimental data, mining of other databases and literature, and the predicted PPIs by genomic context approaches. We demonstrated that InPrePPI has an overall better performance than the prediction methods (phylogenetic co-occurrence, conserved neighborhood, and gene fusion methods) in STRING. However, InPrePPI is limited to the evaluation and prediction of protein-protein pairs based on the genomic context features and its web site provides only prediction function rather than a comprehensive evidence collection. While the STRING database provides a powerful system for proteomics research, the amount of PPI data collected by the high-throughput experiments, or from the existing literature, is still very limited at present in most organisms in nature and is likely to be limited for some time. Computational approaches are expected to play an important role in uncovering the interactomes of most genomes. Although one recent study failed to improve the prediction by adding more features , the InPrePPI method demonstrates that an integration, if appropriate, can improve prediction power. Thus, our integrated method based on the genomic context, which is to be further optimized and enhanced, can be applied to the prediction of PPIs in many other (prokaryotic) genomes and also integrated into the comprehensive database such as STRING.
InPrePPI integrates four genomic context based methods. These four methods are currently the best computational methods for prokaryotic genomes. This implies that InPrePPI may be applied to the discovery of PPIs at least in prokaryotic genomes. InPrePPI uses a constant, k, to normalize the AC value and calculate the weight of each method. This constant depends on the data used and the methods integrated and can be obtained by a heuristic approach. When true positives are available in a genome, the optimal k value and weight of each method can be directly obtained by the method in this study. To predict PPIs in a genome without true positive data, which is very challenging at present and always relies on the knowledge in other well-studied organisms, we may use the optimal k value and the weight available in E. coli or any other genome that is related to the target genome and then refine it after some of the predicted PPIs have been validated (i.e., true positives). InPrePPI may be extended to eukaryotic genomes as well. Recent assessments of phylogenetic profiling in the E. coli and yeast confirmed the similar strategy of reference organism selection in the construction of phylogenetic profiles [36–38] and indicate that phyletic patterns of proteins in prokaryotes alone are adequate to predict functional linkages between proteins in prokaryotic and eukaryotic genomes . Some studies have reported that neighboring genes have similar expression patterns in higher eukaryotes, implying possible interactions [39–41]. Qi et al.  found that gene co-expression is consistently the most important feature in their comprehensive evaluation of PPI prediction in yeast using an integrated framework, which supports the previous finding that the most obvious co-expression comes from permanent complexes such as ribosome and proteasome [42, 43]. Therefore, we may consider both the genomic context information and the gene co-expression data when we extend InPrePPI to eukaryotic genomes.
We used the gold standards of positives to evaluate the PPI prediction methods. In previous studies, positive data was selected from the standardized SWISS-PROT keywords [3, 30], the metabolic map in KEGG , the pathway information in COG , or the protein complexes . So far, there has been no complete biological database to serve as a gold standard of positives. To avoid a biased selection of positive data, we used three well-documented datasets: (1) biological pathway information from KEGG, (2) protein complexes from EcoCyc, and (3) protein-protein interactions identified by experiments from DIP. The prediction performance of each method varied among these three datasets (Figure 1), suggesting that the selection of positive control data should be made carefully and should consider the types of interactions.
Computational prediction will play a major role in the exploration of the interactomes of many genomes. However, a computational method that relies on one specific genomic context feature has limited power in PPI prediction. We believe that an integration approach, which efficiently takes advantage of the different genomic features, will outperform individual methods. In this study, we first evaluated the prediction performance of the four major genomic context based methods (PPM, GCM, GFM, and GNM), then we developed a novel integrated method (InPrePPI) based on the comparisons of these four methods in three datasets (KEGG, EcoCyc, and DIP). We demonstrated that InPrePPI, which is an evaluation rather than prediction method, outperforms these four individual methods and, in general, the other two existing integrated methods (JOM and STRING).
We downloaded genes and their annotations (e.g., name, length, orientation, and protein sequence) in the 226 available complete genomes from the NCBI RefSeq database . We chose E. coli K12 as the target organism and the remaining 225 organisms as reference organisms. The predicted operons in prokaryotes were downloaded from SHOPS . We downloaded the PPI data in STRING from its web site  and then retrieved those PPIs predicted by the methods (phylogenetic co-occurrence, conserved neighborhood, and gene fusion) in STRING. We retrieved the COG annotations for E. coli K12 proteins from the NCBI E. coli K12 genome database .
Four genomic context based methods
We predicted PPIs using the genome datasets collected above by four genomic context based methods: the phylogenetic profile method , the gene cluster method [3, 33], the gene fusion method , and the gene neighbor method . We briefly describe these methods below; the details of these methods are provided in their original publications.
In the phylogenetic profile method, we used the refined method described in Sun et al.  to obtain an optimal reference organism set from the 225 available complete genomes. The homology of a protein was identified by the BLASTP program  with an E-value < 1 × 10-4. We chose the E-value threshold of 1 × 10-4 because of its optimal performance in our previous evaluation . The phylogenetic profile for each E. coli protein was then constructed and assessed using the mutual information (MI) value calculated by the method in Date and Marcotte . The MI value of each protein pair reflects the confidence level of the link between the two proteins. To identify the candidate interactions, we calculated the threshold of mutual information (TMI) values using the method in Sun et al. . A pair of proteins was considered to interact when its MI value was higher than the TMI value.
In the gene cluster method, the genes that belong to one operon in E. coli and have homologues also belonging to another operon in the reference genome(s) were considered to have functional links with each other. In the gene fusion method, two or more proteins were identified to be functionally linked when they were not encoded by neighboring genes in E. coli but were uniquely homologous to a single protein in a reference organism . In the gene neighbor method, we identified those genes that were located as neighbors (i.e., physically linked) among multiple genomes .
Identification of each protein pair is based on the genomic context within a variety of genomes; some were closely related while the others were not. Thus, we assigned a score to each protein pair by the evolutionary distance between the target organism and the reference organism where the pair was present. We used the conserved 16S rRNA gene to estimate the evolutionary distance between E. coli and the other prokaryotic genomes. We downloaded the 16S rRNA gene sequences in E. coli and the other 211 prokaryotic genomes from NCBI . We then aligned them using the ClustalW program . After a manual check and adjustment of the alignments, we estimated the genetic distance using the PHYLIP package . Finally, we calculated the score for each protein pair, which is the sum of the evolutionary distances between E. coli and the other genomes where the protein pair was present.
Gold standard positives and negatives
Summary of the positive and negative control data
Number of protein pairs
EcoCyc (8.0) 
DIP (Ecoli20060116) 
KEGG + EcoCyc + DIP
Evaluation of PPI prediction
In the equations above, TP (true positive) is the number of the predicted PPIs that were found in the positive control dataset, FP (false positive) is the number of the predicted PPIs that were not found in the positive control dataset, and FN (false negative) is the number of PPIs in the positive control dataset that failed to be predicted by the method.
where S j is the score of the pair by method j.
We categorized the predicted PPIs into three groups according to their prediction confidence. We first obtained two average scores to serve as the cutoff values: Score_P, the average score among the predicted protein pairs whose interactions are known to be true (i.e., in the positive dataset), and Score_N, the average score among the predicted protein pairs whose interactions are known to be false (i.e., in the negative dataset). The predicted protein pairs whose scores were higher than Score_P were considered to have high confidence and were categorized into the InPrePPI_high class. The predicted protein pairs whose scores were lower than Score_N were considered to have low confidence and were categorized into the InPrePPI_low class. The remaining protein pairs, whose scores were between Score_N and Score_P, were categorized into the InPrePPI_medium class.
List of abbreviations
an integration method for prediction of protein-protein i nteractions
phylogenetic profile method
gene cluster method
gene fusion method
gene neighbor method
joint observation method
We thank Jill Opalesky and Emily Mitchell for critically reading the manuscript and three anonymous reviewers for valuable comments. This project was supported by Thomas F. and Kate Miller Jeffress Memorial Trust Fund, the 863 Hi-Tech Program grants and China State Key Program of Basic Research grants and China National Natural Science Foundation grant.
- Auerbach D, Thaminy S, Hottiger MO, Stagljar I: The post-genomic era of interactive proteomics: facts and perspectives. Proteomics 2002, 2: 611–623. 10.1002/1615-9861(200206)2:6<611::AID-PROT611>3.0.CO;2-YView ArticlePubMed
- Eisenberg D, Marcotte EM, Xenarios I, Yeates TO: Protein function in the post-genomic era. Nature 2000, 405: 823–826. 10.1038/35015694View ArticlePubMed
- Strong M, Mallick P, Pellegrini M, Thompson M, Eisenberg D: Inference of protein function and protein linkages in Mycobacterium tuberculosis based on prokaryotic genome organization: a combined computational approach. Genome Biol 2003, 4: R59. 10.1186/gb-2003-4-9-r59PubMed CentralView ArticlePubMed
- Lehner B, Fraser AG: A first-draft human protein-interaction map. Genome Biol 2004, 5: R63. 10.1186/gb-2004-5-9-r63PubMed CentralView ArticlePubMed
- Rain JC, Selig L, De Reuse H, Battaglia V, Reverdy C, Simon S, Lenzen G, Petel F, Wojcik J, Schachter V, Chemama Y, Labigne A, Legrain P: The protein-protein interaction map of Helicobacter pylori. Nature 2001, 409: 211–215. 10.1038/35051615View ArticlePubMed
- Schwikowski B, Uetz P, Fields S: A network of protein-protein interactions in yeast. Nat Biotechnol 2000, 18: 1257–1261. 10.1038/82360View ArticlePubMed
- Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain PO, Han JD, Chesneau A, Hao T, Goldberg DS, Li N, Martinez M, Rual JF, Lamesch P, Xu L, Tewari M, Wong SL, Zhang LV, Berriz GF, Jacotot L, Vaglio P, Reboul J, Hirozane-Kishikawa T, Li Q, Gabel HW, Elewa A, Baumgartner B, Rose DJ, Yu H, Bosak S, Sequerra R, Fraser A, Mango SE, Saxton WM, Strome S, Van Den Heuvel S, Piano F, Vandenhaute J, Sardet C, Gerstein M, Doucette-Stamm L, Gunsalus KC, Harper JW, Cusick ME, Roth FP, Hill DE, Vidal M: A map of the interactome network of the metazoan C. elegans. Science 2004, 303: 540–543. 10.1126/science.1091403PubMed CentralView ArticlePubMed
- Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao YL, Ooi CE, Godwin B, Vitols E, Vijayadamodar G, Pochart P, Machineni H, Welsh M, Kong Y, Zerhusen B, Malcolm R, Varrone Z, Collis A, Minto M, Burgess S, McDaniel L, Stimpson E, Spriggs F, Williams J, Neurath K, Ioime N, Agee M, Voss E, Furtak K, Renzulli R, Aanensen N, Carrolla S, Bickelhaupt E, Lazovatsky Y, DaSilva A, Zhong J, Stanyon CA, Finley RL Jr., White KP, Braverman M, Jarvie T, Gold S, Leach M, Knight J, Shimkets RA, McKenna MP, Chant J, Rothberg JM: A protein interaction map of Drosophila melanogaster. Science 2003, 302: 1727–1736. 10.1126/science.1090289View ArticlePubMed
- Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, Stroedicke M, Zenkner M, Schoenherr A, Koeppen S, Timm J, Mintzlaff S, Abraham C, Bock N, Kietzmann S, Goedde A, Toksoz E, Droege A, Krobitsch S, Korn B, Birchmeier W, Lehrach H, Wanker EE: A human protein-protein interaction network: a resource for annotating the proteome. Cell 2005, 122: 957–968. 10.1016/j.cell.2005.08.029View ArticlePubMed
- Marcotte EM, Pellegrini M, Thompson MJ, Yeates TO, Eisenberg D: A combined algorithm for genome-wide prediction of protein function. Nature 1999, 402: 83–86. 10.1038/47048View ArticlePubMed
- Walhout AJ, Vidal M: Protein interaction maps for model organisms. Nat Rev Mol Cell Biol 2001, 2: 55–62. 10.1038/35048107View ArticlePubMed
- von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P: Comparative assessment of large-scale data sets of protein-protein interactions. Nature 2002, 417: 399–403. 10.1038/nature750View ArticlePubMed
- Qi Y, Bar-Joseph Z, Klein-Seetharaman J: Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Proteins 2006, 63: 490–500. 10.1002/prot.20865PubMed CentralView ArticlePubMed
- Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci USA 1999, 96: 4285–4288. 10.1073/pnas.96.8.4285PubMed CentralView ArticlePubMed
- Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA: Protein interaction maps for complete genomes based on gene fusion events. Nature 1999, 402: 86–90. 10.1038/47056View ArticlePubMed
- Dandekar T, Snel B, Huynen M, Bork P: Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem Sci 1998, 23: 324–328. 10.1016/S0968-0004(98)01274-2View ArticlePubMed
- Huynen M, Snel B, Lathe W 3rd, Bork P: Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res 2000, 10: 1204–1210. 10.1101/gr.10.8.1204PubMed CentralView ArticlePubMed
- Huynen MA, Snel B, von Mering C, Bork P: Function prediction and protein networks. Curr Opin Cell Biol 2003, 15: 191–198. 10.1016/S0955-0674(03)00009-7View ArticlePubMed
- Gerstein M, Lan N, Jansen R: Enhanced: integrating interactomes. Science 2002, 295: 284–287. 10.1126/science.1068664View ArticlePubMed
- Bertone P, Gerstein M: Integrative data mining: the new direction in bioinformatics. IEEE Eng Med Biol Mag 2001, 20: 33–40. 10.1109/51.940042View ArticlePubMed
- Chen Y, Xu D: Computational analyses of high-throughput protein-protein interaction data. Curr Protein Pept Sci 2003, 4: 159–181. 10.2174/1389203033487225View ArticlePubMed
- von Mering C, Jensen LJ, Snel B, Hooper SD, Krupp M, Foglierini M, Jouffre N, Huynen MA, Bork P: STRING: known and predicted protein-protein associations, integrated and transferred across organisms. Nucleic Acids Res 2005, 33: D433–7. 10.1093/nar/gki005PubMed CentralView ArticlePubMed
- Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, Qureshi-Emili A, Li Y, Godwin B, Conover D, Kalbfleisch T, Vijayadamodar G, Yang M, Johnston M, Fields S, Rothberg JM: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 2000, 403: 623–627. 10.1038/35001009View ArticlePubMed
- Salwinski L, Eisenberg D: Computational methods of analysis of protein-protein interactions. Curr Opin Struct Biol 2003, 13: 377–382. 10.1016/S0959-440X(03)00070-8View ArticlePubMed
- KEGG Database[http://www.genome.jp/kegg/]
- EcoCyc Database[http://ecocyc.org/]
- DIP Database[http://dip.doe-mbi.ucla.edu/]
- Barabasi AL, Oltvai ZN: Network biology: understanding the cell's functional organization. Nat Rev Genet 2004, 5: 101–113. 10.1038/nrg1272View ArticlePubMed
- Grigoriev A: On the number of protein-protein interactions in the yeast proteome. Nucleic Acids Res 2003, 31: 4157–4161. 10.1093/nar/gkg466PubMed CentralView ArticlePubMed
- Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D: Detecting protein function and protein-protein interactions from genome sequences. Science 1999, 285: 751–753. 10.1126/science.285.5428.751View ArticlePubMed
- Tsoka S, Ouzounis CA: Prediction of protein interactions: metabolic enzymes are frequently involved in gene fusion. Nat Genet 2000, 26: 141–142. 10.1038/79847View ArticlePubMed
- Bowers PM, Pellegrini M, Thompson MJ, Fierro J, Yeates TO, Eisenberg D: Prolinks: a database of protein functional linkages derived from coevolution. Genome Biol 2004, 5: R35. 10.1186/gb-2004-5-5-r35PubMed CentralView ArticlePubMed
- Zheng Y, Roberts RJ, Kasif S: Genomic functional annotation using co-evolution profiles of gene clusters. Genome Biol 2002, 3: R60. 10.1186/gb-2002-3-11-research0060View Article
- Lu LJ, Xia Y, Paccanaro A, Yu H, Gerstein M: Assessing the limits of genomic data integration for predicting protein networks. Genome Res 2005, 15: 945–953. 10.1101/gr.3610305PubMed CentralView ArticlePubMed
- Sun J, Li Y, Zhao Z: Phylogenetic profiles for the prediction of protein-protein interactions: how to select reference organisms? Biochem Biophys Res Commun 2007, 353: 985–991. 10.1016/j.bbrc.2006.12.146View ArticlePubMed
- Jothi R, Przytycka TM, Aravind L: Discovering functional linkages and uncharacterized cellular pathways using phylogenetic profile comparisons: a comprehensive assessment. BMC Bioinformatics 2007, 8: 173. 10.1186/1471-2105-8-173PubMed CentralView ArticlePubMed
- Sun J, Zhao Z: Construction of phylogenetic profiles based on the genetic distance of hundreds of genomes. Biochem Biophys Res Commun 2007, 355: 849–853. 10.1016/j.bbrc.2007.02.048View ArticlePubMed
- Lercher MJ, Blumenthal T, Hurst LD: Coexpression of neighboring genes in Caenorhabditis Elegans is mostly due to operons and duplicate genes. Genome Res 2003, 13: 238–243. 10.1101/gr.553803PubMed CentralView ArticlePubMed
- Williams EJ, Bowles DJ: Coexpression of neighboring genes in the genome of Arabidopsis thaliana. Genome Res 2004, 14: 1060–1067. 10.1101/gr.2131104PubMed CentralView ArticlePubMed
- Lercher MJ, Urrutia AO, Hurst LD: Clustering of housekeeping genes provides a unified model of gene order in the human genome. Nat Genet 2002, 31: 180–183. 10.1038/ng887View ArticlePubMed
- Shoemaker BA, Panchenko AR: Deciphering protein-protein interactions. Part I. Experimental techniques and databases. PLoS Comput Biol 2007, 3: e42. 10.1371/journal.pcbi.0030042PubMed CentralView ArticlePubMed
- Jansen R, Greenbaum D, Gerstein M: Relating whole-genome expression data with protein-protein interactions. Genome Res 2002, 12: 37–46. 10.1101/gr.205602PubMed CentralView ArticlePubMed
- NCBI RefSeq Database[ftp://ftp.ncbi.nih.gov/genomes/]
- NCBI E. coli COG Annotations[ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K12/]
- Sun J, Xu J, Liu Z, Liu Q, Zhao A, Shi T, Li Y: Refined phylogenetic profiles method for predicting protein-protein interactions. Bioinformatics 2005, 21: 3409–3415. 10.1093/bioinformatics/bti532View ArticlePubMed
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMed
- Date SV, Marcotte EM: Discovery of uncharacterized cellular systems by genome-wide analysis of functional linkages. Nat Biotechnol 2003, 21: 1055–1062. 10.1038/nbt861View ArticlePubMed
- Overbeek R, Fonstein M, D'Souza M, Pusch GD, Maltsev N: The use of gene clusters to infer functional coupling. Proc Natl Acad Sci USA 1999, 96: 2896–2901. 10.1073/pnas.96.6.2896PubMed CentralView ArticlePubMed
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22: 4673–4680. 10.1093/nar/22.22.4673PubMed CentralView ArticlePubMed
- Felsenstein J: PHYLIP - phylogeny inference package (version 3.2). Cladistics 1989, 5: 164–166.
- KEGG Orthology (KO)[http://www.genome.jp/dbget-bin/get_htext?KO+-s+F+-f+F/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.