Homology-based prediction of interactions between proteins using Averaged One-Dependence Estimators
© Murakami and Mizuguchi; licensee BioMed Central Ltd. 2014
Received: 30 January 2014
Accepted: 17 June 2014
Published: 23 June 2014
Identification of protein-protein interactions (PPIs) is essential for a better understanding of biological processes, pathways and functions. However, experimental identification of the complete set of PPIs in a cell/organism (“an interactome”) is still a difficult task. To circumvent limitations of current high-throughput experimental techniques, it is necessary to develop high-performance computational methods for predicting PPIs.
In this article, we propose a new computational method to predict interaction between a given pair of protein sequences using features derived from known homologous PPIs. The proposed method is capable of predicting interaction between two proteins (of unknown structure) using Averaged One-Dependence Estimators (AODE) and three features calculated for the protein pair: (a) sequence similarities to a known interacting protein pair (FSeq), (b) statistical propensities of domain pairs observed in interacting proteins (FDom) and (c) a sum of edge weights along the shortest path between homologous proteins in a PPI network (FNet). Feature vectors were defined to lie in a half-space of the symmetrical high-dimensional feature space to make them independent of the protein order. The predictability of the method was assessed by a 10-fold cross validation on a recently created human PPI dataset with randomly sampled negative data, and the best model achieved an Area Under the Curve of 0.79 (pAUC0.5% = 0.16). In addition, the AODE trained on all three features (named PSOPIA) showed better prediction performance on a separate independent data set than a recently reported homology-based method.
Our results suggest that FNet, a feature representing proximity in a known PPI network between two proteins that are homologous to a target protein pair, contributes to the prediction of whether the target proteins interact or not. PSOPIA will help identify novel PPIs and estimate complete PPI networks. The method proposed in this article is freely available on the web at http://mizuguchilab.org/PSOPIA.
KeywordsPrediction of protein-protein interactions Homology Machine learning Averaged One-Dependence Estimators (AODE)
Many biological processes and pathways are mediated by protein-protein interactions (PPIs). Identification of individual PPIs and the whole set of them in a cell/organism (“an interactome”) is, therefore, essential for a better understanding of biological functions of proteins in living cells and elucidating biochemical pathways. Various high-throughput experimental techniques, such as yeast two-hybrid assays and methods based on mass spectrometry, have been used to discover a large number of PPIs in several organisms. Although the amount of interaction data in public PPI databases continues to rise, many of them represent an incomplete interactome, because the available experimental techniques are expensive and can typically identify only a small part of the set of PPIs in specific organisms [1, 2].
To circumvent such limitations of the experimental techniques, a number of computational methods have been developed for predicting PPIs based on prior knowledge obtained from known interacting protein sequences and using machine-learning (ML) techniques [3–14]. Efforts have been made to develop methods based only on information about amino acid sequences, for example, by using the number of amino acid triplets in each sequence [6, 10, 13], a product of signatures defined as a set of subsequences , auto-correlation values of seven different physicochemical scales [11, 15] and normalized counts of single or pairs of consecutive amino acid residues . These purely sequence-based approaches have reported prediction accuracies of 70-84% on a human data set and about 70% on a yeast data set. Furthermore, information about protein domains has been incorporated in several other methods [16, 17]. Although it has been shown to be an informative feature for predicting PPIs , methods utilizing domain information alone are not applicable to proteins without domain assignments.
Identifying proteins homologous to a newly determined protein is often attempted to infer the biological functions of the new protein of unknown function, because homologues tend to have similar functions as well as similar three-dimensional structures. This deductive inference has been applied to the identification of PPIs, on the assumption that homologous proteins share similar interaction patterns as well as similar functions . A pair of interacting proteins in one species and their respective orthologs in another species, which are also known to interact with each other, have been traditionally defined as interaction-orthologs (interologs) [19, 20]. However, this idea can be extended to interaction-homologs, because orthologs and paralogs are not always clearly distinguished [18, 21].
There have been several computational studies about interologs. For example, Yu et al. found that PPIs can be transferred when two pairs of proteins have the geometric mean of the sequence identities >80% or the e-values <10−70. Wiles et al. predicted PPIs from known interactions in five species and developed InterologFinder, a web server to search for information about predicted as well as experimentally determined PPIs for given proteins of interest . Chen et al. developed PPISearch, a web server to search for homologous PPIs given a single protein pair of interest against an integrated database of PPIs in 576 species . Gallone et al. developed a Perl module to search for putative PPIs and prioritize them based on interologs . Garcia et al. developed BIPS, a web server to predict PPIs based on information about known PPIs in multiple species and additional information about domain interactions and GO annotations. It uses BIANA, an integrated database of PPIs from several repositories [21, 24]. In these prediction approaches, collecting as many PPIs as possible in multiple species is an important factor for the reliability of the predicted interactions.
Furthermore, developing a confidence score for PPIs is also key to improving the reliability of the prediction. Most of the previously reported methods used a simple joint sequence identity or e-value for two pairs of interacting proteins [18, 20, 21], whereas one unified score based on the level of homology, conservation of the interactions across multiple species and the number of supporting experimental types was proposed . These methods are largely dependent on the existence of orthologous or homologous PPIs, i.e., it would be very difficult to detect a novel PPI with no interlogs in an integrated database.
To improve the discrimination power of the homology-based PPI prediction, we here apply Averaged One-Dependence Estimators (AODE; ) to this problem. The AODE is an ML algorithm, a variant of the Naïve Bayes classifier (NBC) and it weakens NBC’s independence assumption by allowing a one-dependence. So far, the AODE has been used to combine the outputs of several protein interaction prediction methods; it has been shown to be useful for extracting distinctive information from large imbalanced datasets and it can also be retrained easily and efficiently . Furthermore, it has been reported to be more accurate than NBC, and it can efficiently process a large number of training feature vectors in a high dimensional space without increasing the computational cost significantly [25, 27]. In addition, the AODE does not need to select a model and to optimize any parameters. These strengths, therefore, allowed us to train the AODE on massive PPI data collected from several repositories without incurring a large computational cost.
In this study, the AODE is trained using three features: (a) sequence similarities to known interacting proteins (FSeq), (b) statistical propensities of domain pairs observed in interacting proteins (FDom) and (c) a sum of edge weights along the shortest path between homologous proteins in a PPI network (FNet). The idea of feature (c) is based on the hypothesis that a target protein pair would have more potential to interact if their homologous proteins exist in proximity of each other in a known PPI network. Such a proximal pair, even if not known to interact directly, may form a complex with other proximal proteins or reside in common subcellular locations, thereby increasing the chances of their homologues interacting directly. In a previous study, the topology of a PPI network has been used to predict interactions missing in the network (i.e., those not detected by large-scale experiments), by searching for defective cliques (with a few missing edges) in the PPI network graph . However, this approach can be applied only to proteins with at least one experimentally defined interaction. In addition, the computational cost of this method has been reported to be expensive. Our method, in contrast, searches for a pair of sequences in the graph homologous to the query proteins, which may be unannotated and with no known interactions. Then, a sum of edge weights along the shortest path between them is computed and trained with other features, thus dramatically reducing the computational cost. We demonstrate high predictive performance of the AODE on a recently created human PPI data set with randomly sampled negative data , which had been used for benchmarking previously reported sequence-based methods.
In this section, we first introduce the data set used for training and testing, and describe three features calculated for a pair of proteins. Next, we describe how to construct a feature vector, dealing with symmetry in the protein order. Then, we describe the AODE for probabilistic classification of protein pairs into interacting (positive) or non-interacting (negative) classes, and introduce prediction accuracy measures to assess prediction models developed and the validation method.
Preparation of a PPI data set
Dset1 is a recently created non-redundant human PPI data set (ensuring ≤40% pairwise sequence identity and protein sequence length of >50 amino acids) obtained from the Human Protein Reference Database (HPRD; release 7; ), created by . This data set was divided into three independent sets, each of which contained about 2,000 proteins with about 5,000 positive pairs and 2,000,000 negative pairs, i.e., 400 times larger number of non-interacting protein pairs, generated by randomly paring proteins that appeared in the positive pairs and removing real positive pairs. This is a highly imbalanced data set and the classification categories are unequally represented. Park and Marcotte used these subsets to benchmark four different sequence-based PPI prediction methods [29, 31] (see Additional file 1: Table S1).
Dset2 was constructed to compare prediction performance of the AODE trained on Dset1 with BIPS, a recently developed homology-based prediction server . First, a set of human physical PPIs was obtained from the BioGrid dataset (release 3.2.95, December 2012). Then, from this dataset, we removed PPIs found in the previous BioGrid dataset (release 3.1.93, on October, 2012) compiled after BIPS was released, ensuring that Dset2 includes only recently discovered PPIs. In addition, we used only a set of interacting proteins, each of which was annotated in UniProt . This procedure left a set of 4.430 PPIs. Finally, negative PPI pairs 400 times larger in number than the positives ones were generated in a manner similar to that of Dset1.
Homology-based features for a pair of proteins
Sequence similarities to known interacting proteins (F Seq): Known interacting pairs with sequence similarity to a target pair (S A , S B ) were searched by running BLAST (version 2.2.25+; ) against the database created from the sequences in Dset1, with an e-value cutoff of ≤102. (The high e-value cutoff was chosen to allow for partial matches). Then, of these pairs, the interacting pair (T A , T B ) with the smallest value of √(e-value A 2 + e-value B 2 ) was selected, where e-value x is the BLAST e-value between S x and T x and x is either A or B. The minimum coverage (mincov) for S x and T x was also calculated as the number of positive matches (i.e., alignment positions with a positive BLOSUM62 score ) divided by the length of the longer sequence. These two BLAST e-values and two minimum coverage values, (e-value A , mincov A ) for S A and (e-value B , mincov B ) for S B , were used as features for training (Figure 1-a). If no known homologous interacting pair was found, an e-value of 102 and a mincov of 0 were assigned to FSeq.
- (b)Statistical propensities of domain pairs observed in interacting proteins (F Dom): Each sequence in Dset1 was scanned against Pfam-A (release 25.0; Pfam-A.hmm; ), and the number of Pfam domain pairs (d A , d B ) that appeared in either positive or negative pairs was counted. Knowledge-based interaction propensities for Pfam domain pairs were calculated as:
A sum of edge weights along the shortest path between homologous proteins in the PPI network (F Net): BLAST hits (with an e-value cutoff ≤10−3) for each sequence in a target pair (S A , S B ) were collected from the database created from Dset1. Then, for each possible pair of hits (p A , p B ), where p A and p B were among the hits for S A and S B , respectively, a sum of edge weights along the shortest path (the shortest path weight; SPW) was calculated. In this study, we set the default edge weight to be 1.0. The shortest path between p A and p B was calculated using Dijkstra’s shortest path algorithm implemented in the Boost::Graph perl module (version 1.4; downloaded from http://search.cpan.org/~dburdick/Boost-Graph/), which is a perl interface to the Boost-Graph C++ libraries (release 1.47.0; downloaded from http://www.boost.org/). The lowest SPW was used as a feature for training. If no SPW was defined for any of the pairs (p A , p B ), an FNet value of −1 was given to the target pair (Figure 1-c).
Constructing a feature vector
After the construction of FVs, feature values for i-th feature of the FVs used for training were discretized using the entropy-based discretization method . The optimized intervals (split points), the number of which varied with each feature, were then applied to the construction of FVs for testing.
Averaged One-Dependence Estimator (AODE)
A probabilistic graphical model of the AODE modeled in this study is shown in Figure 3.
Evaluation measures and validation
Performances of AODEs were estimated by the Area Under the Curve (AUC), which gives an AUC = 1.0 for a perfect model and gives an AUC = 0.5 for a random model for which a Receiver Operating Characteristic (ROC) curve is drawn as a diagonal line. A ROC curve is most often used for model comparison and is represented by plotting sensitivity (true positive rate; TPR, or recall) against 1.0 – specificity (false positive rate; FPR). Sensitivity (recall) measures the proportion of the known positive pairs that are correctly predicted as interacting and is defined as TP/(TP + FN), and specificity measures the proportion of the known negative pairs that are correctly predicted as non-interacting and is defined as TN/(TN + FP), where TP is the number of true positives (i.e., known positive pairs correctly predicted as interacting), FP is the number of false positives (i.e., known negative pairs incorrectly predicted as interacting), TN is the number of true negatives (i.e., known negative pairs correctly predicted as non-interacting), and FN is the number of false negatives (i.e., known positive pair incorrectly predicted as non-interacting). The AUC is known to be insensitive to imbalanced data  and it would be a reliable measure for the prediction performance. In addition, performances of AODEs were also estimated by a normalized partial AUC up to the FPR ≤ x% (pAUC x%), following  and . We set x to be 0.5. A prediction model with a high pAUC can predict more true positives with few FPs, so such a model is known to be most useful for users to identify PPIs from the top-ranked predictions .
Furthermore, we used two other common measures, MCC (Mathew’s correlation coefficient; ) and the F-measure . MCC indicates the degree of the correlation between the actual and predicted classes of the protein pair, and its values range between 1 where all the predictions are correct, and −1 where none are correct. MCC is defined as (TP × TN − FP × FN)/√(TP + FP) × (TP + FN) × (TN + FP) × (TN + FN). The F-measure combines precision and recall into their harmonic mean, and is defined as 2 × precision × recall/(precision + recall), where precision is defined as TP/(TP + FP) and measures the proportion of the positive pairs correctly predicted as interacting.
To evaluate the prediction performance of each AODE, a 10-fold cross validation (CV) was carried out. In the 10-fold CV, a data set was divided into 10 subsets, and each subset was used as a testing set and the remaining subsets were used as a training set. This process was repeated 10 times, and then the prediction performances were averaged over all the test results.
In this section, we first assess critically the AODE models based on three homology-based features encoded in a single feature vector. We then demonstrate high predictive performance of our proposed method using a large, human PPI data set compiling recently identified interactions.
Can proximity between homologous proteins in a PPI network contribute to predictions?
We hypothesized that two proteins would have more potential to interact, if their homologous proteins exist in proximity of each other in a known PPI network. Such a proximal pair, even if not known to interact directly, may form a complex with other proximal proteins or reside in common subcellular locations, thereby increasing the chances of their homologues interacting directly. To confirm our hypothesis, we divided Dset1 into 10 subsets, treated each subset as a test set and constructed a PPI network from the remaining subsets. For each pair in the test set, we identified homologous protein pairs (with a BLAST e-value cut-off ≤10−3) and obtained the smallest SPW (a sum of edge weights along the shortest path; see METHODS) in the PPI network. In this study, an edge weight of 1.0 was used as a default weight value. This process was repeated 10 times, and the average number of protein pairs with a given SPW was counted.
Prediction performance of AODEs
Performances of AODEs and NBCs trained on Dset1
0.69 ± 0.01
0.69 ± 0.01
0.57 ± 0.01
0.57 ± 0.01
0.77 ± 0.01
0.77 ± 0.01
FSeq + FDom
0.71 ± 0.01
0.70 ± 0.01
FSeq + FNet
0.79 ± 0.01
0.77 ± 0.01
FDom + FNet
0.79 ± 0.01
0.77 ± 0.01
FSeq + FDom + FNet
0.79 ± 0.01
0.77 ± 0.01
Evaluation of PSOPIA using an independent data set
In order to evaluate our proposed method further, we compared PSOPIA (AODE-VII) with BIPS, a recently developed prediction server based on homologues of two interacting proteins . Because BIPS is based on large, up-to-date PPI data, integrated from several PPI databases by using the BIANA software framework , it is considered to have advantages over other similar methods in retrieving homologous PPIs [18, 22]. In addition, BIPS can use heterogeneous information similar to PSOPIA for filtering out prediction results, such as information about domain-domain interactions (DDIs) in iPfam  and 3DID  and annotations from UniProt  and GO , as well as BLAST-based sequence similarities to a known interacting protein pair. For these reasons, we evaluated the predictability of both PSOPIA and BIPS on Dset2, a data set, which was compiled from a recent release of the BioGrid database and which included only the PPIs identified after BIPS was developed and Dset1 was created (see Methods).
PSOPIA was retrained on the whole of Dset1 and a sequence database used for BLAST was formatted with all the sequences in Dset1. A threshold value of 0.293 was chosen, because it gave the highest F-measure (0.160) in the 10-fold CV on Dset1 (recall = 15.5%, precision = 17.0%, specificity = 99.8%, MCC =0.160). For BIPS, since we were unable to optimize the parameters, we used the default values by the web server: joint identities (the geometric mean of individual BLAST sequence identities) ≥ 80%, joint e-values (the geometric mean of individual BLAST e-values) ≥ 1.0 × e−10 and template sequence coverage ≥ 80% (see  for more details of these parameters). In addition to the default “filter by template interactions”, we also examined two additional filtering conditions: information about DDIs in iPfam or 3DID, and GO annotations (biological process, cellular component or molecular function). The BIPS server accepts sequences of interest or a list of protein identifiers, evaluates potential interactions between all possible sequence pairs and reports only likely (high-scoring) interactions. Therefore, we submitted all the unique sequences in Dset2 to the BIPS server, retrieved the results and defined all the reported pairs to be positive predictions (interacting) and all non-reported pairs to be negative predictions (non-interacting). If a positively predicted pair was found in either the positive or the negative set of Dset2, it was regarded as a true positive or a false positive, respectively. If a negatively predicted pair was found in either the positive or the negative set of Dset2, it was regarded as a false negative or a true negative, respectively. All the other predicted interactions were ignored. In this comparison, we aimed to evaluate the true predictability of these methods, i.e., whether they can predict novel PPIs that have never been observed before, not the data search capability to identify already known PPIs in a database. Thus, we excluded from the evaluation any protein pair (S A , S B ) if either BIPS or PSOPIA detected a known interacting protein pair (T A , T B ) in their database (with BLAST e-values of 0 for S A- T A and S B -T B ).
Evaluation of true prediction performance on Dset2
PSOPIA (θ = 0.293, the higheset F)
PSOPIA (θ = 0.670)
PSOPIA (θ = 0.890)
(I) BIPS, only filtered by the template interactions
(A) Template: Taxonomy ID = 9609 (human)
(B) Template: all species
(II) BIPS, filtered by known DDIs (iPfam or 3DID)
(A) Template: Taxonomy ID = 9609 (human)
(B) Template: all species
(III) BIPS, filtered by known DDIs (iPfam or 3DID) and GO; biological process, cellular component or molecular function
(A) Template: Taxonomy ID = 9609 (human)
(B) Template: all species
We have proposed a new AODE-based method for predicting PPIs based on known homologous PPIs by using three different features, FSeq, FDom and FNet. In constructing Dset1  used for training and testing the AODEs, randomly sampled protein pairs that had not been known to interact with each other were used as a negative data set, because of the limited availability of high-quality negative PPI data, either manually curated or experimentally determined (for example, only 1,892 negative PPIs constructed with 1,257 proteins in the negatome database ). In reality the number of negative PPIs should be much larger than that of positive PPIs [29, 31] and therefore, we trained and evaluated the AODEs on a data set with a large number of negative data. The AODEs were able to deal with this large and imbalanced PPI dataset effectively and they were easily trained within several CPU minutes.
In order to deal with symmetry in the protein order and allow the concatenation of a set of features for individual proteins in a FV, several kernels have been developed in sequence-based methods using a support vector machine (SVM) [6, 7, 10]. In this study, we proposed a simple geometric selection of FVs in a half space of the symmetrical FV space. Although no comparison can be made between these two approaches, our FV selection method is simple and can be incorporated in any ML method.
The predictability of the AODEs, which include a single dependency between the features, was illustrated in a 10-fold CV on Dset1, and then the AODE trained using all three features, named PSOPIA, achieved the highest performance in terms of both AUC (0.79) and pAUC0.5% (0.16). In comparison with the NBC, which assumes conditional independence of all three features, PSOPIA improved AUC by 0.02 (p-value < 2.8e-08) and pAUC10% by 0.03 (p-value = 6.4e-08). We further tested PSOPIA on Dset2, an independent data set, and compared its performance with that of BIPS, a recently reported homology-based method. By excluding the identification of interacting protein pairs already in the database, PSOPIA (threshold = 0.670) achieved higher precision of 13.71% than that of BIPS (2.72%) at a recall level of 0.5 ~ 0.6%, and thus demonstrating higher predictability than BIPS in terms of the F-measure. The F-measure is generally known as a useful and reliable measure to evaluate different methods that have different trade-off relations between precision and recall.
Further improvements of PSOPIA may be possible by creating a large up-to-date PPI dataset integrated from several databases, because a larger PPI database provides a better chance of detecting known PPIs homologous to a target protein pair. It is still unclear, however, whether we should include cross-species data in such a database. In this study, we evaluated BIPS on Dset2 and showed that the use of interactions from different species did not reduce the false positives. Also, Park  and Pitre et al. investigated whether interactions for a pair of proteins in a target species can be predicted using a method trained on known PPI data from different species and observed no significant improvements in the performance of the predictors. Thus, it remains to be seen whether the AODE, a probability-based ML method, can improve the prediction performance using interactions from different species as a training dataset. Moreover, it will be worth attempting to change edge weights in a PPI network and distinguish the interaction type, for example, using numerical parameters given by Kerrien et al. or similarities in GO annotations .
In this study, we have illustrated that proximity in a known PPI network between two proteins homologous to a target protein pair contributes to the prediction of whether the target proteins interact or not. Then, we have applied this feature FNet to the PPI prediction with two other features, FSeq and FDom. Our best AODE, which achieved an AUC of 0.79 (pAUC0.5% = 0.16) in a 10-fold CV on a highly imbalanced data set, will hopefully contribute to the identification of novel PPIs and the estimation of complete PPI networks. The method proposed in this study is freely available on the web at http://mizuguchilab.org/PSOPIA, and Dset2 used for the evaluation can be downloaded from the same URL.
This work was supported by Platform for Drug Discovery, Informatics, and Structural Life Science from the Ministry of Education, Culture, Sports, Science and Technology, Japan. Furthermore, this study was also in part supported by the Industrial Technology Research Grant Program in 2007 (Grant Number 07C46056a) from New Energy and Industrial Technology Development Organization (NEDO) of Japan, and also by Grants-in-Aid for Scientific Research from the Ministry of Education, Culture, Sports, Science, and Technology (Grant Numbers 25430186 and 25293079) and from the Ministry of Health, Labor, and Welfare (“The Adjuvant database project”) to K.M. We thank Shandar Ahmad for carefully reading the manuscript and for helpful comments.
- von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P: Comparative assessment of large-scale data sets of protein-protein interactions. Nature. 2002, 417 (6887): 399-403.View ArticlePubMedGoogle Scholar
- Han JD, Dupuy D, Bertin N, Cusick ME, Vidal M: Effect of sampling on topology predictions of protein-protein interaction networks. Nat Biotechnol. 2005, 23 (7): 839-844.View ArticlePubMedGoogle Scholar
- Bock JR, Gough DA: Predicting protein–protein interactions from primary structure. Bioinformatics. 2001, 17 (5): 455-460.View ArticlePubMedGoogle Scholar
- Sprinzak E, Margalit H: Correlated sequence-signatures as markers of protein-protein interaction. J Mol Biol. 2001, 311 (4): 681-692.View ArticlePubMedGoogle Scholar
- Gomez SM, Noble WS, Rzhetsky A: Learning to predict protein-protein interactions from protein sequences. Bioinformatics. 2003, 19 (15): 1875-1881.View ArticlePubMedGoogle Scholar
- Ben-Hur A, Noble WS: Kernel methods for predicting protein-protein interactions. Bioinformatics. 2005, 21 (Suppl 1): i38-46.View ArticlePubMedGoogle Scholar
- Martin S, Roe D, Faulon JL: Predicting protein-protein interactions using signature products. Bioinformatics. 2005, 21 (2): 218-226.View ArticlePubMedGoogle Scholar
- Nanni L, Lumini A: An ensemble of K-local hyperplanes for predicting protein-protein interactions. Bioinformatics. 2006, 22 (10): 1207-1210.View ArticlePubMedGoogle Scholar
- Pitre S, Dehne F, Chan A, Cheetham J, Duong A, Emili A, Gebbia M, Greenblatt J, Jessulat M, Krogan N, Luo X, Golshani A: PIPE: a protein-protein interaction prediction engine based on the re-occurring short polypeptide sequences between known interacting protein pairs. BMC Bioinformatics. 2006, 7: 365-View ArticlePubMed CentralPubMedGoogle Scholar
- Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, Li Y, Jiang H: Predicting protein-protein interactions based only on sequences information. Proc Natl Acad Sci U S A. 2007, 104 (11): 4337-4341.View ArticlePubMed CentralPubMedGoogle Scholar
- Guo Y, Yu L, Wen Z, Li M: Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic Acids Res. 2008, 36 (9): 3025-3030.View ArticlePubMed CentralPubMedGoogle Scholar
- Roy S, Martinez D, Platero H, Lane T, Werner-Washburne M: Exploiting amino acid composition for predicting protein-protein interactions. PLoS One. 2009, 4 (11): e7813-View ArticlePubMed CentralPubMedGoogle Scholar
- Yu CY, Chou LC, Chang DT: Predicting protein-protein interactions in unbalanced data using the primary structure of proteins. BMC Bioinformatics. 2010, 11: 167-View ArticlePubMed CentralPubMedGoogle Scholar
- Yu J, Guo M, Needham CJ, Huang Y, Cai L, Westhead DR: Simple sequence-based kernels do not predict protein-protein interactions. Bioinformatics. 2010, 26 (20): 2610-2614.View ArticlePubMedGoogle Scholar
- Guo Y, Li M, Pu X, Li G, Guang X, Xiong W, Li J: PRED_PPI: a server for predicting protein-protein interactions based on sequence data with probability assignment. BMC Res Notes. 2010, 3: 145-View ArticlePubMed CentralPubMedGoogle Scholar
- Deng M, Mehta S, Sun F, Chen T: Inferring domain-domain interactions from protein-protein interactions. Genome Res. 2002, 12 (10): 1540-1548.View ArticlePubMed CentralPubMedGoogle Scholar
- Hayashida M, Kamada M, Song J, Akutsu T: Conditional random field approach to prediction of protein-protein interactions using domain information. BMC Syst Biol. 2011, 5 (Suppl 1): S8-View ArticlePubMed CentralPubMedGoogle Scholar
- Chen CC, Lin CY, Lo YS, Yang JM: PPISearch: a web server for searching homologous protein-protein interactions across multiple species. Nucleic Acids Res. 2009, 37 (Web Server issue): W369-375.View ArticlePubMed CentralPubMedGoogle Scholar
- Matthews LR, Vaglio P, Reboul J, Ge H, Davis BP, Garrels J, Vincent S, Vidal M: Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or “interologs”. Genome Res. 2001, 11 (12): 2120-2126.View ArticlePubMed CentralPubMedGoogle Scholar
- Yu H, Luscombe NM, Lu HX, Zhu X, Xia Y, Han JD, Bertin N, Chung S, Vidal M, Gerstein M: Annotation transfer between genomes: protein-protein interologs and protein-DNA regulogs. Genome Res. 2004, 14 (6): 1107-1118.View ArticlePubMed CentralPubMedGoogle Scholar
- Garcia-Garcia J, Schleker S, Klein-Seetharaman J, Oliva B: BIPS: BIANA Interolog Prediction Server. A tool for protein-protein interaction inference. Nucleic Acids Res. 2012, 40 (Web Server issue): W147-151.View ArticlePubMed CentralPubMedGoogle Scholar
- Wiles AM, Doderer M, Ruan J, Gu TT, Ravi D, Blackman B, Bishop AJ: Building and analyzing protein interactome networks by cross-species comparisons. BMC Syst Biol. 2010, 4: 36-View ArticlePubMed CentralPubMedGoogle Scholar
- Gallone G, Simpson TI, Armstrong JD, Jarman AP: Bio:Homology:InterologWalk–a Perl module to build putative protein-protein interaction networks through interolog mapping. BMC Bioinformatics. 2011, 12: 289-View ArticlePubMed CentralPubMedGoogle Scholar
- Garcia-Garcia J, Guney E, Aragues R, Planas-Iglesias J, Oliva B: Biana: a software framework for compiling biological interactions and analyzing networks. BMC Bioinformatics. 2010, 11: 56-View ArticlePubMed CentralPubMedGoogle Scholar
- Webb GI, Boughton JR, Wang Z: Not so naive Bayes: Aggregating one-dependence estimators. Machine Learning. Volume 58, Issue 1. 2005, Netherlands: Springer, 5-24.Google Scholar
- Garcia-Jimenez B, Juan D, Ezkurdia I, Andres-Leon E, Valencia A: Inference of functional relations in predicted protein networks with a machine learning approach. PLoS One. 2010, 5 (4): e9969-View ArticlePubMed CentralPubMedGoogle Scholar
- Webb GI, Boughton JR, Zheng F, Ting KM, Salem H: Learning by extrapolation from marginal to full-multivariate probability distributions: decreasingly naive Bayesian classification. Machine Learning. Volume 86, Issue 2. 2012, Netherlands: Springer, 233-272.Google Scholar
- Yu H, Paccanaro A, Trifonov V, Gerstein M: Predicting interactions in protein networks by completing defective cliques. Bioinformatics. 2006, 22 (7): 823-829.View ArticlePubMedGoogle Scholar
- Park Y, Marcotte EM: Revisiting the negative example sampling problem for predicting protein-protein interactions. Bioinformatics. 2011, 27 (21): 3024-3028.View ArticlePubMed CentralPubMedGoogle Scholar
- Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 2006, 34 (Database issue): D535-539.View ArticlePubMed CentralPubMedGoogle Scholar
- Park Y: Critical assessment of sequence-based protein-protein interaction prediction methods that do not require homologous protein sequences. BMC Bioinformatics. 2009, 10: 419-View ArticlePubMed CentralPubMedGoogle Scholar
- UniProt C: Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012, 40 (Database issue): D71-75.Google Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-410.View ArticlePubMedGoogle Scholar
- Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992, 89 (22): 10915-10919.View ArticlePubMed CentralPubMedGoogle Scholar
- Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J, Heger A, Holm J, Sonnhammer EL, Eddy SR, Bateman A, Finn RD: The Pfam protein families database. Nucleic Acids Res. 2012, 40 (Database issue): D290-301.View ArticlePubMed CentralPubMedGoogle Scholar
- Fayyad UM, Rani KB: Multi-interval discretization of continuous-valued attributes for classification learning. Proceedings of the International Joint Conference on Uncertainty in AI (Q334 I571 1993). 1993, 1022-1027.Google Scholar
- Fawcett T: An introduction to ROC analysis. Pattern Recognition Lett. 2006, 27 (8): 861-874.View ArticleGoogle Scholar
- Matthews BW: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta. 1975, 405 (2): 442-451.View ArticlePubMedGoogle Scholar
- Hripcsak G, Rothschild AS: Agreement, the f-measure, and reliability in information retrieval. J Am Med Inform Assoc. 2005, 12 (3): 296-298.View ArticlePubMed CentralPubMedGoogle Scholar
- Murakami Y, Mizuguchi K: Applying the Naive Bayes classifier with kernel density estimation to the prediction of protein-protein interaction sites. Bioinformatics. 2010, 26 (15): 1841-1848.View ArticlePubMedGoogle Scholar
- Finn RD, Marshall M, Bateman A: iPfam: visualization of protein-protein interactions in PDB at domain and amino acid resolutions. Bioinformatics. 2005, 21 (3): 410-412.View ArticlePubMedGoogle Scholar
- Stein A, Ceol A, Aloy P: 3did: identification and classification of domain-based interactions of known three-dimensional structure. Nucleic Acids Res. 2011, 39 (Database issue): D718-723.View ArticlePubMed CentralPubMedGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene ontology consortium. Nat Genet. 2000, 25 (1): 25-29.View ArticlePubMed CentralPubMedGoogle Scholar
- Smialowski P, Pagel P, Wong P, Brauner B, Dunger I, Fobo G, Frishman G, Montrone C, Rattei T, Frishman D, Ruepp A: The Negatome database: a reference set of non-interacting protein pairs. Nucleic Acids Res. 2010, 38 (Database issue): D540-544.View ArticlePubMed CentralPubMedGoogle Scholar
- Pitre S, Hooshyar M, Schoenrock A, Samanfar B, Jessulat M, Green JR, Dehne F, Golshani A: Short Co-occurring Polypeptide Regions Can Predict Global Protein Interaction Maps. Sci Rep. 2012, 2: 239-View ArticlePubMed CentralPubMedGoogle Scholar
- Kerrien S, Aranda B, Breuza L, Bridge A, Broackes-Carter F, Chen C, Duesbury M, Dumousseau M, Feuermann M, Hinz U, Jandrasits C, Jimenez RC, Khadake J, Mahadevan U, Masson P, Pedruzzi I, Pfeiffenberger E, Porras P, Raghunath A, Roechert B, Orchard S, Hermjakob H: The IntAct molecular interaction database in 2012. Nucleic Acids Res. 2012, 40 (Database issue): D841-846.View ArticlePubMed CentralPubMedGoogle Scholar
- Pitre S, North C, Alamgir M, Jessulat M, Chan A, Luo X, Green JR, Dumontier M, Dehne F, Golshani A: Global investigation of protein-protein interactions in yeast Saccharomyces cerevisiae using re-occurring short polypeptide sequences. Nucleic Acids Res. 2008, 36 (13): 4286-4294.View ArticlePubMed CentralPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.