- Research article
- Open Access
Identifying protein complexes based on node embeddings obtained from protein-protein interaction networks
© The Author(s) 2018
- Received: 8 January 2018
- Accepted: 9 September 2018
- Published: 21 September 2018
Protein complexes are one of the keys to deciphering the behavior of a cell system. During the past decade, most computational approaches used to identify protein complexes have been based on discovering densely connected subgraphs in protein-protein interaction (PPI) networks. However, many true complexes are not dense subgraphs and these approaches show limited performances for detecting protein complexes from PPI networks.
To solve these problems, in this paper we propose a supervised learning method based on network node embeddings which utilizes the informative properties of known complexes to guide the search process for new protein complexes. First, node embeddings are obtained from human protein interaction network. Then the protein interactions are weighted through the similarities between node embeddings. After that, the supervised learning method is used to detect protein complexes. Then the random forest model is used to filter the candidate complexes in order to obtain the final predicted complexes. Experimental results on real human and yeast protein interaction networks show that our method effectively improves the performance for protein complex detection.
We provided a new method for identifying protein complexes from human and yeast protein interaction networks, which has great potential to benefit the field of protein complex detection.
- Node embeddings
- Random forest
- Supervised learning method
- Protein complex detection
In recent years, with the development of human genomics and the development of high-throughput techniques, massive protein-protein interaction (PPI) data have been generated. These PPI data have enable to automatically detect protein complexes from PPI networks. During the past decade, most computational approaches used to identify protein complexes have been based on discovering densely connected subgraphs in protein-protein interaction (PPI) networks [1, 2]. However, many true complexes are not dense subgraphs and these approaches show limited performances for detecting protein complexes from PPI networks. At the same time, the unreliable relations in the PPI data also poses a great challenge for protein complex identification [3–5].
Recently, a number of methods have been developed for protein complex identification. Dongen et al.  proposed a protein complex discovery algorithm named MCL, which manipulates the adjacency matrix of yeast PPI networks with two operators called expansion and inflation. By iterating these two operators, it will find the clusters that have higher possibility to becoming protein complexes. Bader et al.  proposed a protein complex detection algorithm named MCODE which is based on local density to cluster nodes. Zhang et al.  introduced a protein complex detection method which measures the likelihood of a subgraph being a real complex based on the number of three node cliques. Liu et al.  came up with an algorithm named CMC for protein complex discovery, which uses maximum complete subgraphs as seeds and searches for protein complexes from weighted PPI networks. In this algorithm, the protein interactions are weighted by an iterative scoring weight method called AdjustCD. What’s more, some methods, such as COACH  and Core&Peel , are proposed for detecting protein complexes based on the core-attachment observation of protein complex. However, most of the above methods are unable to detect overlapping complexes. Recently, Nepuse et al.  proposed a method named ClusterONE which utilizes greedy algorithm aggregation for identifying overlapping protein complexes. Some methods, such as Prorank+ , also consider the overlapping of protein complexes. In addition, some researchers tried to decrease the negative effects of unreliable PPI data for protein complex detection. For example, Zaki et al.  introduced a novel graph mining algorithm (PEWCC) which assesses the reliability of protein interaction by weighting clustering coefficients and removing unreliable edges, then it identifies protein complexes from the new weighted PPI network. All of these algorithms are based on the topological structure of the PPI network and do not utilize the information of known complexes, and these methods have been applied only on the yeast protein interaction networks.
In recent years, some supervised learning methods have been proposed to detect complexes from PPI network by using informative properties of known complexes, including SCI-BN , NN  and ClusterEPs . These methods usually have three main steps, first they extract features from the known complexes, and then train a supervised classification model or score function to judge whether a subgraph is a true complex, finally use the trained classification model or score function to guide the search process for new protein complexes. However, insufficient extracted features and noise in the PPI data make the classification model imprecise . At the same time, some features are often related to the characteristics of the network, so the features only work on the protein network which has such characteristics, otherwise the performance of complex detection will decrease when the network doesn’t have such characteristics . Therefore, with the increasing amount of data with different characteristics, using traditional features alone fails to further improve the performance of complex detection methods.
However, with the rapid development of deep learning, using self-learned features becomes an alternative way to obtain effective features from networks even with various characteristics. Tang et al.  proposed a spectral clustering method based on graph theory in 2011. The basic idea of this method is to use the similarity matrix of the sample data to decompose the features, and then to cluster the obtained eigenvectors, which is only related to sample size rather than sample characteristics. In 2014, Perozzi et al.  proposed a method named DeepWalk which learns latent representations of vertices in a network from truncated random walks. This method has achieved a remarkable performance for multi-label network classification task in social networks. In 2015, Tang et al.  proposed a method name LINE which learns the d-dimensional feature in two phases: d/2 breadth-first search simulations and another d/2 2 hop distant nodes. In 2016, Grover et al.  proposed an algorithm, node2vec, to learn the representations of the nodes in the network. This method creates the ordered sequence simulating breadth-first search and depth-first search approaches. All these above mentioned feature learning approaches aims to learn node embeddings by exploring the structure of networks, and node embedding methods have gained prominence since they produce continuous and low-dimensional features, which obviate the need for task-specific feature engineering and are effective for various task . Thus, those methods enable us to further extract the hidden information from networks, so as to effectively improve the performances of complex detection methods.
Because of above-mentioned reasons, in this paper, we propose a method, NodeEmbed-SLPC-RF, which is based on node embeddings to identify protein complexes on PPI networks. Firstly, it learns the node representations of protein interaction network, then uses the similarities between node representations to quantify the reliability of the PPI interactions in order to filter existing interactions or add new interactions. Secondly, supervised learning method (SLPC [25, 26]) is used to identify candidate protein complexes. Finally, random forest (RF) model is utilized to classify candidate protein complexes and candidate protein complexes with positive labels are outputted as the final predicted complexes. Experimental results show that our method outperforms the state-of-the art methods in detecting protein complexes from PPI networks.
We detail our NodeEmbed-SLPC-RF method in this section. Specifically, Node embeddings used in the algorithm are presented, and then SLPC and RF are briefly described, finally NodeEmbed-SLPC-RF algorithm is introduced.
At present, there are many approaches to generate network node embeddings. Node embeddings are distributed representations for the network nodes, which can be automatically learned based on the network adjacency information and topology structure obtained from the network. Compared with the traditional network structural features, node embedding methods can learn different vector representations for different networks according to their own structures, and thus can quickly mine the characteristics of different networks. And this kind of features are often not expressed by single values, but by dense vectors.
In order to obtain high quality node embeddings, we use node2vec method  to automatically get vector representations for all the nodes in the network. Node2vec method learns the low dimensional representations for each nodes and at the same time preserves the structural informations of both the nodes and the network. Particularly, node2vec adapts random walk and aliasing sampling strategy to capture the different local structure of a node. Therefore, the low dimensional representations of the nodes are essentially the feature representations for the nodes.
The node2vec algorithm can be roughly divided into three steps: step 1: obtain transition probability matrix π based on return parameter p and in-out parameter q; step 2: generate node sequences for each node based on G and π and, walk denotes all the node sequences. Specifically, r node sequences are generated for each node vi by using alias sampling strategy and the length of each node sequence is l; step 3: use stochastic gradient descent (SGD) strategy to train the model according to walk and obtain vectors for each node. Here, the sliding window size for training process is k, and the dimension of each vector is d. In the algorithm, a graph G is searched according to a certain strategy. Particularly, a number of node sequences are generated for each node, and the length of each node sequence is fixed to l. The number of sequences is determined by the hyperparameter r. And in the algorithm, k is the size of the sliding window and p determines the probability of traversal from the original path. The larger the p, the less likely to return to the same path. Parameter q decides the traversal strategy, the larger the q, the more likely to use breadth-first search strategy. Node2vec firstly generates the node sequences and all the generated node sequences are used as the contexts of the corresponding nodes. Then the skip-gram architecture  is utilized to train the node2vec model and after the training process, the vectors obtained for each node are the learned feature representations for each node. Note that, the time complexity of alias sampling strategy for choosing a node to add into a node sequence is O(1).
where ϕi(i=1,2,⋯,m) denotes the node embedding of the corresponding protein in the complex, Z is the matrix which is composed by ϕi in the complex set, d denotes the dimension of ϕi, and Z(·,j) denotes the j-th column of the matrix Z.
where X=(x1,x2,⋯,xn), Y=(y1,y2,⋯,yn) and n is the dimension of the corresponding vector.
Supervised learning method SLPC
The detail of the supervised learning method (SLPC) used in our work can be found in references  and . The SLPC method mainly includes three steps: firstly, a training set, including positive, middle and negative data, is constructed. Secondly, construct the feature vector space for the complexes in the training set from the networks and train the regression model. Specifically, a rich feature set of eleven topological features is constructed for complexes and the regression model is trained with the feature vectors. After that, the proteins whose degrees are greater than the average degree of the network are selected as the initial cliques. Then, the initial cliques are expanded according to the scores obtained by the regression model in order to generate the final cliques which are likely to be the real complexes. The main reason for using supervised learning method in this work is that it can combine the manually selected features with automatic self-learned features to further improve the performance for protein complex detection.
Random forest  is a model that uses a large number of sample data to train the decision trees for classification, and the class labels are determined by the output of the decision tree. The main idea of random forest model is as follows. A forest is established in a random way, and the forest is composed of many decision trees, and there is no relation between the trees. When a new sample comes in, each tree makes a decision and a class label is determined if the majority decision trees select this label for the classification task.
Random forest model is tolerant to missing data and unbalanced data as well as it can handle high-dimensional data. During the training process of the random forest model, the number of trees is randomly selected in order to avoid the over-fitting problem. What’s more, it can process the high-dimensional data directly without feature selection process. On the other hand, the importances of each feature can be obtained after training and it can maintain good accuracy even with the missing data and unbalanced data. For protein complex detection task, it is well known that there exist false negative relations in the PPI networks [4, 5], and the number of known standard complexes is quite limited. Therefore, we use random forest model to further filter the candidate complexes based on their features.
Dataset and parameter setting
We conducted the experiments on two different types of PPI networks: Human and Yeast. For human, protein and protein relations were downloaded from the human protein reference database (HPRD) , and there were 39,254 interactions and 9678 proteins. For yeast, commonly used DIP network  was obtained and there were 17,203 interactions among 4930 proteins in the DIP network. After removing the duplicated and self-linked relations, we obtained 37,060 interactions and 9521 proteins for human and 17,201 interaction and 4928 proteins for yeast. The golden standard of human protein complexes were also downloaded from HPRD, while the golden standard of yeast protein complexes were constructed by combining MIPS , Aloy , SGD  with TAP06 . The total numbers of golden protein complexes are 1514 and 673 and the size of them ranges from 3 to 129, 3 to 359 for human and yeast, respectively.
We evaluated the performance of NodeEmbed-SLPC-RF against SLPC, ClusterONE, MCODE, MCL, CMC, Coach, ProRank+ and PEWCC. We referred to the previous studies [10, 12–14] and used their recommended settings. For ClusterONE, the density threshold, merging threshold, and penalty value of each node were set to 0.6, 0.8 and 2, respectively. For MCODE, MCL, CMC and Coach, we used the recommended settings for unweighted network. For ProRank+ and PEWCC, we used their default settings. In the NodeEmbed-SLPC-RF, the node2vec algorithm is used to learn the feature representations for the nodes on PPI network. In order to embed nodes which have similar structure closer, as suggested by , the parameters of node2vec were set as follows: p=1,q=8,r=10,l=10,k=10. Besides, 1000 trees were used to make decision in the Random forest model.
where |V| denotes the set of proteins belong to the corresponding complex. Similar to many previous studies, a predicted complex p is regarded to be matched with a golden complex q if the NA(p,q) score is not lower than 0.25.
Using complex vectors to classify the candidate complexes
Performance comparison results on HPRD and DIP datasets
No. of complexes
Performance comparison results on DIP datasets using the MIPS gold standard
Using node embedding similarities to filter edges from original PPI network
Experimental results obtained by using RF to filter the candidate complexes which are predicted from the modified HPRD network by filtering edges with different simi-thres
No. of edges left
No. of complexes
Experimental results obtained by using RF to filter the candidate complexes which are predicted from the modified DIP network by filtering edges with different simi-thres
No. of edges left
No. of complexes
Using node embedding similarities to augment the original network
Since the feature vector representations for each node in the network were obtained by node2vec and the similarities between vector representations might reflect the connectivity between two protein nodes, for each target node, a new relation was generated by determining which one had the highest similarity with the target node. Then some of the new relations were integrated into the original network if the similarity between two nodes was larger than a certain threshold. Finally, the NodeEmbed-SLPC-RF algorithm was used to identify candidate complexes from the integrated network.
Experimental results obtained by using RF to filter the candidate complexes which are predicted from the modified HPRD network by adding edges with different simi-thres
No. of added edges
No. of complexes
Experimental results obtained by using RF to filter the candidate complexes which are predicted from the modified DIP network by adding edges with different simi-thres
No. of added edges
No. of complexes
Link prediction by using different methods
where it takes the product by all l: (u,l)∈E, (v,l)∈E. In the experiment, the number of iterations k was set to 2 as suggested by .
Comparison results for link prediction on HPRD and DIP
Comparison results for link prediction with different dimensions by using node2vec on HPRD and DIP
Using different strategies to generate complex vectors
Performance comparison using different vector generation strategies on HPRD and DIP datasets
No. of complexes
In the previous section, complex vector is generated by its corresponding node embeddings and the complex vectors are considered as features for RF model to further classify the candidate complexes. From the Table 1 we can see that using RF model to further classify candidate complexes could improve the performance of protein complex detection in terms of F-score, however the improvement on DIP is relatively slight. For example, when the dimension of vector is set to be 64, the F-score could improve 8.93% compared with that of using SLPC alone on HPRD network, however the F-score only improves 2.33% compared with that of using SLPC alone on DIP network. In order to measure the effectiveness of RF, we also compare it with SVM and LR, and the comparison result is shown in Fig. 2. It can be seen from the figure that using classifier does not necessarily improve the experimental results. Compared with RF model, SVM and LR model are less effective, especially on HPRD network. This shows that RF can learn effective information of complex feature vectors, while SVM and LR can learn relatively limited information. The reason may be that they have different ways for learning features. In addition, the decision function of SVM is determined by a small number of support vectors, and the overlap between the complexes may interfere with the its decision function thus leading to the poor performance of SVM. What’s more, the LR model is based on a linear function which normally can’t achieve promising result when it encounters linearly non-separable problem .
As mentioned in section of filtering edges, the original PPI network was reconstructed by filtering lower reliable edges based on the node embedding similarities between nodes, then SLPC was used to identify candidate complexes from the modified PPI network, and finally RF model was utilized to classify the candidate complexes based on their complex feature vectors in order to obtain the final predicted complexes. It can be seen from Fig. 3, the similarities of the majority relations in the original PPI network are greater than 0.8 and 0.65 on HPRD and DIP respectively, which indicates that the entire network is closely related for HPRD than DIP. However, there are still some relations which have lower similarities. By filtering the relations which have lower connectivity can help to delete the unreliable relations, so as to effectively improve the performance of complex detection methods. As can be seen from Tables 3 and 4, using NodeEmbed-SLPC-RF method on the modified networks can greatly improve the experimental results. The highest F-score is obtained on the modified network with the simi-thres equals to 0.86 for HPRD, which is about 10.37% higher than that on the original network with using SLPC alone. In addition, the highest F-score on the modified DIP network is with the simi-thres equals to 0.66, but it is only 3.53% higher than that on the original network with using SLPC alone. However, the results show that filtering the relations according to their similarities with proper simi-thres can help to improve the performance for protein complex detection.
As mentioned in the section of augmenting networks, we calculated the similarities between all the node pairs using their node embeddings and then added new relations whose similarities were greater than a threshold to the original PPI networks. Then the candidate complexes were predicted by the SLPC algorithm, and finally the candidate complexes were further classified by RF model to obtain the final predicted complexes. It can be seen from Fig. 4 that the number of the addable edges varies when the simi-thres ranges from 0.65 to 0.75 and 0.35 to 0.45 on HPRD and DIP respectively, which indicates that the similarity scores of most relations are greater than 0.65 and 0.35 on HPRD and DIP respectively. In order to obtain a more effective threshold of similarity for adding new relations into the original networks, we tested the performance of NodeEmbed-SLPC-RF method with the threshold of similarity ranging from 0.65 to 0.75 and 0.35 to 0.45 on HPRD and DIP respectively. The experimental results are shown in Tables 5 and 6. As can be seen from the Table 5, when the similarity threshold is 0.72, the highest F-score can be obtained, which is 9.17% higher than that on the original network with SLPC alone. In addition, from the Table 6 we can see that the best F-score is obtained with simi-thres equals to 0.36, but the improvement is slight compared with the F-score obtained by SLPC alone on the original network. All in all, these results show that adding reliable relations according to their similarity scores can effectively improve the performance of our model.
Experimental results obtained by using RF to filter the candidate complexes which are predicted from the modified HPRD network by filtering edges first and then adding edges with different simi-thres
No. of complexes
fixing filtering sime-thres to 0.86
fixing adding sime-thres to 0.72
Experimental results obtained by using RF to filter the candidate complexes which are predicted from the modified DIP network by filtering edges with simi-thres 0.66 first and then adding edges with different simi-thres
No. of complexes
fixing filtering sime-thres to 0.66
fixing adding sime-thres to 0.38
In this paper, we propose a protein complex detection method which is based on node embeddings, and the results demonstrate that our method can effectively improve the performance for detecting protein complexes from PPI network. Specifically, compared with using SLPC alone, when using RF model to classify the candidate complexes generated by SLPC based on their complex feature vectors and the candidate complexes labeled as positive by RF model were considered as the final predicted complexes, the performance in terms of F-score can be improved up to 8.93% and 2.33% on HPRD and DIP, receptively. In addition, When the original relations were filtered based on the similarity scores of node embeddings and the candidate complexes were further classified according to their complex vectors, the performance in terms of F-score can be increased up to 10.37% and 3.53% on HPRD and DIP respectively compared with using SLPC alone. The results indicate that the performance of protein complex detection methods could be improved by using node embeddings obtained by node2vec to measure the reliability of exiting relations in the PPI networks. What’s more, when adding new relations according to their similarity scores and using complex vectors to filter the candidate complexes, the performance in terms of F-score can be increased by up to 9.17% and 2.88% on HPRD and DIP respectively compared with using SLPC algorithm alone. To sum up, the experiment results demonstrate the effectiveness of using node embeddings and complex vectors for detecting protein complexes from PPI networks. In future work, we will further explore how to combine node embeddings with biological resources for predicting complexes from PPI network.
This work was supported by the grants from the National Key Research and Development Program of China (No. 2016YFC0901902), National Natural Science Foundation of China (No. 61272373, 61572102, 61572098 and 61502071), and Trans-Century Training Program Foundation for the Talents by the Ministry of Education of China (NCET-13-0084).
Availability of data and materials
XL, ZY and ZZ conceived the project, and XL, SS and ZZ carried out all experiments and drafted the manuscript. LW, YZ, HL, JW and BX revised this manuscript. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Chen B, Fan W, Liu J, Wu F-X. Identifying protein complexes and functional modules—from static ppi networks to dynamic ppi networks. Brief Bioinform. 2013; 15(2):177–94.View ArticleGoogle Scholar
- Li X, Wang J, Zhao B, Wu F-X, Pan Y. Identification of protein complexes from multi-relationship protein interaction networks. Hum genomi. 2016; 10(2):17.View ArticleGoogle Scholar
- Von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P. Comparative assessment of large-scale data sets of protein–protein interactions. Nature. 2002; 417(6887):399–403.View ArticleGoogle Scholar
- Sprinzak E, Sattath S, Margalit H. How reliable are experimental protein–protein interaction data?. J Mol Biol. 2003; 327(5):919–23.View ArticleGoogle Scholar
- Wodak SJ, Vlasblom J, Turinsky AL, Pu S. Protein–protein interaction networks: the puzzling riches. Curr Opin Struct Biol. 2013; 23(6):941–53.View ArticleGoogle Scholar
- Van Dongen S. Graph clustering via a discrete uncoupling process. Siam J Matrix Anal Appl. 2008; 30(1):121–41.View ArticleGoogle Scholar
- Bader GD, Hogue CW. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinforma. 2003; 4(1):2.View ArticleGoogle Scholar
- Zhang W, Zou X. A new method for detecting protein complexes based on the three node cliques. IEEE/ACM Trans Comput Biol Bioinforma (TCBB). 2015; 12(4):879–86.View ArticleGoogle Scholar
- Liu G, Wong L, Chua HN. Complex discovery from weighted ppi networks. Bioinformatics. 2009; 25(15):1891–7.View ArticleGoogle Scholar
- Wu M, Li X, Kwoh C-K, Ng S-K. A core-attachment based method to detect protein complexes in ppi networks. BMC Bioinforma. 2009; 10(1):169.View ArticleGoogle Scholar
- Pellegrini M, Baglioni M, Geraci F. Protein complex prediction for large protein protein interaction networks with the core&peel method. BMC Bioinforma. 2016; 17(12):372.View ArticleGoogle Scholar
- Nepusz T, Yu H, Paccanaro A. Detecting overlapping protein complexes in protein-protein interaction networks. Nat Methods. 2012; 9(5):471–2.View ArticleGoogle Scholar
- Hanna EM, Zaki N. Detecting protein complexes in protein interaction networks using a ranking algorithm with a refined merging procedure. BMC Bioinforma. 2014; 15(1):204.View ArticleGoogle Scholar
- Zaki N, Efimov D, Berengueres J. Protein complex detection using interaction reliability assessment and weighted clustering coefficient. BMC Bioinforma. 2013; 14(1):163.View ArticleGoogle Scholar
- Qi Y, Balem F, Faloutsos C, Klein-Seetharaman J, Bar-Joseph Z. Protein complex identification by supervised graph local clustering. Bioinformatics. 2008; 24(13):250–68.View ArticleGoogle Scholar
- Shi L, Lei X, Zhang A. Protein complex detection with semi-supervised learning in protein interaction networks: 2011. p 5. BioMed Central.Google Scholar
- Liu Q, Song J, Li J. Using contrast patterns between true complexes and random subgraphs in ppi networks to predict unknown protein complexes. Sci Rep. 2016; 6:21223.View ArticleGoogle Scholar
- Beniwal S, Arora J. Classification and feature selection techniques in data mining. Int J Eng Res Technol (ijert). 2012; 1(6):6.Google Scholar
- Liu X, Yang Z, Zhou Z, et al.The impact of protein interaction networks’ characteristics on computational complex detection methods. J Theor Biol. 2018; 439:141–51.View ArticleGoogle Scholar
- Tang L, Liu H. Leveraging social media networks for classification. Data Min Knowl Disc. 2011; 23(3):447–78.View ArticleGoogle Scholar
- Perozzi B, Al-Rfou R, Skiena S. Deepwalk: Online learning of social representations. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM: 2014. p. 701–10.Google Scholar
- Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q. Line: Large-scale information network embedding. In: Proceedings of the 24th International Conference on World Wide Web; Florence: International World Wide Web Conferences Steering Committee.2015. p. 1067–77.Google Scholar
- Grover A, Leskovec J. node2vec: Scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco: ACM: 2016. p. 855–64.Google Scholar
- Zhang Y, Xiong Y, Kong X, Zhu Y. Learning node embeddings in interaction graphs. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. Singapore: ACM: 2017. p. 397–406.Google Scholar
- Yu FY, Yang Z, Hu XH, Sun YY, Lin H, Wang J. Protein complex detection in ppi networks based on data integration and supervised learning method. BMC Bioinforma. 2015; 16(12):3.View ArticleGoogle Scholar
- Zhou Z, Gui Y, Yang Z, Liu X, Wang L, Zhang Y, Lin H, Wang J. Disease-specific protein complex detection in the human protein interaction network with a supervised learning method. In: Bioinformatics and Biomedicine (BIBM), 2016 IEEE International Conference On. Shenzhen: IEEE: 2016. p. 1296–301.Google Scholar
- Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv. 2013;1301:3781.Google Scholar
- Breiman L. Random forests. Mach Learn. 2001; 45(1):5–32.View ArticleGoogle Scholar
- Prasad TK, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, et al. Human protein reference database—2009 update. Nucleic Acids Res. 2009; 37(suppl 1):767–72.View ArticleGoogle Scholar
- Xenarios I, Salwinski L, Duan XJ, Higney P, Kim S-M, Eisenberg D. Dip, the database of interacting proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 2002; 30(1):303–5.View ArticleGoogle Scholar
- Mewes H-W, Amid C, Arnold R, Frishman D, Güldener U., Mannhaupt G, Münsterkötter M., Pagel P, Strack N, Stümpflen V, et al. Mips: analysis and annotation of proteins from whole genomes. Nucleic Acids Res. 2004; 32(suppl_1):41–44.View ArticleGoogle Scholar
- Aloy P, Böttcher B, Ceulemans H, Leutwein C, Mellwig C, Fischer S, Gavin A-C, Bork P, Superti-Furga G, Serrano L, et al. Structure-based assembly of protein complexes in yeast. Science. 2004; 303(5666):2026–9.View ArticleGoogle Scholar
- Dwight SS, Harris MA, Dolinski K, Ball CA, Binkley G, Christie KR, Fisk DG, Issel-Tarver L, Schroeder M, Sherlock G, et al. Saccharomyces genome database (sgd) provides secondary gene annotation using the gene ontology (go). Nucleic Acids Res. 2002; 30(1):69–72.View ArticleGoogle Scholar
- Gavin A-C, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dümpelfeld B, et al. Proteome survey reveals modularity of the yeast cell machinery. Nature. 2006; 440(7084):631.View ArticleGoogle Scholar
- Moreno PJ, Ho PP, Vasconcelos N. A Kullback-Leibler divergence based kernel for SVM classification in multimedia applications. In: Advances in Neural Information Processing Systems.British Columbia: Vancouver: 2004. p. 1385–92.Google Scholar
- Foody GM, Mathur A. Toward intelligent training of supervised image classifications: directing training data acquisition for svm classification. Remote Sens Environ. 2004; 93(1):107–17.View ArticleGoogle Scholar
- Hosmer Jr DW, Lemeshow S, Sturdivant RX. Applied logistic regression, vol. 398. Hoboken: Wiley; 2013.View ArticleGoogle Scholar
- Liben-Nowell D, Kleinberg J. The link-prediction problem for social networks. J Assoc Inf Sci Technol. 2007; 58(7):1019–31.View ArticleGoogle Scholar