Protein function prediction by collective classification with explicit and implicit edges in protein-protein interaction networks
© Xiong et al; licensee BioMed Central Ltd. 2013
Published: 24 September 2013
Protein function prediction is an important problem in the post-genomic era. Recent advances in experimental biology have enabled the production of vast amounts of protein-protein interaction (PPI) data. Thus, using PPI data to functionally annotate proteins has been extensively studied. However, most existing network-based approaches do not work well when annotation and interaction information is inadequate in the networks.
In this paper, we proposed a new method that combines PPI information and protein sequence information to boost the prediction performance based on collective classification. Our method divides function prediction into two phases: First, the original PPI network is enriched by adding a number of edges that are inferred from protein sequence information. We call the added edges implicit edges, and the existing ones explicit edges correspondingly. Second, a collective classification algorithm is employed on the new network to predict protein function.
We conducted extensive experiments on two real, publicly available PPI datasets. Compared to four existing protein function prediction approaches, our method performs better in many situations, which shows that adding implicit edges can indeed improve the prediction performance. Furthermore, the experimental results also indicate that our method is significantly better than the compared approaches in sparsely-labeled networks, and it is robust to the change of the proportion of annotated proteins.
The past decade has witnessed a revolution in high-throughput sequencing techniques, resulting in huge amounts of sequenced proteins. However, experimental determination of protein functions is not only expensive but also time-consuming. As a consequence, there is an increasing concern about using computational methods to predict protein functions. Though many efforts have been made in this regard, the functions of most proteins in fully sequenced genomes still remain unknown. This is true even for the six well-studied model species. Taking yeast as an example, approximately one-fourth of the proteins have no annotated functions . Therefore, functional annotation of proteins is one of the fundamental issues in the post-genomic era.
The most common approach to computational prediction of protein functions is to use sequence or structure similarity to transfer functional information among proteins. According to a recent survey , homology-based transfer approaches can be further divided into two classes: sequence-based approaches and structure-based approaches. BLAST  is one of the most widely used sequence-based approaches, which assigns un-annotated proteins with the functions of their homologous proteins. Although sequence similarity is undoubtedly correlated to functional similarity, in many cases there is no need to treat a protein as a whole, This is because typically only the 100-300 amino acids in a functional protein domain perform their functions . Therefore, a protein can be represented as several sequence (or structure) based signatures (motifs) that are associated with some particular functions. PROSITE  for example is a database of sequence motifs that is composed of manually selected sequence motifs. Structure-based approaches are based on the observation that protein structure is far more conserved than sequence , and thus structure is a useful indicator of function. FATCAT  and PAST  are the most popular databases composed of 3D protein structures. The reason for using structure motifs is analogous to that of sequence motifs, One example is PROCAT , a library of 3D enzyme structure motifs. However, sequence similarity does not necessary imply functional equivalence and thus homology-based transfer approaches can result in erroneous predictions, and the original erroneous annotations may be propagated and amplified in databases . Furthermore, as the databases expand, the utility of the homology-based transfer approaches begins to break down. For example, it has been estimated that < 35% of all proteins could be annotated automatically when accepting error rates of ≤ 5%, while even allowing for error rates of > 40%, there is no annotation for > 30% of all proteins .
Recent advances in experimental biology have enabled the production of vast amounts of protein-protein interaction (PPI) data across human and most model species. These data are commonly represented as networks, where a node corresponds to a protein and an edge corresponds to an interaction between a pair of proteins. Thus, using PPI data to assign protein function has been extensively studied. Approaches based on PPI data assume that proteins with similar functions are topologically close in the network. In a review of the existing computational approaches based on PPI data for protein function prediction, Sharan et al.  distinguished two types of approaches: direct annotation schemes and module-assisted schemes.
Direct annotation schemes predict the functions of a protein from the known functions of its neighbors, representatives are neighborhood counting approaches [12–15], graph theoretic approaches [16–18] and Markov random field (MRF) approaches [19–21]. Majority and Indirect neighbors are two neighborhood counting approaches. Majority  is the simplest direct approach, it utilizes the biological hypothesis that interacting proteins probably have similar functions, it ranks each candidate function based on its occurrences in the immediate neighbors. Indirect neighbors  assumes that proteins interact with the same proteins may also have some similar functions. It exploits both indirect and immediate neighbors to rank each candidate function. Functional flow  is a graph theoretic approach, it simulates a discrete-time flow of functions from all proteins. At every time step, the function weight transferred along an edge is proportional to the edge's weight and the direction of transfer is determined by the functional gradient. Deng et al.  devised an MRF model in which the function of a protein is independent of all other proteins given the functions of its immediate neighbors. The parameters of the model are first estimated using quasi-likelihood method, and then Gibbs sampling is used for inferring the functions of unannotated proteins.
Instead of predicting functions for individual proteins, module-assisted schemes first identify modules of related proteins and then annotate each module based on the known functions of its members, examples include hierarchical clustering-based approaches [22, 23] and graph clustering approaches [24–27]. A key problem of this kind of approaches is how to define the similarity between two proteins. Arnau et al.  used the shortest path between proteins as a distance measure and applied hierarchical clustering to detecting functional modules. Up to now, numerous graph-clustering algorithms have been applied to detecting functional modules, such as spectral clustering , edge-betweenness clustering , clique percolation  and overlapping clustering .
Additionally, Chua et al.  presented a simple framework for integrating a large amount of diverse information for protein function prediction. This framework integrated diverse information using simple weighting strategies and a local prediction method. Hu et al.  hybridized the PPI information and the biochemical/physicochemical features of protein sequences to predict protein function. The prediction is carried out as follows: if the query protein has PPI information, the network-based method is applied; otherwise, the hybrid-property based method is employed.
However, most existing network-based approaches do not work well if there is not enough PPI information. In view of this, we proposed a new method that combines PPI information and protein sequence information to improve the prediction performance based on collective classification. Our method divided function prediction into two phases: First, the original PPI network is enriched by adding a number of edges that are computed based on protein sequence similarity. Second, based on the new network, a collective classification algorithm is employed to predict protein function. The main idea behind this method stems from the observation that existing network-based approaches ignore protein sequence information. Therefore, we increase the amount of useful information in the networks by adding a number of computed (or implicit) edges, which consequently improves the prediction performance.
We conducted experiments on S.cerevisiae and M.musculus functional annotation datasets. Compared to four existing protein function prediction methods, our method performs better in many situations, which shows that adding implicit edges can indeed improve the prediction performance. Furthermore, the experimental results also indicate that our method is significantly better than the compared methods in sparsely-labeled networks, and it is robust to the change of the proportion of annotated proteins.
Notation and problem definition
Protein function prediction is a multi-label classification problem where we have a set of functions . Given a protein set, where the first l proteins are labeled as y1, ..., y l , each y i is a vector with y ij = 1 in case that the protein P i is associated with the j-th function F j , otherwise y ij = 0. Our goal is to predict the labels y1+1, ..., y n for the remaining unlabeled proteins P l +1, ..., P n . In this study, we denote the PPI network as a finite undirected graph , with a vertex set where corresponds to the set of annotated proteins and corresponds to the set of un-annotated proteins. Each edge E i ,j∈ ε denotes an observed interaction between protein V i and V j and a weight w i,j ∈ W indicates the interaction confidence between V i and V j . Here, we employ collective classification to tackle this problem. In addition, we use both explicit edges that are extracted from PPI datasets and implicit edges that are computed from protein sequence information. In what follows, we present the method in detail.
Generating BLAST-inferred edges
As we pointed out above, most existing network-based approaches do not work well when there is not enough interaction information in the PPI networks. Considering this, here we propose a novel method that combines PPI information and protein sequence information to improve the prediction performance based on collective classification. The first step of our method is to enrich the original PPI network by adding a number of computed edges based on protein sequence similarity. Note that the similarity between two proteins is not a reliable proof that the two proteins interact, nevertheless, enriching PPI networks by adding a number of computed edges can increase the amount of useful information to the original PPI network and hence improve the prediction performance. In this paper, the basic local alignment search tool (BLAST) is employed to compute the similarity score between each pair of proteins.
It is worth noting that there are two types of edges in the new network: BLAST-inferred edges (implicit edges) and explicit edges that are already there. Here, two questions need to be answered. One is how many edges be added for each protein, that is, how to set the value of parameter k, and another is how to combine the weights of these two types of edges with different semantics. We will answer the first question in the experimental evaluation section and the second question in the next subsection.
The second step of our method is to employ the Gibbs sampling (GS)  based collective classification method o predict protein function based on the new network. GS is the most commonly used collective classification algorithm that aims at finding the best label estimate for each un-annotated vertex by sampling each vertex label iteratively. GS based collective classification is divided into two phases: boot-strapping and iterative classification, its high-level pseudo-code is given in Algorithm 1. Detailed description on the algorithm is presented in the following subsections.
Algorithm 1 Gibbs sampling based collective classification for protein function prediction with implicit and explicit edges in PPI networks.
1: // bootstrapping
2: for each query protein V x do
3: compute the initial using and
4: end for
5: // burn-in period
6: for i=1 to B do
7: for each query protein V x do
8: update using current assignments to ,
9: end for
10: end for
11: // sampling period
12: for i=1 to S do
13: for each query protein V x do
14: update using current assignments to ,
15: create to record the m-rank result
16: end for
17: end for
18: for each query protein V x do
19: calculate the final result based on matrix M x
20: end for
According to the observation that proteins with shorter distance to each other in the network are more likely to have similar functions, weighted voting is employed to predict an initial functional probability distribution for the query protein. Note that there are two types of annotated neighbors to vote: implicit neighbors (BLAST-inferred neighbors) and explicit neighbors. Thus, we introduce a combination parameter λ ∈ (0, 1) to control the tradeoff between these two types of neighbors.
Note that when predicting the functions of the query protein V x , we consider only its labeled neighbor proteins (either implicitly connected or explicitly connected). That is the reason why we use and in Algorithm 1 (Line 3), because unlabeled neighbor proteins can not be exploited in the bootstrapping phase. Codes corresponding to the Bootstrapping phase in Algorithm 1 are from Line 2 to Line 4.
The iterative classification process is divided to the following two periods: the burn-in period and the sampling period. The burn-in period consists of a fixed number B of iterations where we update using weighted voting. This period is implemented in Algorithm 1 from Line 6 to Line 10. The sampling period consists of S iterations. In each iteration, we not only update but also maintain the count statistics as to how many times we have sampled the j-th function F j for protein V x . This period is implemented in Algorithm 1 from Line 12 to Line 20.
Results and discussion
Interaction and annotation data
We evaluated the performance of our approach with two PPI datasets. The firs dataset (denoted as Dataset A) used in this study is based on Gene Ontology (GO) annotation scheme . GO annotations are arranged in a hierarchical order, and consist of three basic GO namespaces: molecular function, biological process and cellular component. There are 19655 GO terms that constitute 15 levels of annotations, and the higher level terms are more generic while the lower level terms are more specific. In this setting, some vague terms such as "GO:0005554 molecular function unknown" and annotations with evidence code "IEA" (Inferred from Electronic Annotation) were excluded. Furthermore, to avoid the bias problem in the annotations, we applied the concept of informative Functional Class  to selectively identify GO terms for validation. An informative GO is referred as the one that 1) is annotated by at least 30 proteins; and 2) has no child terms annotated by at least 30 proteins. This ensures that terms used for validation have a reasonable number of annotations and do not have overlapping description. Predictions were performed separately for each namespace. As a result, in the S.cerevisiae annotation dataset, there are 39, 95 and 66 informative GO terms and in the M.musculus annotation dataset, there are 103, 334 and 130 informative GO terms for the molecular function, biological process and cellular component namespaces, respectively.
Protein interactions of Dataset A were downloaded from the Biological General Repository for Interaction Datasets (BioGRID) . BioGRID is a public database that archives and disseminates genetic and protein interaction data from model organisms and humans, it currently holds 347966 interactions (170162 genetic, 177804 physical) obtained from both high-throughput data sets and individual focused studies, which were derived from over 23000 publications in the literature.
Statistics for Dataset A.
Statistics for Dataset B
MIPS Functional Category
CELL CYCLE AND DNA PROCESSING
PROTEIN FATE (folding, modification, destination)
PROTEIN WITH BINDING FUNCTION OR COFACTOR REQUIREMENT (structural or catalytic)
REGULATION OF METABOLISM AND PROTEIN FUNCTION
CELLULAR TRANSPORT, TRANSPORT FACILITIES AND TRANSPORT ROUTES
CELLULAR COMMUNICATION/SIGNAL TRANSDUCTION MECHANISM
CELL RESCUE, DEFENSE AND VIRULENCE
INTERACTION WITH THE ENVIRONMENT
SYSTEMIC INTERACTION WITH THE ENVIRONMENT
TRANSPOSABLE ELEMENTS, VIRAL AND PLASMID PROTEINS
BIOGENESIS OF CELLULAR COMPONENTS
CELL TYPE DIFFERENTIATION
CELL TYPE LOCALIZATION
Protein interactions of Dataset B were downloaded from STRING database , which is an integrated protein interaction database containing known and predicted protein interactions. These interactions were mainly derived from four data sources: genomic context, high-throughput experiments, conserved co-expression and previous knowledge. The most recent version of STRING covers about 5.2 million proteins from 1133 organisms. For Dataset B, we constructed two PPI networks (one for S.cerevisiae and another for M.musculus), proteins without interaction data or sequence information were deleted. As a result, in the S.cerevisiae interaction network, there are totally 388846 distinct interactions among 4687 proteins, and in the M.musculus interaction network there are 14269 proteins and 832124 interactions. Additionally, protein sequence information for Dataset A and Dataset B were also downloaded from the STRING database.
We compared our method with a sequence similarity based approach (termed BLAST-mined) that does not take the PPI network into account. The BLAST-mined approach was performed in two steps. First, BLAST was adopted to compute similarity score between each pair of proteins. Second, we employed the k NN classifier to predict the functions of un-annotated proteins. We also conducted comparison with a graph based method: Functional flow, as well as two neighbor counting methods: Majority and Indirect neighbors. Functional flow  treats each annotated protein as the source of a functional flow. After simulating the spread over time of this functional flow through the network, each un-annotated protein is assigned a score for having the function based on the amount of flow it received during the simulation. Majority  makes use of the observation that interacting proteins are more likely to have similar functions, it determines the functions of a protein based on the known functions of proteins lying in its immediate neighborhood. The principal advantages of the Majority are its simplicity and effectiveness. Indirect neighbors  exploits both direct and indirect function associations. It computes scores based on level 1 and level 2 interaction partners of a protein.
For traditional classification problems, the standard evaluation criterion is accuracy. However, in this paper we can not simply determine whether a prediction is correct or wrong because of the partially correct phenomenon in multi-label classification problems . Therefore, as in  we adopted the widely-used performance measure, the ratio of TP/FP, which depicts the relative magnitude between the number of true positives and the number of false positives. In this setup, we define the i-th rank overall true positive (TP) as the number of proteins whose i-th rank predicted function is one of the true functions of the protein V x and the i-th rank overall false positive (FP) as the number of proteins whose i-th rank predicted function is not one of the true functions of the protein V x . To evaluate the prediction performance of our method, leave-one-out cross validation was used to compare the performance of our method with that of the competing approaches. The idea behind leave-one-out cross validation is simply to treat each annotated protein as un-annotated in turn, then run the algorithm and compare the predicted functions to the known functions of the protein. It is worth noting that the iterative classification step is omitted in leave-one-out validation, this is because the label vector of the query protein is never updated after bootstrapping. However, in real PPI networks, there are a significant number of un-annotated proteins, thus leave-one-out cross validation seems impracticable in reality. Therefore, we also compared the performance of our method with that of the competing approaches in sparsely-labeled networks. In our implementation, the proportion of annotated proteins was varied from 10% to 90%, we ran 10 experiments for each given proportion of annotated proteins and reported the average performance. Moreover, the burn-in period and the sampling period were set to contain 20 and 100 iterations respectively.
Effect of parameters λ and k
The effect of the combination parameter λ (Dataset A: S.cerevisiae)
The effect of the combination parameter λ (Dataset A: M.musculus)
The effect of the combination parameter λ (Dataset B).
The effect of the number of BLAST-inferred edges k (Dataset A: S.cerevisiae)
The effect of the number of BLAST-inferred edges k (Dataset A: M.musculus)
The effect of the number of BLAST-inferred edges k (Dataset B).
Leave-one-out cross validation experiments
Performance in sparsely-labeled networks
In this paper, we proposed a new method to protein function prediction that combines PPI information and protein sequence information to improve prediction performance. It first reconstructs PPI networks by adding a number of BLAST-inferred implicit edges, and then applies the collective classification method to predicting protein functions based on the new networks. The key idea of our work is to enrich the PPI information of PPI networks by adding a number of computed edges, which subsequently improves the prediction performance. We carried out experiments on S.cerevisiae and M.musculus functional annotation datasets. The experimental results demonstrate that our method outperforms the existing approaches across a series of label situations, especially in sparsely-labeled networks where the existing approaches do not work well due to PPI information inadequacy. Experimental results also validate the robustness of the proposed approach to the number of labeled proteins in PPI networks.
In this paper, we used a very simple scheme (BLAST alignment) to infer implicit edges. Actually, there are some other methods that can be used to mine useful implicit edges, such as random walk. Random walk exploits both local and global network information, should be able to discover more useful hidden edges. We will explore this direction in the future.
Based on "Effectively predicting protein functions by collective classification - An extended abstract", by Wei Xiong, Hui Liu, Jihong Guan and Shuigeng Zhou which appeared in Bioinformatics and Biomedicine Workshops (BIBMW), 2012 IEEE International Conference on. © 2012 IEEE .
This study was supported by China 863 Program (grant No. 2012AA020403) and 973 program (grant No. 2010CB126604). Hui Liu was supported by NSFC (grant No. 31100954). Jihong Guan was supported by NSFC (grant No. 61173118), Shuigeng Zhou was also supported by NSFC (grant No. 61272380).
The publication costs for this article were funded by the corresponding author.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 12, 2013: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2012: Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S12.
- Sharan R, Ulitsky I, Shamir R: Network-based prediction of protein function. Mol Syst Biol. 2007, 3: 88-PubMed CentralView ArticlePubMedGoogle Scholar
- Sleator R, Walsh P: An overview of in silico protein function prediction. Arch microbiol. 2010, 192: 151-155. 10.1007/s00203-010-0549-9.View ArticlePubMedGoogle Scholar
- Altschul S, Madden T, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman D: Gapped blast and psiblast: a new generation of protein database search programs. Nucleic Acids Research. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Friedberg I: Automated protein function prediction-the genomic challenge. Brief Bioinform. 2006, 7: 225-242. 10.1093/bib/bbl004.View ArticlePubMedGoogle Scholar
- Hulo N, Bairoch A, Bulliard V, Cerutti L: The 20 years of prosite. Nucleic Acids Research. 2008, 36: D245-D249.PubMed CentralView ArticlePubMedGoogle Scholar
- Wallace A, Laskowski R, Thornton J: Predicting protein function from sequence and structural data. Curr Opin Struct Biol. 2005, 15: 275-284. 10.1016/j.sbi.2005.04.003.View ArticleGoogle Scholar
- Ye Y, Godzik A: Fatcat: a web server for xexible structure comparison and structure similarity searching. Nucleic Acids Research. 2004, 32: W582-W585. 10.1093/nar/gkh430.PubMed CentralView ArticlePubMedGoogle Scholar
- Taubig H, Buchner A, Griebsch J: Past: fast structure-based searching in the pdb. Nucleic Acids Research. 2006, 34: W20-W23.PubMed CentralView ArticlePubMedGoogle Scholar
- Wallace A, Laskowski R, Thornton J: Derivation of 3d coordinate templates for searching structural databases: application to ser-his-asp catalytic triads in the serine proteinases and lipases. Protein Sci. 1996, 5: 1001-1013.PubMed CentralView ArticlePubMedGoogle Scholar
- Gilks WR, Audit B, de Angelis D: Percolation of annotation errors through hierarchically structured protein sequence databases. Mathematical biosciences. 2005, 193 (2): 223-10.1016/j.mbs.2004.08.001.View ArticlePubMedGoogle Scholar
- Rost B, Liu J, Nair R: Automatic prediction of protein function. Cellular and Molecular Life Sciences. 2003, 60 (12): 2637-2650. 10.1007/s00018-003-3114-8.View ArticlePubMedGoogle Scholar
- Schwikowski B, Uetz P, Fields S: A Network of Protein-Protein Interactions in Yeast. Nature Biotechnology. 2000, 18: 1257-1261. 10.1038/82360.View ArticlePubMedGoogle Scholar
- Chua HN, Wong L: Exploiting indirect neighbours and topological weight to predict protein function from proteincprotein interactions. Bioinformatics. 2006, 22: 1623-1630. 10.1093/bioinformatics/btl145.View ArticlePubMedGoogle Scholar
- Ng KL, Ciou JS, Huang CH: Prediction of protein functions based on function-function correlation relations. Computers in Biology and Medicine. 2010, 40 (3): 300-305. 10.1016/j.compbiomed.2010.01.001.View ArticlePubMedGoogle Scholar
- Xiong W, Liu H, Guan J, Zhou S: Effectively predicting protein functions by collective classification — An extended abstract. Bioinformatics and Biomedicine Workshops (BIBMW), 2012 IEEE International Conference on: 4-7 October 2012. 2012, 634-639. 10.1109/BIBMW.2012.6470212.View ArticleGoogle Scholar
- Vazquez A, Flammini A, Maritan A: Global protein function prediction from protein-protein interaction networks. Nature biotechnology. 2003, 21 (6): 697-700. 10.1038/nbt825.View ArticlePubMedGoogle Scholar
- Karaoz U, Murali TM, Letovsky S: Whole-genome annotation by using evidence integration in functional-linkage networks. Proceedings of the National Academy of Sciences of the United States of America. 2004, 101 (9): 2888-2893. 10.1073/pnas.0307326101.PubMed CentralView ArticlePubMedGoogle Scholar
- Nabieva E, Jim K, Agarwal A, Chazelle B, Singh M: Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics. 2005, 21 (Suppl 1): i302-i310. 10.1093/bioinformatics/bti1054.View ArticlePubMedGoogle Scholar
- Deng M, Zhang K, Mehta S: Prediction of protein function using protein-protein interaction data. Journal of Computational Biology. 2003, 10 (6): 947-960. 10.1089/106652703322756168.View ArticlePubMedGoogle Scholar
- Letovsky S, Kasif S: Predicting protein function from protein/protein interaction data: a probabilistic approach. Bioinformatics. 2003, 19 (suppl 1): i197-i204. 10.1093/bioinformatics/btg1026.View ArticlePubMedGoogle Scholar
- Kourmpetis YAI, van Dijk ADJ, Bink MCAM: Bayesian Markov Random Field analysis for protein function prediction based on network data. PloS one. 2010, 5 (2): e9293-10.1371/journal.pone.0009293.PubMed CentralView ArticlePubMedGoogle Scholar
- Brun C, Chevenet F, Martin D: Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network. Genome Biol. 2003, 5 (1): R6-10.1186/gb-2003-5-1-r6.PubMed CentralView ArticlePubMedGoogle Scholar
- Arnau V, Mars S, Marin I: Iterative cluster analysis of protein interaction data. Bioinformatics. 2005, 21: 364-378. 10.1093/bioinformatics/bti021.View ArticlePubMedGoogle Scholar
- Bu D, Zhao Y, Cai L: Topological structure analysis of the protein-protein interaction network in budding yeast. Nucleic Acids Research. 2003, 31 (9): 2443-2450. 10.1093/nar/gkg340.PubMed CentralView ArticlePubMedGoogle Scholar
- Dunn R, Dudbridge F, Sanderson C: The use of edge-betweenness clustering to investigate biological function in protein interaction networks. BMC Bioinformatics. 2005, 6: 39-10.1186/1471-2105-6-39.PubMed CentralView ArticlePubMedGoogle Scholar
- Adamcsek B, Palla G, Farkas IJ, Derenyi I, Vicsek T: Cfinder: locating cliques and overlapping modulesin biological networks. Bioinformatics. 2006, 22: 1021-1023. 10.1093/bioinformatics/btl039.View ArticlePubMedGoogle Scholar
- Becker E, Robisson B, Chapple CE: Multifunctional proteins revealed by overlapping clustering in protein interaction network. Bioinformatics. 2012, 28 (1): 84-90. 10.1093/bioinformatics/btr621.PubMed CentralView ArticlePubMedGoogle Scholar
- Chua H, Sung W, Wong L: An efficient strategy for extensive integration of diverse biological data for protein function prediction. Bioinformatics. 2007, 23 (24): 3364-3373. 10.1093/bioinformatics/btm520.View ArticlePubMedGoogle Scholar
- Hu L, Huang T, Shi X, Lu W, Cai Y: Predicting functions of proteins in mouse based on weighted protein-protein interaction network and protein hybrid properties. PLoS ONE. 2011, 6 (1): e14556-10.1371/journal.pone.0014556.PubMed CentralView ArticlePubMedGoogle Scholar
- Sen P, Namata G, Bilgic M, Getoor L, Gallagher B, Eliassi-Rad T: Collective classification in netwok data. AI Magazine. 2008, 29: 93-106.Google Scholar
- Ashburner M, Catherine AB, Judith AB: Gene Ontology: tool for the unification of biology. Nature Genetics. 2000, 25: 25-29. 10.1038/75556.PubMed CentralView ArticlePubMedGoogle Scholar
- Stark C, Breitkreutz BJ, Chatr-Aryamontri A: The BioGRID Interaction Database: 2011 update. Nucleic Acids Research. 2011, 39: D698-704. 10.1093/nar/gkq1116.PubMed CentralView ArticlePubMedGoogle Scholar
- Ruepp A, Zollner A, Maier D, Albermann K: The funcat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Research. 2004, 32: 5539-5545. 10.1093/nar/gkh894.PubMed CentralView ArticlePubMedGoogle Scholar
- Güldener U, Münsterkötter M, Kastenmüller G, Strack N: Cygd: the comprehensive yeast genome database. Nucleic Acids Research. 2005, 33: D364-D368.PubMed CentralView ArticlePubMedGoogle Scholar
- Ruepp A, Doudieu O, van den Oever J, Brauner B: The mouse functional genome database (mfungd): functional annotation of proteins in the light of their cellular context. Nucleic Acids Research. 2006, 34: D568-D571. 10.1093/nar/gkj074.PubMed CentralView ArticlePubMedGoogle Scholar
- Damian S, Andrea F, Michael K, Milan S: The string database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Research. 2011, 39: D561-D568. 10.1093/nar/gkq973.View ArticleGoogle Scholar
- Fan RE, Lin CJ: A study on threshold selection for multi-label classification. 2007, Tech. rep., National Taiwan UniversityGoogle Scholar
- Bogdanov P, Singh AK: Molecular Function Prediction Using Neighborhood Features. IEEE/Acm Transactions on Computational Biology and Bioinformatics. 2010, 7: 208-217.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.