Collective prediction of protein functions from protein-protein interaction networks

Background Automated assignment of functions to unknown proteins is one of the most important task in computational biology. The development of experimental methods for genome scale analysis of molecular interaction networks offers new ways to infer protein function from protein-protein interaction (PPI) network data. Existing techniques for collective classification (CC) usually increase accuracy for network data, wherein instances are interlinked with each other, using a large amount of labeled data for training. However, the labeled data are time-consuming and expensive to obtain. On the other hand, one can easily obtain large amount of unlabeled data. Thus, more sophisticated methods are needed to exploit the unlabeled data to increase prediction accuracy for protein function prediction. Results In this paper, we propose an effective Markov chain based CC algorithm (ICAM) to tackle the label deficiency problem in CC for interrelated proteins from PPI networks. Our idea is to model the problem using two distinct Markov chain classifiers to make separate predictions with regard to attribute features from protein data and relational features from relational information. The ICAM learning algorithm combines the results of the two classifiers to compute the ranks of labels to indicate the importance of a set of labels to an instance, and uses an ICA framework to iteratively refine the learning models for improving performance of protein function prediction from PPI networks in the paucity of labeled data. Conclusion Experimental results on the real-world Yeast protein-protein interaction datasets show that our proposed ICAM method is better than the other ICA-type methods given limited labeled training data. This approach can serve as a valuable tool for the study of protein function prediction from PPI networks.


Background
We have witnessed a revolution in sequencing technologies in last decade. The biological sciences are undergoing an explosion in the amount of genome sequences. There are increasing interests about using computational methods to identify the biological functions of the protein sequences [1], as experimentally determining protein functions is time-consuming and it cannot catch up with the fast growth of newly found proteins [2].
Various studies have applied machine learning methods to protein data from biological experiments to predict the functions for unknown proteins. (e.g. [3,4]). Classical computational approaches for protein function prediction represent each protein as a set of features, and employ machine learning algorithms to automatically predict the protein function based on these features. The most well-established methods [5] are the BLAST [6] approach based on sequence, PROSITE [7] based on sequence motifs, and PFAM [8] based on profile methods.
In recent years, the development of experimental methods for genome scale analysis of molecular interaction networks offers new ways to infer protein function in the context of protein-protein interaction (PPI) network, wherein proteins and detected PPIs are represented by nodes and edges, respectively. The basic idea is that the direct interaction partners of a protein are likely to share similar biological functions [9]. Assignment of protein functions using PPI data has also been extensively studied, such as neighborhood counting based method [10], graph theoretic methods [11], hierarchical clusteringbased methods [12] and graph clustering methods [13]. Although many efforts have been made in protein function prediction, most of them were based on either sequence similarity that ignores the protein interactions, or PPI information without using attributes derived from the content of protein sequence. The former method often fails to work if a query protein has no or very little sequence similarity to any proteins of known labels, the latter method has similar problem if there are insufficient relevant PPI information.
To explicitly use the information of the content of the data and the links information of the PPI network to improve the prediction performance, collective classification (CC) is proposed. It received considerable attentions in the last decade. Various CC algorithms has been proposed in the literature [14], such as the iterative classification algorithm (ICA) [15], Gibbs sampling (Gibbs) [16], and variants of the weighted-vote relational neighbor algorithm (wvRN) [17]. Here, we focus on ICA-type approaches, which consist of a local classifier, such as kNN, to infer the class labels of related instances. The key idea is to construct new relational feature vectors by summarizing the label information from neighborhood nodes, and then use the relational features together with the attribute features derived from the content of data to learn local classifiers for prediction. Figure 1 is an illustration of how ICA proceeds. In Figure 1(a), an attribute-only classifier M A induced from using only the attribute features is first learned to estimate the classes of unlabeled instances. The algorithm then employs an aggregation function to compute the relational features by counting the number of neighbors with respect to each label. Once the features are constructed, a collective classifier, M AR , is learned using both the attribute features and relational features (Figure 1(b)); The algorithm repeats step c and step d to make new prediction for unlabeled instances (Figure 1(c)), and to update the relational features based on the new generated predictions (Figure 1(d)). The ICA-type of algorithms usually assume a separate training graph with abundant labeled data. However, in many applications such as protein function prediction problems, the number of labeled protein data is actually very limited and very expensive to obtain. In this situation, most data have no connection to labeled data, and supervision knowledge cannot be obtained from the local connections (as illustrated in Figure 1(a)). As a result, the collective classifier M AR learned from these networks may suffer a reduction in the classification performance.
This paper describes an effective Markov chain based CC algorithm (ICAM) to tackle the label deficiency problem in CC for protein function prediction from PPI networks. Our idea is to model the classifier M AR via the Markov chain with restart. The Markov chain model computes the ranks of labels to indicate the importance of a set of labels to an instance by propagating the label information in a graph constructed from labeled and unlabeled data. The ICAM algorithm further refines the Markov chain model using an ICA framework to generate the possible labels for a given instance. By these techniques, M AR can be learned more effectively. Experiments on the realworld Yeast PPI datasets have demonstrated that our proposed ICAM method improves the classification performance when compared with the ICA-type CC methods. The main contributions of this paper are as follows.
• We study the label deficiency problem of collective classification (CC) and show that the protein function prediction problem from PPI networks can be formulated as a CC task.
• We extend the ICA-type CC algorithm and propose the ICAM algorithm to leverage the unlabeled portion of the data to improve the classification performance of CC via the Markov chain with restart.
• We demonstrate the effectiveness of our proposed ICAM algorithm using the Yeast benchmark datasets. We find that ICAM leads to significant accuracy gains compared to other ICA-type methods when there are limited numbers of labeled data available.

Preliminaries
Assume that the PPI network data are represented as a graph G(V, E, X A , Y, c), where V is a set of nodes, E is a set of edges representing the interactions between the instances. Each instance/node v i ∈ V is described by an attribute vector x i ∈ X A . Each Y i ∈ Y is a set of labels for v i , and c is the number of possible labels. Assume that we have a set of labeled nodes V K ⊂ V with known labels Y K = {Y i |v i ∈ V K }, and the task is to predict the labels Y U for unlabeled nodes V U = V -V K . In this paper, we are primarily interested in generating a ranking of possible labels for a given protein such that its correct functions receive higher ranking than the less unlikely one.

The ICAM algorithm
Inspired by the ICA approach, we introduce the ICAM algorithm for collective classification. The algorithm is summarized in Algorithm 1. Similar to the ICA framework, the ICAM algorithm has two parts as follows: bootstrap and iterative inference. The bootstrap part learns an attribute-only classifier M A from the known nodes, and uses M A to predict labels for the unknown nodes V U (step 1-2). In the iterative inference part, the relational features X R are updated based on the estimated class labels of data (step 4). Specifically, X R of the (i + 1)-th iteration is based on the known and predicted labels from the i-th iteration. Next, the algorithm trains a collective classifier M AR using both attribute features X A and relational features X R to compute the labels for unlabeled data. The iterative process stops when the predictions of M AR are stabilized or a fixed number of iteration is reached.
An important component of the ICA algorithm is to build the relational features that summarizes the relational information, and to construct new feature vectors to train the classifier M AR . For instance, Neville et al. [15] summarize the labels of neighboring nodes as relational features as illustrated in Figure 1(b), where node "B" has two positive neighboring nodes and two negative neighboring nodes. Here, the relational features is "¡2, 2¿", and then "¡2, 2¿" is appended onto the original feature vector, <x i,1 , x i,2 , ...>, as new features, " <x i,1 , x i,2 , ..., 2, 2 >". ICA-type CC methods usually increase accuracy for network data using a large amount of labeled data to train M AR . In this scenario, the supervision knowledge can be effectively propagated in the network and improve the learning accuracy [18]. However, the labeled data are time-consuming to obtain and the number of labeled data is very limited. Most of the nodes may not link to the labeled nodes, as illustrated in Figure 1(a). As a result, the prediction accuracy of the collective classifier M AR will be decreased greatly.
Algorithm 1 ICAM (V, E, X A , Y K , n) Input: V = nodes, E = edges, X A = attribute feature vectors, Y K = labels of known nodes, n = # of iterations, Procedure: In our ICAM algorithm, we assume that the attribute features x i and the relational features r i are conditionally independent given the class label Y i [19]. We then use two distinct classifiers to make two separate predictions for attribute features X A and relational features X R . The prediction is given as follows: where g is a constant independent of Y i . The attribute classifier to estimate p(Y i |x i ) is referred to as M A , and the relational classifier to estimate p(Y i |r i ) is referred to as M R .
There are two main advantages of this prediction method. First, this method allows us to train classifiers M A and M R for attribute features X A and relational features X R in parallel. Second, in the collective inference process, the classifier M R can be re-trained in each iteration based on the re-constructed relational features X R to improve the prediction accuracy of the collective classifier M AR .
Various traditional supervised learning methods can be used to train M A and M R where the classifier, such as kNN, naive bayes and logistic regression [16,20], is learned from a separate training data with a large amount of labeled data. However, when dealing with label deficiency problem in PPI networks, we propose to use transductive learning method for acquiring additional information from unlabeled data to improve the classification performance. Specifically, we set up Markov chain based learning models to estimate p(Y i |x i ) and p(Y i |r i ).

Markov chain based learning
The Markov chain based learning model is based on the idea of random walk with restart. We note that there are many learning tasks using random walk techniques such as protein network cluster discovery [21], community discovery [22], multi-instance multi-label learning [23], and transfer learning [24]. The idea of random walk with restart is to consider a random walker that starts from labeled nodes, and iteratively transmits to its neighborhood with probabilities proportional to their edge weights. At each step, it has a probability a(0 < a <1) to return back to labeled node. The steady-state probability that the walker will finally stay at node j is the relevance score of node j with respect to the labeled nodes [25]: is the steady-state probability of relevance scores of different nodes, P is the affinity matrix associated with the instances in Markov chain transition probability graph, and q is the label distribution vector containing the elements of labeled instances being 1 and 0 for others. Here, the steady-state probability (relevance score of the instances) captures the global structure of the graph and relationship between the nodes. The advantage of this random walk procedure is that it converges to a unique solution for any initial u(0). The process converges fast, needing just a few iterations. The random walk and related methods have been shown to have good performance on the learning tasks mentioned above. In the following, we introduce the learning of M A via the Markov chain with restart using all the instances (both labeled and unlabeled). The process of learning M R is similar.
Given the constructed attribute feature vector x i ∈ X A for a node v i ∈ V, pairwise affinity A ∈ R m×m between the nodes based on relational information are computed using the Gaussian kernel function as follows where ||x i -x j || is the Euclidean distance between the i-th feature vector and the j-th feature vector in X A . The parameter s is a positive number to control the linkage in the manifold [26]. The m-by-m matrix A, with its (i, j)-th entry given by a i,j , is always nonnegative. Similar to (1), using the Gaussian kernel to r i ∈ X R leads to the affinity matrix R for relational features. We then set up Markov chain models for classifiers M A and M R based on A and R, respectively.
For the classifier M A , we construct an m-by-m Markov transition probability matrix P by normalizing the entries of A with respect to each column, i.e., each column sum of P is equal to one, i [P] i,j = 1. For such P, we model the probabilities of visiting the other instances from the current instance in a Markov chain transition probability graph. We construct a transition probability graph, all the labeled and unlabeled instances are linked together. Intuitively, a random walker starts from nodes with known label to propagate labels among labeled instances to the other unlabeled instances. The walker iteratively visits its neighborhood of nodes with the transition probability graph based on A.
Next we use the idea in topic-sensitive PageRank [27] as a Markov chain with restart [25] to solve the learning problem. The random walker has a probability of a to return to labeled instances at each step. It can be interpreted that during each iteration each instance receives the label information from its neighbors via the random walk, and also retains its initial label information. The parameter aspecifies the relative amount of the information from its neighbors and its initial label information. Using this approach, we compute the steady state probabilities that the random walker finally stay at different instances. These steady state probabilities give ranking of labels to indicate the importance of a set of labels to an unlabeled instance.
More formally, we adopt the following equation: to compute the steady probabilities U = [u 1 , u 2 , ..., u c ] (m-by-c matrix) according to P and Q = [q 1 , q 2 , ... , q c ] (m-by-c matrix) which is the assigned probability distribution vector of the class labels that are constructed from the labeled data. The restart parameter 0 < a <1 controls the importance of the assigned probability distributions in the labeled data to the resulting label ranking scores of instances. Given the training data, one simple way to construct q d is using a uniform distribution on the instances with the label class d. The summations of the entries of q d is equal to 1. More precisely, where l d is the number of instances with the label class d in the training data.
The steady probability distribution vector U is solved by the iteration method with an initial matrix U 0 where each column is a probability distribution vector. The overall algorithm is summarized in Algorithm 2.

Experimental results
In this section, we compare the performance of ICAM algorithm with other ICA-type collective classification algorithms: ICA, Gibbs and ICML. We show that the proposed algorithm outperforms these algorithms given limited number of labeled training data.

KDD Cup 2001 data and baselines
The first experiment is conducted for Yeast gene function prediction from KDD Cup 2001 [28]. The dataset includes 1,243 genes and 1,806 interactions among the pair of genes encoding the proteins physically interact with one another. These interaction relationships are symmetric. The protein functions are autocorrelated in this dataset and a subset of the data have been withheld for testing. The task is to predict the functions of the proteins encoded by the genes. There are 14 functions and a protein can have one (or several) function(s). We compare our proposed method with the following three baseline learners: 1. ICA. The Iterative Classification Algorithm (ICA) algorithm proposed by Neville et al. [15] is one of the simplest and most popular CC methods that is frequently used as baseline for CC evaluation in previous studies. For multi-label problem, we transform it into multiple singlelabel prediction problems using one-against-all strategy and employ ICA to make prediction for each single-label problem.
2. Gibbs. This baseline is another ICA-type CC algorithm using the ICA iterative classification framework. In each iteration, Gibbs re-samples the label of each node based on the estimated label distribution [16]. We also use one against-all strategy to convert the multi-label problem into multiple single-label problems for the Gibbs algorithm.
3. ICML. This baseline is a multi-label CC algorithm proposed by Kongetal. [29]. ICML extends the ICA algorithm to multi-label problems by considering dependencies among the label set in the iteration process.
In the experiments, we use kNN as node classifier for ICA, Gibbs and ICML. The parameter k was automatically selected in the range of 10 to 30 at an increment of 5 using 3-fold cross validation on the training set. For the proposed ICAM method, we learn the classifiers M A and M R using Markov chain based models to perform separate predictions. We set the value of a in the Markov chain model to be 0.95 as suggested in [23].

Evaluation criteria
We evaluate the performance of our proposed method with four multi-label evaluation measures: average precision, coverage, ranking loss, and one-error. They are commonly used for multi-label learning algorithm evaluation.

Given a multi-label dataset
Here x i belongs to the j-th label when Y ij = 1, otherwise Y ij = 0, and c is the number of possible labels. The evaluation measures are defined using the following two outputs provided by the learning algorithms: s(x i , l) returns a real-value that indicates the confidence for the class label l to be a proper label of x i ; rank s (x i , l) returns the ranks of class label l derived from s(x i , l).
Coverage [30] evaluates how far we need, on the average, to go down on the list of labels in order to cover all the true labels of an instance: Ranking loss [30] evaluates the average fraction of label pairs that are reverse ordered for the instance: One-error [30] evaluates how many times the topranked label is not in the set of true labels of the instance. Define a classifier H that assigns a single label to an instance x i by H(x i ) = arg max l∈Y h (x i , l), then the one-error is Average precision [30] evaluates the average fraction of labels ranked above a particular label l ∈ Y in Y: The smaller the value of coverage ranking loss and one-error, the better the performance. As for average precision, the bigger the value the better the performance, we report the results of 1-average precision. Thus, for all evaluation metrics, the smaller the value the better the performance.

Results on KDD Cup 2001 data
In this experiment, we test the performance of our proposed ICAM algorithm on the KDD Cup 2001 dataset. We randomly select 50% of data as training set, and use the remaining 50% of data as test set. The experiment is conducted 10 times by randomly selected training/test split (each with a different random seed), and we report the results of mean as well as standard deviation of each compared algorithm. The mean as well as standard deviation of each compared method over the same 10 trails are reported. Table 1 shows the performance of each compared method on the Yeast protein dataset. The best performance achieved among different compared algorithms is marked in bolded. The results show the competitiveness of the ICAM method against other learning methods. Difference evaluation measures for the learning performance are used in the experiments. One algorithm rarely outperforms another algorithm in all criteria. For example, a method that is optimal for instance ranking loss usually does not perform well in coverage or oneerror [31]. In the experiments, we find that ICAM is able to produce better results across all evaluation metrics. These results are impressive and imply that the ICAM algorithm is a good collective classification method for protein function prediction.
We also test the performance of different comparable algorithms with different number of labeled instances ranging from 200 to 1000 with an increment of 200. For example, we randomly pick up 200 instances as training data. The remaining data is used for testing. The experiment is conducted 10 times by randomly selecting training/test split. We report the results of mean as well as standard deviation of each compared algorithms. Figure 2 shows the performance of ICAM and other learning methods with respect to different number of labeled instances.
We can see from the figure that ICAM (the black line) has the best performance in general. ICAM outperforms other algorithms using different number of training data, especially when the size of training data is small. Specifically, ICAM achieves coverage improvement of 0.4916 over the second best method Gibbs (ICAM:4.2213 versus Gibbs:4.7129) and achieves 0.0439 improvement on ranking loss (ICAM:0.1184 versus ICML:0.1623) when the number of training instance is 200. As the size of training data increases, ICAM consistently achieves better performance than other learning algorithms across all evaluation criteria.
We find that ICAM outperforms the other ICA-type methods substantially in terms of coverage. On the other hand, ICAM only slightly outperforms other methods in terms of one-error. We note that one-error and coverage are two different quantitative measures. One-error evaluates how many times the top-ranked label is not in the set of possible labels. Thus, if the goal of a prediction system is to assign a single function to a protein (singlelabel classification), the one-error is identical to test error. Whilst coverage measures how far we need, on the average, to go down on the list of the labels in order to cover all the possible labels assigned to a protein. Coverage is loosely related to precision at the level of perfect recall [30]. The experimental results indicate that the top-rank label predictions from other ICA-type methods are as accurate as those from ICAM, but the predictions from ICAM are more complete than other ICA-type methods. A reasonable explanation for this finding is that the ICA-type methods focused on the single-label setting. In this case, the multi-label problem is first transformed into multiple single-label prediction problems, and then the ICA-type methods use independent classifiers induced from labeled training data for each problem. Nevertheless, ICA-type approaches ignore the effect of unlabeled data and the interdependence of the protein functions. On the other hand, our proposed ICAM approach is based on Markov chain based transductive learning method that uses both label and unlabeled data for label propagation. The Markov chain based method takes the correlation of the classes into account to effectively compute ranking of labels to an instance. Therefore, ICAM provides an opportunity to leverage the individual ICA-type classifiers to achieve higher coverage of predictions.

Results on KDD Cup 2002 data
To validate the effectiveness of the proposed method when there are only a limited number of positive labeled training data, we conduct additional experiments on a relatively large scale Yeast dataset from KDD Cup 2002. It consists of 4507 instances (i.e., genes) from experiments with a set of cerevisiae (Yeast) strains. Each instance is described by various types of information that characterize the gene associated with the instance. The data sources for describing the instances include abstracts from the scientific literature (MEDLINE), gene localization and functions. We represent each instance by a feature vector with 20545 dimensions. The pairs of genes whose encoded proteins physically interact with one another. Such protein-protein interaction network consists of 1218 links.
Each instance is labeled with one of three class labels "nc", "control" and "change". The "change" label indicates instances in which the activity of the hidden system was Table 1 The performance (mean ± standard deviation) of compared algorithms on the Yeast protein dataset. significantly changed, but the activity of the control system was not significantly changed. The goal of the KDD Cup 2002 task is to learn a model that can accurately predict the genes that affect the hidden system but not the control system. In this case, the positive class consists of those genes with "change" labels and the negative class consists of those genes with either "nc" or the "control" label. This partition is highly imbalanced. The rate of positive instances is only 1.2%. Therefore, we base our evaluation analysis on Receiver Operating Characteristic (ROC) curves, which reflect the true positive rate of a classifier as a function of its false positive rate. ROC curves are commonly used for evaluating highly skewed binary classification problems. Recent study has shown that ROC curves have a deep connection to the precision-recall (PR) curves [32].
To evaluate the performance of our ICAM algorithm, we compare it with the linear kernel SVM method that implemented by LIBSVM [33]. Figure 3 shows the results of ROC curves on the KDD Cup 2002 task for ICAM and SVM. The x-axis and y-axis of the figure refer to the false positive rate and true positive rate respectively. We see from the figure that our ICAM (the red curve), outperforms the SVM method (the blue curve) in general.  ICAM achieves improvement of 10.0% (0.713 versus 0.613) on area under the ROC curves. The experimental results imply that the proposed ICAM method is able to deliver better performance in the situation of small positive labeled data size.

Experiments on collaboration networks
In this section, we compare the performance of the proposed ICAM algorithm with other collective classification algorithms on 2 multi-label collaboration networks datasets to validate the effectiveness of the proposed method more thoroughly. These collaboration networks datasets are collected from the DBLP computer science bibliography website, and used in prior work to study the multi-label collective classification problems [29]. Their characteristics are listed in Table 2. Specifically, we extract DBLP coauthorship networks that contain authors who have published papers during the years 2000-2010 as the nodes of the networks, and link any two authors who have collaborated with each other. At each node, we extract a bag-ofwords representation of all the paper titles published by the author, and used it as the attributes of the node. Each author has one (or multiple) research topic(s) of interests from 6 research areas. The representative conferences from each area are selected as class labels. If an author has published papers in any of these conferences, we assume the author is interested in the corresponding research class. The task is to classify each author with a set of multiple research classes of interest. The conferences corresponding to the class labels of two datasets (DBLP-A and DBLP-B) are given as follows.
The classes of DBLP-A are as follows: We test ICAM and other ICA-type algorithms with different number of labeled instances from 1000 to 5000 with an increment of 500. The average results as well as standard deviation of a 10-time data split are given in Figure 4. The experimental results are in concordant with our previous study. We observe that ICAM consistently outperforms the other ICA-type methods on these two datasets, especially when there are only limited number of Table 2 The description of experimental datasets used in the experiments on collaboration networks.  labeled instances, i.e. larger(smaller) improvement is obtained with less(more) labeled data.

Conclusion
In this paper, we studied the label deficiency problem in collective classification (CC). We showed the protein function prediction problem from PPI networks can be formulated as a problem, and proposed an effective and novel Markov chain based CC learning algorithm, namely ICAM. It focuses on how to use labeled and unlabeled data to enhance the classification performance of PPI network data. Experimental results on two real-world Yeast PPI network datasets and two collaboration network datasets showed that our proposed ICAM method is effective in learning CC tasks in the paucity of labeled data. In future, we will consider other semi-supervised learning techniques for collective classification in PPI network data and we will also research on other complex biological networks, such as heterogeneous network classification.