Adaptive diffusion kernel learning from biological networks for protein function prediction

Background Machine-learning tools have gained considerable attention during the last few years for analyzing biological networks for protein function prediction. Kernel methods are suitable for learning from graph-based data such as biological networks, as they only require the abstraction of the similarities between objects into the kernel matrix. One key issue in kernel methods is the selection of a good kernel function. Diffusion kernels, the discretization of the familiar Gaussian kernel of Euclidean space, are commonly used for graph-based data. Results In this paper, we address the issue of learning an optimal diffusion kernel, in the form of a convex combination of a set of pre-specified kernels constructed from biological networks, for protein function prediction. Most prior work on this kernel learning task focus on variants of the loss function based on Support Vector Machines (SVM). Their extensions to other loss functions such as the one based on Kullback-Leibler (KL) divergence, which is more suitable for mining biological networks, lead to expensive optimization problems. By exploiting the special structure of the diffusion kernel, we show that this KL divergence based kernel learning problem can be formulated as a simple optimization problem, which can then be solved efficiently. It is further extended to the multi-task case where we predict multiple functions of a protein simultaneously. We evaluate the efficiency and effectiveness of the proposed algorithms using two benchmark data sets. Conclusion Results show that the performance of linearly combined diffusion kernel is better than every single candidate diffusion kernel. When the number of tasks is large, the algorithms based on multiple tasks are favored due to their competitive recognition performance and small computational costs.


Background
Many types of genomic data can be represented as a graph (network), where the nodes represent genes or proteins, and edges may represent similarities between protein sequences, edges in a metabolic pathway, and physical interactions between proteins [1]. Machine learning tools have been commonly used to analyze biological networks for knowledge discovery and pattern analysis [2]. In this paper, we focus on learning from biological networks for protein function prediction. This problem has been studied extensively by using computational approaches recently [1]. Neighborhood-based methods [3,4] assign functions to proteins based on the most frequent functions within a neighborhood of the protein and they differ mainly in how the "neighborhood" of a protein is defined. More sophisticated prediction functions have been exploited in [5,6]. Methods based on network diffusion [7,8] view the protein network as a flow network and functions of proteins are diffused from annotated proteins to their neighbors in various ways. Other approaches for protein function annotation from biological networks include the graph-cut-based approaches [9,10] and those derived from the kernel methods [11][12][13].
Kernel methods are versatile tools for learning from graph-based data, as they only require the characterization of similarities between objects by the use of kernel trick [2,14]. Diffusion kernels [15], which can be considered as the discretization of the well-known Gaussian kernel of Euclidean space, are commonly used for graphbased data. In kernel methods, the information on the data is conveyed only in the kernel function, which uniquely determines the mapping of the original inputs onto a feature space. Thus, one of the central issues in kernel methods is the selection of a good kernel function for a specific problem at hand. A recent trend in kernel learning (selection) is to formulate it as convex programs, which lead to a globally optimal solution [16]. The idea of learning a linear combination of pre-specified kernels for Support Vector Machines (SVM) was originally proposed in [17] where this problem was formulated as semidefinite programs (SDP) and Quadratically Constrained Quadratic Programs (QCQP). In general, approaches based on learning a convex combination of kernels offer the additional advantage of facilitating heterogeneous data integration from different sources [18].
The objective functions for kernel learning used in [17] are performance measures for hard margin SVM, 1-norm soft margin SVM, and 2-norm soft margin SVM. An alternative criterion for kernel matrix learning is the Kullback-Leibler (KL) divergence [19] between the two zero-mean Gaussian distributions defined by the input and output kernel matrices [20]. One particularly appealing feature of the KL divergence criterion is that unlabeled (test) data can be integrated naturally into the training process, thereby improving generalizations. The formulations in [17] also use unlabeled data, but in a weak form by enforcing the trace magnitude of the kernel matrix including both training and test data in the constraint. Direct incorporation of unlabeled data by the formulations in [17] through the KL divergence criterion involves a matrix determinant term. The resulting formulation is a so-called maximum-determinant problem [21], which is a general framework that contains semidefinite programming (SDP) [16] as a special case. Although its theoretical soundness, experiences with semidefinite programming indicate that it is computationally expensive and thus can not be scaled to large-scale problems. The maximumdeterminant problem is a more general framework than SDP and the path-following algorithms used to solve it is more expensive.
Diffusion kernels [15] capture the long-range relationships between vertices of graphs and are state-of-the-art for building kernels for graphs. In this paper, we focus on learning diffusion kernels constructed from biological networks, using the KL divergence criterion. In particular, we show that when the KL divergence criterion is used to optimize a convex combination of diffusion kernels with different parameters, the resulting optimization problem does not involve the matrix determinant term and thus can be solved by gradient descent methods. Previous studies [22,23] have shown that the removal of the matrixdeterminant term in the KL divergence criterion has limited effect on its performance. When this modified criterion is used to learn a linear combination of diffusion kernels, the resulting optimization problem is convex and thus solutions by gradient descent methods are guaranteed to be globally optimal. A protein typically performs multiple functions. Most existing approaches formulate a separate task for each of the functions and they are learned independently. They decouple the functions of proteins and potentially compromise the performance as the functions of proteins are usually related. We show that our single-task kernel learning formulation based on the KL divergence criterion can be extended to the multi-task case by enforcing all tasks to share a common kernel. The resulting formulation leads to a single optimization problem, which learns multiple functions of proteins simultaneously. Experimental results show that this multiple-task kernel learning in a joint optimization framework keeps competitive prediction performance, while its computational cost is similar to that for a single task, thus dramatically reducing the time complexity.

Methods
We study the problem of protein function prediction from biological networks, which are represented as graphs. For a graph , the vertices represent proteins and edges characterize the relationship between proteins. In the following discussion, the vertex and edge sets are denoted as V and E, respectively. The total number of proteins in the network is n = |V|. The adjacency matrix A is used to denote the similarity between vertices where A i,j describes the similarity between vertices v i and v j . The functions of some proteins in the network are already known and the goal of protein function prediction is to infer the functions of unannotated proteins based on the functions of annotated proteins and the network topology. In particu- lar, for a graph = (V, E), the vertices in V can be partitioned into a training set and a test set. The functions of proteins in the training set are already known while those of proteins in the test set are unknown. Each edge in E reflects the local similarities between its ending vertices. The learning problem is to predict the functions of proteins in the test set based on the label information of training set and the topology of the graph.

Background and Related Work
Kernel methods are particularly suitable for learning from graph-based data, as they only require the similarities between proteins to be encoded in the kernel matrix. In kernel methods, a symmetric function , where denotes the input space, is called a kernel function if it satisfies the Mercer's condition [14]. When used for a finite number of samples in practice, this condition can be stated as follows: for any x 1 , ...,x n ∈ the Gram Any kernel function κ implicitly maps the input set to a high-dimensional (possibly infinite) Hilbert space equipped with the inner product through a mapping The adjacency matrix A can't be directly used as a kernel matrix. First, the adjacency matrix contains the local similarity information only, which may not be effective for function prediction. Secondly, the adjacency matrix may not even be positive semidefinite. To derive a kernel matrix from the adjacency matrix, the idea of random walk and network diffusion has been used. The basic idea is to compute the global similarity between vertices v i and v j as the probability of reaching v j at some time point T when the random walker starts from v i . This idea is justified at least to some extent by observing that the random walker tends to meander around its origin as there is a larger number of paths of length |T| to its neighbors than to remote vertices [2].
To avoid some potential problems such as the choice of value for T and assurance of positive semidefiniteness for the kernel matrix, a random walk with an infinite number of infinitesimally small steps is used instead. It can be formally described as: where β is a parameter for controlling the extent of diffusion and L ∈ ‫ޒ‬ n × n is the graph Laplacian matrix defined as where A is the adjacency matrix, e is the vector of all ones, and diag(Ae) is a diagonal matrix with the diagonal entries being the corresponding row summation of the matrix A. It turns out that for any symmetric matrix L, e βL is always positive definite and thus can be used as a kernel matrix. The diffusion effect of such kernel can be explicitly seen when it is expanded as [2]: where the local information encoded in L is continuously diffused by repeated multiplications. The parameter β in the diffusion kernel controls the extent of diffusion and it has a similar effect as the scaling parameter in Gaussian kernels. If the β is too small, the local information can not be diffused effectively, resulting in a kernel matrix that only captures local similarities. On the other hand, if it is too large, the neighborhood information will be lost. Furthermore, the optimal value for β is problem and datadependent. Thus it is highly desirable to tune the β value adaptively from the data.
We approach the kernel tuning problem by learning an optimal kernel as a linear combination of pre-specified diffusion kernels constructed with different values of β. This is motivated from the work in [17] where the optimal kernel for SVM, in the form of a linear combination of pre-specified kernels, is learned based on the large margin criteria. In particular, the generalized performance measure based on 1-norm soft margin SVM used in [17] is where C > 0 is the regularization parameter in SVM, e is the vector of all ones, G(K) is defined by G ij (K) = k(x i , x j )y i y j , and the i-th entry of y denoted as y i is the class label (1 or -1) of the i-th data point x i . Lanckriet et al. [17] showed that when the optimal kernel is restricted to the linear combination of the given p kernels K 1 , ..., K p , the kernel learning problem can be formulated as a semidefinite program. Furthermore, when the coefficients of the linear combination are constrained to be non-negative, the kernel learning problem can be formulated as a Quadratically Constrained Quadratic Program [16]. As was shown in [20], an alternative performance measure is the KL divergence between the two zero-mean Gaussian distributions associated with the input and output kernel matrices. We show that when this KL divergence criterion is used to learn a linear combination of diffusion kernels constructed with different values of β, the resulting optimization problem can be solved efficiently. We further show that it can be extended to the multiple-task case. Such integration of multiple tasks into one optimization problem can potentially exploit the complementary information among different tasks.

Diffusion Kernel Learning: The Single-Task Case
We focus on learning an optimal kernel for a single task, which will then be extended to the multi-task case. The underlying idea is that the Laplacian matrix L, defined in Eq. (3), contains the connectivity information of all vertices in the graph. By adaptively tuning the kernel constructed from L on the training vertices, the entries corresponding to test vertices are expected to be tuned in some optimal way as well. To restrict the search space and improve the generalization ability, we focus on learning an optimal kernel as a linear combination of a set of diffusion kernels constructed with different values of β, indicating different extents of diffusion. In particular, we choose a sequence of values for β as β 1 , ...,β p , and the corresponding diffusion kernels can be constructed as We may assume that the kernels defined in Eq. (6) reflect our (weak) prior knowledge about the problem. The goal is to integrate the tuning of the coefficients into the learning process and the algorithm can adaptively select an optimal linear combination of the given kernels. Note that it is numerically favorable to normalize the kernels though this does not affect the results theoretically [14]. We normalize the kernels as follows: and the optimal kernel can be represented as for a set of non-negative coefficients

Kullback-Leibler Divergence Formulation
Kernel matrices are positive semidefinite and thus can be used as the covariance matrices for Gaussian distributions. It was shown in [20] that the kernel matrix can be learned by minimizing the Kullback-Leibler (KL) divergence between the zero-mean Gaussian distributions associated with the input and output kernel matrices. In this paper, we focus on learning the optimal coefficients α i from the data automatically based on minimizing this KL divergence criterion. As described in [20], the KL divergence between the zero-mean Gaussian distributions defined by the input kernel K x and output kernel K y can be expressed as where |·| denotes the matrix determinant, N x and N y denote the zero-mean Gaussian distributions associated with K x and K y , respectively, and n is the number of samples. When the output kernel K y is defined as K y = yy T , the KL divergence in Eq. (9) can be expressed as where "const" denotes terms that are independent of K x , and K x is the input kernel matrix, which is defined as a linear combination of the given p kernels as Note that a regularization term, with λ as the regularization parameter, is added to Eq. (11) to deal with the singularity problem of kernel matrices as in [20], and we require as in multiple kernel learning (MKL) [17]. The optimal coefficients α = [α 1 , ..., α p ] T are computed by minimizing KL(N y |N x ). By substituting Eq. (11) into Eq. (10), and removing the constant term, we obtain the following optimization problem: where α = (α 1 , ..., α p ) T , α 0 denotes that all components of α are non-negative, and the vector a ∈ ‫ޒ‬ n is the problem-specific target vector, corresponding to the general target in Eq. (9), defined as follows: Note that we assign the label 0 to vertices in the test set so that it will not bias towards either class. Similar idea has been used in [24] for semi-supervised learning. In multiple kernel learning [17], the sum-to-one constraint on the weights is enforced as in Eq. (12). We present results on both constrained and unconstrained formulations in the experiments. Results show that the constrained formulations achieved better performance than the unconstrained ones.
Recall that the graph Laplacian matrix L is symmetric, so its eigen-decomposition can be expressed as is the diagonal matrix of eigenvalues and P ∈ ‫ޒ‬ n × n is the orthogonal matrix of corresponding eigenvectors. According to the definition of the function of matrices [25], we have where The main result is summarized in the following theorem:

Theorem 1. Given a set of p diffusion kernels, as defined in
Eq. (7), the problem of learning the optimal kernel matrix, in the form of a convex combination of the given p kernel matrices as in Eq. (12), can be formulated as the following optimization problem: where b = (b 1 , ..., b n ) = P T a, g j is the j-th diagonal entry of the diagonal matrix G, defined as and D i is the diagonal matrix defined in Eq. (16).
Proof. The first term in Eq. (12) can be written as: where the third equality follows from the property of the trace, that is, Similarly, the second term in Eq. (12) can be written as: By combining the first term in Eq. (21) and the second term in Eq. (22), we prove the theorem.
The formulation in Theorem 1 is a nonlinear optimization problem. It involves a nonlinear objective function with p variables and linear equality and inequality con-  (22) straints. Due to the presence of the log term in the objective, it is a non-convex problem and a globally optimal solution may not exist. However, our experimental results show that this formulation consistently produces superior performance.

Convex Formulation
The optimization problem in Theorem 1 is not convex. Previous studies [22,23] indicate that the removal of the log determinant term in the KL divergence criterion in Eq. (12) has a limited effect on the performance. This leads to the following optimization problem: Following Theorem 1, we can show that the optimization problem above can be simplified as where g j and b are defined as in Theorem 1.
The optimization problem in Eq. (26) is convex and thus a globally optimal solution exists. Numerical experiments indicate that the simple gradient descent algorithm converges very quickly to the optimal solution. Furthermore, the prediction performance of this convex formulation is comparable to that of the formulation proposed in Theorem 1. This convex formulation shares some similarities with the one in [26], where a set of Laplacian matrices derived from multiple networks is combined.

Diffusion Kernel Learning: The Multi-Task Case
It is known that proteins often perform multiple functions, which are typically related. Many existing function prediction approaches decouple multiple functions and formulate each function prediction problem as a separate binary-class classification problem. Such methods do not consider the relationship among the multiple functions of a protein and potentially compromise the overall performance.
We propose to extend our formulation for the single-task case to deal with multiple tasks simultaneously. In particular, we formulate a single optimization problem for the simultaneous prediction of multiple functions for a protein. The joint learning of multiple functions can potentially exploit the relationship among functions and improve the performance. In terms of computational complexity, the proposed joint optimization problem is shown to be comparable to that of the single-task formulation.
A key observation is that when the pre-specified diffusion kernels are computed from the same biological network with different values of β, the graph Laplacian matrices are the same for all tasks. By enforcing all tasks to share a common linear combination of kernels, we obtain the following joint optimization problem: where a (k) ∈ ‫ޒ‬ n for i = 1, ..., t is the vector of class labels for the k-th task as in Eq. (13), and t is the number of tasks. Note that all t tasks are related in this joint formulation by enforcing a common kernel matrix for all tasks. The objective function in Eq. (27) uses an equal weight for all tasks. If some tasks are known to be more important than others, a more general objective function with varying weights for different tasks may be used instead. Following Theorem 1, we can simplify the optimization problem in Eq. (27), as summarized in the following theorem: Theorem 2. Given a set of p diffusion kernels, as defined in Eq. (7), the problem of optimal multi-task kernel learning, in the form of a convex combination of the given p kernels, can be formulated as the following optimization problem: where g j is defined as in Theorem 1, b k = P T a (k) , a (k) is defined as in Eq. (13) for the k-th task, and t is the total number of tasks.
Proof. The first term in Eq. (27) can be rewritten as Similarly, the second term can be rewritten as The detailed intermediate steps of derivation are the same as those in the proof of Theorem 1 and thus are omitted. By combining these two terms together, we prove the theorem.
The optimization problem in Theorem 2 is not convex. Similar to the single-task case, the log determinant term in Eq. (27) may be removed, which leads to the following convex optimization problem: Experimental evidences show that this convex optimization problem is comparable to the formulation in Theorem 2 in prediction performance.

Results and Discussion
We evaluate the performance of the proposed formulations on two benchmark data sets, and compare them with relevant methods, including the Neighbor Counting approach [4] and the FS-Weighted Averaging approach [5,6]. We construct 60 diffusion kernels from each data set using different values for β and the proposed formulations are applied to compute a linear combination of the pre-computed kernels. The performance of the obtained kernel is compared with that of the individual kernel. To see the relative performance of the objective functions, we also use the 1-norm soft margin SVM criterion, proposed in [17], to compute the linear combination of kernels and the results are presented. All of the formulations proposed in this paper are solved using the MATLAB [27] function fmincon which employs the sequential quadratic programming method [28]. The QCQP formulation for optimizing the 1-norm soft margin SVM criterion is solved using the MOSEK [29] software package. After the kernels are computed, they are fed into SVM for classification and the LIBSVM [30] software package is used in the experiments. All of the experiments are performed on a PC with Intel Pentium D 820 2.8G CPU and 2G RAM.
In the following experiments, a total of 60 diffusion kernels are pre-computed and the values of β used are β i = -0.1 × i, for i = 1, ..., 60. In order to investigate the performance of each individual kernel, we use each kernel for the classification and compute the average Receiver Operating Characteristic (ROC) values over all of the tasks. The ROC value produced by the best averaged individual kernel is used as a baseline. It is called rBaseline as all tasks are restricted to use the same kernel. We further relax the requirement that all tasks use the same kernel and compute the sequence of ROC values achieved by the best individual kernel for each of the tasks. This is considered another baseline called uBaseline as the kernel used by each task is unrestricted. Note that the kernel matrices for both rBaseline and uBaseline represent the single best candidate kernel in the ideal case that the labels of test data are known, and their performance is not guaranteed in practice. In contrast, the kernel matrices computed by the proposed formulations are the optimal kernel matrices in the form of linear combination of the given candidate kernel matrices. In order to evaluate the effectiveness of the weights obtained by the proposed formulations, we assign each kernel the same weight and compute the performance of the combined kernel. It is called eBaseline as all kernel matrices have an equal weight.
DKL KL u mDKL KL u norm soft margin SVM criterion by solving QCQP proposed in [17] is denoted as SM1. The six proposed formulations are summarized in Table 1.

Experiments on the Ligand Data Set
The Ligand data set was derived by Vert and Kanehisa [31] from the Ligand database of chemical reactions in biological pathways [32]. It contains a graph reflecting the interactions between proteins and the function information for them. The graph is a yeast biological network in which a path between vertices implies a possible series of reactions catalyzed by proteins along it. The numbers of vertices and edges in this graph are 753 and 7860, respectively. For the functions of proteins, the functional categories of the MIPS Comprehensive Yeast Genome Database (CYGD) [33] are considered as the gold standard. These categories are not mutually exclusive, and each protein may have multiple functions. There are 36 different functions considered for this data set.

Comparison of ROC Values
We use the ROC as the performance measure and the λ value is fixed to 10 -6 in the experiments. Our experimental results show that the algorithms are not sensitive to the value of λ, as long as it is neither too large nor too small. Figure 1 plots the number of tasks with ROC value above a threshold for all methods. The average ROC values achieved by all methods are also summarized in Table 2.
In order to test statistical significance, we also compute the p-values of Wilcoxon signed test and the results are reported in Table 3. We can observe that mDKL achieves the best performance among all methods. All the pro-posed formulations except outperform the three baseline methods. This implies that the computed linear combination of kernels can potentially exploit the complementary information in different kernels and thus improve performance. The ROC value achieved by SM1 is lower than those of the three baseline methods, implying that the SVM criterion is less effective for such tasks. Note that the SM1 criterion also uses information from unlabeled data, but in a weak form. The formulation achieves a ROC value lower than the three baseline methods. This shows that the constraints have important normalizing effects and can not be removed. By comparing the relative performance of formulations with and without the log term, we can conclude that removing this term usually does not affect the performance. Another important observation is that mDKL and mDKL KL outperform DKL and DKL KL , implying that constraining the multiple tasks to share a common kernel does not degrade the performance if the kernel used is a linear combination of kernels obtained by the proposed formulations. In contrast, if the kernel used is a single kernel, this restriction will degrade the performance, as illustrated by the relative performance of rBaseline and uBaseline. For the eBaseline method, it can be observed that, except for , all of other proposed formulations outperform it. This illustrates that our formulations can compute an optimal kernel matrix by assigning different weights to the candidate kernel matrices. We can observe from Table 3 that the difference between the performance of the two baseline methods (rBaseline and eBaseline) and that of DKL and mDKL are statistically significant. All diffusion kernel based approaches are competitive with the Neighbor Counting approach [4] and the FS-Weighted Averaging approach [5,6]. Neighbor Counting and FS-Weighted Averaging use the local information, more specifically the level-1 neighborhood (Neighbor Counting) and both level-1 and level-2 neighborhoods (FS-Weighted Averaging), for the prediction. The experimental results show the effectiveness of capturing the long-range relationships (global information) between proteins in the network in diffusion kernels [15].   Figure 4 the ROC values obtained by the proposed formulations with respect to uBaseline using scatter plots. We can observe that there are two points below the 45-degree line in each plot. Those two points correspond to tasks 29 and 33 and they are difficult to classify by all methods. As most points in the plots are above the 45-degree line, we can conclude that the proposed formulations outperform uBasline on most tasks.

Comparison of Execution Time
In order to compare the efficiency of various kernel learning methods, we list in Table 2 the execution time of the compared methods. It can be observed that all methods based on multiple tasks are more efficient than their single-task counterparts. In particular, the execution time of mDKL is roughly 1/36 of that of DKL, which is consistent with our theoretical analysis. In general, convex formulations are more efficient than their non-convex original  Mean ROC values over 36 tasks for each kernel on the Lig-and Data Set (the kernel with the maximum mean ROC value is used in rBaseline) Figure 2 Mean ROC values over 36 tasks for each kernel on the Ligand Data Set (the kernel with the maximum mean ROC value is used in rBaseline). The horizontal axis denotes the -β values used to build the corresponding kernel and the vertical axis is the mean ROC value. formulations and the optimization problems with the constraints removed take a longer time to converge. By taking the performance into account, the DKL and mDKL may be the best choices in practice.

Stability Test
In order to obtain a robust performance estimate for the various methods, we randomly partition the data set into a training set and a test set ten times and the average ROC values and standard deviations across splittings are reported in Table 4. Compared with the results in Table 2, we can see that the relative performance of each method in these two tables is very similar. In particular, mDKL and mDKL KL achieve the best overall performance. Except for the two unconstrained formulations and , all of other proposed formulations achieve higher ROC values than the three baseline methods. It is worth noting that the performance of uBaseline and rBaseline is obtained by using the labels of both the training and test data and such performance is not guaranteed in practice when only the labels of the training data are used.

Experiments on the von Mering Data Set
The von Mering data set was created by von Mering et al. [34] from protein-protein interactions identified via six different methods. It contains a graph consisting of 2617 vertices (proteins) and 11855 edges. There are 76 different functions (tasks) associated with the proteins in the graph. The performance of different methods is reported in Figure 5. Two baseline methods, rBaseline and uBaseline, constructed exactly the same way as those for the Ligand data set are used and their performance is summarized in Figure 6 and Figure 7, respectively. The value for is again set to 10 -6 in the experiments. Figure 8 compares the relative performance of the proposed formulations with that of the uBaseline graphically.

Comparison of ROC Values
We use the ROC values of each method to compare their relative performance. Similar to Figure 1 for the Ligand data set, Figure 5 plots the change of the number of tasks with ROC value above a certain threshold as the threshold varies for each of the compared method. For ease of comparison, Table 5 also lists the average ROC values achieved by the compared methods. Similarly, the p-values of Wilcoxon signed test for this data set are reported in Table 6. As the SM1 formulation requires excessive storage and computational time for this relatively large data set, we are not able to obtain its result in this experiment. From these results we can observe that mDKL and mDKL KL achieve the best performance. In general, the performance of DKL, Best ROC values for diferent tasks achieved by different ker-nels on the von Mering Data Set    the three baseline methods. The difference between DKL and DKL KL as well as the difference between mDKL and mDKL KL is very small, which further confirms that the removal of the log term does not affect the performance of algorithm much. For the formulations with constraints removed, i.e., and , their performance is the lowest among the proposed formulations. Similar to the case for the Ligand data set, we conclude that constraining the multiple tasks to share a common kernel does not degrade the performance if the kernel used is a linear combination of kernels obtained by the proposed formulations. In contrast, if the kernel used is a single kernel, this restriction will degrade the performance, as illustrated by the relative performance of rBaseline and uBaseline. In terms of the eBaseline, we can observe from Table 5 that all of our proposed formulations achieve higher ROC values than the eBaseline method, in which all of the kernel matrices are assigned the same weight. We can observe from Table 6 that the difference between the performance of all of the three baselines and that of DKL and mDKL is statistically significant. We can again observe that all diffusion kernel based approaches are competitive with the Neighbor Counting approach and the FS-Weighted Averaging approach. Figure 8 presents the scatter plots of four proposed formulations with respect to uBaseline. It can be observed that most points are above the 45-degree line, which implies that the linear combination of kernels is better than the ideally best individual kernel. In general, the performance of DKL KL , DKL, mDKL KL , mDKL is better than uBaseline. And this is also confirmed by the mean ROC values listed in Table 5. Table 5 also lists the execution time of various kernel learning methods. Similar conclusions can be drawn from this table as to the execution time on the Ligand data set. All methods based on multiple tasks are more efficient than their single-task counterparts. By comparing the results in Table 2 and Table 5 we can also observe that as the number of tasks increases, the time difference between methods based on multiple tasks and those based on single tasks increases too. Thus, the formulations based on multiple tasks are preferred when the number of tasks is large.

Stability Test
Similar to the Ligand data set, we generate ten random splittings of the data into training and test sets and report the average ROC values and standard deviations in Table  7. By comparing with results in Table 5, we can see that the relative performance of each method is similar in both tables. All of the proposed formulations outperform eBaseline.

Conclusion
In this paper, we address the issue of learning an optimal diffusion kernel based on KL divergence criterion for protein function prediction. By exploiting the special structure of the diffusion kernel, we show that this KL   --------DKL KL u mDKL KL u divergence based kernel learning problem can be formulated as a simple optimization problem, which can be solved efficiently. We also extend the formulation to the multi-task case where we predict multiple functions of a protein simultaneously.
We have conducted experiments on two benchmark data sets. Our results show that the performance of linearly combined diffusion kernel is better than every single candidate diffusion kernel. Results also show that the removal of the log term in the KL divergence criterion does not degrade its recognition performance, while it leads to a reduced computational cost. When the number of tasks is large, the algorithms based on multiple tasks are favored due to their competitive recognition performance and small computational costs. One possible extension is to incorporate the learning of the regularization parameter in the proposed formulations as in [17]. The difference between the proposed learning framework and those in [17] is that our formulations require that the eigenvectors of the candidate kernel matrices to be the same. Thus the proposed formulations may not be applied for heterogeneous data integration. We plan to apply the proposed algorithms for the analysis of other graph-based biological data.

Authors' contributions
LS designed the methodology, implemented programs, and participated in manuscript preparation. SJ derived the KL divergence formulation, and drafted the manuscript. JY originally conceived the project, guided the implementation, and drafted the manuscript. All authors have read and approved the final manuscript.