Adaptive diffusion kernel learning from biological networks for protein function prediction

Sun, Liang; Ji, Shuiwang; Ye, Jieping

doi:10.1186/1471-2105-9-162

Research article
Open access
Published: 25 March 2008

Adaptive diffusion kernel learning from biological networks for protein function prediction

Liang Sun^1,2,
Shuiwang Ji^1,2 &
Jieping Ye^1,2

BMC Bioinformatics volume 9, Article number: 162 (2008) Cite this article

7774 Accesses
10 Citations
Metrics details

Abstract

Background

Machine-learning tools have gained considerable attention during the last few years for analyzing biological networks for protein function prediction. Kernel methods are suitable for learning from graph-based data such as biological networks, as they only require the abstraction of the similarities between objects into the kernel matrix. One key issue in kernel methods is the selection of a good kernel function. Diffusion kernels, the discretization of the familiar Gaussian kernel of Euclidean space, are commonly used for graph-based data.

Results

In this paper, we address the issue of learning an optimal diffusion kernel, in the form of a convex combination of a set of pre-specified kernels constructed from biological networks, for protein function prediction. Most prior work on this kernel learning task focus on variants of the loss function based on Support Vector Machines (SVM). Their extensions to other loss functions such as the one based on Kullback-Leibler (KL) divergence, which is more suitable for mining biological networks, lead to expensive optimization problems. By exploiting the special structure of the diffusion kernel, we show that this KL divergence based kernel learning problem can be formulated as a simple optimization problem, which can then be solved efficiently. It is further extended to the multi-task case where we predict multiple functions of a protein simultaneously. We evaluate the efficiency and effectiveness of the proposed algorithms using two benchmark data sets.

Conclusion

Results show that the performance of linearly combined diffusion kernel is better than every single candidate diffusion kernel. When the number of tasks is large, the algorithms based on multiple tasks are favored due to their competitive recognition performance and small computational costs.

Background

Many types of genomic data can be represented as a graph (network), where the nodes represent genes or proteins, and edges may represent similarities between protein sequences, edges in a metabolic pathway, and physical interactions between proteins [1]. Machine learning tools have been commonly used to analyze biological networks for knowledge discovery and pattern analysis [2]. In this paper, we focus on learning from biological networks for protein function prediction. This problem has been studied extensively by using computational approaches recently [1]. Neighborhood-based methods [3, 4] assign functions to proteins based on the most frequent functions within a neighborhood of the protein and they differ mainly in how the "neighborhood" of a protein is defined. More sophisticated prediction functions have been exploited in [5, 6]. Methods based on network diffusion [7, 8] view the protein network as a flow network and functions of proteins are diffused from annotated proteins to their neighbors in various ways. Other approaches for protein function annotation from biological networks include the graph-cut-based approaches [9, 10] and those derived from the kernel methods [11–13].

Kernel methods are versatile tools for learning from graph-based data, as they only require the characterization of similarities between objects by the use of kernel trick [2, 14]. Diffusion kernels [15], which can be considered as the discretization of the well-known Gaussian kernel of Euclidean space, are commonly used for graph-based data. In kernel methods, the information on the data is conveyed only in the kernel function, which uniquely determines the mapping of the original inputs onto a feature space. Thus, one of the central issues in kernel methods is the selection of a good kernel function for a specific problem at hand. A recent trend in kernel learning (selection) is to formulate it as convex programs, which lead to a globally optimal solution [16]. The idea of learning a linear combination of pre-specified kernels for Support Vector Machines (SVM) was originally proposed in [17] where this problem was formulated as semidefinite programs (SDP) and Quadratically Constrained Quadratic Programs (QCQP). In general, approaches based on learning a convex combination of kernels offer the additional advantage of facilitating heterogeneous data integration from different sources [18].

The objective functions for kernel learning used in [17] are performance measures for hard margin SVM, 1-norm soft margin SVM, and 2-norm soft margin SVM. An alternative criterion for kernel matrix learning is the Kullback-Leibler (KL) divergence [19] between the two zero-mean Gaussian distributions defined by the input and output kernel matrices [20]. One particularly appealing feature of the KL divergence criterion is that unlabeled (test) data can be integrated naturally into the training process, thereby improving generalizations. The formulations in [17] also use unlabeled data, but in a weak form by enforcing the trace magnitude of the kernel matrix including both training and test data in the constraint. Direct incorporation of unlabeled data by the formulations in [17] through the KL divergence criterion involves a matrix determinant term. The resulting formulation is a so-called maximum-determinant problem [21], which is a general framework that contains semidefinite programming (SDP) [16] as a special case. Although its theoretical soundness, experiences with semidefinite programming indicate that it is computationally expensive and thus can not be scaled to large-scale problems. The maximum-determinant problem is a more general framework than SDP and the path-following algorithms used to solve it is more expensive.

Diffusion kernels [15] capture the long-range relationships between vertices of graphs and are state-of-the-art for building kernels for graphs. In this paper, we focus on learning diffusion kernels constructed from biological networks, using the KL divergence criterion. In particular, we show that when the KL divergence criterion is used to optimize a convex combination of diffusion kernels with different parameters, the resulting optimization problem does not involve the matrix determinant term and thus can be solved by gradient descent methods. Previous studies [22, 23] have shown that the removal of the matrix-determinant term in the KL divergence criterion has limited effect on its performance. When this modified criterion is used to learn a linear combination of diffusion kernels, the resulting optimization problem is convex and thus solutions by gradient descent methods are guaranteed to be globally optimal. A protein typically performs multiple functions. Most existing approaches formulate a separate task for each of the functions and they are learned independently. They decouple the functions of proteins and potentially compromise the performance as the functions of proteins are usually related. We show that our single-task kernel learning formulation based on the KL divergence criterion can be extended to the multi-task case by enforcing all tasks to share a common kernel. The resulting formulation leads to a single optimization problem, which learns multiple functions of proteins simultaneously. Experimental results show that this multiple-task kernel learning in a joint optimization framework keeps competitive prediction performance, while its computational cost is similar to that for a single task, thus dramatically reducing the time complexity.

Methods

We study the problem of protein function prediction from biological networks, which are represented as graphs. For a graph $G$ , the vertices represent proteins and edges characterize the relationship between proteins. In the following discussion, the vertex and edge sets are denoted as V and E, respectively. The total number of proteins in the network is n = |V|. The adjacency matrix A is used to denote the similarity between vertices where A_i,jdescribes the similarity between vertices v_iand v_j. The functions of some proteins in the network are already known and the goal of protein function prediction is to infer the functions of unannotated proteins based on the functions of annotated proteins and the network topology. In particular, for a graph $G$ = (V, E), the vertices in V can be partitioned into a training set and a test set. The functions of proteins in the training set are already known while those of proteins in the test set are unknown. Each edge in E reflects the local similarities between its ending vertices. The learning problem is to predict the functions of proteins in the test set based on the label information of training set and the topology of the graph.

Background and Related Work

Kernel methods are particularly suitable for learning from graph-based data, as they only require the similarities between proteins to be encoded in the kernel matrix. In kernel methods, a symmetric function $κ : X \times X \to R$ , where $X$ denotes the input space, is called a kernel function if it satisfies the Mercer's condition [14]. When used for a finite number of samples in practice, this condition can be stated as follows: for any x₁, ...,x_n∈ $X$ the Gram matrix K ∈ ℝ^{n × n}, defined by K_ij= κ(x_i, x_j) is positive semidefinite. Any kernel function κ implicitly maps the input set $X$ to a high-dimensional (possibly infinite) Hilbert space $ℋ_{κ}$ equipped with the inner product ${(\cdot, \cdot)}_{ℋ_{κ}}$ through a mapping $φ_{κ} : X \to ℋ_{κ}$

κ (x, z) = {(φ_{κ} (x), φ_{κ} (z))}_{ℋ_{κ}} .

(1)

The adjacency matrix A can't be directly used as a kernel matrix. First, the adjacency matrix contains the local similarity information only, which may not be effective for function prediction. Secondly, the adjacency matrix may not even be positive semidefinite. To derive a kernel matrix from the adjacency matrix, the idea of random walk and network diffusion has been used. The basic idea is to compute the global similarity between vertices v_iand v_jas the probability of reaching v_jat some time point T when the random walker starts from v_i. This idea is justified at least to some extent by observing that the random walker tends to meander around its origin as there is a larger number of paths of length |T| to its neighbors than to remote vertices [2].

To avoid some potential problems such as the choice of value for T and assurance of positive semidefiniteness for the kernel matrix, a random walk with an infinite number of infinitesimally small steps is used instead. It can be formally described as:

K = \lim_{s \to \infty} {(I + \frac{β L}{s})}^{s} = e^{β L},

(2)

where β is a parameter for controlling the extent of diffusion and L ∈ ℝ^{n × n}is the graph Laplacian matrix defined asL = diag(Ae) - A,

where A is the adjacency matrix, e is the vector of all ones, and diag(Ae) is a diagonal matrix with the diagonal entries being the corresponding row summation of the matrix A. It turns out that for any symmetric matrix L, e^βLis always positive definite and thus can be used as a kernel matrix. The diffusion effect of such kernel can be explicitly seen when it is expanded as [2]:

e^{β L} = I + β L + \frac{β^{2}}{2} L^{2} + \frac{β^{3}}{6} L^{3} + \dots,

(4)

where the local information encoded in L is continuously diffused by repeated multiplications. The parameter β in the diffusion kernel controls the extent of diffusion and it has a similar effect as the scaling parameter in Gaussian kernels. If the β is too small, the local information can not be diffused effectively, resulting in a kernel matrix that only captures local similarities. On the other hand, if it is too large, the neighborhood information will be lost. Furthermore, the optimal value for β is problem and data-dependent. Thus it is highly desirable to tune the β value adaptively from the data.

We approach the kernel tuning problem by learning an optimal kernel as a linear combination of pre-specified diffusion kernels constructed with different values of β. This is motivated from the work in [17] where the optimal kernel for SVM, in the form of a linear combination of pre-specified kernels, is learned based on the large margin criteria. In particular, the generalized performance measure based on 1-norm soft margin SVM used in [17] is

ω_{S 1} (K) = \max_{α : C \geq α \geq 0, α^{T} y = 0} {2 α^{T} e - α^{T} G (K) α},

(5)

where C > 0 is the regularization parameter in SVM, e is the vector of all ones, G(K) is defined by G_ij(K) = k(x_i, x_j)y_iy_j, and the i-th entry of y denoted as y_iis the class label (1 or -1) of the i-th data point x_i. Lanckriet et al. [17] showed that when the optimal kernel is restricted to the linear combination of the given p kernels K₁, ..., K_p, the kernel learning problem can be formulated as a semidefinite program. Furthermore, when the coefficients of the linear combination are constrained to be non-negative, the kernel learning problem can be formulated as a Quadratically Constrained Quadratic Program [16]. As was shown in [20], an alternative performance measure is the KL divergence between the two zero-mean Gaussian distributions associated with the input and output kernel matrices. We show that when this KL divergence criterion is used to learn a linear combination of diffusion kernels constructed with different values of β, the resulting optimization problem can be solved efficiently. We further show that it can be extended to the multiple-task case. Such integration of multiple tasks into one optimization problem can potentially exploit the complementary information among different tasks.

Diffusion Kernel Learning: The Single-Task Case

We focus on learning an optimal kernel for a single task, which will then be extended to the multi-task case. The underlying idea is that the Laplacian matrix L, defined in Eq. (3), contains the connectivity information of all vertices in the graph. By adaptively tuning the kernel constructed from L on the training vertices, the entries corresponding to test vertices are expected to be tuned in some optimal way as well. To restrict the search space and improve the generalization ability, we focus on learning an optimal kernel as a linear combination of a set of diffusion kernels constructed with different values of β, indicating different extents of diffusion. In particular, we choose a sequence of values for β as β₁, ...,β_p, and the corresponding diffusion kernels can be constructed as

\begin{matrix} K_{i} = e^{β_{i} L}, & i = 1, \dots, p . \end{matrix}

(6)

We may assume that the kernels defined in Eq. (6) reflect our (weak) prior knowledge about the problem. The goal is to integrate the tuning of the coefficients into the learning process and the algorithm can adaptively select an optimal linear combination of the given kernels. Note that it is numerically favorable to normalize the kernels though this does not affect the results theoretically [14]. We normalize the kernels as follows:

{\tilde{K}}_{i} = \frac{e^{β_{i} L}}{trace (e^{β_{i} L})},

(7)

and the optimal kernel can be represented as

K_{o p t} = \sum_{i = 1}^{p} α_{i} {\tilde{K}}_{i} = \sum_{i = 1}^{p} α_{i} \frac{e^{β_{i} L}}{t r a c e (e^{β_{i} L})},

(8)

for a set of non-negative coefficients ${α_{i}}_{i = 1}^{p}$

Kullback-Leibler Divergence Formulation

Kernel matrices are positive semidefinite and thus can be used as the covariance matrices for Gaussian distributions. It was shown in [20] that the kernel matrix can be learned by minimizing the Kullback-Leibler (KL) divergence between the zero-mean Gaussian distributions associated with the input and output kernel matrices. In this paper, we focus on learning the optimal coefficients α_ifrom the data automatically based on minimizing this KL divergence criterion. As described in [20], the KL divergence between the zero-mean Gaussian distributions defined by the input kernel K_xand output kernel K_ycan be expressed as

KL (N_{y} | N_{x}) = \frac{1}{2} trace (K_{y} K_{x}^{- 1}) + \frac{1}{2} \log | K_{x} | - \frac{1}{2} \log | K_{y} | - \frac{n}{2},

(9)

where |·| denotes the matrix determinant, N_xand N_ydenote the zero-mean Gaussian distributions associated with K_xand K_y, respectively, and n is the number of samples. When the output kernel K_yis defined as K_y= yy^T, the KL divergence in Eq. (9) can be expressed as

KL (N_{y} | N_{x}) = \frac{1}{2} y^{T} K_{x}^{- 1} y + \frac{1}{2} \log | K_{x} | + const,

(10)

where "const" denotes terms that are independent of K_x, and K_xis the input kernel matrix, which is defined as a linear combination of the given p kernels as

K_{x} = \sum_{i = 1}^{p} α_{i} {\tilde{K}}_{i} + λ I = \sum_{i = 1}^{p} α_{i} \frac{e^{β_{i} L}}{trace (e^{β_{i} L})} + λ I .

(11)

Note that a regularization term, with λ as the regularization parameter, is added to Eq. (11) to deal with the singularity problem of kernel matrices as in [20], and we require $\sum_{i = 1}^{p} α_{i} = 1$ as in multiple kernel learning (MKL) [17]. The optimal coefficients α = [α₁, ..., α_p]^Tare computed by minimizing KL(N_y|N_x). By substituting Eq. (11) into Eq. (10), and removing the constant term, we obtain the following optimization problem:

\begin{array}{l} \min_{α} {a^{T} {(\sum_{i = 1}^{p} α_{i} {\tilde{K}}_{i} + λ I)}^{- 1} a + \log | \sum_{i = 1}^{p} α_{i} {\tilde{K}}_{i} + λ I |} \\ \begin{matrix} s .t . & \sum_{i = 1}^{p} α_{i} = 1, \\ α \geq 0, \end{matrix} \end{array}

(12)

where α = (α₁, ..., α_p)^T, α ⩾ 0 denotes that all components of α are non-negative, and the vector a ∈ ℝⁿis the problem-specific target vector, corresponding to the general target in Eq. (9), defined as follows:

a_{i} = {\begin{array}{l} 1 & if v_{i} is in the positive class, \\ - 1 & if v_{i} is in the negative class, \\ 0 & if v_{i} is in the test set . \end{array}

(13)

Note that we assign the label 0 to vertices in the test set so that it will not bias towards either class. Similar idea has been used in [24] for semi-supervised learning. In multiple kernel learning [17], the sum-to-one constraint on the weights is enforced as in Eq. (12). We present results on both constrained and unconstrained formulations in the experiments. Results show that the constrained formulations achieved better performance than the unconstrained ones.

Recall that the graph Laplacian matrix L is symmetric, so its eigen-decomposition can be expressed asL = PDP^T,

whereD = diag(d₁, ... ,d_n)

is the diagonal matrix of eigenvalues and P ∈ ℝ^{n × n}is the orthogonal matrix of corresponding eigenvectors. According to the definition of the function of matrices [25], we have

e^{β_{i} L} = P D_{i} P^{T},

(15)

where

D_{i} = diag (e^{β_{i} d_{1}}, \dots, e^{β_{i} d_{n}}) .

(16)

The main result is summarized in the following theorem:

Theorem 1. Given a set of p diffusion kernels, as defined in Eq. (7), the problem of learning the optimal kernel matrix, in the form of a convex combination of the given p kernel matrices as in Eq. (12), can be formulated as the following optimization problem:

\begin{matrix} \min_{α} & \sum_{j = 1}^{n} (\frac{b_{j}^{2}}{g_{j}} + \log (g_{j})) \end{matrix}

(17)

\begin{matrix} s u b j e c t t o & \sum_{i = 1}^{p} α_{i} = 1, \end{matrix}

(18)

α ≥ 0,

where b = (b₁, ..., b_n) = P^Ta, g_jis the j-th diagonal entry of the diagonal matrix G, defined as

G = \sum_{i = 1}^{p} α_{i} \frac{D_{i}}{t r a c e (D_{i})} + λ I,

(20)

and D_iis the diagonal matrix defined in Eq.(16).

Proof. The first term in Eq. (12) can be written as:

\begin{array}{l} a^{T} {(\sum_{i = 1}^{p} α_{i} {\tilde{K}}_{i} + λ I)}^{- 1} a & = & a^{T} {(\sum_{i = 1}^{p} α_{i} \frac{e^{β_{i} L}}{trace (e^{β_{i} L})} + λ I)}^{- 1} a \\ = & a^{T} P {(\sum_{i = 1}^{p} α_{i} \frac{D_{i}}{trace (e^{β_{i} L})} + λ I)}^{- 1} P^{T} a \\ = & b^{T} {(\sum_{i = 1}^{p} α_{i} \frac{D_{i}}{trace (D_{i})} + λ I)}^{- 1} b \\ = & b^{T} G^{- 1} b = \sum_{j = 1}^{n} \frac{b_{j}^{2}}{g_{j}}, \end{array}

(21)

where the third equality follows from the property of the trace, that is,

trace (e^{β_{i} L}) = trace (P D_{i} P^{T}) = trace (D_{i}) .

Similarly, the second term in Eq. (12) can be written as:

\begin{array}{l} \log | \sum_{i = 1}^{p} α_{i} {\tilde{K}}_{i} + λ I | & = & \log | \sum_{i = 1}^{p} α_{i} \frac{e^{β_{i} L}}{trace (e^{β_{i} L})} + λ I | \\ = & \log | \sum_{i = 1}^{p} α_{i} \frac{e^{β_{i} D}}{trace (e^{β_{i} D})} + λ I | \\ = & \log | G | \\ = & \log (\prod_{j = 1}^{n} g_{j}) \\ = & \sum_{j = 1}^{n} \log (g_{j}) . \end{array}

(22)

By combining the first term in Eq. (21) and the second term in Eq. (22), we prove the theorem.

The formulation in Theorem 1 is a nonlinear optimization problem. It involves a nonlinear objective function with p variables and linear equality and inequality constraints. Due to the presence of the log term in the objective, it is a non-convex problem and a globally optimal solution may not exist. However, our experimental results show that this formulation consistently produces superior performance.

Convex Formulation

The optimization problem in Theorem 1 is not convex. Previous studies [22, 23] indicate that the removal of the log determinant term in the KL divergence criterion in Eq. (12) has a limited effect on the performance. This leads to the following optimization problem:

\begin{matrix} \min_{α} & a^{T} {(\sum_{j = 1}^{n} α_{i} {\tilde{K}}_{i} + λ I)}^{- 1} a \end{matrix}

(23)

\begin{matrix} subject to & \sum_{i = 1}^{p} α_{i} = 1, \end{matrix}

(24)

α ≥ 0.

Following Theorem 1, we can show that the optimization problem above can be simplified as

\begin{matrix} \min_{α} & \sum_{j = 1}^{n} \frac{b_{j}^{2}}{g_{j}} \end{matrix}

(26)

\begin{matrix} subject to & \sum_{i = 1}^{p} α_{i} = 1, \end{matrix}

α ≥ 0.

where g_jand b are defined as in Theorem 1.

The optimization problem in Eq. (26) is convex and thus a globally optimal solution exists. Numerical experiments indicate that the simple gradient descent algorithm converges very quickly to the optimal solution. Furthermore, the prediction performance of this convex formulation is comparable to that of the formulation proposed in Theorem 1. This convex formulation shares some similarities with the one in [26], where a set of Laplacian matrices derived from multiple networks is combined.

Diffusion Kernel Learning: The Multi-Task Case

It is known that proteins often perform multiple functions, which are typically related. Many existing function prediction approaches decouple multiple functions and formulate each function prediction problem as a separate binary-class classification problem. Such methods do not consider the relationship among the multiple functions of a protein and potentially compromise the overall performance.

We propose to extend our formulation for the single-task case to deal with multiple tasks simultaneously. In particular, we formulate a single optimization problem for the simultaneous prediction of multiple functions for a protein. The joint learning of multiple functions can potentially exploit the relationship among functions and improve the performance. In terms of computational complexity, the proposed joint optimization problem is shown to be comparable to that of the single-task formulation.

A key observation is that when the pre-specified diffusion kernels are computed from the same biological network with different values of β, the graph Laplacian matrices are the same for all tasks. By enforcing all tasks to share a common linear combination of kernels, we obtain the following joint optimization problem:

\begin{matrix} \min_{α} & \sum_{k = 1}^{t} {(a^{(k)})}^{T} {(\sum_{i = 1}^{p} α_{i} {\tilde{K}}_{i} + λ I)}^{- 1} a^{(k)} + t \log | \sum_{i = 1}^{p} α_{i} {\tilde{K}}_{i} + λ I | \end{matrix}

(27)

\begin{matrix} subject to & \sum_{i = 1}^{p} α_{i} = 1, \end{matrix}

(28)

α ≥ 0,

where a^(k)∈ ℝⁿfor i = 1, ..., t is the vector of class labels for the k-th task as in Eq. (13), and t is the number of tasks. Note that all t tasks are related in this joint formulation by enforcing a common kernel matrix for all tasks. The objective function in Eq. (27) uses an equal weight for all tasks. If some tasks are known to be more important than others, a more general objective function with varying weights for different tasks may be used instead. Following Theorem 1, we can simplify the optimization problem in Eq. (27), as summarized in the following theorem:

Theorem 2. Given a set of p diffusion kernels, as defined in Eq. (7), the problem of optimal multi-task kernel learning, in the form of a convex combination of the given p kernels, can be formulated as the following optimization problem:

\begin{matrix} \min_{α} & \sum_{k = 1}^{t} \sum_{j = 1}^{n} \frac{b_{k}^{2} (j)}{g_{j}} + t \end{matrix} \sum_{j = 1}^{n} \log (g_{j})

(30)

\begin{matrix} s u b j e c t t o & \sum_{i = 1}^{p} α_{i} = 1, \end{matrix}

(31)

α ≥ 0,

where g_jis defined as in Theorem 1, b_k= P^Ta^(k), a^(k)is defined as in Eq. (13) for the k-th task, and t is the total number of tasks.

Proof. The first term in Eq. (27) can be rewritten as

\begin{array}{l} \sum_{k = 1}^{t} (a^{{(k)}^{T}} {(\sum_{i = 1}^{p} α_{i} {\tilde{K}}_{i} + λ I)}^{- 1} a^{(k)}) & = & \sum_{k = 1}^{t} (b_{k}^{T} {(\sum_{i = 1}^{p} α_{i} \frac{D_{i}}{trace (D_{i})} + λ I)}^{- 1} b_{k}) \\ = & \sum_{k = 1}^{t} (b_{k}^{T} G^{- 1} b_{k}) = \sum_{k = 1}^{t} \sum_{j = 1}^{n} \frac{b_{k}^{2} (j)}{g_{j}} . \end{array}

Similarly, the second term can be rewritten as

t \log | \sum_{i = 1}^{p} α_{i} {\tilde{K}}_{i} + λ I | = t \sum_{j = 1}^{n} \log (g_{j}) .

(33)

The detailed intermediate steps of derivation are the same as those in the proof of Theorem 1 and thus are omitted. By combining these two terms together, we prove the theorem.

The optimization problem in Theorem 2 is not convex. Similar to the single-task case, the log determinant term in Eq. (27) may be removed, which leads to the following convex optimization problem:

\begin{matrix} \min_{α} & \sum_{k = 1}^{t} \sum_{j = 1}^{n} \frac{b_{k}^{2} (j)}{g_{j}} \end{matrix}

(34)

\begin{matrix} subject to & \sum_{i = 1}^{p} α_{i} = 1, \end{matrix}

(35)

α ≥ 0.

Experimental evidences show that this convex optimization problem is comparable to the formulation in Theorem 2 in prediction performance.

Results and Discussion

We evaluate the performance of the proposed formulations on two benchmark data sets, and compare them with relevant methods, including the Neighbor Counting approach [4] and the FS-Weighted Averaging approach [5, 6]. We construct 60 diffusion kernels from each data set using different values for β and the proposed formulations are applied to compute a linear combination of the pre-computed kernels. The performance of the obtained kernel is compared with that of the individual kernel. To see the relative performance of the objective functions, we also use the 1-norm soft margin SVM criterion, proposed in [17], to compute the linear combination of kernels and the results are presented. All of the formulations proposed in this paper are solved using the MATLAB [27] function fmincon which employs the sequential quadratic programming method [28]. The QCQP formulation for optimizing the 1-norm soft margin SVM criterion is solved using the MOSEK [29] software package. After the kernels are computed, they are fed into SVM for classification and the LIBSVM [30] software package is used in the experiments. All of the experiments are performed on a PC with Intel Pentium D 820 2.8G CPU and 2G RAM.

In the following experiments, a total of 60 diffusion kernels are pre-computed and the values of β used are β_i= -0.1 × i, for i = 1, ..., 60. In order to investigate the performance of each individual kernel, we use each kernel for the classification and compute the average Receiver Operating Characteristic (ROC) values over all of the tasks. The ROC value produced by the best averaged individual kernel is used as a baseline. It is called rBaseline as all tasks are restricted to use the same kernel. We further relax the requirement that all tasks use the same kernel and compute the sequence of ROC values achieved by the best individual kernel for each of the tasks. This is considered another baseline called uBaseline as the kernel used by each task is unrestricted. Note that the kernel matrices for both rBaseline and uBaseline represent the single best candidate kernel in the ideal case that the labels of test data are known, and their performance is not guaranteed in practice. In contrast, the kernel matrices computed by the proposed formulations are the optimal kernel matrices in the form of linear combination of the given candidate kernel matrices. In order to evaluate the effectiveness of the weights obtained by the proposed formulations, we assign each kernel the same weight and compute the performance of the combined kernel. It is called eBaseline as all kernel matrices have an equal weight.

For convenience of presentation, the formulations proposed in Theorem 1, Eq. (26), Theorem 2, and Eq. (34) are denoted as DKL_KL, DKL, mDKL_KL, and mDKL, respectively. For DKL_KL and mDKL_KL, we also propose to remove the constraints in their optimization problems and the resulting formulations are denoted as ${DKL}_{KL}^{u}$ and ${mDKL}_{KL}^{u}$ , respectively. (See the caption of Table 1 for detailed description.) The method based on optimizing 1-norm soft margin SVM criterion by solving QCQP proposed in [17] is denoted as SM1. The six proposed formulations are summarized in Table 1.

Table 1 Summary of the proposed formulations. ${DKL}_{KL}^{u}$

Full size table

Experiments on the Ligand Data Set

The Ligand data set was derived by Vert and Kanehisa [31] from the Ligand database of chemical reactions in biological pathways [32]. It contains a graph reflecting the interactions between proteins and the function information for them. The graph is a yeast biological network in which a path between vertices implies a possible series of reactions catalyzed by proteins along it. The numbers of vertices and edges in this graph are 753 and 7860, respectively. For the functions of proteins, the functional categories of the MIPS Comprehensive Yeast Genome Database (CYGD) [33] are considered as the gold standard. These categories are not mutually exclusive, and each protein may have multiple functions. There are 36 different functions considered for this data set.

Comparison of ROC Values

We use the ROC as the performance measure and the λ value is fixed to 10^-6 in the experiments. Our experimental results show that the algorithms are not sensitive to the value of λ, as long as it is neither too large nor too small. Figure 1 plots the number of tasks with ROC value above a threshold for all methods. The average ROC values achieved by all methods are also summarized in Table 2. In order to test statistical significance, we also compute the p-values of Wilcoxon signed test and the results are reported in Table 3. We can observe that mDKL achieves the best performance among all methods. All the proposed formulations except ${mDKL}_{KL}^{u}$ outperform the three baseline methods. This implies that the computed linear combination of kernels can potentially exploit the complementary information in different kernels and thus improve performance. The ROC value achieved by SM1 is lower than those of the three baseline methods, implying that the SVM criterion is less effective for such tasks. Note that the SM1 criterion also uses information from unlabeled data, but in a weak form. The ${mDKL}_{KL}^{u}$ formulation achieves a ROC value lower than the three baseline methods. This shows that the constraints have important normalizing effects and can not be removed. By comparing the relative performance of formulations with and without the log term, we can conclude that removing this term usually does not affect the performance. Another important observation is that mDKL and mDKL_KL outperform DKL and DKL_KL, implying that constraining the multiple tasks to share a common kernel does not degrade the performance if the kernel used is a linear combination of kernels obtained by the proposed formulations. In contrast, if the kernel used is a single kernel, this restriction will degrade the performance, as illustrated by the relative performance of rBaseline and uBaseline. For the eBaseline method, it can be observed that, except for ${mDKL}_{KL}^{u}$ , all of other proposed formulations outperform it. This illustrates that our formulations can compute an optimal kernel matrix by assigning different weights to the candidate kernel matrices. We can observe from Table 3 that the difference between the performance of the two baseline methods (rBaseline and eBaseline) and that of DKL and mDKL are statistically significant. All diffusion kernel based approaches are competitive with the Neighbor Counting approach [4] and the FS-Weighted Averaging approach [5, 6]. Neighbor Counting and FS-Weighted Averaging use the local information, more specifically the level-1 neighborhood (Neighbor Counting) and both level-1 and level-2 neighborhoods (FS-Weighted Averaging), for the prediction. The experimental results show the effectiveness of capturing the long-range relationships (global information) between proteins in the network in diffusion kernels [15].

Table 2 Mean ROC values and execution time (in seconds) of various methods on the Ligand Data Set.

Full size table

Table 3 p-values obtained from Wilcoxon signed test comparing DKL and mDKL with other formulations for the Ligand data set.

Full size table

Figure 2 plots the average ROC values for the 60 kernels (the maximum mean ROC value is used in rBaseline) and Figure 3 plots the best ROC values for the 36 tasks. We can observe that for tasks 29 and 33, the best ROC values are small. This implies that all the kernels perform poorly for these two tasks. To illustrate the relative performance of the proposed formulations with that of the baseline method graphically, we plot in Figure 4 the ROC values obtained by the proposed formulations with respect to uBaseline using scatter plots. We can observe that there are two points below the 45-degree line in each plot. Those two points correspond to tasks 29 and 33 and they are difficult to classify by all methods. As most points in the plots are above the 45-degree line, we can conclude that the proposed formulations outperform uBasline on most tasks.

Comparison of Execution Time

In order to compare the efficiency of various kernel learning methods, we list in Table 2 the execution time of the compared methods. It can be observed that all methods based on multiple tasks are more efficient than their single-task counterparts. In particular, the execution time of mDKL is roughly 1/36 of that of DKL, which is consistent with our theoretical analysis. In general, convex formulations are more efficient than their non-convex original formulations and the optimization problems with the constraints removed take a longer time to converge. By taking the performance into account, the DKL and mDKL may be the best choices in practice.

Stability Test

In order to obtain a robust performance estimate for the various methods, we randomly partition the data set into a training set and a test set ten times and the average ROC values and standard deviations across splittings are reported in Table 4. Compared with the results in Table 2, we can see that the relative performance of each method in these two tables is very similar. In particular, mDKL and mDKL_KL achieve the best overall performance. Except for the two unconstrained formulations ${DKL}_{KL}^{u}$ and ${mDKL}_{KL}^{u}$ , all of other proposed formulations achieve higher ROC values than the three baseline methods. It is worth noting that the performance of uBaseline and rBaseline is obtained by using the labels of both the training and test data and such performance is not guaranteed in practice when only the labels of the training data are used.

Table 4 Average ROC values and the corresponding standard deviations over 11 splittings on the Ligand Data Set. One of the splittings was specified by the contributor of the data and the remaining ten splittings are randomly generated.

Full size table

Experiments on the von Mering Data Set

The von Mering data set was created by von Mering et al. [34] from protein-protein interactions identified via six different methods. It contains a graph consisting of 2617 vertices (proteins) and 11855 edges. There are 76 different functions (tasks) associated with the proteins in the graph. The performance of different methods is reported in Figure 5. Two baseline methods, rBaseline and uBaseline, constructed exactly the same way as those for the Ligand data set are used and their performance is summarized in Figure 6 and Figure 7, respectively. The value for is again set to 10^-6 in the experiments. Figure 8 compares the relative performance of the proposed formulations with that of the uBaseline graphically.