We study the problem of protein function prediction from biological networks, which are represented as graphs. For a graph , the vertices represent proteins and edges characterize the relationship between proteins. In the following discussion, the vertex and edge sets are denoted as V and E, respectively. The total number of proteins in the network is n = |V|. The adjacency matrix A is used to denote the similarity between vertices where Ai,jdescribes the similarity between vertices v
i
and v
j
. The functions of some proteins in the network are already known and the goal of protein function prediction is to infer the functions of unannotated proteins based on the functions of annotated proteins and the network topology. In particular, for a graph = (V, E), the vertices in V can be partitioned into a training set and a test set. The functions of proteins in the training set are already known while those of proteins in the test set are unknown. Each edge in E reflects the local similarities between its ending vertices. The learning problem is to predict the functions of proteins in the test set based on the label information of training set and the topology of the graph.
Background and Related Work
Kernel methods are particularly suitable for learning from graph-based data, as they only require the similarities between proteins to be encoded in the kernel matrix. In kernel methods, a symmetric function , where denotes the input space, is called a kernel function if it satisfies the Mercer's condition [14]. When used for a finite number of samples in practice, this condition can be stated as follows: for any x1, ...,x
n
∈ the Gram matrix K ∈ ℝn × n, defined by K
ij
= κ(x
i
, x
j
) is positive semidefinite. Any kernel function κ implicitly maps the input set to a high-dimensional (possibly infinite) Hilbert space equipped with the inner product through a mapping
(1)
The adjacency matrix A can't be directly used as a kernel matrix. First, the adjacency matrix contains the local similarity information only, which may not be effective for function prediction. Secondly, the adjacency matrix may not even be positive semidefinite. To derive a kernel matrix from the adjacency matrix, the idea of random walk and network diffusion has been used. The basic idea is to compute the global similarity between vertices v
i
and v
j
as the probability of reaching v
j
at some time point T when the random walker starts from v
i
. This idea is justified at least to some extent by observing that the random walker tends to meander around its origin as there is a larger number of paths of length |T| to its neighbors than to remote vertices [2].
To avoid some potential problems such as the choice of value for T and assurance of positive semidefiniteness for the kernel matrix, a random walk with an infinite number of infinitesimally small steps is used instead. It can be formally described as:
(2)
where β is a parameter for controlling the extent of diffusion and L ∈ ℝn × nis the graph Laplacian matrix defined asL = diag(Ae) - A,
where A is the adjacency matrix, e is the vector of all ones, and diag(Ae) is a diagonal matrix with the diagonal entries being the corresponding row summation of the matrix A. It turns out that for any symmetric matrix L, eβLis always positive definite and thus can be used as a kernel matrix. The diffusion effect of such kernel can be explicitly seen when it is expanded as [2]:
(4)
where the local information encoded in L is continuously diffused by repeated multiplications. The parameter β in the diffusion kernel controls the extent of diffusion and it has a similar effect as the scaling parameter in Gaussian kernels. If the β is too small, the local information can not be diffused effectively, resulting in a kernel matrix that only captures local similarities. On the other hand, if it is too large, the neighborhood information will be lost. Furthermore, the optimal value for β is problem and data-dependent. Thus it is highly desirable to tune the β value adaptively from the data.
We approach the kernel tuning problem by learning an optimal kernel as a linear combination of pre-specified diffusion kernels constructed with different values of β. This is motivated from the work in [17] where the optimal kernel for SVM, in the form of a linear combination of pre-specified kernels, is learned based on the large margin criteria. In particular, the generalized performance measure based on 1-norm soft margin SVM used in [17] is
(5)
where C > 0 is the regularization parameter in SVM, e is the vector of all ones, G(K) is defined by G
ij
(K) = k(x
i
, x
j
)y
i
y
j
, and the i-th entry of y denoted as y
i
is the class label (1 or -1) of the i-th data point x
i
. Lanckriet et al. [17] showed that when the optimal kernel is restricted to the linear combination of the given p kernels K1, ..., K
p
, the kernel learning problem can be formulated as a semidefinite program. Furthermore, when the coefficients of the linear combination are constrained to be non-negative, the kernel learning problem can be formulated as a Quadratically Constrained Quadratic Program [16]. As was shown in [20], an alternative performance measure is the KL divergence between the two zero-mean Gaussian distributions associated with the input and output kernel matrices. We show that when this KL divergence criterion is used to learn a linear combination of diffusion kernels constructed with different values of β, the resulting optimization problem can be solved efficiently. We further show that it can be extended to the multiple-task case. Such integration of multiple tasks into one optimization problem can potentially exploit the complementary information among different tasks.
Diffusion Kernel Learning: The Single-Task Case
We focus on learning an optimal kernel for a single task, which will then be extended to the multi-task case. The underlying idea is that the Laplacian matrix L, defined in Eq. (3), contains the connectivity information of all vertices in the graph. By adaptively tuning the kernel constructed from L on the training vertices, the entries corresponding to test vertices are expected to be tuned in some optimal way as well. To restrict the search space and improve the generalization ability, we focus on learning an optimal kernel as a linear combination of a set of diffusion kernels constructed with different values of β, indicating different extents of diffusion. In particular, we choose a sequence of values for β as β1, ...,β
p
, and the corresponding diffusion kernels can be constructed as
(6)
We may assume that the kernels defined in Eq. (6) reflect our (weak) prior knowledge about the problem. The goal is to integrate the tuning of the coefficients into the learning process and the algorithm can adaptively select an optimal linear combination of the given kernels. Note that it is numerically favorable to normalize the kernels though this does not affect the results theoretically [14]. We normalize the kernels as follows:
(7)
and the optimal kernel can be represented as
(8)
for a set of non-negative coefficients
Kullback-Leibler Divergence Formulation
Kernel matrices are positive semidefinite and thus can be used as the covariance matrices for Gaussian distributions. It was shown in [20] that the kernel matrix can be learned by minimizing the Kullback-Leibler (KL) divergence between the zero-mean Gaussian distributions associated with the input and output kernel matrices. In this paper, we focus on learning the optimal coefficients α
i
from the data automatically based on minimizing this KL divergence criterion. As described in [20], the KL divergence between the zero-mean Gaussian distributions defined by the input kernel K
x
and output kernel K
y
can be expressed as
(9)
where |·| denotes the matrix determinant, N
x
and N
y
denote the zero-mean Gaussian distributions associated with K
x
and K
y
, respectively, and n is the number of samples. When the output kernel K
y
is defined as K
y
= yyT, the KL divergence in Eq. (9) can be expressed as
(10)
where "const" denotes terms that are independent of K
x
, and K
x
is the input kernel matrix, which is defined as a linear combination of the given p kernels as
(11)
Note that a regularization term, with λ as the regularization parameter, is added to Eq. (11) to deal with the singularity problem of kernel matrices as in [20], and we require as in multiple kernel learning (MKL) [17]. The optimal coefficients α = [α1, ..., α
p
]Tare computed by minimizing KL(N
y
|N
x
). By substituting Eq. (11) into Eq. (10), and removing the constant term, we obtain the following optimization problem:
(12)
where α = (α1, ..., α
p
)T, α ⩾ 0 denotes that all components of α are non-negative, and the vector a ∈ ℝnis the problem-specific target vector, corresponding to the general target in Eq. (9), defined as follows:
(13)
Note that we assign the label 0 to vertices in the test set so that it will not bias towards either class. Similar idea has been used in [24] for semi-supervised learning. In multiple kernel learning [17], the sum-to-one constraint on the weights is enforced as in Eq. (12). We present results on both constrained and unconstrained formulations in the experiments. Results show that the constrained formulations achieved better performance than the unconstrained ones.
Recall that the graph Laplacian matrix L is symmetric, so its eigen-decomposition can be expressed asL = PDPT,
whereD = diag(d1, ... ,d
n
)
is the diagonal matrix of eigenvalues and P ∈ ℝn × nis the orthogonal matrix of corresponding eigenvectors. According to the definition of the function of matrices [25], we have
(15)
where
(16)
The main result is summarized in the following theorem:
Theorem 1. Given a set of p diffusion kernels, as defined in Eq. (7), the problem of learning the optimal kernel matrix, in the form of a convex combination of the given p kernel matrices as in Eq. (12), can be formulated as the following optimization problem:
(17)
(18)
α ≥ 0,
where b = (b1, ..., b
n
) = PTa, g
j
is the j-th diagonal entry of the diagonal matrix G, defined as
(20)
and D
i
is the diagonal matrix defined in Eq.(16).
Proof. The first term in Eq. (12) can be written as:
(21)
where the third equality follows from the property of the trace, that is,
Similarly, the second term in Eq. (12) can be written as:
(22)
By combining the first term in Eq. (21) and the second term in Eq. (22), we prove the theorem.
The formulation in Theorem 1 is a nonlinear optimization problem. It involves a nonlinear objective function with p variables and linear equality and inequality constraints. Due to the presence of the log term in the objective, it is a non-convex problem and a globally optimal solution may not exist. However, our experimental results show that this formulation consistently produces superior performance.
Convex Formulation
The optimization problem in Theorem 1 is not convex. Previous studies [22, 23] indicate that the removal of the log determinant term in the KL divergence criterion in Eq. (12) has a limited effect on the performance. This leads to the following optimization problem:
(23)
(24)
α ≥ 0.
Following Theorem 1, we can show that the optimization problem above can be simplified as
(26)
α ≥ 0.
where g
j
and b are defined as in Theorem 1.
The optimization problem in Eq. (26) is convex and thus a globally optimal solution exists. Numerical experiments indicate that the simple gradient descent algorithm converges very quickly to the optimal solution. Furthermore, the prediction performance of this convex formulation is comparable to that of the formulation proposed in Theorem 1. This convex formulation shares some similarities with the one in [26], where a set of Laplacian matrices derived from multiple networks is combined.
Diffusion Kernel Learning: The Multi-Task Case
It is known that proteins often perform multiple functions, which are typically related. Many existing function prediction approaches decouple multiple functions and formulate each function prediction problem as a separate binary-class classification problem. Such methods do not consider the relationship among the multiple functions of a protein and potentially compromise the overall performance.
We propose to extend our formulation for the single-task case to deal with multiple tasks simultaneously. In particular, we formulate a single optimization problem for the simultaneous prediction of multiple functions for a protein. The joint learning of multiple functions can potentially exploit the relationship among functions and improve the performance. In terms of computational complexity, the proposed joint optimization problem is shown to be comparable to that of the single-task formulation.
A key observation is that when the pre-specified diffusion kernels are computed from the same biological network with different values of β, the graph Laplacian matrices are the same for all tasks. By enforcing all tasks to share a common linear combination of kernels, we obtain the following joint optimization problem:
(27)
(28)
α ≥ 0,
where a(k)∈ ℝnfor i = 1, ..., t is the vector of class labels for the k-th task as in Eq. (13), and t is the number of tasks. Note that all t tasks are related in this joint formulation by enforcing a common kernel matrix for all tasks. The objective function in Eq. (27) uses an equal weight for all tasks. If some tasks are known to be more important than others, a more general objective function with varying weights for different tasks may be used instead. Following Theorem 1, we can simplify the optimization problem in Eq. (27), as summarized in the following theorem:
Theorem 2. Given a set of p diffusion kernels, as defined in Eq. (7), the problem of optimal multi-task kernel learning, in the form of a convex combination of the given p kernels, can be formulated as the following optimization problem:
(30)
(31)
α ≥ 0,
where g
j
is defined as in Theorem 1, b
k
= PTa(k), a(k)is defined as in Eq. (13) for the k-th task, and t is the total number of tasks.
Proof. The first term in Eq. (27) can be rewritten as
Similarly, the second term can be rewritten as
(33)
The detailed intermediate steps of derivation are the same as those in the proof of Theorem 1 and thus are omitted. By combining these two terms together, we prove the theorem.
The optimization problem in Theorem 2 is not convex. Similar to the single-task case, the log determinant term in Eq. (27) may be removed, which leads to the following convex optimization problem:
(34)
(35)
α ≥ 0.
Experimental evidences show that this convex optimization problem is comparable to the formulation in Theorem 2 in prediction performance.