Enhanced protein fold recognition through a novel data integration approach
 Yiming Ying^{1}Email author,
 Kaizhu Huang^{2} and
 Colin Campbell^{1}
https://doi.org/10.1186/1471210510267
© Ying et al; licensee BioMed Central Ltd. 2009
Received: 16 April 2009
Accepted: 26 August 2009
Published: 26 August 2009
Abstract
Background
Protein fold recognition is a key step in protein threedimensional (3D) structure discovery. There are multiple fold discriminatory data sources which use physicochemical and structural properties as well as further data sources derived from local sequence alignments. This raises the issue of finding the most efficient method for combining these different informative data sources and exploring their relative significance for protein fold classification. Kernel methods have been extensively used for biological data analysis. They can incorporate separate fold discriminatory features into kernel matrices which encode the similarity between samples in their respective data sources.
Results
In this paper we consider the problem of integrating multiple data sources using a kernelbased approach. We propose a novel informationtheoretic approach based on a KullbackLeibler (KL) divergence between the output kernel matrix and the input kernel matrix so as to integrate heterogeneous data sources. One of the most appealing properties of this approach is that it can easily cope with multiclass classification and multitask learning by an appropriate choice of the output kernel matrix. Based on the position of the output and input kernel matrices in the KLdivergence objective, there are two formulations which we respectively refer to as MKLdivdc and MKLdivconv. We propose to efficiently solve MKLdivdc by a difference of convex (DC) programming method and MKLdivconv by a projected gradient descent algorithm. The effectiveness of the proposed approaches is evaluated on a benchmark dataset for protein fold recognition and a yeast protein function prediction problem.
Conclusion
Our proposed methods MKLdivdc and MKLdivconv are able to achieve stateoftheart performance on the SCOP PDB40D benchmark dataset for protein fold prediction and provide useful insights into the relative significance of informative data sources. In particular, MKLdivdc further improves the fold discrimination accuracy to 75.19% which is a more than 5% improvement over competitive Bayesian probabilistic and SVM marginbased kernel learning methods. Furthermore, we report a competitive performance on the yeast protein function prediction problem.
Background
A huge number of protein coding sequences have been generated by genome sequencing projects. In contrast, there is a much slower increase in the number of known threedimensional (3D) protein structures. Determination of a protein's 3D structure is a formidable challenge if there is no sequence similarity to proteins of known structure and thus protein structure prediction remains a core problem within computational biology.
Computational prediction of protein structure has achieved significant successes [1, 2]. Focusing on the fold prediction problem of immediate interest to this paper, one computational method known as the taxonomic approach [3, 4], presumes the number of folds is restricted and focuses on structural predictions in the context of a particular classification of 3D folds. Proteins are in a common fold if they share the same major secondary structures in the same arrangement and the same topological connections [5, 6]. In the taxonomic method for protein fold classification, there are several fold discriminatory data sources or groups of attributes available such as amino acid composition, predicted secondary structure, and selected structural and physicochemical properties of the constituent amino acids. Previous methods for integrating these heterogeneous data sources include simply merging them together or combining trained classifiers over individual data sources [3, 4, 7, 8]. However, how to integrate fold discriminatory data sources systematically and efficiently, without resorting to ad hoc ensemble learning, still remains a challenging problem.
Kernel methods [9, 10] have been successfully used for data fusion in biological applications. Kernel matrices encode the similarity between data objects within a given input space and these data objects can include graphs and sequence strings in addition to realvalued or integer data. Thus the problem of data integration is transformed into the problem of learning the most appropriate combination of candidate kernel matrices, representing these heterogeneous data sources. The typical framework is to learn a linear combination of candidate kernels. This is often termed multiple kernel learning (MKL) in Machine Learning, and nonparametric group lasso in Statistics. Recent trends in kernel learning are usually based on the margin maximization criterion used by Support Vector Machines (SVMs) or variants [11]. The popularity of SVM marginbased kernel learning stems from its efficient optimization formulations [11–14] and sound theoretical foundation [11, 15, 16]. Other data integration methods include the COSSO estimate for additive models [17], kernel discriminant analysis [18], multilabel multiple kernel learning [19, 20] and Bayesian probabilistic models [21, 22]. These methods, in general, can combine multiple data sources to enhance biological inference [21, 23] and provide insights into the significance of the different data sources used.
Following a different approach, in this paper we propose an alternative criterion for kernel matrix learning and data integration, which we will call MKLdiv. Specifically, we propose an informationtheoretic approach to learn a linear combination of kernel matrices, encoding information from different data sources, through the use of a KullbackLeibler divergence [24–28] between two zeromean Gaussian distributions defined by the input matrix and output matrix. The potential advantage of this approach is that, by choosing different output matrices, the method can be easily extended to different learning tasks, such as multiclass classification and multitask learning. These are common tasks in biological data analysis.
To illustrate the method, we will focus on learning a linear combination of candidate kernel matrices (heterogeneous data sources) using the KLdivergence criterion with a main application to the protein fold prediction problem. There are two different formulations based on the relative position of the input kernel matrix and the output kernel matrix in the KLdivergence objective. For the first formulation, although this approach involves a matrix determinant term which is not convex in general, we elegantly reformulate the learning task as a difference of convex problem, which can be efficiently solved by a sequence of convex optimizations. Hence we refer to it as MKLdivdc. The second KLdivergence formulation for kernel integration, called MKLdivconv, is convex and can be solved by a projected gradient descent algorithm. Experimental results show that these formulations lead to stateoftheart prediction performance. In particular, MKLdivdc outperforms the best reported performance on the important task of protein fold recognition, for the benchmark dataset used.
Methods
In the following we first revisit kernel learning approaches based on SVMs [11] and kernel discriminant analysis [18]. Then, we introduce our novel informationtheoretic approach for data integration based on a KLdivergence criterion. Finally we discuss how to solve the optimization task efficiently. For brevity, we use the conventional notation ℕ_{ n }= {1, 2, ..., n} for any n ∈ ℕ.
Background and Related Work
Kernel methods are extensively used for biological data analysis. A symmetric function K : X × X → ℝ is called a kernel function if it is positive semidefinite, by which we mean that, for any n ∈ ℕ and {x_{ i }∈ X: i ∈ ℕ_{ n }}, the Gram matrix is positive semidefinite. According to [29], its corresponding reproducing kernel Hilbert space (RKHS), usually denoted by ℋ_{ K }, can be defined to be the completion of the linear span of the set of functions {K_{ x }(·) := K(x, ·): x ∈ X} with inner product satisfying, for any x ∈ X and g ∈ ℋ_{ K }, the reproducing property ⟨K_{ x }, g⟩_{ K }= g(x). By Mercer's theorem, there exists a high dimensional (possible infinite) Hilbert feature space ℱ with inner product ⟨·, ·⟩_{ℱ} and a feature map ϕ: X → ℱ such that K(x, t) = ⟨ϕ (x), ϕ (t)⟩_{ℱ}, ∀ x, t ∈ X. Intuitively, the kernel function K implicitly maps the data space X into a high dimensional space ℱ, see [9, 10] for more details.
The above kernel learning formulation can be solved by a semidefinite programming (SDP) approach (see Section 4.7 of [11]). However, an SDP formulation is computationally intensive.
The L^{1}regularization term encourages the sparsity [32] of RKHSnorm terms, and thus indicates the relative importance of data sources. It was shown in [13] that the standard L^{2}regularization is equivalent to the use of uniformly weighted kernel weights λ, i.e. for any ℓ ∈ ℕ_{ m }. Recently, Ye et al. [18] proposed an appealing kernel learning approach based on regularized kernel discriminant analysis. This can similarly be shown to be equivalent to a sparse L^{1}regularization formulation with a least square loss, see Appendix 1 for more details.
Informationtheoretic Data Integration
In this paper we adopt a novel informationtheoretic approach to learn the kernel combinatorial weights. The main idea is to quantify the similarity between K_{ λ }and K_{ y } through a KullbackLeibler (KL) divergence or relative entropy term [24–28]. This approach is based on noting that these kernel matrices encode the similarity of data objects within their respective input and label data spaces. Furthermore, there is a simple bijection between the set of distance measures in these data spaces and the set of zeromean multivariate Gaussian distributions [25]. Using this bijection, the difference between two distance measures, parameterized by K_{ λ }and K_{ y }, can be quantified by the relative entropy or KullbackLeibler (KL) divergence between the corresponding multivariate Gaussians. Matching kernel matrices K_{ λ }and K_{ y } can therefore be realized by minimizing a KL divergence between these distributions and we will exploit this approach below in the context of multiple kernel learning.
where I_{ n }denotes the n × n identity matrix and σ > 0 is a supplemented small parameter to avoid the singularity of K_{ λ }.
where parameter σ > 0 is added to avoid the singularity of K_{ y }. If there is no positive semidefiniteness restriction over K_{ℓ}, this formulation is a wellknown convex maximumdeterminant problem[33] which is a more general formulation than semidefinite programming (SDP), its implementation is computationally intensive, and thus cannot be extended to largescale problems according to [33]. However, formulation (5) has a special structure here: λ_{ℓ} is nonnegative and all candidate kernel matrices are positive semidefinite. Hence, we can solve this problem by a simple projected gradient descent method, see below for more details.
The KLdivergence criterion for kernel integration was also successfully used in [27, 28] which formulated the problem of supervised network inference as a kernel matrix completion problem. In terms of information geometry [34], formulation (4) corresponds to finding the mprojection of K_{ y } over an eflat submanifold. The convex problem (5) can be regarded as finding the eprojection of K_{ y } over a mflat submanifold. In [26], formulation (4) was developed for learning an optimal linear combination of diffusion kernels for biological networks. A gradientbased method was employed in [26] to learn a proper linear combination of diffusion kernels. This optimization method largely relies on the special property of all candidate diffusion kernel matrices enjoying the same eigenvectors and the gradientbased learning method could be a problem if we deal with general kernel matrices. In the next section, we propose to solve the general kernel learning formulation (4) using a difference of convex optimization method.
where I_{ n }denotes the n × n identity matrix and the latent random variable f = (f (x_{1}, ..., f (x_{ n }))) is distributed as a Gaussian process prior. The Gaussian process prior can be fully specified by a kernel K with a random covariance matrix associated with random samples x = {x_{ i }: i ∈ ℕ_{ n }}. Specifically, it can be written as fx ~ (f 0, K_{ λ }). We assume a uniform distribution over λ, i.e. a Dirichlet prior distribution with α_{0} = 1. If we let K_{ y } = yy^{⊤} in the objective function of formulation (4), then one can easily check that, up to a constant term, the objective function in formulation (4) is the negative of the log likelihood of Gaussian process regression, and maximizing the log likelihood is equivalent to the minimization problem (4).
Optimization Formulation
Proof It suffices to prove the convexity of f and g. To this end, from [38] we observe that functions – log C and Tr(K_{ y }C^{1}) are convex with respect to positive semidefinite matrices C. Hence, f and g are convex with respect to λ ∈ Δ. This completes the proof of the theorem.
For simplicity we refer to the KLdivergence kernel learning formulation (4) as MKLdivdc since it is a difference of convex problem and refer to formulation (5) as MKLdivconv since it is a convex problem.
Projected Gradient Descent Method for MKLdivconv
where ·_{Fro} denotes the Frobenious norm of a matrix. Hence, the projected gradient descent algorithm could take longer time to become convergent if the value of σ is very small.
Difference of Convex Algorithm for MKLdivdc
which means that the objective value ℒ(λ^{(t)}) monotonically decreases with each iteration. Consequently, we can use the relative change of the objective function as a stopping criterion. Local convergence of the DCA algorithm is proven in [36] (Lemma 3.6, Theorem 3.7). Tao and An [36] state that the DCA often converges to the global solution. Overall, the DC programming approach to MKLdivdc can be summarized as follows.

Given a stopping threshold ε

Given the solution λ^{(t)}at step t, for step t + 1, first compute ▽g(λ^{(t)}) by equation (9). Then, compute solution λ^{(t+1)}of convex subproblem (13).

Stop until the relative change where ε is a stopping threshold
SILP Formulation for the Convex Subproblem (13)
In (16), there are an infinite number of constraints (indexed by a), indicative of a semiinfinite linear programming (SILP) problem. The SILP task can be solved by an iterative column generation algorithm (or exchange method) which is guaranteed to converge to a global optimum. A brief description of the column generation method is illustrated in Appedix 2.
Alternatively we could apply the projected gradient descent (PGD) method in the above subsection directly to the convex subproblem (13). However, the gradient function of its objective function involves the matrix . In analogy to the argument of inequality (12), the Lipschitz constant of the gradient of the objective function in (13) is very large when the value of σ is very small, and thus the projected gradient descent algorithm could take longer to become convergent. Hence, this could make the overall DC programming unacceptable slow. In contrast, in the SILP formulation (16) we introduce the auxiliary variables α to avoid the matrix . In addition, the gradient descent algorithm generally needs to determine the step size η according to the value of σ, see also discussion in the experimental section.
Prior Choice of the Output Kernel Matrix
For the protein fold recognition and yeast protein function prediction projects discussed below, we choose K_{ y } = YY^{⊤} as stated.
In general, though, K_{ y } might encode a known structural relationship between labels. For example, in supervised gene or protein network inference (see e.g. [41, 42]) the output information corresponds to an adjacency (square) matrix A where A_{ ij }= 1 means there is an interaction between gene or protein pair (e_{ i }, e_{ j }) of an organism, otherwise A_{ ij }= 0. In this case, the output kernel matrix K_{ y } can potentially be chosen as the graph Laplacian defined as L = diag(A 1)  A, where 1 is the vector of all ones. It can also be formulated as a diffusion kernel [43] defined by , where hyperparameter β > 0. Other potential choices of K_{ y } can be found in [19, 20] for multilabeled datasets.
Results and Discussion
We mainly evaluate MKLdiv methods (MKLdivdc and MKLdivconv) on protein fold recognition, and then consider an extension to the problem of yeast protein function prediction. In these tasks we first compute the kernel weights by MKLdiv and then feed these into a oneagainstall multiclass SVM to make predictions. The tradeoff parameter in the multiclass SVM is adjusted by 3fold cross validation over the training dataset. For all experiments with MKLdivdc, we choose σ = 10^{5} and for MKLdivconv, we tune σ = {10^{5}, ..., 10^{1}} using cross validation. In both methods, we use a stopping criterion of ε = 10^{5} and initialize the kernel weight λ by setting for any ℓ ∈ ℕ_{ m }where m is the number of candidate kernel matrices.
Synthetic Data
Protein Fold Recognition
Next we evaluated MKLdiv on a wellknown protein fold prediction dataset [3]. This benchmark dataset (based on SCOP PDB40D) has 27 SCOP fold classes with 311 proteins for training and 383 for testing. This dataset was originally proposed by Ding and Dubchak [3] and it has 313 samples for training and 385 for testing. There is less than 35% sequence identity between any two proteins in the training and test set. We follow Shen and Chou [4] who proposed to exclude two proteins from the training and test datasets due to a lack of sequence information. We compare our MKLdiv methods with kernel learning based on oneagainstall multiclass SVM using the SimpleMKL software package [44], kernel learning for regularized discriminant analysis (MKLRKDA) [18]http://www.public.asu.edu/~jye02/Software/DKL/ and a probabilistic Bayesian model for kernel learning (VBKC) [21]. The tradeoff parameters in SimpleMKL and MKLRKDA were also adjusted by 3fold cross validation on the training set.
Description of the Fold Discriminatory Data Sources
Performance with individual and all data sources
Data sources  MKLdivdc  MKLdivconv  SimpleMKL  VBKC  MKLRKDA 

Amino acid composition (C)  51.69  51.69  51.83  51.2 ± 0.5  45.43 
Predicted secondary structure (S)  40.99  40.99  40.73  38.1 ± 0.3  38.64 
Hypdrophobicity (H)  36.55  36.55  36.55  32.5 ± 0.4  34.20 
Polarity (P)  35.50  35.50  35.50  32.2 ± 0.3  30.54 
van der Walls volume (V)  37.07  37.07  37.85  32.8 ± 0.3  30.54 
Polarizability (Z)  37.33  37.33  36.81  33.2 ± 0.4  30.28 
PseAA λ = 1 (L1)  44.64  44.64  45.16  41.5 ± 0.5  36.55 
PseAA λ = 4 (L4)  44.90  44.90  44.90  41.5 ± 0.4  38.12 
PseAA λ = 14 (L14)  43.34  43.34  43.34  38 ± 0.2  40.99 
PseAA λ = 30 (L30)  31.59  31.59  31.59  32 ± 0.2  36.03 
SW with BLOSUM62 (SW1)  62.92  62.92  62.40  59.8 ± 1.9  61.87 
SW with PAM50 (SW2)  63.96  63.96  63.44  49 ± 0.7  64.49 
All data sources  73.36  71.01  66.57  68.1 ± 1.2  68.40 
Uniform weighted  68.40  68.40  68.14    66.06 
As in [21], we employ linear kernels (SmithWaterman scores) for SW1 and SW2 and second order polynomial kernels for the other data sources. Ding and Dubchak [3] conducted an extensive study on the use of various multiclass variants of standard SVMs and neural network classifiers. For these authors the best test set accuracy (TSA) was 56%, and the most informative among their six data sources (CSHPVZ) were aminoacid composition (C), the predicted secondary structure (S) and hydrophobicity (H). Shen and Chou [4] introduced four additional PSeAA data sources to replace the amino acid composition (C) and raised test performance to 62.1%. The latter authors used an ad hoc ensemble learning approach involving a combination of multiclass k nearest neighbor classifiers individually trained on each data source. Recently, test performance was greatly improved by Damoulas and Girolami [21] using a Bayesian multiclass multikernel algorithm. They reported a best test accuracy of 70% on a single run.
Performance with Individual and All Data Sources
We ran MKLdivdc, MKLdivconv, SimpleMKL and MKLRKDA on the overall set of 12 data sources, also evaluating performance on a uniformly weighted (averaged) composite kernel in addition to individual performance on each separate data source. In Table 1 we report the test set accuracy on each individual data source. The performance of MKLdivdc and MKLdivconv inclusive of all data sources achieves a test set accuracy of 73.36% and 71.01% respectively, consistently outperforming all individual performances and the uniformly weighted composite kernel (68.40%). Moreover, individual performance for MKLdivdc, SimpleMKL and MKLRKDA indicates that the most informative data sources are local sequence alignments (SW1 and SW2) and the amino acid composition (C). The performance with individual data sources for MKLdivdc, MKLdivconv, and SimpleMKL are almost the same since, for a fixed kernel, they use the same oneagainstall multiclass SVM.
Performance with Sequential Addition of Data Sources
Effects of sequentially adding data sources
Data sources  MKLdivdc  MKLdivconv  VBKC  SimpleMKL  MKLRKDA 

C  51.69  51.69  51.2 ± 0.5  51.69  47.25 
CS  56.39 (20.23 s)  55.35 (0.32 s)  55.7 ± 0.5 ()  55.61 (14.67 s)  48.30 (0.15 s) 
CSH  57.70 (50.35 s)  58.22 (2.44 s)  57.7 ± 0.6 ()  56.91 (10.40 s)  55.61 (0.12 s) 
CSHP  58.48 (39.02 s)  53.52 (72.14 s)  57.9 ± 0.9 ()  57.96 (17.84 s)  56.65 (0.18 s) 
CSHPV  60.05 (75.05 s)  53.26 (86.39 s)  58.1 ± 0.8 ()  57.96 (15.05 s)  55.87 (0.17 s) 
CSHPVZ  59.26 (135.08 s)  53.52 (99.64 s)  58.6 ± 1.1 ()  59.00 (20.02 s)  57.70 (0.20 s) 
CSHPVZL1  60.05 (221.75 s)  52.74 (122.74 s)  60.0 ± 0.8 ()  61.35 (27.38 s)  57.70 (0.21 s) 
CSHPVZL1L4  62.14 (315.70 s)  52.74 (129.08 s)  60.8 ± 1.1 ()  61.61 (151.38 s)  58.22 (0.25 s) 
CSHPVZL1L4L14  62.14 (450.57 s)  61.09 (57.09 s)  61.5 ± 1.2 ()  60.05 (42.81 s)  59.53 (0.25 s) 
CSHPVZL1L4L14L30  62.14 (612.72 s)  62.14 (67.29 s)  62.2 ± 1.3 ()  62.40 (64.74 s)  55.61 (0.25 s) 
CSHPVZL1L4L14L30SW1  71.80 (620.16 s)  71.54 (17.97 s)  66.4 ± 0.8 ()  65.79 (78.94 s)  66.84 (0.31 s) 
CSHPVZL1L4L14L30SW1SW2  73.36 (805.11 s)  71.01 (84.21 s)  68.1 ± 1.2 ()  66.57 (196.42 s)  68.40 (0.31 s) 
SHPVZL1L4L14L30  60.57 (438.89 s)  61.09 (67.92 s)  61.1 ± 1.4 ()  59.00 (44.79 s)  54.56 (0.25 s) 
Effects of sequentially adding data sources (continued)
Data sources  MKLdivdc  MKLdivconv  SimpleMKL  MKLRKDA 

SW1  62.92  62.92  62.40  61.87 
SW1S  65.27 (24.72 s)  66.31 (10.49 s)  64.22 (40.60 s)  64.75 (0.12 s) 
SW1SW2S  67.10 (48.79 s)  66.05 (4.65 s)  64.75 (61.71 s)  64.49 (0.15 s) 
SW1SW2CS  73.36 (40.65 s)  72.32 (23.43 s)  65.01 (62.81 s)  67.62 (0.17 s) 
SW1SW2CSH  74.67 (72.19 s)  72.32 (8.69 s)  66.31 (75.11 s)  67.88 (0.15 s) 
SW1SW2CSHP  74.93 (123.98 s)  74.41 (11.63 s)  66.31 (74.85 s)  69.71 (0.18 s) 
SW1SW2CSHPZ  75.19 (189.91 s)  73.36 (15.00 s)  68.92 (109.09 s)  66.05 (0.20 s) 
SW1SW2CSHPZV  74.41 (278.47 s)  74.41 (17.47 s)  66.31 (117.14 s)  69.19 (0.25 s) 
SW1SW2CSHPZVL1  73.10 (404.82 s)  73.32 (49.41 s)  66.84 (101.01 s)  68.66 (0.25 s) 
SW1SW2CSHPZVL1L4  72.84 (576.29 s)  72.06 (57.83 s)  67.10 (107.88 s)  67.62 (0.25 s) 
SW1SW2CSHPZVL1L4L14  72.58 (748.72 s)  72.36 (19.43 s)  66.84 (163.85 s)  69.19 (0.28 s) 
SW1SW2CSHPZVL1L4L14L30  73.36 (811.54 s)  71.01 (83.93 s)  66.57 (197.57 s)  68.40 (0.31 s) 
We first report in Table 2 the effect of sequentially adding the sources in the order which was used in [3] and [21] and MKLdivdc and MKLdivconv consistently outperform the competitive kernel learning methods VBKC, SimpleMKL, MKLRKDA and the best performing SVM combination methodology stated in [3]. As suggested by the kernel weights of MKLdivdc in the subfigure (b) of Figure 2, the sequence alignment based data source SW1 is most informative, then S, then SW2 and so on. Hence, in Table 3 we further report the effect of sequentially adding data sources in this rank order. As shown in Table 3, there is a significant improvement over SW1SW2 in MKLdivdc when we sequentially add the data sources of amino acid composition (C) and predicted secondary structure (S). The performance of MKLdivdc keeps increasing until we include CSHPZ, giving the best performance of 75.19%. Although according to [4], the PseAA data sources are believed to contain more information than the conventional amino acid composition. The same behaviour appears for MKLdivconv. However, the MKLdivdc performance degenerates if we continue to add PseAA composition data sources and the same behaviour appears for MKLdivconv. Similar observations were made by [21] which suggests that PseAA measurements may carry noncomplementary information with the conventional amino acid compositions.
Comparison of Running Time
To investigate the runtime efficiency of MKLdiv on protein fold recognition dataset, we list their CPU time in Tables 2 and 3. The running time (in seconds) is the term inside the parenthesis. The SILP approach for MKLRKDA is very efficient while SimpleMKL takes a bit longer. The reason could be that MKLRKDA essentially used the leastsquare loss for multiclass classification in contrast to the oneagainstall SVM used in SimpleMKL. Generally, more time is required to run the interior method for oneagainstall SVM than directly computing the solution of the leastsquare regression. The projected gradient descent method for MKLdivconv is also slower than MKLRKDA. It is to be expected that MKLdivconv converges faster than MKLdivdc since the DC algorithm for MKLdivdc is nonconvex and it needs to solve the subproblem (13) in each iteration of CCCP. Nevertheless, the price we paid in running time for MKLdivdc is worthwhile given its significantly better performance on the protein fold prediction problem.
Sensitivity against Parameter σ
Extension of Investigation to Yeast Protein Classification
We next extend our investigation of MKLdivdc and MKLdivconv on a yeast membrane protein classification problem [23]. This binary classification task has 2316 examples derived from the MIPS comprehensive Yeast Genome Database (CYGD) (see [46]). There are eight kernel matrices http://noble.gs.washington.edu/proj/sdpsvm/. The first three kernels (K_{SW}, K_{B}, and K_{Pfam}) are respectively designed to measure the similarity of protein sequences using BLAST, SmithWaterman pairwise sequence comparison algorithms and a generalization of pairwise comparison method derived from hidden Markov models. The fourth sequencebased kernel matrix (K_{FFT}) incorporates information about hydrophobicity which is known to be useful in identifying membrane proteins, computed by Fast Fourier Transform. The fifth and sixth kernel matrices (K_{LI}, K_{D}) are respectively derived from linear and diffusion kernels based on proteinprotein interaction information. The seventh kernel matrix (K_{E}) is a Gaussian kernel encoding gene expression data. Finally, we added a noise kernel matrix K_{Ran} generated by first generating random numbers and then using a linear kernel.
Conclusion
In this paper we developed a novel informationtheoretic approach to learning a linear combination of kernel matrices based on the KLdivergence [24–28], especially focused on the protein fold recognition problem. Based on the different position of the input kernel matrix and the output kernel matrix in the KLdivergence objective, there are two formulations. The first one is a difference of convex (DC) problem termed MKLdivdc and the second formulation is a convex formulation called MKLdivconv. The sparse formulation for kernel learning based on discriminant analysis [18] was also established. Our proposed methods are able to achieve stateoftheart performance on the SCOP PDB40D benchmark dataset for protein fold recognition problem. In particular, MKLdivdc further improves the fold discrimination accuracy to 75.19% which is a more than 5% improvement over a competitive Bayesian probabilistic approach [21], SVM marginbased kernel learning methods [11], and the kernel learning based on discriminant analysis [18]. We further extended the investigation to the problem of yeast protein function prediction.
Generally, it is difficult to determine which criterion is better for multiple kernel combination since this problem is highly datadependent. For the informationtheoretic approaches MKLdivdc and MKLdivconv, although MKLdivdc is not convex and its DC algorithm tends to find a local minima, in practice we would recommend MKLdivdc for the following reasons. Firstly, as mentioned above MKLdivdc has a close relation with the kernel matrix completion problem using information geometry [27, 28] and the maximization of the log likelihood of Gaussian Process regression [35], which partly explains the success of MKLdivdc. Secondly, we empirically observed that MKLdivdc outperforms MKLdivconv in protein fold recognition and yeast protein function prediction. Finally, as we showed in Figure 4, the performance of MKLdivconv is quite sensitive to the parameter σ and the choice of σ remains a challenging problem. MKLdivdc is relatively stable with respect to small values of σ and we can fix σ to be a very small number e.g. σ = 10^{5}. In future, we are planning to empirically compare performance with other existing kernel integration formulations on various datasets, and discuss convergence properties of the DC algorithm for MKLdivdc based on the theoretical results of [36].
Appendix
Appendix 1 – Column generation method for SILP
For instance, the threshold ε is usually chosen to be 5 × 10^{4}.
Appendix 2 – Sparse formulation of kernel learning based on discriminant analysis
The equivalence between the above algorithm and RKDA kernel learning becomes clear if we formulate its dual problem as follows:
Now, replacing α_{ i }by μα_{ i }and letting completes the argument. □
Now we can see that the dual problem of algorithm (24) is exactly the same as the formulation (see equation (36) in [18]) of RKDA kernel learning.
Declarations
Acknowledgements
We would like to thank the referees for their constructive comments and suggestions which greatly improve the paper. We also thank Prof. Mark Girolami, Dr. Theodoros Damoulas and Dr. Simon Rogers for stimulating discussions. This work is supported by EPSRC grant EP/E027296/1.
Authors’ Affiliations
References
 Baker D, Sali A: Protein structure prediction and structural genomics. Science 294: 93–96. 10.1126/science.1065659Google Scholar
 Jones DT, et al.: A new approach to protein fold recognition. Nature 1992, 358: 86–89. 10.1038/358086a0View ArticlePubMedGoogle Scholar
 Ding C, Dubchak I: Multiclass protein fold recognition using support vector machines and neural networks. Bioinformatics 2001, 17: 349–358. 10.1093/bioinformatics/17.4.349View ArticlePubMedGoogle Scholar
 Shen HB, Chou KC: Ensemble classifier for protein fold pattern recognition. Bioinformatics 2006, 22: 1717–1722. 10.1093/bioinformatics/btl170View ArticlePubMedGoogle Scholar
 Andreeva A, et al.: SCOP database in 2004: refinements integrate strucuture and sequence family data. Nucleic Acids Res 2004, 32: 226–229. 10.1093/nar/gkh039View ArticleGoogle Scholar
 Lo Conte L, et al.: SCOP: a structural classification of protein database. Nucleic Acids Res 2000, 28: 257–259. 10.1093/nar/28.1.257PubMed CentralView ArticlePubMedGoogle Scholar
 Chou K, Zhang C: Prediction of protein structural classes. Crit Revi Biochem Mol Biol 1995, 30: 275–349. 10.3109/10409239509083488View ArticleGoogle Scholar
 Dubchak I, et al.: Prediction of protein folding class using global description of amino acid sequence. Proc Natl Acad Sci 1995, 92: 8700–8704. 10.1073/pnas.92.19.8700PubMed CentralView ArticlePubMedGoogle Scholar
 Schölkopf B, Smola AJ: Learning with Kernels. The MIT Press, Cambridge, MA, USA; 2002.Google Scholar
 ShaweTaylor J, Cristianini N: Kernel Methods for Pattern Analysis. Cambridge university press; 2004.View ArticleGoogle Scholar
 Lanckriet GRG, Cristianini N, Bartlett PL, Ghaoui LE, Jordan ME: Learning the kernel matrix with semidefinite programming. J of Machine Learning Research 2004, 5: 27–72.Google Scholar
 Bach F, Lanckriet GRG, Jordan MI: Multiple kernel learning, conic duality and the SMO algorithm. Proceedings of the Twentyfirst International Conference on Machine Learning (ICML) 2004.Google Scholar
 Rakotomamonjy A, Bach F, Canu S, Grandvalet Y: More efficiency in multiple kernel learning. Proceedings of the 24th International Con ference on Machine Learning (ICML) 2007.Google Scholar
 Sonnenburg S, Rätsch G, Schäfer C, Schölkopf B: Large scale multiple kernel learning. J of Machine Learning Research 2006, 7: 1531–1565.Google Scholar
 Bach F: Consistency of the group Lasso and multiple kernel learning. J of Machine Learning Research 2008, 9: 1179–1225.Google Scholar
 Ying Y, Zhou DX: Learnability of Gaussians with flexible variances. J of Machine Learning Research 2007, 8: 249–276.Google Scholar
 Lin Y, Zhang H: Component selection and smoothing in multivariate nonparametric regression. Annals of Statistics 2006, 34: 2272–2297. 10.1214/009053606000000722View ArticleGoogle Scholar
 Ye J, Ji S, Chen J: Multiclass discriminant kernel learning via convex programming. J of Machine Learning Research 2008, 9: 719–758.Google Scholar
 Ye J, et al.: Heterogeneous data fusion for Alzheimer's disease study. The Fourteenth ACM SIGKDD International Conference On Knowledge Discovery and Data Mining (SIGKDD) 2008.Google Scholar
 Ji S, Sun L, Jin R, Ye J: Multilabel multiple kernel learning. The TwentySecond Annual Conference on Neural Information Processing Systems (NIPS) 2008.Google Scholar
 Damoulas T, Girolami M: Probabilistic multiclass multikernel learning: On protein fold recognition and remote homology detection. Bioinformatics 2008, 24(10):1264–1270. 10.1093/bioinformatics/btn112View ArticlePubMedGoogle Scholar
 Girolami M, Zhong M: Data integration for classification problems employing gaussian process priors. In Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press; 2007.Google Scholar
 Lanckriet GRG, De Bie T, Cristianini N, Jordan MI, Noble WS: A statistical framework for genomic data fusion. Bioinformatics 2004, 20(16):2626–2635. 10.1093/bioinformatics/bth294View ArticlePubMedGoogle Scholar
 Lawrence ND, Sanguinetti G: Matching kernels through KullbackLeibler divergence minimization. In Technical Report CS04–12. Department of Computer Science, University of Sheffield; 2004.Google Scholar
 Davis JV, Kulis B, Jain P, Sra S, Dhillon IS: Informationtheoretic metric learning. Proceedings of the 24th International Conference on Machine Learning 2007.Google Scholar
 Sun L, Ji S, Ye J: Adaptive diffusion kernel learning from biological networks for protein function prediction. BMC Bioinformatics 2008, 9: 162. 10.1186/147121059162PubMed CentralView ArticlePubMedGoogle Scholar
 Tsuda K, Akaho S, Asai K: The em algorithm for kernel matrix completion with auxiliary data. Journal of Machine Learning Research 2003, 4: 67–81. 10.1162/153244304322765649Google Scholar
 Kato T, Tsuda K, Asai K: Selective integration of multiple biological data for supervised network inference. Bioinformatics 2005, 21: 2488–2495. 10.1093/bioinformatics/bti339View ArticlePubMedGoogle Scholar
 Aronszajn N: Theory of reproducing kernels. Trans Amer Math Soc 1950, 68: 337–404. 10.2307/1990404View ArticleGoogle Scholar
 Cristianini N, ShaweTaylor J, Elisseeff A: On kerneltarget alignment. In Advances in Neural Information Processing Systems 14. Cambridge, MA: MIT Press; 2002.Google Scholar
 Micchelli CA, Pontil M: Learning the kernel function via regularization. J of Machine Learning Research 2005, 6: 1099–1125.Google Scholar
 Hastie T, Tibshirani R, Friedman J: Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York, Springer; 2001.View ArticleGoogle Scholar
 Vandenberghe L, Boyd S, Wu S: Determinant maximization with linear matrix inequality constraints. SIAM J Matrix Analysis and Applications 1998, 19: 499–533. 10.1137/S0895479896303430View ArticleGoogle Scholar
 Amari S: Information geometry of the EM and em algorithms for neural networks. Neural Networks 1995, 8: 1379–1408. 10.1016/08936080(95)000038View ArticleGoogle Scholar
 Rasmussen CE, Williams C: Gaussian Processes for Machine Learning. the MIT Press; 2006.Google Scholar
 Tao PD, An LTH: A D.C. optimization algorithm for solving the trust region subproblem. SIAM J Optim 1998, 8: 476–505. 10.1137/S1052623494274313View ArticleGoogle Scholar
 Yuille AL, Rangarajan A: The concave convex procedure. Neural Computation 2003, 15: 915–936. 10.1162/08997660360581958View ArticlePubMedGoogle Scholar
 Borwein JM, Lewis AS: Convex Analysis and Nonlinear Optimization: Theory and Examples. CMS Books in Mathematics, SpringerVerlag, New York; 2000.View ArticleGoogle Scholar
 Nesterov Y: Introductory Lectures on Convex Optimization: A Basic Course. Springer; 2003.Google Scholar
 Hettich R, Kortanek KO: Semiinfinite programming: theory, methods, and applications. SIAM Review 1993, 3: 380–429. 10.1137/1035089View ArticleGoogle Scholar
 Bleakley K, Biau G, Vert JP: Supervised reconstruction of biological networks with local models. Bioinformatics 2007, 23: 57–65. 10.1093/bioinformatics/btm204View ArticleGoogle Scholar
 Yamanishi Y, Vert JP, Kanehisa M: Protein network inference from multiple genomic data: a supervised approach. Bioinformatics 2004, 20: 363–370. 10.1093/bioinformatics/bth910View ArticleGoogle Scholar
 Kondor RI, Lafferty JD: Diffusion kernels on graphs and other discrete structures. Proceedings of the Nineteenth International Conference on Machine Learning 2002.Google Scholar
 The SimpleMKL Toolbox[http://asi.insarouen.fr/enseignants/~arakotom/code/mklindex.html]
 Liao L, Noble WS: Combining pairwise sequence similarity and support vector machine for detecting remote protein evolutionary and structural relationships. J Comput Biol 2003, 6: 857–868. 10.1089/106652703322756113View ArticleGoogle Scholar
 Mewes HW, et al.: MIPS: a database for genomes and protein sequences. Nucleic Acids Res 2000, 28: 37–40. 10.1093/nar/28.1.37PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.