Multiple-kernel learning for genomic data mining and prediction

Background Advances in medical technology have allowed for customized prognosis, diagnosis, and treatment regimens that utilize multiple heterogeneous data sources. Multiple kernel learning (MKL) is well suited for the integration of multiple high throughput data sources. MKL remains to be under-utilized by genomic researchers partly due to the lack of unified guidelines for its use, and benchmark genomic datasets. Results We provide three implementations of MKL in R. These methods are applied to simulated data to illustrate that MKL can select appropriate models. We also apply MKL to combine clinical information with miRNA gene expression data of ovarian cancer study into a single analysis. Lastly, we show that MKL can identify gene sets that are known to play a role in the prognostic prediction of 15 cancer types using gene expression data from The Cancer Genome Atlas, as well as, identify new gene sets for the future research. Conclusion Multiple kernel learning coupled with modern optimization techniques provides a promising learning tool for building predictive models based on multi-source genomic data. MKL also provides an automated scheme for kernel prioritization and parameter tuning. The methods used in the paper are implemented as an R package called RMKL package, which is freely available for download through CRAN at https://CRAN.R-project.org/package=RMKL. Electronic supplementary material The online version of this article (10.1186/s12859-019-2992-1) contains supplementary material, which is available to authorized users.

Note that y i (w · x i + b) is greater than 1 if y i and w · x i + b have the same sign, i.e. the sample is correctly classified. This problem is known as the hard margin formulation and is only feasible when two groups can be perfectly separated by a linear function [1]. It is rare that data are perfectly linearly separable. We can relax (1) so that samples are allowed to be misclassified, by incorporating a penalty for samples that misclassified. The following optimization convex problem is referred to as the soft margin problem: where ξ i are slack variables. Commonly we use ξ i = max(0, y i (w · x i + b)), which is known as the hinge loss function. The parameter C controls the penalty of misclassification, and a value for C is typically found via cross-validation. Larger values of C can lead to a smaller margin to minimize the misclassifications, while smaller values of C may produce a larger margin that can lead to more misclassifications (illustrated in Supplemental Figure 1). Problem (2) is typically not solved directly, but rather by solving the Lagrangian dual [1]. The Lagrangian is the sum of the original objective function and a term that involves the constraints and multiplier. The Lagrangian of (2) is given below: where α i , µ i ≥ 0. The minimizers of L are found by setting the gradient of L equal to zero and 1 Supplemental Figure 1: Illustration of trade off between high and low values of C in (2). solving the resulting system of equations: Equations (4b) and (4c) provide two new constraints, namely n i=1 α i y i = 0 and α i ≤ C − µ i , and provides a representation for the optimal hyperplane w = n i=1 α i y i x i . Plugging the solutions of (4) into the Lagrangian dual of (3) yields: where (x i · x i ) denotes the dot product between x i and x i . This problem is a quadratic programming problem that can solved with many solvers and can be solved much more efficiently than (2). The Karush-Kuhn-Tucker (KKT) conditions are necessary conditions for solving a non-linear programming problem and for SVM these conditions are (4a)-(4c), and: which lead to the following conditions for the support vectors, 0 < α i < C ⇒ y i (w · x i ) = 1 and ξ i = 0, (7b) The resulting classification function produced by SVM algorithms is computed with samples satisfying 0 < α i < C, which correspond to points that are on the margin. These points are called support vectors and the number of support vectors is typically much smaller than the number of samples which helps make SVM algorithms faster [2]. Kernels can be employed to map data into a higher dimensional feature space where the data are linearly separated. A kernel function K : X × X → R that for all x i , x i that satisfies K(x i , x i ) := (φ(x i ) · φ(x i )) where φ : X → H, and H is a Hilbert space. Kernel functions are different similarity measures between samples and K is a symmetric positive definite matrix. The above derivation can be extended to non-linear classification by simply replacing w · x and x · x , in (2) and (3), f (x) = K(w, x) and K(x, x ), The Lagrangian dual can be constructed in a similar fashion as above, of (8) which yields: Since K is a symmetric positive definite matrix, (9) is a quadratic programming problem. Unlike SVM, there is typically not a closed form expression for f (x), thus it is difficult to interpret the results [1], [2].

Multiple Kernel Learning
It has been shown that the convex combinations of kernel functions is a kernel function. An avenue for improvement is to utilize several different representations of the data and allow an algorithm to use a weighted average of these representations of the data. This can help automate kernel prioritization by using a combination of kernel functions for a set of candidate kernels, this is the main idea of multiple kernel learning (MKL). Combining kernels is possible by decomposing the input space into blocks as follows X = X 1 × · · · × X m where each sample can be expressed as . MKL can be formulated as the following optimization problem: This problem remains convex, however it is not smooth which leads to computational issues. Using the same procedure as before, the Lagrangian dual given below: where γ i is the weight of the i th kernel. MKL allows the flexibility to assign kernels on an individual variables basis, or as a data integration tool by assigning the different kernels to multiple data sources [3].
There have been many algorithms proposed to conduct MKL. One class of MKL algorithms are wrapper methods which iteratively solve a single kernel learning problem for a given combination of kernel weights. Wrapper methods iteratively optimize f, b, α with the kernel weights fixed, sometimes referred to as the fixed weights problem, and then optimize the kernel weights with f, b, α fixed. A characteristic of wrapper methods is that they reformulate either the dual or primal of the MKL problem to use off-the-shelf efficient solvers. A shortcoming of wrapper methods is the optimization of f, b, α is inefficient, and maybe unnecessary if the kernel weights are not optimal [4].
SimpleMKL uses gradient descent to solve for the direction that has the most improvement. Then uses lines search to find the optimal kernel weights, with fixed f j , b, and ξ. For each candidate γ, SimpleMKL must solve an SVM problem [3]. Though wrapper methods are easy to implement, they may have poor convergence or produce solutions that are far from the global optimum. SimpleMKL updates γ using gradient descent, SEMKL directly computes γ by Smaller γ j corresponds to smooth f j , and larger γ j correspond to more noisy f j [5]. DALMKL optimizes the dual augmented Lagrangian of a proximal formulation of the MKL problem. This formulation presents a unique set of problems such as the conjugate of a loss function must have no non-differentiable points in the interior of its domain and cannot have a finite gradient at the boundary of its domain. Additional primal variables are added so that φ γ (·, α, b) becomes differentiable. The inner function (14) is differentiable and the gradient and Hessian only depend on the active kernels making gradient descent efficient. Though DALMKL and wrapper function attempt to construct a kernel using a convex combination of kernels, they parameterizations are quite different. Suzuki points this out for the C parameter in particular stating the following Suzuki recommends a range ofC =0.5, 0.05, and 0.005, while there is no clear recommendations C for wrapper methods. Below is a more formal derivation of DALMKL [6]. DALMKL was formulated to utilize the block 1-norm proposed by Bach [7]. The block 1-norm problem is solved by optimizing the dual augmented-Lagrangian. Consider the following general formulation for MKL notation: whereKα = m j=1 K j α j , and φ C (α) = C m j=1 ||α j || Kj . A non-decreasing sequence Ω = {ω (1) , ω (2) , . . . } are introduced to allow to (15) as follows: where α (t) and b (t) are the t th updated values of α and b. This problem is referred to as the proximal MKL problem. The constraint z =Kα + b1 is added (14) and the resulting Lagrangian dual is Minimizing (16) with respect to (z, α, b) results in min z −L * (−ρ) where L * is the convex conjugate of L, and Compute K j and set γ j = 1 m for j = 1, . . . , m Combine K j into single kernel, K = j γ j K j .
Solve dual of SVM problem with K.
Update the kernel weights using gradient descent. We optimize (16) in two cycles. In the inner cycle we optimize the dual vector ρ by and in the outer cycle we update α (t) and b (t) with α (t) = prox α i . The paper provides a detailed overview of the derivation and outlines many scenarios that use elastic net and block q−norm, for q ≥ 1, regularizations, as well as, logistic, squared, hinge, and − loss functions. These scenarios make DALMKL a promising method for extension to other employed in other arenas such as causal inference, or survival analysis.

Summary of Hoadley Data
Cancer types listed in Supplemental Table 1 were selected because there were more than 300 samples before removal of censored patients. Prostate Adenocarcinoma (PRAD) and Thyroid Carcinoma (THCA) were not considered because they had 10 and 16 deaths before the end of followup. Survival outcome was dichotomized by a threshold that the survival rate was between [0.4, 0.6]. Supplemental Table 2