Volume 11 Supplement 8
Proceedings of the Neural Information Processing Systems (NIPS) Workshop on Machine Learning in Computational Biology (MLCB)
Inferring latent task structure for Multitask Learning by Multiple Kernel Learning
 Christian Widmer^{1}Email author,
 Nora C Toussaint^{2},
 Yasemin Altun^{3} and
 Gunnar Rätsch^{1}
DOI: 10.1186/1471210511S8S5
© Widmer et al; licensee BioMed Central Ltd. 2010
Published: 26 October 2010
Abstract
Background
The lack of sufficient training data is the limiting factor for many Machine Learning applications in Computational Biology. If data is available for several different but related problem domains, Multitask Learning algorithms can be used to learn a model based on all available information. In Bioinformatics, many problems can be cast into the Multitask Learning scenario by incorporating data from several organisms. However, combining information from several tasks requires careful consideration of the degree of similarity between tasks. Our proposed method simultaneously learns or refines the similarity between tasks along with the Multitask Learning classifier. This is done by formulating the Multitask Learning problem as Multiple Kernel Learning, using the recently published qNorm MKL algorithm.
Results
We demonstrate the performance of our method on two problems from Computational Biology. First, we show that our method is able to improve performance on a splice site dataset with given hierarchical task structure by refining the task relationships. Second, we consider an MHCI dataset, for which we assume no knowledge about the degree of task relatedness. Here, we are able to learn the task similarities ab initio along with the Multitask classifiers. In both cases, we outperform baseline methods that we compare against.
Conclusions
We present a novel approach to Multitask Learning that is capable of learning task similarity along with the classifiers. The framework is very general as it allows to incorporate prior knowledge about tasks relationships if available, but is also able to identify task similarities in absence of such prior information. Both variants show promising results in applications from Computational Biology.
Background
In Machine Learning, model quality is most often limited by the lack of sufficient training data. In presence of data from different but related tasks, it is possible to boost the performance of each task by leveraging all available information. Multitask learning (MTL), a subfield of Machine Learning, considers the problem of inferring models for each task simultaneously while imposing some regularity criteria or shared representation in order to allow learning across tasks. There has been an active line of research exploring various methods (e.g. [1, 2]), providing empirical findings [3] and theoretical foundations [4, 5]. Most of these methods assume uniform relations across tasks. However, it is conceivable to leverage MTL methods by taking into account the degree of relatedness among tasks. Recently, this direction has been investigated in the context of hierarchies [6, 7] and clusters [8] of tasks, where the relation across tasks as well as the models for each task are inferred simultaneously.
In this paper, we follow this line of research and investigate Multitask Learning scenarios where there exists a latent structural relation across tasks. In particular, we model the relatedness between tasks by defining metatasks. Here, each metatask corresponds to a subset of all tasks, representing the common properties of the tasks within this subset. Then, the model of each task can be derived by a convex combination of the metatasks it belongs to. Moreover, the latent structure over tasks can be represented by a collection of the metatasks. Information is transferred between two tasks t, t′ with respect to their relatedness according to the latent structure (number of metatasks in which t, t′ cooccur and the importance of each of these metatasks defined by the mixture weights).
Clearly, such an approach is highly sensitive to the chosen structure and in the absence of prior knowledge, learning the latent structure is a crucial component of MTL with nonuniform relatedness. Starting from a special case, where there exists a single metatask consisting of all tasks (standard MTL), we show that inferring the latent structure can be cast as a Multiple Kernel Learning problem, where the base kernels are defined with respected to Dirac kernels [9] that establish relatedness of all possible task combinations and hence correspond to all possible metatasks.
Kernel methods are used in a widerange of problems, as the kernel abstracts the input space from the Machine Learning algorithm. One can use several kernels to incorporate different aspects of the same instance (e.g. genomic sequence data and data from blood measurements for one patient) and combine them into the same optimization problem. Multiple Kernel Learning can be used to determine the combination of kernels that is best for the problem at hand. This is done by learning an optimal weighting of the individual kernels along with training a predictor.
Our contribution is the combination of MTL and MKL to address the central question in Multitask Learning, of how to identify the relationships between tasks and to translate them into meaningful parameters in the formulation of the used learning algorithm. We show that MKL can be used to 1) refine a given hierarchical structure that relates the tasks at hand and 2) to identify subsets of tasks for which information transfer pays off in absence of prior information on task relations.
Besides applications in Natural Language Processing [10] and Medical Domains, Multitask Learning is particularly relevant to Computational Biology. In this setting, tasks often correspond to organisms, giving rise to a whole range of problems. The fact that the availability of data describing the same biological mechanism in several organisms is a reoccurring theme makes the Multitask Learning approach particularly well suited for many applications in the field. There has been previous work using Domain Adaptation (closely related to Multitask Learning) in the context of splice site prediction [3]. Furthermore, it was shown [9] that Multitask Learning can be used to leverage the stateoftheart in peptide MHCI binding prediction, which is a problem relevant for vaccine design. Given the success of MTL in Computational Biology and highly structured relation across organisms (tasks), we apply our method to two important Computational Biology problems, namely MHCI binding prediction and splice site prediction. The competitiveness of our results shows the validity of our approach.
Preliminaries
In a singletask supervised learning scenario, a sample of examplelabel pairs D = {(x_{ i },y_{ i } )}_{i=}_{1,…,}_{n}is given, where the x_{i} live in an input space X and y_{ i } ∊ {−1,1} (for binary classification). The goal is to learn a function f such that f(x_{ i } ) ≈ y_{ i } that generalizes well to unseen data.
Before we describe our formulation of MTL as MKL approach, we briefly review the formulations of MTL and MKL that lay the foundations for our approach.
Multitask Learning
In MTL [1], we are given one labeled sample D_{ t } for each of T tasks. Similar to the singletask supervised learning scenario, we are now interested in obtaining T hypotheses f_{ t } , one for each task.
We will formulate our method based on the Support Vector Machine (SVM), which has proven to generalize well [11], scales to large amounts of training data [12, 13] and is able to incorporate arbitrary data sources by means of kernels (e.g., [14]). The generalization to other learning approaches appears straightforward as we mainly consider the extension of kernels to reflect task similarity, although details regarding the learning of their linear combination may differ.
Therefore, we start out with a regularizationbased Multitask Learning method that was similarly proposed in the context of SVMs [2, 10, 15]. The basic idea is that models w _{t} are learned simultaneously for all tasks. Information transfer between tasks is achieved by sharing a general component and enforcing similarity of each w_{t} to w_{0} in the joint optimization problem via regularization. We use the following formulation, leaving out some constants for readability
where l is the hinge loss, l(z, y) = max{1 − yz, 0}.
where β_{1}, β_{2} ≥ 0 and β_{1} + β_{2} = 1.
Clearly, is a convex combination of base kernels and thus a valid kernel. MKL is a technique to learn the individual weights of a weighted linear combination of kernels. Thus, it seems natural to utilize MKL to learn an optimal weighting for .
Multiple Kernel Learning
Lanckriet et al. considered conic combinations of kernel matrices for classification [16], leading to a convex quadratically constrained quadratic program. Later on, it was shown that the problem can be formulated as a semiinfinite linear program, allowing to use standard SVM solvers (e.g. SVMLight [17], LibSVM [18]) for solving the reoccurring subproblems [13]. Only recently, methods were proposed to generalize MKL to an arbitrary l_{ q } norm [19].
Learning with multiple kernels gives rise to M different feature mappings ϕ_{ m } : X → H_{ m } , m = 1,…, M, each leading to a kernel K_{ m } for a Hilbert space H_{ m } . In MKL, we consider linear mixtures of kernels , where β_{ i } ≥ 0. To avoid nonconvexity, the original parameter vector is substituted . For an in depth discussion of this, please consider [19].
where l is the hinge loss, l(z, y) = max{1 − yz, 0} and q denotes the norm used to penalize the weights β. To solve the above optimization problem, we follow ideas presented in [13],[19] to iteratively solve a convex optimization problem involving only the β’s and then to solve for w only. This method is known to converge fast even for a relatively large number of kernels [13].
Multitask Multiple Kernel Learning
where N is the total number of training examples of all tasks combined.
What remains to be discussed is how to define a collection I of candidate subsets S_{ i } (i.e. metatasks), which are subsequently to be weighted by MKL. We consider two scenarios, one where we assume to have access to a hierarchical structure relating the tasks at hand and one, where we assume no prior knowledge given about task relations. Generally, however, it is possible to utilize prior domain knowledge indicating how tasks are related to design an appropriate I.
Powerset MTMKL
With no prior information given, a natural choice is to take into account all possible subsets of tasks. Given a set tasks T, this corresponds to considering the power set P of T (excluding the empty set) I_{ p } = {SS ∈ P(T) ∧ S ≠ Ø}.
Clearly, this gives us an exponential number (i.e. 2 ^{ T } ) of task sets S_{ i } of which only a few will be relevant. To identify the relevant task sets, we propose to use an L1regularized MKL approach (i.e. q = 1 in Equation 2) to yield a sparse solution. Most subset weights will be set to zero, yielding only a few relevant subsets with weights greater than zero. We expect that the examples in these subsets come from similar distributions and that it is therefore beneficial to consider interactions between them, when obtaining a multitask predictor.
While L1regularization of MKL results in a sparse combination of kernels, it does not address the computational complexity of the optimization problem over this exponential search space. With the current implementation, the method is limited to approximately 10 tasks depending on the number of training examples and available resources. However, there are techniques to handle the case where the number of tasks may become prohibitive, for instance, as proposed in [20]. The idea is to iteratively generate new kernels based on the current solution (β, w). These methods are known to converge to the optimal solution, if one can identify appropriate kernels in a larger set. In the current case, this could be done by solving an integer linear program.
Hierarchical MTMKL
As an example, consider the kernel defined by a hierarchical decomposition according to Figure 1. Clearly, the number of β_{ i } corresponds to the number of nodes. For a perfect binary tree this leads to 2m − 1 nodes, where m is the number of leaves/tasks. We expect that learning the contributions of the individual levels of the taxonomy makes sense for cases, where the edge lengths of G are unequal.
Relation to task similarity
This similarity measure can be used for downstream analyses, as it provides insight about the task relationships. A high γ_{k,l} between tasks suggests a considerable resemblance between tasks and could help to generate domain knowledge (e.g., evidence that two cellreceptors bind to similar class of proteins, or the molecular mechanisms of the splicing machinery particularly similar).
Results and discussion
We performed experiments in two settings. In the first setting, we considered a set of MHCI (major histocompatibility complex) proteins. Here, we assume we are not given any prior information to relate them. In the second setting, we used splice site data from 15 organisms and assumed that the task relationship is given by a hierarchical structure according to their evolutionary history. The examples are string data over an alphabet {A,T,G,C} (DNA) in the splicing case and the alphabet of 20 amino acids in the MHCI case. To incorporate string features, we used the Weighted Degree String Kernel [21], which amongst other kernels such as the Spectrum Kernel [22], has been successfully employed in problems from Computational Biology.

Union  One global model is obtained on the union of examples from all tasks.

Plain  For each task, a model is trained independently, not taking into account any outofdomain information.

Vanilla MTL  Our algorithms consists of two components  the MTL formulation and the adjustment of weights β_{ i } with MKL. In the vanilla approach, we fix all weights at β_{ i } = 1.
Experiments were performed by using crossvalidation for modelselection on the training splits. We only tuned one hyperparameter C, for which we considered values between 0. 01 and 1000 on a logarithmic scale in 8 steps. After having obtained an optimal regularization parameter, a classifier is retrained on all training splits and final performance is obtained on a dedicated test set, that was not involved in hyperparameter selection.
MHCI binding prediction using Powerset MTMKL
In this experiment, the task is to predict whether a peptide binds to a certain MHC molecule (binder) or not (nonbinder). It has been previously shown that sharing information between related molecules (alleles) and thus casting the problem in a Multitask Learning scenario, can be beneficial [9]. In the MHC setting, different tasks correspond to different MHC proteins. The data consists of peptide sequences of length l = 9 for 7 tasks. In total, we are given 7367 examples (A_2403=254, A_6901=833, A_0201=3089, A_0202=1447, A_0203=1443, A_2402=197, A_2301=104). For crossvalidation, the data was split randomly into 5 splits of the same size. Unlike the setting of splice site prediction, we do not have a hierarchical structure relating our tasks at hand. To demonstrate that meaningful groups of tasks can be identified by Powerset MTMKL, we do not assume any prior knowledge of task relationships. Please note, however, that we do have access to the sequences of the MHCI proteins. We use these sequences to evaluate the task similarities obtained by our approach.
Results for the MHC experiment in auPRC for the model selection step and the final prediction on the test set. Reported is the average performance over all tasks.
auPRC  Plain  Union  Vanilla MTL  Powerset MTMKL 

crossvalidation  0.668  0.637  0.676  0.692 
test set  0.671  0.576  0.679  0.699 
From Figure 2, we observe that the MKLbased approach outperforms the baseline methods. Furthermore, simply combining the data for different tasks to obtain a single model (Union) does not outperform the naïve method of obtaining an individual classifier for each task (Plain). This hints at rather large differences between the tasks. Learning the weights with MKL, improves performance compared to the Vanilla MTL approach, which already outperforms the other two baselines.
List of task sets and their respective weights β_{ i } that were assigned by 1norm MKL.
Task Set  weight  

A_0201,  A_0202,  A_0203,  A_6901  0.186  
A_0201,  A_0202,  A_0203,  A_2301  0.178  
A_0202,  A_0203,  A_2301,  A_2402,  A_2403  0.110 
A_0201,  A_0203,  A_2301,  A_2402,  A_2403  0.091 
A_0201,  A_0202,  A_2301,  A_2402,  A_6901  0.074 
A_0201,  A_0202,  A_2301,  A_2402,  A_2403  0.066 
Using MKL, we successfully identify groups of tasks among which information sharing is sensible, thus allowing for a smart combination of information from different tasks in the absence of prior knowledge.
The improvement in performance over the Vanilla MTL method is relatively small (a property most likely inherited from MKL). However, we are compensated for this by simultaneously obtaining a sensible task structure.
Splicesite prediction using hierarchical MTMKL
We report the area under the precision recall curve (auPRC), which is well suited for unbalanced data sets. For the Vanilla MTL method, we use the given hierarchy G to define the initial task sets, but not further optimize their individual influence.
Results for the splice site experiment in auPRC for the model selection step and the final prediction on the test set. Reported is the average performance over all tasks. This table shows only the performance for the bestperforming variant of Hierarchical MTMKL with norm q = 2.
auPRC  Plain  Union  Vanilla MTL  Hierarchical MTMKL 

crossvalidation  0.043  0.092  0.087  0.100 
test set  0.059  0.153  0.169  0.190 
The second observation is that we get different results for different qnorms. In particular, we see a degraded performance for q = 1, which complies with our expectation that weights for this approach (assuming the hierarchy is correct) should not be sparse. For the qnorms that we considered, q = 2 performs best. Lastly, we can show that we are able to outperform the Vanilla MTL method (all β_{ i } = 1) by refining the task relations given by the structure G with MKL. Intuitively, using Hierarchical MTMKL corresponds to estimating the edge lengths of G, whereas the other method is restricted to directly using the similarities encoded into the taxonomy.
Conclusions
We have presented a principle way of formulating Multitask Learning as a Multiple Kernel Learning approach. Following the basic idea of tasksetwise decomposition of the kernel matrix, we present a hierarchical decomposition and a power set based approach.
These two methods allow us to elegantly identify or refine structure relating the tasks at hand in one global optimization problem. We expect our methods to work particularly well in cases, where edge weights differ within the hierarchical structure or where the task structure is unknown.
Our experiments illustrate that the MTMKL approach on the power set of all tasks works well for the MHC binding problem: First it increases the accuracy of the predictors compared to the baseline methods and second, the inferred task similarity reflects the prior knowledge that is available for this problem. Also for the splice site prediction problem where the task hierarchy is given by the organisms’ phylogeny, our approach manages to achieve an improvement over standard approaches. Using MKL on top of regular Multitask Learning methods may uncover latent task structure and thereby provide insight into the problem domain, which might be relevant to downstream analyses. In conclusion, this work constitutes a valuable proofofconcept outlining a principle way of using MKL to improve Multitask Learning.
List of abbreviations used
 MKL:

Multiple Kernel Learning
 MTL:

Multi Task Learning
 MHC:

Major Histocompatibility Complex
 SVM:

Support Vector Machine
 auPRC:

area under the Precision Recall Curve.
Declarations
Acknowledgements
We would like to acknowledge Sören Sonnenburg for help with the implementation of our methods and Magdalena Feldhahn for providing the hierarchical clustering of MHCI sequences.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 8, 2010: Proceedings of the Neural Information Processing Systems (NIPS) Workshop on Machine Learning in Computational Biology (MLCB). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/11?issue=S8.
Authors’ Affiliations
References
 Caruana R: Multitask Learning. Machine Learning 1997, 28: 41–75. 10.1023/A:1007379606734View Article
 Evgeniou T, Micchelli C, Pontil M: Learning Multiple Tasks with Kernel Methods. Journal of Machine Learning Research 2005, 6: 615–637.
 Schweikert G, Widmer C, Scholkopf B, Rätsch G: An Empirical Analysis of Domain Adaptation Algorithms. In Advances in Neural Information Processing System, NIPS, Volume. Volume 22. Vancouver, B.C.; 2008.
 BenDavid Schuller: Exploiting Task Relatedness for Multiple Task Learning. COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers 2003.
 Blitzer J, Crammer K, Kulesza A, Pereira F, Wortman J: Learning bounds for domain adaptation. Advances in Neural Information Processing Systems 2008, 20: 129–136.
 Daumé H III: Bayesian Multitask Learning with Latent Hierarchies. In Conference on Uncertainty in Artificial Intelligence 2009.
 Xue Y, Liao X, Carin L, Krishnapuram B: Multitask learning for classification with dirichlet process priors. Journal of Machine Learning Research 2007, 8: 2007.
 Jacob L, Bach F, Vert JP: Clustered MultiTask Learning: A Convex Formulation. In NIPS. MIT Press; 2009:745–752.
 Jacob L, Vert J: Efficient peptideMHCI binding prediction for alleles with few known binders. Bioinformatics 2008, 24(3):358. 10.1093/bioinformatics/btm611PubMedView Article
 Daumé H: Frustratingly Easy Domain Adaptation. In ACL. The Association for Computer Linguistics; 2007.
 Vapnik V: . In The nature of statistical learning theory. Springer Verlag; 2000.View Article
 Bottou L, Chapelle O, Decoste D, Weston J: . In Largescale kernel machines. MIT Press; 2007.
 Sonnenburg S, Rätsch G, Schäfer C, Schölkopf B: Large scale multiple kernel learning. The Journal of Machine Learning Research 2006, 7: 1565.
 Schölkopf B, Smola A: . In Learning with Kernels. Cambridge, MA: The MIT Press; 2002.
 Evgeniou T, Pontil M: Regularized multitask learning. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, Washington, USA, August 22–25, 2004. Edited by: Kim W, Kohavi R, Gehrke J, DuMouchel W,. ACM; 2004:109–117.View Article
 Bach F, Lanckriet G, Jordan M: Multiple kernel learning, conic duality, and the SMO algorithm. In Proceedings of the twentyfirst international conference on Machine learning. ACM New York, NY, USA; 2004.
 Joachims T: Making largeScale SVM Learning Practical. In Advances in Kernel Methods  Support Vector Learning. Edited by: Schölkopf B, Burges C, Smola A. MIT Press; 1999.
 Chang C, Lin C: LIBSVM: a library for support vector machines. 2001.
 Kloft M, Brefeld U, Sonnenburg S, Laskov P, Muller KR, Zien A: Efficient and Accurate LpNorm Multiple Kernel Learning. In Advances in Neural Information Processing Systems. Volume 22. Edited by: Bengio Y, Schuurmans D, Lafferty J, Williams CKI, Culotta A,. MIT Press; 2009:997–1005.
 Gehler P, Nowozin S: Infinite Kernel Learning. In NIPS Workshop on Kernel Learning: Automatic Selection of Optimal Kernels 2008.
 Rätsch G, Sonnenburg S: . In Accurate Splice Site Detection for Caenorhabditis elegans. MIT Press; 2004.
 Leslie C, Eskin E, Noble WS: The Spectrum Kernel: A String Kernel For SVM Protein Classification. In Proceedings of the Pacific Symposium on Biocomputing 2002, 564–575.
 Robinson J, Waller MJ, Parham P, Bodmer JG, Marsh SG: IMGT/HLA Database:a sequence database for the human major histocompatibility complex. Nucleic Acids Res 2001, 29: 210–213. 10.1093/nar/29.1.210PubMed CentralPubMedView Article
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.