 Research
 Open Access
 Published:
Inferring latent task structure for Multitask Learning by Multiple Kernel Learning
BMC Bioinformatics volume 11, Article number: S5 (2010)
Abstract
Background
The lack of sufficient training data is the limiting factor for many Machine Learning applications in Computational Biology. If data is available for several different but related problem domains, Multitask Learning algorithms can be used to learn a model based on all available information. In Bioinformatics, many problems can be cast into the Multitask Learning scenario by incorporating data from several organisms. However, combining information from several tasks requires careful consideration of the degree of similarity between tasks. Our proposed method simultaneously learns or refines the similarity between tasks along with the Multitask Learning classifier. This is done by formulating the Multitask Learning problem as Multiple Kernel Learning, using the recently published qNorm MKL algorithm.
Results
We demonstrate the performance of our method on two problems from Computational Biology. First, we show that our method is able to improve performance on a splice site dataset with given hierarchical task structure by refining the task relationships. Second, we consider an MHCI dataset, for which we assume no knowledge about the degree of task relatedness. Here, we are able to learn the task similarities ab initio along with the Multitask classifiers. In both cases, we outperform baseline methods that we compare against.
Conclusions
We present a novel approach to Multitask Learning that is capable of learning task similarity along with the classifiers. The framework is very general as it allows to incorporate prior knowledge about tasks relationships if available, but is also able to identify task similarities in absence of such prior information. Both variants show promising results in applications from Computational Biology.
Background
In Machine Learning, model quality is most often limited by the lack of sufficient training data. In presence of data from different but related tasks, it is possible to boost the performance of each task by leveraging all available information. Multitask learning (MTL), a subfield of Machine Learning, considers the problem of inferring models for each task simultaneously while imposing some regularity criteria or shared representation in order to allow learning across tasks. There has been an active line of research exploring various methods (e.g. [1, 2]), providing empirical findings [3] and theoretical foundations [4, 5]. Most of these methods assume uniform relations across tasks. However, it is conceivable to leverage MTL methods by taking into account the degree of relatedness among tasks. Recently, this direction has been investigated in the context of hierarchies [6, 7] and clusters [8] of tasks, where the relation across tasks as well as the models for each task are inferred simultaneously.
In this paper, we follow this line of research and investigate Multitask Learning scenarios where there exists a latent structural relation across tasks. In particular, we model the relatedness between tasks by defining metatasks. Here, each metatask corresponds to a subset of all tasks, representing the common properties of the tasks within this subset. Then, the model of each task can be derived by a convex combination of the metatasks it belongs to. Moreover, the latent structure over tasks can be represented by a collection of the metatasks. Information is transferred between two tasks t, t′ with respect to their relatedness according to the latent structure (number of metatasks in which t, t′ cooccur and the importance of each of these metatasks defined by the mixture weights).
Clearly, such an approach is highly sensitive to the chosen structure and in the absence of prior knowledge, learning the latent structure is a crucial component of MTL with nonuniform relatedness. Starting from a special case, where there exists a single metatask consisting of all tasks (standard MTL), we show that inferring the latent structure can be cast as a Multiple Kernel Learning problem, where the base kernels are defined with respected to Dirac kernels [9] that establish relatedness of all possible task combinations and hence correspond to all possible metatasks.
Kernel methods are used in a widerange of problems, as the kernel abstracts the input space from the Machine Learning algorithm. One can use several kernels to incorporate different aspects of the same instance (e.g. genomic sequence data and data from blood measurements for one patient) and combine them into the same optimization problem. Multiple Kernel Learning can be used to determine the combination of kernels that is best for the problem at hand. This is done by learning an optimal weighting of the individual kernels along with training a predictor.
Our contribution is the combination of MTL and MKL to address the central question in Multitask Learning, of how to identify the relationships between tasks and to translate them into meaningful parameters in the formulation of the used learning algorithm. We show that MKL can be used to 1) refine a given hierarchical structure that relates the tasks at hand and 2) to identify subsets of tasks for which information transfer pays off in absence of prior information on task relations.
Besides applications in Natural Language Processing [10] and Medical Domains, Multitask Learning is particularly relevant to Computational Biology. In this setting, tasks often correspond to organisms, giving rise to a whole range of problems. The fact that the availability of data describing the same biological mechanism in several organisms is a reoccurring theme makes the Multitask Learning approach particularly well suited for many applications in the field. There has been previous work using Domain Adaptation (closely related to Multitask Learning) in the context of splice site prediction [3]. Furthermore, it was shown [9] that Multitask Learning can be used to leverage the stateoftheart in peptide MHCI binding prediction, which is a problem relevant for vaccine design. Given the success of MTL in Computational Biology and highly structured relation across organisms (tasks), we apply our method to two important Computational Biology problems, namely MHCI binding prediction and splice site prediction. The competitiveness of our results shows the validity of our approach.
Preliminaries
In a singletask supervised learning scenario, a sample of examplelabel pairs D = {(x_{ i },y_{ i } )}_{i=}_{1,…,}_{n}is given, where the x_{i} live in an input space X and y_{ i } ∊ {−1,1} (for binary classification). The goal is to learn a function f such that f(x_{ i } ) ≈ y_{ i } that generalizes well to unseen data.
Before we describe our formulation of MTL as MKL approach, we briefly review the formulations of MTL and MKL that lay the foundations for our approach.
Multitask Learning
In MTL [1], we are given one labeled sample D_{ t } for each of T tasks. Similar to the singletask supervised learning scenario, we are now interested in obtaining T hypotheses f_{ t } , one for each task.
We will formulate our method based on the Support Vector Machine (SVM), which has proven to generalize well [11], scales to large amounts of training data [12, 13] and is able to incorporate arbitrary data sources by means of kernels (e.g., [14]). The generalization to other learning approaches appears straightforward as we mainly consider the extension of kernels to reflect task similarity, although details regarding the learning of their linear combination may differ.
Therefore, we start out with a regularizationbased Multitask Learning method that was similarly proposed in the context of SVMs [2, 10, 15]. The basic idea is that models w _{t} are learned simultaneously for all tasks. Information transfer between tasks is achieved by sharing a general component and enforcing similarity of each w_{t} to w_{0} in the joint optimization problem via regularization. We use the following formulation, leaving out some constants for readability
where l is the hinge loss, l(z, y) = max{1 − yz, 0}.
It was shown in [15], that the dual formulation of the above corresponds to the standard SVM using a modified kernel function:
where K_{ B } denotes the base kernel that captures the interactions between examples from all tasks and
(1)
Here, t(i) denotes the task of example x_{ i } . In the above formulation, is composed of the general kernel K_{ B } and the kernel δ_{ t }_{(}_{i}_{),}_{t}_{(}_{j}_{)}K_{ B } (x_{ i } , x_{ j } ) that captures only intradomain interactions. In [9], the latter kernel is referred to as Dirac kernel. A slightly more general formulation of is the following, which allows to adjust the tradeoff between the general kernel and the taskspecific kernel:
where β_{1}, β_{2} ≥ 0 and β_{1} + β_{2} = 1.
Clearly, is a convex combination of base kernels and thus a valid kernel. MKL is a technique to learn the individual weights of a weighted linear combination of kernels. Thus, it seems natural to utilize MKL to learn an optimal weighting for .
Multiple Kernel Learning
Lanckriet et al. considered conic combinations of kernel matrices for classification [16], leading to a convex quadratically constrained quadratic program. Later on, it was shown that the problem can be formulated as a semiinfinite linear program, allowing to use standard SVM solvers (e.g. SVMLight [17], LibSVM [18]) for solving the reoccurring subproblems [13]. Only recently, methods were proposed to generalize MKL to an arbitrary l_{ q } norm [19].
Learning with multiple kernels gives rise to M different feature mappings ϕ_{ m } : X → H_{ m } , m = 1,…, M, each leading to a kernel K_{ m } for a Hilbert space H_{ m } . In MKL, we consider linear mixtures of kernels , where β_{ i } ≥ 0. To avoid nonconvexity, the original parameter vector is substituted . For an in depth discussion of this, please consider [19].
We use the following formulation in the primal:
(2)
where l is the hinge loss, l(z, y) = max{1 − yz, 0} and q denotes the norm used to penalize the weights β. To solve the above optimization problem, we follow ideas presented in [13],[19] to iteratively solve a convex optimization problem involving only the β’s and then to solve for w only. This method is known to converge fast even for a relatively large number of kernels [13].
Multitask Multiple Kernel Learning
To be able to use MKL for Multitask Learning, we need to reformulate the Multitask Learning problem as a weighted linear combination of kernels. In the spirit of Equation 1, the basic idea of our decomposition is to define taskbased block masks on the kernel matrix of K_{ B } . Given a list of tasks T = {t_{1},…,t_{ T } }, we define a kernel K_{ S } on a subset of tasks S ⊆ T as follows:
where t(x) denotes the task of example x. Here, each S corresponds to a metatask as defined in the introduction. In the most general formulation, we define a collection I = {S_{1},…,S_{ p } } of an arbitrary number p of task sets S_{ i } , which defines the latent structure over tasks. This collection leads to the following linear combination of kernels, which is positive semidefinite:
Using , we can readily utilize existing MKL methods to learn the β_{ i }. This corresponds to identifying the groups of tasks S_{ i } for which sharing information leads to improved performance. After training using MKL, we have obtained a classifier f_{ t } for each task t:
where N is the total number of training examples of all tasks combined.
What remains to be discussed is how to define a collection I of candidate subsets S_{ i } (i.e. metatasks), which are subsequently to be weighted by MKL. We consider two scenarios, one where we assume to have access to a hierarchical structure relating the tasks at hand and one, where we assume no prior knowledge given about task relations. Generally, however, it is possible to utilize prior domain knowledge indicating how tasks are related to design an appropriate I.
Powerset MTMKL
With no prior information given, a natural choice is to take into account all possible subsets of tasks. Given a set tasks T, this corresponds to considering the power set P of T (excluding the empty set) I_{ p } = {SS ∈ P(T) ∧ S ≠ Ø}.
Clearly, this gives us an exponential number (i.e. 2 ^{T} ) of task sets S_{ i } of which only a few will be relevant. To identify the relevant task sets, we propose to use an L1regularized MKL approach (i.e. q = 1 in Equation 2) to yield a sparse solution. Most subset weights will be set to zero, yielding only a few relevant subsets with weights greater than zero. We expect that the examples in these subsets come from similar distributions and that it is therefore beneficial to consider interactions between them, when obtaining a multitask predictor.
While L1regularization of MKL results in a sparse combination of kernels, it does not address the computational complexity of the optimization problem over this exponential search space. With the current implementation, the method is limited to approximately 10 tasks depending on the number of training examples and available resources. However, there are techniques to handle the case where the number of tasks may become prohibitive, for instance, as proposed in [20]. The idea is to iteratively generate new kernels based on the current solution (β, w). These methods are known to converge to the optimal solution, if one can identify appropriate kernels in a larger set. In the current case, this could be done by solving an integer linear program.
Hierarchical MTMKL
In the second scenario, we assume that we are given a tree structure G that relates our tasks at hand (see Figure 1). In this context, a task t_{ i } corresponds to a leaf in G. Assuming hierarchical relations between tasks is particularly relevant to Computational Biology where often different tasks correspond to different organisms. In this context, we expect that the longer the common evolutionary history between two organisms, the more beneficial it is to share information between these organisms in a MTL setting. We can exploit the hierarchical structure G to determine which subsets might play a role for Multitask Learning. In other words, we use the hierarchy to restrict the space of task sets. Let leaves(n) = {ll is descendant of n} be the set of leaves below the subtree rooted at node n. Then, we can give the following definition for the hierarchically decomposed kernel function
As an example, consider the kernel defined by a hierarchical decomposition according to Figure 1. Clearly, the number of β_{ i } corresponds to the number of nodes. For a perfect binary tree this leads to 2m − 1 nodes, where m is the number of leaves/tasks. We expect that learning the contributions of the individual levels of the taxonomy makes sense for cases, where the edge lengths of G are unequal.
Relation to task similarity
The learned weights β_{ i } reflect the importance of the subset S_{ i } . Clearly, if two tasks t_{ k } and t_{ l } are often jointly present in subsets with high weights, we expect those tasks to be similar to each other. One can infer a measure of pairwise similarities between tasks γ_{k,l} from the weights β_{ i } of the subsets S_{ i } . We define the collection of task sets containing task t_{ k } as . Using this definition, we can define the similarity γ_{k,l} between two tasks by summing up the weights of the shared task sets S_{ i }
(3)
This similarity measure can be used for downstream analyses, as it provides insight about the task relationships. A high γ_{k,l} between tasks suggests a considerable resemblance between tasks and could help to generate domain knowledge (e.g., evidence that two cellreceptors bind to similar class of proteins, or the molecular mechanisms of the splicing machinery particularly similar).
Results and discussion
We performed experiments in two settings. In the first setting, we considered a set of MHCI (major histocompatibility complex) proteins. Here, we assume we are not given any prior information to relate them. In the second setting, we used splice site data from 15 organisms and assumed that the task relationship is given by a hierarchical structure according to their evolutionary history. The examples are string data over an alphabet {A,T,G,C} (DNA) in the splicing case and the alphabet of 20 amino acids in the MHCI case. To incorporate string features, we used the Weighted Degree String Kernel [21], which amongst other kernels such as the Spectrum Kernel [22], has been successfully employed in problems from Computational Biology.
In addition to the two MKLbased methods, we evaluated the following baseline methods:

Union  One global model is obtained on the union of examples from all tasks.

Plain  For each task, a model is trained independently, not taking into account any outofdomain information.

Vanilla MTL  Our algorithms consists of two components  the MTL formulation and the adjustment of weights β_{ i } with MKL. In the vanilla approach, we fix all weights at β_{ i } = 1.
Experiments were performed by using crossvalidation for modelselection on the training splits. We only tuned one hyperparameter C, for which we considered values between 0. 01 and 1000 on a logarithmic scale in 8 steps. After having obtained an optimal regularization parameter, a classifier is retrained on all training splits and final performance is obtained on a dedicated test set, that was not involved in hyperparameter selection.
MHCI binding prediction using Powerset MTMKL
In this experiment, the task is to predict whether a peptide binds to a certain MHC molecule (binder) or not (nonbinder). It has been previously shown that sharing information between related molecules (alleles) and thus casting the problem in a Multitask Learning scenario, can be beneficial [9]. In the MHC setting, different tasks correspond to different MHC proteins. The data consists of peptide sequences of length l = 9 for 7 tasks. In total, we are given 7367 examples (A_2403=254, A_6901=833, A_0201=3089, A_0202=1447, A_0203=1443, A_2402=197, A_2301=104). For crossvalidation, the data was split randomly into 5 splits of the same size. Unlike the setting of splice site prediction, we do not have a hierarchical structure relating our tasks at hand. To demonstrate that meaningful groups of tasks can be identified by Powerset MTMKL, we do not assume any prior knowledge of task relationships. Please note, however, that we do have access to the sequences of the MHCI proteins. We use these sequences to evaluate the task similarities obtained by our approach.
We report the area under the precision recall curve (auPRC) for the individual tasks in Figure 2 and the summary of performances in Table 1.
From Figure 2, we observe that the MKLbased approach outperforms the baseline methods. Furthermore, simply combining the data for different tasks to obtain a single model (Union) does not outperform the naïve method of obtaining an individual classifier for each task (Plain). This hints at rather large differences between the tasks. Learning the weights with MKL, improves performance compared to the Vanilla MTL approach, which already outperforms the other two baselines.
Figure 3 shows the distribution of weights obtained by the L1regularized MKL approach. As expected, we observe that most task sets are assigned a weight of zero (or close to zero). Only a few get assigned a higher weight, so it is worthwhile to investigate the list of tasks that get assigned a weight β_{ i } > 0.05. From Table 2, we observe that the tasks A_0201, A_0202, A_0203, are often grouped in the same task set, which is in agreement with domain knowledge. Based on the assigned weights, we compute the task similarity as defined in Equation (3). For evaluation of the learned similarities, we compare them to the hamming distance (or similarity) between the amino acids in the binding pocket [23] of the MHCI molecules (Figure 4). By visual inspection, we find a good agreement between the inferred task similarity and the moleculebased similarity.
Using MKL, we successfully identify groups of tasks among which information sharing is sensible, thus allowing for a smart combination of information from different tasks in the absence of prior knowledge.
The improvement in performance over the Vanilla MTL method is relatively small (a property most likely inherited from MKL). However, we are compensated for this by simultaneously obtaining a sensible task structure.
Splicesite prediction using hierarchical MTMKL
In this setting, we take into account a given hierarchy (see Figure 5) relating the 15 organisms in our data set. The data set consists of 6000 examples for 15 tasks, each at a positive to negative ratio of 1:100, similar to the one used in [3]. The data is split into 4 splits, three splits with 333 examples each and a large test split with 5000 examples. The dataset was created that way to establish a scenario where positive training examples are extremely rare.
We report the area under the precision recall curve (auPRC), which is well suited for unbalanced data sets. For the Vanilla MTL method, we use the given hierarchy G to define the initial task sets, but not further optimize their individual influence.
From Figure 6, we can make a few very interesting observations. First, in accordance with the results from the MHCI experiment (see Table 3), the nonsparse Hierarchical MTMKL methods outperform the baselines Union and Plain.
The second observation is that we get different results for different qnorms. In particular, we see a degraded performance for q = 1, which complies with our expectation that weights for this approach (assuming the hierarchy is correct) should not be sparse. For the qnorms that we considered, q = 2 performs best. Lastly, we can show that we are able to outperform the Vanilla MTL method (all β_{ i } = 1) by refining the task relations given by the structure G with MKL. Intuitively, using Hierarchical MTMKL corresponds to estimating the edge lengths of G, whereas the other method is restricted to directly using the similarities encoded into the taxonomy.
Conclusions
We have presented a principle way of formulating Multitask Learning as a Multiple Kernel Learning approach. Following the basic idea of tasksetwise decomposition of the kernel matrix, we present a hierarchical decomposition and a power set based approach.
These two methods allow us to elegantly identify or refine structure relating the tasks at hand in one global optimization problem. We expect our methods to work particularly well in cases, where edge weights differ within the hierarchical structure or where the task structure is unknown.
Our experiments illustrate that the MTMKL approach on the power set of all tasks works well for the MHC binding problem: First it increases the accuracy of the predictors compared to the baseline methods and second, the inferred task similarity reflects the prior knowledge that is available for this problem. Also for the splice site prediction problem where the task hierarchy is given by the organisms’ phylogeny, our approach manages to achieve an improvement over standard approaches. Using MKL on top of regular Multitask Learning methods may uncover latent task structure and thereby provide insight into the problem domain, which might be relevant to downstream analyses. In conclusion, this work constitutes a valuable proofofconcept outlining a principle way of using MKL to improve Multitask Learning.
Abbreviations
 MKL:

Multiple Kernel Learning
 MTL:

Multi Task Learning
 MHC:

Major Histocompatibility Complex
 SVM:

Support Vector Machine
 auPRC:

area under the Precision Recall Curve.
References
 1.
Caruana R: Multitask Learning. Machine Learning 1997, 28: 41–75. 10.1023/A:1007379606734
 2.
Evgeniou T, Micchelli C, Pontil M: Learning Multiple Tasks with Kernel Methods. Journal of Machine Learning Research 2005, 6: 615–637.
 3.
Schweikert G, Widmer C, Scholkopf B, Rätsch G: An Empirical Analysis of Domain Adaptation Algorithms. In Advances in Neural Information Processing System, NIPS, Volume. Volume 22. Vancouver, B.C.; 2008.
 4.
BenDavid Schuller: Exploiting Task Relatedness for Multiple Task Learning. COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers 2003.
 5.
Blitzer J, Crammer K, Kulesza A, Pereira F, Wortman J: Learning bounds for domain adaptation. Advances in Neural Information Processing Systems 2008, 20: 129–136.
 6.
Daumé H III: Bayesian Multitask Learning with Latent Hierarchies. In Conference on Uncertainty in Artificial Intelligence 2009.
 7.
Xue Y, Liao X, Carin L, Krishnapuram B: Multitask learning for classification with dirichlet process priors. Journal of Machine Learning Research 2007, 8: 2007.
 8.
Jacob L, Bach F, Vert JP: Clustered MultiTask Learning: A Convex Formulation. In NIPS. MIT Press; 2009:745–752.
 9.
Jacob L, Vert J: Efficient peptideMHCI binding prediction for alleles with few known binders. Bioinformatics 2008, 24(3):358. 10.1093/bioinformatics/btm611
 10.
Daumé H: Frustratingly Easy Domain Adaptation. In ACL. The Association for Computer Linguistics; 2007.
 11.
Vapnik V: . In The nature of statistical learning theory. Springer Verlag; 2000.
 12.
Bottou L, Chapelle O, Decoste D, Weston J: . In Largescale kernel machines. MIT Press; 2007.
 13.
Sonnenburg S, Rätsch G, Schäfer C, Schölkopf B: Large scale multiple kernel learning. The Journal of Machine Learning Research 2006, 7: 1565.
 14.
Schölkopf B, Smola A: . In Learning with Kernels. Cambridge, MA: The MIT Press; 2002.
 15.
Evgeniou T, Pontil M: Regularized multitask learning. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, Washington, USA, August 22–25, 2004. Edited by: Kim W, Kohavi R, Gehrke J, DuMouchel W,. ACM; 2004:109–117.
 16.
Bach F, Lanckriet G, Jordan M: Multiple kernel learning, conic duality, and the SMO algorithm. In Proceedings of the twentyfirst international conference on Machine learning. ACM New York, NY, USA; 2004.
 17.
Joachims T: Making largeScale SVM Learning Practical. In Advances in Kernel Methods  Support Vector Learning. Edited by: Schölkopf B, Burges C, Smola A. MIT Press; 1999.
 18.
Chang C, Lin C: LIBSVM: a library for support vector machines. 2001.
 19.
Kloft M, Brefeld U, Sonnenburg S, Laskov P, Muller KR, Zien A: Efficient and Accurate LpNorm Multiple Kernel Learning. In Advances in Neural Information Processing Systems. Volume 22. Edited by: Bengio Y, Schuurmans D, Lafferty J, Williams CKI, Culotta A,. MIT Press; 2009:997–1005.
 20.
Gehler P, Nowozin S: Infinite Kernel Learning. In NIPS Workshop on Kernel Learning: Automatic Selection of Optimal Kernels 2008.
 21.
Rätsch G, Sonnenburg S: . In Accurate Splice Site Detection for Caenorhabditis elegans. MIT Press; 2004.
 22.
Leslie C, Eskin E, Noble WS: The Spectrum Kernel: A String Kernel For SVM Protein Classification. In Proceedings of the Pacific Symposium on Biocomputing 2002, 564–575.
 23.
Robinson J, Waller MJ, Parham P, Bodmer JG, Marsh SG: IMGT/HLA Database:a sequence database for the human major histocompatibility complex. Nucleic Acids Res 2001, 29: 210–213. 10.1093/nar/29.1.210
Acknowledgements
We would like to acknowledge Sören Sonnenburg for help with the implementation of our methods and Magdalena Feldhahn for providing the hierarchical clustering of MHCI sequences.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 8, 2010: Proceedings of the Neural Information Processing Systems (NIPS) Workshop on Machine Learning in Computational Biology (MLCB). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/11?issue=S8.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
Christian Widmer worked out the idea and implementation, performed the experiments and prepared part of the manuscript. Yasemin Altun was involved in the discussions, the development of methods on which this paper is based and the preparation of the manuscript. Nora C. Toussaint contributed to the discussions, provided the data for the MHCI experiments and contributed to the preparation of the manuscript. Gunnar Rätsch came up with the original idea for the project and supervised the project at each step.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Published
DOI
Keywords
 Baseline Method
 Multiple Kernel Learn
 Precision Recall Curve
 Hinge Loss
 Standard Support Vector Machine