KOPLS package: Kernelbased orthogonal projections to latent structures for prediction and interpretation in feature space
 Max Bylesjö†^{1},
 Mattias Rantalainen†^{2},
 Jeremy K Nicholson^{2},
 Elaine Holmes^{2} and
 Johan Trygg^{1}Email author
DOI: 10.1186/147121059106
© Bylesjö et al; licensee BioMed Central Ltd. 2008
Received: 28 August 2007
Accepted: 19 February 2008
Published: 19 February 2008
Abstract
Background
Kernelbased classification and regression methods have been successfully applied to modelling a wide variety of biological data. The Kernelbased Orthogonal Projections to Latent Structures (KOPLS) method offers unique properties facilitating separate modelling of predictive variation and structured noise in the feature space. While providing prediction results similar to other kernelbased methods, KOPLS features enhanced interpretational capabilities; allowing detection of unanticipated systematic variation in the data such as instrumental drift, batch variability or unexpected biological variation.
Results
We demonstrate an implementation of the KOPLS algorithm for MATLAB and R, licensed under the GNU GPL and available at http://www.sourceforge.net/projects/kopls/. The package includes essential functionality and documentation for model evaluation (using crossvalidation), training and prediction of future samples. Incorporated is also a set of diagnostic tools and plot functions to simplify the visualisation of data, e.g. for detecting trends or for identification of outlying samples. The utility of the software package is demonstrated by means of a metabolic profiling data set from a biological study of hybrid aspen.
Conclusion
The properties of the KOPLS method are well suited for analysis of biological data, which in conjunction with the availability of the outlined opensource package provides a comprehensive solution for kernelbased analysis in bioinformatics applications.
Background
Orthogonal Projections to Latent Structures (OPLS) [1, 2] is a linear regression method that has been employed successfully for prediction modelling in various biological and biochemical applications [3–5]. Among the benefits provided by the OPLS method is its innate ability to model data with both noisy as well as multicollinear variables, such as spectral data from metabolic profiling and other omics platforms [6]. The OPLS method employs the descriptor matrix X (N × K), where N denotes the number of samples and K the number of variables in X, to predict the response matrix Y (N × M), where M denotes the number of variables in Y. The unique property of the OPLS method compared to other linear regression methods is its ability to separate the modelling of covarying variation from structured noise, defined as systematic Yorthogonal variation, while simultaneously maximising the covariance between X and Y.
The OPLS algorithm models the variation in the data matrix X by means of two sets of latent variables [7] (score matrices) T_{p} and T_{o}; see Equation 1. Here, T_{p} (N × A) denotes the Ypredictive score matrix for X, P_{p}^{T} (A × K) denotes the Ypredictive loading matrix for X, T_{o} (N × A_{o}) denotes the corresponding Yorthogonal score matrix, P_{o}^{T} (A_{o} × K) denotes the loading matrix of Yorthogonal components and E denotes the residual matrix of X. Both the Ypredictive and Yorthogonal score matrices describe properties of the modelled observations that are useful for identifying expected and unexpected trends, clusters or outlying samples in data. The relationship between OPLS and other linear regression methods is discussed explicitly elsewhere [1, 3].
X = T_{p}P_{p}^{T} + T_{o}P_{o}^{T} + E (1)
Kernelbased pattern recognition methods [8] such as Support Vector Machines (SVMs) [9], KernelPCA (KPCA) [10, 11] and KernelPLS (KPLS) [12, 13] have previously been applied in a multitude of contexts for exploratory analysis and classification, including biological applications [14–17]. Common among these kernelbased methods is their application of the 'kernel trick' [18]; allowing the kernel matrix to be treated as dot products in a highdimensional feature space. Specifically, this is achieved by adopting a linear method to socalled dual form, so that all instances of the descriptor matrix X are expressed in terms of dot products, e.g. XX^{T}. Subsequently, XX^{T} is substituted for the kernel Gram matrix K with entries K_{i,j}= k(x_{ i }, x_{ j }), where x_{ i }and x_{ j }corresponds to the i th and j th rowvector in the descriptor matrix X, respectively, and k(·,·) represents the kernel function. Hence, one can avoid explicitly mapping X to higherdimensional spaces as well as computing dot products in the feature space, which is computationally beneficial. The transformation to higherdimensional spaces is performed implicitly by the kernel function k(·,·); where common kernel functions include polynomial or Gaussian functions (see Equations 2 and 3).
k(x,y) = (x^{T}y + 1)^{ p } (2)
k(x,y) = exp(xy^{2}/2σ^{2}) (3)
The kernel functions in Equations 2–3 depend on the choice of the parameters p and σ, respectively, which typically influences the predictive ability of the kernelbased method. The traditional approach to kernel parameter selection is to predefine parameter limits and subsequently perform an exhaustive grid search over the entire parameter space. At each setting, the generalisation properties of the model are evaluated using e.g. crossvalidation [19] to identify the parameter setting yielding the lowest possible generalisation error. Unfortunately, even moderately short step sizes can result in a large number of evaluations and unacceptable run times. The alternative in such cases is to utilise stochastic methods, such as simulated annealing [20], which may identify reasonable approximations of the global generalisation error minimum using less evaluations.
Pseudocode for the KOPLS model training algorithm. K denotes the original kernel matrix, K_{i} the kernel matrix deflated by i Yorthogonal components and Q_{i} the K_{i} matrix deflated by A predictive components.
Step  Description  

1.  Estimate the predictive Yweights (C_{p}) by eigenvector decomposition of Y^{T}KY  
2.  Project Y onto C_{p} to achieve the predictive score matrix of Y: U_{p} ← YC_{p}  
3.  Calculate the predictive score matrix of X: T_{p} ← KU_{p}  
4.  Repeat for i : 1 to A_{o}  
4.1  Estimate the Yorthogonal loadings c_{o} by eigenvector decomposition of T_{p}^{T}Q_{i}T_{p}.  
4.2.  Calculate the Yorthogonal score vector: t_{o,i} ← Q_{i}T_{p}c_{o}  
4.3.  Deflate K_{i} by t_{o,i}, yielding K_{i+1}  
4.4.  Update the predictive score matrix: T_{p} ← K_{i+1}U_{p}  
5.  Predictions of Y: Y_{hat} ← T* _{p} (T_{p}^{T}T_{p})^{1}T_{p}^{T}U_{p}C_{p}^{T}. For predictions of future samples, T* _{p} originates from the prediction set. 
Implementations of various kernelbased methods are available in the literature for the R and MATLAB environments. Among the R packages available on CRAN [23], a few relevant examples include kernlab (kernelbased regression and classification), e1071 (including SVMs) and PLS (implementing a linear kernelbased implementation of the PLS algorithm). kernlab provides a number of kernelbased methods for regression and classification, including SVMs and leastsquares SVMs, with functionality for nfold crossvalidation. The e1071 package contains functions for training and prediction using SVMs, including (randomised) nfold crossvalidation. The PLS package includes an implementation of both linear PLS as well as a linear kernelbased PLS version. This enables more efficient computations in situations where the number of observations is very large in relation to the number of features. The PLS package also provides a flexible crossvalidation functionality.
MATLAB toolboxes implementing kernelbased methods include e.g. the SVM and Kernel Methods MATLAB Toolbox [24], Least Squares – Support Vector Machines MATLAB/C toolbox [25] and libsvm [26]. The latter contains a general collection of SVM related algorithms implemented in C++ and Java, including interfaces for MATLAB, Python and a number of other environments. All of these packages provide implementations of various kernelbased methods as well as crossvalidation functionality and basic plot functions. Additional kernelbased software packages can be found at kernelmachines.org [27].
An implementation of the original linear OPLS method [1] is available in the Windowsbased software SIMCAP+ 11.0 (Umetrics AB, Umeå, Sweden). SIMCAP includes a vast number of visualisation features as well as nfold crossvalidation functionality to estimate the number of Ypredictive and Yorthogonal components.
Here, we describe an implementation of the KOPLS algorithm for R [28] and MATLAB (The Mathworks, Natick, MA, USA) licensed under the GNU GPL. To the best knowledge of the authors, there are no other software packages currently available that implement the KOPLS method. The package includes fundamental functionality for model training, prediction of unknown samples and evaluation by means of crossvalidation. Included is also a set of diagnostic tools and plot functions to simplify the visualisation of data, e.g. for detecting trends or for identification of outlying samples.
The KOPLS method can be used for both regression as well as classification tasks and has optimal performance in cases where the number of variables is much higher than the number of observations. Typical application areas are nonlinear regression and classification problems using omics data sets. Properties of the KOPLS method make it particularly helpful in cases where detecting and interpreting patterns in the data is of interest. This may e.g. involve instrumental drift over time in metabolic profiling applications using e.g. LCMS or when there is a risk of dissimilarities between different experimental batches collected at different days. In addition, structured noise (Yorthogonal variation) may also be present as a result of the biological system itself and can therefore be applied for the explicit detection and modelling of such variation. This is accomplished by interpretation of the Ypredictive and the Yorthogonal score components in the KOPLS model. The separation of Ypredictive and Yorthogonal variation in the feature space is unique to the KOPLS method and is not present in any other kernelbased method.
The utility of the KOPLS software package is demonstrated by means of a metabolic profiling data set from a biological study of hybrid aspen, where the KOPLS method is compared in parallel to the similar KPLS method.
Implementation
The KOPLS algorithm has been implemented as an opensource and platformindependent software package for MATLAB and R, in accordance with [21]. The KOPLS package provides functionality for model training, prediction and evaluation using crossvalidation. Additionally, model diagnostics and plot functions have been implemented to facilitate and further emphasise the interpretational strengths of the KOPLS method compared to other related methods.
The following features are available for both MATLAB and R:
(1) Estimation (training) of KOPLS models
An implementation of the pseudocode in Table 1 for modelling the relation between a kernel matrix K (N × N) and a response matrix Y using A predictive and A_{o} Yorthogonal score vectors.
(2) Prediction of new data using the estimated KOPLS model in step (1)
An implementation of the prediction of Y_{hat} (N_{ test }× M) given a test kernel K_{test} (N_{ test }× N_{ test }).
(3) Crossvalidation functionality to estimate the generalisation error of a KOPLS model
This is intended to guide the selection of the number of Ypredictive components A and the number of Yorthogonal components A_{o}. The supported implementations are:

nfold crossvalidation. Data is split into n separate groups and models are sequentially built from n1 groups while the n th group is predicted and used to measure the generalisation error.

Monte Carlo CrossValidation (MCCV) [29]. Data is randomly split into crossvalidation training and test sets. A model is built from the crossvalidation training set while the test set is predicted and used to measure the generalisation error. The procedure is repeated n times to achieve a distribution of prediction errors.

Monte Carlo Classbalanced CrossValidation (for discriminant analysis cases). Same as regular MCCV except that the split into crossvalidation training and test sets is balanced with respect to the existing class labels.
(4) Kernel functions
including the polynomial (Equation 2) and Gaussian (Equation 3) kernel functions.
(5) Model statistics

The explained variation of X (R^{2}_{X}).

The explained variation of Y (R^{2}_{Y}).

Prediction statistics over crossvalidation for regression tasks (Q^{2}_{Y}, which is inversely proportional to the generalisation error).

Prediction statistics over crossvalidation for classification tasks (sensitivity and specificity measures).
(6) Plot functions for visualisation

Scatter plot matrices for model score components.

Model statistics and diagnostics plots.
Code examples for the functionality described above is available in Additional File 1 for both MATLAB and R. The KOPLS package, including source code and documentation, is available for different operating systems in Additional Files 2, 3, 4 or for download on the project home page (see Availability and requirements).
Results and Discussion
The utility of the method has previously been demonstrated using simulated data and for applications in analytical chemistry [21]. Here, we describe a biological data set originating from a study measuring differences in biochemical composition across two genotypes of hybrid aspen. The genotypes will be denoted mutant and wildtype (WT) throughout. Samples have been taken from three biological replicates of each genotype at eight different positions of the tree (internodes 1–8, starting from the top), constituting 48 different observations, of which 46 are included in the analysis (data collection failed for two samples). The internode gradient denotes an approximate growth gradient of the tree. Metabolic profiling data has been collected by means of highresolution magic angle spinning proton nuclear magnetic resonance (^{1}H HR/MAS NMR) spectroscopy. Data pretreatment, including bucketing and removal of residual water, is described in the original study [30].
The modelled descriptor matrix X (46 × 655) contains the NMR data and the response matrix Y (46 × 1) contains the genotypes labelled as 1 and +1. The aim in this study is to predict an unknown sample into the correct category (mutant or WT) based on the metabolic profile. An additional 10 samples were used as an independent test set to further estimate the generalisation error. Both data sets were columnwise meancentred prior to modelling.
From Figure 1B one can also note that there is a joint internode gradient, formed by a linear combination of t^{1}_{o} and t^{2}_{o}. Furthermore, Figure 1B reveals a somewhat bimodal behaviour of the mutant internode gradient. In Figure 1C the joint internode gradient is shown only for the mutant samples, colourcoded by biological replicate. Biological replicate A displays a deviant behaviour, which is an intermediate between the profiles of biological replicates B and C and the WT samples (Figure 1B) and explains the bimodal behaviour. Also from the original study one can superficially see (Figure 4A on page 356 in [30]) that biological replicate A is an approximate intermediate of the stronger mutants B and C and the WT samples. A plausible explanation for this behaviour is that the antisense construct used to create the modified samples is not as strongly active in biological replicate A; either due to the process involved in generating the mutant or slight differences in growth conditions.
For comparison, a KPLS model was fitted in parallel using the Gaussian kernel function with σ = 0.5 and 10 Yorthogonal components as recommended by sevenfold crossvalidation. The first latent variable t_{1} is plotted against the second t_{2} in Figure 1D. One can note that the discriminatory direction is now a linear combination of both of the latent variables (and possible also subsequent components). The different internode gradients are distinctly seen also in the KPLS model, although the internode gradient of the WT samples is correlating perfectly with the discriminatory direction, implying that this direction is related to the class separation. In relation to the KOPLS model, one can clearly see that this is not the case from Figure 1A–B and previous discussions, which highlights the advantages of the KOPLS method. Furthermore, it is not possible in the KPLS model to quantify the amount of variance related to class discrimination (34.3% from the KOPLS model) in relation to the variance related to the internode gradient (47.3% based on the variance in t^{1}_{o} and t^{2}_{o} in the KOPLS model).
Practical code examples of the functionality of the package are available in Additional File 1, describing both MATLAB and R code including illustrations from an additional demonstration data set. This demonstration data set also is available with the supplied package (Additional Files 2, 3, 4).
Conclusion
Kernel methods have previously been applied successfully in many different pattern recognition applications due to the strong predictive abilities and availability of the methods. The KOPLS method is well suited for analysis of biological data, foremost through its innate capability to separately model predictive variation and structured noise. This property of the KOPLS method has the potential to improve the interpretation of biological data, as was demonstrated by a plant NMR data set where interpretation is enhanced compared to the related method KPLS. In conjunction with the availability of the outlined opensource package, KOPLS provides a comprehensive solution for kernelbased analysis in bioinformatics applications.
Availability and requirements

Project name: kopls

Project home page: http://www.sourceforge.net/projects/kopls/

Operating systems: OS Portable (Source code to work with many OS platforms).

Programming languages: MATLAB and R

Other requirements: MATLAB version 7.0 or newer, R version 2.0 or newer.

License: GNU GPL version 2.
Notes
Abbreviations
 OPLS:

Orthogonal Projections to Latent Structures
 KOPLS:

Kernelbased Orthogonal Projections to Latent Structures
 SVM:

Support Vector Machine
 KPCA:

Kernel Principal Component Analysis
 KPLS:

Kernel Partial Least Squares
 NMR:

Nuclear Magnetic Resonance
 ^{1}H HR/MAS NMR:

Highresolution magic angle spinning proton NMR
 LCMS:

Liquid chromatographymass spectrometry
Declarations
Acknowledgements
The authors are grateful to Andreas Sjödin at the Umeå Plant Science Centre, Umeå, Sweden, for useful comments. This work was supported by grants from The METAGRAD Project funded by AstraZeneca and Unilever plc. (MR), The Swedish Foundation for Strategic Research (MB, JT) and The Swedish Research Council (MB, JT).
Authors’ Affiliations
References
 Trygg J, Wold S: Orthogonal projections to latent structures (OPLS). J Chemometrics 2002, 16: 119–128. 10.1002/cem.695View ArticleGoogle Scholar
 Trygg J, Wold S: O2PLS, a twoblock (XY) latent variable regression (LVR) method with an integral OSC filter. J Chemometrics 2003, 17: 53–64. 10.1002/cem.775View ArticleGoogle Scholar
 Bylesjö M, Eriksson D, Sjödin A, Jansson S, Moritz T, Trygg J: Orthogonal Projections to Latent Structures as a Strategy for Microarray Data Normalization. BMC Bioinformatics 2007, 8: 207. 10.1186/147121058207PubMed CentralView ArticlePubMedGoogle Scholar
 Bylesjö M, Rantalainen M, Cloarec O, Nicholson JK, Holmes E, Trygg J: OPLS discriminant analysis: combining the strengths of PLSDA and SIMCA classification. J Chemometrics 2006, 20: 341–351. 10.1002/cem.1006View ArticleGoogle Scholar
 Cloarec O, Dumas ME, Trygg J, Craig A, Barton RH, Lindon JC, Nicholson JK, Holmes E: Evaluation of the orthogonal projection on latent structure model limitations caused by chemical shift variability and improved visualization of biomarker changes in 1H NMR spectroscopic metabonomic studies. Anal Chem 2005, 77(2):517–526. 10.1021/ac048803iView ArticlePubMedGoogle Scholar
 Cloarec O, Dumas ME, Craig A, Barton RH, Trygg J, Hudson J, Blancher C, Gauguier D, Lindon JC, Holmes E, Nicholson J: Statistical total correlation spectroscopy: an exploratory approach for latent biomarker identification from metabolic 1H NMR data sets. Anal Chem 2005, 77(5):1282–1289. 10.1021/ac048630xView ArticlePubMedGoogle Scholar
 Kvalheim OM: The latent variable. Chemometrics Intell Lab Syst 1992, 14: 1–3. 10.1016/01697439(92)80088LView ArticleGoogle Scholar
 ShaweTaylor J, Cristianini N: Kernel methods for pattern analysis. Cambridge , Cambridge University Press; 2004:462.View ArticleGoogle Scholar
 Schölkopf B, Smola A: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge , MIT Press; 2001.Google Scholar
 Rosipal R, Girolami M, Trejo LJ, Cichocki A: Kernel PCA for feature extraction and denoising in nonlinear regression. Neural Comput Appl 2001, 10(3):231–243. 10.1007/s5210018051zView ArticleGoogle Scholar
 Schölkopf B, Smola A, Müller KR: Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 1998, 10(5):1299–1319. 10.1162/089976698300017467View ArticleGoogle Scholar
 Lindgren F, Geladi P, Wold S: The kernel algorithm for PLS. J Chemometrics 1993, 7(1):45–59. 10.1002/cem.1180070104View ArticleGoogle Scholar
 Rosipal R, Trejo LJ: Kernel partial least squares regression in Reproducing Kernel Hilbert Space. J Mach Learn Res 2002, 2(2):97–123. 10.1162/15324430260185556Google Scholar
 Anderson DC, Li W, Payan DG, Noble WS: A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores. J Proteome Res 2003, 2(2):137–146. 10.1021/pr0255654View ArticlePubMedGoogle Scholar
 Brown MP, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M Jr., Haussler D: Knowledgebased analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci U S A 2000, 97(1):262–267. 10.1073/pnas.97.1.262PubMed CentralView ArticlePubMedGoogle Scholar
 Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 2000, 16(10):906–914. 10.1093/bioinformatics/16.10.906View ArticlePubMedGoogle Scholar
 Pochet N, De Smet F, Suykens JA, De Moor BL: Systematic benchmarking of microarray data classification: assessing the role of nonlinearity and dimensionality reduction. Bioinformatics 2004, 20(17):3185–3195. 10.1093/bioinformatics/bth383View ArticlePubMedGoogle Scholar
 Aizerman M, Braverman E, Rozonoer L: Theoretical foundations of the potential function method in pattern recognition learning. Automat Rem Contr 1964, 25: 821–837.Google Scholar
 Wold S: Cross Validatory Estimation of the Number of Components in Factor and Principal Components Models. Technometrics 1978, 20: 397–406. 10.2307/1267639View ArticleGoogle Scholar
 Kirkpatrick S, Gelatt CD Jr., Vecchi MP: Optimization by Simulated Annealing. Science 1983, 220(4598):671–680. 10.1126/science.220.4598.671View ArticlePubMedGoogle Scholar
 Rantalainen M, Bylesjö M, Cloarec O, Nicholson JK, Holmes E, Trygg J: Kernelbased orthogonal projections to latent structures (KOPLS). J Chemometrics 2007, 21: 376–385. 10.1002/cem.1071View ArticleGoogle Scholar
 Czekaj T, Wu W, Walczak B: About kernel latent variable approaches and SVM. J Chemometrics 2005, 19(5–7):341–354. 10.1002/cem.937View ArticleGoogle Scholar
 The Comprehensive R Archive Network (CRAN)[http://cran.rproject.org/]
 SVM and Kernel Methods Matlab Toolbox[http://asi.insarouen.fr/enseignants/~arakotom/toolbox/index.html]
 Least Squares  Support Vector Machines MATLAB/C toolbox[http://www.esat.kuleuven.ac.be/sista/lssvmlab/home.html]
 libsvm[http://www.csie.ntu.edu.tw/~cjlin/libsvm/]
 kernelmachines.org[http://www.kernelmachines.org/software]
 The R project for Statistical Computing[http://www.rproject.org/]
 Shao J: LinearModel Selection by CrossValidation. J Am Stat Assoc 1993, 88(422):486–494. 10.2307/2290328View ArticleGoogle Scholar
 Wiklund S, Karlsson M, Antti H, Johnels D, Sjöström M, Wingsle G, Edlund U: A new metabonomic strategy for analysing the growth process of the poplar tree. Plant Biotechnol J 2005, 3(3):353–362. 10.1111/j.14677652.2005.00129.xView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.