A flexible framework for sparse simultaneous component based data integration
 Katrijn Van Deun^{1}Email author,
 Tom F Wilderjans^{2},
 Robert A van den Berg^{1, 3},
 Anestis Antoniadis^{4} and
 Iven Van Mechelen^{1}
DOI: 10.1186/1471210512448
© Van Deun et al; licensee BioMed Central Ltd. 2011
Received: 14 July 2011
Accepted: 15 November 2011
Published: 15 November 2011
Abstract
1 Background
High throughput data are complex and methods that reveal structure underlying the data are most useful. Principal component analysis, frequently implemented as a singular value decomposition, is a popular technique in this respect. Nowadays often the challenge is to reveal structure in several sources of information (e.g., transcriptomics, proteomics) that are available for the same biological entities under study. Simultaneous component methods are most promising in this respect. However, the interpretation of the principal and simultaneous components is often daunting because contributions of each of the biomolecules (transcripts, proteins) have to be taken into account.
2 Results
We propose a sparse simultaneous component method that makes many of the parameters redundant by shrinking them to zero. It includes principal component analysis, sparse principal component analysis, and ordinary simultaneous component analysis as special cases. Several penalties can be tuned that account in different ways for the block structure present in the integrated data. This yields known sparse approaches as the lasso, the ridge penalty, the elastic net, the group lasso, sparse group lasso, and elitist lasso. In addition, the algorithmic results can be easily transposed to the context of regression. Metabolomics data obtained with two measurement platforms for the same set of Escherichia coli samples are used to illustrate the proposed methodology and the properties of different penalties with respect to sparseness across and within data blocks.
3 Conclusion
Sparse simultaneous component analysis is a useful method for data integration: First, simultaneous analyses of multiple blocks offer advantages over sequential and separate analyses and second, interpretation of the results is highly facilitated by their sparseness. The approach offered is flexible and allows to take the block structure in different ways into account. As such, structures can be found that are exclusively tied to one data platform (group lasso approach) as well as structures that involve all data platforms (Elitist lasso approach).
4 Availability
The additional file contains a MATLAB implementation of the sparse simultaneous component method.
Background
The integrated analysis of multiple data sets obtained for the same biological entities under study, has become one of the major challenges for data analysis in bioinformatics and computational biology. Two main causes for this trend are the availability of complementary measurement platforms and the systemic approach to biology; in both cases, multiple data sets are obtained on the same set of samples (e.g., culture samples, tissues). First, examples where several measurement platforms are included are the study of the metabolome composition of Escherichia coli (E. coli) using several analytical chemical methods to screen for metabolites [1] and the combination of cDNA and Affymetrix chips applied to sixty cancer cell lines [2]. In both examples, there is overlap in the metabolites or genes screened but also complementarity. Second, the modern systemic approach to biology leads to a probing of the biological system on different levels in the cellular organization, such as for example the transcript, protein, and metabolite level [3]. These approaches lead to situations where several data blocks are obtained that are coupled in the sense that they were obtained for the same set of samples. A key issue in integrative data analysis is to analyze such data simultaneously instead of separately or sequentially as this yields an aggregated view. In this respect, simultaneous component methods, that are an extension of principal component analysis (PCA) to the case of multiple coupled data blocks, were proposed and successfully used [4–7].
However, a drawback of component based methods like PCA is their lack of sparseness: Processes underlying the data are revealed by a weighted combination of all variables (these are the genes, transcripts, proteins, metabolites in the aforementioned examples). From an interpretational point of view, this is not very attractive and it also does not reflect that biological processes are expected to be governed by a limited number of genes [8]. The problem holds even more for simultaneous component methods as these involve multiple large sets of variables. To deal with this issue, sparse approaches have been proposed mainly within the context of regression analysis (e.g., [9, 10]) but also for principal component analysis [8, 11–14]: These select a limited number of variables by shrinking many of the weights to zero which is accomplished by proper penalization of these (regression) weights. A favorable characteristic of such penalty based methods is that the selection is builtin (in contrast to, for example, first filtering and then doing the regression/PCA). Here, we extent sparse PCA to sparse simultaneous component methods, accounting for the fact that the data are structured in several data blocks holding both shared and complementary information. The estimation procedure used is efficient and the associated MATLAB code can be found in the additional file.
First, we present the sparse simultaneous component model, starting from ordinary principal component analysis and sparse PCA. A generic modeling framework is introduced that incorporates several types of penalties. Then we present some results for metabolomics data obtained with two measurement platforms for the same set of E. coli samples and we validate the method by means of simulated data.
Results
Algorithm
Notation
We will make use of the following formal notation: matrices are denoted by bold uppercases, vectors by bold lower case, the transpose by the superscript ^{ T }, and the cardinality by the capital of the letter used to run the index (e.g., this paper deals with K data matrices X_{ k }with k running from 1 to K), see [15].
Throughout the paper, we suppose that all variables are meancentered and scaled to norm one.
Model
such that ${P}_{k}^{T}{P}_{k}=\mathsf{\text{I}}$ and with λ_{ L }≥ 0 and λ_{ R }≥ 0 tuning parameters for the lasso and ridge penalties respectively, ${\u2225{W}_{k}\u2225}_{1}={\sum}_{{j}_{k},r}\left{w}_{{j}_{k}r}\right$ and ${\u2225{W}_{k}\u2225}^{2}={\sum}_{{j}_{k,}r}{w}_{{j}_{k}r}^{2}$. The lasso, tuned by the parameter λ_{ L }, has the property to simultaneously shrink coefficients and select variables, keeping only those variables with the highest coefficients. The higher λ_{ L }, the stronger the shrinkage and selection. Note that the selection is done in an unstructured way meaning that correlations between variables are not taken into account. The ridge penalty, tuned by λ_{ R }, only shrinks the coefficients and does not perform variable selection (none of the coefficients becomes zero). It is often introduced when it is of interest to group correlated variables [10] or in case of illconditioned optimization problems (see [18]) to solve the nonuniqueness of the parameter estimates. A particular case is regression analysis with more variables than objects, J_{ k }> I_{ k }, which yields an under determined estimation problem. In the context of PCA, this is of relevance for model (5) because the estimation of the component weights boils down to a regression analysis. Adding the ridge penalty with λ_{ R }> 0 solves the nonuniqueness; in addition, with the appropriate normalization, the ridge ensures that the solution of (5) yields the principal components in case λ_{ L }= 0 (see [14]).
under the constraint of a principal axes orientation and orthogonal loadings: $\left[{P}_{1}^{T}\dots {P}_{K}^{T}\right]{\left[{P}_{1}^{T}\dots {P}_{K}^{T}\right]}^{T}=I$. Simultaneous component model (7) shows that the common component scores T lie in the space spanned by all variables, this is from all data blocks. For ease of notation, we will use the shorthand notation X_{ c }= [X_{1} ... X_{ K }] (of size I × Σ_{ k }J_{ k }) and ${P}_{c}={\left[{P}_{1}^{T}\dots {P}_{K}^{T}\right]}^{T}$ and ${W}_{c}={\left[{W}_{1}^{T}\dots {W}_{K}^{T}\right]}^{T}$ (both of size Σ_{ k }J_{ k }× R). Note that several simultaneous component models were proposed in the literature: [6] gives an overview that emphasizes the different ways of weighting the data blocks in connection to different principles to realize a fair integration of the data.
Elitist lasso was introduced by [19] in the context of regression analysis. The behavior of this penalty can be understood by observing that it behaves as the lasso within blocks and as the ridge between blocks, resulting in shrinkage and a selection of the variables with the highest coefficients within each block (lasso) and a shrinkage but with no selection between blocks (ridge).
Sparse approaches
Norm  Properties  Sparse approach  

Lasso  Elastic net  Group lasso  Sparse group lasso  Elitist lasso  
l _{1}  selection and shrinkage at the level of the concatenated data  YES  YES  NO  YES  NO 
${l}_{2}^{2}$  shrinkage, groups correlated variables  NO  YES  NO  NO  NO 
l _{2,1}  selection and shrinkage of entire blocks  NO  NO  YES  YES  NO 
L _{1,2}  selection and shrinkage within each block  NO  NO  NO  NO  YES 
which has to be minimized with respect to T and P_{ k }under the constraint that T^{ T }T = I. Note that estimation of the loadings is not a regression problem. Therefore, unlike the model based on sparse weights, unique solutions are obtained when J_{ k }> I. This is the case even when λ_{ R }= 0.
The generic loss functions (11) and (12) allow for a flexible use of all these approaches to sparseness. All combinations of the four penalties are made possible. However, often some prior idea about the structure (selection within blocks, between blocks, both within and between blocks) exists such that it is not necessary to consider all possible combinations. Furthermore, some combinations are not advisable. For example the combination of the group lasso and elitist lasso does not seem useful because the behavior of the one interferes with the behavior of the other. By setting the appropriate tuning parameters in the objective functions to zero, particular known sparse approaches can be obtained. For example, with λ_{ G }= λ_{ E }= 0 the extension of sparse PCA to simultaneous component analysis is obtained and with λ_{ R }= λ_{ E }= 0 a sparse simultaneous component version of the sparse group lasso in linear regression is obtained. With all four tuning parameters set equal to zero, the ordinary simultaneous component analysis model results. K = 1 leads to principal component analysis and setting λ_{ G }= λ_{ E }= 0 yields sparse PCA as proposed by [14]. In Table 1 a summary is given of these different existing sparse approaches in terms of which penalties are active.
Algorithm
Given fixed values for the different tuning parameters (λ_{ l }, λ_{ R }, λ_{ G }, and λ_{ E }) and a fixed number of components R, we make use of an alternating scheme to minimize (11) or (12) with respect to W_{ c }(or T) and P_{ c }: W_{ c }(or T) and P_{ c }are alternatingly updated, conditional on fixed values for the other parameters. For example, focusing on (11):

Step 1: Initialize W_{ c }

Step 2: Conditional on the current estimate of W_{ c }, obtain the optimal leastsquares estimate of P_{ c }under the orthogonality constraint as follows (see [22]): P_{ c }= UV^{ T }with USV^{ T }the singular value decomposition of ${W}_{c}^{T}{X}_{c}^{T}{X}_{c}$

Step 3: Check the stop criteria: 1) Is the difference in loss with the previous iteration smaller than 1e  12 or, 2) is a maximum of 5000 iterations reached? If yes, terminate, and else continue.

Step 4: Conditional on the current estimate of P_{ c }, obtain the update of W_{ c }using a majorization minimization procedure (see [23–25] for a general introduction); see the Methods Section for a derivation of the estimate. Return to Step 2.
This particular scheme guarantees that the loss is a nonincreasing function of the iterations. Due to the convexity (not strict) and the fact that the loss function is bounded from below by zero, the procedure will converge to a fixed point for suitable starting values. The majorization minimization (MM) procedure has a linear rate of convergence; this slow convergence rate may, however, be compensated for by the efficiency of the calculations (see for example [26]). To account for the problem that the fixed point may represent a local minimum instead of the global optimum, a multistart procedure can be used. See the Methods Section for details on the algorithm used to minimize (12). MATLAB code implementing the algorithms can be found in the supplementary material.
Testing and implementation
In this section we apply the proposed approach both to empirical and simulated data. The application to empirical data (metabolomics) is mainly for illustrative purposes. The simulated data are used to check how the different penalties (and their interactions) behave under various conditions, and to compare the sparse component weights and sparse component loadings modeling approaches.
Metabolomics data
As an illustrative case, we use empirical data on the metabolome composition of 28 samples of E. coli. The different samples refer to different environmental conditions and different elapsed fermentation times. Mass spectrometry (MS) in combination with on the one hand gas chromatography (GC) and on the other hand liquid chromatography (LC) as a separation method was used, resulting in two coupled data blocks: a GCMS block with the peak areas of 144 metabolites in the 28 conditions and a LCMS block with the peak areas of 44 metabolites in these same conditions. Simultaneous component analysis was previously successfully applied describing the data well by five components (see [5, 6]). However, a better understanding of the processes underlying the data may be obtained by a sparse simultaneous component analysis (SCA) approach as this characterizes the components by a few instead of all metabolites and thus facilitates interpretation.
Our proposed method allows to model the data in several ways, depending on the one hand on the choice of penalizing either the weights or the loadings and on the other hand on the particular values of the different tuning parameters. Therefore, we will analyze the data under different options, namely either under model (11) or under model (12) and, for both models, with several combinations of values for the different tuning parameters. Here we explain how we chose a suitable range of values for the tuning parameters using the notation for the model with penalized weights. The different values of λ_{ L }, λ_{ G }, λ_{ E }, and λ_{ R }were chosen in a way that reflects the balance between lackoffit and strength of the penalty by setting them as a fraction of X_{ c }^{2} (maximal lackoffit) and W_{ c }_{p,q}with W_{ c }obtained from the ordinary SCA solution (maximal value of the penalty). Let λ_{p,q}denote the tuning parameter of the penalty corresponding to the (mixed) l_{p,q}norm, then this yields λ_{p,q}= fX_{ c }^{2}/W_{ c }_{p,q}with f taking values 0,10^{4},10^{3},10^{2},10^{1},0.2, 0.5, and 1. We only consider those combinations of nonzero values for the tuning parameters that were considered in the regression literature, namely the lasso, elastic net, group lasso, sparse group lasso, and elitist lasso (see Table 1). Note that the case with all tuning parameters equal to zero corresponds to regular simultaneous component analysis.
First we discuss the results for the approach based on penalized weights, then the approach based on penalized loadings, followed by a brief comparison of the two approaches. We end the empirical application section with a discussion on the choice and interpretation of a particular sparse simultaneous component analysis.
Penalized weights
Summary results for the different simultaneous component analyses with sparse weights
Lasso  GroupLasso  ElitistLasso  

f  Fit  % zeros  Fit  % zeros  Fit  % zeros 
0  0.57  0  0.57  0  0.57  0 
0.0001  0.57  86  0.57  0  0.57  88 
0.001  0.57  87  0.57  9  0.56  92 
0.01  0.57  88  0.57  9  0.52  96 
0.1  0.56  92  0.56  9  0.26  99 
0.2  0.55  94  0.55  25  0.16  97 
0.5  0.52  97  0.47  45  0.08  98 
1  0.43  99  0.23  50  0.04  99 
Penalized loadings
Summary results for the different simultaneous component analyses with sparse loadings
Lasso  GroupLasso  ElitistLasso  

f  Fit  % zeros  Fit  % zeros  Fit  % zeros 
0  0.57  0  0.57  0  0.57  0 
0.0001  0.57  0  0.57  0  0.57  0 
0.001  0.57  0  0.57  0  0.57  4 
0.01  0.57  0  0.57  0  0.54  19 
0.1  0.57  7  0.57  0  0.36  37 
0.2  0.56  10  0.56  0  0.28  41 
0.5  0.53  20  0.54  0  0.17  47 
1  0.46  28  0.46  0  0.10  47 
Reflections on penalizing the weights versus the loadings
As illustrated, the results obtained under the model with penalized loadings are different from the results obtained under the model with penalized weights. In our view, the most important differences are at the level of data reconstruction and at the level of interpretation. With respect to data reconstruction, the model based on weights yields a better fit while the model with sparse loadings may yield many zero vectors for the reconstructed data. Also, in this respect, the components based on sparse weights have a higher correlation with the components of the ordinary SCA solution than the components resulting from a model with sparse loadings. With respect to interpretation of the underlying components, for the model based on sparse weights this is done in a regressionlike way, while for the model based on sparse loadings it is based on considering loadings as correlations of the variables with the component. In ordinary SCA, the loadings are the correlations and in the sparse model we observed a close connection in that zero loadings represent close to zero correlations and higher loadings represent higher correlations. The weights do not have such a relation with the correlation between the variable and the component.
Selection and interpretation of the sparse SCA solution
Overview of fit and sparseness for retained sparse group lasso models
f _{ L }  f _{ R }  f _{ G }  Fit  Number of zeros in  

C1  C2  C3  C4  C5  
0.5  0.0001  0.01  0.52  178  176  176  178  179 
0.5  0.0001  0.1  0.49  166  156  167  161  159 
0.5  0.0001  0.2  0.44  150  173  145  143  169 
0.5  0.001  0.1  0.49  158  167  161  166  156 
0.5  0.001  0.2  0.44  169  150  146  145  173 
0.5  0.01  0.1  0.48  154  154  166  165  160 
1  0.0001  0.01  0.43  184  181  183  184  182 
1  0.001  0.001  0.43  181  185  185  185  186 
1  0.001  0.01  0.43  181  182  184  183  180 
1  0.01  0.0001  0.42  180  183  180  179  174 
1  0.01  0.001  0.42  182  179  174  180  179 
1  0.01  0.01  0.41  177  180  173  178  181 
Metabolites with nonzero weights in the two selected solutions
metabolite  f_{ L }= 0.5  f_{ L }= 1  

C1  3,5dihydroxypentanoate:  0.68  0.88 
C1  valine:  0.58  0.20 
C1  3phenyllactate or isomer:  0.55  1.21 
C1  isoleucine:  0.48  
C1  tyrosine:  0.41  0.03 
C1  phenylalanine:  0.40  
C1  unknown mass 304, 319 and 406:  0.01  
C1  spectrum not complete6:  0.06  
C1  mixed spectrum3:  0.43  0.36 
C1  ketogluconate (?):  0.46  0.25 
C2  fumarate:  1.40  1.99 
C2  malate:  0.96  1.06 
C2  aspartate:  0.42  
C2  monomethylphosphate:  0.39  
C2  C18:1 fatty acid3:  0.37  0.19 
C2  unknown1:  0.37  
C2  spectrum not complete4:  0.20  
C2  mixed spectrum2:  0.19  
C2  glycerate:  0.14  
C2  unknown20:  0.02  
C3  lactate:  1.23  2.18 
C3  pyruvate:  0.71  0.39 
C3  disaccharide12:  0.49  0.11 
C3  3dehydroquinate:  0.38  
C3  disaccharide8:  0.33  
C3  citrate:  0.29  
C3  disaccharide9:  0.27  
C3  unknown mass 318 and 420:  0.17  
C3  unknown mass 217 and 191:  0.17  
C3  disaccharide13:  0.11  
C3  2hydroxybutanoate:  0.09  
C4  ADP:  1.16  1.01 
C4  GDP:  0.96  1.21 
C4  UDPglucose:  0.71  0.14 
C4  UTP:  0.34  
C4  unknown27:  0.20  
C4  GMP:  0.20  
C4  FBP:  0.09  
C5  spectrum not found7:  1.41  2.04 
C5  guanine:  0.73  
C5  orotate:  0.51  0.34 
C5  spectrum not complete5:  0.31  
C5  mixed spectrum6:  0.24  
C5  Nacetylaspartate  
C5  + betaphenylpyruvate:  0.23  
C5  thymine:  0.12 
Component scores for the selected solution
Condition  Ferm. time  C1  C2  C3  C4  C5 

Reference  16  0.42  0.11  0.06  0.30  0.27 
24  0.26  0.14  0.00  0.29  0.09  
32  0.30  0.09  0.26  0.07  0.05  
40  0.40  0.15  0.27  0.24  0.03  
48  0.38  0.06  0.06  0.34  0.09  
pH +  16  0.35  0.13  0.28  0.99  0.25 
24  0.08  0.22  0.14  0.35  0.10  
40  0.46  0.20  0.35  0.30  0.13  
48  0.54  0.26  0.38  0.10  0.12  
oxygen +  40  0.21  0.05  0.51  0.02  0.13 
oxygen ?  16  0.44  0.24  0.00  0.24  0.21 
24  0.22  0.03  0.42  0.32  0.15  
40  0.34  0.10  1.05  0.24  0.03  
64  0.59  0.05  0.50  0.24  0.08  
phosphate +  16  0.54  0.23  0.08  0.23  0.23 
24  0.53  0.26  0.18  0.27  0.17  
40  0.09  0.06  0.59  0.26  0.10  
48  0.14  0.01  0.02  0.13  0.13  
phosphate   16  0.27  0.25  0.03  0.04  0.14 
24  0.26  0.21  0.19  0.35  0.01  
40  0.53  0.21  0.33  0.56  0.14  
succinate  24  0.10  1.03  0.09  0.13  0.19 
40  0.06  1.21  0.13  0.05  0.08  
48  0.12  1.07  0.11  0.02  0.19  
Wild type  16  0.42  0.27  0.34  0.20  0.05 
24  0.23  0.14  0.31  0.38  0.44  
40  0.11  0.17  0.22  0.22  0.94  
48  0.04  0.19  0.26  0.14  1.06 
Simulated data
To validate the proposed sparse simultaneous component method, we make use of simulated data. The general setup is that data are generated under some specific conditions and with known structure; after addition of noise, the performance of the method in terms of recovering the underlying structure is assessed. Here, we are particularly interested in two aspects: A first one is whether the penalties reflect the structure in the selection of the variables (i.e., between data blocks; within data blocks; or both between and within data blocks); a second one is the behavior of the method in function of the model (i.e., sparse weights or sparse loadings). We also manipulated the amount of error in the data (5 and 30 percent) and the degree of sparseness (50 and 90 percent of zero weights/loadings). All factors were fully crossed and for each of the resulting 2 × 3 × 2 × 2 = 24 conditions, 5 data sets were generated, resulting in a total of 120 data sets. To obtain a realistic simulation, we generated the data using the metabolomics data described in the previous section. 28 samples were sampled with replacement from the original data; then a singular value decomposition was performed to obtain three components: the three loading and weight vectors were obtained as the three right singular vectors corresponding to the three largest singular values and multiplied by these, the three component score vectors were set equal to the corresponding left singular vectors. Sparseness was imposed by setting either weights or loadings equal to zero as follows: In case of sparseness between blocks, all weights/loadings of the first component that correspond to the first data block (the first 144 weights/loadings) were set equal to zero and for the second and third component the weights/loadings corresponding to the second data block (the last 44 weights/loadings) were set equal to zero; in case of sparseness within blocks, 50 or 90 percent of variable indices were randomly sampled and their corresponding weights/loadings were set equal to zero; in case of sparseness within and between data blocks, the two previous strategies were combined. The resulting component loadings and weights were used to generate the true data part using the model part of expressions (1) and (2) (i.e., without the addition of the residual matrices). Noise was then added to this true part of the data with the noise being generated from a normal distribution with mean zero and variance such that these residual matrices account for 5 or 30 percent of the total variation [34]. Each of the data sets was analyzed under both models (sparse weights or sparse loadings) and with varying values for the tuning parameters (f equal to 0, 10^{3}, 0.1, 0.5, and 10). The Elitist lasso penalty was only combined with the ridge penalty because it interferes with the lasso and group lasso (see earlier).
Discussion
We proposed an extension of sparse PCA to the context of several data blocks, relying on a generic modeling framework that allows either for sparse component weights or for sparse component loadings and that incorporates several approaches that were taken to sparsity in the regression literature (including the lasso, elastic net, group lasso, Elitist lasso, and sparse group lasso). A very flexible algorithm was developed that allows to analyze the data under a variety of approaches that take the structure of the data in different ways into account. It also allows for combinations of penalties that were not yet considered in the regression literature.
The flexibility of the approach is important as often a particular kind of structure is needed from data integration methods. The group lasso is a popular tool to find structures that only involve one data block. This is for example relevant in comparative genomics when the focus is on divergence [35] or on tissuespecificity [36]. Elitist lasso, on the other hand, finds sparse structures that involve each of the data blocks. Not only is this of relevance in the aforementioned case of comparative genomics to find conserved processes, but also in a topdown systems biology approach. For example, to integrate microarray gene expression data and interaction data with the aim of finding transcription factors and their target genes [37].
Although the model and algorithm were proposed in the context of simultaneous component analysis, it can be easily translated to the context of principal component analysis and also of regression analysis. In fact the algorithm can be used as it is for PCA and the adaptation to regression analysis is a minor one. In the context of simultaneous component analysis, adaptations of the model (and algorithm) to a context that allows for different values of the tuning parameter for each component and/or each block would be valuable. However, such an extension is not trivial. Moreover, the problem of selecting an optimal model becomes more difficult in that more parameters need to be tuned and this would make the choice of selecting appropriate values for the tuning parameter even more difficult than it already is. A major theoretic challenge for many sparse methods is to find a good method to select the value of the tuning parameters.
Conclusions
We offered a flexible and sparse framework for data integration based on simultaneous component methods. The method is flexible both with respect to the component model and with respect to the sparse structure imposed: Sparsity can be imposed either on the component weights or on the loadings, and can be imposed either within data blocks, across data blocks, or both within and across data blocks. As such, it allows to find structures exclusively tied to one data platform as well as structures that involve all data platforms. A penalty based approach is used that includes the lasso, the ridge penalty, the group lasso, and Elitist lasso. The method includes principal component analysis, sparse principal component analysis, and ordinary simultaneous component analysis as special cases. Real and simulated data were used to validate the method. We believe the method offers a very flexible and versatile tool for many data integration problems.
Methods
Here we derive the estimates used in the alternating least squares and iterative majorization algorithm. First, it is shown how the conditional estimates for the objective function relying on sparse component weights can be obtained and then for the objective function relying on sparse loadings.
Sparse component weights
with U and V the left and right singular vectors of ${W}_{c}^{T}{X}_{c}^{T}{X}_{c}$.
The minimization of (13) with respect to W_{ c }is not a standard problem due to the Lasso, Group Lasso, and Elitist Lasso penalties on W_{ c }. We will make use of a numerical procedure, known as Majorization Minimization (MM) or also Iterative Majorization, which has been proven to be a superior algorithmic strategy in regularization problems [25, 38]. Briefly stated, MM replaces functions that are complicated to minimize by surrogate functions that are easy to minimize, that lie on/above the original function, and that touch the original function in the socalled supporting point. These properties lead to the sandwich inequality [23].
with D_{3} a diagonal matrix containing the $\left({\sum}_{{j}_{k,}r}\left{w}_{{j}_{k}r}^{o}\right\right){\left(\left{w}_{{j}_{k}r}^{o}\right\right)}^{1}$ on its diagonal.
which may be useful when J_{ k }> I.
Sparse loadings
with U and V the left and right singular vectors of ${P}_{c}^{T}{X}^{T}$.
List of abbreviations
 E. coli Escherichia coli :

GC: Gas Chromatography
 LC:

Liquid Chromatography
 MM:

Majorization Minimization
 MS:

Mass Spectrometry
 PCA:

Principal Component Analysis
 SCA:

Simultaneous Component Analysis
 SVD:

Singular Value Decomposition
Declarations
Acknowledgements
This work was supported by the Research Fund of Katholieke Universiteit Leuven (SymBioSys: CoE EF/05/007, GOA/2005/04, PDM: Tom Wilderjans); by IWTFlanders (IWT/060045/SBO Bioframe); and by the Belgian Federal Science Policy Office (IUAP P6/03 and P6/04). We would like to thank TNO, Quality of Life, Zeist, The Netherlands, for making the data available. The authors also wish to thank the reviewers for their valuable comments and suggestions.
Authors’ Affiliations
References
 van der Werf MJ, Overkamp KM, Muilwijk B, Coulier L, Hankemeier T: Microbial metabolomics: Toward a platform with full metabolome coverage. Analytical Biochemistry 2007, 370: 17–25. 10.1016/j.ab.2007.07.022View ArticlePubMedGoogle Scholar
 Le Cao KA, Martin P, RobertGranie C, Besse P: Sparse canonical methods for biological data integration: application to a crossplatform study. BMC Bioinformatics 2009, 10: 34. [http://www.biomedcentral.com/1471–2105/10/34] 10.1186/147121051034PubMed CentralView ArticlePubMedGoogle Scholar
 Ishii N, Nakahigashi K, Baba T, Robert M, Soga T, Kanai A, Hirasawa T, Naba M, Hirai K, Hoque A, Ho PY, Kakazu Y, Sugawara K, Igarashi S, Harada S, Masuda T, Sugiyama N, Togashi T, Hasegawa M, Takai Y, Yugi K, Arakawa K, Iwata N, Toya Y, Nakayama Y, Nishioka T, Shimizu K, Mori H, Tomita M: Multiple HighThroughput Analyses Monitor the Response of E. coli to Perturbations. Science 2007, 316(5824):593–597. [http://www.sciencemag.org/cgi/content/abstract/316/5824/593] 10.1126/science.1132067View ArticlePubMedGoogle Scholar
 de Tayrac M, Le S, Aubry M, Mosser J, Husson F: Simultaneous analysis of distinct Omics data sets with integration of biological knowledge: Multiple Factor Analysis approach. BMC Genomics 2009, 10: 32. 10.1186/147121641032PubMed CentralView ArticlePubMedGoogle Scholar
 van den Berg R, Van Mechelen I, Wilderjans T, Van Deun K, Kiers H, Smilde A: Integrating functional genomics data using maximum likelihood based simultaneous component analysis. BMC Bioinformatics 2009, 10: 340. [http://www.biomedcentral.com/1471–2105/10/340] 10.1186/1471210510340PubMed CentralView ArticlePubMedGoogle Scholar
 Van Deun K, Smilde A, van der Werf M, Kiers H, Van Mechelen I: A structured overview of simultaneous component based data integration. BMC Bioinformatics 2009, 10: 246. [http://www.biomedcentral.com/1471–2105/10/246] 10.1186/1471210510246PubMed CentralView ArticlePubMedGoogle Scholar
 Wilderjans TF, Ceulemans E, Van Mechelen I, van den Berg RA: Simultaneous analysis of coupled data matrices subject to different amounts of noise. British Journal of Mathematical and Statistical Psychology 2011, 64(2):277–290. 10.1348/000711010X513263View ArticlePubMedGoogle Scholar
 Lee D, Lee W, Lee Y, Pawitan Y: Supersparse principal component analyses for highthroughput genomic data. BMC Bioinformatics 2010, 11: 296. [http://www.biomedcentral.com/1471–2105/11/296] 10.1186/1471210511296PubMed CentralView ArticlePubMedGoogle Scholar
 Tibshirani R: Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B 1996, 58: 267–288.Google Scholar
 Zou H, Hastie T: Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B 2005, 67: 301–320. 10.1111/j.14679868.2005.00503.xView ArticleGoogle Scholar
 Jenatton R, Obozinski G, Bach F: Structured sparse principal component analysis. Journal of Machine Learning Research 2010, 9: 366–373.Google Scholar
 Jolliffe I, Trendafilov N, Uddin M: A Modified Principal Component Technique Based on the LASSO. Journal of Computational & Graphical Statistics 2003, 12(3):531–547. 10.1198/1061860032148View ArticleGoogle Scholar
 Witten DM, Tibshirani R, Hastie T: A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 2009, 10(3):515–534. [http://biostatistics.oxfordjournals.org/content/10/3/515.abstract] 10.1093/biostatistics/kxp008PubMed CentralView ArticlePubMedGoogle Scholar
 Zou H, Hastie T, Tibshirani R: Sparse principal component analysis. Journal of Computational and Graphical Statistics 2006, 15(2):265–286. 10.1198/106186006X113430View ArticleGoogle Scholar
 Kiers H: Towards a Standardized Notation and Terminology in Multiway Analysis. Journal of Chemometrics 2000, 14: 105–122. 10.1002/1099128X(200005/06)14:3<105::AIDCEM582>3.0.CO;2IView ArticleGoogle Scholar
 Gabriel KR: The biplot graphic display of matrices with application to principal component analysis. Biometrika 1971, 58: 453–467. 10.1093/biomet/58.3.453View ArticleGoogle Scholar
 Jolliffe IT: Principal component analysis. New York: Springer; 2002.Google Scholar
 Hoerl AE, Kennard RW: Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 1970, 12: 55–67. [http://www.jstor.org/stable/1267351] 10.2307/1267351View ArticleGoogle Scholar
 Kowalski M, Torrésani B: Structured sparsity: from mixed norms to structured shrinkage. SPARS09Signal Processing with Adaptive Sparse Structured Representations 2009, 53: 814–861.Google Scholar
 Yuan M, Lin Y: Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B 2006, 68: 49–67. 10.1111/j.14679868.2005.00532.xView ArticleGoogle Scholar
 Friedman J, Hastie T, Tibshirani R: A note on the group lasso and a sparse group lasso. Tech rep, Statistics Department, Stanford University 2010.Google Scholar
 Ten Berge JMF: Least squares optimization in multivariate analysis. Leiden: DSWO; 1993.Google Scholar
 de Leeuw J: Block relaxation algorithms in statistics. In Information Systems and Data Analysis. Edited by: Bock HH, Lenski W, Richter MM. Berlin: SpringerVerlag; 1994:308–325.View ArticleGoogle Scholar
 Heiser WJ: Convergent computation by iterative majorization: theory and applications in multidimensional data analysis. In Recent advances in descriptive multivariate analysis. Edited by: Krzanowski WJ. Oxford: Oxford University Press; 1995:157–189.Google Scholar
 Lange K, Hunter DR, Yang I: Optimization transfer using surrogate objective functions. Journal of computational and graphical statistics 2000, 9: 1–20. 10.2307/1390605Google Scholar
 Van Deun K, Groenen PJF: Majorization algorithms for inspecting circles, ellipses, squares, rectangles, and rhombi. Operations Research 2005, 53: 957–967. 10.1287/opre.1050.0253View ArticleGoogle Scholar
 Barabasi AL, Oltvai ZN: Network biology: understanding the cell's functional organization. Nature Review Genetics 2004, 5: 101–113. 10.1038/nrg1272View ArticleGoogle Scholar
 Hochreiter S, Bodenhofer U, Heusel M, Mayr A, Mitterecker A, Kasim A, Khamiakova T, Van Sanden S, Lin D, Talloen W, Bijnens L, Gohlmann HWH, Shkedy Z, Clevert DA: FABIA: factor analysis for bicluster acquisition. Bioinformatics 2010, 26(12):1520–1527. 10.1093/bioinformatics/btq227PubMed CentralView ArticlePubMedGoogle Scholar
 Huang J, Ma S, Xie H, Zhang CH: A group bridge approach for variable selection. Biometrika 2009, 96(2):339–355. 10.1093/biomet/asp020PubMed CentralView ArticlePubMedGoogle Scholar
 Zhao P, Rocha G, Yu B: Grouped and Hierarchical Model Selection through Composite Absolute Penalties. Tech rep, Department of Statistics, University of California, Berkeley 2006.Google Scholar
 Ma S, Song X, Huang J: Supervised group Lasso with applications to microarray data analysis. BMC Bioinformatics 2007, 8: 60. 10.1186/14712105860PubMed CentralView ArticlePubMedGoogle Scholar
 Meier L, Van De Geer S, Buhlmann P: The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2008, 70.Google Scholar
 Kim Y, Kim J, Kim Y: Blockwise sparse regression. Statistica Sinica. 2006, 16: 375–390.Google Scholar
 Wilderjans T, Ceulemans E, Van Mechelen I: Simultaneous analysis of coupled data blocks differing in size: A comparison of two weighting schemes. Comput Stat Data Anal 2009, 53: 1086–1098. [http://dl.acm.org/citation.cfm?id=1497631.1497740] 10.1016/j.csda.2008.09.031View ArticleGoogle Scholar
 Alter O, Brown PO, Botstein D: Generalized singular value decomposition for comparative analysis of genomescale expression data sets of two different organisms. Proceedings of the National Academy of Sciences 2003, 100: 3351–3356. 10.1073/pnas.0530258100View ArticleGoogle Scholar
 Van Deun K, Hoijtink H, Thorrez L, Van Lommel L, Schuit F, Van Mechelen I: Testing the hypothesis of tissue selectivity: the intersectionunion test and a Bayesian approach. Bioinformatics 2009, 25(19):2588–2594. 10.1093/bioinformatics/btp439PubMed CentralView ArticlePubMedGoogle Scholar
 Lemmens K, De Bie T, Dhollander T, De Keersmaecker S, Thijs I, Schoofs G, De Weerdt A, De Moor B, Vanderleyden J, ColladoVides J, Engelen K, Marchal K: DISTILLER: a data integration framework to reveal condition dependency of complex regulons in Escherichia coli. Genome Biology 2009, 10(3):R27. [http://genomebiology.com/2009/10/3/R27] 10.1186/gb2009103r27PubMed CentralView ArticlePubMedGoogle Scholar
 Kiers HAL: Setting up alternating least squares and iterative majorization algorithms for solving various matrix optimization problems. Computational Statistics and Data Analysis 2002, 41: 157–170. 10.1016/S01679473(02)001421View ArticleGoogle Scholar
 Groenen PJF: Iterative majorization algorithms for minimizing loss functions in classification. Working paper presented at the 8th conference of the IFCS, Krakow, Poland; 2002.Google Scholar
 Borg I, Groenen PJF: Modern Multidimensional Scaling: Theory and Applications. Springer series in statistics. 2nd edition. New York: SpringerVerlag; 2005.Google Scholar
 McLachlan GJ, Peel D:Finite mixture models/Geoffrey McLachlan, David Peel. Wiley, New York; Chichester; 2000. [http://www.loc.gov/catdir/toc/onix07/00043324.html]View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.