A flexible framework for sparse simultaneous component based data integration

1 Background High throughput data are complex and methods that reveal structure underlying the data are most useful. Principal component analysis, frequently implemented as a singular value decomposition, is a popular technique in this respect. Nowadays often the challenge is to reveal structure in several sources of information (e.g., transcriptomics, proteomics) that are available for the same biological entities under study. Simultaneous component methods are most promising in this respect. However, the interpretation of the principal and simultaneous components is often daunting because contributions of each of the biomolecules (transcripts, proteins) have to be taken into account. 2 Results We propose a sparse simultaneous component method that makes many of the parameters redundant by shrinking them to zero. It includes principal component analysis, sparse principal component analysis, and ordinary simultaneous component analysis as special cases. Several penalties can be tuned that account in different ways for the block structure present in the integrated data. This yields known sparse approaches as the lasso, the ridge penalty, the elastic net, the group lasso, sparse group lasso, and elitist lasso. In addition, the algorithmic results can be easily transposed to the context of regression. Metabolomics data obtained with two measurement platforms for the same set of Escherichia coli samples are used to illustrate the proposed methodology and the properties of different penalties with respect to sparseness across and within data blocks. 3 Conclusion Sparse simultaneous component analysis is a useful method for data integration: First, simultaneous analyses of multiple blocks offer advantages over sequential and separate analyses and second, interpretation of the results is highly facilitated by their sparseness. The approach offered is flexible and allows to take the block structure in different ways into account. As such, structures can be found that are exclusively tied to one data platform (group lasso approach) as well as structures that involve all data platforms (Elitist lasso approach). 4 Availability The additional file contains a MATLAB implementation of the sparse simultaneous component method.


Background
The integrated analysis of multiple data sets obtained for the same biological entities under study, has become one of the major challenges for data analysis in bioinformatics and computational biology. Two main causes for this trend are the availability of complementary measurement platforms and the systemic approach to biology; in both cases, multiple data sets are obtained on the same set of samples (e.g., culture samples, tissues).
First, examples where several measurement platforms are included are the study of the metabolome composition of Escherichia coli (E. coli) using several analytical chemical methods to screen for metabolites [1] and the combination of cDNA and Affymetrix chips applied to sixty cancer cell lines [2]. In both examples, there is overlap in the metabolites or genes screened but also complementarity. Second, the modern systemic approach to biology leads to a probing of the biological system on different levels in the cellular organization, such as for example the transcript, protein, and metabolite level [3]. These approaches lead to situations where several data blocks are obtained that are coupled in the sense that they were obtained for the same set of samples. A key issue in integrative data analysis is to analyze such data simultaneously instead of separately or sequentially as this yields an aggregated view. In this respect, simultaneous component methods, that are an extension of principal component analysis (PCA) to the case of multiple coupled data blocks, were proposed and successfully used [4][5][6][7].
However, a drawback of component based methods like PCA is their lack of sparseness: Processes underlying the data are revealed by a weighted combination of all variables (these are the genes, transcripts, proteins, metabolites in the aforementioned examples). From an interpretational point of view, this is not very attractive and it also does not reflect that biological processes are expected to be governed by a limited number of genes [8]. The problem holds even more for simultaneous component methods as these involve multiple large sets of variables. To deal with this issue, sparse approaches have been proposed mainly within the context of regression analysis (e.g., [9,10]) but also for principal component analysis [8,[11][12][13][14]: These select a limited number of variables by shrinking many of the weights to zero which is accomplished by proper penalization of these (regression) weights. A favorable characteristic of such penalty based methods is that the selection is built-in (in contrast to, for example, first filtering and then doing the regression/ PCA). Here, we extent sparse PCA to sparse simultaneous component methods, accounting for the fact that the data are structured in several data blocks holding both shared and complementary information. The estimation procedure used is efficient and the associated MATLAB code can be found in the additional file.
First, we present the sparse simultaneous component model, starting from ordinary principal component analysis and sparse PCA. A generic modeling framework is introduced that incorporates several types of penalties. Then we present some results for metabolomics data obtained with two measurement platforms for the same set of E. coli samples and we validate the method by means of simulated data.

Algorithm Notation
We will make use of the following formal notation: matrices are denoted by bold uppercases, vectors by bold lower case, the transpose by the superscript T , and the cardinality by the capital of the letter used to run the index (e.g., this paper deals with K data matrices X k with k running from 1 to K), see [15].
Throughout the paper, we suppose that all variables are mean-centered and scaled to norm one.

Model
Simultaneous component analysis is an extension of principal component analysis (PCA) to the case of multiple coupled data matrices. Consider the PCA of a single data block X k containing the scores of I k objects (e.g., batches, arrays) on J k variables (e.g., metabolites, genes). In a first model formulation [16] based on component scores, PCA decomposes the data as follows, with T k the component scores of the I k objects on the R components, P k (of size J k × R) the loadings, and E k (of size I k × J k ) the matrix of residuals. To identify the model, usually the constraints are imposed that the axes have a principal axes orientation and that the component scores are orthogonal: T T k T k = I. Another formulation of the PCA model is based on component weights as follows [17], with W k (of size J k × R) the component weights. Note that we can write T k = X k W k resulting in the equivalence of models (1) and (2). However, usually (2) is constrained to have orthogonal weights: W T k W k = I. Note that under a least squares approach to PCA, P k = W k and thus also P T k P k = I. The principal components are interpreted by considering the contribution of the variables to the components. For the score-based model (1) this is based on the fact that the loadings are equal to the correlation of the variables with the components (we suppose the variables to be mean-centered and scaled to norm one). Let x jk be the jth variable in data block k and t rk the rth component for block k, then with r(.,.) used as a notation for correlation and p jrk the loading of the jth variable on the rth component of block k. In the weight-based model (2), interpretation of the components is based on the weights as these express each component as a weighted linear combination of the variables, For both model formulations this implies that for each component a total of J k correlations or weights have to be taken into account in the interpretation. Especially in the case of omics data, that usually consist of thousands of variables, there is a need for methods that facilitate interpretation. To that end, [14] proposed a sparse PCA method for the weight based model (2), that shrinks a (large) number of component weights to zero. Their method is based on a least-squares approach to PCA model (2) in which the objective function is augmented with an l 1 penalty (also named lasso) and an l 2 (ridge) penalty: Minimize with respect to W k and P k such that P T k P k = I and with λ L ≥ 0 and λ R ≥ 0 tuning parameters for the lasso and ridge penalties respectively, W k 1 = j k ,r w j k r and W k 2 = j k, r w 2 j k r . The lasso, tuned by the parameter λ L , has the property to simultaneously shrink coefficients and select variables, keeping only those variables with the highest coefficients. The higher λ L , the stronger the shrinkage and selection. Note that the selection is done in an unstructured way meaning that correlations between variables are not taken into account. The ridge penalty, tuned by λ R , only shrinks the coefficients and does not perform variable selection (none of the coefficients becomes zero). It is often introduced when it is of interest to group correlated variables [10] or in case of ill-conditioned optimization problems (see [18]) to solve the non-uniqueness of the parameter estimates. A particular case is regression analysis with more variables than objects, J k >I k , which yields an under determined estimation problem. In the context of PCA, this is of relevance for model (5) because the estimation of the component weights boils down to a regression analysis. Adding the ridge penalty with λ R > 0 solves the non-uniqueness; in addition, with the appropriate normalization, the ridge ensures that the solution of (5) yields the principal components in case λ L = 0 (see [14]).
The simultaneous component decomposition of K coupled data blocks X k having a common set of samples (so I 1 = ... = I k = I) is given by imposing the constraint that all T k are equal. Applied to the score based model this gives: for all k and under the constraints of a principal axes orientation and orthogonality of the component scores: Applying the idea of a common matrix of component scores to the weight based model as used by [14], can be realized as follows, = T P T 1 . .
under the constraint of a principal axes orientation and orthogonal loadings: Simultaneous component model (7) shows that the common component scores T lie in the space spanned by all variables, this is from all data blocks. For ease of notation, we will use the shorthand notation X c = [X 1 ...
Note that several simultaneous component models were proposed in the literature: [6] gives an overview that emphasizes the different ways of weighting the data blocks in connection to different principles to realize a fair integration of the data. The problem that a lot of variables have to be taken into account when interpreting the components is exacerbated in the case of simultaneous component analysis as this involves several blocks of variables. To solve for this problem, we propose to go for a sparse simultaneous component method by penalizing either the loadings (in the context of the score based model) or the component weights (in the context of the weights based model) within a least-squares approach. One possibility, in line with sparse PCA, is to use the lasso penalty if necessary in conjunction with a ridge penalty (when grouping of correlated variables is of interest or when Σ k J k >I). However, other types of penalties can be used that, when selecting variables, explicitly take into account that variables belong to (pre-defined) groups/ blocks by selecting variables within blocks only, between blocks only (by setting all weights/loadings of an entire block to zero, i.e. dropping an entire group of variables at once), or both within and between blocks. A penalty that introduces selection only within each group is Elitist lasso (mixed l 1,2 norm), defined for the rth component as Elitist lasso was introduced by [19] in the context of regression analysis. The behavior of this penalty can be understood by observing that it behaves as the lasso within blocks and as the ridge between blocks, resulting in shrinkage and a selection of the variables with the highest coefficients within each block (lasso) and a shrinkage but with no selection between blocks (ridge).
To select entire (pre-defined) groups of variables, the group lasso [20] was introduced. It uses the Euclidean norm (also known as a mixed l 2,1 norm; see [19]) of the group coefficients as a penalty, This penalty behaves as the lasso at the block level and as the ridge within blocks: within blocks shrinkage and grouping of correlated variables occurs however with no selection (behavior of the ridge penalty); between blocks selection of those blocks with the highest sum of squared coefficients occurs while other blocks are dropped (behavior of the lasso). The group lasso applied to groups consisting of one variable only is the same as the lasso. (Note that taking the square root of a squared value is the same as taking the absolute value.) To obtain also sparsity within the groups that are not dropped by the group lasso, [21] proposed the sparse group lasso that blends the lasso with the group lasso and implies shrinkage and selection both within and between groups. The behavior of each of the four penalties and associated norms is summarized in Table  1.
We propose the following generic functions that combine all penalties: First, for the approach based on sparse component weights, which has to be minimized with respect to W k and P c under the constraint that P T c P c = I. Second, for the approach based on sparse component loadings, which has to be minimized with respect to T and P k under the constraint that T T T = I. Note that estimation of the loadings is not a regression problem. Therefore, unlike the model based on sparse weights, unique solutions are obtained when J k >I. This is the case even when λ R = 0.
The generic loss functions (11) and (12) allow for a flexible use of all these approaches to sparseness. All combinations of the four penalties are made possible. However, often some prior idea about the structure (selection within blocks, between blocks, both within and between blocks) exists such that it is not necessary to consider all possible combinations. Furthermore, some combinations are not advisable. For example the combination of the group lasso and elitist lasso does not seem useful because the behavior of the one interferes with the behavior of the other. By setting the appropriate tuning parameters in the objective functions to zero, particular known sparse approaches can be obtained. For example, with λ G = λ E = 0 the extension of sparse PCA to simultaneous component analysis is obtained and with λ R = λ E = 0 a sparse simultaneous component version of the sparse group lasso in linear regression is obtained. With all four tuning parameters set equal to zero, the ordinary simultaneous component analysis model results. K = 1 leads to principal component analysis and setting λ G = λ E = 0 yields sparse PCA as proposed by [14]. In Table 1 a summary is given of these different existing sparse approaches in terms of which penalties are active.

Algorithm
Given fixed values for the different tuning parameters (λ l , λ R , λ G , and λ E ) and a fixed number of components R, we make use of an alternating scheme to minimize (11) or (12) with respect to W c (or T) and P c : W c (or T) and P c are alternatingly updated, conditional on fixed values for the other parameters. For example, focusing on (11): • Step 1: Initialize W c • Step 2: Conditional on the current estimate of W c , obtain the optimal least-squares estimate of P c under the orthogonality constraint as follows (see Step 3: Check the stop criteria: 1) Is the difference in loss with the previous iteration smaller than 1e -12 or, 2) is a maximum of 5000 iterations reached? If yes, terminate, and else continue.
• Step 4: Conditional on the current estimate of P c , obtain the update of W c using a majorization minimization procedure (see [23][24][25] for a general introduction); see the Methods Section for a derivation of the estimate. Return to Step 2.
This particular scheme guarantees that the loss is a non-increasing function of the iterations. Due to the convexity (not strict) and the fact that the loss function is bounded from below by zero, the procedure will converge to a fixed point for suitable starting values. The majorization minimization (MM) procedure has a linear rate of convergence; this slow convergence rate may, however, be compensated for by the efficiency of the calculations (see for example [26]). To account for the problem that the fixed point may represent a local minimum instead of the global optimum, a multistart procedure can be used. See the Methods Section for details on the algorithm used to minimize (12). MATLAB code implementing the algorithms can be found in the supplementary material.

Testing and implementation
In this section we apply the proposed approach both to empirical and simulated data. The application to empirical data (metabolomics) is mainly for illustrative purposes. The simulated data are used to check how the different penalties (and their interactions) behave under various conditions, and to compare the sparse component weights and sparse component loadings modeling approaches.

Metabolomics data
As an illustrative case, we use empirical data on the metabolome composition of 28 samples of E. coli. The different samples refer to different environmental conditions and different elapsed fermentation times. Mass spectrometry (MS) in combination with on the one hand gas chromatography (GC) and on the other hand liquid chromatography (LC) as a separation method was used, resulting in two coupled data blocks: a GC-MS block with the peak areas of 144 metabolites in the 28 conditions and a LC-MS block with the peak areas of 44 metabolites in these same conditions. Simultaneous component analysis was previously successfully applied describing the data well by five components (see [5,6]). However, a better understanding of the processes underlying the data may be obtained by a sparse simultaneous component analysis (SCA) approach as this characterizes the components by a few instead of all metabolites and thus facilitates interpretation.
Our proposed method allows to model the data in several ways, depending on the one hand on the choice of penalizing either the weights or the loadings and on the other hand on the particular values of the different tuning parameters. Therefore, we will analyze the data under different options, namely either under model (11) or under model (12) and, for both models, with several combinations of values for the different tuning parameters. Here we explain how we chose a suitable range of values for the tuning parameters using the notation for the model with penalized weights. The different values of λ L , λ G , λ E , and λ R were chosen in a way that reflects the balance between lack-of-fit and strength of the penalty by setting them as a fraction of ||X c || 2 (maximal lack-of-fit) and |W c | p,q with W c obtained from the ordinary SCA solution (maximal value of the penalty). Let λ p,q denote the tuning parameter of the penalty corresponding to the (mixed) l p,q norm, then this yields λ p,q = f||X c || 2 /|W c | p,q with f taking values 0,10 -4 ,10 -3 ,10 -2 ,10 -1 ,0.2, 0.5, and 1. We only consider those combinations of non-zero values for the tuning parameters that were considered in the regression literature, namely the lasso, elastic net, group lasso, sparse group lasso, and elitist lasso (see Table 1). Note that the case with all tuning parameters equal to zero corresponds to regular simultaneous component analysis.
First we discuss the results for the approach based on penalized weights, then the approach based on penalized loadings, followed by a brief comparison of the two approaches. We end the empirical application section with a discussion on the choice and interpretation of a particular sparse simultaneous component analysis. Table 2 summarizes the results for the approach with a penalty on the component weights and with only one of the tuning parameters different from zero (the ridge penalty on its own is not considered as it does not induce sparsity). Five components are assumed (R = 5). For the three resulting types of sparse simultaneous component analyses we report on the one hand the fit of the model to the data and on the other hand the percentage of component weights that are zero. The fit is

Penalized weights
As could be expected, it holds that increasing the tuning parameter results in a decrease of fit and an increase of the proportion of zero component weights. Comparing the lasso and Elitist lasso, we see that the lasso has a better fit for a similar proportion of zeros which may be attributed to the fact the lasso is less constrained because it does not have to reflect the block structure in the variable selection. Both for the lasso and Elitist lasso the proportion of zeros is very high, even for small values of the tuning parameter. This could be expected as the number of variables is larger than the number of samples and warrants the inclusion of a ridge penalty (see also [14] for the case of the lasso), else non-unique solutions are obtained. Also, at most I non-zero weights will be obtained for each component and this may be too sparse, e.g. in the case of micro-array gene expression data obtained for a limited number of (tissue) samples. Such solutions with only I non-zero values fit as well as the regular simultaneous component model. To understand this, consider a model with one component: t = Xw which represents an underdetermined system in case I < Σ k J k that can be solved exactly by taking only I non-zero weights. The group lasso operates at the level of the block and therefore does not show this effect. The effect of adding a ridge penalty to the lasso and elitist lasso is visualized in Figure 1: The higher the value of the parameter that tunes the ridge penalty, the lower the fit and the lower the proportion of zeros. In Figure  2 the results for the sparse group lasso (i.e., combination of lasso and ridge penalty) are summarized. The lines express the fit and the proportion of zero weights in function of the lasso tuning parameter with different lines referring to different values of the group lasso. As illustrated by the figure, there is a qualitative interaction between the two types of penalties in the sense that lower values of the lasso parameter have a strong effect on the number of zeros when the group lasso parameter takes lower values while, conversely, higher values of the lasso parameter have a strong effect when the group lasso parameter takes higher values: As the group lasso shrinks the component weights, the penalty for the lasso     becomes lower and hence low values of the lasso tuning parameter are ineffective. Addition of a ridge penalty to the lasso and group lasso parameters may be considered when grouping is important (e.g., as is usual with gene expression data to find modules of co-expressed genes; [27]).

Penalized loadings
A summary of the results obtained for the approach with sparse loadings is given in Table 3. The general result that increasing the tuning parameters yields a decrease in fit and an increase in sparsity also holds here. A comparison between Tables 2 and 3 shows that for an equal proportion of zeros, the fit of models with sparse loadings is (much) lower. This can be understood from the fact that the loadings contribute more directly to the reconstruction of the data than the component weights (compare equations (1) and (2)): for example, in a model with one component, a zero loading results in a zero vector for the reconstructed variable. This also explains why different from the approach based on penalizing the weights with an L 1 penalty, the number of non-zeros is not bounded by I. Table 3 also shows that to obtain zero loadings with the group lasso, high values of the tuning parameter are needed.

Reflections on penalizing the weights versus the loadings
As illustrated, the results obtained under the model with penalized loadings are different from the results obtained under the model with penalized weights. In our view, the most important differences are at the level of data reconstruction and at the level of interpretation. With respect to data reconstruction, the model based on weights yields a better fit while the model with sparse loadings may yield many zero vectors for the reconstructed data. Also, in this respect, the components based on sparse weights have a higher correlation with the components of the ordinary SCA solution than the components resulting from a model with sparse loadings. With respect to interpretation of the underlying components, for the model based on sparse weights this is done in a regression-like way, while for the model based on sparse loadings it is based on considering loadings as correlations of the variables with the component. In ordinary SCA, the loadings are the correlations and in the sparse model we observed a close connection in that zero loadings represent close to zero correlations and higher loadings represent higher correlations. The weights do not have such a relation with the correlation between the variable and the component.

Selection and interpretation of the sparse SCA solution
As has been illustrated in the previous Results Section, the data can be analyzed in many ways depending on choices made with respect to the generic model ((11) or (12)) and with respect to the values of the different tuning parameters. Selection of the appropriate model is of key importance and substantive issues may form a good point of departure. First, concerning the choice of the generic model, the model with penalized weights seems more appropriate for the data at hand because all metabolites can be considered to be involved in the biological processes underlying the data. For applications of  component models with sparse loadings to microarray gene expression data, see [28] and [13]. Second, to choose appropriate values for the tuning parameters we consider the properties of the associated penalties. Having components for which the interpretation is tied exclusively to one type of analytical platform (corresponding to the block structure) is convenient. Also, because for each platform many metabolites result, sparseness within each platform/block is needed. This means that we are interested in selection both across and within groups. Recently, there has been a growing interest for methods that perform such a selection [29,30], with particular interest for the group lasso that has been extended and applied in several ways [31][32][33]. Therefore, we will restrict ourselves to a group lasso type of simultaneous component model, however, including a ridge penalty to account for the fact that grouping is useful (because, within each analytical method, the metabolites belong to several classes of strongly related compounds): λ L > 0, l R > 0, λ G > 0, and l L = 0. Then, we eliminate solutions that 1) yield components with all weights equal to zero, 2) yield components having non-zero weights for both data blocks, and 3) solutions that do not fit well (fit < .40) or that are not sparse (less than 50 percent of zero weights in a block). The remaining solutions are summarized in Table 4  and sparsity. We select these solutions for an interpretation.
The metabolites with non-zero component weights are displayed for both selected solutions in Table 5; Table 6 contains the component scores corresponding to the weights of the solution with f L = 0.5. Observe that the solution with f L = 1 is a further selection of the metabolites in the solution with f L = 0.5. The first component shows an effect of phenyllactate, 3,5-dihydroxypentanoate, and two aromatic amino acids (phenylalanine and tyrosine), together with two branched-chain amino acids (isoleucine and valine); the corresponding component scores (see C1 in Table 6) show a clear increasing linear effect of fermentation time. The second component is made up by metabolites like fumarate, malate, aspartate and are associated to succinate catabolism (see C2 in Table 6) making biological sense as these metabolites are close to succinate in central metabolism. For C3, we find non-zero weights for a large number of (unidentified) disaccharides and pyruvate and lactate and high scores in the oxygen related conditions. The identification of pyruvate and lactate could be indicative of a changing, i.e. reduced, dissolved oxygen concentration in the course of the fermentation as pyruvate can be converted into lactate during anaerobic growth. The fourth component is made up by nucleotides important for the energy metabolism in a cell (i.e. ADP, GDP, UDP) and is associated to the growth condition with an elevated pH at the early (16hrs) phase. Finally, C5 seems specific for the wild type strain, although the relation to the metabolites guanine and thymine (both nucleobases) and the other metabolites is not very clear.

Simulated data
To validate the proposed sparse simultaneous component method, we make use of simulated data. The general setup is that data are generated under some specific conditions and with known structure; after addition of noise, the performance of the method in terms of recovering the underlying structure is assessed. Here, we are particularly interested in two aspects: A first one is whether the penalties reflect the structure in the selection of the variables (i.e., between data blocks; within data blocks; or both between and within data blocks); a second one is the behavior of the method in function of the model (i.e., sparse weights or sparse loadings). We also manipulated the amount of error in the data (5 and 30 percent) and the degree of sparseness (50 and 90 percent of zero weights/loadings). All factors were fully crossed and for each of the resulting 2 × 3 × 2 × 2 = 24 conditions, 5 data sets were generated, resulting in a total of 120 data sets. To obtain a realistic simulation, we generated the data using the metabolomics data described in the previous section. 28 samples were sampled with replacement from the original data; then a singular value decomposition was performed to obtain three components: the three loading and weight vectors were obtained as the three right singular vectors corresponding to the three largest singular values and multiplied by these, the three component score vectors were set equal to the corresponding left singular vectors. Sparseness was imposed by setting either weights or loadings equal to zero as follows: In case of sparseness between blocks, all weights/loadings of the first component that correspond to the first data block (the first 144 weights/loadings) were set equal to zero and for the second and third component the weights/loadings corresponding to the second data block (the last 44 weights/ loadings) were set equal to zero; in case of sparseness within blocks, 50 or 90 percent of variable indices were randomly sampled and their corresponding weights/ loadings were set equal to zero; in case of sparseness within and between data blocks, the two previous   strategies were combined. The resulting component loadings and weights were used to generate the true data part using the model part of expressions (1) and (2) (i.e., without the addition of the residual matrices). Noise was then added to this true part of the data with the noise being generated from a normal distribution with mean zero and variance such that these residual matrices account for 5 or 30 percent of the total variation [34]. Each of the data sets was analyzed under both models (sparse weights or sparse loadings) and with varying values for the tuning parameters (f equal to 0, 10 -3 , 0.1, 0.5, and 10). The Elitist lasso penalty was only combined with the ridge penalty because it interferes with the lasso and group lasso (see earlier).
In the discussion of the results of the simulation study, we first focus on the conditions where the data are generated and analyzed under the same model (either sparse weights or sparse loadings), the error amounting to 30 percent of the total variation in the data, and the ridge penalty set equal to the smallest non-zero value. Figures 3 and 4  versus dropped) in function of the value of the tuning parameter. Figure 3 refers to the case with 50 percent zero weights/loadings, Figure 4 to the case with 90 percent zero weights/loadings. In each Figure, the different panels refer to the different combinations of structure in the variable selection (from top to bottom: within blocks, between blocks, within and between blocks) and of sparseness approach (from left to right: lasso, group lasso, Elitist lasso, and sparse group lasso). The panels referring to the sparse group lasso are with varying values for the lasso tuning parameter and with the group lasso tuning parameter fixed at f G = 10. In general, the results confirm the expected relation between the structure of the variable selection and the different approaches to sparseness: The best recovery for selection within blocks is by Elitist lasso with a value of 0.5 for the tuning parameter f E , for selection between blocks is by the group lasso with f G = 10, and for selection between and within blocks the sparse group lasso (f L = 0.1 for the lasso). Deviations from the expected behavior occur for the sparse group lasso when selection is both within and between blocks in case of many zeros (see Figure 4): the lasso and Elitist lasso then outperform the group lasso. This can be attributed to the fact that the group lasso is less aggressive than the lasso and Elitist lasso [11]. On the other hand, the lasso and Elitist lasso perform less well when selection is within blocks and the true structure is not so sparse (50 percent of zeros, see the top row of Figure 3) because of their aggressive behavior. Note that a penalty that selects between groups in a more aggressive way was proposed by [11]. The same pattern of results is obtained when the error amounts to 5 percent (though shifted upwards as in these conditions the status of the variables is better recovered) or when the tuning parameter of the ridge penalty takes higher values. In case the ridge equals zero, the box plots show worse results for the lasso and Elitist tuning parameters equal to zero (because there are more variables than objects thus at most 28 nonzero values are obtained for the approach based on sparse weights). A second point of interest, is the influence of the model used to generate and analyze the data. Figure 5 displays four panels of boxplots for the proportion of correctly classified variables. Within panels, the boxplots are displayed in function of the block structure present in the variable selection. The upper panels refer to data generated under a model with sparse loadings, the lower panels to data generated under a model with sparse weights. The panels at the left were obtained when analyzing the data with a model based on sparse weights and at the right with sparse loadings. In general, analyzing the data with the sparse weights model yields less misclassifications than using the sparse loadings model. However, generating the (underlying) data under a model with sparse weights, in general, results in more misclassifications than generating under a sparse loadings model. These results can be explained by the more direct relation between the loadings and generated or modeled data: Generating the data with sparse loadings imposes a clearer structure than generating them with sparse weights; analyzing/modeling the data with sparse loadings imposes a stronger structure on the modeled data than modeling them with sparse weights. This is because 1) unlike a zero loading, a zero weight for a variable does not necessarily imply a modeled score of zero, because a zero weight for one variable can be compensated by nonzero weights for other variables, and 2) unlike shrinking the weights, shrinking the loadings results more directly in shrunken modeled scores. The latter can be explained by the dependence of the scale of the data, as modeled by PCA model (1), on the scale of the loadings (the model has orthonormal component scores).

Discussion
We proposed an extension of sparse PCA to the context of several data blocks, relying on a generic modeling framework that allows either for sparse component weights or for sparse component loadings and that incorporates several approaches that were taken to sparsity in the regression literature (including the lasso, elastic net, group lasso, Elitist lasso, and sparse group lasso). A very flexible algorithm was developed that allows to analyze the data under a variety of approaches that take the structure of the data in different ways into account. It also allows for combinations of penalties that were not yet considered in the regression literature.
The flexibility of the approach is important as often a particular kind of structure is needed from data integration methods. The group lasso is a popular tool to find structures that only involve one data block. This is for example relevant in comparative genomics when the focus is on divergence [35] or on tissue-specificity [36]. Elitist lasso, on the other hand, finds sparse structures that involve each of the data blocks. Not only is this of relevance in the aforementioned case of comparative genomics to find conserved processes, but also in a topdown systems biology approach. For example, to integrate microarray gene expression data and interaction data with the aim of finding transcription factors and their target genes [37].
Although the model and algorithm were proposed in the context of simultaneous component analysis, it can be easily translated to the context of principal component analysis and also of regression analysis. In fact the algorithm can be used as it is for PCA and the adaptation to regression analysis is a minor one. In the context of simultaneous component analysis, adaptations of the model (and algorithm) to a context that allows for different values of the tuning parameter for each component and/or each block would be valuable. However, such an extension is not trivial. Moreover, the problem of selecting an optimal model becomes more difficult in that more parameters need to be tuned and this would make the choice of selecting appropriate values for the tuning parameter even more difficult than it already is. A major theoretic challenge for many sparse methods is to find a good method to select the value of the tuning parameters.

Conclusions
We offered a flexible and sparse framework for data integration based on simultaneous component methods. The method is flexible both with respect to the component model and with respect to the sparse structure imposed: Sparsity can be imposed either on the component weights or on the loadings, and can be imposed either within data blocks, across data blocks, or both within and across data blocks. As such, it allows to find structures exclusively tied to one data platform as well as structures that involve all data platforms. A penalty based approach is used that includes the lasso, the ridge penalty, the group lasso, and Elitist lasso. The method includes principal component analysis, sparse principal component analysis, and ordinary simultaneous component analysis as special cases. Real and simulated data were used to validate the method. We believe the method offers a very flexible and versatile tool for many data integration problems.

Methods
Here we derive the estimates used in the alternating least squares and iterative majorization algorithm. First, it is shown how the conditional estimates for the objective function relying on sparse component weights can be obtained and then for the objective function relying on sparse loadings.

Sparse component weights
The generic objective function that we rely on to find a simultaneous component solution with sparse component weights is to minimize with respect to W c and P c and under the constraint P T c P c = I. λ L , λ R , λ G , and λ E are considered to be known non negative constants. We use an alternating approach in which each set of parameters is updated in turn while keeping the remaining sets fixed on their last update. Let P c be the first set to be updated, conditionally upon fixed values for W c . Rewriting (13) gives with the terms that are constant with respect to P c . Using P T c P c = I yields with k 2 = k 1 + trW T c X T c X c W c . The minimization of (15) under the constraint of orthogonal loadings is equivalent to the maximization of trW T c X T c X c P c under the same constraint. This is a problem with known closed form solution [22] P c = VU T (16) with U and V the left and right singular vectors of W T c X T c X c . The minimization of (13) with respect to W c is not a standard problem due to the Lasso, Group Lasso, and Elitist Lasso penalties on W c . We will make use of a numerical procedure, known as Majorization Minimization (MM) or also Iterative Majorization, which has been proven to be a superior algorithmic strategy in regularization problems [25,38]. Briefly stated, MM replaces functions that are complicated to minimize by surrogate functions that are easy to minimize, that lie on/above the original function, and that touch the original function in the so-called supporting point. These properties lead to the sandwich inequality [23].
A useful property of majorizing functions is that a sum of terms can be majorized by majorizing the terms [39]. Therefore, a majorizing function can be obtained for (13) by finding a linear or quadratic majorizing function for each of the penalty terms except the ridge. First we consider the Lasso penalty: λ L W c 1 = hk,r,k λ L w j k r k . Applying the additivity property again, we need to find a majorizing function for w j k r k . Such a function is [40] w j k r k ≤ 1 2 with w o j k r k the current estimate of w j k r k that was obtained in the previous iteration. This yields λ jk,r,k with the Vec notation indicating that the matrix is vectorized, with k 3 = j k ,r,k Second, we consider the k Group Lasso penalty terms . A majorizing function for the root is (see [39]) with with D 3 a diagonal matrix containing the Combining (13) with the results (18), (19), and (20), we obtain the following majorizing function for (13): with D sup = λ L 2 D 1 + λ G 2 D 2 + λ E D 3 + λ R I, I, an identity matrix, and k = k 3 + k 4 + k 5 . This function can be minimized with respect to W c by finding the value for which the partial derivative of (21) is zero. The partial derivative equals and is equal to zero for where the inverse is taken of a block-diagonal matrix. W c is obtained by rearranging Vec(W c ). Note that the second derivative is positive so (23) is a minimum of (21). In this equation, the penalty terms occur as diagonal matrices that are summed together in the matrix D sup and with the variance-covariance matrix of the data; the resulting matrix is inverted and will be dominated by large values on the diagonal (yielding small values after inversion). This shows the behavior of the penalties: increasing the tuning parameters results in such large diagonal values; furthermore, the diagonal matrices themselves are inverse functions of the weights in the previous iteration of the algorithm such that small weights further enhance the shrinkage or selection. Note that the matrix to be inverted in equation (23) is of the form D + A T A with D a diagonal matrix; then, the following holds [41], which may be useful when J k >I.

Sparse loadings
The generic objective function that we rely on to find a simultaneous component solution with sparse component weights is to minimize with respect to T and P k under the constraint that T T T = I. λ L , λ R , λ G , and λ E are considered to be known non negative constants. In case all tuning parameters are equal to zero, a regular simultaneous component analysis results and in that case the algorithm should be based on SVD of the concatenated data. We use an alternating approach in which each set of parameters is updated in turn while keeping the remaining sets fixed on their last update. Let T be the first set to be updated, conditionally upon fixed values for P c . Rewriting (25) gives with k 6 = tr X T c X c + P c P T c + λ L P c 1 + λ R P c 2 2 + k λ G J k P k 2 + λ E P k 1,2 the terms that are constant with respect to T. Minimizing function (26) is equivalent to maximizing trP T c X T c T with known closed form solution [22] T = VU T (27) with U and V the left and right singular vectors of P T c X T . Combining (25) with the results (18), (19), and (20) adapted to the case of loadings, we obtain the following majorizing function for (25) and the first derivative of Q(T, P c ) with respect to P c is equal to zero for Additional material