A structured overview of simultaneous component based data integration
 Katrijn Van Deun^{1}Email author,
 Age K Smilde^{2},
 Mariët J van der Werf^{3},
 Henk AL Kiers^{4} and
 Iven Van Mechelen^{1}
DOI: 10.1186/1471210510246
© Van Deun et al; licensee BioMed Central Ltd. 2009
Received: 24 April 2009
Accepted: 11 August 2009
Published: 11 August 2009
Abstract
Background
Data integration is currently one of the main challenges in the biomedical sciences. Often different pieces of information are gathered on the same set of entities (e.g., tissues, culture samples, biomolecules) with the different pieces stemming, for example, from different measurement techniques. This implies that more and more data appear that consist of two or more data arrays that have a shared mode. An integrative analysis of such coupled data should be based on a simultaneous analysis of all data arrays. In this respect, the family of simultaneous component methods (e.g., SUMPCA, unrestricted PCovR, MFA, STATIS, and SCAP) is a natural choice. Yet, different simultaneous component methods may lead to quite different results.
Results
We offer a structured overview of simultaneous component methods that frames them in a principal components setting such that both the common core of the methods and the specific elements with regard to which they differ are highlighted. An overview of principles is given that may guide the data analyst in choosing an appropriate simultaneous component method. Several theoretical and practical issues are illustrated with an empirical example on metabolomics data for Escherichia coli as obtained with different analytical chemical measurement methods.
Conclusion
Of the aspects in which the simultaneous component methods differ, preprocessing and weighting are consequential. Especially, the type of weighting of the different matrices is essential for simultaneous component analysis. These types are shown to be linked to different specifications of the idea of a fair integration of the different coupled arrays.
Background
Recently, technological developments have led to a situation where data analysts in different domains face data that are more and more complex. A special case of complex data are coupled data that consist of different data matrices for the same set of variables or experimental units. In systems biology, an example of matrices sharing the same set of variables is the study of the expression profile of a certain organism (e.g., Saccharomyces cerevisae) on the basis of on the one hand different microarray compendia that can be downloaded from public repositories, and on the basis of, on the other hand, ChIPchip or motif data [1, 2]. An example of data matrices with shared experimental units are metabolomics data (e.g., the metabolome of Escherichia coli) gathered from different fermentations using mass spectrometry (MS) with different MS data sets being available from different separation methods (e.g., gas chromatography and liquid chromatography [3]). In the first example each of the data matrices provides information on the same transcriptome and in the second example on the same set of metabolites, with some parts of the information being common for the different data matrices and some parts being specific: For example, gas chromatography mass spectrometry (GC/MS) and liquid chromatography MS (LC/MS) in general measure both a few classes of common compounds and many classes of compounds that are measured by one of the two methods only [3, 4].
A major challenge for researchers dealing with such coupled data, is to represent them in such a way that both shared and specific information as contained in the different data matrices is captured (with all information in question pertaining to variance within each of the matrices under study). For example, in the case of coupled gene expression and ChIPchip data one may wish to retrieve modules of genes that have the same transcription factors and that are coregulated under the same conditions, which is common information as contained in the transcriptome and ChIPchip data matrices; in the metabolomics example, a coupled data analysis of gas and liquid chromatography MS data should allow to highlight the classes of compounds that are measured by both separation methods, as well as those that are measured by only one of them.
Several tools are available that can be used for the analysis of coupled data. Here we will focus on methods that simultaneously extract components from all data blocks. Examples of such methods include SUMPCA [5], unrestricted PCovR (Gurden: Multiway covariates regression, unpublished), SCAP [6], multiple factor analysis [7], and STATIS [8]. Whereas all these methods are based on the idea of a simultaneous component extraction they have been developed independently in different disciplines (including chemometrics and psychometrics) and rely on different terminologies and mathematical frameworks. As a consequence, comparing them is not straightforward. The primary objective of this paper is to provide a structured overview in which all the methods fit, and to highlight their common core and particularities.
The paper starts by introducing some terminology to delineate the types of data to which the methods are applicable; then, a general framework is introduced that encompasses all the different simultaneous component methods and that frames them mathematically into a principal components setting. Then, each of the methods is discussed with respect to this framework. An application is presented on simultaneous components analyses of gas and liquid chromatography MS data; in this application we compare the results obtained by applying the different methods and discuss how to interpret the results obtained by one of the methods (multiple factor analysis).
Methods
Some terminology
A structural overview of Simultaneous component methods
In this section, first a general framework is presented that encompasses all simultaneous component methods; then each of the methods is discussed with respect to the general framework.
The general framework
The aspects that constitute the general framework are 1) the data to which the methods apply, 2) preprocessing steps included in the methods, 3) a general mathematical model for the data, 4) the objective function that is needed to estimate the model parameters, 5) identification constraints to obtain a unique solution for the parameters, and 6) the algorithmic strategy used to derive the model parameters (on the basis of the objective function, subject to the identification constraints).
We will follow the notation introduced in [9], denoting vectors and matrices by bold upper and lower case letters respectively and using indexes for which the cardinality is indicated by the capital of the letter used to run the index. For example, the different data matrices will be indexed by k, with k running from 1 to K.
The data
As discussed in section 2, simultaneous component methods are applicable to data consisting of at least two twoway twomode data matrices that have one mode in common. The common mode can be either the object or the variable mode.
Preprocessing
The data might be preprocessed in order to correct for irrelevant differences between variables or matrices. For example, when different measurement units are used to measure the variables the resulting differences in offset and scale might be accounted for by centering each variable (such that the resulting variable has mean zero) and by scaling to sum of squares or to range one [3] (the combined operation of centering and scaling to sum of squares one is usually named autoscaling or standardizing); see also [10]. Coupled matrices may also differ in size such that the results of a particular statistical analysis may be dominated by the largest matrices; to correct for this effect, a possible strategy is to scale each data matrix to sum of squares one.
The model
Except for the equality constraint on the common component scores in (2) or on the loadings in (3), identification constraints (e.g., orthogonality or orthonormality) might be applicable but the latter do not affect the scores reproduced by the model, T_{ k } .
Objective function
under the restriction that either T_{1} = ... = T_{ K }= T (common object mode) or P_{1} = ... = P_{ K }= P (common variable mode).
Identification constraints
In general it holds for any nonsingular matrix B that if T_{ k }and P_{ k }are solutions of (5), T_{ k }B and P_{ k }B^{1} are solutions too (e.g., B can be a rotation matrix). Examples of identification constraints often encountered in practice are that the component scores and loadings are expressed with respect to orthogonal axes in the direction of the highest variance (principal axes orientation) and that the component scores or loadings are orthonormal.
Model estimation
Optimal parameters that minimize (5) can be obtained either by a singular value decomposition (SVD) of the concatenated weighted data matrices or by a twostep approach in which first the common structure is found by an eigendecomposition of the sum of crossproduct matrices and second the K companion matrices are found by means of a suitable regression analysis. Both strategies can be used for data with a coupled object (row) mode as well as for data with a coupled variable (column) mode and they result in the same solution. We refer to the additional file (see additional file 1: Estimation.pdf) for details on both estimation procedures and a proof of their equivalence.
Specific simultaneous component methods
In the previous section we have set up a general framework for a simultaneous component representation of data that consist of at least two matrices sharing a common mode. Here, we will show how five published methods are specializations of the general framework. These are SUMPCA [5], unrestricted principal covariates regression (unrestricted PCovR see [11] and also Gurden: Multiway covariates regression, unpublished), multiple factor analysis (MFA [7, 12]), STATIS [8], and SCAP [6]).
The general framework includes six aspects: (1) the data, (2) preprocessing, (3) the model, (4) the objective function, (5) the identification constraints, and (6) the algorithmic strategy. We will discuss the published methods with respect to the specific choices they make for the first, second, third, and fifth aspect. The fourth aspect (the objective function) is not discussed because it is the same for all methods; the sixth aspect (the algorithmic strategy) is not discussed either, because the choice made with respect to this aspect does not affect the obtained solution.
SUMPCA
A first published method that fits within our general framework is SUMPCA. A confusing element though, is that the name SUMPCA has been used for two different methods: this paragraph bears on SUMPCA (with a hyphen) proposed in the chemometric literature [5] and not on SUMPCA proposed in the psychometric literature [13, 14]; we will come back to the latter method when discussing SCAP.

Data
SUMPCA [5] was developed for data that are linked in the row mode.

Preprocessing
The preprocessing steps consist of first autoscaling the data per variable and secondly, scaling each data block to sum of squares equal to one [5].

Model
which implies prespecified weights w_{ k }= 1. Note, however, that the preprocessing step of a scaling of each block to sum of squares one applied to autoscaled data, is equivalent to using weights on the autoscaled data; this means blocks with more variables are downweighted more than blocks with fewer variables in order to avoid that larger blocks dominate the solution.

Identification constraints
The SUMPCA model (6) is estimated under the identification constraints of a principal axes orientation and orthonormality of the common component scores: T^{ T }T = I. The latter constraint is unusual in chemometrics where it is common practice to have orthonormal loadings. However, it implies that the method finds exactly the same global scores as CPCAW, a method that extracts the components in a sequential way (for a proof of the equivalence, see [5]).
Unrestricted PCovR

Data
Principal covariates regression (PCovR) was proposed for the analysis of data consisting of a matrix of dependent variables X_{1} and a matrix of independent variables X_{2} for the same set of objects [11].

Preprocessing
In [11], different options are mentioned for preprocessing on the level of the variables, including centering, scaling and autoscaling.

Model
with the common component scores T_{1} being restricted to belong to the column space of X_{1} and with 0 ≤ β ≤ 1. Such a restriction does not fit within the general framework as outlined above; yet, in the more general context of multiway covariates regression, also an unrestricted model has been proposed in which this restriction was dropped (Gurden: Multiway covariates regression, unpublished); clearly, the unrestricted PCovR model fits within the general framework. The prespecified weight is determined by a strategy that minimizes the crossvalidation error for the prediction of X_{2}; this introduces some asymmetry in the treatment of X_{1} and X_{2}. Both simulation studies (Gurden: Multiway covariates regression, unpublished) and empirical results [15] yielded unsatisfying results for the proposed crossvalidation approach because it results in βvalues close to zero or one, implying that all weight is placed on a single matrix.

Identification constraints
In [11], it is suggested to constrain the common component scores to be columnwise orthonormal: T^{ T }T = I.
MFA

Data
Multiple Factor Analysis or MFA [7, 12, 16], also known as Analyse Factorielle Multiple or AFM, was proposed in the French literature as a method for the analysis of data consisting of several sets of variables for the same group of subjects (common object mode). Recently, the method was applied to integrate distinct omics data [17].

Preprocessing
The variables are supposed to be autoscaled.

Model
which is the correction used by MFA. Note that scaling the matrices (e.g., to unit sum of squares) is of no influence on the MFA results as the singular value of f times X_{ k }equals fσ_{ k }and the first step of MFA is to divide each block by this singular value.

Identification constraints
MFA estimates the parameters under a principal axes constraint and orthonormality of the component scores: T^{ T }T = I.
STATIS

Data
STATIS was proposed in the French literature [8] (see also [13] for an English description of the method). In both publications, threeway data are used but the authors note that the method is also applicable to coupled twoway data matrices with a common object mode.

Preprocessing
In [8], nothing is mentioned on preprocessing. Centering and scaling to unit variance are mentioned as options of the STATIS (ACT) method in [18]; a weighting to account for differences in size of the matrices is described in [19].

Model
with a_{ k }being the weight associated to the k th data matrix. These weights are obtained from the first component of the PCA of a matrix which is derived from the data in the following way: 1) Derive the crossproduct matrices S_{ k }= (note that for vectors of autoscaled object scores, this would be the matrix of correlations between objects). 2) Construct the matrix F of size N^{2} × K by inserting the vectorized matrix S_{ k }, formally written as vec(S_{ k }), in the k th column. The weights a_{ k }are the loadings on the first principal component of F and can be obtained from the first right singular vector of F or from an eigendecomposition of the matrix S with values (note that the element on the intersection of row k and column k' of the matrix F^{ T }F equals vec (S_{ k })^{ T }vec(S_{ k' }) so S = F^{ T }F). Larger weights can be expected for: (a) data matrices with larger values (the values on the diagonal of the cross product matrix S_{ k }will be larger for such matrices and thus the loading associated to the kth column of F, vec(S_{ k })), (b) larger matrices, (c) data matrices with more covariation between the vectors of object scores, and (d) matrices with more similar cross product matrices to other matrices (K > 2). The latter property is the motivation for the proposed weighting strategy: STATIS wants to find a compromise of the different crossproduct matrices S_{ k }. This idea of a compromise or a consensus is prominent in the analysis of threeway data and underlies Generalized Procrustes analysis (GPA) and Generalized Canonical Correlation analysis [20]. See [21, 22] for a comparison of STATIS and GPA.

Identification constraints
STATIS looks for simultaneous components with a principal axes orientation and under the restriction of orthonormal loadings ( P_{ conc }= I, with P_{ conc }representing the matrix of concatenated block specific loadings).
SCAP

Data
In the psychometric literature, SCAP was proposed as a method for the analysis of multiple data matrices obtained by measuring the same set of variables in different groups; the variable mode is therefore considered to be the common mode [6]. The acronym SCAP stands for simultaneous component analysis with a common pattern matrix, where "pattern matrix" is a psychometric term for "loading matrix".

Preprocessing
In the original applications of SCAP, where the main interest is to account for variation within the matrices X_{ k }, it is recommended to autoscale each variable per data block [6]. Another proposed strategy is to center each variable per matrix and to scale the variables to sum of squares ∑_{ k }I_{ k }for the concatenated data, this is over matrices [23]. The latter strategy preserves the variability within the matrices X_{ k }. In the remainder of the paper we will refer to SCAP with a variablewise autoscaling per block.

Model

Identification constraints
SCAP looks for simultaneous components that have a prinicipal axes orientation and orthonormal component scores, T_{ conc }= I, with T_{ conc }representing the matrix of concatenated block specific loadings).
Reflection on the general framework
Characterization of published simultaneous component methods in function of the general framework.
Common mode  Preprocessing  Matrixspecific weights  Identification constraint  

SUMPCA  object  Variables: autoscaling Matrices: scaled to sum of squares one  All w_{ k }= 1  Principal axes T^{ T }T = I 
unr. PCovR  object  Variables: autoscaling  Minimize cross validation error  Principal axes T^{ T }T = I 
MFA  object  Variables: autoscaling  Inverse of largest singular value  Principal axes T^{ T }T = I 
STATIS  object  Compromise weights  Principal axes P_{ conc }= I  
SCAP  variable  Variables: autoscaling  All w_{ k }= 1  Principal axes T_{ conc }= I 
Structured overview of the methods
A first aspect concerns the labeling of the common mode as the object or as the variable mode. Ignoring for a moment what happens on the level of preprocessing and weighting but taking into account whether the orthonormality restriction is imposed on the common mode or on the other mode, changing the label of the common mode from object to variable or vice versa makes that STATIS turns into SCAP or vice versa. Second, with respect to preprocessing, all methods except STATIS autoscale each variable. Furthermore, SUMPCA additionally also scales each matrix to unit sum of squares (alternatively this can also be considered a matrixspecific weighting strategy). Third, except for SCAP, all methods use a distinct matrixspecific weighting strategy (note that as indicated above SUMPCA can be considered to do this weighting via the preprocessing step). Fourth, all methods use a principal axis identification constraint but differ in their choice of putting an orthonormality constraint either on the common structure (SUMPCA, unr. PCovR, and MFA) or on the concatenated structure (STATIS and SCAP); as discussed previously, however, this affects the scaling of the components only, and yields the same reconstructed data.
Consequential differences between the methods
Next, one may wonder for which of the aspects as discussed above the choices may be consequential for the obtained results. A first such aspect is the type of preprocessing, with all methods except STATIS relying on a variablewise autoscaling. This means that the solutions obtained with STATIS can be dominated by the variables with the largest sums of squares; for that reason, these solutions can be very different from the results as obtained with the other methods. A second possibly consequential aspect is the matrixspecific weighting strategy. This can strongly influence the obtained results, with as an extreme all weight being put on a particular matrix which results in a structure of the common mode (T or P) that equals the structure obtained from the separate component analysis of that matrix. Note that in case the R dominant directions of the different data matrices span approximately the same subspace, a matrixspecific weighting will have little impact on the obtained results (Escoufier's RV coefficient can be used to measure this dependency between the data matrices [25, 26]). Third, although the labeling of the common mode has no direct consequential effect, it may have important indirect consequences through what happens on the level of preprocessing. Indeed, methods that perform autoscaling do this in the direction of the variables: Labeling the common mode as 'variables' results in autoscaling in the direction of the common mode while labeling it as 'objects' results in autoscaling in the other direction.
Furthermore, matrixspecific weighting operates differently for autoscaled data with a common object mode than for autoscaled data with a common variable mode. To understand the latter, let the row elements (which may pertain to either objects or variables) be common, such that the common structure (T or P) can be derived from when the row mode is the variable mode X_{ k } is a correlation matrix and when the row mode is the object mode it will be a crossproduct matrix. Now, as the size of a correlation does not depend on the sample size, in case of a common variable mode larger matrices will not necessarily have a larger contribution; on the other hand entries of crossproduct matrices may take larger values in case of larger matrices. As a consequence, in case of a common object mode, larger matrices may contribute more to the sum hence in such a case, unlike in a case with a common variable mode, a downweighting of larger matrices may be desired. Note that in the description of the methods used here, we are strict in the choices made on the level of preprocessing. The methods, however, could also be defined on the basis of the matrixspecific weighting strategy apart from any preprocessing strategy. Then, the labelling of the common mode is not consequential (such that SCAP can also be used for data that are coupled in the variable mode while SUMPCA, unr. PCovR, MFA, and STATIS can also be used for data coupled in the object mode).
Reflection on preprocessing and weighting
The proposed preprocessing and weighting strategies reflect dataanalytic concerns to correct for possible disturbing factors on the level of the variables and on the level of the matrices. First, on the level of the variables a disturbing factor may stem from different variables being measured by different measurement techniques or with different measurement scales. Often such differences may be irrelevant from a substantive perspective but they are essential and may distort the results, which may necessitate a suitable type of preprocessing. Now STATIS as original published [8], unlike all other published methods does not autoscale variables; therefore, in case of irrelevant differences between variables in measurement scale, STATIS may not be used or only in combination with a preliminary centering, scaling, or autoscaling of the data as proposed in later publications on the method [18, 19].
Principles to realize a fair integration of different data matrices.
Principle  Methods aiming at this principle 

Same weight for all matrices (naive approach)  SCAP 
More weight for smaller matrices  SUMPCA, MFA 
More weight for less redundant matrices  MFA 
More weight for matrices with more stable predictive information  PCovR 
More weight for matrices that share more information with other matrices (K > 2)  STATIS 
It can be useful to apply these principles of fairness in a flexible way. For example, different principles may be combined when there are different reasons for unwanted dominance of a matrix. This is the case for the weighting strategy underlying MFA; another example is a modified weighting strategy for STATIS that corrects for differences in sizes between the matrices and also gives more weight to matrices that have more in common with other matrices [21, 22]. Also, different principles can be used for different data matrices. For example, suppose that a particular matrix is very noisy, that all matrices have equal sum of squares, and that interest is in avoiding redundancies. In the noisy (i.e., extremely heterogenous) block all singular values will be almost equal such that applying the principle of more weight for the less redundant matrices will give most weight to the noisy data block. In such a case one may consider to use the principle of more weight for the matrices with more stable predictive information in order to give less weight to this particular block, and to additionally use the principle of avoiding redundancies on the other blocks.
Results
In this Section we will apply each of the existing methods to an example biological dataset. First the data are described, then each of the methods is applied and the solution obtained with MFA is interpreted.
Description of the data
The phenylalanine production in the Escherichia coli NST 74 and a wild type strain was studied with the aim to identify bottlenecks in the production of this compound [3]. Metabolomes were screened at different fermentation times and obtained under various environmental conditions using both gas chromatography (GC) and liquid chromatography (LC) in combination with mass spectrometry (MS). In general GC/MS and LC/MS methods are known to detect different classes of chemical compounds, although some classes are detected by both methods [4]. The data consist of two coupled data matrices: a GC/MS matrix with the measurements of 144 metabolites (including 13 intermediates of the phenylalanine biosynthesis route) and a LC/MS matrix with the measurements of 44 metabolites, both for the same 28 samples of E. coli (common object mode); no metabolite was measured on both platforms, hence, there is no trivial overlap. In the data considered here, only those metabolites that were detected in at least 20 percent of the experiments were used; furthermore, the data were manually curated and normalized as described by [27]. Measurement values below the detection threshold were set equal to one half of the smallest detected value [3]: 687 (17 percent) of the GC measurements and 34 (3 percent) of the LC measurements were below the limit of detection. Due to a lot of values below the detection limit for a few metabolites in the GC data, some extreme outliers were observed. To deal with skewness and asymmetry, all values were logtransformed. No influential outliers were found among the logtransformed values.
Application of different simultaneous component methods
The different published simultaneous component methods were applied to the coupled GC/MS – LC/MS data. Due to the asymmetry associated to the crossvalidation procedure in PCovR, we report two unrestricted PCovR analyses denoted as PCovR GC and PCovR LC with the former denoting a leaveoneout crossvalidation based on using the observed scores in the LC data to reproduce the scores of the GC data and the latter a crossvalidation that uses the observed GC data to reproduce the LC data. We also report the analysis based on the unweighted but variablewise autoscaled data. This can be considered to be SCAP applied to the transposed versions of the data matrices.
Weights put on GC versus LC by different SCA methods.
GC  LC  

PCovR GC^{1}  1.00  0 
STATIS  .99  .01 
SCAP  .77  .23 
MFA  .66  .34 
SUMPCA  .50  .50 
PCovR LC^{1}  0  1.00 
Tucker's coefficient of congruence between the component scores (R = 5).
PCovR GC  SCAP  MFA  SUMPCA  PCovR LC  LC^{2}  STATIS  

GC^{1}  1  0.91  0.86  0.81  0.55  0.55  0.13 
PCovR GC^{3}  0.91  0.86  0.81  0.55  0.55  0.13  
SCAP  0.99  0.96  0.73  0.73  0.12  
MFA  0.99  0.79  0.79  0.12  
SUMPCA  0.84  0.84  0.11  
PCovR LC^{4}  1  0.08  
LC  0.08 
Interpretation of the multiple factor analysis solution
Component scores (labeled 'SC1' to 'SC5') after VARIMAX rotation of the MFA solution with five components and, on the last three lines, the variance accounted for by these components in the GC data (the 'GC' line), the LC data (the 'LC' line), and the concatenated data (the 'TOTAL' line).
SC1  SC2  SC3  SC4  SC5  

Reference  
16 hrs  0.07  0.33  0.04  0.01  0.12 
24 hrs  0.05  0.02  0.10  0.09  0.16 
32 hrs  0.01  0.30  0.05  0.07  0.06 
40 hrs  0.10  0.27  0.01  0.09  0.01 
48 hrs  0.18  0.17  0.17  0.03  0.10 
pH +  
16 hrs  0.18  0.11  0.29  0.10  0.61 
24 hrs  0.03  0.07  0.36  0.10  0.03 
40 hrs  0.35  0.17  0.26  0.03  0.09 
48 hrs  0.22  0.10  0.05  0.10  0.03 
oxygen +  
40 hrs  0.24  0.04  0.02  0.00  0.06 
oxygen ?  
16 hrs  0.04  0.34  0.01  0.11  0.06 
24 hrs  0.26  0.02  0.02  0.08  0.16 
40 hrs  0.45  0.13  0.09  0.05  0.08 
64 hrs  0.37  0.10  0.11  0.02  0.07 
phosphate +  
16 hrs  0.01  0.38  0.02  0.04  0.03 
24 hrs  0.09  0.43  0.04  0.10  0.14 
40 hrs  0.33  0.00  0.04  0.01  0.01 
48 hrs  0.04  0.12  0.01  0.01  0.15 
phosphate   
16 hrs  0.03  0.06  0.00  0.19  0.09 
24 hrs  0.03  0.02  0.41  0.12  0.07 
40 hrs  0.32  0.09  0.32  0.01  0.14 
succinate  
24 hrs  0.07  0.10  0.02  0.55  0.13 
40 hrs  0.01  0.02  0.01  0.57  0.03 
48 hrs  0.05  0.05  0.05  0.46  0.11 
Wild type  
16 hrs  0.19  0.27  0.25  0.10  0.06 
24 hrs  0.02  0.11  0.38  0.12  0.04 
40 hrs  0.08  0.20  0.32  0.06  0.33 
48 hrs  0.11  0.06  0.26  0.03  0.55 
GC  0.14  0.12  0.08  0.14  0.06 
LC  0.10  0.14  0.20  0.07  0.13 
TOTAL  0.13  0.13  0.12  0.12  0.09 
Discussion
 1.
Labeling the common mode as object versus variable is consequential for the obtained results only when considered in combination with preprocessing (the variablewise autoscaling).
 2.
Preprocessing can have a strong influence on the obtained results. This is the case when the variables are measured on different scales or when the blockspecific sum of squares differ substantially. In the former case a single/few variables can dominate the solution, in the latter a single/few matrices. Applying a simultaneous component analysis to data corrected for such differences may yield very different results from an analysis on the uncorrected data.
 3.
In general, different matrixspecific weighting strategies lead to different results. The different simultaneous component methods each proposed specific weighting strategies based on different principles of fairness. As illustrated, these strategies may yield very different weights ranging from one extreme (all weight is put on a particular matrix) to the other (all weight is put on another particular matrix).
 4.
The identification constraint to impose orthonormality on the common versus on the concatenated model structure affects the scale of the component scores and loadings but it is not consequential and yields the same reproduced scores.
Conclusion
Summarizing, of the four aspects that differentiate the simultaneous component methods, the choices made with respect to preprocessing and weighting are consequential; this was also illustrated by applying the different simultaneous component methods to two coupled metabolomics data matrices. Especially, the type of weighting of the different matrices is essential for simultaneous component analysis. These types were shown to be linked to different specifications of the idea of a fair integration of the different coupled matrices. A summary overview of these principles (see Table 2) may be of help to the data analyst in choosing an appropriate weighting scheme for the analysis of a data set at hand. As discussed, such a weighting scheme may be based on a flexible integration of different principles of fairness.
Abbreviations
 GC:

Gas Chromatography
 GPA:

Generalized Procrustes Analysis
 LC:

Liquid Chromatography
 MFA:

Multiple Factor Analysis
 MS:

Mass Spectrometry
 PCovR:

Principal Covariates Regression
 PLS:

Partial Least Squares
 SCAP:

Simultaneous Component Analysis with a fixed Pattern matrix
 STATIS:

Structuration des Tableaux à Trois Indices de la Statistique
 SVD:

Singular Value Decomposition.
Declarations
Acknowledgements
This work was supported by the Research Fund of Katholieke Universiteit Leuven (SymBioSys: CoE EF/05/007 and GOA/2005/04). We wish to thank the anonymous reviewers for helpful comments.
Authors’ Affiliations
References
 Lemmens K, Dhollander T, De Bie T, Monsieurs P, Engelen K, Smets B, Winderickx J, De Moor B, Marchal K: Inferring transcriptional modules from ChIPchip, motif and microarray data. Genome Biology 2006, 7(5):R37. 10.1186/gb200675r37PubMed CentralView ArticlePubMedGoogle Scholar
 Segal E, Yelensky R, Koller D: Genomewide discovery of transcriptional modules from DNA sequence and gene expression. Bioinformatics 2003, 19(suppl1):i273–282. 10.1093/bioinformatics/btg1038View ArticlePubMedGoogle Scholar
 Smilde AK, van der Werf MJ, Bijlsma S, van der Werfvan der Vat BJ, Jellema RH: Fusion of Mass SpectrometryBased Metabolomics Data. Analytical Chemistry 2005, 77(20):6729–6736. 10.1021/ac051080yView ArticlePubMedGoogle Scholar
 Werf MJ, Overkamp KM, Muilwijk B, Coulier L, Hankemeier T: Microbial metabolomics: Toward a platform with full metabolome coverage. Analytical Biochemistry 2007, 370: 17–25. 10.1016/j.ab.2007.07.022View ArticlePubMedGoogle Scholar
 Smilde AK, Westerhuis JA, de Jong S: A framework for sequential multiblock component methods. Journal of Chemometrics 2003, 17: 323–337. 10.1002/cem.811View ArticleGoogle Scholar
 Kiers HA, ten Berge JM: Hierarchical relations between methods for simultaneous component analysis and a technique for rotation to a simple simultaneous structure. British Journal of Mathematical and Statistical Psychology 1994, 47: 109–126.View ArticleGoogle Scholar
 Escofier B, Pagès J: Analyses factorielles simples et multiples. 3rd edition. Paris: Dunod; 1998.Google Scholar
 L'Hermier des Plantes H, Thièbaut B: Etude de la pluviosité au moyen de la méthode S.T.A.T.I.S. Revue de Statistique Appliquée 1977, 25: 57–81.Google Scholar
 Kiers HAL: Towards a Standardized Notation and Terminology in Multiway Analysis. Journal of Chemometrics 2000, 14: 105–122. 10.1002/1099128X(200005/06)14:3<105::AIDCEM582>3.0.CO;2IView ArticleGoogle Scholar
 Bro R, Smilde AK: Centering and scaling in component analysis. Journal of Chemometrics 2003, 17: 16–33. 10.1002/cem.773View ArticleGoogle Scholar
 de Jong S, Kiers HA: Principal covariates regression: Part I. Theory. Chemometrics and Intelligent Laboratory Systems 1992, 14: 155–164. 10.1016/01697439(92)80100IView ArticleGoogle Scholar
 Escofier B, Pagès J: Méthode pour l'analyse de plusieurs groupes de variables: Application à la caractérisation de vins rouges du Val de Loire. Revue de Statistique Appliquée 1983, 31: 43–59.Google Scholar
 Kiers HAL: Threeway methods for the analysis of qualitative and quantitative twoway data. Leiden, The Netherlands: DSWO press; 1989.Google Scholar
 Kiers HA: Hierarchical relations among threeway methods. Psychometrika 1991, 56: 449–470. 10.1007/BF02294485View ArticleGoogle Scholar
 Smilde AK, Kiers HA: Multiway covariates regression models. Journal of Chemometrics 1999, 13: 31–48. 10.1002/(SICI)1099128X(199901/02)13:1<31::AIDCEM528>3.0.CO;2PView ArticleGoogle Scholar
 Pagès J: Collection and analysis of perceived product interdistances using multiple factor analysis: Application to the study of 10 white wines from the Loire Valley. Food Quality and Preference 2005, 16: 642–649. 10.1016/j.foodqual.2005.01.006View ArticleGoogle Scholar
 de Tayrac M, Le S, Aubry M, Mosser J, Husson F: Simultaneous analysis of distinct Omics data sets with integration of biological knowledge: Multiple Factor Analysis approach. BMC Genomics 2009, 10: 32. 10.1186/147121641032PubMed CentralView ArticlePubMedGoogle Scholar
 Lavit C, Escoufier Y, Sabatier R, Traissac P: The ACT (STATIS method). Computational Statistics and Data Analysis 1994, 18: 97–119. 10.1016/01679473(94)901341View ArticleGoogle Scholar
 Stanimirova I, Walczak B, Massart D, Simeonov V, Saby C, Di Crescenzo E: STATIS, a threeway method for data analysis. Application to environmental data. Chemometrics and Intelligent Laboratory Systems 2004, 73: 219–233. 10.1016/j.chemolab.2004.03.005View ArticleGoogle Scholar
 Dahl T, Næs T: A bridge between Tucker1 and Carrol's Generalized Canonical Analysis. Computational Statistics and Data Analysis 2006, 50: 3086–3098. 10.1016/j.csda.2005.06.016View ArticleGoogle Scholar
 Meyners M, Kunert J, Qannari EM: Comparing generalized Procrustes analysis and STATIS. Food Quality and Preference 2000, 11: 77–83. 10.1016/S09503293(99)000385View ArticleGoogle Scholar
 Qannari EM, Courcoux P, Lejeune M, Maystre O: Comparaison de trois stratégies de détermination d'un compromis en évaluation sensorielle. Revue de Statistique Appliquée 1997, XLV(1):61–74.Google Scholar
 Timmerman ME, Kiers HAL: Four simultaneous component models for the analysis of multivariate time series from more than one subject to model intraindividual and interindividual differences. Psychometrika 2003., 68(105–121):
 Levin J: Simultaneous factor analysis of several gramian matrices. Psychometrika 1966, 31: 123–134. 10.1007/BF02289472View ArticleGoogle Scholar
 Robert P, Escoufier Y: A unifying tool for linear multivariate statistical methods. Applied Statistics 1976, 25: 257–265. 10.2307/2347233View ArticleGoogle Scholar
 Smilde AK, Kiers HA, Bijlsma S, Rubingh CM, van Erk MJ: Matrix correlations for highdimensional data: the modified RVcoefficient. Bioinformatics 2009., 25(3): 10.1093/bioinformatics/btn634
 Hageman JA, Berg RA, Westerhuis JA, Werf MJ, Smilde AK: Genetic algorithm based twomode clustering of metabolomics data. Metabolomics 2008, 4(2):141–149. 10.1007/s1130600801057View ArticleGoogle Scholar
 LorenzoSeva U, ten Berge JMF: Tucker's congruence coefficient as a meaningful index of factor similarity. Methodology 2006, 2: 57–64.View ArticleGoogle Scholar
 Borg I, Groenen PJ: Modern Multidimensional Scaling: Theory and Applications. 2nd edition. Springer series in statistics, New York: SpringerVerlag; in press.
 Kaiser HF: The varimax criterion for analytic rotation in factor analysis. Psychometrika 1958, 23: 187–200. 10.1007/BF02289233View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.