Mathematical basis of multiple factor analysis (MFA)
MFA [16] is devoted to the simultaneous exploration of multiple data tables where the same individuals are described by several tables of variables. In MFA, the number and the type of variables (quantitative or categorical) may vary from one table to another, but within each table the nature of the variables is the same. Here we focus on quantitative variables. The aims of MFA are similar to those of PCA, namely to study the similarities between individuals from a multidimensional point of view, to analyze the relationships between variables and characterize individuals based on these relationships. However, beyond these conventional uses, MFA can also be used to study the links between tables of variables and to compare the information contributed by each table.
MFA analyzes a set of J data tables K
1,…,K
J
, where each K
j
corresponds to a table of quantitative variables measured on the same I individuals (for a schematic overview of MFA see Additional file 1: Figure S1). The core of MFA is a PCA in which weights are assigned to variables. More formally, the matrix of variance-covariance associated with each data table K
j
is decomposed by PCA and its largest eigenvalue \({\lambda _{1}^{j}}\) is derived. Then, each variable belonging to K
j
is weighted by \(1/\sqrt {{\lambda _{1}^{j}}}\). Finally, a global PCA is performed on the merged and weighted data table K=[K
1,…,K
J
] to obtain the configuration F (the scores matrix or principal components). The main reason for the weighting step is to remove from each table all information related to its own dimensionality or variance. Therefore, no single table can dominate the first dimension of the global analysis.
MFA provides the same graphical representations as PCA (i.e., representation of individuals and variables) but also, due to the table structure, specific representations such as the table representation and the superimposed representation are available [16].
The multiple imputation multiple factor analysis approach (MI-MFA)
To deal with multiple tables with missing rows, we propose the MI-MFA approach, a multiple imputation (MI) adapted to the framework of MFA. The aim of our method is not to get the best possible estimations of the missing values, but to replace them with plausible values in order to provide estimates of the MFA configurations. According to MI methodology, the MI-MFA approach is carried out by performing the following three steps:
-
1. Imputation: generate M different imputed datasets K
(1),…,K
(m),…,K
(M) of K.
-
2. MFA analysis: perform an MFA on each K
(m) imputed dataset leading to M different configurations F
1,…,F
m
,…,F
M
.
-
3. Combination: find a consensus configuration between all F
1,…,F
M
configurations.
These steps are outlined in Fig. 2 and described in detail in the following sections.
Generating imputed data: multiple hot-deck imputation
Hot-deck imputation involves replacing missing values of one or more variables with available values from a similar unit [17]. The observation from which these available values are taken for imputation is called the donor and the observation with the missing value, which receives the donor’s value, is the recipient. The donor can be randomly selected from a set of potential donors, called the donor pool. Selection of a suitable donor pool is not an easy task and is beyond the scope of this article [18, 19]. The general principle is to choose donor units that are as close as possible to the recipient with respect to some affinity score. Affinity is defined in terms of the degree to which each potential donor matches the recipient’s values across all variables other than the one being imputed. Intuitively, in the framework of stratified multiple omics tables, the donor pool can be formed of available individuals belonging to the same stratum (e.g. cancerous cell line, treatment, etc.) and the same omics table as the recipient.
Multiple hot decking differs from other forms of hot decking by using several donors for a single recipient [20]. Multiple hot-deck imputation proceeds as follows. Let K=[K
1,…,K
J
] be the merged data table containing missing rows with strata s=1,…,S, then carry out the following steps (see Fig. 2, hot-deck imputation step):
Step 1. Create donor pools by taking donors belonging to the same stratum s and the same table K
j
as the recipient. Recipients within the same stratum have the same donor pool. Suppose always that there is a large enough number of donors for recipients in each stratum.
Step 2. For each recipient in K
j
, impute the missing individual by drawing randomly with replacement a donor from the corresponding donor pool. Repeat this procedure until all missing individuals in the J tables have been imputed.
Step 3. Repeat Step 2 until M different imputed datasets K
(1),…,K
(M) of K are obtained.
By conducting the imputations in this way, it is reasonable to assume that the within-unit between-variables multivariate relationships are preserved.
The combination procedure: the STATIS method
The question that arises after using MI in an MFA framework is how should all the configurations resulting from the analyses be combined to obtain a single unique estimate of the consensus configuration? While averaging is an appropriate combination procedure in many other statistical techniques, it is not recommended for MFA due to possible reflection, dilation or rotation of the different configurations with respect to each other [21]. Here we consider an alternative approach by implementing the STATIS method which provides a compromise configuration balancing all configurations.
The STATIS method [22] (which stands for Structuration des Tableaux à Trois Indices de la Statistique in French) is a generalization of PCA used to simultaneously study several tables of variables collected on the same individuals. The goal of this method is to analyze the structure of the individual tables (i.e., the relation between the individual tables) and to derive from this structure an optimal set of weights for computing a common configuration of the observations. The solution obtained, called the compromise, is the configuration agrees the most with all other configurations. An overview of the STATIS method is presented in Additional file 1: Figure S2.
STATIS analyzes a set of N tables X
1,…,X
N
, where each X
n
is a table of quantitative variables measured on the same individuals. The first stage of STATIS consists of calculating a matrix of cross-products between individuals for each table \(\boldsymbol {W}_{\!\!n} = \boldsymbol {X}_{\!n} \boldsymbol {X}_{\!n}^{T}\) (A
T means the transpose of a vector or a matrix A) reflecting the similarities between individuals within this table. The use of matrices W
n
instead of X
n
simplifies the computation because it obviates the determination of rotations when matching the X
n
. The basic idea in STATIS is then to find a compromise space \(\boldsymbol {W_{\!c}} = \sum _{n=1}^{N} \alpha _{n} \boldsymbol {W}_{\!\!n}\) that globally balances these cross-product matrices by choosing a suitable optimal set of weights α
1,…,α
N
. These weights are obtained from the PCA of the matrix R whose generic term R
j
k
gives the cosine between tables (also known as the RV-coefficient [23]) defined as:
$$\boldsymbol{R}_{\!jk} = \frac{\text{trace}(\boldsymbol{W}_{\!\!j}^{T}\, \boldsymbol{W}_{\!k})}{\sqrt{\text{trace}(\boldsymbol{W}_{\!\!j}^{T}\, \boldsymbol{W}_{\!\!j})\cdot \text{trace}(\boldsymbol{W}_{\!k}^{T}\, \boldsymbol{W}_{\!k})}}\;, $$
where the trace is the sum of the main diagonal elements of a square matrix. The first eigenvector obtained from the PCA of R represents the “agreement between tables”. Its elements are normalized in such a way that their sum is equal to 1 and used as weights α
n
in order to define W
c
. Tables with larger values of α
n
are more similar to the other tables and therefore will have a larger weight, while the weight of the “outlier tables” will be closer to zero with respect to the other weights. The principal components from the PCA of W
c
then gives the coordinates of the individuals in the compromise space, called the compromise configuration.
Implementation of MI-MFA
The MI-MFA algorithm can be summarized as follows (see Fig. 2):
Step 0. Start with an observed, incomplete dataset K. Define the number of imputations M and the dimensionality d of the compromise configuration.
Step 1. Perform multiple hot-deck imputation. For m=1,…,M:
-
Obtain an imputed version K
(m) of K, such that, \(\boldsymbol {K}^{(m)}\neq \boldsymbol {K}^{(m')}\phantom {\dot {i}\!}\) for m≠m
′. The imputed datasets are identical for the non-missing data entries, but differ for the imputed values. The imputed version of the data is obtained by using the hot-deck imputation approach.
-
Perform an MFA using d components on the imputed dataset K
(m) to obtain the configuration F
m
.
Step 2. Perform a STATIS on the set of configurations F
1,…,F
M
to obtain F
c
, the compromise configuration.
Note that the number of dimensions d used in the algorithm has to be chosen a priori. However, the number of dimensions does not affect the estimation of the imputed values and the estimation of the compromise configuration. Moreover, for given K
(1),…,K
(M) imputed datasets, solutions provided by the algorithms are nested (the solution with d dimensions is included in the solution with d+1 dimensions). Since the core of MI-MFA is a weighted PCA, the strategies suggested to choose the number of components in PCA can be adapted to MI-MFA, but work needs to be done to validate the quality of these extensions.
How many imputations?
When using MI, one of the uncertainties concerns the number M of imputed datasets needed to obtain satisfactory results. The number of imputed datasets in MI depends to a large extent on the proportion of missing data. The greater the missingness, the larger the number of imputations needed to obtain stable results. However, in multiple hot-deck imputation, the number of imputed datasets is limited by the size of the donor pools. In any case, the total number of possible imputations M
total
can be calculated before applying the imputation approach (see Additional file 2). If M
total
is small (M
total
≤50), then M=M
total
can be used in MI-MFA. The appropriate number of imputations can be informally determined by carrying out MI-MFA on N replicate sets of M
l
imputations for l=0,1,2,…, with M
0<M
1<M
2<⋯<M
total
, until the estimate compromise configurations are stabilized. More precisely, this approach can be carried out by applying the following steps:
Step 0. Start with an observed, incomplete dataset K. Define the number of imputations M
l
with M
0<M
1<M
2<⋯<M
total
and the number N of replicate sets of M
l
imputations.
Step 1. Create collections \(\mathcal {I}_{n}^{M_{l}}\), n=1,…,N, each one containing M
l
different imputed datasets of K, such that, \(\mathcal {I}_{n}^{M_{l}}\neq \mathcal {I}_{n'}^{M_{l}}\), for n≠n
′ and \(\mathcal {I}_{n}^{M_{l-1}}\subset \mathcal {I}_{n}^{M_{l}}\) for M
0<M
1<M
2<⋯<M
total
.
Step 2. For n=1,…,N, perform an MI-MFA using \(\mathcal {I}_{n}^{M_{0}}\), to obtain N different compromise configurations \(\boldsymbol {F_{\!c}}_{1}^{\!M_{0}},\dots,\boldsymbol {F_{\!c}}_{N}^{\!M_{0}}\).
Step 3. Let l=1.
For n=1,…,N,
-
perform an MI-MFA using the collection \(\mathcal {I}_{n}^{M_{l}}\), to obtain a compromise configuration \(\boldsymbol {F_{\!c}}_{n}^{\!M_{l}}\);
-
calculate \(r_{n}^{\,l}=r(\boldsymbol {F_{\!c}}_{n}^{\!M_{l}},\,\boldsymbol {F_{\!c}}_{n}^{\!M_{l-1}})\), a measure of the distance or correlation between configurations (for example the RV coefficient [23]).
Step 4. Calculate \(\overline {r}^{\,l}=\frac {1}{N}\sum _{n=1}^{N} r_{n}^{\,l}\) (or \(\sigma (\overline {r}^{\,l})\) the standard error of \(\overline {r}^{\,l}\)).
Step 5. Repeat steps 3 to 4 for l=2,3,… until the differences between two subsequent \(\overline {r}^{\,l}\) (or \(\sigma (\overline {r}^{\,l})\)) become smaller than a certain convergence criterion.
Uncertainty of MI-MFA solutions
In an MI-MFA framework, after estimating the configurations from the imputed datasets, a new source of variability due to missing values can be taken into account. Here we describe two approaches to visualize the uncertainty of the estimated MFA configurations attributable to missing row values. First, an individual plot for all estimated MFA configurations is constructed. The individual plot is obtained by projecting each estimated MFA configuration onto the compromise configuration (named the trajectories by Lavit [22]). Each individual is represented by M points, each corresponding to one of the M MFA configurations. Confidence ellipses and convex hulls can then be constructed for the M configurations for each individual. The computed convex hull results in a polygon containing all M solutions. All individuals have confidence areas, even those without missing values. Indeed, even if only the estimation of missing values is the only change, this will have a possible impact on all MFA parameters. Therefore, the area of such an ellipse (or convex hull) provides an insight into the uncertainty of the estimated configuration. The larger the area of an ellipse (convex hull), the more uncertain the exact location of the individual. Thus, when the area of an ellipse is large, the scientist should remain really careful regarding its interpretation.
Performance of the method
We conducted two case studies to assess the performance of our method. Instead of using theoretical distributions to generate simulated data, our studies were based on two real datasets, denoted as the original datasets. Subsequently, specific patterns of missingness were created in these datasets as illustrated in Fig. 1, resulting in what we called the incomplete datasets. This approach was used in order to more closely mirror situations that may occur in the omics context. Next, missing row values were estimated and the resulting complete datasets were referred to the imputed datasets.
We then compared our MI-MFA method to the RI-MFA approach [9] and the mean variable imputation MFA (MVI-MFA) method, in which the missing values are simply replaced by the mean of each variable after which an MFA is performed on the imputed dataset. This latter approach was considered as the common base for comparing the MI-MFA and RI-MFA methods. For each configuration obtained using MI-, RI- and MVI-MFA, the similarity between the configuration solution and the true configuration (based on an MFA using the original dataset) was assessed from the RV coefficient [23]. The RV coefficient, which ranges from zero to one, can be interpreted as a correlation coefficient between two matrices, which allows the relative positions of objects to be compared from one configuration to another.
We also provide comprehensive graphical comparisons of the true vs. the MI-, RI- or MVI- MFA configurations. The individuals from both configurations are drawn in a same plot and connected by an arrow, the length of which indicates the divergence between the two configurations.
Implementation of the analyses
All analyses were performed using the R computing environment [24]. MFA was performed using the MFA function of the FactoMineR R package [25]. The statis function of the ade4 R package [26] was used to determine the compromise configuration. The RI-MFA method is implemented in the imputeMFA function available in the missMDA R package [27]. Note that the number of components ncp used to predict the missing entries in the imputeMFA function has to be chosen a priori. This choice is crucial and difficult [9]. As the true configuration was known in our case, the number of components ncp was chosen to minimize the RV coefficient between the true and the imputeMFA configurations.
The appropriateness of the results from MI-, RI- and MVI- MFA was then determined by comparing the configurations resulting from these three strategies with the true MFA configuration. Due to a possible lack of alignment (order change, sign reversal of the components and rotation) between two configurations (the true vs. the MI-, RI- or MVI- MFA configuration), it was necessary to align them before being compared. Ordinary Procrustes Analysis [28] was used to align these configurations prior to their comparison.
Datasets
Liver toxicity
The datasets originated from a liver toxicity study [29] in which 64 male rats of the inbred Fisher F344/N strain were exposed to toxic doses of acetaminophen (paracetamol) in a controlled experiment. Necropsies were performed 6, 18, 24 and 48 h after exposure and mRNA was extracted. The data consisted of the expression of 3,116 genes and 10 clinical variables considered to be markers of liver injury. The 64 subjects (rats) were cross-classified in eight strata (or treatments) according to two factors:
-
exposure time: 6, 18, 24 and 48 h;
-
toxic doses of acetaminophen: high (1500 mg/kg or 2000 mg/kg) or low (50 mg/kg or 150 mg/kg).
Eight subjects per stratum were included. These datasets were downloaded from the mixOmics R package [30].
NCI-60 data
The NCI-60 dataset contained transcriptomic [31] and proteomic [32] tables for a collection of 60 cell lines from the National Cancer Institute (NCI-60). The NCI-60 panel included cell lines derived from various cancer types: colon (7 cell lines), renal (8), ovarian (6), breast (8), prostate (2), lung (9) and central nervous system (6), as well as leukemia (6) and melanoma (8). The gene expression profiles used here were generated using an Agilent platform [31] and downloaded from Cellminer [33]. Data were log2-transformed. To facilitate data interpretation and computation, the transcriptomic data were filtered to exclude probes that did not map to an official HUGO gene symbol and to retain only the probe with the highest average value when multiple probes mapped to the same gene, as previously described in [34]. Gene invariants across all 60 cell lines, corresponding to genes without any effect between cancer types, were also removed. Filtering produced a dataset of 1,433 genes. The NCI-60 proteome table was also downloaded from Cellminer [33]. Proteomic data were obtained using high-density reverse-phase lysate microarrays [32]. Data were log2-transformed and protein abundance levels were available for 162 proteins [32].