 Methodology article
 Open access
 Published:
Multiblock variable influence on orthogonal projections (MBVIOP) for enhanced interpretation of total, global, local and unique variations in OnPLS models
BMC Bioinformatics volume 22, Article number: 176 (2021)
Abstract
Background
For multivariate data analysis involving only two input matrices (e.g., X and Y), the previously published methods for variable influence on projection (e.g., VIP_{OPLS} or VIP_{O2PLS}) are widely used for variable selection purposes, including (i) variable importance assessment, (ii) dimensionality reduction of big data and (iii) interpretation enhancement of PLS, OPLS and O2PLS models. For multiblock analysis, the OnPLS models find relationships among multiple data matrices (more than two blocks) by calculating latent variables; however, a method for improving the interpretation of these latent variables (model components) by assessing the importance of the input variables was not available up to now.
Results
A method for variable selection in multiblock analysis, called multiblock variable influence on orthogonal projections (MBVIOP) is explained in this paper. MBVIOP is a model based variable selection method that uses the data matrices, the scores and the normalized loadings of an OnPLS model in order to sort the input variables of more than two data matrices according to their importance for both simplification and interpretation of the total multiblock model, and also of the unique, local and global model components separately. MBVIOP has been tested using three datasets: a synthetic fourblock dataset, a real threeblock omics dataset related to plant sciences, and a real sixblock dataset related to the food industry.
Conclusions
We provide evidence for the usefulness and reliability of MBVIOP by means of three examples (one synthetic and two realworld cases). MBVIOP assesses in a trustable and efficient way the importance of both isolated and ranges of variables in any type of data. MBVIOP connects the input variables of different data matrices according to their relevance for the interpretation of each latent variable, yielding enhanced interpretability for each OnPLS model component. Besides, MBVIOP can deal with strong overlapping of types of variation, as well as with many data blocks with very different dimensionality. The ability of MBVIOP for generating dimensionality reduced models with high interpretability makes this method ideal for big data mining, multiomics data integration and any study that requires exploration and interpretation of large streams of data.
Background
Multivariate data analysis can involve thousands of input (manifest) variables in just one data block. These variables may contain latent information that can help (i) to extract inferences and explain phenomena and relationships that might not be obvious from the experimental results obtained in the laboratory, (ii) to get a more meaningful and visual interpretation of the data, (iii) to optimize processes in both industry and research environments, and (iv) to understand the holistic pattern in complex biological systems where different parts interact by underlying connections. Compared to the analysis of a single dataset, the analysis of a large number of datasets (blocks) implies that the number of variables and their underlying interconnections grow very much indeed; at this point, reducing the number of variables involved in the multiblock data analysis becomes a meaningful and much needed strategy.
Interest in multiblock approaches has risen in psychology [1,2,3], chemistry [4,5,6,7], biology [8, 9] and sensory science [10, 11], among other; an interest mainly motivated by the goal of extracting the maximum useful information from two or more datasets interrelated among themselves. Early multiblock methods based on projections and latent variables, e.g. partial least squares (PLS) [12, 13], allowed the analysis of a limited number (usually two or three) of data matrices, but without taking full advantage of how the data blocks were connected. Two commonly used multiblock approaches based on principal components were consensus principal component analysis (CPCA) [14, 15] and hierarchical principal component analysis (HPCA) [16], whose algorithms are very similar, differing only in the normalization steps [5]. For PLS applied to multiblock analysis, it is worth mentioning hierarchical partial least squares (HPLS) [14] and multiblock partial least squares (MBPLS) [17], which are similar but with two main differences: (i) the normalization is done on different model parameters, and (ii) the regression of the Yblock is done on different matrices [5]. Some interesting applications of multiblockPLS were reported by Wise and Gallagher in 1996 [18], and a better understanding of the underlying patterns in latent models was attempted by Kourti et al. [4] using multiblock multiway PLS for analyzing batch polymerization processes in 1995. Although many different multiblock methods based in different criteria and principles can be found in the literature (e.g. regularized generalized canonical correlation analysis, RGCCA [19]), this paper will mainly keep its scope inside methods based on partial least squares regression [20,21,22,23,24,25,26,27,28,29,30], such as sparse partial least squares presented by Le Cao et al. [31] (and further implemented by Rohart et al. [32]). Multiblock methods based on orthogonal projections have received interest within lifesciences provided the model structure it can decompose the data blocks into; two examples of this are the multiomics factor analysis (MOFA) presented by Argelaguet et al. in 2018 [33] and the Nblock orthogonal projections to latent structures (OnPLS) method presented by Löfstedt and Trygg in 2011 [34]. The latter can be used to provide some input parameters for improved model interpretation using MBVIOP. From a methodology perspective, OnPLS provides means to take full advantage of the shared and unique variations of more than two data blocks. Examples of alternative methods with different objective functions include JIVE (joint and individual variation explained) [35], GSVD (generalized singular value decomposition) [36], and msPLS (multiset sparse partial least squares path modelling) [37].
The numerous variable selection methods for multivariate analysis of one data matrix [38,39,40,41,42,43,44,45,46,47] cannot handle the complexity and the underlying patterns of a large number of datasets; therefore, data integration and multiblock variable selection methods are needed. An important consideration is to be aware of the multiset structure since the integration of multiple datasets can be performed in different ways, and different methods may have specific requirements on this aspect. For instance, OnPLS followed by MBVIOP has a similar integration framework than the Nintegration of block sparse PLS requiring the same number of samples (N) for all data matrices, whilst mint sparse PLS has a Kintegration (also called Pintegration in the literature) framework which requires the same number of variables (K) instead of the same number of samples [32]. Besides, some methods are more suitable for improving model interpretability, whilst other are more suitable for improving predictability; hereby, the importance of selecting the appropriate variable selection method according to the purpose of the data analysis, an example of this was shown by comparing the obtained root mean square error of prediction (RMSEP) using two different variable selection methods on the Marzipan dataset in GalindoPrieto et al. [48]. The fact that variable influence on projection (VIP) approaches for OPLS (VIP_{OPLS}) [39], O2PLS (VIP_{O2PLS}) [48] and OnPLS (MBVIOP) base their calculations on the product between the normalized loadings (p) and the sum of squares of X and Y leads to an enhanced model interpretability that other methods cannot achieve. However, if the aim of the analysis is to achieve enhanced model predictability, other methods such as sparse PLS [31] (that uses the Q2 parameter as criterion to choose the number of model components, and the root means square error of prediction criterion for evaluation of the predictive power of each Y variable between the original non penalized PLS models and the sparse PLS model) may be more suitable. We include a comparison for unsupervised multiblock variable selection using the sparse PLS method for multiblock cases (blocksPLS) [32] and MBVIOP in the Results and Discussion section.
In addition, variable selection aiming to enhance the interpretation of latent variables containing uncorrelated (orthogonal) variation can be challenging. An example of an approach able to deal with multiple datasets is the sparse generalized canonical correlation analysis (SGCCA) for variable selection that combines RGCCA with the L1penalty [49]; however, to deal also with orthogonalization in an analysis of multiple datasets, methods such as VIP_{O2PLS} (also called O2PLSVIP) [48], MOFA [33], or the MBVIOP explained here are more suitable options. We include a comparison for unsupervised integrated feature selection between MOFA and MBVIOP in the Results and Discussion section.
It is worth mentioning that for one PLS component, loadings or weights can be used for determining which variables are more influential [50], but this has limited use. There is a need for a diagnostic giving the described variable influence in a PLS model, or any of its derived orthogonal versions, using more than 1 component. All VIP diagnostics are constructed for that purpose.
A multiblock variable selection method called multiblock variable influence on orthogonal projections (MBVIOP) for OnPLS models was developed as part of previous thesis work [51] and is now published and explained in this paper. The mathematical principles of MBVIOP relate to those used in VIP_{OPLS} (a.k.a., OPLSVIP) [39, 44] and VIP_{O2PLS} (a.k.a., O2PLSVIP) [48]. However, the cornerstone of MBVIOP is its interblock connectivity with emphasis on the variable influence, making MBVIOP substantially different (i) from its two predecessors VIP_{OPLS} and VIP_{O2PLS} in terms of connectivity, and also (ii) from OnPLS regression [34] since the normalized OnPLS p loadings cannot provide by themselves a reliable and precise variable importance assessment while this is easily achieved by MBVIOP by taking these normalized loadings as starting point for the variable importance assessment (as it will be shown in the synthetic example). MBVIOP allows the selection of the most important variables for enhanced interpretation of OnPLS models when three or more data blocks are simultaneously modelled. It is worth mentioning that MBVIOP is also applicable to O2PLS® models that involve only two data blocks. Furthermore, MBVIOP provides four MBVIOP profiles (total, global, local and unique) to help answer questions such as:

a.
Total MBVIOP profile: Which are the variables that are more relevant for the interpretation of the whole model? Which variables could be eliminated from the model in order to improve it?

b.
Global MBVIOP profile: Which variables help to interpret the variation that is common to all the data blocks involved in the model?

c.
Local MBVIOP profile: Which variables are important to interpret the variation that is common to some of (but not all) the blocks? And how do these variables connect among the data blocks to explain the information shared by them (i.e., the variation related to the same component or latent variable)?

d.
Unique MBVIOP profile: Which are the variables that contain unique information that can be only found in one specific data block? And which inferences related to the data can be elucidated from the selected variables in the unique MBVIOP profiles?
The MBVIOP algorithm has been tested by using three multiblock datasets, (i) a simulated fourblock dataset called SD16_235GLU, (ii) a real threeblock omics dataset here called Hybrid Aspen, and (iii) a real sixblock industrial dataset called Marzipan. The three datasets are described in detail in sections "Synthetic dataset (four blocks)"–"Metabolomics, proteomics and transcriptomics data of hybrid aspen (three blocks)".
Results and discussion
The results and the discussion aim to validate the multiblock variable influence on orthogonal projections (MBVIOP) method for its application in OnPLS models (extended interpretations related to biology or spectroscopy are out of the scope of this paper). Thus, an OnPLS model followed by an MBVIOP variable selection will be performed in all multiblock analyses. The input variables will be sorted according to their importance for the entire multiblock model (i.e., the total variation), but also for each model component separately (i.e., the unique, the local and the global variations). Figure 1 shows the different types of variation present in a generic OnPLS model.
Description of the OnPLS models
For the synthetic fourblock SD16_235GLU data, an OnPLS model was built in MATLAB. The OnPLS algorithm found two global components (in black and blue in Fig. 2), three local components (in cyan, orange and green in Fig. 2), and three unique components (in pink color in Fig. 2); which points to a conservative, but well conducted, modelling by the OnPLS algorithm. Only two unique components included in the design of the synthetic data were not found; i.e., one unique component in block D_{1} (which represented a 14.3% of the variation of D_{1}) and one unique component in block D_{4} (which contained a 20% of the variation of D_{4}). The rest of the variation was extracted by the model (see Table 1); the percentage of total variation explained by the model was 85.8% for D_{1}, 100% for D_{2}, 100% for D_{3} and 80% for D_{4}.
For the Marzipan data, the six data matrices were used to generate an OnPLS model, which yielded two global components and two unique components (the percentages of explained variation per component and per block are shown in Table 2). The model was able to explain almost all variation; more specifically, a 96.2% of total variation for the NIRS1 block, a 93.8% for the NIRS2 block, a 95.8% for the INFRAPROVER block, a 97.0% for the BOMEM block, a 99.9% for the INFRATECH block and a 75.5% for the IR block. Since all blocks are related to NIR/IR spectroscopy, it is not surprising that the OnPLS algorithm found two global components. The Marzipan data mostly has predictive (joint) variation, which is absolutely dominant over the orthogonal (unique) variation [48].
For the Hybrid Aspen data, an OnPLS model was built obtaining four global components, two local components (one shared between the transcript and the metabolite data, and another shared between the transcript and the protein data), and two unique components (one for the transcriptomics block, and another for the metabolomics block). The OnPLS model explained 75.0% of the total variation for the transcriptomics data block (14,738 variables), 55.0% for the proteomics data block (3132 variables), and 58.3% for the metabolomics data block (281 variables). The decomposition of explained variation for the different types of variation is shown in Table 3.
Evidence of the reliability and the efficiency of MBVIOP using synthetic data
For the variation contained in the local component that D_{1} shares with D_{4}, MBVIOP selected as relevant variables 10–18, represented as a peak marked in cyan in the local MBVIOP plot for D_{1} (Fig. 2); in the same local MBVIOP plot, variables 35–47 (marked in orange) were considered important for explaining the variation that D_{1} shares with D_{2}. The unique MBVIOP plot for D_{1} pointed at variables 7–19 as the important ones for explaining the unique variation of D_{1}; interestingly, variable 13 stood out from the rest of variables.
By comparing the MBVIOP variable importance results to the normalized loadings (Fig. 2), it can be seen that the MBVIOP method is very reliable finding the exact variables that are important for the different types of variation of D_{1}; furthermore, MBVIOP assesses the correct proportion of importance for each variable, which cannot be achieved by the normalized loadings plot. Hence, looking at variable 13 in the normalized loadings plot, it can be seen that this variable was related to the two unique components of D_{1} (explaining 28.6% of variation), whereas the other variables (7–12 and 14–19) linked to the unique variation of D_{1} were only related to one of the unique components (explaining only 14.3% of the variation); however, the normalized loading plot did not highlight such an important variable (no. 13) in any way. Auspiciously, MBVIOP highlighted the importance of variable 13 (marked in dark pink color in Fig. 2) as an intense peak standing out from the crowd; this variable was also depicted in the total MBVIOP plot for D_{1}. Therefore, the total and the unique MBVIOP plots for D_{1} evidence the efficiency of MBVIOP algorithm to not lose track of any variable, even if it is a lonely variable.
The MBVIOP results obtained for block D_{2} are encouraging, since, even with a high overlapping of the normalized loadings (profiles), the MBVIOP algorithm identified the variables that were relevant for each type of variation (see Fig. 2).
For block D_{3}, the variables considered important in the global MBVIOP plot (Fig. 2) contributed to explain a 50% of the total variation of the OnPLS model, whilst the variables related to explain other types of variation did not overpass the 25%; therefore, the variables related to the information globally shared by all the data matrices were selected as the most important ones for the whole model, leaving out the variables related to information that was local or unique. The unique variation of D_{3} (25% of the total variation) was explained by the large range of variables 15–74. For an overview assessment of the variable importance, the total MBVIOP plot pointed at variables 33–52 and 75–89 as the most relevant ones. Interestingly, the total MBVIOP plot emphasizes the efficiency of MBVIOP giving the proportionally fair importance to the variables according to the amount of information that they help to explain in the OnPLS model; the absence of the large amount of variables which were relevant for the unique variation (i.e., variables 15–74 of D_{3}) enlightened another achievement of the MBVIOP algorithm: it does not matter if there is an outsize number of variables that are important for a specific type of variation, in case that their importance for interpreting/explaining variation in the whole model is not significant enough, they will not be considered relevant variables in the total MBVIOP plot. The latter fact demonstrates that MBVIOP properly sorts the variables according to their importance for explaining a specific type of variation.
Enhancement of the interpretability in an OnPLS model for the Marzipan case by using MBVIOP
The MBVIOP results (see Fig. 3) obtained for the OnPLS model generated using the Marzipan dataset (previously described in section "Description of the OnPLS models") helped to better interpret the pattern of information overlapping between the six data matrices (that would be a painstaking task if it was done by using the normalized loadings provided in Fig. 3). There is not significant amount of local variation in the Marzipan dataset, which explains the fact that no important variables for explaining local variation were selected by MBVIOP. In addition, due to the extreme dominance of the joint variation over the unique variation, the MBVIOP results for the global latent variables were very similar to the MBVIOP results for the total variation, as can be seen by comparison of the plots in Fig. 3.
Giving an overall look at the MBVIOP plots of Fig. 3, the manifest variables selected as relevant for the two global latent variables (global model components) seemed to relate to (i) the sugar content (majorly sucrose, but also small amounts of invert sugar and glucose syrup), and (ii) the almonds and apricot kernels. The unique MBVIOP plots were related to special and unique characteristics of some marzipan samples and/or some spectrometers, as it will be explained in this section.
Block NIRS1 contains measurements done using an instrument that was able to cover, not only the NIR region, but also the visual light range (400–800 nm). Thanks to this, differences in color could be detected for the marzipan samples. Interestingly, MBVIOP determined that some variables corresponding to the range between 450 and 800 nm (visual light region) were relevant for explaining variation only detectable in NIRS1 (i.e., unique for this data block). These important variables relate to the cocoa that was added to some marzipan samples (they had a more brownish color). Besides, by looking at the whole unique MBVIOP plot (from 450 to 2448 nm) in Fig. 3, it can be seen that, aside from the variables with high MBVIOP values detected in the visual light range, there were also important variables located at 1232–1396 nm, 1428–1506 nm, 1638–1682 nm, 1818–1872 nm, and 1902–1986 nm. The cocoa NIR spectrum has been described in the literature [52], thus by matching of some of the important wavelengths found by MBVIOP and the known composition of the cocoa, it is possible to realize the enhanced and easier model interpretation achieved by using MBVIOP (which is not possible by using the OnPLS model loadings provided in Fig. 3). The wavelengths at 1478–1506 nm are important to uncover the OnPLS model variation related to the first overtones of the CH groups of the cocoa, and variables at 1902–1986 nm explain the variation related to the second overtones of the C = O groups of the cocoa (see Fig. 3).
The Infratec MBVIOP revealed three clear regions of important variables located at 960–972 nm, 978–990 nm and 996–1002 nm (see MBVIOP plots for Infratec in Fig. 3). These variables are selected as relevant by the MBVIOP algorithm because they are related to the carbohydrates, proteins, water and lipids (i.e., the second overtones of O–H and N–H stretching vibrations, and the third overtones of CH stretching vibrations). These substances are common to all the marzipan samples, which explains that these wavelengths (variables) were highlighted in the global MBVIOP plot. It is worth noticing that these three wavelength regions can be also seen (albeit not so clearly) in the MBVIOP plots of NIRS2.
As in the VIP_{O2PLS} analysis of Marzipan data published in 2017 [48], the multiblock model generated for the VIP analysis is only between spectra, not between spectra and concentrations; which can be unusual, but also useful either for technical reasons (e.g., to compare spectrometers) or for spectroscopic reasons (e.g., to see the correspondence between bands in IR and bands in NIR – overtones –). The MBVIOP plots for NIRS1 and Bomem (Fig. 3) were very similar because of the characteristics that the NIR spectrometers had in common, however MBVIOP found some differences in the variable importance that could (maybe) be attributable to the different optical principles of the two instruments (dispersive scanning for the NIRS1, and FT interferometer for the Bomem). On the other hand, the IR data block contained relevant variables (wavenumbers) that explained information that is unique for this block, due to the differences in type of spectroscopy (IR/NIR) and instrumentation (spectrometer components).
Some very intense peaks in the MBVIOP plots correspond to variables that are important for some major marzipan compounds. For example, the peak around 1440 nm in the MBVIOP plot for NIRS2 could be related to the O–H bonds, and the peak around 2100 nm in the MBVIOP plot for Bomem could relate to the protein amino acids.
Selection of the most relevant variables in systems biology multiblock analysis for enhanced model interpretation and dimensionality reduction
For the Hybrid Aspen data, the variables were sorted by importance using MBVIOP, and afterwards, this information was used for achievement of enhanced interpretability (higher percentage of explained model variation) and reduced model dimensions (less variables). The purpose was not only to validate MBVIOP as a method for variable importance sorting, but also for multiblock variable selection. To this end, two MBVIOP variable selections (both of them from the original model, i.e. not sequentially done) were carried out, one choosing the variables with MBVIOP values over the default threshold (MBVIOP ≥ 1), and another variable selection with a more conservative criterion (i.e., MBVIOP ≥ 0.5). Afterwards, two new OnPLS models were generated using only the variables selected by MBVIOP; the number of variables used in the original and the two new reduced multiblock models, as well as the percentages of total explained variation, are summarized in Table 4. We want to emphasize that the MBVIOP profile used for selecting the variables was the total MBVIOP because the goal was to improve the total model interpretation without focusing on any concrete part of the model. Nevertheless, it would be possible to select the variables that are more convenient for improving the interpretation of a specific type of variation (e.g., the local variation) by using its corresponding MBVIOP profile (e.g., the local MBVIOP) and building a new model with this selected subset of variables; hereby, MBVIOP is a variable selection method à la carte according to the part of the model (total, global, local or unique) targeted to be improved. In order to show possible sensitivity differences among MBVIOP profiles due to threshold choice (i.e., MBVIOP ≥ 1 or MBVIOP ≥ 0.5), the number of selected variables is shown in Additional file 1: Table S1 in the Supporting Information and as bar plots in Fig. 4 for each type of variation and each threshold choice. From Fig. 4, it does not seem to exist significant differences between total and global profiles in relation to the number of selected variables. However, the number of variables selected when using the threshold MBVIOP ≥ 1 (blue bars in Fig. 4) was clearly lower than when using the threshold MBVIOP ≥ 0.5 (green bars in Fig. 4). For the unique variance, the reduction of number of selected variables using MBVIOP ≥ 0.5 was substantially more significant than for the joint variation types.
The blocks of the original OnPLS model contained 14,738 microarray elements (variables of the transcriptomics data block) that explained the 75.0% of total variation, 3132 extracted chromatographic peaks (variables of the proteomics data block) that explained the 55.0% of total variation, and 281 extracted chromatographic peaks (variables of the metabolomics data block) that explained the 58.3% of total variation. After performing a conservative (i.e., with threshold at 0.5 a.u.) MBVIOP selection of variables, a subset of variables was used for building a new multiblock model obtaining an increase of model interpretability; as shown in Table 4, 13,127 variables from the transcriptomics data explained the 80.1% of total variation, 2186 variables from the proteomics data explained the 67.3%, and 232 variables from the metabolomics data explained the 65.5%. The second new multiblock model with reduced dimensions (using MBVIOP ≥ 1 as criterion for selecting the subset of variables) had substantially less variables (approximately, 1/3 of the original ones) and, at the same time, increased the interpretability (measured as percentage of explained total variation in Table 4); more specifically, only 4452 transcript variables were needed to explain the 85.2% of total variation, 683 protein variables explained the 71.6%, and 81 metabolite variables the 76.2%. Due to the latter improvement, a deep exploration of the forty most important variables of each block, for interpreting the total multiblock model, was carried out. The identification of these variables is provided in Additional file 1: Table S2 for each block.
The variables with global MBVIOP values above the threshold (Additional file 1: Table S3) are important for explaining the variation related to common characteristics of the growth processes of the plants, as well as both the genotype and the internode effects (common to all data blocks). Some of the most important variables to explain this latent information were PU07944 from the transcript data, the protein variables 966 and 1071, and Win022_C04 from the metabolite data.
MBVIOP determined that the PU06931 was the most important microarray element for explaining the locally joint information, related to lignin biosynthesis, between the transcript and the protein data, with a local MBVIOP value of 8.05 a.u. (Additional file 1: Table S4), followed by PU07326 and PU06434; whilst for explaining the locally shared information with the metabolite data, the most important microarray elements were PU00630 (4.50 a.u.), PU03044 and PU22639. Connecting to, variable 966 (local MBVIOP value equal to 9.76 a.u.), followed by variables 2121 and 1115, were the most important protein variables for explaining the variation locally shared with the transcriptomics block. In the metabolite space, variable Win031_C01 (5.39 a.u.), followed by Win021_C05 and Win034_C06, were selected as the most relevant metabolite variables for explaining the local variation shared with the transcript data.
The housekeepinglike events, and the differences between the instrumentation used to characterize the data in the three different platforms, were uncovered by the variables listed in Additional file 1: Table S5 (i.e., the variables with higher values of unique MBVIOP).
In order to explore the possibility of finding variables that could explain more than one type of variation (i.e., the special cases illustrated in Fig. 1), it is worth comparing the tables and plots for the unique, local and global MBVIOP values. For example, in this biological case, the variable Win021_C05 of the metabolomics data block helps to explain variation that is globally shared by all the data blocks, and also contributes to explain variation that is locally shared only between the metabolomics and the transcriptomics data blocks. Therefore, one variable can contain information related to more than one type of variation, and MBVIOP is able to detect and distinguish this feature.
Comparison of MBVIOP to MOFA and blocksPLS
Two unsupervised variable selection methods, i.e. block sparse partial least squares (blocksPLS) and multiomics factor analysis (MOFA), have been compared to multiblock variable influence on orthogonal projections (MBVIOP). All three methods have been run in symmetric mode, i.e. giving the same importance to all data blocks and considering all of them as descriptor matrices. The results have been evaluated and we present the highlighted remarks of the comparison in this section. Further details about the procedures and calculations are described in section "Determination of variable importance in blocksPLS and MOFA for comparison to MBVIOP variable selection".
MBVIOP and MOFA comparison for synthetic data and real omics data
In order to compare the performance of MBVIOP and MOFA, an 8component MOFA model was generated yielding a percentage of total explained variation of 54.5% for D1, 100% for D2, 100% for D3 and 80% for D4; i.e., similar to the percentage of total explained variation obtained by MBVIOP (85.8% for D_{1}, 100% for D_{2}, 100% for D_{3} and 80% for D_{4}). The distribution of the model components had similarities and differences in relation to the one obtained by MBVIOP. Whilst MBVIOP found two global components and three local components as expected from the design of the synthetic data, MOFA found 3 global components and three local components (see Additional file 1: Figure S1). For the local variation, both methods found the local components shared by D2D3D4 and D1D2, but yielded different local assessments for the other latent variables. There were also differences in the discovering of the unique components; however, both methods found a unique component for D1. In general, it seems that MBVIOP assessed better the explained variation per model component than MOFA.
Interestingly, the results of the variable selection performed by MOFA shared many similarities with MBVIOP. When looking at the absolute MOFA loadings for the first global component, most of the variables selected by MOFA for the four data blocks were the same variables selected by MBVIOP (marked in purple in Fig. 2). The second and third components of MOFA contained a mix in the selection of the variables that seemed to partially match the variables selected by MBVIOP for the second global component (marked in grey in Fig. 2). There was also similarity in the selected variables from both methods when looking at the explained local variation, e.g. the same variables were selected as important in the absolute loadings assessment for the fourth component of MOFA and the local D1D2 component of MBVIOP (marked in orange in Fig. 2). The evaluation of the variable selection for the unique components found by both methods, i.e. for the unique components of D1 (in pink in Fig. 2), also showed a similar variable importance assessment; however, MOFA did not highlight variable 13 that helps to explain two unique components (as explained in section "Evidence of the reliability and the efficiency of MBVIOP using synthetic data") over the variables that were only helping to interpret one unique component. As an example of how the assessment has been visualized in MOFA, the absolute loading plot from MOFA for the latter example has been included as Additional file 1: Figure S2.
For the Hybrid Aspen case, MOFA yielded 8 model components (see Additional file 1: Figure S3). The total variation explained by the model was 24.6% for metabolomics, 29.5% for proteomics and 69.2% for transcriptomics. The MOFA algorithm found two global components and two unique components for the transcriptomics and the proteomics data. It also uncovered local variation shared by the transcriptomics and the metabolomics data. However, the components distribution seems difficult to assess by looking at Additional file 1: Figure S3 due to the low values of the R2 parameter for some cases.
The variable importance assessment performed using MOFA shared some similarities with the one performed using MBVIOP. For instance, the metabolites ranked as the most important ones in the MOFA model (e.g. Win022_C04, Win020_C03, Win009_C09, Win034_C06, Win031_C01 or Win021_C05) were selected as important top variables to explain global variation in both MBVIOP (Additional file 1: Table S3 and section "Selection of the most relevant variables in systems biology multiblock analysis for enhanced model interpretation and dimensionality reduction") and MOFA (Additional file 1: Figures S4–S5). The variable selection for the transcripts and the proteins was also consistent for both MBVIOP and MOFA; e.g. top selected transcripts for explaining the unique variation in MBVIOP (such as PU27903 or PU28218) were also determined as important by MOFA, and proteins such as 847 or 270 were also selected in both methods. For the total models, the same 2239 transcripts, 175 proteins and 32 metabolites were selected as important features by both methods.
MBVIOP and blocksPLS comparison for the Hybrid Aspen data
For the comparison between the MBVIOP and the blocksPLS methods, the number of variables used in the original and reduced models and the total explained variation are summarized in Tables 4–5. Both methods, as specified in section "Determination of variable importance in blocksPLS and MOFA for comparison to MBVIOP variable selection", used similar specifications (such as the number of components for explaining the predictive variation or the constraint/penalization degree). The percentages of explained variation obtained by the blocksPLS algorithm were inferior to the ones obtained by MBVIOP. MBVIOP was able to explain more total variance than blocksPLS. Furthermore, when generating the models with a reduced number of variables, MBVIOP improved the percentage of explained variation by using only the subset of MBVIOP selected variables for the new models instead of all original variables. On the contrary, the reduced models generated by blocksPLS explained less variance than the original blocksPLS model.
The overlap between the selected variables by MBVIOP and blocksPLS was assessed. For the moderately constrained (threshold of 0.5 a.u.) reduced MBVIOP and blocksPLS models, the same 4257 transcripts, 559 proteins, and 75 metabolites, were selected by both methods as important. For the normally constrained (threshold of 1.0 a.u.) reduced MBVIOP and blocksPLS models, the same 2053 transcripts, 207 proteins, and 33 metabolites, were selected by both methods as important. Considering the total number of variables selected by both methods (see Tables 4–5), this seems a good overlap for the variable selection performed using MBVIOP and blocksPLS. Besides, some variables mentioned in section Selection of the most relevant variables in systems biology multiblock analysis for enhanced model interpretation and dimensionality reduction were selected by both methods as important for interpreting the joint variation. For example, both MBVIOP and blocksPLS selected Win022_C04 as the most important variable in the metabolomics data, and proteins such as 1071, or transcripts such as PU07944, we selected for the proteomics and the transcriptomics data respectively.
Conclusions
A novel multiblock variable selection method, called multiblock variable influence on orthogonal projections (MBVIOP), has been tested and validated here. Evidence of its reliability, efficiency and usefulness have been shown. MBVIOP can assess in a reliable and efficient way the importance of both isolated and ranges of variables in any type of data. Furthermore, MBVIOP can deal with strong overlapping of types of variation, as well as with many data blocks with very different dimensionality. In addition, MBVIOP connects the variables of different data matrices according to their relevance for the data interpretation of each latent variable (component) of an OnPLS model.
MBVIOP also takes advantage of the full symmetry of the OnPLS model, which points at some advantages over the combination of sequential multiblock modelling techniques and variable selection methods. In sequential multiblock regression, even if the parameters keep the information of all parts of the sequence (i.e., other blocks of the multiblock dataset), the sequential approach only allows the weighting of the variables in a unique path (sequence) previously established, without any symmetry. Thus, the possibility of taking into account shared influences of the variables in other combinations, not considered by the preestablished path, is missing. MBVIOP uses the symmetry of OnPLS for establishing fairer relationships/influences between variables of different blocks iterating over all components and all blocks, i.e. considering all combinations. In addition, it is worth emphasizing the ability of VIP_{OPLS} [39], VIP_{O2PLS} [48] and MBVIOP to uncover the variables that are important for the uncorrelated (orthogonal) variation. However, for enhanced model interpretability, the synthetic example (section Evidence of the reliability and the efficiency of MBVIOP using synthetic data) has shown how MBVIOP surpasses any try of variable importance assessment done by means of OnPLS p loadings. More specifically, MBVIOP provides a correctly proportionated importance assessment of the variables, even when the profiles are affected by high overlapping or when there is an outsizing number of variables related to a specific type of variation, assessment that cannot be achieved by the normalized OnPLS loadings.
MBVIOP has been compared to blocksPLS and MOFA multiblock methods. Even if the comparisons are limited by the component distribution assessed by each method, the modelling and variable selection performed led to interesting conclusions. In relation to the modelling, MBVIOP explained a higher percentage of total variation than MOFA and blocksPLS. For the feature selection, when using synthetic data, the variables selected by MBVIOP and MOFA seemed to be consistent; however, when using real omics data, even if some of the most important variables were selected in both methods, differences in the final sorting seemed to rise when the values of the weights of the ranked variables were too adjusted. The overlapping of selected variables between blocksPLS and MBVIOP, and MOFA and MBVIOP, were both significant, consistent, and similar in number of variables. It is also worth mentioning, that MBVIOP was able to keep the proportionality in the variable importance assessment (e.g., showed as a peak variable 13 of the synthetic data because of explaining more variation than the other variables); however, MOFA did not keep this proportionality as explained in the Results section.
Nevertheless, it is interesting to compare the results for the Marzipan example obtained here with the ones obtained in 2017 [48], for the NIRS2 and the IR data blocks, using an O2PLS model and the VIP_{O2PLS} variable selection method. As expected, the importance assessments are very similar. However, the absence of the other four data blocks in the VIP_{O2PLS} variable selection [48] made the establishment of a clear relationship between the variables of the two present blocks and the variables of the four absent blocks totally impossible, which led to classify those variables as containers of orthogonal variation; however, when the variable assessment was performed in a sixblock multiblock analysis with MBVIOP, the same variables were selected as relevant for explaining variation shared between NIRS2 and the other data blocks (e.g., variables around 1200 nm, 1400 nm and 1800 nm). Hereby, when using all the blocks in a full multiblock system, the assessment was improved in relation to the twoblock combination analysis.
MBVIOP was able to reduce the number of variables of an OnPLS model (in a third for the Hybrid Aspen example) and, at the same time, increase the model interpretability. Besides, it has been shown that MBVIOP is a variable selection method à la carte for OnPLS models that allows to target a concrete type of variation (global, local or unique), or, if desired, target the total model, for afterwards building a stronger reduced OnPLS model with better interpretability than the original model.
The above achievements entail valuable advantages for industry and research groups (e.g., time optimization, fast and reliable variable selection, or enhanced interpretation in multiblock analysis). We envisage the use of MBVIOP in fields like chemistry, biology, medicine, psychology, economy, physics, cybernetics, and engineering, inter alia. Since VIP_{OPLS} [39] can be applied to both OPLS® and PLS models, it is expected by the authors that MBVIOP could be successfully applied not only to OnPLS models but also to multiblock PLS (e.g., MBPLS and HPLS models). This should lead to a more reliable and accurate variable sorting/selection in the MBPLS analysis than using other methods because of the more efficient and detailed weighting of the variables (especially due to the further connectivity ability, and the use of not only the amount of variation in Y explained by the model SSY but also the explained amount of variation in X SSX) of MBVIOP compared to PLSVIP (VIP_{PLS}) method applied to multiblock analysis. The verification of the latter hypotheses is part of future work.
Methods
General notation
Scalars are written using italic characters (e.g. h, and H), vectors are typed in bold lowercase characters (e.g. h), and matrices are defined as bold uppercase characters (e.g. H). When necessary, the dimensions of the matrices are specified by the subscript r x c, where r is the number of rows and c is the number of columns. Transposed matrices are marked with the superscript T. The symbol ○ indicates a Hadamard power or product. Matrix elements are represented by the corresponding matrix italic lowercase character adding as subscripts the row and the column where they are located (e.g., for an H matrix, an element located in row i and column k would be indicated as h_{ik}). Model components are represented by a. Subscripts g, l and u stand for global, local and unique respectively. The units a.u. stand for arbitrary units for the MBVIOP values. Notation referring to specific cases is explained insitu.
Determination of the variable importance in OnPLS models
MBVIOP is a model based variable selection method that uses a number n of preprocessed data matrices (D), and the scores (t) and the normalized loadings (p) from an OnPLS model. The Hadamard products of the normalized loadings (denoted as p^{○2}, i.e. p ○ p) are computed, and afterwards, they are multiplied by the ratio between the variation explained by the corresponding model component and the cumulated variation. The latter sum of squares (SS) ratio helps to assess the variable importance focusing on interpretability, i.e. the SS ratio helps to know which variables are more helpful to explain the maximum amount of variation. The scores are used for the calculation of the residuals prior to computation of the sum of squares. The MBVIOP values, which will conform the MBVIOP vectors, are obtained by iterative calculations among both the components (latent variables) and the data matrices, with specific combinations according to the type of variation. As final step, the square root is taken, and a normalization is performed by applying the Euclidean norm (2norm) and multiplying by the number of manifest variables raised to the ½ power. The latter explanation is the general procedure for all types of variation (see Fig. 1), details and specifications are provided below. We also describe the calculations, equations (for the unique, the local, the global, and the total variations), and how to interpret the results provided by the MBVIOP algorithm, in the subsequent sections.
Threshold of MBVIOP values for importance assessment
The threshold for importance assessment according to the MBVIOP values is similar to VIP_{OPLS} [39] and VIP_{O2PLS} [48] cases. Generally, variables with MBVIOP values higher than 1 are considered important for the model interpretation, whereas variables with MBVIOP values below 1 could be considered irrelevant. Since the sum of squares of all MBVIOP values is equal to the number of manifest variables of the respective data matrix, the average MBVIOP is equal to 1; therefore, if all variables would have the same contribution to the OnPLS model, they would have MBVIOP values equal to 1. The threshold is represented in all plots by a red horizontal line at MBVIOP = 1 for fast visual assessment. However, since this is a datadriven methodology, there can be special cases that justify the use of other threshold values according to either the goal of the variable selection or the demand level of dimensionality reduction, as shown in section "Selection of the most relevant variables in systems biology multiblock analysis for enhanced model interpretation and dimensionality reduction.
Calculation of MBVIOP for the unique components
The first computation performed in the algorithm is the unique MBVIOP (Eq. 1), which allows to assess the importance of the variables related to the unique information contained in each data block. It is worth noting that the unique information contained in the unique variation (exclusive of one block, i.e. not shared with other blocks) can be elucidated focusing on a reduced subset of important variables selected by MBVIOP without need to inspect all variables. This subset of important variables is found using Eq. 1.
In Eq. 1, d_{i} indicates which data block we are referring to, K is the number of manifest (input) variables of the data block, A_{u} represents the total number of unique components (unique latent variables), a_{u} indicates a specific unique component, p corresponds to the normalized loadings extracted from the OnPLS model, SSD_{au,di} stands for sum of squares of a data block for an a_{u}^{th} component, SSD_{cum,di} stands for the cumulated sum of squares of a data block, and the Euclidean normalization is indicated using the subscript 2 and enclosing the normalized expression between doubleline brackets.
Calculation of MBVIOP for the local components
MBVIOP_{Local} gives values higher than 1 to those input variables that are important for explaining the variation (information) of a specific local component in an OnPLS model. The local MBVIOP (Eq. 2) is calculated iterating among all the local components, selecting the blocks that have variables locally connected (see Fig. 1), and leaving out any data block that is related to either global variation or local variation linked to a different local component. Furthermore, the local part of the MBVIOP algorithm is constrained to ignore the connection of a data block with itself, since this would increase the importance of the locally connected variables in relation to the whole model variable influence, making the weighting system unfairly favorable to the variables with locally shared information.
In Eq. 2, the local MBVIOP calculation is summarized. The calculation iterates among all the local components A_{l}, and the local MBVIOP values for each local component are calculated considering all the combinations (direct and reverse) of the locally connected blocks, here denoted D_{LC}. It should be mentioned that D_{LC} includes the data block d_{i} and also the blocks connected to it (d_{LC}) in Eq. 2. For instance, in a multiblock analysis involving four or more data blocks, if the variation of a local component is shared by three blocks, the corresponding local MBVIOP values will be calculated using exclusively these three blocks in an iterative and exchangeable way either to provide the normalized loading (p) or to provide the sum of squares values (SSD). In the end, all three connected blocks will have contributed as both d_{i} and d_{LC} according to the specific ongoing calculation.
The iterative computation of the local MBVIOP is condensed in Eq. 2, where A_{l} represents the total number of local components, a_{l} stands for a specific local component, β (beta) represents the connectivity degree, SSD_{al,dLC} stands for sum of squares explained by an a_{l}^{th} component for a data block d_{LC}, SSD_{cum,dLC} is the cumulated sum of squares of the data block d_{LC}. The rest of nomenclature is analogous to section “Calculation of MBVIOP for the unique components”.
The connectivity degree β is based on the number of local connections, which makes MBVIOP different from VIP_{O2PLS}, since the latter uses the number of local components. It is worth noting that in VIP_{O2PLS} the number of local components will always be equal to the number of local connections among blocks since there are only twoblock connections (since O2PLS cannot handle more than two blocks). However, in MBVIOP, there can be connections among more than two blocks related to the same local component, which implies that the number of local components will not match the number of connections. Hereby, the connectivity degree is different in MBVIOP.
Calculation of MBVIOP for the global components
MBVIOP_{Global} pinpoints the variables that are relevant for explaining the variation (information) that is shared by all the data blocks related to a specific global component (these variables would be the ones filled in white inside the grey zone of Fig. 1), e.g., a common biological effect present in all data matrices. The global MBVIOP (Eq. 3) is calculated by iterating over all the data block combinations (direct and reverse modes) and all the global components. In Eq. 3, for a more intuitive explanation, d_{i} is used as the data block to which the normalized loading of an iteration belongs, and d_{j} as the data block to which the SSD values of an iteration belong. The blocks exchange these roles on the spot (i.e., at the exact iteration corresponding to a specific calculation); thus, all D data blocks are used as both d_{i} and d_{j}, but in different moments of the global MBVIOP computation.
In Eq. 3, A_{g} represents the total number of global components (global latent variables), a_{g} indicates a specific global component, SSD_{ag,dj} stands for sum of squares of an a_{g}^{th} component related to a data block d_{j}, and SSD_{cum,dj} stands for the cumulated sum of squares of the data block d_{j}, and the rest of nomenclature is analogous to Eqs. 1 and 2.
Calculation for the total variable influence for interpreting the whole model
The overview of which variables are more relevant for the total model interpretation (i.e., considering the global, the local and the unique variations involved in the OnPLS model) is highly appreciated in industrial environments; this is achieved by MBVIOP_{Total}. In the total MBVIOP the contributions of the global, local and unique MBVIOP vectors are joined achieving a proper weighting of all variables for the total variable influence on all projections. Equation 4 summarizes its computation.
The nomenclature of Eq. 4 is analogous to the nomenclature mentioned in the previous sections. As in the other cases, MBVIOP leads to a vector which contains the MBVIOP values for the variables of each data block (but the calculations take all blocks into consideration). As it will be explained in section “Graphical representation of the MBVIOP results for variable importance assessment”, the visualization by plotting the MBVIOP vectors for each block is one of the various options.
Graphical representation of the MBVIOP results for variable importance assessment
Equations 1–4 lead to four MBVIOP vectors (i.e., MBVIOP_{Unique}, MBVIOP_{Local}, MBVIOP_{Global}, MBVIOP_{Total}). It is always possible to look at the numerical values of MBVIOP for each variable of the OnPLS model to assess their importance for the data interpretation. However, this can become a very timeconsuming and painstaking task. Hence, a reduced table containing only target variables and its MBVIOP values, or a graphical representation of these MBVIOP vectors, seem a more convenient way to present the results. The MATLAB code created for MBVIOP allows several ways to plot the results; for this paper, blockwise plots have been chosen (even though the calculation of each MBVIOP has involved all the data blocks because of being a multiblock variable sorting). Other graphical representations could be possible; in a case where all data blocks of the OnPLS model would contain the same manifest variables, it would be possible to make a 3D (cube) plot locating the manifest variables on the Xaxis, labeling the data blocks on the Yaxis, and inserting the MBVIOP values on the Zaxis (the vertical one); this visualization becomes ideal for matrices with the same variables (e.g., in some comparison studies), but it is not recommended when the data blocks have different variables (which is frequently the case).
In section “Results and discussion”, the results were represented visualizing the MBVIOP values for each data block (by rows in the figures), and for type of variation interpreted by the variables (by columns); thus, each column of plots separately represents the unique, the local, the global and the total MBVIOP results (Fig. 2 can be used as an example). As mentioned in section “Threshold of MBVIOP values for importance assessment”, a threshold at MBVIOP = 1 (represented by a red horizontal line) is included in each plot; variables with values above the red line are relevant for the interpretation of the type of variation corresponding to the plotted MBVIOP. The variables of different blocks that contribute to explain the same variation (e.g., a common biological effect among data blocks, or a common feature of several instruments) are marked with the same color in all blockwise plots (see Fig. 2).
Determination of variable importance in blocksPLS and MOFA for comparison to MBVIOP variable selection.
Variable importance assessment using MOFA on the SD16365GLU and the Hybrid Aspen data
MOFA [33] performs unsupervised data integration aiming to uncover the principal sources of variation in multiomics datasets, and, in some aspects, it can be seen as a statistical generalization of principal component analysis for omics data. MOFA infers a set of factors (model components) that contain biological or technical variation that can be either shared by multiple data matrices or unique of a specific data matrix. MOFA achieves factorwise sparsity by identifying factors (model components), but also featurewise sparsity by means of the variable weights.
For the synthetic data, a MOFA model was generated yielding 8 latent factors (model components). The weights were plot as shown in Additional file 1: Figure S2 and the explained variation was calculated. For the Hybrid Aspen data, an 8component MOFA model was generated. Due to MOFA characteristics, 314 protein variables needed to be removed because of having nearly zero variance, and the model was built with the remaining 2818 protein variables. The absolute loadings were plotted, and the explained variation of the model calculated.
Variable importance assessment using blocksPLS
BlocksPLS [31, 32] is a onestep method that combines data integration and variable selection by using partial least squares (PLS). Sparsity is achieved by applying a LASSO penalization of the PLS loading vectors when computing a singular value decomposition. The Q2 parameter is used to select the number of model components, and the root mean square error of prediction serves as criterion to evaluate the predictive power of the variables between the original (nonpenalized) PLS model and the sparse PLS model. Therefore, the resulting selected variables are appropriate for prediction purposes.
In order to compare the feature selection results of MBVIOP and blocksPLS, three 6component blocksPLS models were generated using different constraint degrees for the Hybrid Aspen data (see Table 5). Both canonical and regression modes were tested, leading to better results when the canonical approach was used. The model was built using the canonical mode available from the mixOmics Rpackage that is appropriate to ensure that all data matrices are considered descriptors in a symmetric framework similar to the one used in MBVIOP. A design matrix was set to maximize correlations among the data blocks. The resulting selected variables and the percentage of total explained variation were compared to MBVIOP.
Materials and software
The code of the MBVIOP algorithm was developed using MATLAB version R2019b (The MathWorks Inc., Natick, MA, USA). The fourblock synthetic data set (SD16_235GLU), the blockscaling preprocesses, the OnPLS models, and the MBVIOP results (values and plots) were also done using MATLAB (The MathWorks Inc., Natick, MA, USA). The Marzipan dataset [53] was provided by the University of Copenhagen through the website www.models.life.ku.dk/Marzipan, and preprocessed using PLStoolbox version 8.1.1 (Eigenvector Research, Inc.). The blocksPLS analysis was performed using the mixOmics Rpackage version 6.8.5. The MOFA analysis was performed using the MOFA Rpackage version 1.6.1.
Synthetic dataset (four blocks)
The synthetic dataset, named SD16_235GLU, was created by the authors for testing and validating the MBVIOP MATLAB code. The name of the dataset, SD16_235GLU, stands for synthetic data (SD) designed in 2016 for having 2 global components (G), 3 local components (L), and 5 unique components (U). The dataset is conformed of four data blocks (D_{1}, D_{2}, D_{3}, D_{4}) and 50 observations (samples) common to all blocks. The first block (D_{1}) contains 61 manifest variables, the second block (D_{2}) contains 79, and the third and fourth blocks (D_{3} and D_{4}) contain 96 manifest variables each one. The joint (predictive) normalized loadings (p_{g}, p_{l}) were created using Gaussian pure profiles, which are visualized as a bell shape in the plots; whereas the unique (orthogonal) normalized loadings (p_{u}) were created using unit pulse pure profiles, visualized as a rectangular step in the plots. The scores, both predictive (t_{g}, t_{l}) and orthogonal (t_{u}), were randomly generated, meancentered, scaled to unit norm, and orthogonalized among themselves. The latent variables (components) were calculated as the individual products of scores and transposed normalized loadings (t_{a*}p_{a}^{T}). Finally, the four data blocks were created as the sum of global, local and unique components plus the residual matrices R. The noise was randomized, and its level was set to 0.1%. A generic Dblock is described in Eq. 5; where A_{g} stands for the total number of global components, A_{l} represents the total number of local components, and A_{u} the total number of unique components. All blocks follow the pattern of Eq. 5.
Equations 6 – 9 show the combination of components for each data matrix. To simulate a global component, the corresponding score vector (t_{ag}) was shared among all blocks; for the local components, the corresponding score vector (t_{al}) was shared among the locally connected blocks for that specific local component; and for the unique components individual scores (t_{au}) were used.
The SD16_235GLU was designed (i) to be exigent/difficult in relation to the five unique components when modelling, (ii) to have one local component shared by three data blocks (D_{2}, D_{3}, D_{4}), (iii) to have a local component shared by D_{1} and D_{4}, (iv) to have a local component shared by D_{1} and D_{2}, and (v) to have two global components shared by all data blocks. The percentage of variation per component is: 14.3% in D_{1}, 25% in D_{2}, 25% in D_{3}, and 20% in D_{4} (thus, D_{1} has a total of seven components, D_{2} has four, D_{3} also four, and D_{4} has five).
Marzipan dataset (six blocks).
The Marzipan dataset consists of six data blocks obtained from the analysis of thirtytwo marzipan samples, of nine different recipes, performed using six different spectrometers setups. The marzipan samples contained different amounts of almonds, apricot kernels, water, sucrose, invert sugar, glucose syrup, and minor contributions of additives; cocoa was added in some of the marzipan samples, giving them a distinctive brown color. The six spectrometers (including optical principles, spectral range, and other details) were described by Christensen et al. [53] in 2004. An additional set of measurements using an InfraAlyzer 260 spectrometer was originally considered as a seventh data block [53], but it has been excluded from this work because of not using exactly the same samples than the other six instrumental analyses.
The first data block (NIRS1) contained 1000 variables (400—2500 nm), and the second data block (NIRS2) had 600 variables (800—2100 nm); both NIRS1 and NIRS2 datasets were obtained using a NIRSystems 6500 spectrometer. The third (from an Infraprover II instrument) contained 406 variables, the fourth (from a Bomem MB 160 Diffusir) consisted of 664 variables, the fifth (from an Infratec 1255) had 100 variables, and the sixth (from a PerkinElmer System 2000) had 950 variables. Thus, the dimensions of the different data blocks varied from 100 to 1000 variables (i.e., a ten times difference between the smallest and the largest). NIRS1, Infraprover II and Bomem data blocks were preprocessed by extended multiplicative signal correction (EMSC) [54]; whilst NIRS2, Infratec and PerkinElmer data blocks were preprocessed by SavitskyGolay differentiation (2nd derivative, 3rd order, 15 points window) [55]. In addition, all data blocks were meancentered and normalized to equal sum of squares before building the OnPLS model.
Metabolomics, proteomics and transcriptomics data of hybrid aspen (three blocks)
The Hybrid Aspen dataset used here, previously pretreated and analyzed in Bylesjö et al. [56] in 2009 and in Löfstedt et al. [57] in 2013, contains thirtythree samples of hybrid aspen (Populus tremula x Populus tremuloides) labeled according to the plant internode from where they were sampled (categories A, B, and C) and according to three different genotypes of hybrid aspen (WT, G5, and G3). The wild type (WT) played the role of reference sample. The G5 and G3 genotypes were related to the PttMYB21a gene, which is known to primarily affect lignin biosynthesis and plant growth characteristics. The G5 genotype contained several antisense constructs of the PttMYB21a gene, affecting plant growth; thus, this genotype displays a distinct phenotype with slower growth compare to the WT samples. The G3 genotype contained only one antisense construct of the PttMYB21a gene, displaying a similar but less distinct phenotype compared to the G5 samples. Further details are described by Bylesjö et al. [56].
All thirtythree samples were measured for transcript (cDNA), protein (UPLC/MS) and metabolite (GC/TOFMS) quantities [57]. As result, three data blocks were obtained: a transcript data block containing 14,738 variables (microarray elements), a protein data block containing 3132 variables (extracted chromatographic peaks), and a metabolite data block containing 281 variables (extracted chromatographic peaks).
Availability of data and materials
The Marzipan dataset analyzed in the current study is available through the website www.models.life.ku.dk/Marzipan of the University of Copenhagen. The Hybrid Aspen and the SD16_235GLU datasets are available from the authors on reasonable request.
Abbreviations
 blocksPLS:

Multiblock sparse partial least squares
 CPCA:

Consensus principal component analysis
 GSVD:

Generalized singular value decomposition
 HPCA:

Hierarchical principal component analysis
 HPLS:

Hierarchical partial least squares
 JIVE:

Joint and individual variation explained
 MBPLS:

Multiblock partial least squares
 MBVIOP:

Multiblock variable influence on orthogonal projections
 MOFA:

Multiomics factor analysis
 msPLS:

Multiset sparse partial least squares
 OPLS:

Orthogonal projections to latent structures
 O2PLS:

2Block orthogonal projections to latent structures
 OnPLS:

Nblock orthogonal projections to latent structures
 PCA:

Principal component analysis
 PLS:

Partial least squares to latent structures
 RGCCA:

Regularized generalized canonical correlation analysis
 RMSEP:

Root mean square error of prediction
 SGCCA:

Sparse generalized canonical correlation analysis
 sPLS:

Sparse partial least squares
 SSX:

Sum of squares of X
 SSY:

Sum of squares of Y
 VIP:

Variable influence on projection
References
Horst P. Relations among m sets of measures. Psychometrika. 1961;26:129–49.
Levin J. Simultaneous factor analysis of several Gramian matrices. Psychometrika. 1966;31:413–9.
Curran PJ, Hussong AM. Integrative data analysis: The simultaneous analysis of multiple data sets. Psychol Methods. 2009;14:81–100.
Kourti T, Nomikos P, MacGregor JF. Analysis, monitoring and fault diagnosis of batch processes using multiblock and multiway PLS. J Process Control. 1995;5:277–84.
Westerhuis JA, Kourti T, MacGregor JF. Analysis of multiblock and hierarchical PCA and PLS models. J Chemom. 1998;12:301–21.
Frank I, Feikema J, Constantine N, Kowalski B. Prediction of product quality from spectral data using the partial leastsquares method. J Chem Inf Comput Sci. 1984;24:20–4.
Mazerolles G, Boccard J, Hanafi M, Rudaz S. Analysis of experimental design with multivariate response: a contribution using multiblock techniques. Chemom Intell Lab Syst. 2011;106:65–72.
Conesa A, PratsMontalbán JM, Tarazona S, Nueda MJ, Ferrer A. A multiway approach to data integration in systems biology based on Tucker3 and NPLS. Chemom Intell Lab Syst. 2010;104:101–11.
Reinke SN, GalindoPrieto B, Skotare T, Broadhurst DI, Singhania A, Horowitz D, Djukanović R, Hinks TSC, Geladi P, Trygg J, Wheelock CE. OnPLSbased multiblock data integration: a multivariate approach to interrogating biological interactions in asthma. Anal Chem. 2018;90:13400–8.
Qannari EM, Wakeling I, Courcoux P, MacFie HJH. Defining the underlying sensory dimensions. Food Qual Prefer. 2000;11:151–4.
Tenenhaus M, Pagès J, Ambroisine L, Guinot C. PLS methodology to study relationships between hedonic judgements and product characteristics. Food Qual Prefer. 2005;16:315–25.
Geladi P, Kowalski BR. Partial leastsquares regression: a tutorial. Anal Chim Acta. 1986;185:1–17.
Wold S, Martens H, Wold H. The multivariate calibrationproblem in chemistry solved by the PLS method. Lecture Notes Math. 1983;973:286–93.
Wold, S., Hellberg, S., Lundstedt, T., Sjöström, M. & Wold, H. PLS modeling with latent variables in two or more dimensions. in Symposium on PLS model building: theory and application. (1987).
Geladi, P., Martens, H., Martens, M., Kalvenes, S. & Esbensen, K. Multivariate comparison of laboratory measurements. in Proc. Symposium on Applied Statistics 49–61 (1988).
Wold S, Kettaneh N, Tjessem K. Hierarchical multiblock PLS and PC models for easier model interpretation and as an alternative to variable selection. J Chemom. 1996;10:463–82.
Wangen LE, Kowalski BR. A multiblock partial least squares algorithm for investigating complex chemical systems. J Chemom. 1988;3:3–20.
Wise BM, Gallagher NB. The process chemometrics approach to process monitoring and fault detection. J Process Control. 1996;6:329–48.
Tenenhaus A, Tenenhaus M. Regularized generalized canonical correlation analysis. Psychometrika. 2011;76:257–84.
Qin SJ, Valle S, Piovoso MJ. On unifying multiblock analysis with application to decentralized process monitoring. J Chemom. 2001;15:715–42.
el Bouhaddani S, Uh HW, Jongbloed G, Hayward C, Klarić L, Kiełbasa SM, HouwingDuistermaat J. Integrating omics datasets with the OmicsPLS package. BMC Bioinform. 2018;19:371.
Trygg J. O2PLS for qualitative and quantitative analysis in multivariate calibration. J Chemom. 2002;16:283–93.
Smilde AK, Westerhuis JA, de Jong S. A framework for sequential multiblock component methods. J Chemom. 2003;17:323–37.
Gabrielsson J, Jonsson H, Airiau C, Schmidt B, Escott R, Trygg J. The OPLS methodology for analysis of multiblock batch process data. J Chemom. 2006;20:362–9.
Höskuldsson A. Multiblock and path modelling procedures. J Chemom. 2008;22:571–9.
Hanafi M, Kohler A, Qannari EM. Shedding new light on hierarchical principal component analysis. J Chemom. 2010;24:703–9.
Mazerolles G, Preys S, Bouchut C, Meudec E, Fulcrand H, Souquet JM, Cheynier V. Combination of several mass spectrometry ionization modes: a multiblock analysis for a rapid characterization of the red wine polyphenolic composition. Anal Chim Acta. 2010;678:195–202.
El Ghaziri A, Cariou V, Rutledge DN, Qannari EM. Analysis of multiblock datasets using ComDim: overview and extension to the analysis of (K + 1) datasets. J Chemom. 2016;30:420–9.
Jourdren S, SaintEve A, Panouillé M, Lejeune P, Déléris I, Souchon I. Respective impact of bread structure and oral processing on dynamic texture perceptions through statistical multiblock analysis. Food Res Int. 2016;87:142–51.
Smilde, A., Bro, R. & Geladi, P. Multiway analysis: applications in the chemical sciences. in 118221349 (John Wiley and Sons, 2004).
Lê Cao KA, Rossouw D, RobertGranié C, Besse P. A sparse PLS for variable selection when integrating omics data. Stat Appl Genet Mol Biol. 2008;7:35.
Rohart, F., Gautier, B., Singh, A. & Lê Cao, K.A. mixOmics: An R package for ‘omics feature selection and multiple data integration. PLOS Computational Biology 13, e1005752 (2017).
Argelaguet, R., Velten, B., Arnol, D., Dietrich, S., Zenz, T., Marioni, J. C., Buettner, F., Huber, W. & Stegle, O. Multi‐omics factor analysis—a framework for unsupervised integration of multi‐omics data sets. Mol Syst Biol. 14, (2018).
Löfstedt T, Trygg J. OnPLSa novel multiblock method for the modelling of predictive and orthogonal variation. J Chemom. 2011;25:441–55.
Lock EF, Hoadley KA, Marron JS, Nobel AB. Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. Ann Appl Stat. 2013;7:523–42.
Van Loan CF. Generalizing the Singular Value Decomposition. SIAM J Numer Anal. 1976;13:76–83.
Csala A, Zwinderman AH, Hof MH. Multiset sparse partial least squares path modeling for high dimensional omics data analysis. BMC Bioinform. 2020;21:9.
Andersen CM, Bro R. Variable selection in regressiona tutorial. J Chemom. 2010;24:728–37.
GalindoPrieto B, Eriksson L, Trygg J. Variable influence on projection (VIP) for orthogonal projections to latent structures (OPLS). J Chemom. 2014;28:623–32.
Kvalheim OM, Arneberg R, Bleie O, Rajalahti T, Smilde AK, Westerhuis JA. Variable importance in latent variable regression models. J Chemom. 2014;28:615–22.
Leardi R. Genetic algorithms in chemometrics and chemistry: a review. J Chemom. 2001;15:559–69.
Lindgren, F., Geladi, P., Rännar, S. & Wold, S. Interactive variable selection (IVS) for PLS. Part 1: Theory and algorithms. J Chemom. 8, 349–363 (1994).
Lindgren, F., Geladi, P., Berglund, A., Sjöström, M. & Wold, S. Interactive variable selection (IVS) for PLS. Part II: Chemical applications. J. Chemom. 9, 331–342 (1995).
GalindoPrieto B, Eriksson L, Trygg J. Variable influence on projection (VIP) for OPLS models and its applicability in multivariate time series analysis. Chemom Intell Lab Syst . 2015;146:297–304.
Farrokhnia M, Karimi S. Variable selection in multivariate calibration based on clustering of variable concept. Anal Chim Acta. 2016;902:70–81.
Nørgaard L, Saudland A, Wagner J, Nielsen JP, Munck L, Engelsen SB. Interval partial leastsquares regression (iPLS): A comparative chemometric study with an example from nearinfrared spectroscopy. Appl Spectrosc. 2000;54:413–9.
Tibshirani, R. Regression Shrinkage and Selection via the Lasso. J R Stat Soc. Ser B (Methodological) 58, 267–288 (1996).
GalindoPrieto B, Trygg J, Geladi P. A new approach for variable influence on projection (VIP) in O2PLS models. Chemom Intell Lab Syst . 2017;160:110–24.
Tenenhaus A, Philippe C, Guillemot V, Le Cao KA, Grill J, Frouin V. Variable selection for generalized canonical correlation analysis. Biostatistics. 2014;15:569–83.
Wold, S., Johansson, E. & Cocchi, M. PLS  partial leastsquares projections to latent structures. 3D QSAR Drug Design (Ed. Kubinyi H.), Theory Methods and Applications, ESCOM Science Publishers, Leiden 523–550 (1993).
GalindoPrieto, B. Novel variable influence on projection (VIP) methods in OPLS, O2PLS, and OnPLS models for singleand multiblock variable selection: VIPOPLS, VIPO2PLS, and MBVIOP methods. (Umeå University, 2017).
Sunoj S, Igathinathane C, Visvanathan R. Nondestructive determination of cocoa bean quality using FTNIR spectroscopy. Comput Electron Agric. 2016;124:234–42.
Christensen J, Nørgaard L, Heimdal H, Pedersen J, Engelsen S. Rapid spectroscopic analysis of marzipan—comparative instrumentation. J Near Infrared Spectrosc. 2004;12:63–75.
Martens H, Stark E. Extended multiplicative signal correction and spectral interference subtraction: new preprocessing methods for near infrared spectroscopy. J Pharm Biomed Anal. 1991;9:625–35.
Savitzky A, Golay MJE. Smoothing and differentiation of data by simplified least squares procedures. Anal Chem. 1964;36:1627–39.
Bylesjö M, Nilsson R, Srivastava V, Grönlund A, Johansson AI, Jansson S, Karlsson J, Moritz T, Wingsle G, Trygg J. Integrated analysis of transcript, protein and metabolite data to study lignin biosynthesis in hybrid aspen. J Proteome Res. 2009;8:199–210.
Löfstedt T, Hoffman D, Trygg J. Global, local and unique decompositions in OnPLS for multiblock data analysis. Anal Chim Acta. 2013;791:13–24.
Acknowledgements
The authors want to thank the anonymous reviewers for helping to improve this paper, the Chemistry Department of Umeå University where MBVIOP was developed as part of the PhD thesis of BGP, and the University of Copenhagen for providing the Marzipan dataset via the website www.models.life.ku.dk/Marzipan.
Funding
The authors are grateful for the financial support given by MKS Instruments AB (BGP), eSSENCE (JT), and Industrial Doctoral School (BGP), Umeå University, Sweden. In addition, part of this work was carried out during the tenure of an ERCIM “Alain Bensoussan” Fellowship Programme (BGP). The funding body did not play any roles in the design of the study and collection, the analysis, the data interpretation, or the manuscript writing.
Author information
Authors and Affiliations
Contributions
For the MBVIOP algorithm, BGP generated the theory, equations, MATLAB code, results and figures for the three datasets during her PhD under the supervision of JT and PG. BGP also generated the R codes and results for the comparisons using the MOFA and the blocksPLS methods. JT provided the OnPLS models generated using the OnPLS algorithm/code. PG advised on the theory and equations of MBVIOP, method validation and spectroscopy interpretation. BGP wrote the manuscript draft, and PG checked it and improved it. The manuscript was revised and approved by all authors. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1
: Supporting information that contains Tables S1–S5 and Figures S1–S5.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
GalindoPrieto, B., Geladi, P. & Trygg, J. Multiblock variable influence on orthogonal projections (MBVIOP) for enhanced interpretation of total, global, local and unique variations in OnPLS models. BMC Bioinformatics 22, 176 (2021). https://doi.org/10.1186/s12859021040159
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859021040159