Skip to main content
  • Methodology article
  • Open access
  • Published:

Statistical integration of two omics datasets using GO2PLS



Nowadays, multiple omics data are measured on the same samples in the belief that these different omics datasets represent various aspects of the underlying biological systems. Integrating these omics datasets will facilitate the understanding of the systems. For this purpose, various methods have been proposed, such as Partial Least Squares (PLS), decomposing two datasets into joint and residual subspaces. Since omics data are heterogeneous, the joint components in PLS will contain variation specific to each dataset. To account for this, Two-way Orthogonal Partial Least Squares (O2PLS) captures the heterogeneity by introducing orthogonal subspaces and better estimates the joint subspaces. However, the latent components spanning the joint subspaces in O2PLS are linear combinations of all variables, while it might be of interest to identify a small subset relevant to the research question. To obtain sparsity, we extend O2PLS to Group Sparse O2PLS (GO2PLS) that utilizes biological information on group structures among variables and performs group selection in the joint subspace.


The simulation study showed that introducing sparsity improved the feature selection performance. Furthermore, incorporating group structures increased robustness of the feature selection procedure. GO2PLS performed optimally in terms of accuracy of joint score estimation, joint loading estimation, and feature selection. We applied GO2PLS to datasets from two studies: TwinsUK (a population study) and CVON-DOSIS (a small case-control study). In the first, we incorporated biological information on the group structures of the methylation CpG sites when integrating the methylation dataset with the IgG glycomics data. The targeted genes of the selected methylation groups turned out to be relevant to the immune system, in which the IgG glycans play important roles. In the second, we selected regulatory regions and transcripts that explained the covariance between regulomics and transcriptomics data. The corresponding genes of the selected features appeared to be relevant to heart muscle disease.


GO2PLS integrates two omics datasets to help understand the underlying system that involves both omics levels. It incorporates external group information and performs group selection, resulting in a small subset of features that best explain the relationship between two omics datasets for better interpretability.


With the advancements in high throughput technology, multiple omics data are commonly available on the same subjects. To identify a set of relevant related features across the omics levels, these datasets need to be integrated and analyzed jointly. For statistical integration of omics data, there are several challenges to overcome: complex correlation structure within and between omics data, high-dimensionality (\(p\gg n\), or “large p, small n”), heterogeneity between different omics datasets, and selection of relevant features in each dataset. To deal with the first two challenges, Partial Least Squares (PLS) has been proposed [1, 2]. Dimension reduction is achieved by decomposing two datasets X and Y into joint and residual subspaces. The joint (low-dimensional) subspace of one dataset represents the best approximation of X or Y based on maximizing the covariance of the two. However, by integrating two heterogeneous omics datasets, the PLS joint components also contain (strong) omic-specific variation. This heterogeneity can be caused by differences (e.g. between methylation and glycomics) in size, distribution, and measurement platform. Ignoring these omic-specific characteristics (variation specific to each of the data) in the model may lead to a biased representation of the underlying system. Two-way orthogonal partial least squares (O2PLS) [3, 4] was proposed to decompose each dataset into joint, orthogonal, and residual subspaces. The orthogonal subspaces in X and Y capture variation unrelated to each other, making the joint subspaces better estimates for the true relation between X and Y. Hence, O2PLS accounts for the heterogeneity of two omics datasets. However, the resulting low-dimensional latent components spanning the joint subspaces are linear combinations of all the observed variables. Therefore, to select a small subset of relevant features for better interpretation, one can impose sparsity on the loadings of the principal components. A straightforward approach is to ignore all loadings smaller than some threshold value, effectively treating them as zero, which can be misleading [5].

Several sparse methods based on PLS have been proposed. Chun and Keleş proposed sparse PLS (SPLS) [6] which fits PLS on a reduced X space, consisting of pre-selected X-variables using a penalized regression. Sparse PLS (sPLS) by Lê Cao et al. [7] imposes \(L_1\) penalty on the singular value decomposition (SVD) of the covariance matrix of X and Y, resulting in sparse loading vectors for both datasets. Often it is of interest to select a group of features instead of individual features, e.g. features within a gene or a pathway. By so doing, one can improve power by identifying aggregate effects of the selected features [8,9,10]. Liquet et al. extended sPLS to group PLS (gPLS) [10], imposing group-wise \(L_2\) penalties on the loadings of the pre-defined feature groups. It results in group-wise sparsity (i.e., features belonging to the same group will always be selected altogether).

In this work, we propose to extend O2PLS to incorporate sparsity, called Group Sparse O2PLS (GO2PLS). GO2PLS obtains sparse solutions by pushing a large number of small non-zero weights (or loading values) to zeros, instead of employing hard thresholding using arbitrary cut-off values. Therefore, GO2PLS constructs joint low-dimensional latent components representing the underlying systems involving both omics levels while taking into account the heterogeneity of different omics data, incorporates external biological information such as known group structure, and performs variable selection by imposing group-wise penalties on the loading vectors in the joint subspaces.

For illustration, we apply GO2PLS to datasets from two studies. Firstly, TwinsUK is a population based study [11, 12], where methylation (482K CpG sites) and 22 immunoglobulin G (IgG) glycans were measured. A previous research [13] suggested the presence of an indirect influence of methylation on IgG glycosylation that may in part capture environmental exposures. We integrate the two omics datasets, aiming to identify groups of CpG sites affecting IgG glycosylation. In the CVON-DOSIS case-control study [14], regulomics (histone modification) and transcriptomics data were measured on 13 hypertrophic cardiomyopathy (HCM) patients and 10 controls. Histone modification can have an impact on gene expression. Therefore we integrate the two omics datasets and identify a small set of regulatory regions and transcripts explaining this relationship. Moreover, the extreme imbalance in a high-dimensional setting (33K ChIP-seq and 15K RNA-seq vs 23 subjects) poses computational challenges. The resulting selected features are further studied using gene set enrichment analysis [15]. Several possible scenarios containing these characteristics are designed and investigated in an extensive simulation study.

This paper is organized as follows. In the methods section, an overview of O2PLS is presented, followed by the formulation of GO2PLS. Via a simulation study, we explore the properties of GO2PLS and compare its performance to other competitive methods. We then apply GO2PLS to integrate methylation and glycomics in the TwinsUK study and regulomics and transcriptomics in the CVON-DOSIS study. We conclude with a discussion and possible directions to further extend the method.


Data description

TwinsUK datasets

Whole blood methylation (using Infinium HumanMethylation450 BeadChip) and IgG glycomics (Ultra Performance Liquid Chromatography) data were measured on 405 independent individuals, among which 392 are females and 13 are males. The age ranges from 18 to 81, with a median of 58. The methylation dataset consists of beta values (ratio of intensities between methylated and unmethylated alleles) at 482563 CpG sites. CpG sites with missing values, on allosomes, or labeled cross-active [16] were removed. We kept only the CpG sites on CpG islands or surrounding areas (shelves and shores) that mapped to genetic regions. Age, sex, batch effect, and cell counts were corrected for using multiple regression. The glycomics dataset contains 22 glycan peaks. These peaks were normalized using median quotient (MQ) normalization [17], log-transformed, and adjusted for batch effect, age, and sex as well. The remaining 126299 CpG sites were then divided into 16892 groups based on their target genes (biological information from the UCSC database [18, 19]). No group information was available for the glycomics data.

CVON-DOSIS datasets

In the CVON-DOSIS study, regulomics and transcriptomics datasets were measured on the samples taken from the heart tissues of 13 HCM patients and 10 healthy controls. HCM is a heart muscle disease that makes it harder for the heart to pump blood, leading to heart failure. The regulomics data were measured using ChIP-seq, providing counts of histone modification H3K27ac in 33642 regulatory regions. The transcriptomics data contain counts of 15882 transcripts, measured by RNA-seq. The raw counts of regulomics data were normalized with reads per kilobase million (RPKM) to adjust for sequencing depth. Transcriptomics data were normalized with counts per million (CPM) with effective library size (estimated using the Trimmed Mean of M-values (TMM) method in EdgeR R package [20]). Further, both normalized data were log-transformed.

Two-way orthogonal partial least squares (O2PLS)

Let X and Y be two data matrices with the number of rows equal to the sample size N and the number of columns equal to the dimensionality p and q, respectively. Let the number of joint, X-orthogonal (unrelated to Y) and Y-orthogonal components be \(K\), \(K_x\) and \(K_y\), respectively, where \(K\), \(K_x\) and \(K_y\) are typically much smaller than p and q. The O2PLS model decomposes X and Y as follows:

$$\begin{aligned} X&= TW^{\top } + T_\perp P_\perp ^{\top } + E, \\ Y&= UC^{\top } + U_\perp Q_\perp ^{\top } + F. \end{aligned}$$

The relation between X and Y is captured through the inner relation between T and U,

$$\begin{aligned} U&=TB_T + H, \\ T&= UB_U + \tilde{H}. \end{aligned}$$

In this model, the scores are: \(T\, (N \times K), \, U \,(N \times K), \, T_\perp \, (N \times K_x), \, U_\perp \, (N \times K_y).\) They represent projections of the observed data X and Y to lower-dimensional subspaces. The loadings, \(W\,(p \times K), \, C\,(q \times K), \, P_\perp \,(p \times K_x), \, Q_\perp \,(q \times K_y),\) indicate relative importance of each X and Y variable in forming the corresponding scores. Further, \(E\,(N \times p), \, F\,(N \times q), \, H\,(N \times K), \, \tilde{H}\,(N \times K),\) represent the residual matrices.

In O2PLS, estimates of the joint subspaces are obtained by first filtering out the orthogonal variation. The filtered data matrices \({{\tilde{X}}}\) and \({{\tilde{Y}}}\) are constructed as follows:

$$\begin{aligned} {{\tilde{X}}}&= ( I_N - T_\perp (T_\perp ^{\top } T_\perp )^{-1} T_\perp ^{\top }) X, \\ {{\tilde{Y}}}&= ( I_N - U_\perp (U_\perp ^{\top } U_\perp )^{-1} U_\perp ^{\top }) Y, \end{aligned}$$

where \(T_\perp, U_\perp\) are estimates for the orthogonal subspaces, and \(I_N\) is identity matrix of size N. For more details see [3]. The joint parts maximize the covariance between the joint scores \(T = {{\tilde{X}}}W\) and \(U = {{\tilde{Y}}}C\). Here, W and C consist of loading vectors (\(w_1, \ldots , w_K\)) and (\(c_1, \ldots , c_K\)), which can be found as the right and left singular vectors of the covariance matrix \({{\tilde{Y}}}^{\top } {{\tilde{X}}}\) [4]. Calculating and storing \({{\tilde{Y}}}^{\top } {{\tilde{X}}}\) of dimension \(q \times p\) can be cumbersome for high dimensional omics data. Therefore we consider the following optimization problem sequentially for components \(k = 1, \ldots , K\):

$$\begin{aligned} \max _{\left\Vert c_k\right\Vert _2=1,\left\Vert w_k\right\Vert _2=1} c_k^{\top } {{\tilde{Y}}}_k^{\top } {{\tilde{X}}}_k w_k, \end{aligned}$$

where parameters \(w_k, \, c_k\) are the loading vectors of the k-th joint components and \({{\tilde{X}}}_k, \, {{\tilde{Y}}}_k\) are the filtered data matrices after \(k-1\) times of deflation. This can be solved efficiently using the Nonlinear Iterative Partial Least Squares (NIPALS) [21] algorithm, which starts with random initialization of the X-space score vector t and repeats a sequence of the following steps until convergence:

$$\begin{aligned} &\left. 1\right) c_k = \frac{{{\tilde{Y}}}_k^{\top } t}{t^{\top } t}, \quad \left. 2\right) \left\Vert c_k\right\Vert _2 \rightarrow 1, \quad \left. 3\right) u = {{\tilde{Y}}}_k c_k, \\&\left. 4\right) w_k = \frac{{{\tilde{X}}}_k^{\top } u}{u^{\top } u}, \quad \left. 5\right) \left\Vert w_k\right\Vert _2 \rightarrow 1 , \quad \left. 6\right) t = {{\tilde{X}}}_k w_k.\\ \end{aligned}$$

In step 1 and 4, \(Y_k\) and \(X_k\) are projected onto the X-space score vector t and the Y-space score u to get the loading vectors \(c_k\) and \(w_k\). The loading vectors are then unitized (step 2 and 5) and used to calculated the new scores u and t. Convergence of the algorithm is guaranteed. A detailed description and proof of optimality of the O2PLS algorithm can be found in [3, 4].

While standard cross-validation (CV) over a 3-dimensional grid is often used to determine the optimal number of components \(K\), \(K_x\), and \(K_y\), the procedure is not optimal for O2PLS, since there is not a single optimization criterion for all three parameters. As in [4], we use an alternative CV procedure that first performs a 2-dimensional grid search of \(K_x\) and \(K_y\), with a fixed \(K\), to optimize prediction performance of \(T \rightarrow U\) and \(U \rightarrow T\). Then a sequential search of optimal \(K\) is conducted to minimize the sum of mean squared errors (MSE) of prediction concerning \(X \rightarrow Y\) and \(Y \rightarrow X\).

Group sparse O2PLS (GO2PLS)

GO2PLS extends O2PLS by introducing a penalty in the NIPALS optimization on the filtered data \({{\tilde{X}}}\) and \({{\tilde{Y}}}\). This penalty encourages sparse, or group-sparse solutions for the joint loading matrices W and C, leading to a subset of the original features corresponding to non-zero loading values being selected in each joint component.

Briefly, we introduce an \(L_1\) penalty on each pair of joint loading vectors. The optimization problem for the k-th pair of joint loadings \(c_k, \, w_k\) is:

$$\begin{aligned} \max _{\left\Vert c_k\right\Vert _2=1,\left\Vert w_k\right\Vert _2=1} c_k^{\top } {{\tilde{Y}}}_k^{\top } {{\tilde{X}}}_k w_k + \lambda _c \left\Vert c_k\right\Vert _1 + \lambda _w \left\Vert w_k\right\Vert _1, \end{aligned}$$

where \(\lambda _c, \, \lambda _w\) are penalization parameters that regulate the sparsity level. The optimization problem (6) can be solved [22] by iterating over the k-th pair of joint loadings,

$$\begin{aligned} c_k = \frac{S({{\tilde{Y}}}_k^{\top } t, \lambda _c)}{\left\Vert S({{\tilde{Y}}}_k^{\top } t, \lambda _c)\right\Vert _2}, \; w_k = \frac{S({{\tilde{X}}}_k^{\top } u, \lambda _w)}{\left\Vert S({{\tilde{X}}}_k^{\top } u, \lambda _w)\right\Vert _2}, \end{aligned}$$

where \(t = {{\tilde{X}}}_k w_k\) and \(u = {{\tilde{Y}}}_k c_k\). Here, \(S(\cdot )\) is the soft thresholding operator: \(S(a, \text {const}) = sgn(a)(|a| - \text {const})_+\) (\(\text {const} \ge 0\) is a non-negative constant, \((x)_+\) equals to x if \(x > 0\) and equals to 0 if \(x \le 0\)).

To perform group selection, we impose group-wise \(L_2\) penalty on the joint loading vectors. Let \({{\tilde{X}}}\) and \({{\tilde{Y}}}\) be partitioned into \(J \,(J \le p )\) and \(M \,(M \le q )\) groups, respectively. The submatrices \({{\tilde{X}}}^{(j)}\) and \({{\tilde{Y}}}^{(m)}\) (\(j = 1,\ldots , J; \, m = 1, \ldots , M\)) contain the j-th and m-th group of variables, with corresponding loading vectors \(w^{(j)}\) (of size \(p_j\)) and \(c^{(m)}\) (of size \(q_m\)). The optimization problem for the k-th pair of loading vectors \(c_k = ({{c_k^{(1)}}}^\top , \ldots , {{c_k^{(M)}}^{\top }})^\top\) and \(w_k = ({{w_k^{(1)}}}^\top , \ldots , {{w_k^{(J)}}^{\top }} )^\top\) can be written as follows:

$$\begin{aligned} &\min _{c_k^{(m)},w_k^{(j)}} \left \{ -\sum_{j=1}^{J} \sum _{m=1}^{M} {c_{k}^{(m)}}^{\top } {{\tilde{Y}}}_{k}^{{(m)}^\top } {{\tilde{X}}}_k^{(j)} w_k^{(j)}\right.\\ & \quad \left.+\,\lambda _c \sum _{m=1}^{M} \sqrt{q_m} \left\Vert c_k^{(m)}\right\Vert _2 + \lambda _w \sum _{j=1}^{J} \sqrt{p_j} \left\Vert w_k^{(j)}\right\Vert _2\right. \\ & \quad \left. +\, \phi_c \left( \sum _{m=1}^{M} \left\Vert c_k^{(m)}\right\Vert _2^2 -1 \right) + \phi _w \left( \sum _{j=1}^{J} \left\Vert w_k^{(j)}\right\Vert _2^2 -1 \right) \right \}, \end{aligned}$$

where the last two terms are reformulations of the unit norm constraints on \(c_k\) and \(w_k\), with \(\phi _c\) and \(\phi _w\) being the Lagrangian multipliers. The effective penalization parameters on each group (\(\lambda _c, \, \lambda _w\)) are adjusted by the square root of the group size to correct for the fact that larger groups are more likely to be selected. This optimization problem can be solved using block coordinate descent (for details, see Additional file 1). The solution takes the form:

$$\begin{aligned} &c_k^{(m)} = \frac{\left( \left\Vert {{\tilde{Y}}}_k^{{(m)}^{\top }} t\right\Vert _2-\sqrt{q_m}\lambda_c\right) _+}{2\phi _c \left\Vert {{\tilde{Y}}}_k^{{(m)}^{\top }} t\right\Vert _2}\quad {{\tilde{Y}}}_k^{{(m)}^{\top }} t, \\ &w_k^{(j)} = \frac{\left( \left\Vert {{\tilde{X}}}_k^{{(j)}^{\top }} u\right\Vert _2-\sqrt{p_j}\lambda _w\right) _+}{2\phi _w \left\Vert {{\tilde{X}}}_k^{{(j)}^{\top }} u\right\Vert _2}\quad {{\tilde{X}}}_k^{{(j)}^{\top }} u. \end{aligned}$$

The \({{\tilde{X}}}\)-variables within the j-th group will have non-zero weights if \(\left\Vert {{\tilde{X}}}_k^{{(j)}^{\top }}u\right\Vert _2\) (i.e., the contribution of the whole group to the covariance) is larger than the size-adjusted penalization parameter \(\sqrt{p_j}\lambda _w\). In the same way, the \({{\tilde{Y}}}\)-variables within the m-th group will be assigned non-zero loading values if \(\left\Vert {{\tilde{Y}}}_k^{{(m)}^{\top }} t\right\Vert _2 > \sqrt{q_m}\lambda _c\).

Note that when all the groups have size 1, the summation of group-wise \(L_2\) penalties is equivalent to an \(L_1\) penalty on the unpartitioned loading vector and individual features will be selected (i.e., (8) reduces to (6)). In this specific case, to avoid confusion, we call the method Sparse O2PLS (SO2PLS). When the penalization parameters \(\lambda _w = \lambda _c = 0\), GO2PLS becomes to O2PLS. If the number of orthogonal components \(K_x = K_y = 0\), GO2PLS, SO2PLS, O2PLS are equivalent to gPLS, sPLS, and PLS, respectively.

The k-th pair of joint loadings are orthogonalized with respect to the previous \(k-1\) loading vectors. Let \(\pi\) be an index set for selected variables in \(w_k\). The orthogonalization is achieved by first projecting \(w_k^{(\pi )}\) onto \(span\{ w_1^{(\pi )}, \ldots , w_{k-1}^{(\pi )} \}\), and then subtracting this projection from \(w_k^{(\pi )}\). When the previous \(k-1\) components do not select any variable in \(\pi\), \(span\{ w_1^{(\pi )}, \ldots , w_{k-1}^{(\pi )} \}\) is actually a zero subspace and no orthogonalization is needed.

To determine the optimal sparsity level, it is more convenient and intuitive to focus on the number of selected \({{\tilde{X}}}\), \({{\tilde{Y}}}\) groups (donote \(h_x\), \(h_y\), respectively). If prior biological knowledge does not already specify certain \(h_x\) and \(h_y\), cross-validation can be used to search for combinations of \(h_x\) and \(h_y\) that maximize the covariance between each pair of estimated joint components \({\mathrm{Cov}}({\hat{t}}, {\hat{u}})\). Similar to LASSO [23], the “one-standard-error-rule” [24] can be applied to obtain a more stable CV result. The GO2PLS algorithm is described below:

figure a

Simulation Study

We evaluate the performance of GO2PLS in two scenarios. First, we investigate the ability to select the relevant groups under various scenarios, focusing on the joint subspace, where the group selection takes place. Second, we compare the performance of GO2PLS and SO2PLS with other methods: O2PLS, PLS, sPLS, and gPLS. We investigate joint score estimation, joint loading estimation, and feature selection performances.

In the first scenario, we set the number of variables in X and Y to be \(p = 5000\) and \(q = 20\), respectively. There are 10 groups of variables in X with non-zero loading values. The first 5 groups have group sizes of 100, 50, 20, 5, and 1, respectively, in which all the variables have loading values equal to 1. The remaining 5 groups are of size 10, with loading values of variables equal to 5. Note that large loading values are assigned to the latter 5 groups to make the detection of the first 5 groups more difficult. The remaining variables have zero loading values and are divided into groups of size 10. All the Y-variables have the same loading values and are not grouped. The sample size N is set to 30. We simulate both data matrices with 1 joint component (T and U from Equation 1 are both standard normally distributed and have correlation 1). We perform 1000 simulation runs and record the number of the runs GO2PLS selected relevant groups; we compute the proportion of each truly relevant group (with non-zero loadings) being selected across the simulation runs (number of times being selected divided by 1000). The group importance measurement \({\left\Vert {X^{(j)}}^{\top } U\right\Vert _2}/ {\sqrt{p_j}}\), that determines whether a group is selected or not is recorded for the first 5 groups (with loading value 1) to investigate the stability of the selection procedure.

In the second scenario, we vary the sample size N from 30 to 600, and set \(p = 20{,}000\) and \(q = 10{,}000\), mimicking the dimensionality of the CVON-DOSIS datasets. Both X- and Y- variables are evenly divided into 1000 groups. For each joint component, we select 50 relevant groups and assign non-zero loadings to the variables contained in them. Within each group, variables have the same loading values: 1 for the first group, 2 for the second,..., and 50 for the last relevant group. We set the number of joint components \(K = 2\) and the number of orthogonal components \(K_x = K_y = 1\). The scores \(T, T_\perp , U, U_\perp\) from Equation 1 are generated from normal distributions with zero mean. The relationship between the joint scores is represented by \(U = T + H\), where H accounts for \(20\%\) of the variation in U. The noise matrices E, F are generated from normal distributions with zero mean and variance such that the variance of the noise matrix accounts for a proportion \(\alpha\) (\(0< \alpha < 1\)) of the variance of the data matrix (i.e., \(\alpha = \mathrm{Var}(E)/\mathrm{Var}(X) = \mathrm{Var}(F)/\mathrm{Var}(Y)\)). The ratio of the variance of the orthogonal components to the variance of the joint components (\(\sigma ^2_{T_{\perp }} / \sigma _T^2\)), and noise level \(\alpha\) are varied. For evaluating the accuracy of the joint score estimation, we computed \(R^2_{\widehat{T}T} = 1 - \sum (\widehat{T} - T)^2/\sum T^2\) and \(R^2_{\widehat{T}\widehat{U}} = 1 - \sum (\widehat{U} - \widehat{T})^2/\sum \widehat{U}^2\), which quantify how well the true parameter T and the estimated Y-joint component \(\widehat{U}\) can be explained by the estimated X-joint component \(\widehat{T}\). The performance of feature selection and the accuracy of estimated loadings are evaluated by true positive rate (TPR = TP/(TP+FN), where TP = True Positive, FN = False Negative) and \(W^{\top } \widehat{W}\), which represents the cosine of the angle between the estimated loading vector and the true one. The performances of all methods are evaluated on an independent test dataset of size 1000. For each setting, 500 replications are generated.

An overview of scenario settings is presented in Tables 12. To make a clearer comparison of the behavior across all the methods, we use the optimum values for the tuning parameters (number of components and number of relevant variables or groups).

Table 1 Settings of Scenario 1 to study the performance of selecting relevant groups
Table 2 Settings of Scenario 2 to compare the performances regarding joint score estimation, joint loading estimation, and feature selection

Results of simulation study

Scenario 1

Fig. 1
figure 1

Simulation Scenario 1: Selection proportion of relevant groups with different sizes under varying noise. The proportion for larger groups is higher at low to moderate (\(\alpha < 0.7\)) noise levels, and shows robustness against increasing noise

Fig. 2
figure 2

Simulation Scenario 1: Density plot of estimated group importance measurement \(\left\Vert {X^{(j)}}^{\top } U\right\Vert _2/{\sqrt{p_j}}\) for each group size under 3 different noise levels. The vertical dotted red line is the average threshold. When the measurement of a group is larger than the threshold, the group is selected. The total area on the right side of the threshold under each density curve equals to the selection proportion for the corresponding group. The less the density curve spreads out, the more stable is the estimate

Figure 1 shows the selection proportion for each relevant group under each noise level. Compared to smaller groups, the proportion for larger groups is higher at low to moderate (\(\alpha < 0.7\)) noise levels, and shows robustness against increasing noise. When the noise level is very high (\(\alpha > 0.8\)), the method loses power to detect relevant group of any size, particularly, of larger size. Figure 2 shows the density of the group importance measurement \(\left\Vert {X^{(j)}}^{\top } U\right\Vert _2 /{\sqrt{p_j}}\) for the first 5 relevant groups with different group sizes under 3 different noise levels. The vertical dotted lines indicate the average threshold given the correct number of relevant groups. Since a group will be selected if exceeds the threshold, the total area on the right side of the threshold under each density curve equals the selection proportion for the corresponding group. The measurement for larger relevant group shows higher precision at all noise levels. The threshold increases along with the noise.

Scenario 2

Fig. 3
figure 3

Simulation Scenario 2: comparison of joint score estimation performance, under varying relative orthogonal signal strength (top row), and varying sample size (bottom row). On the Y-axis, \(R^2_{\widehat{T}T}\) (left) and \(R^2_{\widehat{T}\widehat{U}}\) (right) are the coefficient of determination of regressing T on \(\widehat{T}\), and \(\widehat{U}\) on \(\widehat{T}\), respectively, quantifying the joint score estimation performances. Boxes show the results of 500 repetition

The performance of the joint score estimation is compared focusing on the difference between methods with orthogonal parts (GO2PLS, SO2PLS, O2PLS) and their counterparts without the “O2” filtering (gPLS, sPLS, PLS). The top row of Fig. 3 shows the performance measured by \(R^2_{\widehat{T}T}\) & \(R^2_{\widehat{T}\widehat{U}}\) under \(N = 30\), \(\alpha = 0.1\) and varying relative orthogonal signal strength from one fifth to five times of the joint signal. In the left panel, \(R^2_{\widehat{T}T}\) of the various methods is depicted, representing how well the joint component \(\hat{T}\) captured the true underlying T. Overall, penalized methods performed better than non-penalized ones, especially when the orthogonal variation is relatively small. PLS performed poorly compared to O2PLS, when the orthogonal variation exceeds the joint variation. As the orthogonal variation further increases, performances of sPLS and gPLS deteriorated, while SO2PLS and GO2PLS were less affected. In the right panel, \(R^2_{\widehat{T}\widehat{U}}\) is presented, an estimate of the true parameters \(R^2_{TU}\), capturing correlation of T and U. Across different settings, O2PLS-based methods performed better, especially when the orthogonal variation is large.

The bottom row of Fig. 3 shows the score estimation performance under fixed relative orthogonal signal strength of 1, \(\alpha =0.1\), and varying sample size N from 30 to 600. Penalized methods performed better compared to non-penalized methods in general, when the sample size is small. Regardless of the sample size, O2PLS-based methods outperformed PLS-based methods.

Fig. 4
figure 4

Simulation Scenario 2: comparison of feature selection and joint loading estimation performance, under varying noise level (top row), and varying sample size (bottom row). On the Y-axis are the True Positive Rate (left) and \(W^{\top } \widehat{W}\) (right), which is the cosine of the angle between the estimated loading vector \(\widehat{W}\) and the true one W. Boxes show the results of 500 repetition

Lastly, we present the results of GO2PLS, SO2PLS, and O2PLS with regard to feature selection and estimation of joint loadings. Results of PLS-based methods are not included since the performances of gPLS, sPLS, and PLS in this regard are very similar to GO2PLS, SO2PLS, and O2PLS, respectively. In Fig. 4, the top row shows the TPR and \(W^{\top } \widehat{W}\) under \(N = 30\) and varying noise levels \(\alpha\) from low to high. At all noise levels, GO2PLS had higher TPR than SO2PLS and O2PLS, and performed robustly against increasing noise. Regarding \(W^{\top } \widehat{W}\), GO2PLS outperformed the other two as well. In the bottom row, when increasing sample size at a fixed noise level of 0.5, the variance appeared to decrease and the performances of all the methods converged. Overall, GO2PLS outperformed others.

Application to data

We demonstrate SO2PLS and GO2PLS on datasets from two distinct studies. In the TwinsUK study, our aim is to integrate methylation and glycomics data and identify important groups of CpG sites underlying glycosylation. In the CVON-DOSIS study, we integrate regulomics and transcriptomics data and select a subset of genes and regions that drive their relationship.

TwinsUK study

We performed GO2PLS on the data with 1 joint, no methylation-orthogonal, and 3 glycomics-orthogonal components based on 5-fold cross-validation. We set the sparsity parameters to select the top 100 groups in the methylation and kept all the 22 glycan variables. The selected CpG groups from GO2PLS were mapped to their targeted genes for interpretation.

We performed gene set enrichment analyses on the selected genes using the ToppGene Suite [25]. The results appeared to be related to immune response. We listed the most significant molecular function, biological process, and pathway in Table 3 (the full list of significant results can be found in Additional file 2).

An extra analysis was performed using another grouping strategy, where we grouped 55531 CpG sites that map to the promoter region (0-1500 bases upstream of the transcriptional start site (TSS)) of a gene to 14491 groups based on their targeted genes. We applied GO2PLS and selected 100 groups. Note that the sizes of these groups became smaller since many CpG sites in gene bodies are excluded. Enrichment analysis did not result in significant results.

Table 3 TwinsUK study: top results of gene set enrichment analysis


We applied SO2PLS on the regulomics and transcriptomics datasets, with 2 joint and 1 orthogonal components for each omics dataset. In each pair of the joint components, 1000 regulomics and 500 transcriptomics variables were selected. We then further identified the genes corresponding to the promoter regions where the selected 1000 histone modification locates (using ± 10K window from the transcription start site of the gene). These genes are of interest since they are likely to be related to epigenetic regulation of gene expression. Genes corresponding to the selected transcripts were also identified. These gene sets identified from each joint component of the two omics data were investigated separately using gene set enrichment analysis. The top results were listed in Table 4. The Gene Ontology (GO) enrichment analysis of the selected genes and regions showed terms related to HCM that were also found previously [28]. Due to the presence of the case-control status in both omics levels, we expect the joint components related to the disease. Plotting the joint scores of the two datasets showed a separation between HCM cases and controls (Fig. 5). For a comparison of score plots of PCA, PLS, O2PLS, and SO2PLS, please see Additional file 3.

Fig. 5
figure 5

CVON-DOSIS study: SO2PLS joint score plots of regulomics (left) and transcriptomics (right). HCM patients and controls were plotted in different colors. Ellipses are the 95% confidence regions of each group

Table 4 CVON-DOSIS study: Gene set enrichment analysis results


Statistical integration of two omics datasets is becoming increasingly popular to gain insight into underlying biological systems. O2PLS is a method that integrates two heterogeneous datasets and takes into account omic-specific variation. The resulting joint and orthogonal components are linear combinations of all variables, making interpretation difficult. To introduce sparsity and identify relevant groups, GO2PLS incorporates biological information on group structures to perform group selection in the joint subspace. Depending on the group size, such an approach may also lead to a higher selection probability of relevant features. We performed an extensive simulation study and showed that O2PLS-based methods generally outperformed PLS-based methods regarding joint score estimation when orthogonal variation was present in the data. Since PLS does not take into account orthogonal parts, the joint components also include part of the orthogonal variation. Further, when the sample size was small or the noise level was high, penalized methods appeared to be much less prone to overfitting than non-penalized methods. This suggests that results based on GO2PLS are likely to be generalizable when applied to new datasets. Concerning feature selection, adding external group information led to higher TPR, and larger groups of relevant features had a higher proportion of being detected under a moderate noise level. We then applied GO2PLS to the TwinsUK study, where we selected 100 target genes comprising of CpG sites that are most related to IgG glycosylation. The results of the enrichment analysis on the selected genes showed GO-terms involving the immune system in which the IgG glycans play important roles. In the CVON-DOSIS study, we integrated regulomics and transcriptomics and identified 1000 regulatory regions and 500 transcripts, and mapped them to genes. Further analysis of the selected gene sets showed enrichment for terms related to heart muscle diseases. Moreover, the implementation of GO2PLS is computationally fast and memory efficient. It relies on an algorithm based on NIPALS that does not store large matrices of size \(p \times q\) when performing the group-penalized optimization. A regular laptop (8G RAM, quad-core 2.6 GHz) was able to run GO2PLS on omics data from both case studies.

The group information should be chosen together with domain experts based on the research question and biological knowledge. For example, in our TwinsUK data application, we aimed to identify the genes comprising of CpG sites, rather than the individual CpG sites. Therefore, we grouped CpG sites in the same genetic region. Furthermore, the biological knowledge that close-by CpG sites tend to function together supported the choice of grouping. Different grouping information leads to a changed definition of groups, consequently the selected groups will have a different interpretation. In the same example of the TwinsUK study, extra analysis using smaller groups led to no significant results in the enrichment analysis, supposedly due to weaker aggregated group effects. When group information is not available, or the research goal is to identify individual features (e.g., in our CVON-DOSIS data application), SO2PLS can be used.

In the CVON-DOSIS study, Plotting the first two joint components showed two distinct classes corresponding to the case-control status. This might be expected since the analysis was conditional on case-control status, yielding a correlation between the two omics datasets. This phenomenon is well known in regression analysis of secondary phenotypes [29], but not well studied in PLS type of methods. This is a topic of future research. Often omics data are collected to study their relationship with an outcome variable or to predict an outcome variable. To this end, our approach has to be extended to incorporate the outcome variable. Such an approach might also lead to a more sparse solution since the selected features have to be correlated among the two omics datasets and the outcome variable. Further extensions of GO2PLS are to incorporate more than two omics datasets to represent the actual biological system even better.

Finally, it is possible to extend the GO2PLS algorithm to a probabilistic model. Extending latent variable methods to probabilistic models is not new. PCA was extended to Probabilistic PCA in [30], and Probabilistic PLS (PPLS) [31] was proposed to provide a probabilistic framework for PLS. It has been shown that the probabilistic counterpart has a lower bias in estimation and is robust to non-normally distributed variables [31]. More importantly, the probabilistic model will allow statistical inference, making it possible to interpret the relevance and importance of features in the population, and facilitating follow-up studies. These extensions of GO2PLS will be suited for various studies with more complicated designs.


In this article, we proposed GO2PLS to integrate two omics data by estimating joint latent components. GO2PLS takes into account heterogeneity between different omics levels by including orthogonal components for each dataset. The method utilizes known group information among the features to select relevant groups of features, by imposing group-wise penalties in the joint subspace. Alternatively, the method can also choose features at the individual level. This flexibility facilitates investigation into different research questions. Our simulation study showed that GO2PLS behaved robust against noise and outperformed competing methods in terms of accuracy of joint score estimation, joint loading estimation, and feature selection. We applied GO2PLS to the datasets from two studies with distinct designs and showed that the results were biologically interpretable. To conclude, GO2PLS is a robust and flexible method for integrating two omics datasets and selecting important groups of features.

Availability of data and materials

The R scripts and functions for GO2PLS are publicly available in the OmicsPLS R package and can be installed in R via install.packages(“OmicsPLS”). Because of the sensitive nature of the data collected for the CVON-DOSIS study, requests to access the dataset from qualified researchers trained in human subject confidentiality protocols may be sent to the corresponding authors. Individual level methylation and glycomics data from TwinsUK are not permitted to be shared or deposited due to the original consent given at the time of data collection. However, access to methylation and glycomics data can be applied for through the TwinsUK data access committee. For information on access and how to apply, visit



Group Sparse Two-way Orthogonal Partial Least Squares


Sparse Two-way Orthogonal Partial Least Squares


Principal Component Analysis


Partial Least Squares


Two-way Orthogonal Partial Least Squares


Sparse Partial Least Squares (Chun and Keleş)


Sparse Partial Least Squares (Lê Cao et al.)


Group Partial Least Squares


Singular Value Decomposition


Nonlinear Iterative Partial Least Squares




Mean Squared Error


True Positive Rate


Immunoglobulin G


Hypertrophic Cardiomyopathy


Reads per Kilobase Million


Counts per Million


Trimmed Mean of M-values


Gene Ontology


  1. Boulesteix A-LL, Strimmer K. Partial least squares: A versatile tool for the analysis of high-dimensional genomic data. Briefings Bioinform. 2007;8(1):32–44.

    Article  CAS  PubMed  Google Scholar 

  2. Wold S, Ruhe A, Wold H, Dunn III WJ. The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses. SIAM J Sci Stat Comput. 1984;5(3):735–43. arXiv:1308.0863v1

  3. Trygg J, Wold S. O2-PLS, a two-block (X-Y) latent variable regression (LVR) method with an integral OSC filter. J Chemom. 2003;17(1):53–64.

    Article  CAS  Google Scholar 

  4. el Bouhaddani S, Houwing-Duistermaat J, Salo P, Perola M, Jongbloed G, Uh HW. Evaluation of O2PLS in Omics data integration. BMC Bioinform. 2016;17(2):1–20.

    Article  CAS  Google Scholar 

  5. Jolliffe IT, Trendafilov NT, Uddin M. A modified principal component technique based on the LASSO. J Comput Graph Stat. 2003;12(3):531–47., arXiv:1205.0121v2

  6. Chun, H., Keleş, S.: Sparse partial least squares regression for simultaneous dimension reduction and variable selection. J R Stat Soc Ser B: Stat Methodol 72(1), 3–25 (2010).

  7. Lê Cao, K.A., Rossouw, D., Robert-Granié, C., Besse, P. A sparse PLS for variable selection when integrating omics data. Statist Appl Genet Mol Biol. 7(1) (2008).

  8. Tyekucheva S, Marchionni L, Karchin R, Parmigiani G. Integrating diverse genomic data using gene sets. Genome Biology. 2011;12(10):105.

    Article  CAS  Google Scholar 

  9. Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B: Stat Methodol. 2006;68(1):49–67.

    Article  Google Scholar 

  10. Liquet B, De Micheaux PL, Hejblum BP, Thiébaut R. Group and sparse group partial least square approaches applied in genomics context. Bioinformatics. 2016;32(1):35–42.

    Article  CAS  PubMed  Google Scholar 

  11. Spector TD, Williams FMK. The UK Adult Twin Registry (TwinsUK). Twin Res Hum Genet. 2006;9(6):899–906.

    Article  PubMed  Google Scholar 

  12. Moayyeri A, Hammond CJ, Hart DJ, Spector TD. The UK adult twin registry (twinsUK resource). Twin Res Hum Genet. 2013;16(1):144–9.

    Article  PubMed  Google Scholar 

  13. Wahl A, Kasela S, Carnero-Montoro E, van Iterson M, Štambuk J, Sharma S, van den Akker E, Klaric L, Benedetti E, Razdorov G, Trbojević-Akmačić I, Vučković F, Ugrina I, Beekman M, Deelen J, van Heemst D, Heijmans BT, Consortium BIOS, Wuhrer M, Plomp R, Keser T, Šimurina M, Pavić T, Gudelj I, Krištić J, Grallert H, Kunze S, Peters A, Bell JT, Spector TD, Milani L, Slagboom PE, Lauc G, Gieger C. IgG glycosylation and DNA methylation are interconnected with smoking. Biochimica et Biophysica Acta (BBA) - General Subjects 1862(3), 637–648 (2018).

  14. CVON-DOSIS – Cardiovascular Research Consortium. Accessed 18 Nov 2020

  15. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005;102(43):15545–50.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Chen Y-aA, Lemire M, Choufani S, Butcher DT, Grafodatskaya D, Zanke BW, Gallinger S, Hudson TJ, Weksberg R. Discovery of cross-reactive probes and polymorphic CpGs in the Illumina Infinium HumanMethylation450 microarray. Epigenetics. 2013;8(2):203–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Uh H-W, Klarić L, Ugrina I, Lauc G, Smilde AK, Houwing-Duistermaat JJ. Choosing proper normalization is essential for discovery of sparse glycan biomarkers. Mol Omics. 2020.

  18. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler aD. The human genome browser at UCSC. Genome Res. 2002;12(6):996–1006.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. UCSC Genome Browser Home. Accessed 19 Nov 2020

  20. Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Tech Rep (2010).

  21. Wold H. Nonlinear Iterative Partial Least Squares (NIPALS) Modelling: Some Current Developments. In: Multivariate Analysis–III, pp. 383–407 (1973).

  22. Witten DM, Tibshirani R, Hastie T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2009;10(3):515–34.

    Article  PubMed  PubMed Central  Google Scholar 

  23. Tibshirani R. Regression Shrinkage and Selection Via the Lasso. J R Stat Soc: Ser B (Methodological). 1996;58(1):267–88.

    Article  Google Scholar 

  24. Hastie T, Tibshirani R, Wainwright M. Statistical learning with sparsity: the lasso and generalizations. Stat Learn Spars: Lasso General. 2015;84(1):1–337.

    Article  Google Scholar 

  25. Chen J, Bardes EE, Aronow BJ, Jegga AG. ToppGene Suite for gene list enrichment analysis and candidate gene prioritization. Nucleic Acids Res. 2009;37(SUPPL.2).

  26. Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J R Stat Soc: Ser B (Methodological). 1995;57(1):289–300.

    Article  Google Scholar 

  27. Storey JD. A direct approach to false discovery rates. Technical Report. 2002;3.

  28. Gao J, Collyer J, Wang M, Sun F, Xu F. Genetic dissection of hypertrophic cardiomyopathy with myocardial rna-seq. Int J Mol Sci. 2020;21(9).

  29. Tissier R, Tsonaka R, Mooijaart SP, Slagboom E, Houwing-Duistermaat JJ. Secondary phenotype analysis in ascertained family designs: application to the Leiden longevity study. Stat Med. 2017;36(14):2288–301.

    Article  PubMed  PubMed Central  Google Scholar 

  30. Bishop CM, Tipping ME. Probabilistic Principal Component Analysis. J R Stat Soc. Ser B 61(iii), 611–622 (1999).

  31. el Bouhaddani S, Uh HW, Hayward C, Jongbloed G, Houwing-Duistermaat J. Probabilistic partial least squares model: Identifiability, estimation and application. J Multivar Anal. 2018;167:331–46. arXiv:1706.03597

Download references


The authors would like to thank M. Harakalova, and M. Mokry from the Dept. of Cardiology, UMC Utrecht for providing the CVON-DOSIS data and discussion on the analysis of the CVON-DOSIS datasets. We thank M. Michels and J. van der Velden for providing the HCM tissues, the biobank of UMC Utrecht, the biobank of the Washington University School of Medicine, and the Sydney Heart Bank for providing non-failing donor tissue. This work has received support from the EU/EFPIA Innovative Medicines Initiative 2 Joint Undertaking BigData@Heart grant (116074). TwinsUK is funded by the Wellcome Trust, Medical Research Council, European Union, Chronic Disease Research Foundation (CDRF), Zoe Global Ltd and the National Institute for Health Research (NIHR)-funded BioResource, Clinical Research Facility and Biomedical Research Centre based at Guy’s and St Thomas’ NHS Foundation Trust in partnership with King’s College London.


The research leading to these results has received funding and support from the European Union’s Horizon 2020 research and innovation programme IMforFUTURE under H2020-MSCA-ITN grant agreement number 721815, from the EU/EFPIA Innovative Medicines Initiative 2 Joint Undertaking BigData@Heart grant (116074), and from the ERA-Net for Research Programmes on Rare Diseases (E-rare 3 – MSAomics project). The funding bodies did not play any role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

Authors and Affiliations



ZG performed the mathematical work, simulation study, data anaylysis, and wrote the manuscript. JH and H-WU provided the underlying idea of the methods. SB supported the programming. All authors contributed to the discussion of the methods, simulation and data analysis. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Zhujie Gu.

Ethics declarations

Ethics approval and consent to participate

TwinsUK study: Ethical approval was granted by the National Research Ethics Service London-Westminster, the St Thomas’ Hospital Research Ethics Committee (EC04/015 and 07/H0802/84). All research participants have signed informed consent prior to taking part in any research activities. CVON-DOSIS study: the study protocol was approved by the local ethics committee of the Erasmus MC (2010-409), the Biobank Research Ethics Committee of University Medical Center Utrecht (protocol number 12/387), and the Washington University School of Medicine Ethics Committee (Institutional Review Board). Informed consent was obtained from each patient prior to surgery or was waived by the ethics committee when acquiring informed consent was not possible due to the death of the donor.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1

. The details of solving the optimization problem (8).

Additional file 2

. The full lists of significant results of gene set enrichment analyses in the TwinsUK study and the CVON-DOSIS study.

Additional file 3

. Additional analysis of the CVON-DOSIS datasets, where score plots of PCA, PLS, O2PLS, SO2PLS are shown and compared.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gu, Z., el Bouhaddani, S., Pei, J. et al. Statistical integration of two omics datasets using GO2PLS. BMC Bioinformatics 22, 131 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: