 Methodology article
 Open Access
 Published:
Statistical integration of two omics datasets using GO2PLS
BMC Bioinformatics volume 22, Article number: 131 (2021)
Abstract
Background
Nowadays, multiple omics data are measured on the same samples in the belief that these different omics datasets represent various aspects of the underlying biological systems. Integrating these omics datasets will facilitate the understanding of the systems. For this purpose, various methods have been proposed, such as Partial Least Squares (PLS), decomposing two datasets into joint and residual subspaces. Since omics data are heterogeneous, the joint components in PLS will contain variation specific to each dataset. To account for this, Twoway Orthogonal Partial Least Squares (O2PLS) captures the heterogeneity by introducing orthogonal subspaces and better estimates the joint subspaces. However, the latent components spanning the joint subspaces in O2PLS are linear combinations of all variables, while it might be of interest to identify a small subset relevant to the research question. To obtain sparsity, we extend O2PLS to Group Sparse O2PLS (GO2PLS) that utilizes biological information on group structures among variables and performs group selection in the joint subspace.
Results
The simulation study showed that introducing sparsity improved the feature selection performance. Furthermore, incorporating group structures increased robustness of the feature selection procedure. GO2PLS performed optimally in terms of accuracy of joint score estimation, joint loading estimation, and feature selection. We applied GO2PLS to datasets from two studies: TwinsUK (a population study) and CVONDOSIS (a small casecontrol study). In the first, we incorporated biological information on the group structures of the methylation CpG sites when integrating the methylation dataset with the IgG glycomics data. The targeted genes of the selected methylation groups turned out to be relevant to the immune system, in which the IgG glycans play important roles. In the second, we selected regulatory regions and transcripts that explained the covariance between regulomics and transcriptomics data. The corresponding genes of the selected features appeared to be relevant to heart muscle disease.
Conclusions
GO2PLS integrates two omics datasets to help understand the underlying system that involves both omics levels. It incorporates external group information and performs group selection, resulting in a small subset of features that best explain the relationship between two omics datasets for better interpretability.
Background
With the advancements in high throughput technology, multiple omics data are commonly available on the same subjects. To identify a set of relevant related features across the omics levels, these datasets need to be integrated and analyzed jointly. For statistical integration of omics data, there are several challenges to overcome: complex correlation structure within and between omics data, highdimensionality (\(p\gg n\), or “large p, small n”), heterogeneity between different omics datasets, and selection of relevant features in each dataset. To deal with the first two challenges, Partial Least Squares (PLS) has been proposed [1, 2]. Dimension reduction is achieved by decomposing two datasets X and Y into joint and residual subspaces. The joint (lowdimensional) subspace of one dataset represents the best approximation of X or Y based on maximizing the covariance of the two. However, by integrating two heterogeneous omics datasets, the PLS joint components also contain (strong) omicspecific variation. This heterogeneity can be caused by differences (e.g. between methylation and glycomics) in size, distribution, and measurement platform. Ignoring these omicspecific characteristics (variation specific to each of the data) in the model may lead to a biased representation of the underlying system. Twoway orthogonal partial least squares (O2PLS) [3, 4] was proposed to decompose each dataset into joint, orthogonal, and residual subspaces. The orthogonal subspaces in X and Y capture variation unrelated to each other, making the joint subspaces better estimates for the true relation between X and Y. Hence, O2PLS accounts for the heterogeneity of two omics datasets. However, the resulting lowdimensional latent components spanning the joint subspaces are linear combinations of all the observed variables. Therefore, to select a small subset of relevant features for better interpretation, one can impose sparsity on the loadings of the principal components. A straightforward approach is to ignore all loadings smaller than some threshold value, effectively treating them as zero, which can be misleading [5].
Several sparse methods based on PLS have been proposed. Chun and Keleş proposed sparse PLS (SPLS) [6] which fits PLS on a reduced X space, consisting of preselected Xvariables using a penalized regression. Sparse PLS (sPLS) by Lê Cao et al. [7] imposes \(L_1\) penalty on the singular value decomposition (SVD) of the covariance matrix of X and Y, resulting in sparse loading vectors for both datasets. Often it is of interest to select a group of features instead of individual features, e.g. features within a gene or a pathway. By so doing, one can improve power by identifying aggregate effects of the selected features [8,9,10]. Liquet et al. extended sPLS to group PLS (gPLS) [10], imposing groupwise \(L_2\) penalties on the loadings of the predefined feature groups. It results in groupwise sparsity (i.e., features belonging to the same group will always be selected altogether).
In this work, we propose to extend O2PLS to incorporate sparsity, called Group Sparse O2PLS (GO2PLS). GO2PLS obtains sparse solutions by pushing a large number of small nonzero weights (or loading values) to zeros, instead of employing hard thresholding using arbitrary cutoff values. Therefore, GO2PLS constructs joint lowdimensional latent components representing the underlying systems involving both omics levels while taking into account the heterogeneity of different omics data, incorporates external biological information such as known group structure, and performs variable selection by imposing groupwise penalties on the loading vectors in the joint subspaces.
For illustration, we apply GO2PLS to datasets from two studies. Firstly, TwinsUK is a population based study [11, 12], where methylation (482K CpG sites) and 22 immunoglobulin G (IgG) glycans were measured. A previous research [13] suggested the presence of an indirect influence of methylation on IgG glycosylation that may in part capture environmental exposures. We integrate the two omics datasets, aiming to identify groups of CpG sites affecting IgG glycosylation. In the CVONDOSIS casecontrol study [14], regulomics (histone modification) and transcriptomics data were measured on 13 hypertrophic cardiomyopathy (HCM) patients and 10 controls. Histone modification can have an impact on gene expression. Therefore we integrate the two omics datasets and identify a small set of regulatory regions and transcripts explaining this relationship. Moreover, the extreme imbalance in a highdimensional setting (33K ChIPseq and 15K RNAseq vs 23 subjects) poses computational challenges. The resulting selected features are further studied using gene set enrichment analysis [15]. Several possible scenarios containing these characteristics are designed and investigated in an extensive simulation study.
This paper is organized as follows. In the methods section, an overview of O2PLS is presented, followed by the formulation of GO2PLS. Via a simulation study, we explore the properties of GO2PLS and compare its performance to other competitive methods. We then apply GO2PLS to integrate methylation and glycomics in the TwinsUK study and regulomics and transcriptomics in the CVONDOSIS study. We conclude with a discussion and possible directions to further extend the method.
Methods
Data description
TwinsUK datasets
Whole blood methylation (using Infinium HumanMethylation450 BeadChip) and IgG glycomics (Ultra Performance Liquid Chromatography) data were measured on 405 independent individuals, among which 392 are females and 13 are males. The age ranges from 18 to 81, with a median of 58. The methylation dataset consists of beta values (ratio of intensities between methylated and unmethylated alleles) at 482563 CpG sites. CpG sites with missing values, on allosomes, or labeled crossactive [16] were removed. We kept only the CpG sites on CpG islands or surrounding areas (shelves and shores) that mapped to genetic regions. Age, sex, batch effect, and cell counts were corrected for using multiple regression. The glycomics dataset contains 22 glycan peaks. These peaks were normalized using median quotient (MQ) normalization [17], logtransformed, and adjusted for batch effect, age, and sex as well. The remaining 126299 CpG sites were then divided into 16892 groups based on their target genes (biological information from the UCSC database [18, 19]). No group information was available for the glycomics data.
CVONDOSIS datasets
In the CVONDOSIS study, regulomics and transcriptomics datasets were measured on the samples taken from the heart tissues of 13 HCM patients and 10 healthy controls. HCM is a heart muscle disease that makes it harder for the heart to pump blood, leading to heart failure. The regulomics data were measured using ChIPseq, providing counts of histone modification H3K27ac in 33642 regulatory regions. The transcriptomics data contain counts of 15882 transcripts, measured by RNAseq. The raw counts of regulomics data were normalized with reads per kilobase million (RPKM) to adjust for sequencing depth. Transcriptomics data were normalized with counts per million (CPM) with effective library size (estimated using the Trimmed Mean of Mvalues (TMM) method in EdgeR R package [20]). Further, both normalized data were logtransformed.
Twoway orthogonal partial least squares (O2PLS)
Let X and Y be two data matrices with the number of rows equal to the sample size N and the number of columns equal to the dimensionality p and q, respectively. Let the number of joint, Xorthogonal (unrelated to Y) and Yorthogonal components be \(K\), \(K_x\) and \(K_y\), respectively, where \(K\), \(K_x\) and \(K_y\) are typically much smaller than p and q. The O2PLS model decomposes X and Y as follows:
The relation between X and Y is captured through the inner relation between T and U,
In this model, the scores are: \(T\, (N \times K), \, U \,(N \times K), \, T_\perp \, (N \times K_x), \, U_\perp \, (N \times K_y).\) They represent projections of the observed data X and Y to lowerdimensional subspaces. The loadings, \(W\,(p \times K), \, C\,(q \times K), \, P_\perp \,(p \times K_x), \, Q_\perp \,(q \times K_y),\) indicate relative importance of each X and Y variable in forming the corresponding scores. Further, \(E\,(N \times p), \, F\,(N \times q), \, H\,(N \times K), \, \tilde{H}\,(N \times K),\) represent the residual matrices.
In O2PLS, estimates of the joint subspaces are obtained by first filtering out the orthogonal variation. The filtered data matrices \({{\tilde{X}}}\) and \({{\tilde{Y}}}\) are constructed as follows:
where \(T_\perp, U_\perp\) are estimates for the orthogonal subspaces, and \(I_N\) is identity matrix of size N. For more details see [3]. The joint parts maximize the covariance between the joint scores \(T = {{\tilde{X}}}W\) and \(U = {{\tilde{Y}}}C\). Here, W and C consist of loading vectors (\(w_1, \ldots , w_K\)) and (\(c_1, \ldots , c_K\)), which can be found as the right and left singular vectors of the covariance matrix \({{\tilde{Y}}}^{\top } {{\tilde{X}}}\) [4]. Calculating and storing \({{\tilde{Y}}}^{\top } {{\tilde{X}}}\) of dimension \(q \times p\) can be cumbersome for high dimensional omics data. Therefore we consider the following optimization problem sequentially for components \(k = 1, \ldots , K\):
where parameters \(w_k, \, c_k\) are the loading vectors of the kth joint components and \({{\tilde{X}}}_k, \, {{\tilde{Y}}}_k\) are the filtered data matrices after \(k1\) times of deflation. This can be solved efficiently using the Nonlinear Iterative Partial Least Squares (NIPALS) [21] algorithm, which starts with random initialization of the Xspace score vector t and repeats a sequence of the following steps until convergence:
In step 1 and 4, \(Y_k\) and \(X_k\) are projected onto the Xspace score vector t and the Yspace score u to get the loading vectors \(c_k\) and \(w_k\). The loading vectors are then unitized (step 2 and 5) and used to calculated the new scores u and t. Convergence of the algorithm is guaranteed. A detailed description and proof of optimality of the O2PLS algorithm can be found in [3, 4].
While standard crossvalidation (CV) over a 3dimensional grid is often used to determine the optimal number of components \(K\), \(K_x\), and \(K_y\), the procedure is not optimal for O2PLS, since there is not a single optimization criterion for all three parameters. As in [4], we use an alternative CV procedure that first performs a 2dimensional grid search of \(K_x\) and \(K_y\), with a fixed \(K\), to optimize prediction performance of \(T \rightarrow U\) and \(U \rightarrow T\). Then a sequential search of optimal \(K\) is conducted to minimize the sum of mean squared errors (MSE) of prediction concerning \(X \rightarrow Y\) and \(Y \rightarrow X\).
Group sparse O2PLS (GO2PLS)
GO2PLS extends O2PLS by introducing a penalty in the NIPALS optimization on the filtered data \({{\tilde{X}}}\) and \({{\tilde{Y}}}\). This penalty encourages sparse, or groupsparse solutions for the joint loading matrices W and C, leading to a subset of the original features corresponding to nonzero loading values being selected in each joint component.
Briefly, we introduce an \(L_1\) penalty on each pair of joint loading vectors. The optimization problem for the kth pair of joint loadings \(c_k, \, w_k\) is:
where \(\lambda _c, \, \lambda _w\) are penalization parameters that regulate the sparsity level. The optimization problem (6) can be solved [22] by iterating over the kth pair of joint loadings,
where \(t = {{\tilde{X}}}_k w_k\) and \(u = {{\tilde{Y}}}_k c_k\). Here, \(S(\cdot )\) is the soft thresholding operator: \(S(a, \text {const}) = sgn(a)(a  \text {const})_+\) (\(\text {const} \ge 0\) is a nonnegative constant, \((x)_+\) equals to x if \(x > 0\) and equals to 0 if \(x \le 0\)).
To perform group selection, we impose groupwise \(L_2\) penalty on the joint loading vectors. Let \({{\tilde{X}}}\) and \({{\tilde{Y}}}\) be partitioned into \(J \,(J \le p )\) and \(M \,(M \le q )\) groups, respectively. The submatrices \({{\tilde{X}}}^{(j)}\) and \({{\tilde{Y}}}^{(m)}\) (\(j = 1,\ldots , J; \, m = 1, \ldots , M\)) contain the jth and mth group of variables, with corresponding loading vectors \(w^{(j)}\) (of size \(p_j\)) and \(c^{(m)}\) (of size \(q_m\)). The optimization problem for the kth pair of loading vectors \(c_k = ({{c_k^{(1)}}}^\top , \ldots , {{c_k^{(M)}}^{\top }})^\top\) and \(w_k = ({{w_k^{(1)}}}^\top , \ldots , {{w_k^{(J)}}^{\top }} )^\top\) can be written as follows:
where the last two terms are reformulations of the unit norm constraints on \(c_k\) and \(w_k\), with \(\phi _c\) and \(\phi _w\) being the Lagrangian multipliers. The effective penalization parameters on each group (\(\lambda _c, \, \lambda _w\)) are adjusted by the square root of the group size to correct for the fact that larger groups are more likely to be selected. This optimization problem can be solved using block coordinate descent (for details, see Additional file 1). The solution takes the form:
The \({{\tilde{X}}}\)variables within the jth group will have nonzero weights if \(\left\Vert {{\tilde{X}}}_k^{{(j)}^{\top }}u\right\Vert _2\) (i.e., the contribution of the whole group to the covariance) is larger than the sizeadjusted penalization parameter \(\sqrt{p_j}\lambda _w\). In the same way, the \({{\tilde{Y}}}\)variables within the mth group will be assigned nonzero loading values if \(\left\Vert {{\tilde{Y}}}_k^{{(m)}^{\top }} t\right\Vert _2 > \sqrt{q_m}\lambda _c\).
Note that when all the groups have size 1, the summation of groupwise \(L_2\) penalties is equivalent to an \(L_1\) penalty on the unpartitioned loading vector and individual features will be selected (i.e., (8) reduces to (6)). In this specific case, to avoid confusion, we call the method Sparse O2PLS (SO2PLS). When the penalization parameters \(\lambda _w = \lambda _c = 0\), GO2PLS becomes to O2PLS. If the number of orthogonal components \(K_x = K_y = 0\), GO2PLS, SO2PLS, O2PLS are equivalent to gPLS, sPLS, and PLS, respectively.
The kth pair of joint loadings are orthogonalized with respect to the previous \(k1\) loading vectors. Let \(\pi\) be an index set for selected variables in \(w_k\). The orthogonalization is achieved by first projecting \(w_k^{(\pi )}\) onto \(span\{ w_1^{(\pi )}, \ldots , w_{k1}^{(\pi )} \}\), and then subtracting this projection from \(w_k^{(\pi )}\). When the previous \(k1\) components do not select any variable in \(\pi\), \(span\{ w_1^{(\pi )}, \ldots , w_{k1}^{(\pi )} \}\) is actually a zero subspace and no orthogonalization is needed.
To determine the optimal sparsity level, it is more convenient and intuitive to focus on the number of selected \({{\tilde{X}}}\), \({{\tilde{Y}}}\) groups (donote \(h_x\), \(h_y\), respectively). If prior biological knowledge does not already specify certain \(h_x\) and \(h_y\), crossvalidation can be used to search for combinations of \(h_x\) and \(h_y\) that maximize the covariance between each pair of estimated joint components \({\mathrm{Cov}}({\hat{t}}, {\hat{u}})\). Similar to LASSO [23], the “onestandarderrorrule” [24] can be applied to obtain a more stable CV result. The GO2PLS algorithm is described below:
Simulation Study
We evaluate the performance of GO2PLS in two scenarios. First, we investigate the ability to select the relevant groups under various scenarios, focusing on the joint subspace, where the group selection takes place. Second, we compare the performance of GO2PLS and SO2PLS with other methods: O2PLS, PLS, sPLS, and gPLS. We investigate joint score estimation, joint loading estimation, and feature selection performances.
In the first scenario, we set the number of variables in X and Y to be \(p = 5000\) and \(q = 20\), respectively. There are 10 groups of variables in X with nonzero loading values. The first 5 groups have group sizes of 100, 50, 20, 5, and 1, respectively, in which all the variables have loading values equal to 1. The remaining 5 groups are of size 10, with loading values of variables equal to 5. Note that large loading values are assigned to the latter 5 groups to make the detection of the first 5 groups more difficult. The remaining variables have zero loading values and are divided into groups of size 10. All the Yvariables have the same loading values and are not grouped. The sample size N is set to 30. We simulate both data matrices with 1 joint component (T and U from Equation 1 are both standard normally distributed and have correlation 1). We perform 1000 simulation runs and record the number of the runs GO2PLS selected relevant groups; we compute the proportion of each truly relevant group (with nonzero loadings) being selected across the simulation runs (number of times being selected divided by 1000). The group importance measurement \({\left\Vert {X^{(j)}}^{\top } U\right\Vert _2}/ {\sqrt{p_j}}\), that determines whether a group is selected or not is recorded for the first 5 groups (with loading value 1) to investigate the stability of the selection procedure.
In the second scenario, we vary the sample size N from 30 to 600, and set \(p = 20{,}000\) and \(q = 10{,}000\), mimicking the dimensionality of the CVONDOSIS datasets. Both X and Y variables are evenly divided into 1000 groups. For each joint component, we select 50 relevant groups and assign nonzero loadings to the variables contained in them. Within each group, variables have the same loading values: 1 for the first group, 2 for the second,..., and 50 for the last relevant group. We set the number of joint components \(K = 2\) and the number of orthogonal components \(K_x = K_y = 1\). The scores \(T, T_\perp , U, U_\perp\) from Equation 1 are generated from normal distributions with zero mean. The relationship between the joint scores is represented by \(U = T + H\), where H accounts for \(20\%\) of the variation in U. The noise matrices E, F are generated from normal distributions with zero mean and variance such that the variance of the noise matrix accounts for a proportion \(\alpha\) (\(0< \alpha < 1\)) of the variance of the data matrix (i.e., \(\alpha = \mathrm{Var}(E)/\mathrm{Var}(X) = \mathrm{Var}(F)/\mathrm{Var}(Y)\)). The ratio of the variance of the orthogonal components to the variance of the joint components (\(\sigma ^2_{T_{\perp }} / \sigma _T^2\)), and noise level \(\alpha\) are varied. For evaluating the accuracy of the joint score estimation, we computed \(R^2_{\widehat{T}T} = 1  \sum (\widehat{T}  T)^2/\sum T^2\) and \(R^2_{\widehat{T}\widehat{U}} = 1  \sum (\widehat{U}  \widehat{T})^2/\sum \widehat{U}^2\), which quantify how well the true parameter T and the estimated Yjoint component \(\widehat{U}\) can be explained by the estimated Xjoint component \(\widehat{T}\). The performance of feature selection and the accuracy of estimated loadings are evaluated by true positive rate (TPR = TP/(TP+FN), where TP = True Positive, FN = False Negative) and \(W^{\top } \widehat{W}\), which represents the cosine of the angle between the estimated loading vector and the true one. The performances of all methods are evaluated on an independent test dataset of size 1000. For each setting, 500 replications are generated.
An overview of scenario settings is presented in Tables 1, 2. To make a clearer comparison of the behavior across all the methods, we use the optimum values for the tuning parameters (number of components and number of relevant variables or groups).
Results of simulation study
Scenario 1
Figure 1 shows the selection proportion for each relevant group under each noise level. Compared to smaller groups, the proportion for larger groups is higher at low to moderate (\(\alpha < 0.7\)) noise levels, and shows robustness against increasing noise. When the noise level is very high (\(\alpha > 0.8\)), the method loses power to detect relevant group of any size, particularly, of larger size. Figure 2 shows the density of the group importance measurement \(\left\Vert {X^{(j)}}^{\top } U\right\Vert _2 /{\sqrt{p_j}}\) for the first 5 relevant groups with different group sizes under 3 different noise levels. The vertical dotted lines indicate the average threshold given the correct number of relevant groups. Since a group will be selected if exceeds the threshold, the total area on the right side of the threshold under each density curve equals the selection proportion for the corresponding group. The measurement for larger relevant group shows higher precision at all noise levels. The threshold increases along with the noise.
Scenario 2
The performance of the joint score estimation is compared focusing on the difference between methods with orthogonal parts (GO2PLS, SO2PLS, O2PLS) and their counterparts without the “O2” filtering (gPLS, sPLS, PLS). The top row of Fig. 3 shows the performance measured by \(R^2_{\widehat{T}T}\) & \(R^2_{\widehat{T}\widehat{U}}\) under \(N = 30\), \(\alpha = 0.1\) and varying relative orthogonal signal strength from one fifth to five times of the joint signal. In the left panel, \(R^2_{\widehat{T}T}\) of the various methods is depicted, representing how well the joint component \(\hat{T}\) captured the true underlying T. Overall, penalized methods performed better than nonpenalized ones, especially when the orthogonal variation is relatively small. PLS performed poorly compared to O2PLS, when the orthogonal variation exceeds the joint variation. As the orthogonal variation further increases, performances of sPLS and gPLS deteriorated, while SO2PLS and GO2PLS were less affected. In the right panel, \(R^2_{\widehat{T}\widehat{U}}\) is presented, an estimate of the true parameters \(R^2_{TU}\), capturing correlation of T and U. Across different settings, O2PLSbased methods performed better, especially when the orthogonal variation is large.
The bottom row of Fig. 3 shows the score estimation performance under fixed relative orthogonal signal strength of 1, \(\alpha =0.1\), and varying sample size N from 30 to 600. Penalized methods performed better compared to nonpenalized methods in general, when the sample size is small. Regardless of the sample size, O2PLSbased methods outperformed PLSbased methods.
Lastly, we present the results of GO2PLS, SO2PLS, and O2PLS with regard to feature selection and estimation of joint loadings. Results of PLSbased methods are not included since the performances of gPLS, sPLS, and PLS in this regard are very similar to GO2PLS, SO2PLS, and O2PLS, respectively. In Fig. 4, the top row shows the TPR and \(W^{\top } \widehat{W}\) under \(N = 30\) and varying noise levels \(\alpha\) from low to high. At all noise levels, GO2PLS had higher TPR than SO2PLS and O2PLS, and performed robustly against increasing noise. Regarding \(W^{\top } \widehat{W}\), GO2PLS outperformed the other two as well. In the bottom row, when increasing sample size at a fixed noise level of 0.5, the variance appeared to decrease and the performances of all the methods converged. Overall, GO2PLS outperformed others.
Application to data
We demonstrate SO2PLS and GO2PLS on datasets from two distinct studies. In the TwinsUK study, our aim is to integrate methylation and glycomics data and identify important groups of CpG sites underlying glycosylation. In the CVONDOSIS study, we integrate regulomics and transcriptomics data and select a subset of genes and regions that drive their relationship.
TwinsUK study
We performed GO2PLS on the data with 1 joint, no methylationorthogonal, and 3 glycomicsorthogonal components based on 5fold crossvalidation. We set the sparsity parameters to select the top 100 groups in the methylation and kept all the 22 glycan variables. The selected CpG groups from GO2PLS were mapped to their targeted genes for interpretation.
We performed gene set enrichment analyses on the selected genes using the ToppGene Suite [25]. The results appeared to be related to immune response. We listed the most significant molecular function, biological process, and pathway in Table 3 (the full list of significant results can be found in Additional file 2).
An extra analysis was performed using another grouping strategy, where we grouped 55531 CpG sites that map to the promoter region (01500 bases upstream of the transcriptional start site (TSS)) of a gene to 14491 groups based on their targeted genes. We applied GO2PLS and selected 100 groups. Note that the sizes of these groups became smaller since many CpG sites in gene bodies are excluded. Enrichment analysis did not result in significant results.
CVONDOSIS study
We applied SO2PLS on the regulomics and transcriptomics datasets, with 2 joint and 1 orthogonal components for each omics dataset. In each pair of the joint components, 1000 regulomics and 500 transcriptomics variables were selected. We then further identified the genes corresponding to the promoter regions where the selected 1000 histone modification locates (using ± 10K window from the transcription start site of the gene). These genes are of interest since they are likely to be related to epigenetic regulation of gene expression. Genes corresponding to the selected transcripts were also identified. These gene sets identified from each joint component of the two omics data were investigated separately using gene set enrichment analysis. The top results were listed in Table 4. The Gene Ontology (GO) enrichment analysis of the selected genes and regions showed terms related to HCM that were also found previously [28]. Due to the presence of the casecontrol status in both omics levels, we expect the joint components related to the disease. Plotting the joint scores of the two datasets showed a separation between HCM cases and controls (Fig. 5). For a comparison of score plots of PCA, PLS, O2PLS, and SO2PLS, please see Additional file 3.
Discussion
Statistical integration of two omics datasets is becoming increasingly popular to gain insight into underlying biological systems. O2PLS is a method that integrates two heterogeneous datasets and takes into account omicspecific variation. The resulting joint and orthogonal components are linear combinations of all variables, making interpretation difficult. To introduce sparsity and identify relevant groups, GO2PLS incorporates biological information on group structures to perform group selection in the joint subspace. Depending on the group size, such an approach may also lead to a higher selection probability of relevant features. We performed an extensive simulation study and showed that O2PLSbased methods generally outperformed PLSbased methods regarding joint score estimation when orthogonal variation was present in the data. Since PLS does not take into account orthogonal parts, the joint components also include part of the orthogonal variation. Further, when the sample size was small or the noise level was high, penalized methods appeared to be much less prone to overfitting than nonpenalized methods. This suggests that results based on GO2PLS are likely to be generalizable when applied to new datasets. Concerning feature selection, adding external group information led to higher TPR, and larger groups of relevant features had a higher proportion of being detected under a moderate noise level. We then applied GO2PLS to the TwinsUK study, where we selected 100 target genes comprising of CpG sites that are most related to IgG glycosylation. The results of the enrichment analysis on the selected genes showed GOterms involving the immune system in which the IgG glycans play important roles. In the CVONDOSIS study, we integrated regulomics and transcriptomics and identified 1000 regulatory regions and 500 transcripts, and mapped them to genes. Further analysis of the selected gene sets showed enrichment for terms related to heart muscle diseases. Moreover, the implementation of GO2PLS is computationally fast and memory efficient. It relies on an algorithm based on NIPALS that does not store large matrices of size \(p \times q\) when performing the grouppenalized optimization. A regular laptop (8G RAM, quadcore 2.6 GHz) was able to run GO2PLS on omics data from both case studies.
The group information should be chosen together with domain experts based on the research question and biological knowledge. For example, in our TwinsUK data application, we aimed to identify the genes comprising of CpG sites, rather than the individual CpG sites. Therefore, we grouped CpG sites in the same genetic region. Furthermore, the biological knowledge that closeby CpG sites tend to function together supported the choice of grouping. Different grouping information leads to a changed definition of groups, consequently the selected groups will have a different interpretation. In the same example of the TwinsUK study, extra analysis using smaller groups led to no significant results in the enrichment analysis, supposedly due to weaker aggregated group effects. When group information is not available, or the research goal is to identify individual features (e.g., in our CVONDOSIS data application), SO2PLS can be used.
In the CVONDOSIS study, Plotting the first two joint components showed two distinct classes corresponding to the casecontrol status. This might be expected since the analysis was conditional on casecontrol status, yielding a correlation between the two omics datasets. This phenomenon is well known in regression analysis of secondary phenotypes [29], but not well studied in PLS type of methods. This is a topic of future research. Often omics data are collected to study their relationship with an outcome variable or to predict an outcome variable. To this end, our approach has to be extended to incorporate the outcome variable. Such an approach might also lead to a more sparse solution since the selected features have to be correlated among the two omics datasets and the outcome variable. Further extensions of GO2PLS are to incorporate more than two omics datasets to represent the actual biological system even better.
Finally, it is possible to extend the GO2PLS algorithm to a probabilistic model. Extending latent variable methods to probabilistic models is not new. PCA was extended to Probabilistic PCA in [30], and Probabilistic PLS (PPLS) [31] was proposed to provide a probabilistic framework for PLS. It has been shown that the probabilistic counterpart has a lower bias in estimation and is robust to nonnormally distributed variables [31]. More importantly, the probabilistic model will allow statistical inference, making it possible to interpret the relevance and importance of features in the population, and facilitating followup studies. These extensions of GO2PLS will be suited for various studies with more complicated designs.
Conclusions
In this article, we proposed GO2PLS to integrate two omics data by estimating joint latent components. GO2PLS takes into account heterogeneity between different omics levels by including orthogonal components for each dataset. The method utilizes known group information among the features to select relevant groups of features, by imposing groupwise penalties in the joint subspace. Alternatively, the method can also choose features at the individual level. This flexibility facilitates investigation into different research questions. Our simulation study showed that GO2PLS behaved robust against noise and outperformed competing methods in terms of accuracy of joint score estimation, joint loading estimation, and feature selection. We applied GO2PLS to the datasets from two studies with distinct designs and showed that the results were biologically interpretable. To conclude, GO2PLS is a robust and flexible method for integrating two omics datasets and selecting important groups of features.
Availability of data and materials
The R scripts and functions for GO2PLS are publicly available in the OmicsPLS R package https://cran.rproject.org/package=OmicsPLS and can be installed in R via install.packages(“OmicsPLS”). Because of the sensitive nature of the data collected for the CVONDOSIS study, requests to access the dataset from qualified researchers trained in human subject confidentiality protocols may be sent to the corresponding authors. Individual level methylation and glycomics data from TwinsUK are not permitted to be shared or deposited due to the original consent given at the time of data collection. However, access to methylation and glycomics data can be applied for through the TwinsUK data access committee. For information on access and how to apply, visit http://www.twinsuk.ac.uk/dataaccess/submissionprocedure/.
Abbreviations
 GO2PLS::

Group Sparse Twoway Orthogonal Partial Least Squares
 SO2PLS::

Sparse Twoway Orthogonal Partial Least Squares
 PCA::

Principal Component Analysis
 PLS::

Partial Least Squares
 O2PLS::

Twoway Orthogonal Partial Least Squares
 SPLS::

Sparse Partial Least Squares (Chun and Keleş)
 sPLS::

Sparse Partial Least Squares (Lê Cao et al.)
 gPLS::

Group Partial Least Squares
 SVD::

Singular Value Decomposition
 NIPALS::

Nonlinear Iterative Partial Least Squares
 CV::

CrossValidation
 MSE::

Mean Squared Error
 TPR::

True Positive Rate
 IgG::

Immunoglobulin G
 HCM::

Hypertrophic Cardiomyopathy
 RPKM::

Reads per Kilobase Million
 CPM::

Counts per Million
 TMM::

Trimmed Mean of Mvalues
 GO::

Gene Ontology
References
 1.
Boulesteix ALL, Strimmer K. Partial least squares: A versatile tool for the analysis of highdimensional genomic data. Briefings Bioinform. 2007;8(1):32–44. https://doi.org/10.1093/bib/bbl016.
 2.
Wold S, Ruhe A, Wold H, Dunn III WJ. The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses. SIAM J Sci Stat Comput. 1984;5(3):735–43. https://doi.org/10.1137/0905052 arXiv:1308.0863v1
 3.
Trygg J, Wold S. O2PLS, a twoblock (XY) latent variable regression (LVR) method with an integral OSC filter. J Chemom. 2003;17(1):53–64. https://doi.org/10.1002/cem.775.
 4.
el Bouhaddani S, HouwingDuistermaat J, Salo P, Perola M, Jongbloed G, Uh HW. Evaluation of O2PLS in Omics data integration. BMC Bioinform. 2016;17(2):1–20. https://doi.org/10.1186/s128590150854z.
 5.
Jolliffe IT, Trendafilov NT, Uddin M. A modified principal component technique based on the LASSO. J Comput Graph Stat. 2003;12(3):531–47. https://doi.org/10.1198/1061860032148, arXiv:1205.0121v2
 6.
Chun, H., Keleş, S.: Sparse partial least squares regression for simultaneous dimension reduction and variable selection. J R Stat Soc Ser B: Stat Methodol 72(1), 3–25 (2010). https://doi.org/10.1111/j.14679868.2009.00723.x
 7.
Lê Cao, K.A., Rossouw, D., RobertGranié, C., Besse, P. A sparse PLS for variable selection when integrating omics data. Statist Appl Genet Mol Biol. 7(1) (2008). https://doi.org/10.2202/15446115.1390
 8.
Tyekucheva S, Marchionni L, Karchin R, Parmigiani G. Integrating diverse genomic data using gene sets. Genome Biology. 2011;12(10):105. https://doi.org/10.1186/gb20111210r105.
 9.
Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B: Stat Methodol. 2006;68(1):49–67. https://doi.org/10.1111/j.14679868.2005.00532.x.
 10.
Liquet B, De Micheaux PL, Hejblum BP, Thiébaut R. Group and sparse group partial least square approaches applied in genomics context. Bioinformatics. 2016;32(1):35–42. https://doi.org/10.1093/bioinformatics/btv535.
 11.
Spector TD, Williams FMK. The UK Adult Twin Registry (TwinsUK). Twin Res Hum Genet. 2006;9(6):899–906. https://doi.org/10.1375/twin.9.6.899.
 12.
Moayyeri A, Hammond CJ, Hart DJ, Spector TD. The UK adult twin registry (twinsUK resource). Twin Res Hum Genet. 2013;16(1):144–9. https://doi.org/10.1017/thg.2012.89.
 13.
Wahl A, Kasela S, CarneroMontoro E, van Iterson M, Štambuk J, Sharma S, van den Akker E, Klaric L, Benedetti E, Razdorov G, TrbojevićAkmačić I, Vučković F, Ugrina I, Beekman M, Deelen J, van Heemst D, Heijmans BT, Consortium BIOS, Wuhrer M, Plomp R, Keser T, Šimurina M, Pavić T, Gudelj I, Krištić J, Grallert H, Kunze S, Peters A, Bell JT, Spector TD, Milani L, Slagboom PE, Lauc G, Gieger C. IgG glycosylation and DNA methylation are interconnected with smoking. Biochimica et Biophysica Acta (BBA)  General Subjects 1862(3), 637–648 (2018). https://doi.org/10.1016/J.BBAGEN.2017.10.012
 14.
CVONDOSIS – Cardiovascular Research Consortium. http://cvondosis.nl/. Accessed 18 Nov 2020
 15.
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: A knowledgebased approach for interpreting genomewide expression profiles. Proc Natl Acad Sci USA. 2005;102(43):15545–50. https://doi.org/10.1073/pnas.0506580102.
 16.
Chen YaA, Lemire M, Choufani S, Butcher DT, Grafodatskaya D, Zanke BW, Gallinger S, Hudson TJ, Weksberg R. Discovery of crossreactive probes and polymorphic CpGs in the Illumina Infinium HumanMethylation450 microarray. Epigenetics. 2013;8(2):203–9. https://doi.org/10.4161/epi.23470.
 17.
Uh HW, Klarić L, Ugrina I, Lauc G, Smilde AK, HouwingDuistermaat JJ. Choosing proper normalization is essential for discovery of sparse glycan biomarkers. Mol Omics. 2020. https://doi.org/10.1039/c9mo00174c.
 18.
Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler aD. The human genome browser at UCSC. Genome Res. 2002;12(6):996–1006. https://doi.org/10.1101/gr.229102.
 19.
UCSC Genome Browser Home. https://genome.ucsc.edu/. Accessed 19 Nov 2020
 20.
Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNAseq data. Tech Rep (2010). http://genomebiology.com/2010/11/3/R25
 21.
Wold H. Nonlinear Iterative Partial Least Squares (NIPALS) Modelling: Some Current Developments. In: Multivariate Analysis–III, pp. 383–407 (1973). https://doi.org/10.1016/b9780124266537.500326. https://www.sciencedirect.com/science/article/pii/B9780124266537500326
 22.
Witten DM, Tibshirani R, Hastie T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2009;10(3):515–34. https://doi.org/10.1093/biostatistics/kxp008.
 23.
Tibshirani R. Regression Shrinkage and Selection Via the Lasso. J R Stat Soc: Ser B (Methodological). 1996;58(1):267–88. https://doi.org/10.1111/j.25176161.1996.tb02080.x.
 24.
Hastie T, Tibshirani R, Wainwright M. Statistical learning with sparsity: the lasso and generalizations. Stat Learn Spars: Lasso General. 2015;84(1):1–337. https://doi.org/10.1201/b18401.
 25.
Chen J, Bardes EE, Aronow BJ, Jegga AG. ToppGene Suite for gene list enrichment analysis and candidate gene prioritization. Nucleic Acids Res. 2009;37(SUPPL.2). https://doi.org/10.1093/nar/gkp427.
 26.
Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J R Stat Soc: Ser B (Methodological). 1995;57(1):289–300. https://doi.org/10.1111/j.25176161.1995.tb02031.x.
 27.
Storey JD. A direct approach to false discovery rates. Technical Report. 2002;3. https://doi.org/10.1111/14679868.00346.
 28.
Gao J, Collyer J, Wang M, Sun F, Xu F. Genetic dissection of hypertrophic cardiomyopathy with myocardial rnaseq. Int J Mol Sci. 2020;21(9). https://doi.org/10.3390/ijms21093040
 29.
Tissier R, Tsonaka R, Mooijaart SP, Slagboom E, HouwingDuistermaat JJ. Secondary phenotype analysis in ascertained family designs: application to the Leiden longevity study. Stat Med. 2017;36(14):2288–301. https://doi.org/10.1002/sim.7281.
 30.
Bishop CM, Tipping ME. Probabilistic Principal Component Analysis. J R Stat Soc. Ser B 61(iii), 611–622 (1999). https://doi.org/10.1111/14679868.00196
 31.
el Bouhaddani S, Uh HW, Hayward C, Jongbloed G, HouwingDuistermaat J. Probabilistic partial least squares model: Identifiability, estimation and application. J Multivar Anal. 2018;167:331–46. https://doi.org/10.1016/j.jmva.2018.05.009. arXiv:1706.03597
Acknowledgements
The authors would like to thank M. Harakalova, and M. Mokry from the Dept. of Cardiology, UMC Utrecht for providing the CVONDOSIS data and discussion on the analysis of the CVONDOSIS datasets. We thank M. Michels and J. van der Velden for providing the HCM tissues, the biobank of UMC Utrecht, the biobank of the Washington University School of Medicine, and the Sydney Heart Bank for providing nonfailing donor tissue. This work has received support from the EU/EFPIA Innovative Medicines Initiative 2 Joint Undertaking BigData@Heart grant (116074). TwinsUK is funded by the Wellcome Trust, Medical Research Council, European Union, Chronic Disease Research Foundation (CDRF), Zoe Global Ltd and the National Institute for Health Research (NIHR)funded BioResource, Clinical Research Facility and Biomedical Research Centre based at Guy’s and St Thomas’ NHS Foundation Trust in partnership with King’s College London.
Funding
The research leading to these results has received funding and support from the European Union’s Horizon 2020 research and innovation programme IMforFUTURE under H2020MSCAITN grant agreement number 721815, from the EU/EFPIA Innovative Medicines Initiative 2 Joint Undertaking BigData@Heart grant (116074), and from the ERANet for Research Programmes on Rare Diseases (Erare 3 – MSAomics project). The funding bodies did not play any role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Author information
Affiliations
Contributions
ZG performed the mathematical work, simulation study, data anaylysis, and wrote the manuscript. JH and HWU provided the underlying idea of the methods. SB supported the programming. All authors contributed to the discussion of the methods, simulation and data analysis. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
TwinsUK study: Ethical approval was granted by the National Research Ethics Service LondonWestminster, the St Thomas’ Hospital Research Ethics Committee (EC04/015 and 07/H0802/84). All research participants have signed informed consent prior to taking part in any research activities. CVONDOSIS study: the study protocol was approved by the local ethics committee of the Erasmus MC (2010409), the Biobank Research Ethics Committee of University Medical Center Utrecht (protocol number 12/387), and the Washington University School of Medicine Ethics Committee (Institutional Review Board). Informed consent was obtained from each patient prior to surgery or was waived by the ethics committee when acquiring informed consent was not possible due to the death of the donor.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1
. The details of solving the optimization problem (8).
Additional file 2
. The full lists of significant results of gene set enrichment analyses in the TwinsUK study and the CVONDOSIS study.
Additional file 3
. Additional analysis of the CVONDOSIS datasets, where score plots of PCA, PLS, O2PLS, SO2PLS are shown and compared.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Gu, Z., el Bouhaddani, S., Pei, J. et al. Statistical integration of two omics datasets using GO2PLS. BMC Bioinformatics 22, 131 (2021). https://doi.org/10.1186/s12859021039583
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859021039583
Keywords
 Integration of Omics data
 Dimension reduction
 Feature selection
 Group structure
 O2PLS