A multivariate approach to the integration of multi-omics datasets
© Meng et al.; licensee BioMed Central Ltd. 2014
Received: 22 January 2014
Accepted: 14 May 2014
Published: 29 May 2014
Skip to main content
© Meng et al.; licensee BioMed Central Ltd. 2014
Received: 22 January 2014
Accepted: 14 May 2014
Published: 29 May 2014
To leverage the potential of multi-omics studies, exploratory data analysis methods that provide systematic integration and comparison of multiple layers of omics information are required. We describe multiple co-inertia analysis (MCIA), an exploratory data analysis method that identifies co-relationships between multiple high dimensional datasets. Based on a covariance optimization criterion, MCIA simultaneously projects several datasets into the same dimensional space, transforming diverse sets of features onto the same scale, to extract the most variant from each dataset and facilitate biological interpretation and pathway analysis.
We demonstrate integration of multiple layers of information using MCIA, applied to two typical “omics” research scenarios. The integration of transcriptome and proteome profiles of cells in the NCI-60 cancer cell line panel revealed distinct, complementary features, which together increased the coverage and power of pathway analysis. Our analysis highlighted the importance of the leukemia extravasation signaling pathway in leukemia that was not highly ranked in the analysis of any individual dataset. Secondly, we compared transcriptome profiles of high grade serous ovarian tumors that were obtained, on two different microarray platforms and next generation RNA-sequencing, to identify the most informative platform and extract robust biomarkers of molecular subtypes. We discovered that the variance of RNA-sequencing data processed using RPKM had greater variance than that with MapSplice and RSEM. We provided novel markers highly associated to tumor molecular subtype combined from four data platforms. MCIA is implemented and available in the R/Bioconductor “omicade4” package.
We believe MCIA is an attractive method for data integration and visualization of several datasets of multi-omics features observed on the same set of individuals. The method is not dependent on feature annotation, and thus it can extract important features even when there are not present across all datasets. MCIA provides simple graphical representations for the identification of relationships between large datasets.
There has been rapid progress in high-throughput technologies and platforms to assay global mRNA, miRNA, methylation, proteins, and metabolite profiles of cells are readily available. Advances in RNA-sequencing and mass spectrometry (MS) based proteomics have dramatically improved coverage and quality of genomic, transcriptomic and proteomic profiling [1–4]. Increasing number of studies including The Cancer Genome Atlas (TCGA) and ENCyclopedia of DNA Elements (ENCODE) projects systematically profile large number of biological samples resulting in multiple levels of quantitative information [5–8]. Recent advances of MS based proteomics provide a complementary approach to genomics and transcriptomic technologies [3, 4] and systematic analyses can now be carried out to identify and quantify the majority of proteins expressed in human cells [9–12]. These data yield unprecedented views of molecular building blocks and the machinery of cells. Interpreting these large-scale datasets to derive information about a biological system represents a considerable challenge often faced by investigators.
Multiple omics data analysis can be broadly defined by some common questions, which are dependent on the data collected; multiple datasets measuring the same biological molecules or multiple datasets each measuring different biological molecules. In the first case, given multiple transcriptomics data from different microarray or RNA-sequencing studies, the aim may be to discover which platform is the most informative with highest quality data, identify robust biomarkers across datasets or highlight platform specific discrepancies in measurements. In the second case, given multiple different data such as transcripts, proteins and metabolites, the objective may be to integrate and concatenate information to increase the breath and coverage of available data in a biological network. In this case, specific platform discrepancies are less important and performance of data integration is more likely to be assessed using system biology or pathway approaches.
Nevertheless, both analyses face common challenges associated with integrating data from disparate technologies. Several meta-analysis studies map identifiers from each platform to a common set of identifiers to generate a single concatenated matrix for subsequent analysis [13, 14]. However, this data simplification overlooks several fundamental platform and biological biases. Platforms are not universal and measure different molecules. Filtering genes to their intersection may considerably reduce data coverage. In addition, the many-to-many mapping of gene identifiers from multiple platforms complicates direct comparison of molecules across multiple levels. Moreover, because correlations between different platforms are probably lower than expected , it may not provide gains in data quality or study power. Such filtering may also introduce bias because platform discrepancies could reflect biological variation. For instance, poor correlation between a transcript and its translated protein may result from biological processes such as microRNA post-transcriptional repression [16, 17]. Similarly, correlations between proteins and metabolites of pathways can diverge if proteins are expressed in an inactive form, in which case its abundance may not represent activity.
Ordination methods, such as principal component analysis (PCA), independent component analysis (ICA) and correspondence analysis (COA), are exploratory data analysis approaches that have been applied to analyze omics data including transcriptome and proteome studies [18–22]. Graphical representation of measurements (samples) and variables (genes, proteins) on a lower dimensional space facilitates interpretation of global variance structure and identification of the most informative (or variant) features across datasets. These methods permit visualization of data that have considerable levels of noise and data where the number of variables exceeds the number of measurements, which is typical in omics studies. However, these approaches do not solve the problem of comparing many datasets simultaneously.
Studies have extended these approaches to couple two datasets together . One such approach is co-inertia analysis (CIA) . CIA was originally applied to study ecological and environmental tables, where it was employed to link environmental variables with species characteristics . Culhane and colleagues introduced CIA in genomics, when they compared data from two microarray platforms . An advantage of this method is that it does not require the mapping or filtering of genes to a common set. The relationship between co-inertia analysis and related methods including Procrustes analysis , canonical correlation analysis with Elastic Net penalization (CCA-EN) and sparse Partial Least Squares (sPLS) have been described previously . CIA and sPLS both maximize the covariance between eigenvectors and are efficient in determining main individual effects in paired dataset analysis. By contrast CCA-EN maximizes the correlation between eigenvectors and tends to discover effects present in both datasets, but may omit to discover strong individual effects. Variables selected by CCA-EN and sPLS are highly similar but CIA selected marginally different marker genes that may have some redundancy . A noteworthy advantage of CIA is that it can be coupled with several dimension reduction approaches, including PCA or correspondence analysis, such that it can accommodate both discrete count data (e.g. somatic mutation) and continuous data . These approaches are performed on each dataset separately and can be integrated using CIA . However, all above methods including CIA are limited to the analysis of two datasets, limiting their application in modern multi-omics studies. Several approaches have been proposed for integrating more than two datasets, such as consensus PCA (CPCA) , regularized generalized canonical correlation analysis (RGCCA) , sparse generalized canonical correlation analysis (SGCCA)  and penalized canonical correlation analysis (PCCA) . SGCCA and PCCA originally focus on the feature selection from multiple datasets, but also can be used for multiple table integration problem.
Here, we describe another method, multiple co-inertia analysis (MCIA), for the analysis of more than two omics datasets, extending its application in the field of environmental science and, recently, phylogenetics . MCIA is related to consensus PCA (CPCA) which both maximize the square covariance between eigenvectors and are subject to similar constraints . CPCA is less sensitive to multi-collinearity within each dataset than generalized canonical correlation analysis . We illustrate the application of MCIA using two different examples, and show that integrated analysis is more insightful than analysis of the individual datasets. First, we demonstrated the power of MCIA via applying it to the integration and comparison of multi-omics data independent of data annotation. We employed MCIA to identify common relationships among multiple gene and protein expression data of the NCI-60 cancer cell line panel of the National Cancer Institute [8, 11, 33]. The integrated analysis revealed that cell lines are clustered according to anatomical tissue source and showed a significant degree of correlation between transcript and protein expression. Second, we assessed the concordance in gene expression data obtained from microarray and next generation RNA-sequencing of 266 samples of high grade serous ovarian cancer. MCIA integrated ovarian cancer gene expression data from different sources which captured distinct subsets of the transcriptome (<47% of genes were present on all four platforms) to reveal a set of biomarkers that were consistently highly ranked by all four platforms and were biologically relevant to ovarian cancer. To enable community access to MCIA, we implemented the method into the R-Bioconductor (omicade4) package as an easy-to-use tool for bioinformaticians and biologists.
A typical omics dataset is a matrix where the number of features exceeds the number of measurements (rows and columns of the matrix, respectively). MCIA requires a set of tables where either features or measurements are matched and have equal weights. MCIA is performed in a two-step process. First a one table ordination method, such as PCA, COA or non-symmetric correspondence analysis (NSC)  is applied on each dataset separately, which transforms data into comparable lower dimensional spaces.
where xij is the centered row profile, i.e. the relative abundance of selected variable to the measurement’s weight.
The superscript T and −1 stand for matrix transposition and matrix inversion respectively. Therefore, the formula (6) removes the dimension that is spanned by vector u k 1 (k = 1, …, K) to get a residual matrix, which is passed to the SVD to find the second order solution. These steps are repeated until the desired number of axes (principal components, dimensions) is generated. As a result, MCIA provides a simultaneous ordination of columns (measurements) and rows (features) of multiple tables within the same hyperspace, with features or measurements sharing similar trends will be closely projected. The detailed description of MCIA and the proof that these axes are maximally co-variant are given in Chessel and Hanafi [26, 35].
We analyzed publicly available sets of data from two studies: (i) transcriptomic [8, 33, 36] and proteomic  datasets of the NCI-60 cancer cell line panel, the latter one generated in our group, and (ii) an ovarian cancer dataset generated as part of the TCGA project . In each study, there are multiple datasets measuring molecules (mRNA or proteins) from the same samples (cell lines or tumors).
The NCI-60 panel is a collection of 59 cancer cell lines of leukemia, lymphomas, melanomas and carcinomas of ovarian, renal, breast, prostate, colon, lung and central nervous system (CNS) origin. The NCI-60 transcriptome data were downloaded from Cellminer  and were obtained on four different platforms; Affymetrix HG-U133 plus 2.0, HG-U133, HG-U95 and Agilent GE 4x44K . Affymetrix data were normalized using GC robust multichip averaging GCRMA;  and Agilent data were log transformed as obtained from the Cellminer. Although data filtering is not required to perform MCIA, to facilitate data interpretation, microarray data were filtered to exclude probes that do not map to an official HUGO gene symbol. The probe with highest average value was retained when multiple probes mapped to the same gene. Filtering produced datasets of 11,051; 8,803; 9,044 and 10,382 genes on Agilent, HG-U95, HG-U133 and HG-U133 plus 2.0 platforms respectively. The lung cancer cell line NCI-H23 was excluded since its expression profile was not available on the HG-U133 platform. A Venn diagram representing the overlapping genes in the processed data for each platform is provided in Additional file 1: Figure S1.
The proteome profiles of cell lines were produced from a conventional GeLC-MS/MS approach and label-free quantification, as described in . The international protein index (IPI) identifiers were mapped to official gene symbol to facilitate subsequent pathway interpretation. Data were log transformed (base 10) and no filtering or additional normalization were performed. This dataset represents 7,150 protein expressions across 58 cell line in NCI-60 panel.
Gene expression of tumors from ovarian cancer patients were profiled using two microarray platforms (Agilent customized platform G4502A and Affymetrix GeneChip HG U133 plus 2.0) and RNA-sequencing on Illumina HiSeq platform. Data were downloaded from the NCI-TCGA data portal 07/08/2013; . Patient samples (266 out of 489) that were present in all four datasets were included in the analysis. The Agilent and Affymetrix data were normalized and summarized by lowess and robust multichip averaging (RMA), respectively . The transcript expression levels of the Illumina RNA-sequencing data were determined using two different pre-processing pipelines (RPKM and RSEM) denoted as RNASeq and RNASeqV2, respectively. Normalization and quantification of RNASeq followed the RPKM method  whereas the alignment and gene expression quantification in RNAseqV2 were obtained by MapSplice and RSEM [42, 43]. In RNASeq and RNASeqV2; 20,657 and 20,135 genes were detected (before filtering). These data were filtered to exclude genes with more than 15 missing values. Only genes mapped to an official gene symbol were retained. For the features mapped to the same gene symbol, the one with the largest average expression value was kept. Remaining missing values were replaced with a positive value far smaller than the lowest expression value in each dataset (10−15 in RNASeq and 10−10 in RNASeqV2) and then, the expression values were log transformed (base 10). After filtering, the Agilent, Affymetrix, RNASeq and RNASeqV2 datasets contained 17,814; 12,042; 16,769 and 15,840 gene expression measurements respectively. The Venn diagram representing the overlap of genes in these datasets is shown in Additional file 1: Figure S2.
The NCI-60 panel, a collection of 59 cancer cell lines derived from nine different tissues (brain, blood and bone marrow, breast, colon, kidney, lung, ovary, prostate and skin) has been extensively used in in vitro high-throughput drug screen assays. They have been molecularly profiled using comparative genomic hybridization array , karyotype analysis , DNA mutational analysis [46, 47], transcripts expression array [33, 48], microarrays for microRNA expression  and protein expression .
There was greater divergence in the cell lines from tumors with more intrinsic molecular heterogeneity (e.g. breast and NSCLC cell lines). The transcriptome and proteome profiles of the individual breast and lung cell lines were projected close in space demonstrating that the expression profiles shared a high degree of consensus. The tight projection of multiple data types from diverse technology platforms provides evidence that the observed spread of cell lines reflected the biological variance (tumor cell lines heterogeneity), as opposed to inter-study technical or other stochastic variance. For instance, we observed that the estrogen receptor positive breast cancer cell line MCF7 displays an epithelial phenotype and clustered to colon cancer lines. In contrast, the cell line negative for the estrogen receptor, HS578T, clustered with the stromal/mesenchymal cluster of glioblastoma and renal tumor cell lines. This suggests that HS578T exhibits more invasive mesenchymal features compared to T47D and MCF7.
Each PC has an associated eigenvalue which represents the amount of variability contained in that PC. The first three PCs of the MCIA accounted for 17.4%, 14.2% and 9.7% of variance respectively (each eigenvalue divided by the sum of all eigenvalues; Additional file 1: Figure S4). The observation that the first two PCs capture less than a third of the structure in the datasets (Figure 1A) reflects the complexity inherent in cell lines of 58 tumors from nine different organs. In order to identify the contribution of each dataset to the total variance, that is, to what extent each dataset deviates or agrees with what the majority of datasets support, we extracted the MCIA pseudo-eigenvalues. Figure 1B shows the pseudo-eigenvalues associated with the first two principal components of each dataset. Comparison within microarray data revealed that Affymetrix HGU133 Plus 2.0 accounts for the highest variance on axis 1 and 2, possibly because this platform contains informative features, or features that are poorly detected or absent on other platforms. We observed that the similarity within transcriptome datasets is greater than the similarity between transcriptome and proteome data, which is consistent with the results shown by Figure 1A.
One of the most attractive features of MCIA is that it can be used to highlight lack or presence of co-structure between datasets, thus it allows selection of the strongest features from each dataset for subsequent analysis. For instance, we observed in particular, large variation between the protein and transcript expression patterns of two cell lines, melanoma SKMEL2 and ovarian IGROV1. The proteome coordinates of SKMEL2 were close to the origin and far from the transcriptomic data that was projected on the negative end of PC2 with the other melanoma cell line data. The poor information content in proteome data of the SKMEL2 cell line could reflect the lack of expression of melanin related genes on protein level. Similarly, the incongruence of the proteome and transcriptome data of the ovarian cell line IGROV1 may be due to expression of less epithelial markers that projected on the positive direction of axis 2.
To characterize the overall correlation between each pair of high dimensional data we calculated the pair-wise RV coefficient, a multivariate generalization of the squared Pearson correlation coefficient . For each pair of datasets, the RV-coefficient is calculated as the total co-inertia (sum of eigenvalues of co-inertia, i.e. sum of eigenvalues of the product of two cross product matrices) divided by the square root of the product of the squared total inertia (sum of the eigenvalues) from the individual analysis. As the co-structure between two datasets increases, the RV score move towards to 1. A zero RV score indicates no similarity. The overall similarity in structure between microarray data was higher than the similarity between microarray and proteomics data; average RV coefficient > 0.9 and 0.76 respectively (Additional file 1: Figure S5).
When MCIA was performed on the same transcriptome data and the subset of proteome data that were quantified in all 58 cell lines (n = 524 proteins, no missing values), the filtered proteome data had a higher consensus to the co-structure and increased pseudo-eigenvalues (Additional file 1: Figure S6).
The first axis (PC1, horizontal axis), which explains the largest variance, separated cells with epithelial or mesenchymal characteristics, suggesting that epithelial-mesenchymal transition (EMT) is an essential mechanism defining different classes of cancers (Figure 2A). EMT has been shown to play an important role in epithelial cell malignancy and metastasis . Epithelial markers, including SLC27A2, CDH1, SPINT1, S100P and EPCAM had high weights on the positive side of PC1 (Figures 2B-F). At the opposite end, mesenchymal and collagen markers, including GJA1, which is involved in epicardial to mesenchymal cell transition, and TGFβ2 were observed (Additional file 2: Table S1). The second (vertical) axis, PC2, clearly separated melanoma and leukemia from other epithelial cancer types. The strongest determinant of the vertical axis is a set of melanoma-related genes, namely melanoma-associated antigens (MAGEA), melanogenic enzyme (GPNMB) as well as TYR, DCT, TYRP1, MALANA, S100B and BCL2A1. The top 100 genes with greatest weights on PC1 and PC2 were selected from each dataset (Figures 2B-F) and the full list of markers is provided in Additional file 2: Table S1. Among 1,377 selected genes, 145 were measured in three or more datasets. MCIA enables the study of the union of features from all studies. Among the NCI-60 datasets, less than 12% of the total 17,805 genes studied were measured in all five datasets. By observing highly ranked genes across studies, one can identify robust markers that could be subject to further analysis.
Complementary information can be obtained by integrating data from different platforms and data types which increases the genome coverage and power of subsequent pathway analysis. While numerous genes were over-expressed in both the transcriptome and proteome data, some (HCLS1, PECAM and two integrins, ITGAL, ITGB2) were identified exclusively in the proteome dataset (Figure 3A). We observed that leukocyte related biological functions, such as activation of mononuclear leukocyte, mobilization of Ca2+ and activation of lymphocyte were most strongly associated with the leukemia cell lines (Additional file 3: Table S2). Enrichment analysis suggested that the most significantly enriched pathways are, leukemia extravasation signaling pathway (p = 1.04−11; Figure 3B), which is responsible for leukocyte migration and related to metastasis in leukemia cell lines , T cell receptor signaling (p = 5.25−5) and iCos-iCosL signaling in T helper cells (p = 8.32−5; Additional file 4: Table S3).
To further demonstrate the advantage of combining multiple layers of information in pathway analysis, we performed identical analysis only based on transcriptome markers from all of the four microarray studies. Although leukocyte extravasation signaling was still the most enriched pathway, it did not reach the same level of significance (p = 1.14−4). In addition, pathways that are not strongly associated with leukemia were also significantly enriched (p < 0.01; hereditary breast cancer signaling and NFAT in Cardiac Hypertrophy). Several pathways that are associated with leukemia and were detected in the combined analysis were absent, including NF-kB pathway and PI3K Signaling in B lymphocytes (Additional file 4: Table S3).
We repeated this analysis on the set of MCIA discovered features associated with melanoma (Additional file 3: Table S2). The selected features comprised of proteins and genes that are highly expressed in melanoma cell lines, such as TYR, TYRP1 and BCL2A1. These were significantly enriched in the biological functions or pathways associated with eumelanin biosynthesis and disorder of pigmentation including the melanocyte development and pigmentation signaling pathway (Additional file 3: Table S2; Figure 3C). Melanocytic development and pigmentation is regulated in large part by the bHLH-Lz microphthalmia-associated transcription factor (MITF) and MITF activity is controlled by at least two pathways: MSH and Kit signaling. BCL2A1 is transcriptionally activated by MITF and serves as an anti-apoptosis factor . Interestingly, the upstream regulator of MITF, lEF1, was also consistently identified on the same direction in all transcriptome datasets (Figure 2). It is of note that, although all five datasets contributed to the coverage of this pathway, MITF was solely detected in the Affymetrix data. MCIA can increase coverage and, the power of pathway (and other annotation) analyses as it does not require mapping or pre-filtering of features to the subset common to all datasets. MCIA allows easy integration of multiple omics levels to identify classes that are relevant in the given biological context.
In generalized canonical correlation analysis (GCCA) several sets of variables are analyzed simultaneously. Several generalizations of CCA have been described. These employ different methods, including sum of correlations (SUMCOR), sum of squared correlations (SSQCOR) and sum of absolute value correlations (SABSCOR) . Recently Tenenhaus and coworkers introduced regularized generalized canonical correlation analysis (RGCCA) to generalize RCCA to multi-block data analysis of data where the number of variables exceed the number of cases . We compare MCIA to several RGCCA methods that are defined by different shrinkage parameters and optimization criteria (Additional file 1: Figure S7-S9).
First, we compared three different optimization criteria in RGCCA, namely SUMCOR, SABSCOR, SSQCOR with MCIA. As depicted in Additional file 1: Figure S7, the SUMCOR method and MCIA algorithm consistently return similar results with positively correlated axes (Additional file 1: Figure S7). Also the identified components from the SABSCOR and SSQCOR methods are always highly correlated to the MCIA results, but it is important to note that the correlation could be either positive or negative. This is inconvenient for the comparison and integration of multiple omics datasets, as the components from one dataset might be inverted in another dataset.
By tuning the shrinkage parameter т, which can range from 0 to 1, RGCCA balances optimizing the intra-table and inter-table covariance. Additional file 1: Figure S8 and S9 show that the identified components are nearly identical across datasets for т = 0. The smaller the shrinkage parameter т, the higher is the correlation between neighboring components from different datasets. But the variance of each individual dataset is less well explained by the components. In contrast, the results of RGCCA with a shrinkage parameter of т = 1 are very similar to MCIA results. In this case, RGCCA gives priority to finding a component that explains its own block well . Similarly, MCIA maximizes the variance within each table and the covariance of components of each table with a consensus reference structure through a synthetic analysis. It is important to note that in omics data analyses, the number of features is generally much larger than the number of observations. Therefore, a low т should be avoided as it results in overfitting of the data and apparently perfect correlations, which rarely represent meaningful information.
We applied MCIA to compare the consistency and discrepancy in gene expression profiles of ovarian cancer tumors obtained by RNA-sequencing and Affymetrix and Agilent microarray technologies (Figure 4A). The results revealed high overall similarity in structure between the four datasets and three platforms.
Recent microarray gene expression profiling studies have reported four subtypes of ovarian cancer (proliferative, immunoreactive, mesenchymal and differentiated) [37, 55]. These HGS-OvCa subtypes can be clearly distinguished along the first two MCIA axes (Figure 4A). The first axis generally separated samples with immunoreactive versus proliferative characteristics. Whereas the second axis distinguished tumors with a mesenchymal subtype which show a short survival time  from the differentiated ovarian cancer samples. Consistent with other studies, MCIA identified large overlap between the four subtypes, indicating that most samples exhibited features of multiple subtype signatures . In order to find whether this classification correlates with clinical factors, we compared the first two PCs with clinical records provided from the TCGA data portal and the Verhaak study . This comparison revealed that age at diagnosis was significantly negatively correlated with PC1 and positively correlated PC2 (Pearson correlation p = 1.29−3 and p = 3.56−4 respectively), suggesting that differentiated and immunoreactive patients tend to present at younger age. The percentage of stromal cells is positively correlated with PC2 (Pearson correlation p = 1.79−3), which is in consensus with the mesenchymal subtype having greater percentage of stromal cells . Other clinical factors, such as somatic mutation, drug treatment and tumor stages did not significantly correlate with either axis.
Both microarray and RNA sequencing data resulted in a similar ordination of tumor samples in the MCIA space. In order to identify which genes contribute significantly to the divergence of samples, we examined the gene expression variables superimposed onto the same space (Figure 4B-E). The top 100 genes from the end of each axis were selected. The full list of selected genes and their enriched pathways are provided in Additional file 5: Table S4. Each dataset contained different genes. Approximately 47% of genes (9,755 genes) were measured on all four datasets (Additional file 1: Figure S2). Among 1096 genes selected as the top 100 genes from each datasets on PC1 and PC2 only 82 genes were in at least three platforms and 27 (2.5%) were present in all datasets. Several of these “robust” markers, have been previously implicated in ovarian cancer [37, 56]. Many T-cell activation and trafficking genes, such as CXCL9, CD2 and CD3D were projected onto the positive end of the first axes, which represented the immunoreactive subtype tumors. MCIA revealed new markers that might be associated with the immune system, including SH2D1A, RHOH, SAA1, SAA2 and GNLY. This is further corroborated by numerous GO terms significantly associated with genes on this end of the axis (DAVID functional annotation) . For instance, significantly enriched gene sets include glycoprotein, chemotaxis, defense and immune response (FDR < 0.01, Additional file 5: Table S4). The genes at the opposite end of the MCIA axes included transcriptional factors SOX11, HMGA2, along with several cell cycle promoters, such as BEX1, MAPK4 as well as nerve system development regulators (TBX1, TUBB2B), which characterize the proliferative subtype. Genes that are expressed on the positive end of axis 2, such as POSTN, CXCL14 and HOXA5, define the mesenchymal cluster. Other potential mesenchymal subtype markers for ovarian cancer include ASPN, homeobox alpha genes as well as collagens. ASPN is a critical regulator of TGF-beta pathway that induces the epithelial mesenchymal transition. Gene set analysis revealed that mesenchymal genes are enriched in GO terms including cell adhesion, skeletal system development, collagen and ECM receptor interaction pathway (Additional file 5: Table S4).
The robust markers at the differentiated end include oviductal glycoprotein 1 (OVGP1/MUC9), SPDEF, KIAA1324, GJB1 and ALPPL2, some of which have already been reported as ovarian biomarkers. For instance, OVGP1 has been suggested as a possible serum marker for the detection of low grade ovarian cancer . Although the TCGA dataset is all high grade serous ovarian cancer, in our analysis, it was highly expressed in differentiated subtype. Human SPDEF protein plays a significant role in tumorigenesis in multiple cancers, including ovarian cancer and has been reported to suppress prostate tumor metastasis. A recent study on prostate cancer demonstrated that SPDEF suppresses cancer metastasis through down-regulation of matrix metalloproteinase 9 and 13 (MMP9, MMP13), which are required for the invasive phenotype of cells . Our analysis implied that SPDEF and matrix metalloproteinase plays a similar role in the development of ovarian cancer. In addition, it has been shown that, in a mouse model, POSTN down-regulates ALPP mRNA . POSTN and ALPPL2 were projected onto the diametral ends of axes 2, which implies that the same mechanism of regulation exists in ovarian cancer and can be exploited to distinguish subtypes. Interestingly, the DAVID gene set analysis of markers for the differentiated phenotype did not reveal as strong gene set enrichments as described for the other subtypes (lowest FDR = 0.0022 vs. 10−47 to 10−9; Additional file 5: Table S4) indicating that this subtype exhibits considerably higher degree of heterogeneity.
In the present study, we described multiple co-inertia analysis (MCIA), an exploratory data analysis method that can identify co-relationships between multiple high dimensional datasets. MCIA projects multiple sets of features onto the same dimensional space and provides a simple graphical representation for the efficient identification of concordance between datasets. The sets of features may have none or few features in common. By transforming multiple sources of data onto the same scale, the most variant features are transformed onto the same scale. This allows one to extract and easily combine sets of omic features (genes, proteins, etc.) for greater power in subsequent pathway analysis. MCIA provides a consensus reference structure of datasets, revealing similar trends among multiple tables. Compared to RGCCA, we found that MCIA is most similar to the SUMCOR version of RGCCA with т = 1 in practice.
Our integrative analysis of NCI-60 cell line panel indicated that, although both transcriptome and proteome cell lines were clustered according to their lineage, they provides complementary information. We demonstrated that integrated analysis of gene and protein expression data increases the power of pathway analysis and yields more information than an analysis of gene expression alone. MCIA highly ranked the leukemia extravasation signaling pathway. This pathway was overrepresented with features that were predominantly from the proteomics data and were enriched in biological functions of “activation of mononuclear leukocyte and lymphocyte”. MCIA of high grade serous ovarian cancer revealed four previously described subtypes of ovarian cancer and provided novel subtype markers. An advantage of MCIA is that it couples multiple set of features measured on the same set of samples. Since it does not rely on feature annotation, it is not limited by the immaturity of annotations. There is no prerequisite to filter or map features (genes) to a common set thereby considerably increasing genome coverage.
In a study that compares CIA with other sparse multiple table analysis methods (sPLS and CCA-EN), LeCao et al. suggested that CIA may result in redundancy when it is used for feature selection since it does not include a built-in procedure for variable selection . Similarly, MCIA does not impose any sparsity in the result, so MCIA selects much more features than methods introducing the Lasso penalty, such as SGCCA  or PCCA . Hence, the interpretation of MCIA selected features would have to be coupled with other methods, such as enrichment analysis, in order to reveal functional insights. We also note that the MCIA algorithm finds solutions in a sequential manner and each order of components requires a singular value decomposition (SVD) for a large dataset. The computationally intensity of the algorithm increases with sample size as more components are retained. For instance, the CPU time of analysis of the NCI-60 data with 5 kept principal components was around 68 seconds on Intel Xeon 1596 MHz using one thread of a Linux server.
In conclusion, we believe MCIA is a useful method for integration of multiple omics datasets where the same tissue or cell lines have been assayed multiple times. MCIA is available to the community via an R-Bioconductor (“omicade4”) package which includes documentation and a vignette.
The microarray data of NCI-60 cell lines are available through CELLMINER (http://discover.nci.nih.gov/cellminer/home.do). The NCI-60 proteomic data can be downloaded from http://wzw.tum.de/proteomics/NCI60/ as well as from https://www.proteomicsdb.org. The ovarian cancer data are available through the TCGA download portal (https://tcga-data.nci.nih.gov/tcga/tcgaHome2.jsp).
Canonical correlation analysis
Canonical correlation analysis with elastic net penalty
Central nervous system
Consensus principal component analysis
The encyclopedia of DNA elements
Generalized canonical correlation analysis
GC robust multichip averaging
In-gel digestion and liquid chromatography tandem mass spectrometry
High grade serous ovarian cancer
Independent component analysis
Ingenuity pathways analysis
International protein index
Multiple co-inertia analysis
The national cancer institute
Non-symmetric correspondence analysis
Non-small-cell lung carcinoma
Polyacrylamide gel electrophoresis
Principal component analysis
Penalized canonical correlation analysis
Regularized generalized canonical correlation analysis
Robust multichip averaging
Reads per kilo base per million
RNA-Seq by Expectation Maximization
Singular value decomposition
Sparse generalized canonical correlation analysis
Sparse partial least square
The cancer genome atlas.
The authors gratefully acknowledge Hannes Hahne and Mathias Wilhelm of TUM and Curtis Huttenhower of Harvard School of Public Health for critical reading of the manuscript. Funding for this work was provided by Dana-Farber Cancer Institute Women’s Cancers Program, the Claudia Adams Barr foundation and the National Cancer Institute at the National Institutes of Health [grant numbers 1RC4CA156551-01, 1U19CA148065].
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.