Integrated analysis of the NCI-60 cell line transcriptome and proteome
The NCI-60 panel, a collection of 59 cancer cell lines derived from nine different tissues (brain, blood and bone marrow, breast, colon, kidney, lung, ovary, prostate and skin) has been extensively used in in vitro high-throughput drug screen assays. They have been molecularly profiled using comparative genomic hybridization array [44], karyotype analysis [45], DNA mutational analysis [46, 47], transcripts expression array [33, 48], microarrays for microRNA expression [8] and protein expression [11].
MCIA was applied as an exploratory analysis of four transcriptomic studies (Agilent n = 11,051; HGU95 n = 8,803; HGU133 n = 9,044 and HGU133 plus 2.0 n = 10,382) and one proteomic study (GeLC-MS/MS; n = 7,150) of the 58 cell lines. Figure 1A shows the projection of cell lines onto the first two principal components (PCs) of MCIA. Similar to the visualization employed in CIA [26], the datasets are transformed into the same projection. The coordinates of the five measurements for each cell line are connected by lines. The length of which indicates the divergence (the shorter the line, the higher the level of concordance) between the mRNAs and protein expression levels for a particular cell line. The MCIA plot of the first two principal components shows similar trends in transcriptome and proteome profiles, indicating that the most variant sources of biological information were similar. Cell lines originating from the same or closely related anatomical source of tissue were projected close to each other and converged into groups. The colon, leukemia, melanoma, CNS, renal and ovarian cell lines segregated largely according to their tissue of origin. Seven out of eight melanoma lines clustered together, and the remaining one, LOX-IMVI, has been reported to lack melanin production [49]. These results are consistent with independently performed hierarchical clustering analysis (Additional file 1: Figure S3).
There was greater divergence in the cell lines from tumors with more intrinsic molecular heterogeneity (e.g. breast and NSCLC cell lines). The transcriptome and proteome profiles of the individual breast and lung cell lines were projected close in space demonstrating that the expression profiles shared a high degree of consensus. The tight projection of multiple data types from diverse technology platforms provides evidence that the observed spread of cell lines reflected the biological variance (tumor cell lines heterogeneity), as opposed to inter-study technical or other stochastic variance. For instance, we observed that the estrogen receptor positive breast cancer cell line MCF7 displays an epithelial phenotype and clustered to colon cancer lines. In contrast, the cell line negative for the estrogen receptor, HS578T, clustered with the stromal/mesenchymal cluster of glioblastoma and renal tumor cell lines. This suggests that HS578T exhibits more invasive mesenchymal features compared to T47D and MCF7.
Overall co-structure comparison using MCIA
Each PC has an associated eigenvalue which represents the amount of variability contained in that PC. The first three PCs of the MCIA accounted for 17.4%, 14.2% and 9.7% of variance respectively (each eigenvalue divided by the sum of all eigenvalues; Additional file 1: Figure S4). The observation that the first two PCs capture less than a third of the structure in the datasets (Figure 1A) reflects the complexity inherent in cell lines of 58 tumors from nine different organs. In order to identify the contribution of each dataset to the total variance, that is, to what extent each dataset deviates or agrees with what the majority of datasets support, we extracted the MCIA pseudo-eigenvalues. Figure 1B shows the pseudo-eigenvalues associated with the first two principal components of each dataset. Comparison within microarray data revealed that Affymetrix HGU133 Plus 2.0 accounts for the highest variance on axis 1 and 2, possibly because this platform contains informative features, or features that are poorly detected or absent on other platforms. We observed that the similarity within transcriptome datasets is greater than the similarity between transcriptome and proteome data, which is consistent with the results shown by Figure 1A.
One of the most attractive features of MCIA is that it can be used to highlight lack or presence of co-structure between datasets, thus it allows selection of the strongest features from each dataset for subsequent analysis. For instance, we observed in particular, large variation between the protein and transcript expression patterns of two cell lines, melanoma SKMEL2 and ovarian IGROV1. The proteome coordinates of SKMEL2 were close to the origin and far from the transcriptomic data that was projected on the negative end of PC2 with the other melanoma cell line data. The poor information content in proteome data of the SKMEL2 cell line could reflect the lack of expression of melanin related genes on protein level. Similarly, the incongruence of the proteome and transcriptome data of the ovarian cell line IGROV1 may be due to expression of less epithelial markers that projected on the positive direction of axis 2.
To characterize the overall correlation between each pair of high dimensional data we calculated the pair-wise RV coefficient, a multivariate generalization of the squared Pearson correlation coefficient [50]. For each pair of datasets, the RV-coefficient is calculated as the total co-inertia (sum of eigenvalues of co-inertia, i.e. sum of eigenvalues of the product of two cross product matrices) divided by the square root of the product of the squared total inertia (sum of the eigenvalues) from the individual analysis. As the co-structure between two datasets increases, the RV score move towards to 1. A zero RV score indicates no similarity. The overall similarity in structure between microarray data was higher than the similarity between microarray and proteomics data; average RV coefficient > 0.9 and 0.76 respectively (Additional file 1: Figure S5).
When MCIA was performed on the same transcriptome data and the subset of proteome data that were quantified in all 58 cell lines (n = 524 proteins, no missing values), the filtered proteome data had a higher consensus to the co-structure and increased pseudo-eigenvalues (Additional file 1: Figure S6).
MCIA axes describe biological properties
In contrast to traditional clustering methods, MCIA projects the original data onto a lower dimensional space, maximizing the covariance of each dataset with respect to the reference structure. In MCIA plots, a gene that is highly expressed in a certain cell line will be projected in the direction of this cell line and the greater the distance from the origin, the stronger the association. In order to identify biomarkers that are highly associated with cancer cell lines of different origins, we examined the feature space of mRNAs and proteins that were projected in the same direction and space (Figure 2).
The first axis (PC1, horizontal axis), which explains the largest variance, separated cells with epithelial or mesenchymal characteristics, suggesting that epithelial-mesenchymal transition (EMT) is an essential mechanism defining different classes of cancers (Figure 2A). EMT has been shown to play an important role in epithelial cell malignancy and metastasis [51]. Epithelial markers, including SLC27A2, CDH1, SPINT1, S100P and EPCAM had high weights on the positive side of PC1 (Figures 2B-F). At the opposite end, mesenchymal and collagen markers, including GJA1, which is involved in epicardial to mesenchymal cell transition, and TGFβ2 were observed (Additional file 2: Table S1). The second (vertical) axis, PC2, clearly separated melanoma and leukemia from other epithelial cancer types. The strongest determinant of the vertical axis is a set of melanoma-related genes, namely melanoma-associated antigens (MAGEA), melanogenic enzyme (GPNMB) as well as TYR, DCT, TYRP1, MALANA, S100B and BCL2A1. The top 100 genes with greatest weights on PC1 and PC2 were selected from each dataset (Figures 2B-F) and the full list of markers is provided in Additional file 2: Table S1. Among 1,377 selected genes, 145 were measured in three or more datasets. MCIA enables the study of the union of features from all studies. Among the NCI-60 datasets, less than 12% of the total 17,805 genes studied were measured in all five datasets. By observing highly ranked genes across studies, one can identify robust markers that could be subject to further analysis.
Integration of proteomics and transcriptomics complements the biological information
To further evaluate the biological significance of the features selected by MCIA, we employed Ingenuity Pathway Analysis (IPA: http://www.ingenuity.com) to discover significant canonical pathways which discriminate different cell lines (Figure 3). In MCIA plots, samples and features are projected onto the same space. The genes with strongest association to a cell line are those projected in the same direction and have the highest weights (greater distance from the origin). As features have been transformed on the same scale, the union of features from each individual dataset can be easily extracted and concatenated to provide greater coverage in pathway analysis. Features strongly associated to each tissue type from both transcriptome and proteome datasets can be concatenated and mapped to signaling pathways. There is no requirement to extract equal numbers of features from each data type. For example we observed that features strongly associated with leukemia related features tended to be enriched in the proteins (Figure 3A). The most extreme features associated with the leukemia cell lines were selected from all platforms using their coordinates and were subjected to the functional and pathway analysis. The full list of features, the coordinate feature selection criteria and their functional and pathway analysis are provided in Additional file 3: Tables S2 and Additional file 4: Table S3.
Complementary information can be obtained by integrating data from different platforms and data types which increases the genome coverage and power of subsequent pathway analysis. While numerous genes were over-expressed in both the transcriptome and proteome data, some (HCLS1, PECAM and two integrins, ITGAL, ITGB2) were identified exclusively in the proteome dataset (Figure 3A). We observed that leukocyte related biological functions, such as activation of mononuclear leukocyte, mobilization of Ca2+ and activation of lymphocyte were most strongly associated with the leukemia cell lines (Additional file 3: Table S2). Enrichment analysis suggested that the most significantly enriched pathways are, leukemia extravasation signaling pathway (p = 1.04−11; Figure 3B), which is responsible for leukocyte migration and related to metastasis in leukemia cell lines [52], T cell receptor signaling (p = 5.25−5) and iCos-iCosL signaling in T helper cells (p = 8.32−5; Additional file 4: Table S3).
To further demonstrate the advantage of combining multiple layers of information in pathway analysis, we performed identical analysis only based on transcriptome markers from all of the four microarray studies. Although leukocyte extravasation signaling was still the most enriched pathway, it did not reach the same level of significance (p = 1.14−4). In addition, pathways that are not strongly associated with leukemia were also significantly enriched (p < 0.01; hereditary breast cancer signaling and NFAT in Cardiac Hypertrophy). Several pathways that are associated with leukemia and were detected in the combined analysis were absent, including NF-kB pathway and PI3K Signaling in B lymphocytes (Additional file 4: Table S3).
We repeated this analysis on the set of MCIA discovered features associated with melanoma (Additional file 3: Table S2). The selected features comprised of proteins and genes that are highly expressed in melanoma cell lines, such as TYR, TYRP1 and BCL2A1. These were significantly enriched in the biological functions or pathways associated with eumelanin biosynthesis and disorder of pigmentation including the melanocyte development and pigmentation signaling pathway (Additional file 3: Table S2; Figure 3C). Melanocytic development and pigmentation is regulated in large part by the bHLH-Lz microphthalmia-associated transcription factor (MITF) and MITF activity is controlled by at least two pathways: MSH and Kit signaling. BCL2A1 is transcriptionally activated by MITF and serves as an anti-apoptosis factor [53]. Interestingly, the upstream regulator of MITF, lEF1, was also consistently identified on the same direction in all transcriptome datasets (Figure 2). It is of note that, although all five datasets contributed to the coverage of this pathway, MITF was solely detected in the Affymetrix data. MCIA can increase coverage and, the power of pathway (and other annotation) analyses as it does not require mapping or pre-filtering of features to the subset common to all datasets. MCIA allows easy integration of multiple omics levels to identify classes that are relevant in the given biological context.
Comparison of MCIA and regularized generalized canonical correlation analysis (RCGGA)
In generalized canonical correlation analysis (GCCA) several sets of variables are analyzed simultaneously. Several generalizations of CCA have been described. These employ different methods, including sum of correlations (SUMCOR), sum of squared correlations (SSQCOR) and sum of absolute value correlations (SABSCOR) [29]. Recently Tenenhaus and coworkers introduced regularized generalized canonical correlation analysis (RGCCA) to generalize RCCA to multi-block data analysis of data where the number of variables exceed the number of cases [29]. We compare MCIA to several RGCCA methods that are defined by different shrinkage parameters and optimization criteria (Additional file 1: Figure S7-S9).
First, we compared three different optimization criteria in RGCCA, namely SUMCOR, SABSCOR, SSQCOR with MCIA. As depicted in Additional file 1: Figure S7, the SUMCOR method and MCIA algorithm consistently return similar results with positively correlated axes (Additional file 1: Figure S7). Also the identified components from the SABSCOR and SSQCOR methods are always highly correlated to the MCIA results, but it is important to note that the correlation could be either positive or negative. This is inconvenient for the comparison and integration of multiple omics datasets, as the components from one dataset might be inverted in another dataset.
By tuning the shrinkage parameter т, which can range from 0 to 1, RGCCA balances optimizing the intra-table and inter-table covariance. Additional file 1: Figure S8 and S9 show that the identified components are nearly identical across datasets for т = 0. The smaller the shrinkage parameter т, the higher is the correlation between neighboring components from different datasets. But the variance of each individual dataset is less well explained by the components. In contrast, the results of RGCCA with a shrinkage parameter of т = 1 are very similar to MCIA results. In this case, RGCCA gives priority to finding a component that explains its own block well [29]. Similarly, MCIA maximizes the variance within each table and the covariance of components of each table with a consensus reference structure through a synthetic analysis. It is important to note that in omics data analyses, the number of features is generally much larger than the number of observations. Therefore, a low т should be avoided as it results in overfitting of the data and apparently perfect correlations, which rarely represent meaningful information.
Integrated analysis of microarray and RNA-sequencing ovarian cancer datasets
In the ovarian cancer datasets, MCIA was applied to several microarray and RNA-seq gene expression datasets; Agilent, Affymetrix, RNASeq, RNASeqV2 which contained 17,814, 12,042, 16,769, and 15,840 genes respectively. In the MCIA space, the first PC (horizontal axis) accounted for 19.6% of the total variance and the second PC (vertical axis) accounted for 10.6% of variance (Additional file 1: Figure S10). In comparison to microarray data, RNA sequencing data typically contains many missing values. These are generated when multiple experiments are combined. We excluded genes (rows) with high number of missing values. After filtering genes with more than 15 missing values in RNA-seq data, the four datasets contributed similarly to the total variance (Figure 4 and Additional file 1: Figure S11). Among the two RNA-seq datasets, RNASeq consistently tended to be more variant than RNASeqV2 on PC1-5 (Additional file 1: Figure S12). RNASeq and RNASeqV2 were generated from the same Illumina RNA-sequencing data but using two different pre-processing approaches. MCIA results indicated that normalization and quantification with the RPKM method (RNASeq) outperforms MapSplice and RSEM (RNASeqV2). The informativeness or variance in RNA sequencing data tended to be sensitive to pre-processing and filtering algorithms which is expected given that methods for processing these data are still emerging. In addition, Affymetrix profiles were generally more variant than Agilent as indicated by greater pseudo-eigenvalues on PC1-3. When the microarray and RNASeq data were compared, we detected several outlier genes that were highly variant on PC1 and PC2 on RNASeq but absent on the microarray platforms. These include CDHR4 and HESRG which are highly expressed by the differentiated subtype (Figure 4) [54].
MCIA identified ovarian subtypes
We applied MCIA to compare the consistency and discrepancy in gene expression profiles of ovarian cancer tumors obtained by RNA-sequencing and Affymetrix and Agilent microarray technologies (Figure 4A). The results revealed high overall similarity in structure between the four datasets and three platforms.
Recent microarray gene expression profiling studies have reported four subtypes of ovarian cancer (proliferative, immunoreactive, mesenchymal and differentiated) [37, 55]. These HGS-OvCa subtypes can be clearly distinguished along the first two MCIA axes (Figure 4A). The first axis generally separated samples with immunoreactive versus proliferative characteristics. Whereas the second axis distinguished tumors with a mesenchymal subtype which show a short survival time [56] from the differentiated ovarian cancer samples. Consistent with other studies, MCIA identified large overlap between the four subtypes, indicating that most samples exhibited features of multiple subtype signatures [56]. In order to find whether this classification correlates with clinical factors, we compared the first two PCs with clinical records provided from the TCGA data portal and the Verhaak study [56]. This comparison revealed that age at diagnosis was significantly negatively correlated with PC1 and positively correlated PC2 (Pearson correlation p = 1.29−3 and p = 3.56−4 respectively), suggesting that differentiated and immunoreactive patients tend to present at younger age. The percentage of stromal cells is positively correlated with PC2 (Pearson correlation p = 1.79−3), which is in consensus with the mesenchymal subtype having greater percentage of stromal cells [56]. Other clinical factors, such as somatic mutation, drug treatment and tumor stages did not significantly correlate with either axis.
MCIA suggests robust subtype biomarkers
Both microarray and RNA sequencing data resulted in a similar ordination of tumor samples in the MCIA space. In order to identify which genes contribute significantly to the divergence of samples, we examined the gene expression variables superimposed onto the same space (Figure 4B-E). The top 100 genes from the end of each axis were selected. The full list of selected genes and their enriched pathways are provided in Additional file 5: Table S4. Each dataset contained different genes. Approximately 47% of genes (9,755 genes) were measured on all four datasets (Additional file 1: Figure S2). Among 1096 genes selected as the top 100 genes from each datasets on PC1 and PC2 only 82 genes were in at least three platforms and 27 (2.5%) were present in all datasets. Several of these “robust” markers, have been previously implicated in ovarian cancer [37, 56]. Many T-cell activation and trafficking genes, such as CXCL9, CD2 and CD3D were projected onto the positive end of the first axes, which represented the immunoreactive subtype tumors. MCIA revealed new markers that might be associated with the immune system, including SH2D1A, RHOH, SAA1, SAA2 and GNLY. This is further corroborated by numerous GO terms significantly associated with genes on this end of the axis (DAVID functional annotation) [57]. For instance, significantly enriched gene sets include glycoprotein, chemotaxis, defense and immune response (FDR < 0.01, Additional file 5: Table S4). The genes at the opposite end of the MCIA axes included transcriptional factors SOX11, HMGA2, along with several cell cycle promoters, such as BEX1, MAPK4 as well as nerve system development regulators (TBX1, TUBB2B), which characterize the proliferative subtype. Genes that are expressed on the positive end of axis 2, such as POSTN, CXCL14 and HOXA5, define the mesenchymal cluster. Other potential mesenchymal subtype markers for ovarian cancer include ASPN, homeobox alpha genes as well as collagens. ASPN is a critical regulator of TGF-beta pathway that induces the epithelial mesenchymal transition. Gene set analysis revealed that mesenchymal genes are enriched in GO terms including cell adhesion, skeletal system development, collagen and ECM receptor interaction pathway (Additional file 5: Table S4).
The robust markers at the differentiated end include oviductal glycoprotein 1 (OVGP1/MUC9), SPDEF, KIAA1324, GJB1 and ALPPL2, some of which have already been reported as ovarian biomarkers. For instance, OVGP1 has been suggested as a possible serum marker for the detection of low grade ovarian cancer [58]. Although the TCGA dataset is all high grade serous ovarian cancer, in our analysis, it was highly expressed in differentiated subtype. Human SPDEF protein plays a significant role in tumorigenesis in multiple cancers, including ovarian cancer and has been reported to suppress prostate tumor metastasis. A recent study on prostate cancer demonstrated that SPDEF suppresses cancer metastasis through down-regulation of matrix metalloproteinase 9 and 13 (MMP9, MMP13), which are required for the invasive phenotype of cells [59]. Our analysis implied that SPDEF and matrix metalloproteinase plays a similar role in the development of ovarian cancer. In addition, it has been shown that, in a mouse model, POSTN down-regulates ALPP mRNA [60]. POSTN and ALPPL2 were projected onto the diametral ends of axes 2, which implies that the same mechanism of regulation exists in ovarian cancer and can be exploited to distinguish subtypes. Interestingly, the DAVID gene set analysis of markers for the differentiated phenotype did not reveal as strong gene set enrichments as described for the other subtypes (lowest FDR = 0.0022 vs. 10−47 to 10−9; Additional file 5: Table S4) indicating that this subtype exhibits considerably higher degree of heterogeneity.