 Methodology article
 Open access
 Published:
A networkassisted coclustering algorithm to discover cancer subtypes based on gene expression
BMC Bioinformatics volume 15, Article number: 37 (2014)
Abstract
Background
Cancer subtype information is critically important for understanding tumor heterogeneity. Existing methods to identify cancer subtypes have primarily focused on utilizing generic clustering algorithms (such as hierarchical clustering) to identify subtypes based on gene expression data. The networklevel interaction among genes, which is key to understanding the molecular perturbations in cancer, has been rarely considered during the clustering process. The motivation of our work is to develop a method that effectively incorporates molecular interaction networks into the clustering process to improve cancer subtype identification.
Results
We have developed a new clustering algorithm for cancer subtype identification, called “networkassisted coclustering for the identification of cancer subtypes” (NCIS). NCIS combines gene network information to simultaneously group samples and genes into biologically meaningful clusters. Prior to clustering, we assign weights to genes based on their impact in the network. Then a new weighted coclustering algorithm based on a seminonnegative matrix trifactorization is applied. We evaluated the effectiveness of NCIS on simulated datasets as well as largescale Breast Cancer and Glioblastoma Multiforme patient samples from The Cancer Genome Atlas (TCGA) project. NCIS was shown to better separate the patient samples into clinically distinct subtypes and achieve higher accuracy on the simulated datasets to tolerate noise, as compared to consensus hierarchical clustering.
Conclusions
The weighted coclustering approach in NCIS provides a unique solution to incorporate gene network information into the clustering process. Our tool will be useful to comprehensively identify cancer subtypes that would otherwise be obscured by cancer heterogeneity, using highthroughput and highdimensional gene expression data.
Background
For a given type of cancer, there are often subtypes that harbor unique sets of genomic changes and exhibit different patterns of gene expression [1–5]. Subtype information is critically important to tailor more effective treatments for patients, as varying subtypes often respond disparately to the same treatment. In the past decade, many generic clusteringbased approaches have been developed to identify cancer subtypes based on gene expression data. Typically, expression levels of d genes measured on n samples are presented as a d × n realvalued matrix with the entries representing the corresponding expression level. A clustering method can be applied to partition the columns/rows of this matrix into different clusters such that items in one cluster have similar expression patterns. The partition of columns offers clues to potential cancer subtypes, while the partition of rows can highlight potentially relevant coexpressed genes. The most popular clustering methods used in cancer subtype identification include hierarchical clustering and kmeans [6, 7]. Recently, a number of other clustering methods have also been developed. Consensus clustering [8] is a clustering framework where the same clustering algorithm is applied to different subsets of the data multiple times. A consensus result is then collected to better describe the similarities between samples. This framework has been widely used in cancer subtype analysis [9]. To address the highdimensional feature space problem (which is almost always the case for gene expression analysis), the method developed in [10] uses sparse clustering techniques to adaptively select a small set of informative features to cluster the samples. There are also several clustering methods that were specifically designed for cancer subtype clustering. In [11, 12], survival time information was used to select survivalassociated genes and then the samples were clustered using gene expression. In [13, 14], an integrated approach was developed to consider multiple types of omics data (e.g. gene expression, mutation, copy number, methylation) to help identify cancer subtypes. However, we are more interested in only utilizing the gene expression data (which is much more accessible than any other types of omics data) and getting the subtype information of a patient (based on molecular features such as gene expression) without using clinical features (such as survival time). More importantly, most previous studies did not incorporate biological information, particularly molecular interactions networks, into the clustering step. Indeed, network is key to understanding the molecular perturbations in cancer [15, 16]. If we consider the interconnection between genes during the clustering process, we have more knowledge of gene interactions at a systems level and may improve our ability to identify cancer subtypes. This will in turn allow us to analyze the perturbation of a group of genes and pathways rather than individual ones to better understand tumor heterogeneity.
The motivation of our work is to develop a method that effectively incorporates molecular interaction networks into the clustering process to improve cancer subtype identification. To this end, one previous work employed information of biological networks during the clustering [17]. The method first defined a network distance based on the proximity of two genes in the network and an expression distance of two genes, and then constructed the overall distance metric as a function of network and expression distance metrics for hierarchical clustering. Another recent work incorporated network information to cluster genotypes and phenotypes based on phenotypegene association matrix [18]. The authors did so by adding penalty and regularization terms into the clustering objective to keep the final results consistent with clusters obtained from prior knowledge on the disease phenotype similarity network. However, these two approaches are not appropriate for our cancer subtype identification. First, as our goal is to cluster cancer patients (not genes as in [17]), we cannot add a networkbased distance in the distance metric defined for patients. Second, networkderived clusters [18] are also difficult to define for patients since there is no network structure linking all the patients (like the phenotype similarity network). Finally, simply combining network proximitydefined gene clusters directly with gene expression clusters may be misleading, since neighboring genes can have entirely different expression patterns.
In this work, we introduce a new coclustering algorithm to effectively integrate network information with expression variation across samples. We call our method “networkassisted coclustering for the identification of cancer subtypes” (NCIS). The method first learns a weight for each gene as an indicator of its importance to be used in the clustering. The key idea is that genes regulating many other genes and showing highly variable expression patterns will be considered as more informative in the clustering process. Another important contribution of this work is that we embed the gene weights directly into the coclustering objective function.
Coclustering simultaneously clusters both samples and features [19, 20]. In coclustering, similarity is a measure of the coherence of features (e.g. genes) and samples in a bicluster, rather than a function of feature pairs or sample pairs. Consequently, it considers the local context and is able to automatically select subsets that share similar attributes [21, 22]. The method we utilize in NCIS is based on SemiNonnegative Matrix TriFactorization (SNMTF) [23, 24], a member of the matrix factorizationbased clustering family. A common underlying assumption of such coclustering methods is that there exist cluster centroids that characterize the behavior and trend of cluster members, which is mathematically formed as matrix trifactorization. Matrix factorization has simple formalization when compared to other methods, and was shown to be useful in gene expression analysis [25, 26]. To our knowledge, NCIS is the first method to apply SNMTF to achieve weighted coclustering in cancer subtype identification.
Methods
Method overview
We developed a clustering method that incorporates the gene network (i.e. the interactions between genes) as prior knowledge and simultaneously cluster samples and genes into distinct groups. Adding network structure to the clustering step will help us better select representative genes for clustering. We expect that such a method will generate more biologically informative clusters. The main scheme of our method is shown in Figure 1. Note that we assume the input expression data has already been preprocessed such that the batch effects and technical artifacts are already removed. We implemented the algorithm in MATLAB. Source code and users’ manual can be found at http://bioencompbio.bioen.illinois.edu/NCIS/.
Assigning weights to genes
Feature selection is essential for successful pattern recognition from the highdimensional data. In many previous studies, genes were selected based on their median absolute deviation (MAD) or coefficient of variation (CV) [9, 27]. The cutoff was set rather arbitrarily and typically only a small subset of genes was retained for subsequent analysis, which drastically reduces the amount of information used in clustering. Other dimensionreduction methods such as principal component analysis (PCA) [28] are useful, but the biological interpretation is not always straightforward as expression vectors of the samples are projected to a lowdimensional principal component space [29, 30]. On the other hand, incorporating additional biologically relevant information as prior knowledge could help resolve ambiguities in the data because it provides, to a certain extent, insight into how the gene expression profiles were generated. Therefore, we utilize the gene network as well as expression information to select genes that both play more important roles in the network and show larger variations among samples. Our method assigns a weight to each gene; genes with higher weights will be prioritized during the weighted coclustering.
We use a modified PageRank algorithm to assign weights to genes. The original PageRank [31] views the web as a directed graph. Suppose there are N nodes (web pages), then E is a N × N matrix denoting the connections among the nodes. A link from page i to page j is shown by an edge pointing from node i to node j, and in the matrix form denoted by E_{ ij } = 1. More links to node j raises its confidence level. The algorithm ranks all the nodes based on the iterative formula:
where r^{n} is the confidence level in the n^{th} iteration and {deg}_{i}={\displaystyle {\sum}_{j=1}^{N}}{E}_{\mathit{ij}}, referring to the total number of web pages that i points to; α is a parameter representing the extent to which the ranking depends on the structure of the graph. In our case, we have a similar network representing the molecular interactions among genes. Our method that assigns weights to genes is similar to a gene ranking method, GeneRank [32], which extended the idea of PageRank to microarray gene differential expression analysis. However, we use a directed graph (rather than undirected graph in GeneRank), because we believe the direction of edges that models gene regulation is important. Additionally, rather than using differential expression, we use an alternative method to consider gene expression variation across samples as described below.
In our graph, a directed edge from node i to node j means that gene i regulates the expression of gene j. Genes with larger variations among samples tend to have more distinguishing power. We incorporate such variation into the model to assign weights to genes. Our main idea is that genes having a lot of heavyweight downstream targets should be assigned large weights – a rationale similar to the confidence vote in the original PageRank, except that outgoing edges increase a gene’s weight while incoming edges increase a web page’s weight. Our weighttraining approach is:
where {deg}_{i}={\displaystyle {\sum}_{j=1}^{N}}{E}_{\mathit{ji}} is the total number of genes that regulate gene i; w^{n} is the weight vector of genes in the n^{th} iteration and NMAD is the normalized median absolute deviation (MAD):
where max(MAD) is the maximum of vector MAD.
We use MAD as a measurement of the expression variation of a gene among all the samples. The values are normalized by the maximum value in all MAD’s to make the weighttraining mechanism stable and comparable with different overall expression levels. In each iteration, every gene i is evaluated by its own MAD as well as the weights and connections of the genes that i regulates. The final weight of each gene reflects both its impact in the network and its ability to separate the samples.
The convergence of this iterative algorithm is guaranteed for any 0 < α < 1 [32, 33]. Let w^{n+1} = w^{n}, we have
We can write in the matrix form as
where D is a diagonal matrix with entries deg_{ i }, 1 ≤ i ≤ N.
The final weight for all the genes can be represented as:
where I is the N × N identity matrix.
To make weights trained with different parameters more comparable, we normalized w such that the maximum of w is 1. We chose a relatively large α value (α = 0.85) to make the weights rely more on the network structure.
Weighted coclustering algorithm
After assigning weights to genes, our input data include the gene expression profile of each sample and the weights for all the genes from the previous step. We developed a new weighted coclustering method to simultaneously separate samples into subtypes and group genes into functionally relevant subclasses. Our method is based on SemiNonnegative Matrix TriFactorization (SNMTF), where the nonnegative constraint on the data matrix imposed on Orthogonal Nonnegative Matrix TriFactorization (ONMTF) is relaxed to make it suitable for general dataset.
Objective
Suppose our gene expression matrix X contains d genes and n samples, and we would like to group the genes into m clusters and group the samples into c clusters (subtypes). For convenience, the main notations used in the rest of the paper are presented in Table 1. Our method can be specified as minimizing the following objective,
Here, G denotes the cluster each gene belongs to and F denotes the cluster of every sample. Entries of matrix S can be treated as centroids of the blocks generated. The aforementioned weights are presented in the diagonal matrix W, and we incorporate an “importance indicator” by multiplying the weights to the row (gene) norms. This is to prioritize genes with large weights in the optimization step. Due to difficulties in minimizing the objective with the binaryvalue constraint on F and G, we relax F and G into continuous nonnegative domain as in previous related work [24]. We only require {\displaystyle \sum _{j=1}^{m}}{G}_{\mathit{ij}}=1,{\displaystyle \sum _{j=1}^{c}}{F}_{\mathit{ij}}=1. Thus our objective is to minimize:
Optimization
We set:
Then we have:
We can get a clearer understanding of S from this expression. If G and F are defined as in Table 1, i.e., 0/1valued partition matrix, F^{T}F should be a c × c diagonal matrix, whose entries represent the number of samples belonging to each sample cluster. G^{T}WG should be a m × m diagonal matrix, with entries equal to the total weights of the features (genes) belonging to each of the m feature clusters. Similar to the interpretation of F^{T}F, G^{T}WG can be considered as the weighted total number of features in each feature cluster (taking feature i as w_{ i } features when counting the total number). Therefore, (G^{T}WG)^{1}G^{T}WX represents the feature cluster centroids on the sample space (ndimension) and XF(F^{T}F)^{1} represents the sample cluster centroids on the feature space (ddimension). The difference is that all the samples are assumed to have the same weight of 1, while features are assigned different weights W. Entries of S can be viewed as feature cluster centroids on the samplecentroids space (cdimension) or as sample cluster centroids on the genecentroids space (mdimension). Therefore, S gives the centroids information of the biclusters after partitioning.
Now, assume S and G are fixed. Let β ∈ ℝ^{n × c} be the Lagrangian multiplier for F, then Lagrangian function for F is
We set:
Using KarushKuhnTucker condition [34], we have
where A = X^{T}WGS, B = S^{T}G^{T}WGS; M^{+} and M^{} are the positive and negative of matrix M defined as {M}^{+}=\frac{\leftM\right+M}{2},{M}^{}=\frac{\leftM\rightM}{2}, respectively. Therefore, we obtain the iterating formula for F:
Similar derivation leads to the iterative formula of G:
where C = WXFS^{T}, D = SF^{T}FS^{T}.
The iterations decrease the value of the objective function J. Convergence of the algorithm can be shown using a typical approach for the convergence proof of NMFbased methods. For more details, see the proof in the (Additional file 1: Supplementary Materials).
Our algorithm is as follows:

Initialize F and G.

While not convergent and iterations less than a predefined value
Update S by
S = (G^{T}WG)^{− 1}G^{T}WXF(F^{T}F)^{− 1};
Update F by
Update G by
m and c selection
A question raised in almost all clustering methods is how to determine the cluster numbers. There is no agreedupon solution. Here we utilize an approach that takes advantage of the stochastic property of our algorithm: although NCIS may not converge to the same solution on each run with different initiations, we could expect the results to be very stable if the clustering is strong enough [8, 35]. As in [8, 35], we ran NCIS 50 times with randomly selected initiations and defined a sample consensus matrix \overline{{M}_{s}} and a gene consensus matrix \overline{{M}_{g}}. For each run, a n × n sample connectivity matrix M_{ s } and a d × d gene connectivity matrix M_{ g } are obtained:
Consensus matrices \overline{{M}_{s}} and \overline{{M}_{g}} are the averages of M_{ s }’s and M_{ g }’s over 50 runs respectively. The entries range between 0 and 1, where 0 indicates that the corresponding samples (or genes) belong to different clusters in every run and 1 indicates that they belong to the same clusters in all the cases. Therefore, 1\overline{M} offers a new distance metric between the items (1\overline{{M}_{s}} for samples and 1\overline{{M}_{g}} for genes). Similar to [35], we used 1\overline{{M}_{s}} and 1\overline{{M}_{g}} to hierarchically cluster samples and genes, and then we define an average cophenetic correlation coefficient \frac{\rho \left({\overline{M}}_{s}\right)+\rho \left(\overline{{M}_{g}}\right)}{2} to evaluate the stability. Cophenetic correlation coefficient ρ(C) is defined as the Pearson correlation between distance matrix 1C and the distance matrix induced by the linkage used in hierarchical clustering for reordering C. If a clustering is stable, the entries would be close to 0 and 1 (two modes), and in the ideal case (only 0 and 1) ρ(C) would be exactly 1. We observe how the cophenetic correlation coefficients change as m and c change and select values where the averaged coefficient starts to decrease.
Results and discussion
We applied NCIS to two largescale datasets from TCGA as well as simulated datasets to evaluate the effectiveness of our method. We built the network using a variety of sources, including the network used in [36] as well as our uptodate curated information from Reactome [37], the NCINature Curated PID [38], and KEGG [39]. The network from [36] consisted of inferred geneinteraction from sources of information such as protein interactions, gene coexpression, protein domain interaction, and textmined interaction described by [40]. To aggregate all of the networks together, all redundant edges were collapsed to single edges. We combined the edges of each of the databases such that a link between any two nodes A and B exists in the aggregated network if a link between A and B exists in any of the databases we used. The resulting aggregated network consisted of 11,648 genes and 211,794 edges. Our method assumes that the network is an aggregation of different biological networks, such as proteinprotein interaction network, transcriptional regulatory network, and signaling network etc. In the MATLAB implementation of our program, we also allow users to provide other network information as needed.
Application to TCGA breast cancer dataset
The first dataset we used is from a recent largescale breast cancer study from TCGA [41]. This dataset contains the expression levels of 17,814 genes across 547 samples. We first integrated the gene expression profile with the aggregated network information mentioned above, and trained weights for 8,726 genes included in both of these resources (for genes that are not included in either expression profile or network information, we ignored them). We set α = 0.85 (α is a tuning parameter that represents the extent to which gene weights rely on network structure; see Methods part). The 8,726 weighted genes and 547 samples were the input of NCIS. Figure 2 shows the heatmap with genes and samples rearranged according to the NCIS’s clustering result. Based on the cophenetic correlation coefficient calculated from 50 runs (see Methods part), we chose number of patient clusters c = 5 and number of gene clusters m = 8 (Additional file 1: Table S1).
Since we did not know the true class each sample belongs to or the number of subtypes, we used clinical features to evaluate the effectiveness of the clustering algorithm. The underlying idea was that patients in different subgroups were expected to have some different clinical characteristics. We used the following available clinical information to evaluate subtypes identification result: survival time, AJCC staging information (neoplasm disease lymph node stage, neoplasm disease stage and tumor stage) and tumor nuclei percentage. AJCC neoplasm disease lymph node stage represents the stage of cancer based on the lymph nodes present. Neoplasm disease stage represents the extent of a cancer, especially whether the disease has spread from the original site to other parts of the body. Tumor stage is a class assigned to a malignancy which allows for the grouping of similar cancer types based on the extent of disease in the primary tumor, regional lymph nodes, and metastatic sites. Tumor nuclei percentage represents the percentage of tumor nuclei in a malignant neoplasm specimen (from TCGA data dictionary). Table 2 gives the significance level of the difference among all subtypes for each feature. Given pvalue threshold 0.05, we conclude that the NCIS (α = 0.85)defined subtypes successfully separated the patients according to these clinical features.
We also set α = 0 (no network information was used) in the coclustering process to show the impact of network information. Similar statistical tests were performed (Table 2). In general, NCIS (α = 0.85) showed better pvalues in separating the patients in terms of clinical features than NCIS (α = 0).
In the original TCGA paper [41], the authors performed a hierarchical clustering using a subset of genes (most varied across samples) and identified 13 subtypes (test results for clinical features are shown in Table 2 as TCGA/BRCA paper). Since consensus hierarchical clustering generally performs better than the traditional hierarchical clustering, we also applied a consensus average linkage hierarchical clustering [8, 42]. To make a fair comparison, we used all 8,726 genes. The program was run over 1,000 iterations and the resampling rate of the sample was set to 0.8. The distance metric was 1 minus Pearson’s correlation coefficient. The algorithm suggested 3 subtypes. However, in Table 2, we listed the tests’ pvalues for both 3subtypes and 5subtypes conditions to make it easier to compare with the results of NCIS. The results indicated that in general, clusters generated by consensus clustering were not as informative as those from NCIS. We think the most important reason is the lack of effective feature selection in consensus clustering when there are a large number of genes as input.
The advantage of NCIS is the incorporation of the network and assigning an “importance indicator” to each gene. Therefore, in addition to generating the subtypes, we also obtained a biproduct  the gene weights, which describe the genes’ roles in the network and their abilities to distinguish the patient samples. We further performed ANOVA tests for each gene’s expression level across the five subtypes. In the heatmap in Figure 2, we showed the first 50 genes that are shared between the ordered pvalue list and the ordered gene weight list (pvalues are ordered from smallest to largest and gene weights are ordered from largest to smallest). To illustrate the difference for specific genes in the five subtypes at network level, we extracted the subnetwork of ABCC8 as an example (Figure 3). There are 9 genes targeted by ABCC8 in the network we used. We chose this subnetwork because it has a small size and is easily and clearly presented. Although we did not find literature studying the effect of ABCC8 in breast cancer, MRP has been reported to be highly associated with breast cancer tumor progression and treatment outcomes [43–45]. As shown in Figure 3, ABCC8 is highly expressed in Luminal A and Luminal B subtypes. Several of its downstream genes have very similar expression pattern. Such examples demonstrate the differential expression pattern between subtypes at the network level and the advantage of prioritizing genes with higher impact in the network.
The running time of NCIS (α = 0.85, c = 5, m = 8) on an 8GB memory desktop for this dataset is about 5 minutes.
Application to TCGA GBM dataset
The second dataset we used was from a recent largescale TCGA Glioblastoma Multiforme (GBM) subtype identification work [9]. This dataset contains the expression of 11,861 genes on 200 GBM and 2 normal brain samples. In the original paper, the authors first selected 1,903 variably expressed genes according to the MAD and applied consensus hierarchical clustering with agglomerative average linkage [8]. Four subtypes were detected.
We integrated the gene expression data with the network information to train a weight for each of the 7,183 genes included in both sets (network and expression). Tuning parameter α was set to 0.85. After obtaining the weights, these 7,183 weightedgenes and the 202 samples were used in NCIS (result in Figure S1). We set m = 7 and c = 4 (Additional file 1: Table S2).
We again used clinical characteristics to evaluate the effectiveness of our method. We used survival time, tumor necrosis percentage, and tumor nuclei percentage. Tumor necrosis percentage represents the percentage of cell death in a malignant tumor specimen (from TCGA data dictionary). Additional file 1: Table S3 provides the significance level of the difference among all subtypes for each feature. We also ran consensus average linkage hierarchical clustering [8, 42] on the 7,183gene dataset. The program was run over 1,000 iterations and the resampling rate of the samples was set to 0.8. The distance metric is 1 minus Pearson’s correlation coefficient. We identified 4 subtypes. Overall, NCIS (α = 0.85) performed the best. Interestingly, we observed that Subtype Proneural has a much higher survival rate than the other three subtypes (Additional file 1: Figure S2). The underlying mechanism requires more study. In the heatmap in Figure S1, we also showed the first 50 genes that are shared between the ordered pvalue list (based on ANOVA test of each gene’s expression across the four subtypes) and the ordered gene weight list. Figure S3 shows the subnetwork of C1QA, which is involved in immune response, to illustrate the difference among subtypes at network level.
The running time of NCIS (α = 0.85, c = 4, m = 7) on an 8GB desktop for this dataset is about 2 minutes.
Evaluation by simulation
To further assess the performance of NCIS, we simulated a dataset with 300 samples and 3 subgroups. We designed a method to simulate gene expression data based on network interaction structure (see Additional file 1: Supplementary Method). For the 3 subtypes we simulated, the mean expression levels of each gene were estimated from the gene expression profiles of Luminal A, Luminal B, and Basal subtypes in the breast cancer dataset. The final datasets contained 300 samples and 8,726 genes.
To make the simulated datasets more realistic, noisy genes were added. We first trained a weight for each gene based only on the network structure and then chose l genes with lowest weights as “uninformative” genes. We randomly permutated the expression levels of these genes across the samples. l was set to 1000, 2000, 3000, 4000, and 5000 (we generated 5 datasets for each l).
We set m = 8 and c = 3 in NCIS. The results for multiple trials of the simulation studies were shown in the Figure 4. The running time of NCIS (α = 0.85, c = 5, m = 8) on an 8GB desktop is about 3 minutes for each simulated dataset. We found that when the number of “noisy” genes is small (1000 and 2000), both NCIS (α = 0.85) and consensusclustering have 100% accuracy. When the number of noisy genes is increased to 3000, NCIS (α = 0.85) starts to perform better than consensus clustering. As expected, once the number of noisy genes becomes excessive (5000 out of 9000), neither method can achieve high accuracy. Overall, our simulation result indicated that NCIS is a more robust method than consensus clustering to tolerate noise.
We also observed that NCIS (α = 0.85) outperformed NCIS (α = 0) significantly in our simulated dataset (Figure 4). However, in the two real datasets, the advantage is marginal. We think the main reason could be that in our simulated datasets, the expression levels are strongly related to the network structure we collected (i.e. the interaction network well reflects how the gene expression is generated), while in real datasets there are more uncertainties and the network information we used is incomplete.
Conclusions
Cancer subtype information is of critical importance in designing better treatment strategies. We developed a novel clustering method, called NCIS, to identify cancer subtypes from highthroughput gene expression data. NCIS incorporates the network information within the clustering step to detect more informative sample subtypes. NCIS assigns a weight to each gene based on its connection in network and its distinguishing ability in expression level across all samples. Our approach avoids excluding a large number of genes, which results in much information loss for subsequent analysis in previous methods. In addition, we utilize a weighted coclustering method to capture the duality of gene expression data, i.e. similarity is treated as a level of coherence of the samples and genes in the bicluster.
The future directions of this problem should ideally address three key challenges. First, the network we used is assumed to be a generic molecular interaction network; it is not specific for the particular type of cancer or the tissuetype. Second, the network does not contain all the genes. Third, many edges in our current network do not have a high confidence level and the directions of many edges are unclear. These three problems can be addressed as we gain more complete understanding of the network.
Further research is needed to design better approaches to choose the optimal parameters in NCIS, including α, c, and m. Since there is often no gold standard available for the clustering problem of a specific type of cancer, it is difficult to find the optimal parameters of α, c, and m. In our study, we use α = 0.85 to keep the balance between network information and gene expression information. We did test the results using different values of α, such as 0.8 and 0.7, and the results were comparable with minor differences in the clustering result. We believe the problem of choosing the optimal α may require further studies when more data is collected through largescale projects with detailed clinical features in the future. Such knowledge can be utilized to help select α. Additionally, how to determine the number of clusters (c and m) remains a difficult problem in clustering algorithms. In our work, we utilized cophenetic correlation coefficients used in [35]. We compared the results using different m and c combinations where the cophenetic coefficients are slightly lower than the optimal combination (Additional file 1: Table S1 and S2). For both BRCA and GBM data, we observed that the subtypes identified were very similar (i.e., they had high correlation with the results from optimal combination) based on the pvalues of a Chisquared test between subtypes identified in the optimal combination. Therefore, the small variation in the choices of m and c results in very similar clustering here. However, further research is still needed to develop better approaches to automatically select the most appropriate m and c.
We believe our new NCIS algorithm will be useful to comprehensively identify cancer subtypes, which would otherwise be obscured by cancer heterogeneity, using highthroughput and highdimensional gene expression data. Results from NCIS are expected to enhance our ability to discover important subtype patterns and key genes involved in each subtype, which will in turn help us better understand important network perturbations in a subtypespecific manner.
References
Perou CM, Sorlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, et al: Molecular portraits of human breast tumours. Nature. 2000, 406 (6797): 747752. 10.1038/35021093.
Sorlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, van de Rijn M, Jeffrey SS, et al: Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci U S A. 2001, 98 (19): 1086910874. 10.1073/pnas.191367098.
Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, et al: Distinct types of diffuse large Bcell lymphoma identified by gene expression profiling. Nature. 2000, 403 (6769): 503511. 10.1038/35000501.
Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, et al: Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci U S A. 2001, 98 (24): 1379013795. 10.1073/pnas.191502998.
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286 (5439): 531537. 10.1126/science.286.5439.531.
Lehmann BD, Bauer JA, Chen X, Sanders ME, Chakravarthy AB, Shyr Y, Pietenpol JA: Identification of human triplenegative breast cancer subtypes and preclinical models for selection of targeted therapies. J Clin Invest. 2011, 121 (7): 27502767. 10.1172/JCI45014.
Bullinger L, Döhner K, Bair E, Fröhling S, Schlenk RF, Tibshirani R, Döhner H, Pollack JR: Use of geneexpression profiling to identify prognostic subclasses in adult acute myeloid leukemia. N Engl J Med. 2004, 350 (16): 16051616. 10.1056/NEJMoa031046.
Monti S, Tamayo P, Mesirov J, Golub T: Consensus clustering: a resamplingbased method for class discovery and visualization of gene expression microarray data. Machine learning. 2003, 52 (1): 91118.
Verhaak RG, Hoadley KA, Purdom E, Wang V, Qi Y, Wilkerson MD, Miller CR, Ding L, Golub T, Mesirov JP, et al: Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell. 2010, 17 (1): 98110. 10.1016/j.ccr.2009.12.020.
Witten DM, Tibshirani R: A framework for feature selection in clustering. J Am Stat Assoc. 2010, 105 (490): 713726. 10.1198/jasa.2010.tm09415.
Bair E, Tibshirani R: Semisupervised methods to predict patient survival from gene expression data. PLoS Biol. 2004, 2 (4): E10810.1371/journal.pbio.0020108.
Koestler DC, Marsit CJ, Christensen BC, Karagas MR, Bueno R, Sugarbaker DJ, Kelsey KT, Houseman EA: Semisupervised recursively partitioned mixture models for identifying cancer subtypes. Bioinformatics. 2010, 26 (20): 25782585. 10.1093/bioinformatics/btq470.
Shen R, Olshen AB, Ladanyi M: Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics. 2009, 25 (22): 29062912. 10.1093/bioinformatics/btp543.
Mo Q, Wang S, Seshan VE, Olshen AB, Schultz N, Sander C, Powers RS, Ladanyi M, Shen R: Pattern discovery and cancer gene identification in integrated cancer genomic data. Proc Natl Acad Sci U S A. 2013, 110 (11): 42454250. 10.1073/pnas.1208949110.
Barabasi AL, Gulbahce N, Loscalzo J: Network medicine: a networkbased approach to human disease. Nat Rev Genet. 2011, 12 (1): 5668. 10.1038/nrg2918.
Chuang HY, Lee E, Liu YT, Lee D, Ideker T: Networkbased classification of breast cancer metastasis. Mol Syst Biol. 2007, 3: 140
Hanisch D, Zien A, Zimmer R, Lengauer T: Coclustering of biological networks and gene expression data. Bioinformatics. 2002, 18 (Suppl 1): S145S154. 10.1093/bioinformatics/18.suppl_1.S145.
Hwang T, Atluri G, Xie M, Dey S, Hong C, Kumar V, Kuang R: Coclustering phenomegenome for phenotype classification and disease gene discovery. Nucleic Acids Res. 2012, 40 (19): e14610.1093/nar/gks615.
Tanay A, Sharan R, Shamir R: Biclustering algorithms: A survey. Handbook Comput Mol Biol. 2005, 9: 2126.
Pan F, Zhang X, Wang W: CRD: Fast Coclustering on Large Datasets Utilizing SamplingBased Matrix Decomposition. Proceedings of the ACM SIGMOD International Conference on Management of Data: 1012 June 2008. Edited by: Shasha D, Wang JTL. 2008, Vancouver: ACM, 173184.
Cheng Y, Church GM: Biclustering of expression data. Proceedings / International Conference on Intelligent Systems for Molecular Biology; ISMB International Conference on Intelligent Systems for. Mol Biol. 2000, 8: 93103.
Eren K, Deveci M, Kucuktunc O, Catalyurek UV: A comparative analysis of biclustering algorithms for gene expression data. Brief Bioinform. 2013, 14 (3): 279292. 10.1093/bib/bbs032.
Ding C, Li T, Peng W, Park H: Orthogonal nonnegative matrix tfactorizations for clustering. Proceedings of the 12th ACM SIGKOD International Conference on Knowledge Discovery and Data Mining: 2730 August 2006. Edited by: Ungar LU, Craven M, Gunopulos D, EliassiRad T. 2006, Philadelphia: ACM, 126135.
Gu Q, Zhou J: Coclustering on manifolds. Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: 28 June1 July 2009; Paris. Edited by: John E, Fogelman FS, Flach P, Zaki M. 2009, ACM, 359368.
Li A, Walling J, Ahn S, Kotliarov Y, Su Q, Quezado M, Oberholtzer JC, Park J, Zenklusen JC, Fine HA: Unsupervised analysis of transcriptomic profiles reveals six glioma subtypes. Cancer Res. 2009, 69 (5): 20912099. 10.1158/00085472.CAN082100.
Gao Y, Church G: Improving molecular cancer class discovery through sparse nonnegative matrix factorization. Bioinformatics. 2005, 21 (21): 39703975. 10.1093/bioinformatics/bti653.
Liu Y, Hayes DN, Nobel A, Marron J: Statistical significance of clustering for highdimension, low–sample size data. J Am Stat Assoc. 2008, 103 (483): 12811293. 10.1198/016214508000000454.
Alter O, Brown PO, Botstein D: Singular value decomposition for genomewide expression data processing and modeling. Proc Natl Acad Sci U S A. 2000, 97 (18): 1010110106. 10.1073/pnas.97.18.10101.
Jiang DX, Tang C, Zhang AD: Cluster analysis for gene expression data: A survey. Ieee T Knowl Data En. 2004, 16 (11): 13701386. 10.1109/TKDE.2004.68.
Barillot E, Calzone L, Hupe P, Vert JP, Zinovyev A: Computational systems biology of cancer, vol. 47. 2012, CRC Press
Brin S, Page L: The anatomy of a largescale hypertextual Web search engine. Comput Networks ISDN Sys. 1998, 30 (1): 107117.
Morrison JL, Breitling R, Higham DJ, Gilbert DR: GeneRank: using search engine technology for the analysis of microarray experiments. BMC Bioinforma. 2005, 6: 23310.1186/147121056233.
Higham DJ, Taylor A: The sleekest link algorithm. Institute of Mathematics and Its Applications (IMA) Mathematics Today. 2003, 39: 192197.
Boyd S, Vandenberghe L: Convex optimization. 2004, Cambridge, UK: Cambridge university press
Brunet JP, Tamayo P, Golub TR, Mesirov JP: Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci U S A. 2004, 101 (12): 41644169. 10.1073/pnas.0308531101.
Ciriello G, Cerami E, Sander C, Schultz N: Mutual exclusivity analysis identifies oncogenic network modules. Genome Res. 2012, 22 (2): 398406. 10.1101/gr.125567.111.
Croft D, O'Kelly G, Wu G, Haw R, Gillespie M, Matthews L, Caudy M, Garapati P, Gopinath G, Jassal B, et al: Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res. 2011, 39 (Database issue): D691D697.
Schaefer CF, Anthony K, Krupa S, Buchoff J, Day M, Hannay T, Buetow KH: PID: the Pathway Interaction Database. Nucleic Acids Res. 2009, 37 (Database issue): D674D679.
Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M: KEGG for integration and interpretation of largescale molecular data sets. Nucleic Acids Res. 2012, 40 (Database issue): D109D114.
Wu G, Feng X, Stein L: A human functional protein interaction network and its application to cancer data analysis. Genome biology. 2010, 11 (5): R5310.1186/gb2010115r53.
Network TCGA: Comprehensive molecular portraits of human breast tumours. Nature. 2012, 490 (7418): 6170. 10.1038/nature11412.
Wilkerson MD, Hayes DN: ConsensusClusterPlus: a class discovery tool with confidence assessments and item tracking. Bioinformatics. 2010, 26 (12): 15721573. 10.1093/bioinformatics/btq170.
Nooter K, Brutel de la Riviere G, Look MP, van Wingerden KE, HenzenLogmans SC, Scheper RJ, Flens MJ, Klijn JG, Stoter G, Foekens JA: The prognostic significance of expression of the multidrug resistanceassociated protein (MRP) in primary breast cancer. Br J Cancer. 1997, 76 (4): 486493. 10.1038/bjc.1997.414.
Doyle LA, Yang W, Abruzzo LV, Krogmann T, Gao Y, Rishi AK, Ross DD: A multidrug resistance transporter from human MCF7 breast cancer cells. Proc Natl Acad Sci U S A. 1998, 95 (26): 1566515670. 10.1073/pnas.95.26.15665.
Wind NS, Holen I: Multidrug resistance in breast cancer: from in vitro models to clinical studies. Int J Breast Cancer. 2011, 2011: 967419
Acknowledgements
We thank the members of the Ma lab for helpful discussions. This work was supported by an award from the Interdisciplinary Innovation Initiative (In3) program at the University of Illinois (to J.M. and J.H.) and an Illinois Distinguished Fellowship (to J.P.H.).
We also thank the Cancer Genome Atlas Research Network for making the data publicly available.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
YL, QG, JH, and JM conceived the research. YL developed the algorithm and performed the data analysis with contributions from QG and JHP. YL drafted the manuscript. QG, JPH, JH, and JM revised the manuscript. All authors read and approved the final manuscript.
Electronic supplementary material
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Liu, Y., Gu, Q., Hou, J.P. et al. A networkassisted coclustering algorithm to discover cancer subtypes based on gene expression. BMC Bioinformatics 15, 37 (2014). https://doi.org/10.1186/147121051537
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/147121051537