A network-assisted co-clustering algorithm to discover cancer subtypes based on gene expression
© Liu et al.; licensee BioMed Central Ltd. 2014
Received: 10 June 2013
Accepted: 31 January 2014
Published: 4 February 2014
Cancer subtype information is critically important for understanding tumor heterogeneity. Existing methods to identify cancer subtypes have primarily focused on utilizing generic clustering algorithms (such as hierarchical clustering) to identify subtypes based on gene expression data. The network-level interaction among genes, which is key to understanding the molecular perturbations in cancer, has been rarely considered during the clustering process. The motivation of our work is to develop a method that effectively incorporates molecular interaction networks into the clustering process to improve cancer subtype identification.
We have developed a new clustering algorithm for cancer subtype identification, called “network-assisted co-clustering for the identification of cancer subtypes” (NCIS). NCIS combines gene network information to simultaneously group samples and genes into biologically meaningful clusters. Prior to clustering, we assign weights to genes based on their impact in the network. Then a new weighted co-clustering algorithm based on a semi-nonnegative matrix tri-factorization is applied. We evaluated the effectiveness of NCIS on simulated datasets as well as large-scale Breast Cancer and Glioblastoma Multiforme patient samples from The Cancer Genome Atlas (TCGA) project. NCIS was shown to better separate the patient samples into clinically distinct subtypes and achieve higher accuracy on the simulated datasets to tolerate noise, as compared to consensus hierarchical clustering.
The weighted co-clustering approach in NCIS provides a unique solution to incorporate gene network information into the clustering process. Our tool will be useful to comprehensively identify cancer subtypes that would otherwise be obscured by cancer heterogeneity, using high-throughput and high-dimensional gene expression data.
KeywordsCancer subtype Clustering Gene expression
For a given type of cancer, there are often subtypes that harbor unique sets of genomic changes and exhibit different patterns of gene expression [1–5]. Subtype information is critically important to tailor more effective treatments for patients, as varying subtypes often respond disparately to the same treatment. In the past decade, many generic clustering-based approaches have been developed to identify cancer subtypes based on gene expression data. Typically, expression levels of d genes measured on n samples are presented as a d × n real-valued matrix with the entries representing the corresponding expression level. A clustering method can be applied to partition the columns/rows of this matrix into different clusters such that items in one cluster have similar expression patterns. The partition of columns offers clues to potential cancer subtypes, while the partition of rows can highlight potentially relevant co-expressed genes. The most popular clustering methods used in cancer subtype identification include hierarchical clustering and k-means [6, 7]. Recently, a number of other clustering methods have also been developed. Consensus clustering  is a clustering framework where the same clustering algorithm is applied to different subsets of the data multiple times. A consensus result is then collected to better describe the similarities between samples. This framework has been widely used in cancer subtype analysis . To address the high-dimensional feature space problem (which is almost always the case for gene expression analysis), the method developed in  uses sparse clustering techniques to adaptively select a small set of informative features to cluster the samples. There are also several clustering methods that were specifically designed for cancer subtype clustering. In [11, 12], survival time information was used to select survival-associated genes and then the samples were clustered using gene expression. In [13, 14], an integrated approach was developed to consider multiple types of omics data (e.g. gene expression, mutation, copy number, methylation) to help identify cancer subtypes. However, we are more interested in only utilizing the gene expression data (which is much more accessible than any other types of omics data) and getting the subtype information of a patient (based on molecular features such as gene expression) without using clinical features (such as survival time). More importantly, most previous studies did not incorporate biological information, particularly molecular interactions networks, into the clustering step. Indeed, network is key to understanding the molecular perturbations in cancer [15, 16]. If we consider the interconnection between genes during the clustering process, we have more knowledge of gene interactions at a systems level and may improve our ability to identify cancer subtypes. This will in turn allow us to analyze the perturbation of a group of genes and pathways rather than individual ones to better understand tumor heterogeneity.
The motivation of our work is to develop a method that effectively incorporates molecular interaction networks into the clustering process to improve cancer subtype identification. To this end, one previous work employed information of biological networks during the clustering . The method first defined a network distance based on the proximity of two genes in the network and an expression distance of two genes, and then constructed the overall distance metric as a function of network and expression distance metrics for hierarchical clustering. Another recent work incorporated network information to cluster genotypes and phenotypes based on phenotype-gene association matrix . The authors did so by adding penalty and regularization terms into the clustering objective to keep the final results consistent with clusters obtained from prior knowledge on the disease phenotype similarity network. However, these two approaches are not appropriate for our cancer subtype identification. First, as our goal is to cluster cancer patients (not genes as in ), we cannot add a network-based distance in the distance metric defined for patients. Second, network-derived clusters  are also difficult to define for patients since there is no network structure linking all the patients (like the phenotype similarity network). Finally, simply combining network proximity-defined gene clusters directly with gene expression clusters may be misleading, since neighboring genes can have entirely different expression patterns.
In this work, we introduce a new co-clustering algorithm to effectively integrate network information with expression variation across samples. We call our method “network-assisted co-clustering for the identification of cancer subtypes” (NCIS). The method first learns a weight for each gene as an indicator of its importance to be used in the clustering. The key idea is that genes regulating many other genes and showing highly variable expression patterns will be considered as more informative in the clustering process. Another important contribution of this work is that we embed the gene weights directly into the co-clustering objective function.
Co-clustering simultaneously clusters both samples and features [19, 20]. In co-clustering, similarity is a measure of the coherence of features (e.g. genes) and samples in a bi-cluster, rather than a function of feature pairs or sample pairs. Consequently, it considers the local context and is able to automatically select subsets that share similar attributes [21, 22]. The method we utilize in NCIS is based on Semi-Nonnegative Matrix Tri-Factorization (SNMTF) [23, 24], a member of the matrix factorization-based clustering family. A common underlying assumption of such co-clustering methods is that there exist cluster centroids that characterize the behavior and trend of cluster members, which is mathematically formed as matrix tri-factorization. Matrix factorization has simple formalization when compared to other methods, and was shown to be useful in gene expression analysis [25, 26]. To our knowledge, NCIS is the first method to apply SNMTF to achieve weighted co-clustering in cancer subtype identification.
Assigning weights to genes
Feature selection is essential for successful pattern recognition from the high-dimensional data. In many previous studies, genes were selected based on their median absolute deviation (MAD) or coefficient of variation (CV) [9, 27]. The cutoff was set rather arbitrarily and typically only a small subset of genes was retained for subsequent analysis, which drastically reduces the amount of information used in clustering. Other dimension-reduction methods such as principal component analysis (PCA)  are useful, but the biological interpretation is not always straightforward as expression vectors of the samples are projected to a low-dimensional principal component space [29, 30]. On the other hand, incorporating additional biologically relevant information as prior knowledge could help resolve ambiguities in the data because it provides, to a certain extent, insight into how the gene expression profiles were generated. Therefore, we utilize the gene network as well as expression information to select genes that both play more important roles in the network and show larger variations among samples. Our method assigns a weight to each gene; genes with higher weights will be prioritized during the weighted co-clustering.
where r n is the confidence level in the n th iteration and , referring to the total number of web pages that i points to; α is a parameter representing the extent to which the ranking depends on the structure of the graph. In our case, we have a similar network representing the molecular interactions among genes. Our method that assigns weights to genes is similar to a gene ranking method, GeneRank , which extended the idea of PageRank to microarray gene differential expression analysis. However, we use a directed graph (rather than undirected graph in GeneRank), because we believe the direction of edges that models gene regulation is important. Additionally, rather than using differential expression, we use an alternative method to consider gene expression variation across samples as described below.
where max(MAD) is the maximum of vector MAD.
We use MAD as a measurement of the expression variation of a gene among all the samples. The values are normalized by the maximum value in all MAD’s to make the weight-training mechanism stable and comparable with different overall expression levels. In each iteration, every gene i is evaluated by its own MAD as well as the weights and connections of the genes that i regulates. The final weight of each gene reflects both its impact in the network and its ability to separate the samples.
where D is a diagonal matrix with entries deg i , 1 ≤ i ≤ N.
where I is the N × N identity matrix.
To make weights trained with different parameters more comparable, we normalized w such that the maximum of w is 1. We chose a relatively large α value (α = 0.85) to make the weights rely more on the network structure.
Weighted co-clustering algorithm
After assigning weights to genes, our input data include the gene expression profile of each sample and the weights for all the genes from the previous step. We developed a new weighted co-clustering method to simultaneously separate samples into subtypes and group genes into functionally relevant subclasses. Our method is based on Semi-Nonnegative Matrix Tri-Factorization (SNMTF), where the nonnegative constraint on the data matrix imposed on Orthogonal Nonnegative Matrix Tri-Factorization (ONMTF) is relaxed to make it suitable for general dataset.
Main notations used in this paper
Number of samples
Number of genes
Number of sample clusters
Number of gene clusters
Gene expression matrix of size d × n
The i th column of X, representing the expression of the i th sample
The i th row of X, representing the expression of the i th gene
Sample partition matrix of size n × c; F ij ∈[0,1]: F ij = 1 if X ∙i belongs to sample cluster j and F ij = 0 otherwise
Feature partition matrix of size d × m; G ij ∈[0,1]: G ij = 1 if X i∙ belongs to gene cluster j and G ij = 0 otherwise
A m × c matrix
A d × d diagonal matrix; entries are the weights of the genes
We can get a clearer understanding of S from this expression. If G and F are defined as in Table 1, i.e., 0/1-valued partition matrix, F T F should be a c × c diagonal matrix, whose entries represent the number of samples belonging to each sample cluster. G T WG should be a m × m diagonal matrix, with entries equal to the total weights of the features (genes) belonging to each of the m feature clusters. Similar to the interpretation of F T F, G T WG can be considered as the weighted total number of features in each feature cluster (taking feature i as w i features when counting the total number). Therefore, (G T WG)-1G T WX represents the feature cluster centroids on the sample space (n-dimension) and XF(F T F)-1 represents the sample cluster centroids on the feature space (d-dimension). The difference is that all the samples are assumed to have the same weight of 1, while features are assigned different weights W. Entries of S can be viewed as feature cluster centroids on the sample-centroids space (c-dimension) or as sample cluster centroids on the gene-centroids space (m-dimension). Therefore, S gives the centroids information of the bi-clusters after partitioning.
Using Karush-Kuhn-Tucker condition , we have
where C = WXFS T , D = SF T FS T .
The iterations decrease the value of the objective function J. Convergence of the algorithm can be shown using a typical approach for the convergence proof of NMF-based methods. For more details, see the proof in the (Additional file 1: Supplementary Materials).
Our algorithm is as follows:
Initialize F and G.
While not convergent and iterations less than a pre-defined value
Update S by
S = (G T WG)− 1G T WXF(F T F)− 1;
m and c selection
Consensus matrices and are the averages of M s ’s and M g ’s over 50 runs respectively. The entries range between 0 and 1, where 0 indicates that the corresponding samples (or genes) belong to different clusters in every run and 1 indicates that they belong to the same clusters in all the cases. Therefore, offers a new distance metric between the items ( for samples and for genes). Similar to , we used and to hierarchically cluster samples and genes, and then we define an average cophenetic correlation coefficient to evaluate the stability. Cophenetic correlation coefficient ρ(C) is defined as the Pearson correlation between distance matrix 1-C and the distance matrix induced by the linkage used in hierarchical clustering for re-ordering C. If a clustering is stable, the entries would be close to 0 and 1 (two modes), and in the ideal case (only 0 and 1) ρ(C) would be exactly 1. We observe how the cophenetic correlation coefficients change as m and c change and select values where the averaged coefficient starts to decrease.
Results and discussion
We applied NCIS to two large-scale datasets from TCGA as well as simulated datasets to evaluate the effectiveness of our method. We built the network using a variety of sources, including the network used in  as well as our up-to-date curated information from Reactome , the NCI-Nature Curated PID , and KEGG . The network from  consisted of inferred gene-interaction from sources of information such as protein interactions, gene co-expression, protein domain interaction, and text-mined interaction described by . To aggregate all of the networks together, all redundant edges were collapsed to single edges. We combined the edges of each of the databases such that a link between any two nodes A and B exists in the aggregated network if a link between A and B exists in any of the databases we used. The resulting aggregated network consisted of 11,648 genes and 211,794 edges. Our method assumes that the network is an aggregation of different biological networks, such as protein-protein interaction network, transcriptional regulatory network, and signaling network etc. In the MATLAB implementation of our program, we also allow users to provide other network information as needed.
Application to TCGA breast cancer dataset
P-value of the dependence test for different clinical features and breast cancer subtypes
Neoplasm disease lymph node stage
Neoplasm disease stage
Tumor nuclei percentage
NCIS (α = 0.85)
2.03 × 10-3
1.68 × 10-3
2.33 × 10-3
6.17 × 10-3
NCIS (α = 0)
6.22 × 10-3
3.84 × 10-3
2.67 × 10-3
6.24 × 10-3
Consensus (k = 3)
5.90 × 10-3
Consensus (k = 5)
3.29 × 10-3
2.08 × 10-4
8.35 × 10-3
We also set α = 0 (no network information was used) in the co-clustering process to show the impact of network information. Similar statistical tests were performed (Table 2). In general, NCIS (α = 0.85) showed better p-values in separating the patients in terms of clinical features than NCIS (α = 0).
In the original TCGA paper , the authors performed a hierarchical clustering using a subset of genes (most varied across samples) and identified 13 subtypes (test results for clinical features are shown in Table 2 as TCGA/BRCA paper). Since consensus hierarchical clustering generally performs better than the traditional hierarchical clustering, we also applied a consensus average linkage hierarchical clustering [8, 42]. To make a fair comparison, we used all 8,726 genes. The program was run over 1,000 iterations and the resampling rate of the sample was set to 0.8. The distance metric was 1 minus Pearson’s correlation coefficient. The algorithm suggested 3 subtypes. However, in Table 2, we listed the tests’ p-values for both 3-subtypes and 5-subtypes conditions to make it easier to compare with the results of NCIS. The results indicated that in general, clusters generated by consensus clustering were not as informative as those from NCIS. We think the most important reason is the lack of effective feature selection in consensus clustering when there are a large number of genes as input.
The running time of NCIS (α = 0.85, c = 5, m = 8) on an 8GB memory desktop for this dataset is about 5 minutes.
Application to TCGA GBM dataset
The second dataset we used was from a recent large-scale TCGA Glioblastoma Multiforme (GBM) subtype identification work . This dataset contains the expression of 11,861 genes on 200 GBM and 2 normal brain samples. In the original paper, the authors first selected 1,903 variably expressed genes according to the MAD and applied consensus hierarchical clustering with agglomerative average linkage . Four subtypes were detected.
We integrated the gene expression data with the network information to train a weight for each of the 7,183 genes included in both sets (network and expression). Tuning parameter α was set to 0.85. After obtaining the weights, these 7,183 weighted-genes and the 202 samples were used in NCIS (result in Figure S1). We set m = 7 and c = 4 (Additional file 1: Table S2).
We again used clinical characteristics to evaluate the effectiveness of our method. We used survival time, tumor necrosis percentage, and tumor nuclei percentage. Tumor necrosis percentage represents the percentage of cell death in a malignant tumor specimen (from TCGA data dictionary). Additional file 1: Table S3 provides the significance level of the difference among all subtypes for each feature. We also ran consensus average linkage hierarchical clustering [8, 42] on the 7,183-gene dataset. The program was run over 1,000 iterations and the resampling rate of the samples was set to 0.8. The distance metric is 1 minus Pearson’s correlation coefficient. We identified 4 subtypes. Overall, NCIS (α = 0.85) performed the best. Interestingly, we observed that Subtype Proneural has a much higher survival rate than the other three subtypes (Additional file 1: Figure S2). The underlying mechanism requires more study. In the heatmap in Figure S1, we also showed the first 50 genes that are shared between the ordered p-value list (based on ANOVA test of each gene’s expression across the four subtypes) and the ordered gene weight list. Figure S3 shows the subnetwork of C1QA, which is involved in immune response, to illustrate the difference among subtypes at network level.
The running time of NCIS (α = 0.85, c = 4, m = 7) on an 8GB desktop for this dataset is about 2 minutes.
Evaluation by simulation
To further assess the performance of NCIS, we simulated a dataset with 300 samples and 3 subgroups. We designed a method to simulate gene expression data based on network interaction structure (see Additional file 1: Supplementary Method). For the 3 subtypes we simulated, the mean expression levels of each gene were estimated from the gene expression profiles of Luminal A, Luminal B, and Basal subtypes in the breast cancer dataset. The final datasets contained 300 samples and 8,726 genes.
To make the simulated datasets more realistic, noisy genes were added. We first trained a weight for each gene based only on the network structure and then chose l genes with lowest weights as “uninformative” genes. We randomly permutated the expression levels of these genes across the samples. l was set to 1000, 2000, 3000, 4000, and 5000 (we generated 5 datasets for each l).
We also observed that NCIS (α = 0.85) outperformed NCIS (α = 0) significantly in our simulated dataset (Figure 4). However, in the two real datasets, the advantage is marginal. We think the main reason could be that in our simulated datasets, the expression levels are strongly related to the network structure we collected (i.e. the interaction network well reflects how the gene expression is generated), while in real datasets there are more uncertainties and the network information we used is incomplete.
Cancer subtype information is of critical importance in designing better treatment strategies. We developed a novel clustering method, called NCIS, to identify cancer subtypes from high-throughput gene expression data. NCIS incorporates the network information within the clustering step to detect more informative sample subtypes. NCIS assigns a weight to each gene based on its connection in network and its distinguishing ability in expression level across all samples. Our approach avoids excluding a large number of genes, which results in much information loss for subsequent analysis in previous methods. In addition, we utilize a weighted co-clustering method to capture the duality of gene expression data, i.e. similarity is treated as a level of coherence of the samples and genes in the bi-cluster.
The future directions of this problem should ideally address three key challenges. First, the network we used is assumed to be a generic molecular interaction network; it is not specific for the particular type of cancer or the tissue-type. Second, the network does not contain all the genes. Third, many edges in our current network do not have a high confidence level and the directions of many edges are unclear. These three problems can be addressed as we gain more complete understanding of the network.
Further research is needed to design better approaches to choose the optimal parameters in NCIS, including α, c, and m. Since there is often no gold standard available for the clustering problem of a specific type of cancer, it is difficult to find the optimal parameters of α, c, and m. In our study, we use α = 0.85 to keep the balance between network information and gene expression information. We did test the results using different values of α, such as 0.8 and 0.7, and the results were comparable with minor differences in the clustering result. We believe the problem of choosing the optimal α may require further studies when more data is collected through large-scale projects with detailed clinical features in the future. Such knowledge can be utilized to help select α. Additionally, how to determine the number of clusters (c and m) remains a difficult problem in clustering algorithms. In our work, we utilized cophenetic correlation coefficients used in . We compared the results using different m and c combinations where the cophenetic coefficients are slightly lower than the optimal combination (Additional file 1: Table S1 and S2). For both BRCA and GBM data, we observed that the subtypes identified were very similar (i.e., they had high correlation with the results from optimal combination) based on the p-values of a Chi-squared test between subtypes identified in the optimal combination. Therefore, the small variation in the choices of m and c results in very similar clustering here. However, further research is still needed to develop better approaches to automatically select the most appropriate m and c.
We believe our new NCIS algorithm will be useful to comprehensively identify cancer subtypes, which would otherwise be obscured by cancer heterogeneity, using high-throughput and high-dimensional gene expression data. Results from NCIS are expected to enhance our ability to discover important subtype patterns and key genes involved in each subtype, which will in turn help us better understand important network perturbations in a subtype-specific manner.
We thank the members of the Ma lab for helpful discussions. This work was supported by an award from the Interdisciplinary Innovation Initiative (In3) program at the University of Illinois (to J.M. and J.H.) and an Illinois Distinguished Fellowship (to J.P.H.).
We also thank the Cancer Genome Atlas Research Network for making the data publicly available.
- Perou CM, Sorlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, et al: Molecular portraits of human breast tumours. Nature. 2000, 406 (6797): 747-752. 10.1038/35021093.View ArticlePubMedGoogle Scholar
- Sorlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, van de Rijn M, Jeffrey SS, et al: Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci U S A. 2001, 98 (19): 10869-10874. 10.1073/pnas.191367098.View ArticlePubMed CentralPubMedGoogle Scholar
- Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, et al: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 2000, 403 (6769): 503-511. 10.1038/35000501.View ArticlePubMedGoogle Scholar
- Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, et al: Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci U S A. 2001, 98 (24): 13790-13795. 10.1073/pnas.191502998.View ArticlePubMed CentralPubMedGoogle Scholar
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286 (5439): 531-537. 10.1126/science.286.5439.531.View ArticlePubMedGoogle Scholar
- Lehmann BD, Bauer JA, Chen X, Sanders ME, Chakravarthy AB, Shyr Y, Pietenpol JA: Identification of human triple-negative breast cancer subtypes and preclinical models for selection of targeted therapies. J Clin Invest. 2011, 121 (7): 2750-2767. 10.1172/JCI45014.View ArticlePubMed CentralPubMedGoogle Scholar
- Bullinger L, Döhner K, Bair E, Fröhling S, Schlenk RF, Tibshirani R, Döhner H, Pollack JR: Use of gene-expression profiling to identify prognostic subclasses in adult acute myeloid leukemia. N Engl J Med. 2004, 350 (16): 1605-1616. 10.1056/NEJMoa031046.View ArticlePubMedGoogle Scholar
- Monti S, Tamayo P, Mesirov J, Golub T: Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Machine learning. 2003, 52 (1): 91-118.View ArticleGoogle Scholar
- Verhaak RG, Hoadley KA, Purdom E, Wang V, Qi Y, Wilkerson MD, Miller CR, Ding L, Golub T, Mesirov JP, et al: Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell. 2010, 17 (1): 98-110. 10.1016/j.ccr.2009.12.020.View ArticlePubMed CentralPubMedGoogle Scholar
- Witten DM, Tibshirani R: A framework for feature selection in clustering. J Am Stat Assoc. 2010, 105 (490): 713-726. 10.1198/jasa.2010.tm09415.View ArticlePubMed CentralPubMedGoogle Scholar
- Bair E, Tibshirani R: Semi-supervised methods to predict patient survival from gene expression data. PLoS Biol. 2004, 2 (4): E108-10.1371/journal.pbio.0020108.View ArticlePubMed CentralPubMedGoogle Scholar
- Koestler DC, Marsit CJ, Christensen BC, Karagas MR, Bueno R, Sugarbaker DJ, Kelsey KT, Houseman EA: Semi-supervised recursively partitioned mixture models for identifying cancer subtypes. Bioinformatics. 2010, 26 (20): 2578-2585. 10.1093/bioinformatics/btq470.View ArticlePubMed CentralPubMedGoogle Scholar
- Shen R, Olshen AB, Ladanyi M: Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics. 2009, 25 (22): 2906-2912. 10.1093/bioinformatics/btp543.View ArticlePubMed CentralPubMedGoogle Scholar
- Mo Q, Wang S, Seshan VE, Olshen AB, Schultz N, Sander C, Powers RS, Ladanyi M, Shen R: Pattern discovery and cancer gene identification in integrated cancer genomic data. Proc Natl Acad Sci U S A. 2013, 110 (11): 4245-4250. 10.1073/pnas.1208949110.View ArticlePubMed CentralPubMedGoogle Scholar
- Barabasi AL, Gulbahce N, Loscalzo J: Network medicine: a network-based approach to human disease. Nat Rev Genet. 2011, 12 (1): 56-68. 10.1038/nrg2918.View ArticlePubMed CentralPubMedGoogle Scholar
- Chuang HY, Lee E, Liu YT, Lee D, Ideker T: Network-based classification of breast cancer metastasis. Mol Syst Biol. 2007, 3: 140-View ArticlePubMed CentralPubMedGoogle Scholar
- Hanisch D, Zien A, Zimmer R, Lengauer T: Co-clustering of biological networks and gene expression data. Bioinformatics. 2002, 18 (Suppl 1): S145-S154. 10.1093/bioinformatics/18.suppl_1.S145.View ArticlePubMedGoogle Scholar
- Hwang T, Atluri G, Xie M, Dey S, Hong C, Kumar V, Kuang R: Co-clustering phenome-genome for phenotype classification and disease gene discovery. Nucleic Acids Res. 2012, 40 (19): e146-10.1093/nar/gks615.View ArticlePubMed CentralPubMedGoogle Scholar
- Tanay A, Sharan R, Shamir R: Biclustering algorithms: A survey. Handbook Comput Mol Biol. 2005, 9: 21-26.Google Scholar
- Pan F, Zhang X, Wang W: CRD: Fast Co-clustering on Large Datasets Utilizing Sampling-Based Matrix Decomposition. Proceedings of the ACM SIGMOD International Conference on Management of Data: 10-12 June 2008. Edited by: Shasha D, Wang JTL. 2008, Vancouver: ACM, 173-184.View ArticleGoogle Scholar
- Cheng Y, Church GM: Biclustering of expression data. Proceedings / International Conference on Intelligent Systems for Molecular Biology; ISMB International Conference on Intelligent Systems for. Mol Biol. 2000, 8: 93-103.Google Scholar
- Eren K, Deveci M, Kucuktunc O, Catalyurek UV: A comparative analysis of biclustering algorithms for gene expression data. Brief Bioinform. 2013, 14 (3): 279-292. 10.1093/bib/bbs032.View ArticlePubMed CentralPubMedGoogle Scholar
- Ding C, Li T, Peng W, Park H: Orthogonal nonnegative matrix t-factorizations for clustering. Proceedings of the 12th ACM SIGKOD International Conference on Knowledge Discovery and Data Mining: 27-30 August 2006. Edited by: Ungar LU, Craven M, Gunopulos D, Eliassi-Rad T. 2006, Philadelphia: ACM, 126-135.View ArticleGoogle Scholar
- Gu Q, Zhou J: Co-clustering on manifolds. Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: 28 June-1 July 2009; Paris. Edited by: John E, Fogelman FS, Flach P, Zaki M. 2009, ACM, 359-368.View ArticleGoogle Scholar
- Li A, Walling J, Ahn S, Kotliarov Y, Su Q, Quezado M, Oberholtzer JC, Park J, Zenklusen JC, Fine HA: Unsupervised analysis of transcriptomic profiles reveals six glioma subtypes. Cancer Res. 2009, 69 (5): 2091-2099. 10.1158/0008-5472.CAN-08-2100.View ArticlePubMed CentralPubMedGoogle Scholar
- Gao Y, Church G: Improving molecular cancer class discovery through sparse non-negative matrix factorization. Bioinformatics. 2005, 21 (21): 3970-3975. 10.1093/bioinformatics/bti653.View ArticlePubMedGoogle Scholar
- Liu Y, Hayes DN, Nobel A, Marron J: Statistical significance of clustering for high-dimension, low–sample size data. J Am Stat Assoc. 2008, 103 (483): 1281-1293. 10.1198/016214508000000454.View ArticleGoogle Scholar
- Alter O, Brown PO, Botstein D: Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci U S A. 2000, 97 (18): 10101-10106. 10.1073/pnas.97.18.10101.View ArticlePubMed CentralPubMedGoogle Scholar
- Jiang DX, Tang C, Zhang AD: Cluster analysis for gene expression data: A survey. Ieee T Knowl Data En. 2004, 16 (11): 1370-1386. 10.1109/TKDE.2004.68.View ArticleGoogle Scholar
- Barillot E, Calzone L, Hupe P, Vert J-P, Zinovyev A: Computational systems biology of cancer, vol. 47. 2012, CRC PressGoogle Scholar
- Brin S, Page L: The anatomy of a large-scale hypertextual Web search engine. Comput Networks ISDN Sys. 1998, 30 (1): 107-117.View ArticleGoogle Scholar
- Morrison JL, Breitling R, Higham DJ, Gilbert DR: GeneRank: using search engine technology for the analysis of microarray experiments. BMC Bioinforma. 2005, 6: 233-10.1186/1471-2105-6-233.View ArticleGoogle Scholar
- Higham DJ, Taylor A: The sleekest link algorithm. Institute of Mathematics and Its Applications (IMA) Mathematics Today. 2003, 39: 192-197.Google Scholar
- Boyd S, Vandenberghe L: Convex optimization. 2004, Cambridge, UK: Cambridge university pressView ArticleGoogle Scholar
- Brunet JP, Tamayo P, Golub TR, Mesirov JP: Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci U S A. 2004, 101 (12): 4164-4169. 10.1073/pnas.0308531101.View ArticlePubMed CentralPubMedGoogle Scholar
- Ciriello G, Cerami E, Sander C, Schultz N: Mutual exclusivity analysis identifies oncogenic network modules. Genome Res. 2012, 22 (2): 398-406. 10.1101/gr.125567.111.View ArticlePubMed CentralPubMedGoogle Scholar
- Croft D, O'Kelly G, Wu G, Haw R, Gillespie M, Matthews L, Caudy M, Garapati P, Gopinath G, Jassal B, et al: Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res. 2011, 39 (Database issue): D691-D697.View ArticlePubMed CentralPubMedGoogle Scholar
- Schaefer CF, Anthony K, Krupa S, Buchoff J, Day M, Hannay T, Buetow KH: PID: the Pathway Interaction Database. Nucleic Acids Res. 2009, 37 (Database issue): D674-D679.View ArticlePubMed CentralPubMedGoogle Scholar
- Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M: KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 2012, 40 (Database issue): D109-D114.View ArticlePubMed CentralPubMedGoogle Scholar
- Wu G, Feng X, Stein L: A human functional protein interaction network and its application to cancer data analysis. Genome biology. 2010, 11 (5): R53-10.1186/gb-2010-11-5-r53.View ArticlePubMed CentralPubMedGoogle Scholar
- Network TCGA: Comprehensive molecular portraits of human breast tumours. Nature. 2012, 490 (7418): 61-70. 10.1038/nature11412.View ArticleGoogle Scholar
- Wilkerson MD, Hayes DN: ConsensusClusterPlus: a class discovery tool with confidence assessments and item tracking. Bioinformatics. 2010, 26 (12): 1572-1573. 10.1093/bioinformatics/btq170.View ArticlePubMed CentralPubMedGoogle Scholar
- Nooter K, Brutel de la Riviere G, Look MP, van Wingerden KE, Henzen-Logmans SC, Scheper RJ, Flens MJ, Klijn JG, Stoter G, Foekens JA: The prognostic significance of expression of the multidrug resistance-associated protein (MRP) in primary breast cancer. Br J Cancer. 1997, 76 (4): 486-493. 10.1038/bjc.1997.414.View ArticlePubMed CentralPubMedGoogle Scholar
- Doyle LA, Yang W, Abruzzo LV, Krogmann T, Gao Y, Rishi AK, Ross DD: A multidrug resistance transporter from human MCF-7 breast cancer cells. Proc Natl Acad Sci U S A. 1998, 95 (26): 15665-15670. 10.1073/pnas.95.26.15665.View ArticlePubMed CentralPubMedGoogle Scholar
- Wind NS, Holen I: Multidrug resistance in breast cancer: from in vitro models to clinical studies. Int J Breast Cancer. 2011, 2011: 967419-View ArticlePubMed CentralPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.