Semi-supervised gene shaving method for predicting low variation biological pathways from genome-wide data
© Zhu; licensee BioMed Central Ltd. 2009
Published: 30 January 2009
The gene shaving algorithm and many other clustering algorithms identify gene clusters showing high variation across samples. However, gene expression in many signaling pathways show only modest and concordant changes that fail to be identified by these methods. The increasingly available signaling pathway prior knowledge provide new opportunity to solve this problem.
We propose an innovative semi-supervised gene clustering algorithm, where the original gene shaving algorithm was extended and generalized so that prior knowledge of signaling pathways can be incorporated. Different from other methods, our method identifies gene clusters showing concerted and modest expression variation as well as strong expression correlation. Using available pathway gene sets as prior knowledge, whether complete or incomplete, our algorithm is capable of forming tightly regulated gene clusters showing modest variation across samples. We demonstrate the advantages of our algorithm over the original gene shaving algorithm using two microarray data sets. The stability of the gene clusters was accessed using a jackknife approach.
Our algorithm represents one of the first clustering algorithms that is particularly designed to identify signaling pathways of low and concordant gene expression variation. The discriminating power is achieved by manufacturing a principal component enriched by signaling pathways.
Gene clustering that assigns group membership(s) to each gene is a widespread pattern extraction technique. Genes sharing the same membership are often hypothesized to be regulated by the same defined or undefined genomic influence, such as cellular pathway. Model-free clustering techniques such as K-means and hierarchical clustering [1–3] are widely used. One limitation of these approaches, as pointed out by many researchers, e.g. , is that each gene can only belong to a single cluster. These types of gene clustering algorithms are thus called mutually exclusive clustering. In the context of cellular pathways, they assume that one gene can only be regulated by one pathway at a time, which apparently, is not the case. Model-based clustering or soft clustering [5–8] provides mechanisms to relax this stringent assumption by introducing "probabilistic" or "fuzzy" memberships. However, these "soft" memberships do not biologically account for the fact that one gene is often simultaneously regulated by multiple genomic influences.
Singular value decomposition (SVD) [9–11] has shown great promise towards deconvolving channels of genomic influence. Assuming rows of data matrix correspond to genes and columns correspond to physiological/genetic conditions under which the gene expression abundance was interrogated using gene chips, the SVD factors the data matrix into three matrices. The first matrix, which contains most of information, is called a gene coefficient matrix where each column (principal component, PC) defines a preliminary gene cluster that might be regulated by a specific genomic influence. We will describe more details of SVD in the method section. SVD has been repeatedly shown to be able to deconvolve the observed gene expression signal into a composite of multiple overlapping genomic influences, many of them correspond to signaling pathways [9, 11].
Thus SVD provides a methodology base for non-mutually exclusive clustering. The gene clusters generated by SVD are often preliminary due to the fact that many non-relevant genes might contaminate the PC's that define gene clusters. Hastie et al  proposed removing non-relevant genes in an iterative fashion, in which the least correlated genes with the leading PC is treated as non-relevant. The gene shaving algorithm quickly became an important tool in the pattern discovery arsenal. It iteratively searches for clusters of genes showing high variation across the samples, and correlation across the genes . The former is achieved by working with the leading PC and the latter is achieved by iteratively discarding non-relevant genes to the cluster. There are other types of non-mutually exclusive clustering methods as well, such as plaid model .
The underlying assumption of the gene shaving algorithm is that the leading PC accounting for the largest portion of variation is always of exclusive interest to the investigator [4, 13]. Consequently the algorithm iteratively refines the first gene cluster defined by the first PC by shaving off a proportion of genes that are least correlated with the leading PC. The second gene cluster is formed by performing the same procedure on the orthogonal data, resulting from the residuals of regression, and so on. However, the underlying assumption that the whole algorithm is based on is not always true for every single case. In fact, gene expression in many signaling pathways show modest but concordant changes. The gene shaving algorithm would most likely to fail in these cases by working exclusively with the leading PC.
Gene set based methods, such as Gene Set Enrichment Analysis (GSEA) were designed to overcome this limitation. Since it's first introduction in 2003 , it has been widely applied to interpret genome-wide expression profiles [15, 16]. However, the approach only ranks pre-compiled gene sets according to the relevancy to the data and does not predict any new genes in the gene sets. Therefore, it strictly depends on the availability and validity of a priori defined gene sets. In reality a gene set is not always available in a complete and accurate format. What is typically available is partial pathway learned from empirical experimental studies.
We seek a seamless combination of the strengths of the two methodological frameworks. We manufacture a PC that is most enriched by prior knowledge (signaling pathway of interest). Performing the analysis iteratively we will be able to identify the gene cluster showing modest but concordant changes. In many cases, we are further interested in finding genes that are concordantly up or down-regulated by genomic influences. Therefore, it might be beneficial to turn our attention not only to the PC that the prior knowledge is most enriched, but also to the positive PC and the negative PC separatively. The hypothesis can be substantiated by previous works that positive and negative PC's can be enriched by completely different biological functions, e.g. .
We aim to demonstrate that the proposed algorithm is capable of identifying tightly regulated gene sets showing modest and concerted variation using incomplete prior knowledge and real-world microarray data set. Ground truth, which indicates a "complete" gene set used as precondition for applying GSEA algorithm [14, 16], is desirable to demonstrate the claimed advantages of our algorithm. It is often not available. Therefore, we use four "high-amplitude" and four "low-amplitude" gene sets identified in  as ground truth to evaluate the ability of our algorithm to recover them using subsets of a variety of lengths. The high and low amplitude genes used in this example are well-studied genes in the cell cycle, and many of them are co-regulated by a number of signaling pathways [17, 18]. We then use incomplete prior knowledge supplied by our collaborator and apply our algorithm to predict new WNT and NOTCH pathway genes in the somitogenesis process.
Recovering low and high amplitude gene sets using incomplete prior knowledge
As a proof of concept, we first analyzed a cell cycle data set originally reported in . The data set consists of whole yeast genome expression profiles interrogated over two full cell cycles (20 evenly spaced time points) synchronized by elutriation. We considered the same 308 genes as in the paper derived using Fourier transform. In each of the four gene sets, genes were further classified into high-amplitude and low-amplitude groups according to magnitude of variation. The processed data are available from the authors' website at .
We treated the high-amplitude genes and low-amplitude genes in each gene set as "complete", as assumed in classical GSEA analysis. We sampled subsets of increasing sizes from 5 to complete (e.g. 40) with a step size of 5. In each step experiment, we generated 500 subsets of the same size (with replicates), and for each subset we applied our algorithm to demonstrate its ability to recover the full gene set using the hypergeometric test explained in method section. The P-values of the tests were used as a measure for such an ability. For visualization convenience, the P-values were negatively log-transformed and higher value corresponds to better recovery of the complete gene set.
Our algorithm can be viewed as an generalization of the gene shaving algorithm. Gene shaving algorithm exclusively works with the leading PC. Therefore, it is only capable of identifying high-amplitude signaling pathways. Our algorithm adaptively works with the PC that is most enriched by prior knowledge. Therefore, it is capable of identifying either high-amplitude or low-amplitude signaling pathways wherever prior knowledge is available. Comparing Figure 2b to Figure 2c more closely, it is evident that our algorithm recovers low-amplitude gene sets even better than gene shaving algorithm recovers high-amplitude ones. This is demonstrated by uniformly larger mean values and overall smaller variance on the vertical axis. The results of analyzing other complete gene sets of appropriate size lead to the same conclusion (see additional file 1). The proof-of-concept analysis provided compelling evidence that our algorithm is particularly suitable for identifying sets of tightly regulated genes with modest variation.
Predicting WNT and NOTCH pathway genes using prior knowledge
Microarray data and prior knowledge
We then proceed to re-analyze microarray data originally reported in Dequeant et al  to predict genes in WNT and NOTCH pathways. In this experiment, the genome-wide gene expression was interrogated over 17 developmental stages using Affymetrix GeneChip 430A. Using the Lomb-Scargle periodogram  the top 687 genes were used for gene clustering so that all prior knowledge genes are included. Microrarray data are available at ArrayExpress at .
Prior knowledge corresponds to a list of experimentally validated cyclic genes regulated by the segmentation clock, a molecular oscillator acting during somitogenesis . The segmentation clock is a set of periodic processes linked to the formation of the vertebrate embryo segments (somites) that give rise to the segments in the adult body plan of a vertebrate animal. Malfunction of cyclic genes are the direct cause of many developmental diseases, such as Noonan syndrome and truncated tail . Therefore, predicted cyclic genes are potential human disease genes. In particular, we have incomplete sets of 11 genes in the WNT pathway, and 9 genes in the NOTCH pathway as our prior knowledge. Our objective is to predict more WNT and NOTCH genes using prior knowledge, microarray data and our proposed algorithm.
Finding the most enriched PC using prior knowledge
Comparing our semi-supervised algorithms with the gene shaving algorithm
Biological function enrichment analysis and transcription factor association analysis.
embryonic development (1.13E-04)
MyoG_Myotubes (9.47E-03) 
MyoD_Growing cells (1.99E-05) 
cytosolic part (4.48E-08)
iron ion binding (3.92E-06)
tube development (3.86E-04)
branching morphogenesis of a tube (9.88E-06)
tube morphogenesis (7.26E-05)
patterning of blood vessels (3.57E-05)
embryonic pattern specification (1.11E-04)
oxygen binding (6.89E-14)
gas transport (4.60E-14)
hemoglobin complex (1.12E-14)
developmental maturation (3.86E-04)
MyoG_Myotubes (9.47E-03) 
negative regulation of cell differentiation (3.01E-04)
MyoD_Growing cells (1.99E-05) 
ectoderm development (1.91E-05)
cell maturation (1.94E-04)
tissue morphogenesis (1.12E-05)
epidermis morphogenesis (2.00E-06)
hair cell differentiation (5.26E-06)
mechanoreceptor differentiation (7.56E-06)
negative regulation of neuron differentiation (3.49E-06)
regulation of neuron differentiation (3.93E-05)
cell fate determination (9.65E-06)
auditory receptor cell fate commitment (3.78E-08)
Stability of clusters against perturbation of prior knowledge
With exception of a few recent works [26–28], most clustering algorithms these days are non-supervised in the sense that prior knowledge is not properly utilized to guide the learning process. Instead prior knowledge is often used in the post-learning phase in that researchers predict functions of unknown genes based on genes of known functions lying in the same cluster. The traditional gene shaving method focuses on the leading PC that accounts for most of variation in the data. On one hand, it is useful in discovering high variation pathway genes [4, 29], on the other hand, it tends to overlook essential pathway genes that have modest expression variation. We hypothesized that highly concerted expression behavior of these genes, albeit modest in variation, may help shape its pattern out of the noisy microarray data using appropriate analysis techniques, i.e., SVD.
The main contribution of this work is that we proposed an optimization algorithm combining the strengths of gene set based analysis and iterative gene selection. The iterative fashion inspired from the gene shaving algorithm allows distilling desired gene cluster using prior knowledge, while the latter enables us to discover gene clusters of modest and concerted expression change. The PC's that define gene clusters group a series of tightly regulated genes ranked by variance over samples. The orthogonality as specified in SVD analysis indicates those gene clusters of different variation were regulated by orthogonal defined or undefined genomic influences (Table 1 of ).
Our method is particularly suitable for identifying gene clusters with modest and concerted expression change, therefore it is not limited to identify periodically expressed gene clusters. When there is no prior knowledge available, the optimization process can be done through optimizing the enrichment of interesting Gene Ontology (GO) vocabulary, for example, somitogenesis [GO:0001756]. The technique for testing enrichment of GO term is very similar to that was used here, also see review in . A recursive dendrogram can be constructed as a foundation to generate overlapping gene clusters, from which the optimal clusters can be identified and retrieved according to the enrichment of the interesting GO term(s) .
Our algorithm represents one of the first clustering algorithms that is particularly designed to identify signaling pathways of low and concordant gene expression variation. The discriminating power is achieved by manufacturing a principal component enriched by the prior knowledge.
Singular Value Decomposition
Refer to supplemental figure 1 for a schematic illustration of the procedure. As shown in later data analysis examples, the separation operation is the key to enhance the prior knowledge enrichment level and to differentiate between antiphased WNT and NOTCH clusters.
Testing gene coefficients
Each element in and is compared to the value , where n is the number of genes and p is a weight factor whose recommended value is 3. If the magnitude of the element in and is greater than , the corresponding gene is determined to contribute significantly to the PC's. Alternatively the list of genes that are significantly up-regulated or down-regulated by the underlying genomic influence corresponding to each PC.
where is a parameter corresponding to the probability of genes in the prior knowledge belonging to the PC, and is a parameter corresponding to the probability of genes not in prior knowledge belonging to the PC. Under H0, the test statistic x follows a hypergeometric distribution with known parameters m, n and k.
Semi-supervised gene shaving algorithm
1: Start with the centered data matrix X that each row has zero mean
2: while TRUE do
3: Singular value decomposition
4: for all column of and do
5: if column elements are greater than a cut-off then
6: NO change
8: Set to 0
9: end if
10: end for
11: for all Gene sets correspond to each columns do
12: Test enrichment of prior knowledge in each gene set
13: end for
14: if Two or more columns that are most enriched with prior knowledge exist then
17: Retrieve the best PC that are most enriched by prior knowledge
18: end if
19: Sort genes according to absolute correlation with the best PC
20: Discard α% least correlated genes (α = 10% followed from )
21: Assign the reduced data matrix to X
22: end while
23: Trace-back to retrieve the best gene cluster
As shown in the above Algorithm and Figure 1, the algorithm iterates until there are two or more most enriched PC's coexisting as defined by prior knowledge. The iterations stop here since we don't yet know a good way to further reduce the size of the cluster. Inconsiderate reduction might cause a loss of important genes. There are two ways of tracing back to retrieve the best gene cluster. One is to find the smallest cluster containing all prior knowledge, another is to find the cluster in which the enrichment of prior knowledge optimized. We chose the latter because it does not rely on the assumption that all prior knowledge need to be accurate. In fact, each gene coefficient can be used to measure the relative importance of genes in forming the cluster pattern. Genes in prior knowledge that help shaping out patterns receive higher weight, otherwise receive lower weight.
Stability analysis of gene clusters – a jackknife approach
where tα/2, n-1satisfies Pr(t n ≥ tα/2, n-1) = α, with t n denoting a t-distributed random variable with n degree of freedom.
DZ is supported by Research Start-up Grants from the University of New Orleans and Research Institute for Children of Children's Hospital New Orleans.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S1
- Eisen M, Spellman P, Brown P, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998, 95: 14587–15151. 10.1073/pnas.95.25.14863View ArticleGoogle Scholar
- Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander E, Golub T: Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 1999, 96(6):2907–2912. 10.1073/pnas.96.6.2907PubMed CentralView ArticlePubMedGoogle Scholar
- Zhu D, Hero A, Cheng H, Khanna R, Swaroop A: Network constrained clustering for gene microarray data. Bioinformatics 2005, 21(21):4014–4020. 10.1093/bioinformatics/bti655View ArticlePubMedGoogle Scholar
- Hastie T, Tibshirani R, Eisen M, Alizadeh A, Levy R, Staudt L, Chan W, Botstein D, Brown P: 'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns. Genome Biology 2000., 1(2):Google Scholar
- Yeung K: Model-based clustering and data transformations for gene expression data. Bioinformatics 2006, 17(10):977–087. 10.1093/bioinformatics/17.10.977View ArticleGoogle Scholar
- Gasch A, Eisen M: Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering. Genome Biology 2002., 3(11):Google Scholar
- Qin Z: Clustering microarray gene expression data using weighted Chinese restaurant process. Bioinformatics 2006, 22(16):1988–1997. 10.1093/bioinformatics/btl284View ArticlePubMedGoogle Scholar
- Do K: Applications of gene shaving and mixture models to cluster microarray gene expression data. Cancer Informatics 2007, 2: 25–43.Google Scholar
- Alter O: Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 2000, 97(18):10101–10106. 10.1073/pnas.97.18.10101PubMed CentralView ArticlePubMedGoogle Scholar
- Wall M, Dyck P, Brettin T: SVDMAN – singular value decomposition analysis of microarray data. Bioinformatics 2001, 17(6):566–568. 10.1093/bioinformatics/17.6.566View ArticlePubMedGoogle Scholar
- Carter G, Rupp S, Fink G, Galitski T: Disentangling information flow in the Ras-cAMP signaling network. Genome Research 2006, 16: 520–526. 10.1101/gr.4473506PubMed CentralView ArticlePubMedGoogle Scholar
- Lazzroni L: Plaid models for gene expression data. Statistica Sinica 2002, 12: 61–86.Google Scholar
- Liang L, Mandal V, Lu Y, Kumar D: MCM-test: a fuzzy-set-theory-based approach to differential analysis of gene pathways. BMC Bioinformatics 2008, 9(Suppl 6):S16. 10.1186/1471-2105-9-S6-S16PubMed CentralView ArticlePubMedGoogle Scholar
- Mootha V, Lindgren C, Eriksson K, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly M, Patterson N, Mesirov J, Golub T, Tamayo P, Spiegelman B, Lander E, Hirschhorn J, Altshuler D, Groop L: PGC-1 α -responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics 2003, 34(3):267–273. 10.1038/ng1180View ArticlePubMedGoogle Scholar
- Tian L, Greenberg S, Kong S, Altschuler J, Kohane I, Park P: Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci USA 2005, 102(38):13544–13549. 10.1073/pnas.0506577102PubMed CentralView ArticlePubMedGoogle Scholar
- Subramanian A, Tamayo P, Mootha V, Mukherjee S, Ebert B, Gillette M, Paulovich A, Pomeroy S, Golub T, Lander E, Mesirov J: Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 2005, 102(43):15545–15550. 10.1073/pnas.0506580102PubMed CentralView ArticlePubMedGoogle Scholar
- Rustici G, Mata J, Kivinen K, Lio P, Penkett C, Burns G, Hayles J, Brazma A, Nurse P, Bahler J: Periodic gene expression program of the fission yeast cell cycle. Nature Genetics 2004, 36(8):809–817. 10.1038/ng1377View ArticlePubMedGoogle Scholar
- Spellman P, Sherlock G, Zhang M, Iyer V, Anders K, Eisen M, Brown P, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 1998, 9(12):3273–3297.PubMed CentralView ArticlePubMedGoogle Scholar
- PTC versus paired normal thyroid tissue[http://www.sanger.ac.uk/PostGenomics/S\_prombe/]
- Dequeant M, Glynn E, Gaudenz K, Wahl M, Chen J, Mushegian A, Pourquie O: A Complex Oscillating Network of Signaling Genes Underlies the Mouse Segmentation Clock. Science 2006, 314(5805):1595–1598. 10.1126/science.1133141View ArticlePubMedGoogle Scholar
- Glynn E, Chen J, Mushegian A: Detecting periodic patterns in unevenly spaced gene expression time series using Lomb-Scargle periodograms. Bioinformatics 2006, 22: 310–316. 10.1093/bioinformatics/bti789View ArticlePubMedGoogle Scholar
- A Complex Oscillating Network of Signaling Genes Underlies the Mouse Segmentation Clock[http://www.ebi.ac.uk/microarray-as/aer/#ae-browse/q=E-TABM-163]
- Segal lab website[http://genie.weizmann.ac.il/genomicaweb/enrichment/genesets.jsp]
- Blais A, Tsikitis M, Acosta-Alvear D, Sharan R, Kluger Y, Dynlacht B: An initial blueprint for myogenic differentiation. Gene & Development 2005, 19(48):553–569. 10.1101/gad.1281105View ArticleGoogle Scholar
- Cao Y, Kumar R, Bennett H, Charlotte A, Kooperberg C, Boyer L, Young R, Tapscott S: Global and gene-specific analyses show distinct roles for Myod and Myog at a common set of promoters. The EMBO Journal 2006, 25: 502–511. 10.1038/sj.emboj.7600958PubMed CentralView ArticlePubMedGoogle Scholar
- Pittler S, Zhang Y, Chen S, Mears A, Zack D, Ren Z, Swain P, Yao S, Swaroop A, White J: Incorporating gene functions as priors in model-based clustering of microarray gene expression data. Bioinformatics 2006, 22(7):795–801. 10.1093/bioinformatics/btl011View ArticleGoogle Scholar
- Larson P, Almasri E, Chen G, Dai Y: A statistical method to incorporate biological knowledge for generating testable novel gene regulatory interactions from microarray experiments. BMC Bioinformatics 2007., 8(317):Google Scholar
- Tseng G: Penalized and weighted K-means for clustering with scattered objects and prior information in high-throughput biological data. Bioinformatics 2007, 23(17):2247–2255. 10.1093/bioinformatics/btm320View ArticlePubMedGoogle Scholar
- Tomphor J, Lu J, Kepler T: Pathway level analysis of gene expression using singular value decomposition. BMC Bioinformatics 2005, 6: 225. 10.1186/1471-2105-6-225View ArticleGoogle Scholar
- Rivas E, Personnaz L: Enrichment or depletion of a GO category within a class of genes: which test? Bioinformatics 2006, 23(4):401–407. 10.1093/bioinformatics/btl633View ArticleGoogle Scholar
- Manly B: Randomization, Bootstrap and Monte Carlo Methods in Biology. Boca Raton: Chapman and Hall; 1997.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.