Graph ranking for exploratory gene data analysis
© Gao et al; licensee BioMed Central Ltd. 2009
Published: 8 October 2009
Microarray technology has made it possible to simultaneously monitor the expression levels of thousands of genes in a single experiment. However, the large number of genes greatly increases the challenges of analyzing, comprehending and interpreting the resulting mass of data. Selecting a subset of important genes is inevitable to address the challenge. Gene selection has been investigated extensively over the last decade. Most selection procedures, however, are not sufficient for accurate inference of underlying biology, because biological significance does not necessarily have to be statistically significant. Additional biological knowledge needs to be integrated into the gene selection procedure.
We propose a general framework for gene ranking. We construct a bipartite graph from the Gene Ontology (GO) and gene expression data. The graph describes the relationship between genes and their associated molecular functions. Under a species condition, edge weights of the graph are assigned to be gene expression level. Such a graph provides a mathematical means to represent both species-independent and species-dependent biological information. We also develop a new ranking algorithm to analyze the weighted graph via a kernelized spatial depth (KSD) approach. Consequently, the importance of gene and molecular function can be simultaneously ranked by a real-valued measure, KSD, which incorporates the global and local structure of the graph. Over-expressed and under-regulated genes also can be separately ranked.
The gene-function bigraph integrates molecular function annotations into gene expression data. The relevance of genes is described in the graph (through a common function). The proposed method provides an exploratory framework for gene data analysis.
Microarray technology has made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene expression data offers a tremendous opportunity for an enhanced understanding of functional genomics. However, the large number of genes greatly increases the challenges of analyzing, comprehending and interpreting the resulting mass of data. Selecting a subset of important genes is necessary to address the challenge for two primary reasons. First, multivariate methods are prone to overfitting. This problem is aggravated when the number of variables is large compared to the number of examples, and even worse for gene expression data which usually has ten or twenty thousand genes but with only a very limited number of samples. It is not uncommon to use a variable ranking method to filter out the least promising variables before using a multivariate method. The second reason for ranking the importance of genes is that identifying important genes is, in and of itself, interesting. For example, to answer the question of what genes are important for distinguishing between cancerous and normal tissue may lead to new medical practices.
Gene selection has been investigated extensively over the last decade by researchers from the statistics, data mining and bioinformatics communities. There are basically two approaches. One approach treats gene selection as a pre-processing step. It usually comes with a measure to rank genes. Fold change is a simple measure used in . Dudoit, et al.  performed a selection of genes based on the between-group and within-group variance ratios. Golub, et al.  used a different method for standardizing the data for selecting genes. Pepe, et al.  considered two measures related to the Receiver Operating Characteristic curve (ROC) for ranking genes. Strength of statistical evidence, such as p-values of hypothesis testing , are also commonly used measures for gene selection. Storey and Tibshirani  proposed a measure of significance called q-value based on the concept of false discovery rate. The other common approach to gene selection embeds gene selection into a specific learning procedure. Fan and Li  proposed penalized likelihood methods for regression to select variables and estimate coefficients simultaneously. Lee, et al.  proposed a hierarchical Bayesian model for gene selection. They employed latent variables to specialize the model to a regression setting and used a Bayesian mixture prior to perform the variable selection. Recursive feature elimination (RFE) methods with support vector machines (SVM), e.g. [9–12], have been shown to be successful for gene selection and classification. L1 SVMs perform variable selection automatically by solving a quadratic optimization problem, e.g. [13–15]. Diáz, et al.  applied a random forest algorithm for classification and at the same time for selecting genes based on the permuted importance score. Mukherjee and Roberts  provided a theoretical analysis of gene selection, in which the probability of successfully selecting relevant genes, using a given gene ranking function, is explicitly calculated in terms of population parameters. For a more comprehensive survey of this subject, the reader is referred to [18, 19], and .
In most of the cases, genes selected by the aforementioned procedures are not sufficient for accurate inference of the underlying biology, because biological significance does not necessarily have to be statistically significant . For example, suppose the gene with low differential expression is a transcription factor that controls the expression of some other genes. The transcription factor itself may be activated by the treatment but its expression may not be significantly changed. Hence, an ideal selection procedure should be able to highlight the transcription factor. To do so, additional biological knowledge must be integrated into it. With the development of biological knowledge databases, biologically interesting sets of genes, for example genes that belong to a pathway or genes known to have the same molecular function, can be compiled, for example from Gene Ontology , see GO Consortium (2008). There have been many publications combining gene expression with GO lately. One common approach is to find enriched gene sets annotated by GO terms which are over-represented among the differentially expressed genes in the analysis of microarray data. See [23–26], and  for details of enrichment. The other approach is to use a GO graph to improve identification of differentially expressed genes. Morrison, et al.  constructed a gene-gene graph derived from GO and used GeneRank, which is a modification of PageRank (the ranking algorithm used in Google search engine), for prioritizing the importance of genes. Gene expression data was cleverly used to specify "the personalization vector" in PageRank. Ma et al.  first computed an individual score for each gene from gene expression profiles, then combined the scores of a gene and its direct and indirect neighbors in the gene-gene graph derived from GO or protein-protein interaction network to obtain a more accurate gene ranking. Daigle and Altman  developed a probabilistic model that integrates biological knowledge with microarray data to identify differentially expressed (DE) genes. They introduced a latent binary variable (DE/not DE) and used a learning algorithm on a stochastic, binary state network to estimate ranking score. Srivastava, et al.  used the GO structure to compute the similarity between genes and combined gene expression data in a ridge regression for gene selection. Clearly, an approach integrating GO and gene data captures dependent structure of genes without sacrificing gene-level resolution. It provides more reliable results than the methods relying on gene expression data alone, which is justified later.
In this paper, we propose an exploratory framework of gene ranking that utilizes gene expression profiles and GO annotations. The contributions of this paper are described as follows.
Bi-graph representation of biological information of genes. We extract biological information from the GO database. One of the three GO ontologies (molecular function) is used (the other two types of annotations biological process and cellular component can be used similarly). A bipartite graph is constructed with one partition being genes and the other molecular functions. If a gene is associated with a particular function, the gene and the function are joined by an edge. Such a graph structure represents species-independent biological knowledge among genes indirectly (through common functions). Furthermore, using gene expression studies, the weight of the edge is assigned to be the expression level of the gene associated with the edge. This integrates the species-dependent information into the graph. The weighted graph conveys gene dependency structure nicely.
A new graph ranking algorithm. We introduce a new measure, kernelized spatial depth (KSD), to rank the nodes of a graph. Spatial depth (SD) provides a center-outward ordering of a data set in an Euclidean space ℝ d . It is a global concept. KSD generalizes the notion of spatial depth by incorporating the local perspective of the data set. Applying KSD to a graph provides the ranking of nodes, which takes into consideration both global and local structures of the graph. For sparse graphs, the algorithm is efficient with computational complexity (n2), where n is the number of nodes of the graph. The algorithm can be easily modified to handle dynamic data sets. It can also be parallelized to scale up for large data sets.
Better interpretation. Under a specified condition, not only is the importance of genes ranked, but the importance of functions is also ranked. This provides us with a better understanding and insight into the roles of various genes and molecular functions by analyzing bigraphs with gene expression profiles under different conditions. We demonstrate the performance of the proposed procedure using gene data from Gene Expression Omnibus (GEO). The new methods exhibit a higher level of biological relevance than competing methods.
Unlike a gene-gene network construction used in GeneRank, the gene-function bigraph structure has several advantages. It combines the gene expression profiles easily and naturally by assigning them to be weights of the graph. In addition, the importance of genes and molecular functions can be simultaneously ranked. Bipartite graph modeling was also used by Dhillon  and Zha, et al.  to co-cluster documents and words due to those advantages. Tanay, et al.  formed a gene-condition bigraph to find gene clusters in gene expression data.
The rest of this paper is organized as follows. After a brief introduction of some preliminaries on graphs, we introduce the KSD measure to rank vertices of a graph, followed by a discussion of choice of kernels and their comparison. In application, gene-function bigraphs are constructed to combine biological species-independent knowledge extracted from GO and species-dependent information contained in gene expression profiles. We apply our KSD ranking method to real data sets. Our conclusions and discussion are given in the last section.
Preliminaries of graphs and a motivating example
A graph G consists of a set of vertices (nodes) V and a set of edges E that connect vertices. The vertices are entities of interest and the edges represent relationships between the entities. Edges can be assigned positive weights W to quantify how strong the relationships are. Such a graph is called a weighted graph. Un-weighted graphs are just the special case with all the weights equally being 1.
A bipartite graph (or bigraph) is a graph whose vertices can be divided into two disjoint sets V1 and V2 such that every edge connects a vertex in V1 to one in V2. In our application, a bipartite graph is constructed with one set of vertices being genes and the other set of vertices being one of the Gene Ontology (GO) molecular functions.
The degree of a vertex v ∈ V denoted as d v is defined as the sum of the weights related to v, i.e. d v = Σ u W (v, u); (v, u) ∈ E. Obviously, for an un-weighted graph, the degree of v is the number of incident edges.
Spatial depth and kernelized spatial depth
We first introduce spatial depth in the Euclidean space ℝ d , then generalize it to kernelized spatial depth, which is the spatial depth on the feature space induced by a positive kernel. In order to extend the concept of KSD to a graph, the kernel on the graph must be specified. We define several graph kernels and present the KSD algorithm to obtain the depth of every vertex of the graph.
From the definition, it is not difficult to see that points deep inside a data cloud receive high depth and those on the outskirts get lower depth. Each observation from a data set contributes equally, as a unit vector, to the value of the depth function. In this sense, spatial depth takes a global view of the data set. On the one hand, the spatial depth downplays the significance of distance and hence reduces the impact of those extreme observations whose extremity is measured in (Euclidean) distance, so that it gains resistance against these extreme observations. Robustness is a favorite property of spatial depth . Ding, et al.  constructed a robust clustering algorithm based on it. On the other hand, the robustness of the depth function trades off some distance measurement, resulting in certain loss of the measurement of (dis)similarity of the data points. To overcome this limitation of spatial depth, Chen, et al.  proposed kernelized spatial depth (KSD) incorporating into the depth function a distance metric (or a similarity measure) induced by a positive definite kernel function.
Kernelized spatial depth
The value of KSD depends upon κ without knowing explicitly what the ϕ is. In ℝ d , one of the popular positive definite kernels is the Gaussian kernel κ (x, y) = exp(-||x - y||2/σ2), which can be interpreted as a similarity between x and y, hence it encodes a similarity measure. For a graph, we must consider what a good similarity measure will be, and how to construct an appropriate kernel matrix efficiently.
Choice of graph kernels
Various kernels on graphs can be found in recent literature, for example [39–41], and . Ando and Zhang  provide some theoretical insights into the role of normalization of the graph Laplacian matrix. We consider five Laplacian kernels, including complement Laplacian kernel, which is proposed here. Each kernel is described, followed by a comparison and discussion of computational issues of these kernels.
From the above result, we can see that the distance between two adjacent vertices in the feature space is larger than that of two disconnected vertices. The mapping ϕ reverses the relationship between two vertices in the graph. In this sense, we can view the Laplacian kernel as a dissimilarity matrix. In other words, a vertex close to the center in the graph turns into a vertex far from the center in the feature space. Therefore, a smaller KSD value indicates a higher rank of the vertex in the graph when choosing the Laplacian as the kernel. It is interesting but not consistent with the usual kernels that describe the similarity between two vertices. Next we look at several alternatives to Laplacian kernel.
Laplacian of complement graph kernel
There is no question that nI - E - L is symmetric and positive semi-definite. Notice that the Laplacian of the complement graph is defined in terms of negative Laplacian of the original graph. Hence it reverses the dissimilarity measure of L G . In other words, the Laplacian of the complement graph is a similarity matrix. Therefore, the larger KSD value with Laplacian of the complement as the kernel indicates the deeper the vertex is in the graph as we expect. This kernel is specially useful for dense graphs. The Laplacian of the complement of the graph may be a sparse matrix which leads to an efficient implementation of the KSD algorithm.
Diffusion Laplacian kernel
From Taylor expansion of exponential function, it is not difficult to show that K D is symmetric positive definite and all entries are non-negative.
Diffusion Laplacian kernel performs in an "opposite" way to the Laplacian kernel. Therefore like the Laplacian of the complement graph kernel, the larger KSD value using diffusion Laplacian kernel indicates the "central" vertex in the graph.
Pseudo-inverse Laplacian kernel
where Λ- is a diagonal matrix with the (i, i) diagonal element being . For convenience, we define = 0 if λ i = 0. Clearly, K P is also positive semi-definite, which means that it is indeed a valid kernel.
P-step random walk kernel
where p is a positive integer and a ≥ 2. The name of the kernel is based on the fact that (aI - ℒ) p is up to scaling terms equivalent to a p-step random walk on the graph with random restarts. Since it involves negative ℒ in the form, it is a similarity kernel.
In particular, a p-step random walk kernel with a = 2 and p = 1, K R = 2I - ℒ, converts the off-diagonal dissimilarites in a Laplacian kernel to off-diagonal similarities. It is simple in form and is much more attractive for practical purposes.
Ranking algorithm based on KSD for graphs
Given a graph G and a specified kernel, the following pseudocode describes the procedure to calculate the kernelized spatial depth values of all vertices.
Algorithm 1 KSD Algorithm
1 Get the Laplacian ℒ of the input graph G
2 Choose and compute the kernel matrix K
3 FOR (every vertex m in G )
4 FOR (every vertex i in G )
6 IF t = 0
7 α i = 0
9 α i = 1/t
12 FOR (every pair of vertices i, j in G )
13 M ij = K mm + K ij - K mi - K mj
17 OUTPUT D κ
From the above algorithm, the computation cost of KSD for all vertices depends on the sparseness of the kernel matrix. For a sparse kernel matrix, it is (n2), otherwise it is (n3). It is worthwhile to remark that the algorithm can be sped up by running it on multiple CPUs or computers even without the help of parallel programming techniques.
Comparison of kernels
In the real world, most networks (graphs) such as the world wide web, biological networks including the gene-function bipartite graphs we will construct later, are sparse, which means that the associated weight matrices are sparse. Complement Laplacian kernel is not suitable because of its expensive computation cost (n3). Since the diffusion kernel and pseudo-inversion kernel require spectral decomposition of ℒ, which has (n3) complexity and also the resulting kernels usually are very dense, they are not attractive. The Laplacian kernel has some difficulty on interpretation, so we prefer to choose the p-step random walk kernel.
In our application work in the next section, we rank the importance of genes by KSD using the p-step random walk kernel with a = 2 and p = 1.
Application to gene data
In our application, gene expression involving budding yeast (Saccharomyces cerevisiae) cells treated with DNA-reactive compounds cisplatin (CIS), methyl methanesulfonate (MMS), and bleomycin (BLE) to induce genotoxic stress will be compared with gene expression of Saccharomyces treated with DNA non-reactive ethanol (EtOH) and sodium chloride (NaCl) compounds to produce cytotoxic stress. Our goal is to identify a small number of biologically relevant genes capable of differentiating mechanisms of toxicity between the known genotoxic compounds from the cytotoxic compounds. In order to do so, we use the following basic methodology:
Construct an unweighted gene-function bigraph based on GO with one partition representing genes and the other representing molecular function.
Preprocess and combine data from the gene expression samples into one set per compound.
For each compound, add weights to the bigraph using the gene expression data.
Run the KSD algorithm on each bigraph to develop a gene expression profile of ranked genes for each compound.
Compare the ranked gene sets.
Details of these steps are provided below.
General construction of gene-function bigraph
In order to integrate biological information and gene expression data, one of gene ontologies – molecular function descriptions of genes are used. In the GO database, the ontologies are structured as rooted directed acyclic graphs (DAGs). The terms close to the root are more abstract than the terms far away from the root. We first extract the most specific functions associated with each gene to form the set of GO function terms. With one set of functions and the other set of genes, a bipartite graph is established. Consider Figure 3. Gene YGR098C is associated with the GO function term 0004197, which describes the cysteine-type endopeptidase activity. Genes YMR154C and YNL223W also have the same function. So in the bipartite graph, Gene YGR098C is more related to YMR154C and YNL223W than it is to YBL069W.
Algorithm 2 Gene-Function bigraph Construction Algorithm
0 Input c , user specified parameter
1 Input gene data
2 Extract associate GO function terms F
3 Form weighted bigraph G = (V, E, W )
4 FOR each term f i in F
5 Obtain all ancestors m of f i and their generation levels l im
7 FOR every pair i, j in F
8 Find the nearest common ancestor s
9 k = max (l is , l js )
10 Add edges of f j and g t : (g t , f i ) ∈ E with weights W ti × c k into G
11 Add edges of f i and g t : (g t , f j ) ∈ E with weights W tj × c k into G
13 OUTPUT G
The construction of the gene-function bigraph combines gene expression profiles and topological similarity in a single framework. Khatri and Drăghici  summarized three ways to determine the abstraction level of annotation in their section 2.7. Our approach is a variation of their second method. The user may decide k, the bottom-up level, for annotations. The difference is that we treat the children terms unequally, similar to the weight strategy presented in .
Figure 3 demonstrates how to build the structure of gene-function bigraph. The yellow rectangles represent genes at the bottom level. The above blue ellipses and arrows form a subgraph of the DAG in the GO database. Solid edges represent the association between gene and function. Dashed lines are added edges that reflect the semantic similarity of function annotations. The graph inside the red dashed box is the gene-function bipartite graph.
Preprocessing of gene expression data
Bigraphs for gene data under each treatment
In our application, we choose c = 1/5. Since r dramatically decreases on k for such choice of c, we truncate r to be zero for k > 1 to reduce computation memory and time. Under Algorithm 2, the bigraph under treatment MMS agent has total 5232 vertices including 4675 genes and 557 function terms. The number of edges are 22659. Hence the resulting bigraph is very sparse with sparsity 0.0017 comparing with 1 in the full graph (the graph with all pair edges). We use p-step random walk kernel to analyze the graph. Since we take log-2 expression differences with respect to the control agent, genes with positive log-2 expression difference are up-regulated and down-regulated genes have negative values. We are not able to directly assign weights of edges in the bigraph. We separate the bigraph into two subgraphs: one with all over-expressed genes and the other one with all under-expressed genes. For the subgraph containing "down-regulated" genes, the weights are assigned to be the absolute values of log-2 expression differences. Then we rank the important genes in those two graphs separately. It is reasonable to do so because we are interested in important induced genes and also repressed genes. All graph construction and algorithms are implemented using R and Bio-conductor.
Validation of improvement using GO
Before we present the result on the genes that are able to potentially differentiate genotoxicity and cyto-toxicity, we would like to demonstrate that integrating GO will provide more reliable results than methods only using gene expression data. We consider the three NaCl samples individually, ranking differentially expressed genes in each sample and comparing the degree of overlap of the top 100 gene lists.
For the simplest fold-change method, which ranks genes by the ratio of expression level of a NaCl treated sample over the mean expression in the control group, there are seven common genes appearing in the top 100 of the three samples, and only three overlapping in the top 50. When t-statistics are used for ranking genes, there are no genes in the overlap of the top 50 genes from the three samples, and only five genes in the overlap of the top 100 genes. Moreover, only one gene is identified in each sample by both methods. The reasons for such a poor performance include the noise level and experimental variability of microarrays. Ranking each gene independently is also one of the attributed reasons. Incorporating gene expression profiles and biological knowledge can improve performance.
By integrating GO annotations, a gene-function bigraph is constructed with weights being fold-changes or t-statistics for each sample. The KSD ranking on fold-change weighted graph provides an overlap of 60 genes in the top 100 and 32 in the top 50. There are 45 common genes in all the top 100 and 24 in the top 50 if we rank the t-statistic weighted bigraph. Furthermore, 38 common genes are identified in every bigraph based on each sample using either a fold change or t-statistic. For other compounds, we obtained a similar result: a small overlap for methods on gene data alone, a relatively larger overlap for our approach on the GO derived graph. While our testing used GO function annotations, similar results are expected with the other two ontologies. It is noted that there is a complete overlap if only GO information is used. Gene-function bigraphs which combine gene data with GO enhance the experimental signal and capture the dependent structure of genes. Hence, ranking on bigraphs improves the results.
Top 10 induced (up) and repressed (down) genes for each agent.
Genes with similar responses under genotoxic or cytotoxic stress
We enlarge the search of differentially regulated genes between the two groups to the top 100 genes. Eight other genes are capable of discriminating between genotoxic and cytotoxic agents. They behave similarly within group but totally different between groups. Genes over-expressed for genotoxic treatments but down-regulated for cytotoxic agents include TFS1, NTH1, ATG27 and un-characterized YMR090W. TFS1 is a Carboxy peptidase Y inhibitor, which is targeted to vacuolar membranes during stationary phase and involved in protein kinase A signaling pathway. NTH1 is required for thermotolerance and may mediate resistance to other cellular stresses. Type I membrane protein, ATG27, is involved in autophagy and the cytoplasm-to-vacuole targeting pathway. For gene YMR090W with unknown function, we should treat it with caution. GO term 0003674 is manually created for unknown molecular functions. Because our method utilizes the GO DAG structure, the identification of YMR090W may be caused by 0003674 (unknown function) but not by significant changes of mRNA levels. Further study about this gene is worthwhile.
Four genes PUS2, CAX4, WSC4 and MLP2 are induced for cytotoxic stress but repressed for genotoxic stress. PUS2 protein is a mitochondrial tRNA, associated with pseudouridine synthase activity targeted to mitochondria, specifically dedicated to mitochondrial tRNA modification. Response to decreased yeast viability and slow growth caused by cytotoxic stress, CAX4 is induced to increase the level of N-linked glycosylation. WSC4 is an ER membrane protein involved in the translocation of soluble secretory proteins and insertion of membrane proteins into the ER, which plays an important role in the stress response. MLP2, a Myosin-like protein associated with the nuclear envelope, connects the nuclear pore complex with the nuclear interior and is involved in the Tel1p pathway that controls telomere length.
Significant important genes distinguishing genotoxicity and cytotoxicity
Ribonucleotide-diphosphate reductase (RNR)
Carboxypeptidase Y inhibitor
Neutral trehalase, degrades trehalose
Type I membrane protein
Mitochondrial tRNA:pseudouridine synthase
Dolichyl pyrophosphate (Dol-P-P) phosphatase
ER membrane protein
Myosin-like protein associated with the nuclear envelope
Comparison with PageRank
We also use PageRank to analyze each weighted bigraph under each treatment. It yields very similar results as our KSD. Considering up-regulated genes for MMS, 85 out of the top 100 ranked genes by PageRank coincide with the top 100 by KSD. For down-expressed genes in the MMS treatment, there are 77 common genes appearing in both top 100 lists by PageRank and KSD. The other compounds have a similar overlap in top 100 lists. PageRank and KSD produce similar ranking lists for gene data, so why do we need KSD?
There are two major advantages of KSD over PageRank. First, PageRank needs a damping parameter to be specified. From some empirical studies, the parameter being 0.85 (the default value in R) seems to work well on the balance between the convergence rate and stability in many applications. But there are some circumstances where 0.85 may be far from the "optimal" value. The choice of the damping parameter is a concern for PageRank and hence for GeneRank also. This is however not an issue for KSD if we use Laplacian, complement Laplacian or Psedo-inverse kernels. Second, since spatial depth is a robust measure for centrality, we expect that KSD will inherit this nice property and obtain a more robust ranking result. To demonstrate the robustness, we design the following experiment to compare the sensitivity of our approach and PageRank against incorrect annotations on the artificial data.
The gene-function bigraph integrates molecular function annotations with gene expression data. The general relevance of genes is described in the graph (through a common function). Weights of the graph are assigned to be gene response expressions. The resulting bigraph includes more biological information than the gene data alone. Consequently, ranking on the bigraph may provide more biologically significant genes than ranking procedures based only on gene data. Also, we propose a new ranking algorithm for graphs based on the KSD measure. KSD balances the local and global topological structure of the graph, hence it provides a good and meaningful ordering of vertices of the graph. Experimental results on artificial data show that KSD is more robust than the well-known PageRank against incorrect annotations. The proposed method provides an exploratory framework for gene data analysis.
Support under National Science Foundation Grant DMS-0707074 is gratefully acknowledged by XD.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 11, 2009: Proceedings of the Sixth Annual MCBIOS Conference. Transformational Bioinformatics: Delivering Value from Genomes. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S11.
- Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW: On differential variability of expression ratios: Improving statistical inference about gene expression changes from microarray data. J Comput Biol 2001, 8: 37–52. 10.1089/106652701300099074View ArticlePubMedGoogle Scholar
- Dudoit S, Fridlyand J, Speed TP: Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of American Statistical Association 2002, 97: 77–87. 10.1198/016214502753479248View ArticleGoogle Scholar
- Golub TR, et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286: 531–537. 10.1126/science.286.5439.531View ArticlePubMedGoogle Scholar
- Pepe MS, Longton G, Anderson GL, Schummer M: Selecting differentially expressed genes from microarray experiments. Biometrics 2003, 59: 133–142. 10.1111/1541-0420.00016View ArticlePubMedGoogle Scholar
- Kerr MK, Martin M, Churchill GA: Analysis of variance for gene expression microarray data. Journal of Computational Biology 2000, 7(6):819–837. 10.1089/10665270050514954View ArticlePubMedGoogle Scholar
- Storey JD, Tibshirani R: Statistical significance for genome-wide experiments. Proceedings of the Natinal Academy Sciences USA (PNAS) 2003, 100: 9440–9445. 10.1073/pnas.1530509100View ArticleGoogle Scholar
- Fan J, Li R: Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 2001, 96: 1348–1360. 10.1198/016214501753382273View ArticleGoogle Scholar
- Lee KE, Sha N, Dougherty ER, Vannucci M, Mallick BK: Gene selection: a Bayesian variable selection approach. Bioinformatics 2003, 19: 90–97. 10.1093/bioinformatics/19.1.90View ArticlePubMedGoogle Scholar
- Brown P, et al.: Knowledge-based analysis of microarray gene expression data by using support vector machines. The Proceedings of the National Academy of Sciences of the USA (PNAS) 2000, 97: 262–267. 10.1073/pnas.97.1.262Google Scholar
- Guyon I, Weston J, Barnhill S, Vapnik V: Gene selection for cancer classification using support vector machines. Machine Learning 2002, 46: 389–422. 10.1023/A:1012487302797View ArticleGoogle Scholar
- Ding Y, Wilkins D: Improving the performance of SVM-RFE to select genes in microarray data. BMC Bioinformatics 2006, 7(Suppl 2):S12. 10.1186/1471-2105-7-S2-S12PubMed CentralView ArticlePubMedGoogle Scholar
- Furlanello C, Serafini M, Merler S, Jurman G: Entropy-based gene ranking without selection bias for the predictive classification of microarray data. BMC Bioinformatics 2003., 4(54):Google Scholar
- Tibshirani R: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 1996, 58: 267–288.Google Scholar
- Zou H, Hastie T: Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B 2005, 67: 301–320. 10.1111/j.1467-9868.2005.00503.xView ArticleGoogle Scholar
- Wang L, Zhu J, Zou H: Hybrid huberized support vector machines for microarray classification and gene selection. Bioinformatics 2008, 24: 412–419. 10.1093/bioinformatics/btm579View ArticlePubMedGoogle Scholar
- Díaz Uriarte R, de Andrés SA: Gene selection and classification of microarray data using random forest. BMC Bioinformatics 2006, 7: 3. 10.1186/1471-2105-7-3PubMed CentralView ArticlePubMedGoogle Scholar
- Mukherjee SN, Roberts SJ: A theoretical analysis of gene selection. Preceedings of IEEE Computational Systems Bioinformatics Conference (CSB) 2004, 131–141.Google Scholar
- Gentleman R, Irizarry RA, Carey VJ, Dudoit S, Huber W: Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer; 2005.View ArticleGoogle Scholar
- Guyon I, Elisseeff A: An introduction to variable and feature selection. The Journal of Machine Learning Research 2003, 3: 1157–1182. 10.1162/153244303322753616Google Scholar
- Lee MLT: Analysis of microarray gene expression data. Boston: Kluwer; 2004.Google Scholar
- Zadeh SFM, Morradi MH: An evaluation of genes ranking methods by ontology. Proceedings of 8th International Conference on Signal Processing 2006, 4: 16–20.Google Scholar
- Zhang B, Schmoyer D, Kirov S, Snoddy J: GOTree Machine (GOTM): a web-based platform for interpreting sets of interesting genes using Gene Ontology hierarchies. BMC Bioinformatics 2004, 5: 16. 10.1186/1471-2105-5-16PubMed CentralView ArticlePubMedGoogle Scholar
- Martin D, Brun C, Remy E, Mouren P, Thieffry D, Jacq B: GOToolBox: functional analysis of gene datasets based on Gene Ontology. Genome Biol 2004, 5: R101. 10.1186/gb-2004-5-12-r101PubMed CentralView ArticlePubMedGoogle Scholar
- Alexa A, Rahnenührer J, Lengauer T: Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics 2006, 22: 1600–1607. 10.1093/bioinformatics/btl140View ArticlePubMedGoogle Scholar
- Falcon S, Gentleman R: Using GOstats to test gene lists for GO term association. Bioinformatics 2007, 23(2):257–258. 10.1093/bioinformatics/btl567View ArticlePubMedGoogle Scholar
- Grossmann S, Bauer S, Robinson PN, Vingron M: An improved statistic for detecting over-represented Gene Ontology annotations in gene sets. Proceedings of the Lecture Notes in Computer Science 2006, 85–98. full_textGoogle Scholar
- Trajkovski I, Lavrač N, Tolar J: SEGS: Search for enriched gene sets in microarray data. Journal of Biomedical Informatics 2008, 41: 588–601. 10.1016/j.jbi.2007.12.001View ArticlePubMedGoogle Scholar
- Morrison J, Breitling R, Desmond H, Gilbert D: GeneRank: Using search technology for the ananlysis of microarray experiments. BMC Bioinformatics 2005, 6: 233. 10.1186/1471-2105-6-233PubMed CentralView ArticlePubMedGoogle Scholar
- Ma X, Lee H, Wang L, Sun F: CGI: a new approach for prioritizing genes by combining gene expression and protein-protein interaction data. Bioinformatics 2007, 23(2):215–221. 10.1093/bioinformatics/btl569View ArticlePubMedGoogle Scholar
- Daigle BJ, Altman RB: M-BISON: Microarry-based integration of data sources using networks. BMC Bioinformatics 2008, 9: 214. 10.1186/1471-2105-9-214PubMed CentralView ArticlePubMedGoogle Scholar
- Srivastava S, Zhang L, Jin R, Chan C: A novel method incorporating gene ontology information for unsupervised clustering and feature selection. PLoS ONE 2008., 3(12): 10.1371/journal.pone.0003860Google Scholar
- Dhillon IS: Co-clustering documents and words using bipartite spectral graph partitioning. Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (KDD) 2001, 269–274. full_textView ArticleGoogle Scholar
- Zha HY, He XF, Ding C, Simon H, Gu M: Bipartite graph partitioning and data clustering. Proceedings of 10th International Conference on Information and Knowledge Management (CIKM) 2001, 25–32.Google Scholar
- Tanay A, Sharan R, Shamir R: Discovering statistically significant biclusters in gene expression data. Bioinformatics 2002, 18(suppl 1):S136-S144.View ArticlePubMedGoogle Scholar
- Serfling R: A depth function and a scale curve based on spatial quantiles. In Statistical Data Analysis Based on the L1-Norm and Related Methods Edited by: Dodge D. 2002, 25–38.View ArticleGoogle Scholar
- Dang X, Serfling R, Zhou W: Influence Functions of Some Depth Functions, with Application to L-Statistics. Journal of Nonparametric Statistics 2009, 21(01):49–66. 10.1080/10485250802447981View ArticleGoogle Scholar
- Ding Y, Dang X, Peng H, Wilkins D: Robust Clustering in High Dimensional Data Using Statistical Depths. BMC Bioinformatics 2007, 8(Suppl 7):S8. 10.1186/1471-2105-8-S7-S8PubMed CentralView ArticlePubMedGoogle Scholar
- Chen Y, Dang X, Peng H, Bart H: Outlier detection with the kernelized spatial depth function. IEEE Transactions on Pattern Analysis and Machine Intelligence 2009, 31(2):288–305. 10.1109/TPAMI.2008.72View ArticlePubMedGoogle Scholar
- Smola AJ, Kondor R: Kernels and Regularizations on Graphs. In learning theorm and kernel machines. Berlin-Heidelberg: Springer Verlag; 2005.Google Scholar
- Ho N, Dooren P: On the pseudo-inverse of the Laplacian of a bipartite graph. Applied Mathematics Letters 2005, 18(8):917–922. 10.1016/j.aml.2004.07.034View ArticleGoogle Scholar
- Agarwal A, Chakrabarti S: Learning random walks to rank nodes in graphs. 2007.View ArticleGoogle Scholar
- Kondor R, Lafferty J: Diffusion kernels on graphs and other discrete input spaces. Proceddings of the 19th International Conference on Machine Learning (ICML) 2002, 315–322.Google Scholar
- Ando RK, Zhang T: Learning on graph with Laplacian regularization. Proceedings of Neural Information Processing Systems conference (NIPS) 2006, 25–32.Google Scholar
- Chung FRK: Spectral Graph Theory. In CBMS Regional Conference Series in Mathematics 92. American Mathematical Society; 1997.Google Scholar
- Khatri P, Drăghici S: Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 2005, 21(18):3587–3595. 10.1093/bioinformatics/bti565PubMed CentralView ArticlePubMedGoogle Scholar
- Caba E, Dickinson DA, Warnes GR, Aubrecht J: Differentiating mechanisms of toxicity using global gene expression analysis in Saccharomyces cerevisiae. Mutation Research 2005, 575: 34–46.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.