GEOGLE: context mining tool for the correlation between gene expression and the phenotypic distinction
© Yu et al. 2009
Received: 13 May 2009
Accepted: 25 August 2009
Published: 25 August 2009
Skip to main content
© Yu et al. 2009
Received: 13 May 2009
Accepted: 25 August 2009
Published: 25 August 2009
In the post-genomic era, the development of high-throughput gene expression detection technology provides huge amounts of experimental data, which challenges the traditional pipelines for data processing and analyzing in scientific researches.
In our work, we integrated gene expression information from Gene Expression Omnibus (GEO), biomedical ontology from Medical Subject Headings (MeSH) and signaling pathway knowledge from sigPathway entries to develop a context mining tool for gene expression analysis – GEOGLE. GEOGLE offers a rapid and convenient way for searching relevant experimental datasets, pathways and biological terms according to multiple types of queries: including biomedical vocabularies, GDS IDs, gene IDs, pathway names and signature list. Moreover, GEOGLE summarizes the signature genes from a subset of GDSes and estimates the correlation between gene expression and the phenotypic distinction with an integrated p value.
This approach performing global searching of expression data may expand the traditional way of collecting heterogeneous gene expression experiment data. GEOGLE is a novel tool that provides researchers a quantitative way to understand the correlation between gene expression and phenotypic distinction through meta-analysis of gene expression datasets from different experiments, as well as the biological meaning behind. The web site and user guide of GEOGLE are available at: http://omics.biosino.org:14000/kweb/workflow.jsp?id=00020
The rapid development of high-throughput gene expression detection technology provides a huge amount of experimental data for advanced researches on associating gene expression signatures with biological phenotypes. The application of microarrays to identify gene expression signatures of human diseases has been widely accepted [1, 2]. Accordingly, a vast number of tools for microarray analysis are available, including ArrayPipe , GEPAS , GeneTrailExpress , and currently reported Perl modules for microarray analysis , etc. Besides, Gene Set Analysis is highlighted in microarray analysis. Gene sets are usually defined as set of genes which function in cohort, detailed analysis on which can lead to a functional level map of the transcriptome data. Some popular gene set analysis tools include Babelomics , WebGestalt , etc. Furthermore, to address the problems of limited samples in single biological experiment and heterogeneity of gene expression datasets from different sources, methods for large-scale meta-analysis of microarray data have been developed [9–11]. Those tools for meta-analysis like studies such as Connectivity Map  requires a huge amount of supporting data resources, and associated information from existing biological databases. There is a clear requirement for efficiently retrieving associative datasets for meta-analysis to avoid manual work in mining from a large number of references.
The Gene Expression Omnibus (GEO) , curated by the National Center for Biotechnology Information (NCBI), is designed in response to this demand as a public warehouse for the submission, storage and retrieval of the high-throughput gene expression and genomic hybridization experiments. Several tools and strategies for operating the GEO database have been developed to enable comparisons of microarray data across experimental platforms, different laboratories and multiple species [14–18]. However most of these tools for retrieving data from the GEO repository paid little attention to mining further information about the gene expression signatures, such as linking to the biological functions of genes, or integrating the related pathway information in the biological processes. The National Library of Medicine's controlled vocabulary thesaurus (MeSH)  is one of the best resources for biomedical vocabularies. MeSH is helpful to be used as an index to link experimental conditions and biological concepts together, including disease phenotypes.
Focusing on this issue, we developed a state-of-the-art online bioinformatics tool, named GEOGLE, for mining the experimental data from GEO database and constructing relationships among the datasets, genes, pathways and the genes' biological significance. Our system integrates information from multiple sources, such as sigPathway  for pathway information and MeSH for biomedical vocabularies. Investigators are able to use multiple types of data for querying – including disease information, gene symbols, pathway names, expression datasets (GDS IDs), and signature lists – to search a large collection of related microarray information. An integrated p value is introduced by GEOGLE, which could be considered as an estimate for the correlation between gene expression and the phenotypic distinction. This mining technology may have great value in discovering the linkages between known phenotypes and experiment data, as well as retrieving suitable datasets for further research work.
The analysis in GEOGLE consists of two major parts: meta-analysis which integrates literature information and similarity search for signatures and datasets. The gene expression data and the basic signature for each dataset are derived from public expression data warehouses, such as GEO. MeSH terms have been used as important vocabulary dictionary for associating gene expression data with other biological terms, such as pathways and diseases. Dataset searching is mainly based on checking synonyms from pathways or diseases' description of their relevant MeSH terms in dataset annotation. Through literature searching and dataset filtering, summarized signatures are available from the integration of GEO, MeSH and sigPathway. For the second part of signature similarity search, a similar method has been used in Connectivity Map . GEOGLE will search similar datasets sharing the same signatures from the databases. By associating the attributes of these datasets, GEOGLE will summarize the common features and suggest the potential relationships between genes and diseases.
These data are processed by a second integrating and indexing engine. Three kinds of relationships were constructed through this engine and stored in the GEOGLE database, such as the linkage between gene and experiment dataset, between pathway and dataset, and between vocabulary and dataset. The linkage between gene and experiment dataset was represented in two aspects: the individual p value for estimating the significance of differential gene expression in one dataset, and the integrated p value for estimating the correlation between gene expression and the phenotypic distinction which might contain several datasets with similar phenotypes. The algorithm for calculating the integrated p value is presented in the next paragraph. To construct the linkage between pathway and dataset, signatures from datasets were mapped into pathways based on information from sigPathway dataset. MeSH terms were mapped into experiments' annotation according to the context mining of the recorded synonyms from the description of GDSes.
The procedure for calculating the integrated p value consists of two major parts. Firstly, the p value of each gene in individual dataset was calculated with SAM method, as mentioned before. Secondly, a novel procedure was developed to calculate an integrated p value for evaluating the relationship between signature and a group of referred datasets (reflecting a phenotype). Steps from (1) to (5) were performed:
(1) The p values of different genes in each dataset were organized into a vector. Then gene – GDS matrices (named P gc , gc for gene – condition) were generated from a set of p value vectors calculated independently from different GDSes. Each element in P gc represents a p value which had been prepared before using SAM.
(3) Then Z score was summarized from Z gc_subwith the function:
, n for the number of elements in Zgc_sub
(4) A new p value was calculated to represent the significance of Z score using the cumulative distribution function of normal distribution (as mentioned in (2)). Let a parameter ('alpha') be the threshold of the p value from this test.
(5) If these genes were not signatures of a group of GDSes, the P gc_subwould follow uniform distribution. If P gc_subfollowed uniform distribution, Z gc_subwould follow norm distribution. As a result, Z score would also follow norm distribution. A significant small value of Z comparing to normal distribution corresponded to the significantly being perturbed of these genes under these conditions. The p value from this test is considered as the integrated p value from the whole searching task. We could judge whether certain gene should be considered as signature in the group of GDSes by the integrated p value.
(6) To judging the relationship between candidate signatures and vocabularies is very similar with the procedure from (1) to (5) mentioned before. Each vocabulary (MeSH term) contains a groups of expression datasets (GDSes). The integrated p value for the significance of the correlation between signature and certain vocabulary is equal to that for the correlation between signature and the GDSes in the vocabulary.
(7) The next step is to evaluate the relationship between pathway and a group of genes (signatures). We used a very similar procedure with some modification. We constructed a pathway – gene matrices (named P pg , pg for pathway – gene). The relationship of pathways and genes were derived from sigPathway. Each element in this matrix is the integrated p value of gene in a group of GDSes (this group is determined according pervious GDS searching). Then the procedure from (1) to (5) was repeated, using P pg taking the place of P gc as initial input. The new integrated p values calculated were considered to be the estimate of the significance of the pathways in the searching task.
Some optional parameters can be set by users. For instance, in Vocabulary Miner setting 'F' (false) for 'list_Mesh_GDS.only' makes the miner search additional information about genes and pathways according to the query. A more efficient searching, by setting 'T' for this parameter, comes at the cost of performing no pathway and gene information searching. Another common parameter is 'alpha'. This parameter sets the threshold for the integrated p value to estimate the correlation between gene expression and experiment datasets. GEOGLE provides a task management system for users to review the states of their previously conducting tasks and to retrieve the results later, which will be saved temporally on the server. The detail processing pipelines of these miners and a step-by-step tutorial of using GEOGLE and the explanation of the input and results could be found in the Supplementary.
By using 'Vocabulary Miner', we search for 'smoking' related gene expression gene data in human, then got a result with 4 GDSes considered to be candidate datasets (GDS1304, GDS1436, GDS1673 and GDS534). According to their annotation, these GDSes which are all related to cigarette smoking effect are suitable for further meta-analysis. Terms like 'Breast Cancer/Estrogen Receptor Signaling' and 'Stress Response to Cellular Damage' are returned with significant p values in pathway section of the results, which suggests that these pathways are closely related to the 'smoking' phenotype. Such genes like GALNT1 are identified as signature genes trough all these dataset. According to pervious report, GALNT1 is strongly associated with the using of tobacco and the risk of lung cancer [21, 22]. In our previous work, GEOGLE severed as main tool for expression data analysis associated with metabolomics data, which reveals distinct variations related to nicotine consumption in human . The combination of several miners provided not only suitable expression datasets but also candidate genes which might be related to the influence of smoking. The gene for alkylglycerone phosphate synthase (alkyl-DHAP, or AGPS) has been found strong down-regulated in smokers in human lung tissues. This is consistent with metabolic profiling. The down-regulation of this gene was found to influence both ether lipid and glycerophospholipid pathways, and shift the ratios of plasmalogens to diacyl-phosphatidylcolines.
In this report we introduce GEOGLE, an online web service for GEO dataset mining and biomedical information integration. GEOGLE provides an efficient way for users to search for related experiment datasets according to their own research interest with various types of input. Another significant feature of GEOGLE is the novel concept of an integrating system for signatures, pathways, biological terms and disease information. Public data warehouses such as GEO are high-quality resources for an automatic mining and integration system of gene expression datasets and reference literatures from GEOGLE, which will be a revolution compared to manually collecting experimental data for biological research. Currently there exist a few tools for operating the GEO database. For example, Oncomine [24, 25] is a previously published cancer gene expression analysis platform. CleanEx  also contains re-annotate experiment datasets with the MeSH terms and some on-line analysis tools for gene expression data. Compared with these tools, GEOGLE has some outstanding features and additional values for this kind of study. Firstly, the main object of GEOGLE is to search for candidate datasets from different experiments for further meta-analysis, according to certain biological vocabularies and/or genes of interest. Secondly, GEOGLE provides a quantitative method to evaluate the correlation between each gene and a series of gene expression datasets which might represents certain phenotypic distinction. Thirdly, GEOGLE collected a wide range of information about different kinds of diseases including cancer (over 60,000 MeSH terms have been involved). Fourthly, GEOGLE performed further mining for related gene function information, pathway annotation and reference knowledge and introduced an integrated p value for estimating the correlation between gene expression and the phenotypic distinction. Fifthly, GEOGLE allows multiple types of inputs such as keywords, datasets, pathways, genes and user defined signatures. Technically, a modular design concept allows each part of GEOGLE to be replaced by a more advanced one, for instance another BLAST engine with more accuracy could be used for the similarity search. The container of GEOGLE (Omics Explorer) is hosted via a standard online service platform supported by InforSense Ltd. Thus no individual GUI will be need for GEOGLE's online user interface. In addition, the GEOGLE database can be easily updated to keep it synchronized with public gene expression databases.
Further steps in the development of GEOGLE should focus on the integration of high-throughput gene expression databases other than GEO, such as the ArrayExpress  and the Stanford Microarray Database (SMD) . One of the improvements of GEOGLE in-progress is large scale gene and disease information mining effort from reference databases  and integrating this information with existing signature data. The reference mining results are believed to be able to prove the reliability of the relationships between signatures and diseases discovered by GEOGLE. Moreover, since GEOGLE provides a potential network of diseases, genes and pathways, more analysis work focusing on this will be considered in future.
Project name: GEOGLE
Project home page: http://omics.biosino.org:14000/kweb/workflow.jsp?id=00020
Operating system(s): Developed in Linux and platform independent for accessing
Programming language: Java 1.5 and R 2.5.1
Other requirements: Internet Explorer, Firefox or Safari is required to access the website.
Gene Expression Omnibus
Medical Subject Headings
Significance Analysis of Microarray
National Center for Biotechnology Information.
Funding: The 863 Hi-Tech Program of China (863) (grant 2007AA02Z304, 2006AA020406), the Shanghai Committee of Science and Technology (Grant 07dz22004, 08JC1416600) and Research Program of CAS (grant KSCX2-YW-R-112).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.