- Open Access
GeneBins: a database for classifying gene expression data, with application to plant genome arrays
BMC Bioinformaticsvolume 8, Article number: 87 (2007)
To interpret microarray experiments, several ontological analysis tools have been developed. However, current tools are limited to specific organisms.
We developed a bioinformatics system to assign the probe set sequences of any organism to a hierarchical functional classification modelled on KEGG ontology. The GeneBins database currently supports the functional classification of expression data from four Affymetrix arrays; Arabidopsis thaliana, Oryza sativa, Glycine max and Medicago truncatula. An online analysis tool to identify relevant functions is also provided.
GeneBins provides resources to interpret gene expression results from microarray experiments. It is available at http://bioinfoserver.rsbs.anu.edu.au/utils/GeneBins/
Microarrays enable us to study the expression of thousands of genes simultaneously, providing a comprehensive overview of the gene activities in a given tissue. A number of ontological tools are now available that support the functional interpretation of gene expression data, through the identification of significant enriched Gene Ontology terms (GO)  associated with a list of (differentially expressed) genes, such as Onto-Tools , BlastSets , NetAffx , ArrayXPath  or FatiGO . However, Gene Ontology is a controlled vocabulary designed to organize information for molecular function, biological processes and cellular components and thus does not directly reflect metabolic pathways. In addition, these tools are limited to organisms with well-annotated genomes.
We propose a new strategy that assigns genes to hierarchical categories (BINs) modelled on the ontology provided by the KEGG database . KEGG is a pathway-orientated database, which integrates the genes of many species. The top level of the classification contains four categories (metabolism, genetic information processing, environmental formation processing and cellular processes); the next levels correspond to subcategories (e.g. metabolic pathways, multiprotein complexes, protein families, etc.) or to individual functions. By converting the entire KEGG Orthologous database into a new BIN structure (GeneBins), we define a generic hierarchical classification (i.e. not species-specific). Any protein gene can then be assigned to a bin in this ontology based on the similarity of its amino acid sequence to the sequences in four reference databases (KEGG, Cluster of Orthologous Groups (COG) , Swiss-Prot  and Gene Ontology), using the cross-references provided by KEGG. Based on this approach, GeneBins currently contains probe set assignments to the KEGG-based ontology for the Affymetrix arrays  of Arabidopsis thaliana, Oryza sativa (rice) and the model legumes Glycine max (soybean) and Medicago truncatula (barrel medic).
Based on these assignments, we have developed an online tool to identify the significantly over- or under-represented metabolic pathways in a set of sequences using a method based on the hypergeometric distribution, as developed in the BlastSets system . This can, for example, be used to interpret sets of up- or down-regulated microarray sequences.
In addition, the classification system provided can also be used in MapMan [11–13] to display gene expression data on images representing a functional context of these genes, for which it provides both the BIN structure and mapping file to this ontology.
Construction and contents
The GeneBins database is a web-based tool combining a PostgreSQL database management system with a dynamic web interface based on PHP and Perl. Data pre-processing is implemented in Perl and statistical analyses are performed using Perl and the R statistical package .
The database contains three components:
The functional hierarchy (GeneBins structure) consists of two tables; the first table contains the identifiers (BIN codes) and their descriptions (BIN names) and the second contains the hierarchical structure of the classification.
The reference databases with identifiers, description and protein sequences from KEGG Orthologous, COG, Swiss-Prot and the reference set of sequences provided by Gene Ontology.
The genome arrays containing data from the Affymetrix arrays. Each probe set is described by its identifier, the database from which the sequence used to design the probe set was taken, the accession number and description of a representative sequence, and the consensus sequence spanning from the most 5' to the most 3' probe position in the public Unigene cluster.
Probe sets are assigned to the GeneBins hierarchy based on their sequence similarity with amino acid sequences in the reference databases. BINs are linked to these sequences by the cross-references provided by KEGG. We used BLASTX  to find best matches (E-value < 10-8) for each consensus sequence of a given Affymetrix array in each reference database. From these we extracted cross-references to assign the probe set to the corresponding BIN in the GeneBins classification.
As of August 2006, data for the Affymetrix arrays of four plants (Arabidopsis thaliana, Oryza sativa, Glycine max and Medicago truncatula) are available in the database (Table 1).
Utility and discussion
The GeneBins web interface  can be used to search the classification of a given probe set or to analyse a list of identifiers according to their assignments in the hierarchy.
Search for classification
It is possible to retrieve the classification of a probe set in a selected genome array by its Affymetrix probe set identifier or by the GenBank accession number of the representative sequence. The results of database queries provide information on the probe set sequence, its position in the functional hierarchy, and the blast matches, as given in Figure 1. Note that a probe set can be assigned to more than one BIN. The cross-references associated to these BINs are displayed with a hyperlink to the entry in the corresponding database. The best BLAST matches are used to assign the probe set sequence to the BINs, provided that they exceed a pre-defined threshold E-value (10-8).
Gene expression analysis
GeneBins can be used to identify the functional categories associated with a set of sequences (e.g. differentially expressed) and thus find the metabolic pathways or other cellular functions up- or down-regulated in microarray experiments. The list of probe set identifiers (Affymetrix probe set identifiers and/or GenBank accession numbers), belonging to a given genome array, can be pasted in a text box or uploaded from a file in the GeneBins website.
To provide an overview of the functions affected, a bar plot representing the distribution of the submitted identifiers in the second level of the classification is displayed (Figure 2a). Note that the sum of the percentages can be more than 100% as a gene can be assigned to several BINs.
To detect if a certain functional category is statistically over-represented in the selected group of genes, compared to the rest of the genome array, the p-value for all BINs throughout the classification is calculated using the hypergeometric distribution . This p-value represents the probability that the intersection of the set of submitted sequences with the set of sequences belonging to the given BIN occurs by chance. The p-value significant threshold can be specified, with a default cut-off of 0.05. Because multiple hypothesis tests are performed, it can also be adjusted using a Bonferroni correction . The resulting page lists, by increasing p-values, the BINs with assigned probe sets belonging to the submitted group (Figure 2b). Those that are significant are highlighted. It is possible to retrieve the list of all probe sets assigned to a given BIN. This page can be bookmarked as the results are stored for seven days, and can also be downloaded in a tabular file.
In addition, to display gene expression data on images representing a functional context of these genes (e.g. metabolic pathways) using MapMan, the complete probe sets classification for each organism can be downloaded in the appropriate MapMan format and in an xml format to be explored locally using any outliner.
In the near future, we plan to apply our approach to other Affymetrix arrays. The classification process will be improved by taking into account the domain composition of the proteins. We are currently developing an interface allowing the submission of a set of sequences (e.g. custom DNA microarrays) to be classified automatically.
GeneBins provides a hierarchical functional classification, modelled on the KEGG ontology, of probe set sequences of four plant Affymetrix arrays. Based on these assignments, an online analysis tool is available to interpret gene expression results from microarray experiments by identifying the most relevant pathways or functions involved in a submitted list of genes.
Availability and requirements
Access to GeneBins is via a web interface, freely available to all interested users, at http://bioinfoserver.rsbs.anu.edu.au/utils/GeneBins/
It has been tested to work with Safari 2.0, Mozilla Firefox 1.5 and Internet Explorer 6.0 web browsers and does not require any particular plug-in.
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25(1):25–29. 10.1038/75556
Draghici S, Khatri P, Bhavsar P, Shah A, Krawetz SA, Tainsky MA: Onto-Tools, the toolkit of the modern biologist: Onto-Express, Onto-Compare, Onto-Design and Onto-Translate. Nucleic Acids Res 2003, 31(13):3775–3781. 10.1093/nar/gkg624
Barriot R, Poix J, Groppi A, Barre A, Goffard N, Sherman D, Dutour I, de Daruvar A: New strategy for the representation and the integration of biomolecular knowledge at a cellular scale. Nucleic Acids Res 2004, 32(12):3581–3589. 10.1093/nar/gkh681
Cheng J, Sun S, Tracy A, Hubbell E, Morris J, Valmeekam V, Kimbrough A, Cline MS, Liu G, Shigeta R, Kulp D, Siani-Rose MA: NetAffx Gene Ontology Mining Tool: a visual approach for microarray data analysis. Bioinformatics 2004, 20(9):1462–1463. 10.1093/bioinformatics/bth087
Chung HJ, Park CH, Han MR, Lee S, Ohn JH, Kim J, Kim J, Kim JH: ArrayXPath II: mapping and visualizing microarray gene-expression data with biomedical ontologies and integrated biological pathway resources using Scalable Vector Graphics. Nucleic Acids Res 2005, 33(Web Server issue):W621–6. 10.1093/nar/gki450
Al-Shahrour F, Diaz-Uriarte R, Dopazo J: FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics 2004, 20(4):578–580. 10.1093/bioinformatics/btg455
Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resource for deciphering the genome. Nucleic Acids Res 2004, 32(Database issue):D277–80. 10.1093/nar/gkh063
Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV: The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res 2001, 29(1):22–28. 10.1093/nar/29.1.22
Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS: The Universal Protein Resource (UniProt). Nucleic Acids Res 2005, 33(Database issue):D154–9. 10.1093/nar/gki070
Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res 2003, 31(4):e15. 10.1093/nar/gng015
Thimm O, Blasing O, Gibon Y, Nagel A, Meyer S, Kruger P, Selbig J, Muller LA, Rhee SY, Stitt M: MAPMAN: a user-driven tool to display genomics data sets onto diagrams of metabolic pathways and other biological processes. Plant J 2004, 37(6):914–939. 10.1111/j.1365-313X.2004.02016.x
Usadel B, Nagel A, Thimm O, Redestig H, Blaesing OE, Palacios-Rojas N, Selbig J, Hannemann J, Piques MC, Steinhauser D, Scheible WR, Gibon Y, Morcuende R, Weicht D, Meyer S, Stitt M: Extension of the visualization tool MapMan to allow statistical analysis of arrays, display of corresponding genes, and comparison with known responses. Plant Physiol 2005, 138(3):1195–1204. 10.1104/pp.105.060459
Goffard N, Weiller G: Extending MapMan: application to legume genome arrays. Bioinformatics 2006, 22(23):2958–2959. 10.1093/bioinformatics/btl517
The R Project for Statistical Computing[http://www.R-project.org]
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215(3):403–410.
Cho RJ, Huang M, Campbell MJ, Dong H, Steinmetz L, Sapinoso L, Hampton G, Elledge SJ, Davis RW, Lockhart DJ: Transcriptional regulation and function during the human cell cycle. Nat Genet 2001, 27(1):48–54. 10.1038/83751
Bonferroni CE: Teoria statistica delle classi e calcolo delle probabilità. Pubblicazioni del Regio Istituto Superiore di Scienze Economiche e Commerciali di Firenze 1936, 8: 3–62.
This study was funded by an Australian Research Council Centre of Excellence grant. Funding to pay the Open Access publication charges for this article was provided by the same grant.
NG participated in the design, implemented the system and drafted the manuscript with revisions provided by GW. GW conceived and supervised the project. Both authors read and approved the final manuscript.