GOParGenPy: a high throughput method to generate Gene Ontology data matrices
© Kumar et al.; licensee BioMed Central Ltd. 2013
Received: 21 March 2013
Accepted: 11 July 2013
Published: 8 August 2013
Gene Ontology (GO) is a popular standard in the annotation of gene products and provides information related to genes across all species. The structure of GO is dynamic and is updated on a daily basis. However, the popular existing methods use outdated versions of GO. Moreover, these tools are slow to process large datasets consisting of more than 20,000 genes.
We have developed GOParGenPy, a platform independent software tool to generate the binary data matrix showing the GO class membership, including parental classes, of a set of GO annotated genes. GOParGenPy is at least an order of magnitude faster than popular tools for Gene Ontology analysis and it can handle larger datasets than the existing tools. It can use any available version of the GO structure and allows the user to select the source of GO annotation. GO structure selection is critical for analysis, as we show that GO classes have rapid turnover between different GO structure releases.
GOParGenPy is an easy to use software tool which can generate sparse or full binary matrices from GO annotated gene sets. The obtained binary matrix can then be used with any analysis environment and with any analysis methods.
KeywordsGene Ontology Large-scale datasets Data Mining Machine learning Bioinformatics
Gene Ontology (GO) is a popular standard in the annotation of gene products, providing information related to genes across all species. It presents a shared, controlled structured vocabulary of terms that describe the gene products . GO is structured as a Directed Acyclic Graph (DAG) that holds the terms that describe the molecular function, biological process, and cellular component for a gene product. GO has a hierarchical structure that represents the terms from more specific to general terms.
GO is currently being used for various analysis tasks like a) over-representation of the GO classes from a selected group of genes , b) semantic similarity between two genes , c) threshold free gene set analysis [4, 5], d) machine learning to classify unknown genes to various GO categories [6, 7], and e) explorative analysis of large-scale datasets . The linking of the reported GO categories to the GO DAG structure and their parent nodes is critical for all these tasks. In tasks a, c, d, and e, the parental nodes of the GO structure provide different levels of detail allowing simultaneous monitoring of very detailed and very broad functional classes. In the case of task b, the link to the GO hierarchy is crucial for finding a path between the two genes across the GO graph.
There are many existing methods [9-13] freely available for processing (i.e. linking gene products to the GO hierarchy) and analyzing Gene Ontology terms. Most of these tools perform well enough to handle small data sets, but on larger scale, such as in the case of microarray data, the execution time for these tools becomes prohibitive. Moreover, most of these methods use quite old GO structures causing methods to miss a large proportion of the currently used GO classes (see Results).
The annotationDbi  and GO.db  packages in Bioconductor are the most widely used tools for Gene Ontology analysis for the R enviroment. GO.db stores links from GO classes to their parent GO classes, storing all the GO classes, their parents, child terms and ancestor terms in a database for easy retrieval and processing. Despite the Gene Ontology consortium updating GO class annotations and linkages on daily basis, these GO related R packages are updated only biannually. Indeed, the best source of GO information is the annotation files themselves, which are available from the GO consortium web pages.
Even the GO consortium cannot help with research carried out with novel species. This is critical as we can expect a growing number of novel sequenced species with next-generation sequencing methods. These will require in-house GO annotation of sequences . Also, with the analysis of more exotic organisms, there might be alternative sources for GO annotations, like species specific databases. Current GO processing tools use only a pre-fixed annotation source for analysis.
We present a fast python program named GOParGenPy (GO Parent Generation Python) that can process large annotation files, incorporate any version of OBO structure and can generate GO data matrices. Users of GOParGenPy will mainly be biologists and bioinformaticians who do analysis using languages, such as R, Matlab or python. It is freely available from the project web page (see Availability and requirements).
Reading in ‘gene_ontology_edit.obo’ file in standard format, parsing it and storing all the GO classes and their attributes.
Reading in the GO annotations of the analyzed genes (various input formats are supported).
Links GO annotations to their parent GO classes. The linking also looks for alternative ids for those GO classes which have become obsolete.
Outputs a list of genes with added parent GO classes.
Outputs a sparse or full matrix with genes as rows and GO classes as columns. The default format is the sparse matrix.
The OBO flat file format stores GO classes and attributes such as id, name, namespaces, definition, etc. OBO file GO classes and their respective attribute values are stored in a hash table using the numeric part of GO id as keys. Hence, the parent or ancestor class(es) for any given GO class can be retrieved recursively by looking through the attribute values of GO classes, namely ‘is_a’, ‘part_of’ and ‘consider’ links.
Next, the intermediate file obtained in first step is iterated over so that for each gene and its respective GO classes, all shared parent or ancestor GO classes are retrieved recursively using the above hash table. Redundant steps are removed by adding another hash table that is dynamically built as the iteration progresses through the entire file. The main purpose of this hash table is to store the GO class and all its parent or ancestor classes together so that when the same GO class is encountered in further iterations the retrieval does not get referred back to earlier GO hash table. Thus, at any instance the maximum size this data structure is the total number of GO classes present in a given OBO file. Hence, after certain stage the overall processing of input annotation file becomes independent of number of genes and the associated GO annotations.
Moreover, the program also does a lookup in the OBO file of alternate ids for any GO class which has become obsolete in order to retrieve parent/ancestor classes also in these cases. This functionality is optional.
The obtained sparse matrix can be further processed with standard analysis pipelines. The sparse matrix format is supported by many analysis environments, like R and Matlab.
Instability of OBO files.
Instability of OBO files
OBO files are central to all GO analysis. However, they vary significantly between GO analysis tools with DAVID using version 6.7, agriGO using version 1.2 and GO.db/AnnotationDBI from R/Bioconductor using a biannually updated version.
For DAVID the corresponding version of OBO file used is of date 01.12.2009
For GO.db the corresponding version of OBO file used is of date 01.03.2011
For agriGO the corresponding version of OBO file used is of date 01.04.2010
The reference version of OBO file with which these packages are compared is of date 01.02.2012.
Relative execution time
The execution time was compared only between the most widely used standalone packages. These are GeneOntology package from Bioperl Toolkit, GO.db and AnnotationDBI from R/Bioconductor. The aim is to compare the performance of GOParGenPy with these packages in processing large datasets. Parent GO classes were generated by GOParGenPy using the current version of GO structure (01.03.2013). First, the methods were tested with a randomly chosen set of 80975 genes from UNIPROT-GOA . This is 2-3 times the size of the largest genomes in gene expression analysis. Next, in order to measure the performance on extremely large file, tools were tested with all the GO annotated sequences (>21 million sequences) available from UNIPROT-GOA.
Comparative analysis of GO packages
A comparative view of change in total number of GO classes per year throughout 9 years of OBO files data
Total GO Classes
Change in number of GO
Average change in GO
classes per year
classes in 8 years
Execution time of various available packages
Number of Genes
AnnotationDbi/GO.db Package in
GeneOntology package in
We present a new standalone software tool GOParGenPy for generating high-throughput GO data matrices for any selected input annotation file and any version of OBO file. We have shown the importance of OBO structure and presented an effective way of storing and retrieving GO classes and their attributes for any downstream analysis involving GO data. All the existing methods, be it web based application or standalone offline tools, utilize an outdated OBO structure from GO consortium. As shown in the Figures 3 - 5 we can find that at maximum 25% of GO classes (for DAVID tool) are outdated with respect to current version of OBO file. Hence, any downstream analysis methods that incorporate GO data obtained from these tools may lead to erroneous results.
GOParGenPy outperforms all these existing tools in terms of incorporating users’ choice of OBO structure and speed of generating GO data matrices. It is also able to process extremely large datasets. It incorporates a dynamic hash table that stores all GO classes from the input file with their parent GO classes retrieved from OBO structure. This unique feature enables generation of data matrices independent of size of input data as the maximum size of this hash table is the total number of GO classes present in the OBO structure file used. Hence, this makes GOParGenPy faster in the generation of GO data matrices for large gene sets. Also, GOParGenPy looks for alternative ids of those GO classes which have become obsolete or have their definition altered.
Although, GOParGenPy does not do any actual data analysis or visualization steps itself the output files can be easily imported to environments like Matlab, R or Python. The output GO data can be used as an input for various analysis tasks like prediction of new GO annotations with classifiers , for visualization tasks  or for correlation analysis between GO data and large-scale data . Thus, GOParGenPy encourages modular thinking in bioinformatics.
GOParGenPy allows the user to select the used GO annotation file and the used GO structure file. This allows the usage of latest annotation data files and latest GO structure. However, it can also be used with older annotation files. This is useful when an older work needs to be replicated, or while comparing methods with one that uses old GO structure.
Additionally, GOParGenPy features and its application can be extended to other ontology resources and it has been already tested with Plant Ontology (PO). GOParGenPy optional features can incorporate any PO annotated gene lists and corresponding OBO file to generate sparse binary matrix representation. (see Project homepage).
GOParGenPy is a fast python program for generating GO binary data matrices from annotated set of genes. GOParGenPy outperforms existing tools by allowing any available version of the OBO structure and handling large scale input annotation dataset with over 21 million annotated sequences. The output files can be easily incorporated into various platforms such as MATLAB, R or Python for further GO related downstream analysis.
Availability and requirements
Project name: GOParGenPy
Project homepage: http://ekhidna.biocenter.helsinki.fi/users/ajay/private/GOParGenPy.htm
Operating system: Platform independent
Programming language: Python version 2.5/2.6
License: Free for academic use
PT described the problem statement. AAK developed and implemented the software tool. LH and PT supervised the project. All authors contributed equally in writing the manuscript. All authors read and approved the final manuscript.
We would like to thank anonymous reviewers for their comments and Alan Medlar for help with language. We would also like to thank Biocenter Finland for funding this project.
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: The gene ontology consortium. Nat Genet. 2000, 25 (1): 25-29. 10.1038/75556.PubMed CentralView ArticlePubMedGoogle Scholar
- Khatri P, Drăghici S: Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics. 2005, 21 (18): 3587-3595. 10.1093/bioinformatics/bti565.PubMed CentralView ArticlePubMedGoogle Scholar
- Lord PW, Stevens RD, Brass A, Goble CA: Semantic similarity measures as tools for exploring the gene ontology. Pac Symp Biocomput. 2003, 601-612. ISSN 1793-5091Google Scholar
- Törönen P, Ojala PJ, Marttinen P, Holm L: Robust extraction of functional signals from gene set analysis using a generalized threshold free scoring function. BMC Bioinforma. 2009, 10: 307-10.1186/1471-2105-10-307.View ArticleGoogle Scholar
- Ackermann M, Strimmer K: A general modular framework for gene set enrichment analysis. BMC Bioinforma. 2009, 10: 47-10.1186/1471-2105-10-47.View ArticleGoogle Scholar
- Peña-Castillo L, Tasan M, Myers CL, Lee H, Joshi T, Zhang C, Guan Y, Leone M, Pagnani A, Kim WK, Krumpelman C, Tian W, Obozinski G, Qi Y, Mostafavi S, Lin GN, Berriz GF, Gibbons FD, Lanckriet G, Qiu J, Grant C, Barutcuoglu Z, Hill DP, Warde-Farley D, Grouios C, Ray D, Blake JA, Deng M, Jordan MI, Noble WS, et al: A critical assessment of Mus musculus gene function prediction using integrated genomic evidence. Genome Biol. 2008, 9 (Suppl 1): S2-10.1186/gb-2008-9-s1-s2.PubMed CentralView ArticlePubMedGoogle Scholar
- Radivojac P, et al: A large-scale evaluation of computational protein function prediction. Nat Methods. 2013, 10 (3): 221-7. 10.1038/nmeth.2340. Epub 2013 Jan 27PubMed CentralView ArticlePubMedGoogle Scholar
- Nikkilä J, Törönen P, Kaski S, Venna J, Castrén E, Wong G: Analysis and visualization of gene expression data using Self-Organizing Maps. Neural Netw. 2002, 15 (8-9): 953-966.View ArticlePubMedGoogle Scholar
- Beissbarth T, Speed TP: GOstat: Find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics. 2004, 20 (9): 1464-1465. 10.1093/bioinformatics/bth088.View ArticlePubMedGoogle Scholar
- Carlson M, Falcon S, Pages H, Li N: AnnotationDbi: Annotation Database Interface. R package version 1.12.0
- Carlson M, Falcon S, Pages H, Li N: GO.db: A set of annotation maps describing the entire Gene Ontology. R package version 2.5
- da-Huang W, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009, 4 (1): 44-57.View ArticleGoogle Scholar
- Du Z, Zhou X, Ling Y, Zhang Z, Su Z: agriGO: a GO analysis toolkit for the agricultural community. Nucl Acids Res. 2010, 38: W64-W70. 10.1093/nar/gkq310.PubMed CentralView ArticlePubMedGoogle Scholar
- Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H, Lehväslaiho H, Matsalla C, Mungall CJ, Osborne BI, Pocock MR, Schattner P, Senger M, Stein LD, Stupka E, Wilkinson MD, Birney E: The Bioperl toolkit: perl modules for the life sciences. Genome Res. 2002, 12 (10): 1611-1618. 10.1101/gr.361602.PubMed CentralView ArticlePubMedGoogle Scholar
- UNIPROT - GOA data set download link: http://www.ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.