iGepros: an integrated gene and protein annotation server for biological nature exploration
© Zheng et al; licensee BioMed Central Ltd. 2011
Published: 14 December 2011
In the post-genomic era, transcriptomics and proteomics provide important information to understand the genomes. With fast development of high-throughput technology, more and more transcriptomics and proteomics data are generated at an unprecedented rate. Therefore, requirement of software to annotate those omics data and explore their biological nature arises. In the past decade, some pioneer works were presented to address this issue, but limitations still exist. Fox example, some of these tools offer command line only, which is not suitable for those users with little or no experience in programming. Besides, some tools don’t support large scale gene and protein analysis.
To overcome these limitations, an integrated gene and protein annotation server named iGepros has been developed. The server provides user-friendly interfaces and detailed on-line examples, so most researchers even those with little or no programming experience can use it smoothly. Moreover, the server provides many functionalities to compare transcriptomics and proteomics data. Especially, the server is constructed under a model-view-control framework, which makes it easy to incorporate more functions to the server in the future.
In this paper, we present a server with powerful capability not only for gene and protein functional annotation, but also for transcriptomics and proteomics data comparison. Researchers can survey biological characters behind gene and protein datasets and accelerate their investigation of transcriptome and proteome by applying the server. The server is publicly available at http://www.biosino.org/iGepros/.
In the post-genomic era, one of the important goals of biological research is to explain genome contexts and understand the function of genetic information . Experiments on transcription and translation levels are widely carried out to decipher the functions behind a genomic sequence. In recent decades, some high-throughput technologies, such as microarray, next generation sequencing (NGS), and mass spectrometry (MS), have been introduced to meet these requirements. For example, microarray technology is often used to detect mRNA expression under a specific physiological condition [2, 3]. More recently, RNA-seq technology was developed to inspect RNA expressions on the whole genome scale [4, 5]. While for the mass spectrometry technology, it is generally followed with liquid chromatography method to quantify protein expression levels [6, 7].
After a large-scale gene or protein set is obtained by preliminary analyzing the raw data of transcriptomics and proteomics experiments, annotating those candidate genes and proteins will be executed to survey their biologic characters. During this annotation process, collecting mappings of genes and proteins to gene ontology (GO) is regarded as the primary step, which can help people decipher their roles in biological process, cellular component, or molecular function aspect. Many GO annotation tools have been developed to browse, search, and visit GO terms for genes and proteins. For example, Carbon and his colleagues provided a website tool for online accessing GO terms . The Bioconductor community released some GO packages written in R language, which provided a comprehensive annotation method for gene and protein sets [9, 10]. Besides, for a large size gene or protein set, an enrichment analysis is necessary for finding out whether the set shows statistical significance on some GO terms . In recent years, some pioneer works have been reported for this purpose [12–14]. In addition, the KEGG pathway is regarded as another pivotal term for annotating the functions of genes and proteins. Pathway information can help people understand relations between genes and proteins as well. In 2007, Moriya and his colleagues offered a KEGG pathway annotation server for high-throughput data, which can automatically generate pathways based on annotation results of the input data . Furthermore, an enrichment analysis on KEGG pathways is required for deeply investigating the relationships between genes and proteins. For example, a knowledge-base website named DAVID has been created for GO and pathway annotation . These pioneer works mentioned above provide useful GO and pathway annotation tools for the biologist community, but some limitations are still exist, which prevents people to analysis large-size data sets produced by high-throughput technology. First, some of these tools only offer a command line mode for users, which is not convenient for users with little or non programming skill. Secondly, some tools do not support large data set analysis on one time. Therefore, they are not suitable for annotation of datasets generated by a high-throughput technology. At last, experiments executed on both transcription and translation levels need integration software, which can annotate gene and protein sets simultaneously and combine the annotation results. To our knowledge, there is no available software for this purpose.
In this paper, we present an integrated web server with a user friendly interface for gene and protein annotation. The server supports large size datasets as inputs, so it can be used as an analysis tool for high-throughput experiments. Especially, it offers a powerful association module to combine results of gene and protein annotations. This can help people effectively analyze datasets produced by assays launched on both transcriptional and translational aspects. In a word, the server can help to explore the biological nature behind gene and protein data.
In practice, two data uploading methods were developed for large data set analysis. On one hand, users can directly copy and paste their data with a valid format in the textbox offered by the server. On the other hand, they can choose a local data file with a proper format and upload it to the server. Then, controllers of the server called relative program to finish users’ tasks. Subsequently, results of those tasks were returned with a legible table manner, so that users could quickly read and save outputs. In this work, the server was deployed on a cluster machine with 4 CPUs and 8GB memory so as to ensure its performance for large size data sets. Currently, the server can handle a gene or protein list contained 300 database IDs on one time, and it can be done within 20 minutes.
iGepros is a user-friendly web server aiming to provide powerful annotation tools for large size gene and protein sets. Furthermore, the server offers some useful association tools to connect gene and protein annotation results. It can combine outcomes of transcriptomics and proteomics experiments, which help researchers to understand the biological nature in transcriptional and translational levels. In order to make researchers analyze massive data conveniently, three modules (named Gene module, Protein module, and Gepro module) are set up. Detailed functionality of these three modules is described in followed sections.
To demonstrate the efficiency of the iGepros server, we collected data in transcription and translation level from a published article , in which Hartl and his colleagues investigated the mouse brain transcriptome and proteome of embryonic days 9.5, 11.5, and 13.5. Differentially expressed genes and proteins between days 9.5 and 11.5 were collected for this case study, which contained 35 genes and 52 proteins [see additional file 1].
First, we used the Gene module to annotate the 35 genes, where the GO enrichment analysis tool was utilized for getting GO terms genes concentrated on. Results of GO enrichment analysis were listed in additional file 2. We found these genes were mainly associated with basic metabolic process and nervous system development according to results of GO enrichment analysis. This includes the reporting result of the published article.
Then, we used the Protein module to annotate the 52 proteins, where the protein annotation tool for KEGG pathway was utilized to get pathways proteins involved. Pathway information was summarized in additional file 3. These proteins mainly participated in basic metabolic pathways and associated with some neuro-disorder, such as Alzheimer, Parkinson, and Huntington disease. This was consistent with result of GO enrichment analysis for genes and the result of the published article.
Connection between the 35 genes and the 52 proteins
GO terms of genes and proteins associated together
Pathway information of genes and proteins associated together
PPAR signaling pathway
Fc gamma R-mediated phagocytosis
In the post-genomic era, genomics and proteomics data generated from high-throughput technologies are accumulated quickly. The biology community needs powerful tools to analyze large size gene and protein sets conveniently. Moreover, researchers want to understand biologic meaning of interested genes and proteins. Software that provides biologists with large-scale genes and proteins annotation capabilities is in a high demand. To meet this trend, we have established a web server called iGepros. Especially, the server is user-friendly and some demos are offered on-line, which make it suitable for a wide range of users even those with little or no programming experience. In addition, the iGepros server provides operations to connect genes and proteins, which allows users to compare different omics data sets. As demonstrated in the case study, users can first annotate genes and proteins with GO terms and pathway information through the Gene and Protein module. This primary annotation helps researchers decipher interested genes and proteins detected by high-throughput technology. Then, users can perform comparative analysis between transcriptomics and proteomics data by integrating genes and proteins through the GePro module, which present researchers ability to associate information in transcription and translation levels and comprehend biological nature of the investigated issue.
Currently, the iGepros server supports following model organisms: Human, Mouse, Rat, Chick, Cow, Zebrafish, Arabidopsis. In the future, more model organisms will be supported by the server. In addition, more tools for the GePro module will be developed. The server will have stronger capability for comparison of transcriptomics and proteomics data.
iGepros is an integrated gene and protein analysis server, which has a powerful capability for gene and protein functional annotation as well as transcriptomics and proteomics data comparison. This on-line server allows (1) retrieval of gene informations of a model organism, (2) GO term annotation for a large set of genes or proteins, (3) pathway annotation for a large set of genes or proteins, (4) GO enrichment analysis for huge size gene and protein sets, (5) pathway enrichment analysis for huge size gene and protein sets, (6) de novo annotating gene and protein sequence without functional information, (7) mapping genes and proteins with cross reference information, (8) connecting genes and proteins through GO terms, (9) connecting genes and proteins through pathway information, (10) comparative analysis of transcriptomics and proteomics data on GO aspect, and (11) comparative analysis of transcriptomics and proteomics data on pathway aspect.
This work was supported by Main Direction Program of Knowledge Innovation of Chinese Academy of Sciences (grant KSCX2-EW-R-04), China Postdoctoral Science Foundation fund (No. 20110490758), Shanghai Pujiang Program (No. 09PJ1407900), National Natural Science Foundation of China (No. 60970050), and National High-Tech R&D Program (863) (No. 2009AA02Z310, No. 2009AA02Z306).
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 14, 2011: 22nd International Conference on Genome Informatics: Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S14.
- The ENCODE (ENCyclopedia Of DNA Elements) Project Science 2004, 306(5696):636–640.Google Scholar
- Schena M, Shalon D, Davis RW, Brown PO: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 1995, 270(5235):467–470. 10.1126/science.270.5235.467View ArticlePubMedGoogle Scholar
- Behr MA, Wilson MA, Gill WP, Salamon H, Schoolnik GK, Rane S, Small PM: Comparative genomics of BCG vaccines by whole-genome DNA microarray. Science 1999, 284(5419):1520–1523. 10.1126/science.284.5419.1520View ArticlePubMedGoogle Scholar
- Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 2009, 10(1):57–63. 10.1038/nrg2484PubMed CentralView ArticlePubMedGoogle Scholar
- Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 2008, 18(9):1509–1517. 10.1101/gr.079558.108PubMed CentralView ArticlePubMedGoogle Scholar
- Aebersold R, Mann M: Mass spectrometry-based proteomics. Nature 2003, 422(6928):198–207. 10.1038/nature01511View ArticlePubMedGoogle Scholar
- Peng J, Elias JE, Thoreen CC, Licklider LJ, Gygi SP: Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. J Proteome Res 2003, 2(1):43–50. 10.1021/pr025556vView ArticlePubMedGoogle Scholar
- Carbon S, Ireland A, Mungall CJ, Shu S, Marshall B, Lewis S: AmiGO: online access to ontology and annotation data. Bioinformatics 2009, 25(2):288–289. 10.1093/bioinformatics/btn615PubMed CentralView ArticlePubMedGoogle Scholar
- Falcon S, Gentleman R: Using GOstats to test gene lists for GO term association. Bioinformatics 2007, 23(2):257–258. 10.1093/bioinformatics/btl567View ArticlePubMedGoogle Scholar
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, et al.: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004, 5(10):R80. 10.1186/gb-2004-5-10-r80PubMed CentralView ArticlePubMedGoogle Scholar
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al.: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 2005, 102(43):15545–15550. 10.1073/pnas.0506580102PubMed CentralView ArticlePubMedGoogle Scholar
- Zheng Q, Wang XJ: GOEAST: a web-based software toolkit for Gene Ontology enrichment analysis. Nucleic Acids Res 2008, 36(Web Server issue):W358–363.PubMed CentralView ArticlePubMedGoogle Scholar
- Bauer S, Grossmann S, Vingron M, Robinson PN: Ontologizer 2.0--a multifunctional tool for GO term enrichment analysis and data exploration. Bioinformatics 2008, 24(14):1650–1651. 10.1093/bioinformatics/btn250View ArticlePubMedGoogle Scholar
- Conesa A, Gotz S, Garcia-Gomez JM, Terol J, Talon M, Robles M: Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 2005, 21(18):3674–3676. 10.1093/bioinformatics/bti610View ArticlePubMedGoogle Scholar
- Moriya Y, Itoh M, Okuda S, Yoshizawa AC, Kanehisa M: KAAS: an automatic genome annotation and pathway reconstruction server. Nucleic Acids Res 2007, 35(Web Server issue):W182–185.PubMed CentralView ArticlePubMedGoogle Scholar
- Sherman BT, Huang da W, Tan Q, Guo Y, Bour S, Liu D, Stephens R, Baseler MW, Lane HC, Lempicki RA: DAVID Knowledgebase: a gene-centered database integrating heterogeneous gene annotation resources to facilitate high-throughput gene functional analysis. BMC Bioinformatics 2007, 8: 426. 10.1186/1471-2105-8-426PubMed CentralView ArticlePubMedGoogle Scholar
- The MVC framework webpage http://en.wikipedia.org/wiki/model-view-controller
- Hartl D, Irmler M, Romer I, Mader MT, Mao L, Zabel C, de Angelis MH, Beckers J, Klose J: Transcriptome and proteome analysis of early embryonic mouse brain development. Proteomics 2008, 8(6):1257–1265. 10.1002/pmic.200700724View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.