G-InforBIO: integrated system for microbial genomics
© Tanaka et al; licensee BioMed Central Ltd. 2006
Received: 28 March 2006
Accepted: 04 August 2006
Published: 04 August 2006
Genome databases contain diverse kinds of information, including gene annotations and nucleotide and amino acid sequences. It is not easy to integrate such information for genomic study. There are few tools for integrated analyses of genomic data, therefore, we developed software that enables users to handle, manipulate, and analyze genome data with a variety of sequence analysis programs.
The G-InforBIO system is a novel tool for genome data management and sequence analysis. The system can import genome data encoded as eXtensible Markup Language documents as formatted text documents, including annotations and sequences, from DNA Data Bank of Japan and GenBank encoded as flat files. The genome database is constructed automatically after importing, and the database can be exported as documents formatted with eXtensible Markup Language or tab-deliminated text. Users can retrieve data from the database by keyword searches, edit annotation data of genes, and process data with G-InforBIO. In addition, information in the G-InforBIO database can be analyzed seamlessly with nine different software programs, including programs for clustering and homology analyses.
The G-InforBIO system simplifies genome analyses by integrating several available software programs to allow efficient handling and manipulation of genome data. G-InforBIO is freely available from the download site.
The number of microbial genomes for which sequence data are available is increasing each year. Currently, complete nucleotide sequences of more than 300 strains are available in the International Nucleotide Sequence Database (INSD), which includes DDBJ, EMBL, and GenBank , and the sequence data are summarized in the portal site, Genome Information Broker (GIB) [2, 3]. Genome data are composed primarily of annotation and sequence data, and the large volume of annotation data and long nucleotide sequences must be integrated for effective genome research. Such genome data are used for analyses that include comparisons of genomic structures between closely related species [4, 5], phylogenetic analysis , and detection of ubiquitous [7, 8] and species-specific genes (ORFans) [9, 10]. It appears that genomic analyses require high-capacity computers and many programs to study multiple long sequences.
Software programs, including Artemis , ASAP [12, 13], ERGO , and GenDB , have been developed to integrate annotation data and the results of various sequence analyses. However, a compact and easy-to-use sequence analysis package is needed by research laboratories outside of genome sequencing centers. Therefore, we developed the G-InforBIO system, an integrated system for analysis of microbial genomes that functions as a data management and sequence analysis program. Herein, we describe the functions of the G-InforBIO system and illustrate its uses with microbial genome data.
The G-InforBIO system is programmed in Java (j2sdk1.4.2_05), which is one of the most widely used computer languages in bioinformatics . The inputs are annotation and whole-genome sequence files. Annotation files formatted as a flat file (FF), eXtensible Markup Language (XML), and tab-deliminated text can be imported. The genome database is constructed automatically after importing. The database items generated on the G-InforBIO system are Accession, Feature, Location, Qualifier-key, Qualifier-value, and Whole Sequence. The definitions of these items, except Accession and Whole sequence, are provided on the International Nucleotide Sequence Database Collaboration (INSDC) web site . Terms in the Accession and Whole sequence fields should be unique to each genome. The Whole Sequence is the name of a whole-genome sequence file with the extension wgs, formatted as a fasta file. When an FF is imported, the whole-genome sequence file is produced automatically. In the G-InforBIO system database, annotation data are listed, and annotation data for one gene in a genome are recorded separately in multiple lines. The lines have a common term in each Accession, Location or Whole Sequence field to identify the gene location and the genome, and each of the lines has specific terms in Feature, Qualifier-key, and Qualifier-value fields respectively to represent annotation information. Lines can be extracted using keyword searches by annotation data from the database in the G-InforBIO system.
Gene and protein sequence data retrieval from the database
Nucleotide and predicted protein sequences in the database can be retrieved for use by the analysis programs integrated in G-InforBIO. Specific nucleotide sequences can be cut out with reference to the Location field of the extracted lines in the database from the whole-genome sequence files assigned in the "Whole Sequence" field. The excised nucleotide sequences are complementary, joined, and partial sequences, whose description styles are defined in INSDC . Predicted protein sequences are recorded in the Qualifier-value field, which lines have translation in the Qualifier-key field in the database. Specific predicted protein sequences encoded by the same gene locations as the extracted lines are retrieved from their predicted protein sequences recorded in the database. Sequences in a multi-fasta file are named with line numbers in the database, followed by Accession, Feature, and Location fields for each line extracted. If no lines are extracted in the database, all nucleotide or predicted protein sequences are retrieved. Retrieved nucleotide and protein sequences can be transferred to the analysis programs in G-InforBIO or be exported as a multi-fasta file. Additionally, locations selected by clicking on the database can be also retrieved and transferred to the analysis programs as the same manner.
The G-InforBIO system contains nine programs for sequence analysis. ClustalW , BLASTCLUST , and Self-Organizing Map (SOM) [20, 21] can be used for clustering analyses based on sequence similarity. BLAST , Blat , DDBJ Blast , MegaBLAST , and Sim4  can be used for homology analyses, and primer3 can be used for primer design . Results of some of these programs are displayed as text documents, and it may be difficult to interpret the data. Therefore, graphical result viewers were designed to display results of ClustalW , BLASTCLUST , SOM [20, 21], BLAST , Blat , MegaBLAST , and Sim4  analyses in G-InforBIO.
Furthermore, results from one analysis program can be simply utilized for the other analysis programs through a sequence file. For example, a dataset (fasta file format) of nucleotide sequence clusters generated by BLASTCLUST  can be imported into ClustalW  for phylogenetic analysis.
Graphical genome viewer (feature Viewer)
The Feature Viewer in the G-InforBIO system can display maps of two genomes contained in the database. Gene location and annotation information are retrieved from the database. There are two Feature View windows, and each window is composed of the map viewer and the sequence viewer. In the map viewer, gene regions are represented as pentagons for genes with reference to their location information, and annotation data of genes appear in a table in this viewer. In the sequence viewer, users can browse the nucleotide sequence around a selected gene by clicking on the map viewer.
A specific nucleotide sequence selected by users can be also excised from a sequence as a text file in the Feature Viewer. Additionally, the selected nucleotide sequence is translated automatically into six-frame protein sequences. Retrieved nucleotide and protein sequences can then be captured and transferred to the analysis programs in G-InforBIO.
Download of genomic data from DDBJ
We used the Simple Object Access Protocol (SOAP) interface [28, 29] in the G-InforBIO system to download FFs with the extension .seq of genome data from the SOAP server of XML Central of DDBJ  in the GIB [2, 3]. FFs published from GenBank, which have the extension .gbk, can be imported after they are downloaded manually from the file transfer protocol (ftp) site .
We used the G-InforBIO system to analyze FFs containing genomic data of two Xylella fastidiosa strains. It was reported that their genomic differences are limited to phage-associated chromosomal rearrangements and deletions , and their genome structures were compared with G-InforBIO.
Download and import of FFs
Keyword searches of genes and retrieving sequences from the database
Graphical viewer for comparative genomics
We developed the G-InforBIO system, which allows seamless handling of genome data from management to analysis. The system is also helpful for interpretation of results because it provides a graphical view of the linkage of the data and results of various analyses. The results of analyses, however, depend on the quality of the annotation information, such as predicted coding regions, for specific genes. Genome data can be constantly updated through downloads of current data from INSDC  by the system.
New genome analysis tools and algorithms will be developed in the future, and the object-oriented architecture of the G-InforBIO system will allow integration of programs constructed in Java or C language. Therefore, we anticipate that this system will expand to contain additional tools for genomic analysis. The system allows comprehensive utilization of genome information. This system can be used to analyze fungal genomes in G-InforBIO.
Availability and requirements
Project name: InforBIO project;
Project homepage: http://wdcm.nig.ac.jp/inforbio/index_e.html;
Operating systems: Windows 2000/XP, Macintosh OSX, Linux, UNIX;
Other requirements: CPU ≥ 1 GHz, Memory ≥ 512 MB, HD ≥ 60 MB (+ capacity for genome data), Screen resolution ≥ 800 × 600 pixels;
Programming language: Java (j2sdk1.4.2_05);
License: GNU GPL;
Any restrictions to use by non-academics: none.
The authors would like to express their sincere thanks to K. Koorikawa of Hitachi Software Engineering Co., Ltd., for programming. Development of the G-InforBIO system was supported in part by the Project of Fundamental Research and Development for Databasing and Networking Bio-resource Information as part of the Promotion System for Intellectual Infrastructure of Research and Development, Special Coordination Funds for Promoting Science and Technology.
- Tateno Y, Saitou N, Okubo H, Sugawara H, Gojobori T: DDBJ in collaboration with mass-sequencing teams on annotation. Nucleic Acids Res 2005, 33: D25-D28. 10.1093/nar/gki020PubMed CentralView ArticlePubMedGoogle Scholar
- Fumoto M, Miyazaki S, Sugawara H: Genome Information Broker (GIB): data retrieval and comparative analysis system for completed microbial genomes and more. Nucleic Acids Res 2002, 30: 66–68. 10.1093/nar/30.1.66PubMed CentralView ArticlePubMedGoogle Scholar
- Boekhorst J, Siezen RJ, Zwahlen MC, Vilanova D, Pridmore RD, Mercenier A, Kleerebezem M, de Vos WM, Brussow H, Desiere F: The complete genomes of Lactobacillus plantarum and Lactobacillus johnsonii reveal extensive differences in chromosome organization and gene content. Microbioiol 2004, 150: 3601–3611. 10.1099/mic.0.27392-0View ArticleGoogle Scholar
- Göckner G, Lehmann R, Romualdi A, Pradella S, Schulte-Spechtel U, Schilhabel M, Wilske B, Suhnel J, Platzer M: Comparative analysis of the Borrelia garinii genome. Nucleic Acids Res 2004, 32: 6038–6046. 10.1093/nar/gkh953View ArticleGoogle Scholar
- Lerat E, Daubin V, Moran NA: From gene trees to organismal phylogeny in prokaryotes: the case of the gamma-Proteobacteria. PLoS Biol 2003, 1: 101–109. 10.1371/journal.pbio.0000019View ArticleGoogle Scholar
- Margulies EH, Blanchette M, Haussler D, Green ED: Identification and characterization of multi-species conserved sequences. Genome Res 2003, 13: 2507–2518. 10.1101/gr.1602203PubMed CentralView ArticlePubMedGoogle Scholar
- Charlebois RL, Doolittle WF: Computing prokaryotic gene ubiquity: rescuing the core from extinction. Genome Res 2004, 14: 2469–2477. 10.1101/gr.3024704PubMed CentralView ArticlePubMedGoogle Scholar
- Charlebois RL, Clarke GD, Beiko RG, St Jean A: Characterization of species-specific genes using a flexible web-based querying system. FEMS Microbiol Lett 2003, 225: 213–220. 10.1016/S0378-1097(03)00512-3View ArticlePubMedGoogle Scholar
- Daubin V, Ochman H: Bacterial genomes as new gene homes: the genealogy of ORFans in E coli . Genome Res 2004, 14: 1036–1042. 10.1101/gr.2231904PubMed CentralView ArticlePubMedGoogle Scholar
- Rutherford K, Parkhill J, Crook J, Horsnell T, Rice P, Rajandream MA, Barrell B: Artemis: sequence visualization and annotation. Bioinformatics 2000, 16: 944–945. 10.1093/bioinformatics/16.10.944View ArticlePubMedGoogle Scholar
- Glasner JD, Liss P, Plunkett G 3rd, Darling A, Prasad T, Rusch M, Byrnes A, Gilson M, Biehl B, Blattner FR, Perna NT: ASAP a systematic annotation package for community analysis of genomes. Nucleic Acids Res 2003, 31: 147–151. 10.1093/nar/gkg125PubMed CentralView ArticlePubMedGoogle Scholar
- Glasner JD, Rusch M, Liss P, Plunkett G 3rd, Cabot EL, Darling A, Anderson BD, Infield-Harm P, Gilson MC, Perna NT: ASAP: a resource for annotating, curating, comparing, and disseminating genomic data. Nucleic Acids Res 2006, 34: D41-D45. 10.1093/nar/gkj164PubMed CentralView ArticlePubMedGoogle Scholar
- Overbeek R, Larsen N, Walunas T, D'Souza M, Pusch G, Selkov E Jr, Liolios K, Joukov V, Kaznadzey D, Anderson I, Bhattacharyya A, Burd H, Gardner W, Hanke P, Kapatral V, Mikhailova N, Vasieva O, Osterman A, Vonstein V, Fonstein M, Ivanova N, Kyrpides N: The ERGO genome analysis and discovery system. Nucleic Acids Res 2003, 31: 164–171. 10.1093/nar/gkg148PubMed CentralView ArticlePubMedGoogle Scholar
- Meyer F, Goesmann A, McHardy AC, Bartels D, Bekel T, Clausen J, Kalinowski J, Linke B, Rupp O, Giegerich R, Puhler A: GenDB – an open source genome annotation system for prokaryote genomes. Nucleic Acids Res 2003, 31: 2187–2195. 10.1093/nar/gkg312PubMed CentralView ArticlePubMedGoogle Scholar
- Maruyama H, Tamura K, Uramoto N, Murata M, Clark A, Nakamura Y, Neyama R, Kosaka K, Hada S: XML and Java: Developing Web Applications. Addison-Wesley Boston; 2002.Google Scholar
- Definition of items[http://www.insdc.org/feature_table.html]
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22: 4673–4680.PubMed CentralView ArticlePubMedGoogle Scholar
- Kanaya S, Kinouchi M, Abe T, Kudo Y, Yamada Y, Nishi T, Mori H, Ikemura T: Analysis of codon usage diversity of bacterial genes with a self-organizing map (SOM): characterization of horizontally transferred genes with emphasis on the E. coli O157 genome. Gene 2001, 276: 89–99. 10.1016/S0378-1119(01)00673-4View ArticlePubMedGoogle Scholar
- Abe T, Kanaya S, Kinouchi M, Ichiba Y, Kozuki T, Ikemura T: Informatics for unveiling hidden genome signatures. Genome Res 2003, 13: 693–702. 10.1101/gr.634603PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403–410. 10.1006/jmbi.1990.9999View ArticlePubMedGoogle Scholar
- Kent WJ: BLAT-the BLAST-like alignment tool. Genome Res 2002, 14: 656–664. 10.1101/gr.229202. Article published online before March 2002View ArticleGoogle Scholar
- DDBJ Blast[http://www.ddbj.nig.ac.jp/search/blast-e.html]
- Zhang Z, Schwartz S, Wagner L, Miller W: A greedy algorithm for aligning DNA sequences. J Comput Biol 2000, 7: 203–214. 10.1089/10665270050081478View ArticlePubMedGoogle Scholar
- Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W: A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res 1998, 8: 967–974.PubMed CentralPubMedGoogle Scholar
- Sugawara H, Miyazaki H: Biological SOAP servers and web services provided by the public sequence data bank. Nucleic Acids Res 2003, 31: 3836–3839. 10.1093/nar/gkg558PubMed CentralView ArticlePubMedGoogle Scholar
- XML Central of DDBJ[http://www.xml.nig.ac.jp/index.html]
- Van Sluys MA, de Oliveira MC, Monteiro-Vitorello CB, Miyaki CY, Furlan LR, Camargo LE, da Silva AC, Moon DH, Takita MA, Lemos EG, Machado MA, Ferro MI, da Silva FR, Goldman MH, Goldman GH, Lemos MV, El-Dorry H, Tsai SM, Carrer H, Carraro DM, de Oliveira RC, Nunes LR, Siqueira WJ, Coutinho LL, Kimura ET, Ferro ES, Harakava R, Kuramae EE, Marino CL, Giglioti E, Abreu IL, Alves LM, do Amaral AM, Baia GS, Blanco SR, Brito MS, Cannavan FS, Celestino AV, da Cunha AF, Fenille RC, Ferro JA, Formighieri EF, Kishi LT, Leoni SG, Oliveira AR, Rosa VE Jr, Sassaki FT, Sena JA, de Souza AA, Truffi D, Tsukumo F, Yanai GM, Zaros LG, Civerolo EL, Simpson AJ, Almeida NF Jr, Setubal JC, Kitajima JP: Comparative analyses of the complete genome sequences of Pierce's disease and citrus variegated chlorosis strains of Xylella fastidiosa . J Bacteriol 2003, 185: 1018–1026. 10.1128/JB.185.3.1018-1026.2003PubMed CentralView ArticlePubMedGoogle Scholar
- Saitoh N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 1987, 4: 406–425.Google Scholar
- G-InforBIO download site[http://www.wdcm.org/inforbio/G-InforBIO/download.html]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.