- Open Access
G-InforBIO: integrated system for microbial genomics
BMC Bioinformaticsvolume 7, Article number: 368 (2006)
Genome databases contain diverse kinds of information, including gene annotations and nucleotide and amino acid sequences. It is not easy to integrate such information for genomic study. There are few tools for integrated analyses of genomic data, therefore, we developed software that enables users to handle, manipulate, and analyze genome data with a variety of sequence analysis programs.
The G-InforBIO system is a novel tool for genome data management and sequence analysis. The system can import genome data encoded as eXtensible Markup Language documents as formatted text documents, including annotations and sequences, from DNA Data Bank of Japan and GenBank encoded as flat files. The genome database is constructed automatically after importing, and the database can be exported as documents formatted with eXtensible Markup Language or tab-deliminated text. Users can retrieve data from the database by keyword searches, edit annotation data of genes, and process data with G-InforBIO. In addition, information in the G-InforBIO database can be analyzed seamlessly with nine different software programs, including programs for clustering and homology analyses.
The G-InforBIO system simplifies genome analyses by integrating several available software programs to allow efficient handling and manipulation of genome data. G-InforBIO is freely available from the download site.
The number of microbial genomes for which sequence data are available is increasing each year. Currently, complete nucleotide sequences of more than 300 strains are available in the International Nucleotide Sequence Database (INSD), which includes DDBJ, EMBL, and GenBank , and the sequence data are summarized in the portal site, Genome Information Broker (GIB) [2, 3]. Genome data are composed primarily of annotation and sequence data, and the large volume of annotation data and long nucleotide sequences must be integrated for effective genome research. Such genome data are used for analyses that include comparisons of genomic structures between closely related species [4, 5], phylogenetic analysis , and detection of ubiquitous [7, 8] and species-specific genes (ORFans) [9, 10]. It appears that genomic analyses require high-capacity computers and many programs to study multiple long sequences.
Software programs, including Artemis , ASAP [12, 13], ERGO , and GenDB , have been developed to integrate annotation data and the results of various sequence analyses. However, a compact and easy-to-use sequence analysis package is needed by research laboratories outside of genome sequencing centers. Therefore, we developed the G-InforBIO system, an integrated system for analysis of microbial genomes that functions as a data management and sequence analysis program. Herein, we describe the functions of the G-InforBIO system and illustrate its uses with microbial genome data.
The G-InforBIO system is programmed in Java (j2sdk1.4.2_05), which is one of the most widely used computer languages in bioinformatics . The inputs are annotation and whole-genome sequence files. Annotation files formatted as a flat file (FF), eXtensible Markup Language (XML), and tab-deliminated text can be imported. The genome database is constructed automatically after importing. The database items generated on the G-InforBIO system are Accession, Feature, Location, Qualifier-key, Qualifier-value, and Whole Sequence. The definitions of these items, except Accession and Whole sequence, are provided on the International Nucleotide Sequence Database Collaboration (INSDC) web site . Terms in the Accession and Whole sequence fields should be unique to each genome. The Whole Sequence is the name of a whole-genome sequence file with the extension wgs, formatted as a fasta file. When an FF is imported, the whole-genome sequence file is produced automatically. In the G-InforBIO system database, annotation data are listed, and annotation data for one gene in a genome are recorded separately in multiple lines. The lines have a common term in each Accession, Location or Whole Sequence field to identify the gene location and the genome, and each of the lines has specific terms in Feature, Qualifier-key, and Qualifier-value fields respectively to represent annotation information. Lines can be extracted using keyword searches by annotation data from the database in the G-InforBIO system.
Gene and protein sequence data retrieval from the database
Nucleotide and predicted protein sequences in the database can be retrieved for use by the analysis programs integrated in G-InforBIO. Specific nucleotide sequences can be cut out with reference to the Location field of the extracted lines in the database from the whole-genome sequence files assigned in the "Whole Sequence" field. The excised nucleotide sequences are complementary, joined, and partial sequences, whose description styles are defined in INSDC . Predicted protein sequences are recorded in the Qualifier-value field, which lines have translation in the Qualifier-key field in the database. Specific predicted protein sequences encoded by the same gene locations as the extracted lines are retrieved from their predicted protein sequences recorded in the database. Sequences in a multi-fasta file are named with line numbers in the database, followed by Accession, Feature, and Location fields for each line extracted. If no lines are extracted in the database, all nucleotide or predicted protein sequences are retrieved. Retrieved nucleotide and protein sequences can be transferred to the analysis programs in G-InforBIO or be exported as a multi-fasta file. Additionally, locations selected by clicking on the database can be also retrieved and transferred to the analysis programs as the same manner.
The G-InforBIO system contains nine programs for sequence analysis. ClustalW , BLASTCLUST , and Self-Organizing Map (SOM) [20, 21] can be used for clustering analyses based on sequence similarity. BLAST , Blat , DDBJ Blast , MegaBLAST , and Sim4  can be used for homology analyses, and primer3 can be used for primer design . Results of some of these programs are displayed as text documents, and it may be difficult to interpret the data. Therefore, graphical result viewers were designed to display results of ClustalW , BLASTCLUST , SOM [20, 21], BLAST , Blat , MegaBLAST , and Sim4  analyses in G-InforBIO.
Furthermore, results from one analysis program can be simply utilized for the other analysis programs through a sequence file. For example, a dataset (fasta file format) of nucleotide sequence clusters generated by BLASTCLUST  can be imported into ClustalW  for phylogenetic analysis.
Graphical genome viewer (feature Viewer)
The Feature Viewer in the G-InforBIO system can display maps of two genomes contained in the database. Gene location and annotation information are retrieved from the database. There are two Feature View windows, and each window is composed of the map viewer and the sequence viewer. In the map viewer, gene regions are represented as pentagons for genes with reference to their location information, and annotation data of genes appear in a table in this viewer. In the sequence viewer, users can browse the nucleotide sequence around a selected gene by clicking on the map viewer.
A specific nucleotide sequence selected by users can be also excised from a sequence as a text file in the Feature Viewer. Additionally, the selected nucleotide sequence is translated automatically into six-frame protein sequences. Retrieved nucleotide and protein sequences can then be captured and transferred to the analysis programs in G-InforBIO.
Download of genomic data from DDBJ
We used the Simple Object Access Protocol (SOAP) interface [28, 29] in the G-InforBIO system to download FFs with the extension .seq of genome data from the SOAP server of XML Central of DDBJ  in the GIB [2, 3]. FFs published from GenBank, which have the extension .gbk, can be imported after they are downloaded manually from the file transfer protocol (ftp) site .
We used the G-InforBIO system to analyze FFs containing genomic data of two Xylella fastidiosa strains. It was reported that their genomic differences are limited to phage-associated chromosomal rearrangements and deletions , and their genome structures were compared with G-InforBIO.
Download and import of FFs
Available genomes, including chromosomes and plasmids, are listed in the Remote DB window of G-InforBIO. As shown in Fig. 1, target genomes were retrieved from the list with Xylella as keywords of Organism name, and their FFs were downloaded. The X. fastidiosa genome database were constructed on G-InforBIO by importing 5 FFs downloaded, including genomes of X. fastidiosa 9a5c (1 chromosome and 2 plasmids) and X. fastidiosa Temecula (1 chromosome and 1 plasmid).
Keyword searches of genes and retrieving sequences from the database
Gene annotation information is listed in the Search Entry window, and imported genome data can be browsed. From the X. fastidiosa genome database, 216 phage-related genes on both X. fastidiosa chromosomes were extracted with keywords of product for the Qualifier-key field and phage-related for the Qualifier-value field as shown in Fig. 2. Protein sequences encoded by the 216 extracted genes were directly transferred to the analysis programs.
Graphical viewer for comparative genomics
The retrieved phage-related protein sequences and genome sequences of two Xylella fastidiosa strains, 9a5c and Temecula, were analyzed with some analysis programs integrated in G-InforBIO and compared with graphical result viewers. BLASTCLUST , which is based on the BLAST score-based single-linkage clustering, was used for identification of similar phage-related proteins between the two strains under the default conditions, and then 111 of 216 retrieved sequences were assigned to 36 clusters, which respectively encompassed 2 to 7 proteins. The graphical result viewer of BLASTCLUST , which shows the distribution of the clustered protein coding regions on both chromosomes as shown in Fig. 3A, revealed that genes for the clustered phage-related proteins are concentrated in particular locations on each chromosome. MegaBLAST , which is called Alignment View in the system, is used for alignment across entire genomes to identify common regions and identified it many regions common between the two chromosomes. The graphical result viewer of MegaBLAST , which shows the distribution of regions common between two genomes as shown in Fig. 3B, revealed fragmentations and complex inversions in the chromosome structures. Green dashed lines in Fig. 3B show locations of genes for phage-related proteins in a cluster, generated with BLASTCLUST, on each chromosome. Interestingly, the clustered phage-related proteins by BLASTCLUST were encoded near the ends of the inverted fragments and near deleted regions on each chromosome. Additionally, six phage-related DNA polymerase sequences from two strains, which are encoded near other phage-related protein genes, were retrieved as shown in Fig. 2 and were aligned by using ClustalW . Their phylogenetic relationships were examined by using the neighbor-joining method  with the DNA polymerase sequence encoded by ORF40 on Xanthomonas oryzae phage OP1 genome [DDBJ: AP008979] as an out group in G-InforBIO. The capture of the result is shown in Fig. 3C. They were assigned to three clusters, which encompassed the 9a5c cluster (two phage-related DNA polymerases on 9a5c strain chromosome), the Temecula cluster (three phage-related DNA polymerases on Temecula strain chromosome), and a single cluster (one phage-related DNA polymerase on Temecula strain chromosome). The 9a5c and the Temecula clusters were closely related to each other. Thus, it seems that chromosomal rearrangements and deletions of two Xylella fastidiosa strains are affected by the infection of closely related bacteriophages and that Temecula strain might be additionally infected by another bacteriophage.
We developed the G-InforBIO system, which allows seamless handling of genome data from management to analysis. The system is also helpful for interpretation of results because it provides a graphical view of the linkage of the data and results of various analyses. The results of analyses, however, depend on the quality of the annotation information, such as predicted coding regions, for specific genes. Genome data can be constantly updated through downloads of current data from INSDC  by the system.
New genome analysis tools and algorithms will be developed in the future, and the object-oriented architecture of the G-InforBIO system will allow integration of programs constructed in Java or C language. Therefore, we anticipate that this system will expand to contain additional tools for genomic analysis. The system allows comprehensive utilization of genome information. This system can be used to analyze fungal genomes in G-InforBIO.
Availability and requirements
Project name: InforBIO project;
Project homepage: http://wdcm.nig.ac.jp/inforbio/index_e.html;
Operating systems: Windows 2000/XP, Macintosh OSX, Linux, UNIX;
Other requirements: CPU ≥ 1 GHz, Memory ≥ 512 MB, HD ≥ 60 MB (+ capacity for genome data), Screen resolution ≥ 800 × 600 pixels;
Programming language: Java (j2sdk1.4.2_05);
License: GNU GPL;
Any restrictions to use by non-academics: none.
Tateno Y, Saitou N, Okubo H, Sugawara H, Gojobori T: DDBJ in collaboration with mass-sequencing teams on annotation. Nucleic Acids Res 2005, 33: D25-D28. 10.1093/nar/gki020
Fumoto M, Miyazaki S, Sugawara H: Genome Information Broker (GIB): data retrieval and comparative analysis system for completed microbial genomes and more. Nucleic Acids Res 2002, 30: 66–68. 10.1093/nar/30.1.66
Boekhorst J, Siezen RJ, Zwahlen MC, Vilanova D, Pridmore RD, Mercenier A, Kleerebezem M, de Vos WM, Brussow H, Desiere F: The complete genomes of Lactobacillus plantarum and Lactobacillus johnsonii reveal extensive differences in chromosome organization and gene content. Microbioiol 2004, 150: 3601–3611. 10.1099/mic.0.27392-0
Göckner G, Lehmann R, Romualdi A, Pradella S, Schulte-Spechtel U, Schilhabel M, Wilske B, Suhnel J, Platzer M: Comparative analysis of the Borrelia garinii genome. Nucleic Acids Res 2004, 32: 6038–6046. 10.1093/nar/gkh953
Lerat E, Daubin V, Moran NA: From gene trees to organismal phylogeny in prokaryotes: the case of the gamma-Proteobacteria. PLoS Biol 2003, 1: 101–109. 10.1371/journal.pbio.0000019
Margulies EH, Blanchette M, Haussler D, Green ED: Identification and characterization of multi-species conserved sequences. Genome Res 2003, 13: 2507–2518. 10.1101/gr.1602203
Charlebois RL, Doolittle WF: Computing prokaryotic gene ubiquity: rescuing the core from extinction. Genome Res 2004, 14: 2469–2477. 10.1101/gr.3024704
Charlebois RL, Clarke GD, Beiko RG, St Jean A: Characterization of species-specific genes using a flexible web-based querying system. FEMS Microbiol Lett 2003, 225: 213–220. 10.1016/S0378-1097(03)00512-3
Daubin V, Ochman H: Bacterial genomes as new gene homes: the genealogy of ORFans in E coli . Genome Res 2004, 14: 1036–1042. 10.1101/gr.2231904
Rutherford K, Parkhill J, Crook J, Horsnell T, Rice P, Rajandream MA, Barrell B: Artemis: sequence visualization and annotation. Bioinformatics 2000, 16: 944–945. 10.1093/bioinformatics/16.10.944
Glasner JD, Liss P, Plunkett G 3rd, Darling A, Prasad T, Rusch M, Byrnes A, Gilson M, Biehl B, Blattner FR, Perna NT: ASAP a systematic annotation package for community analysis of genomes. Nucleic Acids Res 2003, 31: 147–151. 10.1093/nar/gkg125
Glasner JD, Rusch M, Liss P, Plunkett G 3rd, Cabot EL, Darling A, Anderson BD, Infield-Harm P, Gilson MC, Perna NT: ASAP: a resource for annotating, curating, comparing, and disseminating genomic data. Nucleic Acids Res 2006, 34: D41-D45. 10.1093/nar/gkj164
Overbeek R, Larsen N, Walunas T, D'Souza M, Pusch G, Selkov E Jr, Liolios K, Joukov V, Kaznadzey D, Anderson I, Bhattacharyya A, Burd H, Gardner W, Hanke P, Kapatral V, Mikhailova N, Vasieva O, Osterman A, Vonstein V, Fonstein M, Ivanova N, Kyrpides N: The ERGO genome analysis and discovery system. Nucleic Acids Res 2003, 31: 164–171. 10.1093/nar/gkg148
Meyer F, Goesmann A, McHardy AC, Bartels D, Bekel T, Clausen J, Kalinowski J, Linke B, Rupp O, Giegerich R, Puhler A: GenDB – an open source genome annotation system for prokaryote genomes. Nucleic Acids Res 2003, 31: 2187–2195. 10.1093/nar/gkg312
Maruyama H, Tamura K, Uramoto N, Murata M, Clark A, Nakamura Y, Neyama R, Kosaka K, Hada S: XML and Java: Developing Web Applications. Addison-Wesley Boston; 2002.
Definition of items[http://www.insdc.org/feature_table.html]
Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22: 4673–4680.
Kanaya S, Kinouchi M, Abe T, Kudo Y, Yamada Y, Nishi T, Mori H, Ikemura T: Analysis of codon usage diversity of bacterial genes with a self-organizing map (SOM): characterization of horizontally transferred genes with emphasis on the E. coli O157 genome. Gene 2001, 276: 89–99. 10.1016/S0378-1119(01)00673-4
Abe T, Kanaya S, Kinouchi M, Ichiba Y, Kozuki T, Ikemura T: Informatics for unveiling hidden genome signatures. Genome Res 2003, 13: 693–702. 10.1101/gr.634603
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403–410. 10.1006/jmbi.1990.9999
Kent WJ: BLAT-the BLAST-like alignment tool. Genome Res 2002, 14: 656–664. 10.1101/gr.229202. Article published online before March 2002
Zhang Z, Schwartz S, Wagner L, Miller W: A greedy algorithm for aligning DNA sequences. J Comput Biol 2000, 7: 203–214. 10.1089/10665270050081478
Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W: A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res 1998, 8: 967–974.
Sugawara H, Miyazaki H: Biological SOAP servers and web services provided by the public sequence data bank. Nucleic Acids Res 2003, 31: 3836–3839. 10.1093/nar/gkg558
XML Central of DDBJ[http://www.xml.nig.ac.jp/index.html]
Van Sluys MA, de Oliveira MC, Monteiro-Vitorello CB, Miyaki CY, Furlan LR, Camargo LE, da Silva AC, Moon DH, Takita MA, Lemos EG, Machado MA, Ferro MI, da Silva FR, Goldman MH, Goldman GH, Lemos MV, El-Dorry H, Tsai SM, Carrer H, Carraro DM, de Oliveira RC, Nunes LR, Siqueira WJ, Coutinho LL, Kimura ET, Ferro ES, Harakava R, Kuramae EE, Marino CL, Giglioti E, Abreu IL, Alves LM, do Amaral AM, Baia GS, Blanco SR, Brito MS, Cannavan FS, Celestino AV, da Cunha AF, Fenille RC, Ferro JA, Formighieri EF, Kishi LT, Leoni SG, Oliveira AR, Rosa VE Jr, Sassaki FT, Sena JA, de Souza AA, Truffi D, Tsukumo F, Yanai GM, Zaros LG, Civerolo EL, Simpson AJ, Almeida NF Jr, Setubal JC, Kitajima JP: Comparative analyses of the complete genome sequences of Pierce's disease and citrus variegated chlorosis strains of Xylella fastidiosa . J Bacteriol 2003, 185: 1018–1026. 10.1128/JB.185.3.1018-1026.2003
Saitoh N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 1987, 4: 406–425.
G-InforBIO download site[http://www.wdcm.org/inforbio/G-InforBIO/download.html]
The authors would like to express their sincere thanks to K. Koorikawa of Hitachi Software Engineering Co., Ltd., for programming. Development of the G-InforBIO system was supported in part by the Project of Fundamental Research and Development for Databasing and Networking Bio-resource Information as part of the Promotion System for Intellectual Infrastructure of Research and Development, Special Coordination Funds for Promoting Science and Technology.
NT participated in the design and coordination of the study and drafted the manuscript. SM conceived of the study and participated in its design. TA and HS participated in the design of the study. All authors read and approved the final manuscript.
Takashi Abe, Satoru Miyazaki and Hideaki Sugawara contributed equally to this work.
Electronic supplementary material
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.