GeneViTo: Visualizing gene-product functional and structural features in genomic datasets
- Georgios S Vernikos†1,
- Christos G Gkogkas†1,
- Vasilis J Promponas1 and
- Stavros J Hamodrakas1Email author
© Vernikos et al; licensee BioMed Central Ltd. 2003
Received: 22 July 2003
Accepted: 31 October 2003
Published: 31 October 2003
The availability of increasing amounts of sequence data from completely sequenced genomes boosts the development of new computational methods for automated genome annotation and comparative genomics. Therefore, there is a need for tools that facilitate the visualization of raw data and results produced by bioinformatics analysis, providing new means for interactive genome exploration. Visual inspection can be used as a basis to assess the quality of various analysis algorithms and to aid in-depth genomic studies.
GeneViTo is a JAVA-based computer application that serves as a workbench for genome-wide analysis through visual interaction. The application deals with various experimental information concerning both DNA and protein sequences (derived from public sequence databases or proprietary data sources) and meta-data obtained by various prediction algorithms, classification schemes or user-defined features. Interaction with a Graphical User Interface (GUI) allows easy extraction of genomic and proteomic data referring to the sequence itself, sequence features, or general structural and functional features. Emphasis is laid on the potential comparison between annotation and prediction data in order to offer a supplement to the provided information, especially in cases of "poor" annotation, or an evaluation of available predictions. Moreover, desired information can be output in high quality JPEG image files for further elaboration and scientific use. A compilation of properly formatted GeneViTo input data for demonstration is available to interested readers for two completely sequenced prokaryotes, Chlamydia trachomatis and Methanococcus jannaschii.
GeneViTo offers an inspectional view of genomic functional elements, concerning data stemming both from database annotation and analysis tools for an overall analysis of existing genomes. The application is compatible with Linux or Windows ME-2000-XP operating systems, provided that the appropriate Java Runtime Environment is already installed in the system.
The impressive progress in Molecular Biology, enhanced by the development of rapid genome sequencing technologies, led to an exponential growth of the number of available DNA/protein sequences deposited in public databases. Between the early 90's, when the Human Genome Project began, and 1996 the complete genome sequences of 5 unicellular organisms had been determined. By the time of this writing (September 2003) 160 genomes (including the Human Genome) have been completely sequenced, while 643 genome projects are still in progress [1, 2]. On the other hand, the intensive research activity in the field of Bioinformatics generates a large amount of heterogeneous meta-data which, examined on a large-scale, demand further analysis in order to extract valuable biological information.
DNA or protein sequence retrieval from specialized curated databases (GenBank , SWISS-PROT ) is quite effective, by the means of well-established tools, such as SRS  or Entrez . Cross-references between entries from disseminated biological databases are abundant, helping for easy navigation over the World Wide Web, but are unable to offer an overview of the way sequence features are distributed in ordered sequence sets, such as complete genomes.
Moreover, several bioinformatics analysis and prediction tools are available, either as web services or as standalone applications, attempting to give further insight to existing sequence information. These tools produce different output, according to the analysis type, and results representation is mainly oriented towards a per functional element basis. These analyses complement experimental data and guide further research activities.
Once information concerning a genome is obtained (sequence, annotation and meta-data), an integration step is required in order to come to advanced biological conclusions. Such a task is time-consuming and painstaking, as long as data for hundreds/thousands of sequences are "thick-set" in structured text files. Furthermore, the monotonous machine-readable file format does not reveal at once features contained in a set of sequences in an intuitive way. It becomes quite clear, especially in cases of completely sequenced genomes, that organizing data in text files constitutes only a primordial level of presentation. Thus, a more sophisticated approach for easier, efficient, more productive and less chaotic representation is required.
Data visualization, using specialized Computer Graphics Software, act as an intermediate link between raw data and the user for more effective and elaborate manipulation of numerous genomes. Such computational workbenches become even more useful when they incorporate, apart from the already deposited data, additional tools, making large-scale in silico experiments easier.
Several powerful genome visualization tools are already available, mainly focused on features related to nucleotide sequences: gff2ps , Artemis , SeqVista , NCBI Map Viewer , TIGR Genome browser , ENSEMBL project viewer , ERGO™ . Each of these methods follows a different philosophy in the type of input data (e.g. sequences, maps, nucleotide sequence features), the accepted formats and the way that features are visualized. Our approach is mainly focused in presenting features related to gene products and their distribution along genomic regions.
We have developed GeneViTo, a JAVA-based computer application to incorporate in a single depiction sequence features existing in annotation records from nucleotide and protein sequence databases (GenBank, SWISS-PROT) and prediction methods output (e.g. PRED-CLASS , PRED-TMR2 , orienTM , SIGNALP ).
GeneViTo provides interfaces to additional analysis tools, as well as several search utilities, to easily manipulate and further examine sequenced genomes. Existing annotations may easily be extracted with a mouse click on the color boxes representing structural genes (protein coding regions or functional RNA products).
The GeneViTo working environment attempts to unify the representation of data from genome-proteome resources and bioinformatics meta-data scattered around the Internet. The scope is to achieve a holistic yet detailed image of an entire organism and help with genome annotation, phylogenetic studies and comparative genomics efforts.
The open system architecture combined with object oriented JAVA programming allows future incorporation of numerous bioinformatics tools with various outputs, thus alleviating the need to develop program-specific visualization software for a given biological task.
Java-based Object Oriented Design
GeneViTo has been fully implemented in the Java programming language, exploiting its implicit object oriented design and graphical representation capabilities. Two main Java classes (MyFrame and MyPanel) serve as the core modules for data depiction and user interaction respectively, whereas additional classes are used for efficient data handling utilities (such as required transformations, raw data processing and analysis). Object oriented software design enables for ease of code reusability and provides interfaces for linking new software (or software modules) to build systems of extended capabilities.
Furthermore, the Java programming language has become very popular among programmers for its expressive power and elegance and, at the same time, very attractive to users for reasons of portability. Standard Java interpreters do not build architecture-specific binaries but bytecode interpreted by the Java Virtual Machine, which makes Java applications, conceptually, cross-platform.
Performance Benchmark Tests
Performance Benchmark Tests
Main Memory Usage (Mbytes)
Preparing a Genome
See Figure 1
Loading a Genome
Concerning the visualization process, GeneViTo uses true color (a minimum of 256 colors suffices for old graphics cards) and genome browsing is very fast even on typical desktop computers. Graphics displayed on screen can be saved in high-quality JPEG graphics files. The resolution of these snapshots is only limited by the screen resolution.
Availability of GeneViTo input data.
Type of Data
FASTA nucleotide coding regions file
FASTA Nucleic Acid file
Cluster of Orthologous Groups (COGs)
Proteome Analysis Server (EBI)
Restriction Enzymes Sites
Protein Structural Classification
PRED-TMR2 and orienTM *
Prediction of TM region location and orientation
The original data files are necessary only for the initial formatting step and all essential information is included in the special files created by this procedure. More details about supported file formats, data gathering and organization, are provided through the application's detailed "Help" utility.
Results and Discussion
The sequence of the selected element is displayed in the "Sequence Features Panel", along with all the relevant information available from nucleotide or protein sequence databases (GenBank, SWISS-PROT) and data available from prediction algorithms (PRED-CLASS, PRED-TMR2, orienTM, SIGNALP). In the "General Features Panel", information about the selected item is displayed in red, such as: gene name, position on the genome (bp), length (bp) and coding strand (5'-3', 3'-5'). Information about the respective protein product (in case of protein coding genes) is available in blue: protein identification, subcellular location, enzyme class, predicted structural class (PRED-CLASS), number of annotated (SWISS-PROT) and predicted transmembrane segments (PRED-TMR2).
The "Color panel" is a tree-like structure indexing the color choices corresponding to the indications displayed in the "Central Graphics Panel", so that users can consult the color patterns each time. Extra information available from SWISS-PROT and user-defined annotation on each protein or gene is displayed in a more detailed way in the "Annotation Features Panel". The "Central Graphics Panel" is dynamically connected to all other panels and the circular map and their information is updated in real time on each selection.
A genome of interest may be loaded by selecting the option "Open a project" and choosing the folder where the specially formatted files were initially stored. Genomic elements are selected by simple mouse-clicks on the respective boxes. Consequently, all available information is instantly displayed in the relative panels according to the already defined options, while at the same time the red pointer is dragged on the circular map to the respective position of the gene. In addition, an arrow displayed near the selected gene, indicates the DNA strand where the gene is placed on (red left_to_right arrow for the 5'-3' and blue right_to_left arrow for the 3'-5' DNA strand).
"Browsing" a genome
Structural RNA coding genes
Apart from protein coding genes (default option), structural RNA coding genes (t-RNAs, r-RNAs), can also be viewed. Once this option is activated, extra genes are depicted on the genome in the form of "sticks", colored according to the amino acid residue they carry (in case of t-RNAs) or the corresponding ribosomal subunit (in case of r-RNAs). There are several features available on each RNA-coding gene, such as genome position, length, sequence, and the RNA type it belongs to, displayed in the "Sequence Features Panel".
Protein sequence related features
Type-I Membrane Protein
Type-II Membrane Protein
Predictions for protein sequences
Prediction algorithms provide information on protein sequences, concerning: possible transmembrane domains (PRED-TMR2) along with their topology (orienTM), structural classification into four distinct classes (PRED-CLASS; Membrane, Globular, Fibrous, Mixed), signal peptide and cleavage site prediction (SIGNALP). These features are a valuable mean of annotating genomic Open Reading Frames with no structural or functional assignment or as a mean of evaluation.
View Clusters of Orthologous groups (COGs)
View restriction sites on the genome
User defined annotation
Adding personal annotation for a specific gene or protein is supported. By doing so, each time the genome is loaded this user-defined information will also be available, with the option of modifying (erasing or updating) it. This feature allows storage of personal proteomic-genomic research data that can easily be retrieved.
View the whole genome in a circular map
Saving a Jpeg snapshot
An additional feature provided by GeneViTo is the ability to save the current display in a high-resolution color image file in JPEG format, suitable for direct incorporation into scientific publications or further inspection. Additionally, a color palette enables the modification of the graphics panel background color, according to the user's personal taste.
Searching for similarities
Possible local sequence similarities can be acquired for any selected element running the BLAST  algorithm (BlastN, BlastP, BlastX) on a particular sequence (DNA or protein) against some default databases. Any database of choice can also be defined, given that it follows the Fasta format. In this case, GeneViTo offers a utility to format the given file according to the BLAST standards. BLAST results are automatically displayed in a new window after the procedure has been completed.
Searching for a protein sequence motif
Detailed help for all options and tools supported by GeneViTo is organized in a tree-like structure, through the «Help» menu.
Comparison to other genome browsers
Availability of huge amounts of genomic data during the last decade urged the development of computer applications for the analysis and visualization of this enormous information. Consequently, several approaches have been made to the problem of genome visualization, resulting in diverse (as far as both the capabilities and the overall philosophy) genome browsers. Each system seems to have been developed primarily to serve investigators' needs in a 'project' environment, paying more attention to incorporate features needed to accomplish specific tasks. This fact makes direct comparisons of such software difficult and, to some extent, arbitrary. In this section, we compare GeneViTo to three well known and widely used browsers: the TIGR genome browser , the ENSEMBL project viewer and ERGO™  (actually its freely available version ERGO Light).
The TIGR genome browser is freely available through the TIGR web server, providing access to all TIGR in-house data. This genome browser comes with some very useful features, such as the graphical display of alternative sources of annotations (e.g. multiple gene-finder predictions), matches of gene products to characteristic sequence motifs and so on. Again, the software system is not available to install locally and users are not able to upload their genomic data along with annotations or predictions, so it has to be considered as an interface to TIGR genomic data.
The ENSEMBL genome browser is a valuable set of genome analysis and visualization tools, offering a functional working environment for web-based data mining and information viewing. Individual gene products are linked to automatically created annotation, while users can save and alter annotations when new experimental evidence become available. Moreover the entire package can be downloaded to be used for individual research purposes.
ERGO is a commercial software suite with excellent capabilities, including metabolic pathway information, but it is mostly data-centric. More important to its great computational power is the underlying data annotation handled by a team of experts. The freely available ERGO Light version, offers free access to a smaller amount of data, yet the same computational tools, but users cannot upload their own data to the server.
GeneViTo, as well as all the above software resources, come with intuitive user friendly GUIs, allowing for easy navigation through the vast amount of genomic data. GeneViTo, as a stand alone application, enables the incorporation of user defined-data: genomes, annotations and/or computational predictions. Its main advantage, is the clear presentation of sequence feature distribution along genomic regions. In the input layer GeneViTo uses data from public databases and free bioinformatics tools, so in most of the cases users will be able to easily visualize available data through a simple mouse-click. Simultaneous display of computationally predicted features along with available annotations (often derived by computational means as well) provides a useful environment, which may complement the already existing tools for genome annotation and visualization.
Genome wide analysis is a difficult task due to the large amount of primordial data and the "non-productive" way in which they are stored and displayed. Several genome viewers already exist, each one of them serving different needs and research interests. GeneViTo offers an easy to use computer environment that incorporates experimental data combined with prediction algorithms results. Primordial data and meta-data are all embodied in a clear display that offers instantly an intuitive aspect of a genome and a large amount of biological information at hand. The information offered can lead to valuable conclusions and cover a wide variety of biology issues concerning entire organisms.
GeneViTo has already been applied to visualize the genomes of two microbial organisms: the bacterium C. trachomatis and the archaeon M. jannaschii. Future plans to extend the software platform include the ability to handle multichromosomal genomes as unique sets or the simultaneous display and analysis of complete genomes. In order to achieve that goal, some modifications will be necessary, as we have to efficiently handle the differences in eukaryotic genome organization (e.g. exon-intron structure). Such issues should be taken into account both in the data storage and handling processes and in the visualization philosophy. Computational issues, such as memory requirements, are not raised, as clearly illustrated in Table 1.
GeneViTo will be available to download upon request to the authors. Download instructions, along with the "ready to run" microbial genome files accompanied with a detailed online manual for preparing and viewing genomic data are freely available at the URL http://bioinformatics.biol.uoa.gr/GENEVITO/index.html.
Availability and Requirements
Project Name: GeneViTo
Project Home Page: http://bioinformatics.biol.uoa.gr/GENEVITO/index.html
Operating System(s): Extensively tested on Windows, Linux (Intel). Theoretically, GeneViTo should work on any other platform with Java Runtime Environment (JRE) 1.4.1 installed.
Programming Language: JAVA
Other requirements: Installation of JRE 1.4.1.
License: Free for Academic use.
Any restrictions: None.
Graphical User Interface
Clusters of Orthologous Groups
The authors wish to thank the two anonymous referees for their valuable comments and suggestions that substantially helped to improve the final manuscript. Pantelis Bagos MSc kindly provided help on performing and presenting the Benchmark tests.
- Bernal A, Ear U, Kyrpides N: Genomes OnLine Database (GOLD): a monitor of genome projects world-wide. Nucleic Acids Res 2001, 29: 126–7. 10.1093/nar/29.1.126PubMed CentralView ArticlePubMedGoogle Scholar
- GOLD: Genomes OnLine Database[http://wit.integratedgenomics.com/GOLD/]
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank. Nucleic Acids Res 2003, 31: 23–7. 10.1093/nar/gkg057PubMed CentralView ArticlePubMedGoogle Scholar
- Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, Pilbout S, Schneider M: The SWISS-PROT protein sequence database and its supplement TrEMBL in 2003. Nucleic Acids Res 2003, 31: 365–70. 10.1093/nar/gkg095PubMed CentralView ArticlePubMedGoogle Scholar
- Zdobnov EM, Lopez R, Apweiler R, Etzold T: The EBI SRS server – recent developments. Bioinformatics 2002, 18: 368–73. 10.1093/bioinformatics/18.2.368View ArticlePubMedGoogle Scholar
- Schuler GD, Epstein JA, Ohkawa H, Kans JA: Entrez: molecular biology database and retrieval system. In: Methods in Enzymology (Edited by: Doolittle RF). San Diego: Academic Press 1996, 266: 141–62.Google Scholar
- Abril JF, Guigó R: gff2ps: visualizing genomic annotations. Bioinformatics 2000, 16: 743–4. 10.1093/bioinformatics/16.8.743View ArticlePubMedGoogle Scholar
- Rutherford K, Parkhill J, Crook J, Horsnell T, Rice P, Rajandream M-A, Barrell B: Artemis: sequence visualization and annotation. Bioinformatics 2000, 16: 944–5. 10.1093/bioinformatics/16.10.944View ArticlePubMedGoogle Scholar
- Hu Z, Frith M, Niu T, Weng Z: SeqVISTA: a graphical tool for sequence feature visualization and comparison. BMC Bioinformatics 2003, 4: 1. 10.1186/1471-2105-4-1PubMed CentralView ArticlePubMedGoogle Scholar
- NCBI Map Viewer[http://www.ncbi.nih.gov/mapview/]
- TIGR Genome Browse[http://www.tigr.org/tigr-scripts/CMR2/choose_genome.spl]
- Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T, Durbin R, Eyras E, Gilbert J, Hammond M, Huminiecki L, Kasprzyk A, Lehvaslaiho H, Lijnzaad P, Melsopp C, Mongin E, Pettett R, Pocock M, Potter S, Rust A, Schmidt E, Searle S, Slater G, Smith J, Spooner W, Stabenau A, Stalker J, Stupka E, Ureta-Vidal A, Vastrik I, Clamp M: The Ensembl genome database project. Nucleic Acids Res 2002, 30: 38–42. 10.1093/nar/30.1.38PubMed CentralView ArticlePubMedGoogle Scholar
- Overbeek R, Larsen N, Walunas T, D'Souza M, Pusch G, Selkov E Jr, Liolios K, Joukov V, Kaznadzey D, Anderson I, Bhattacharyya A, Burd H, Gardner W, Hanke P, Kapatral V, Mikhailova N, Vasieva O, Osterman A, Vonstein V, Fonstein M, Ivanova N, Kyrpides N: The ERGO™ genome analysis and discovery system. Nucleic Acids Res 2003, 31: 164–79. 10.1093/nar/gkg148PubMed CentralView ArticlePubMedGoogle Scholar
- Pasquier C, Promponas VJ, Hamodrakas SJ: PRED-CLASS: cascading neural networks for generalized protein classification and genome-wide applications. Proteins: Structure, Function, and Genetics 2001, 44: 361–9. 10.1002/prot.1101View ArticleGoogle Scholar
- Pasquier C., Hamodrakas SJ: An hierarchical artificial neural network system for the classification of transmembrane proteins. Protein Eng 1999, 12: 631–4. 10.1093/protein/12.8.631View ArticlePubMedGoogle Scholar
- Liakopoulos TD, Pasquier C, Hamodrakas SJ: A novel tool for the prediction of transmembrane protein topology based on a statistical analysis of the SwissProt database: the orienTM algorithm. Protein Eng 2001, 14: 387–90. 10.1093/protein/14.6.387View ArticlePubMedGoogle Scholar
- Nielsen H, Engelbrecht J, Brunak S, von Heijne G: Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng 1997, 10: 1–6. 10.1093/protein/10.1.1View ArticlePubMedGoogle Scholar
- EBI Proteome Analysis Server[http://www.ebi.ac.uk/proteome/index.html]
- Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV: The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res 2001, 29: 22–28. 10.1093/nar/29.1.22PubMed CentralView ArticlePubMedGoogle Scholar
- TIGR, Gene pairs for Methanococcus jannaschii[http://www.tigr.org/tigr-scripts/operons/pairs.cgi?taxon_id=57]
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Krogh A, Larsson B, von Heijne G, Sonnhammer ELL: Predicting Transmembrane Protein Topology with a Hidden Markov Model: Application to Complete Genomes. J Mol Biol 2001, 305: 567–80. 10.1006/jmbi.2000.4315View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.