ProGenExpress: Visualization of quantitative data on prokaryotic genomes
© Watson; licensee BioMed Central Ltd. 2005
Received: 09 February 2005
Accepted: 13 April 2005
Published: 13 April 2005
The integration of genomic information with quantitative experimental data is a key component of systems biology. An increasing number of microbial genomes are being sequenced, leading to an increasing amount of data from post-genomics technologies. The genomes of prokaryotes contain many structures of interest, such as operons, pathogenicity islands and prophage sequences, whose behaviour is of interest during infection and disease. There is a need for simple and novel tools to display and analyse data from these integrated datasets, and we have developed ProGenExpress as a tool for visualising arbitrarily complex numerical data in the context of prokaryotic genomes.
Here we describe ProGenExpress, an R package that allows researchers to easily and quickly visualize quantitative measurements, such as those produced by microarray experiments, in the context of the genome organization of sequenced prokaryotes. Data from microarrays, proteomics or other whole-genome technologies can be accurately displayed on the genome. ProGenExpress can also search for novel regions of interest that consist of groups of adjacent genes that show similar patterns across the experimental data set. We demonstrate ProGenExpress with microarray data from a time-course experiment involving Salmonella typhimurium.
ProGenExpress can be used to visualize quantitative data from complex experiments in the context of the genome of sequenced prokaryotes, and to find novel regions of interest.
The genomes of prokaryotic organisms contain many structures that may be involved in pathogenicity, including a variety of operons, pathogenicity islands and prophage sequences. Operons are sets of adjacent genes in bacteria that form a single transcriptional unit, and many, such as those coding for flagella  or fimbriae , have been implicated in pathogenicity. Pathogenicity islands are distinct regions of the genome that confer virulence upon the host, and are found in many pathogens of humans, animals and plants, and at least ten pathogenicity islands have been identified in Salmonella alone . Prophage sequences represent the chromosomes of bacteriophage integrated as part of the genome of the bacterial host, and have also been implicated in pathogenicity in several species .
In order to study the behaviour of these elements, it is essential to integrate information about the genome structure of an organism with quantitative measurements produced by post-genomic technologies, such as those from microarray or proteomics experiments. This integrative biology approach is a key feature of systems biology. Studying the behaviour of these genomic elements, and other groups of adjacent genes, during infection and disease may reveal important information about the molecular mechanisms underlying pathogenicity.
Several microbial genome viewers have been developed which allow quantitative data to be displayed on the genome. The Microbial Genomes Viewer  offers a good online solution, however users must install a browser plug-in and may not be comfortable transmitting data over the internet. GenoMap  can be used to create plots of microarray data on microbial genomes, and is available as Tcl/TK source code. Genome2D  also offers good visualisation of quantitative data on microbial genomes, but is limited to the Windows operating system. Finally, GenomeViz  has recently been released, which offers much functionality, including visualisation of quantitative data, genome alignments and GC content. However this software is currently limited to unix-based systems. All of the above solutions are limited in two respects. Firstly, the quantitative values are represented as a colour-scale, which reduces the accuracy of the data and which may present problems in comparing one colour to the next. Secondly, the above tools can only display a single value for each gene, which precludes the visualisation of more complex data, such as a time-course experiment.
ProGenExpress is released as a package for R. R is a freely available, open-source statistical package  that is widely used in the biological community. R has very powerful statistical and graphical capabilities, and many add-on packages are freely available. The bioconductor project [10, 11] provides a huge number of add-on packages for R, covering a wide range of biological data analysis applications, and the implementation of ProGenExpress in R provides seamless integration with many of these packages. ProGenExpress is written in the native R language and has been fully tested on both windows and linux. R is available for windows, linux, unix and MacOS (including MacOS X).
Results and discussion
ProGenExpress has been written to allow researchers to quickly and simply visualize the behaviour of bacterial genomic regions of any size during experiments using whole genome technologies, such as microarray or proteomics experiments. For information relating to the genome organisation of prokaryotes, ProGenExpress includes functions for downloading and reading both NCBI .ptt files, which describe the location of protein coding genes in bacteria in a tabular format, and include links to the COGs database , and whole genome RefSeq entries . For the quantitative experimental data, ProGenExpress can use the objects created by many of the packages from the bioconductor project [10, 11], or data imported into R from text files, SQL databases and Excel.
There are currently 225 completed prokaryotic genomes in RefSeq  that ProGenExpress can read, and though the utility of ProGenExpress is demonstrated here using microarray data, any kind of numerical data that can be linked to the genes of prokaryotic organisms can be displayed using ProGenExpress. Where measures of the statistical significance of the data points for each gene are available, these can be passed to the plotting functions of ProGenExpress, with the result that those genes that are not significant will be plotted in white and those that are significant will be plotted in their normal plotting colour.
The genome is represented as two barplots, one for each strand. Each gene has a number of bars equal to the number of experimental data sets included, allowing time-course or complex strain/treatment experiments to be plotted. Distance between the bars for each gene is representative of intergenic distance. Slices of the genome can be selected either by base range, gene synonym or gene name. Both horizontal and vertical plots are possible, and bars can be coloured either by numerical value or by COGs  functional category.
The software is demonstrated here using microarray data from Eriksson et al . This data set consists of gene expression measurements from intracellular Salmonella typhimurium at 4, 8 and 12 hours post murine macrophage infection. Gene expression values were calculated as the relative expression level of test RNA to that of RNA from bacteria grown in vitro, and the data has been centred and normalised according to Eriksson et al . Data from Erikson et al is available as a spreadsheet . This spreadsheet was pre-processed to contain only columns for gene synonym, gene name and relative expression level of test RNA to control RNA on a log 2 scale for each of the three time points. The spreadsheet was saved as tab-delimited text and read in to R using the read.table() function. The S typhimurium genome and plasmid sequences were read in to R using the read.ptt() function, with RefSeq files NC_003197.ptt and NC_003277.ptt respectively. The microarray data was linked to the gene location data using the linkem.avg() function. Images of the microarray data on both the entire genome and the plasmid were then generated using the plotrange() and plotrange.vertical() functions in conjunction with jpeg(), an internal R function. The results were viewed in Internet Explorer. Finally, the find.region() function was used to find regions of interest as described below.
ProGenExpress has several advantages over existing software. The package seamlessly integrates with the bioconductor project and the many packages available in R for microarray analysis, including limma, marray and affy, and is available for both Windows and Linux, amongst others. Both horizontal and vertical plots are possible, and an unlimited number of data points for each gene can be plotted, allowing for the visualization and analysis of complex time course or strain/treatment experiments. Furthermore, the bar-plots display numerical data accurately, and do not rely on a colour-scale to depict values. Finally, the ability to search integrated genomic and post-genomic data sets for clusters of genes which behave similarly represents an opportunity for the discovery of novel genomic elements involved in pathogenicity.
We describe ProGenExpress, an open-source R package which allows researchers to quickly and easily visualise quantitative data from arbitrarily complex experiments in the context of the genome of sequenced prokaryotes. ProGenExpress can also be used to search for genomic regions which may represent coherent functional units. We show how ProGenExpress can be used to visualise microarray data from a time-course experiment on the genome of Salmonella typhimurium, and to find unannotated genomic regions that may be involved in pathogenicity. Future plans for the software include the ability to read data from ensembl databases, and the development of visualisation tools for eukaryotic genomes. Software updates and new releases will be available from the project home page.
Availability and requirements
List of abbreviations
Cluster of Orthologous Groups
Salmonella pathogenicity island II
This work was funded by the core strategic grant of the Institute for Animal Health, provided by the BBSRC.
- Robertson JMC, McKenzie NH, Duncan M, Allen-Vercoe E, Woodward MJ, Flint HJ, Grant G: Lack of flagella disadvantages Salmonella enterica serovar Enteritidis during the early stages of infection in the rat. J Med Micro 2003, 52: 91–99. 10.1099/jmm.0.04901-0View ArticleGoogle Scholar
- Baumler AJ, Tsolis RM, Heffron F: Contribution of fimbrial operons to attachment to and invasion of epithelial cell lines by Salmonella typhimurium. Infect Immun 1996, 64: 1862–65.PubMed CentralPubMedGoogle Scholar
- Parkhill J, Dougan G, James KD, Thomson NR, Pickard D, Wain J, Churcher C, Mungall KL, Bentley SD, Holden MT, Sebaihia M, Baker S, Basham D, Brooks K, Chillingworth T, Connerton P, Cronin A, Davis P, Davies RM, Dowd L, White N, Farrar J, Feltwell T, Hamlin N, Haque A, Hien TT, Holroyd S, Jagels K, Krogh A, Larsen TS, Leather S, Moule S, O'Gaora P, Parry C, Quail M, Rutherford K, Simmonds M, Skelton J, Stevens K, Whitehead S, Barrell BG: Complete genome sequence of a multiple drug resistant Salmonella enterica serovar Typhi CT18. Nature 2001, 413(6858):848–52. 10.1038/35101607View ArticlePubMedGoogle Scholar
- Banks DJ, Lei B, Musser JM: Prophage induction and expression of prophage-encoded virulence factors in group A Streptococcus serotype M3 strain MGAS315. Infect Immun 2003, 71(12):7079–86. 10.1128/IAI.71.12.7079-7086.2003PubMed CentralView ArticlePubMedGoogle Scholar
- Kerkhoven R, van Enckevort FH, Boekhorst J, Molenaar D, Siezen RJ: Visualization for genomics: the Microbial Genome Viewer. Bioinformatics 2004, 20(11):1812–14. 10.1093/bioinformatics/bth159View ArticlePubMedGoogle Scholar
- Sato N, Ehira S: GenoMap, a circular genome data viewer. Bioinformatics 2003, 19(12):1583–84. 10.1093/bioinformatics/btg195View ArticlePubMedGoogle Scholar
- Baerends RJ, Smits WK, de Jong A, Hamoen LW, Kok J, Kuipers OP: Genome2D: a visualization tool for the rapid analysis of bacterial transcriptome data. Genome Biol 2004, 5(5):R37. 10.1186/gb-2004-5-5-r37PubMed CentralView ArticlePubMedGoogle Scholar
- Ghai R, Hain T, Chakraborty T: GenomeViz: visualizing microbial genomes. BMC Bioinformatics 5(1):198. 10.1186/1471-2105-5-198
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004, 5(10):R80. 10.1186/gb-2004-5-10-r80PubMed CentralView ArticlePubMedGoogle Scholar
- Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA: The COG database: an updated version includes eukaryotes. BMC Bioinformatics 2003, 4(1):41. 10.1186/1471-2105-4-41PubMed CentralView ArticlePubMedGoogle Scholar
- Pruitt KD, Tatusova T, Maglott DR: NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 2005, 33: D501–4. 10.1093/nar/gki025PubMed CentralView ArticlePubMedGoogle Scholar
- Eriksson S, Lucchini S, Thompson A, Rhen M, Hinton JCD: Unravelling the biology of macrophage infection by gene expression profiling of intracellular Salmonella enterica. Molecular Microbiology 2003, 47(1):103–118. 10.1046/j.1365-2958.2003.03313.xView ArticlePubMedGoogle Scholar
- NCBI Refseq completed microbial genomes[http://www.ncbi.nlm.nih.gov/genomes/MICROBES/Complete.html]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.