CoreGenes: A computational tool for identifying and cataloging "core" genes in a set of small genomes
© Zafar et al 2002
Received: 27 November 2001
Accepted: 24 April 2002
Published: 24 April 2002
Skip to main content
© Zafar et al 2002
Received: 27 November 2001
Accepted: 24 April 2002
Published: 24 April 2002
Improvements in DNA sequencing technology and methodology have led to the rapid expansion of databases comprising DNA sequence, gene and genome data. Lower operational costs and heightened interest resulting from initial intriguing novel discoveries from genomics are also contributing to the accumulation of these data sets. A major challenge is to analyze and to mine data from these databases, especially whole genomes. There is a need for computational tools that look globally at genomes for data mining.
CoreGenes is a global JAVA-based interactive data mining tool that identifies and catalogs a "core" set of genes from two to five small whole genomes simultaneously. CoreGenes performs hierarchical and iterative BLASTP analyses using one genome as a reference and another as a query. Subsequent query genomes are compared against each newly generated "consensus." These iterations lead to a matrix comprising related genes from this set of genomes, e. g., viruses, mitochondria and chloroplasts. Currently the software is limited to small genomes on the order of 330 kilobases or less.
A computational tool CoreGenes has been developed to analyze small whole genomes globally. BLAST score-related and putatively essential "core" gene data are displayed as a table with links to GenBank for further data on the genes of interest. This web resource is available at http://pumpkins.ib3.gmu.edu:8080/CoreGenes or http://www.bif.atcc.org/CoreGenes.
The development of genomics instrumentation, technology and methodology, as well as their integration and deployment in many fields of research, has evolved from producing manageable small streams of DNA sequence data to generating an inundating amount of DNA sequence and whole genome data. This massive amount of raw DNA sequence data can be described simply and aptly as a "tsunami" – a tremendous and unexpected wave. An unprecedented wave can be either overwhelming or overwhelmed, depending upon the preparedness of investigators. Preparations include having available or developing appropriate computational tools. One particular area of continuing concern is the ability to separate interesting and relevant data from "noise." This process is known as data mining and is enhanced by the development of effective and "user-friendly" bioinformatics tools and computational methods [1–10].
Many researchers have been interested in studying individual proteins, identifying single genes and characterizing putative genes, i.e., "open reading frames." There are several sets of bioinformatics tools that allow researchers to characterize DNA sequences, in particular, to locate and define the above single or few gene entries. Examples include the GCG Wisconsin package and the Staden package [9, 10]. However, there is also a growing subset of researchers who are interested in "whole genome" analyses [3, 7, 8]. With respect to these researchers, there are still gaps in the data mining tool set available for the global study of genomes . This paper addresses the critical need for more "whole genome" tools by describing CoreGenes which is intended to expedite the analyses of genomes globally in order to locate, identify and catalog related genes that may constitute a "core" of essential genes common to these related organisms. A matrix of closely related and putatively homologous genes allows a better understanding of the relationships and organization of genomes and their host organisms.
As noted earlier, JAVA-based software allows distributed programming over the Web, providing an interactive graphical user interface (GUI) . Earlier, we have taken advantage of this resource to develop GeneOrder2.0  a tool that allows the determination of gene co-linearity. We have again used this public domain JAVA software resource (Apache Software Foundation) to develop another computational tool. CoreGenes allows for "user-friendly" visualizations of putative gene identity in sets of genomes, related or unrelated, globally from sets of two to five small whole genomes in a GUI environment. The URL address for this software is http://pumpkins.ib3.gmu.edu:8080/CoreGenes or http://www.bif.atcc.org/CoreGenes.
Comparative genomics and more specialized fields such as comparative virology, etc., involve the comparison of DNA sequences, genes and genomes [12–14]. Recent rapid data acquisition is allowing the analyses of whole genome sequences, especially the smaller genomes such as mitochondria and chloroplasts [15–17], as well as the larger bacterial genomes [18, 19] and large tracts of eukaryotic chromosomes, especially from related organisms [12–14, 20–23]. These studies include the determination of the order of genes, i.e., co-linearity [24, 25], the location of synteny [26–28] and the identification of clusters of orthologous genes [cog] between two genomes [21–23]. Along similar lines of thought, it should be extremely useful to locate, identify and catalog the sets of "core" genes common to these genomes-genomes which otherwise may be related or semi-related or unrelated in other respects. These global views allow for a deeper understanding of one organism in the context of another, especially in regards to their genomic contents. In addition, the comparison of multiple genomes and the identification of related genes and "core" genes can lead to insight into the structure and function of genes and genomes . This is very useful in genome annotations and also in the identification and characterization of functions for "newly found" putative genes.
Identification of "core" genes from small whole genomes is useful and complements other data derived from these genomes. Small genomes include those from viruses , mitochondria [14, 15] and chloroplasts . The increasing importance of the large amount of DNA sequence data recently collected from these small genomes is reflected in the better understanding of their biology [3–4, 12–14] and in the upsurge of publications analyzing these genomes and the organisms to which they belong [15–28]. Genome co-linearity, gene clustering and homolog identification are three global genome analyses which are important in many fields of research, including resolving phylogenetic and evolutionary relationships [15–17].
One obvious application is to use this tool as a step in the characterization of an "alphabet" of putatively essential "core" genes in a set of closely related genomes such as from a collection of poxvirus genomes (unpublished data).
Once the program is initiated, the respective genome data are downloaded from the GenBank database (Figure 1). These genome sequence data are subsequently parsed into protein-coding sequences [as annotated in the GenBank database] and are converted by CoreGenes into "GeneOrder2.0-FASTA" format [7, 8, 29, 31]. Comparisons are performed and the results are presented in a tabular format in the subsequent GUI. Each gene has a hyperlink to its entry in the NCBI database.
BLASTP protein similarity analyses  between the reference sequence and the first query sequence are performed sequentially, with each query protein compared individually to the entire protein database of the reference genome. This is similar to the algorithm for the GeneOrder analyses [7, 8]. If the alignment score between the reference protein and a query protein meets or exceeds a defined similarity threshold number, then the proteins are paired and their accession numbers stored. A consensus map of related genes is generated and stored. Hierarchical comparisons with additionally entered query genomes, up to four in total, are performed in each session.
However, permutations of the five genomes must be analyzed in order to collect the comprehensive set of putatively related core genes. Given the five genomes to be queried, this task is daunting manually. Of course it would be useful to generate a table of genes that bin across only 2, 3 or 4 genomes. This is being addressed actively. It is anticipated that this comprehensive table of genes including rows with matches across only two, three or four genomes will be made available in the near future. Meanwhile, upon the completion of the above algorithm, a table containing the extracted GenBank data and summarizing the "core" genes within the queried genomes is generated (Figure 3 bottom panel). The columns of this table can also be exported via "cut and paste" into Microsoft Excel and Word programs to generate publication quality figures.
Accession numbers of each gene and very brief descriptions are presented in each individual block within this matrix, as extracted directly from the GenBank database. Each individual gene is hyperlinked from this table to the NCBI website to allow the investigator an opportunity to view the unique GenBank file for the gene of interest.
CoreGenes generates a matrix of "core" genes for two to five genomes analyzed simultaneously. Currently, it is limited to small whole genomes, ca. 330 kb or smaller, as long as the data have been annotated in GenBank. Genomes of 35 kb, 150 kb and 250 kb have also been analyzed successfully (data not shown). The upper limit has not been explored in great detail. One main drawback is the time it takes to do each analysis, as the GenBank server needs to be accessed for each paired genome analysis. An alternative being developed is an option to run CoreGenes in a "batch processing" mode where the analyzed data are e-mailed back to the user after a request submission.
Contained in the top-level window is a field to define the minimum protein similarity score (i. e., "BLASTP" threshold score). These can be either the default ("75") or a user-defined value. Score ranges are related to the similarities of the proteins being queried [7, 8]. For reference, the three similarity ranges that can be defined for running GeneOrder2.0 are highest ("A"), high ("B") and low ("C") [7, 8]. The BLASTP threshold score ranges for each are as follows: "A" is defined from [200-∞), "B" is defined from [100–200) and "C" is defined from [75–100). Genes with matches in the "A" range are true homologs, while those in the "B" range are likely related and those in the "C" range require visual validation of the level of identity in order to ensure a true match. Related gene matching values for CoreGenes are also defined in this manner. Caveat: it is always recommended that the results between two BLAST matches be scrutinized as reports have suggested that the closest BLAST match is often not the nearest neighbor .
This tool has been validated with analyses of several diverse virus, chloroplast and mitochondrion genomes. For example, a set of four chloroplast genomes (Figures 2 and 3) and a set of five mitochondrion genomes (data not shown) from evolutionary divergent sets of organisms were run independently to demonstrate the power and capabilities of CoreGenes. Shown in Figure 3 is an output from one of these analyses. With the BLASTP threshold score set at "75," the "core" genes are cataloged and displayed with brief identifying information from the GenBank database. Sixty-one "core" genes were cataloged from the set of chloroplast genomes (data not shown). The genomes are as follows: Arabidopsis thaliana, NC_000932; Nicotiana tabacum, NC_001879; Oryza sativa, NC_001320; and Chlorella vulgaris, NC_001865. Mitochondrion genomes are as follows: Homo sapiens (NC_001807), Gallus gallus (NC_001323), Caenorhabditis elegans (NC_001328), Drosophila melanogaster (NC_001709) and Schizosaccharomyces pombe (NC_001326). An analysis was also performed with a mixture of mitochondrion and chloroplast genomes. Interestingly, several putatively related genes were detected in this particular analysis (data not shown).
In addition to the aforementioned chloroplast and mitochondrion genomes, and of more interest to our research group, CoreGenes has been validated with virus genomes ranging in size from 35 kb to 330 kb (data not shown). Specifically, it has been run with combinations and permutations of adenovirus genomes, ca. 35 kb (NC_001405, NC_001406, NC_002067, NC_001454, NC_001460, NC_000942, NC_001813 and NC_002501, poxvirus genomes, ca. 250 kb NC_001559, NC_001266, NC_001266, NC_003027, NC_001132, NC_001731 and NC_002642), and other viruses of varying sizes: ca. 150 kb (e.g., baculoviruses: Heliocoverpa armigera nucleopolyhedrovirus G4, NC_002654 and Lymantria dispar nucleopolyhedrovirus NC_001973) and ca. 330 kb (Paramecium bursaria Chlorella virus 1, NC_000852).
A group of three chordopox viruses (vaccinia NC_001559, Molluscum contagiosum virus NC_001731, and fowlpox virus NC_001266) and two entomopox viruses (Melanoplus sanguinipes entomopoxvirus NC_001993 and Amsacta moorei entomopoxvirus NC_002520) was analyzed with CoreGenes. With related genomes such as these, the data can also be used as a predictive tool for the elucidation of an "alphabet" of essential genes especially in collaboration with "wet bench" analyses such as the characterization of temperature sensitive mutants, for example, poxviruses (data not shown).
CoreGenes run time is a function of the network connections. If one party, such as the NCBI server, is experiencing heavy traffic or is down due to technical difficulties, then the application will stall and be unsuccessful. Sets of orthopoxviruses, ca. 250 kb, take approximately 25 minutes to run on a PowerMac G3 running Mac OS 9.0 and Netscape Communicator 6.1. Larger genomes are currently problematic due to the computational speed, the NCBI server and/or the user's connection timing out. This issue is being addressed.
Some network "firewalls" may be incompatible with this software, causing the connections to terminate prematurely. An error message "An internal error has occurred. Please try again later java lang.NullPointer Exception." will be displayed. Also, entering incorrect accession numbers may give this same message. Alternatively, CoreGenes has been run successfully on university and public library terminals with internet access. These organizations do not seem to have the "firewall" needs/concerns as other organizations.
CoreGenes has been validated with several different platforms and also with different web browsers: Macintosh (Explorer 4.5 and Netscape 6.1), PC (Explorer 5.0 and Netscape 4.08), SGI (Netscape) and SUN (Netscape) workstations. There are compatibility issues between CoreGenes and Macintosh (Netscape 4.7 and below). Using Netscape 6.1 surmounts these problems. This problem appears to lie in the JAVA applet included with the earlier version of Netscape for Macintosh. Moving an Apple-supplied "JAVA Accelerator for PowerPC" into the "extensions" folder may allow earlier versions of Netscape to run this program. Printing the CoreGenes applet-generated graph may be problematic due to an applet incompatibility; capturing the graph as a "screenshot" via the PC and the Mac platforms and printing independently circumvents this.
Run times vary from 1 minute and 21 seconds for a set of five adenovirus genomes (ca. 35 kb) to 40 minutes for a set of five poxvirus genomes (ca. 250 kb). Currently, if there are multiple requests, the computation may take much longer as the requests are queued. This inconvenience is being addressed and is due to the server hosting the software. Depending on the hardware, some local servers may time out during this period while waiting for this request to be processed, which will result in an error message stating that "The attempt to load 'servlet' failed." Adjusting "preference" settings on the local web browser may rectify this problem. Immediate goals of improvement include an option to have results e-mailed back to the user. We expect that there will be additional improvements in both speed and response issues when we upgrade our server hardware and rewrite some of the CoreGenes software to accommodate the larger megabase-sized genomes.
Only the NCBI database can be searched at this time; in other words, only GenBank accession numbers can be used. If there is an operator error in entering the number correctly, then an error message will be displayed, e.g., "The attempt to load 'servlet' failed." Improvements to this software will include providing an additional field to enter proprietary and non-GenBank genome data, similar to an option developed for GeneOrder2.0 .
CoreGenes fits into the niche for GUI-based interactive computational tools [1–10] that enhance the visualization of DNA sequence data, especially in the context of genome comparisons. It meets a critical need for tool sets containing global "whole genome" analyses tools. As noted earlier, small genomes are still of great interest to many researchers. This tool is a base to expand upon, for example, to build more robust, elegant and complementary "whole genome" computational tools. Although CoreGenes successfully expedites the determination of "core" genes during the comparisons of several small whole genomes simultaneously, it will likely be succeeded by improved software to compare and analyze even much larger genomes, especially in the megabase range. This feature is being pursued with urgency. One known current limitation in analyzing larger genomes is computational, e. g., hardware; this will be addressed shortly. Increasingly powerful workstations to act as servers will allow the much more computationally intensive comparisons of megabase-sized genomes. However, this version of CoreGenes is very useful and fills a current unmet need in genome analyses, that of collecting related genes in a family of genomes. In addition to stimulating the development of similar tools, CoreGenes will allow continuing improvements to it. We plan to support aggressively this version of CoreGenes, updating with improvements and additional features, as well as to work on a more robust faster version.
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.