ComPath: comparative enzyme analysis and annotation in pathway/subsystem contexts
© Choi and Kim; licensee BioMed Central Ltd. 2008
Received: 08 April 2007
Accepted: 06 March 2008
Published: 06 March 2008
Once a new genome is sequenced, one of the important questions is to determine the presence and absence of biological pathways. Analysis of biological pathways in a genome is a complicated task since a number of biological entities are involved in pathways and biological pathways in different organisms are not identical. Computational pathway identification and analysis thus involves a number of computational tools and databases and typically done in comparison with pathways in other organisms. This computational requirement is much beyond the capability of biologists, so information systems for reconstructing, annotating, and analyzing biological pathways are much needed. We introduce a new comparative pathway analysis workbench, ComPath, which integrates various resources and computational tools using an interactive spreadsheet-style web interface for reliable pathway analyses.
ComPath allows users to compare biological pathways in multiple genomes using a spreadsheet style web interface where various sequence-based analysis can be performed either to compare enzymes (e.g. sequence clustering) and pathways (e.g. pathway hole identification), to search a genome for de novo prediction of enzymes, or to annotate a genome in comparison with reference genomes of choice. To fill in pathway holes or make de novo enzyme predictions, multiple computational methods such as FASTA, Whole-HMM, CSR-HMM (a method of our own introduced in this paper), and PDB-domain search are integrated in ComPath. Our experiments show that FASTA and CSR-HMM search methods generally outperform Whole-HMM and PDB-domain search methods in terms of sensitivity, but FASTA search performs poorly in terms of specificity, detecting more false positive as E-value cutoff increases. Overall, CSR-HMM search method performs best in terms of both sensitivity and specificity. Gene neighborhood and pathway neighborhood (global network) visualization tools can be used to get context information that is complementary to conventional KEGG map representation.
ComPath is an interactive workbench for pathway reconstruction, annotation, and analysis where experts can perform various sequence, domain, context analysis, using an intuitive and interactive spreadsheet-style interface.
Comparative analysis of multiple genomes has become a very important research method as more genome sequences become available. Biological pathway analysis is also typically performed in comparison with multiple genomes of choice, using a number of computational tools and databases. A biological pathway involves a number of enzymes and its substrates and products. In addition, pathways interact with each other. Thus comparative analysis of pathways is quite complicated and can hardly be done without well designed pathway analysis software systems. A number of automated pathway comparison systems have been developed: The SEED , Pathway Tools , KAAS (KEGG Automatic Annotation Server) , KOBAS (KEGG Orthology-Based Annotation System) , Path-A (Pathway Analyst) , TIGR Comprehensive Microbial Resources , and JGI Integrated Microbial Genomes .
The SEED provides a web-based comparative genome annotation environment based on 'subsystems.' Subsystems are a set of functional roles found in any common biological process including metabolic pathways, phenotypes, or multi-subunit complex structures. Subsystems in multiple genomes are conveniently represented in a table format: functional roles (i.e. subsystems) are represented in columns, genomes represented in rows, and cells are populated with the genes having corresponding function. A populated subsystem in the table specifies which genomes include gene variants of the subsystem. This populated subsystem approach is an intuitive way to visualize pathway components of multiple genomes. Detection of subsystems in a large number of genomes is not trivial, thus a computational method is developed to detect subsystems automatically .
Pathway Tools uses the PathoLogic algorithm to determine enzymatic reactions catalyzed by each gene product in a query genome and then match detected reaction list against all available pathways from a reference database. PathoLogic accepts sequences of a fully annotated genome from GenBank  and MetaCyc  is used as a reference pathway database. PathoLogic uses the annotation information from GenBank, as opposed to sequence similarity information used in other systems, and the EC assignment as evidence for the presence of each pathway in the reference database for the genome of interest. Once the matching step is complete, PathoLogic infers a set of reactions expected to occur in the target genome and determines which of those pathways are likely to exist in the target genome.
KAAS and KOBAS are systems to annotate input protein sequences with KO (KEGG Ontology). KAAS provides functional annotation of genes by BLAST comparisons, single best hit (SBH) and bidirectional best hits (BBH), against the manually curated KEGG/GENES database [11, 12]. KO assignments to genes and predicted KEGG pathways are generated as output. KOBAS provides statistical significance tests for predicted pathways. KO terms are assigned based either on sequence similarity to entries in KEGG/GENES or on cross-database links in KEGG/GENES when a list of sequence identifiers is available in the databases. After KO assignment, frequently occurring or statistically significantly enriched pathways of the query sequences are identified in comparison with the background model.
Path-A (Pathway Analyst) takes a set of query protein sequences from a genome and identifies which sequences are likely to exist in any of its supported model pathways using several sequence analysis techniques (e.g. SVM, BLAST and HMM). The model pathway approach enables the pathway prediction algorithm to predict instances of a pathway with variations in pathway structure that were never observed in the training pathway set. Path-A currently provides abstract models for 10 pathways, spanning 125 genome-specific pathway instances.
Comprehensive Microbial Resources (CMR) at TIGR allows users to access all bacterial genome sequences completed to date. CMR provides two types of annotation resources: primary annotation from the genome sequencing center and TIGR annotation that is generated by an automated annotation process at TIGR. The CMR Pathway Tool kit consists of three categories of pathway analysis tools: "Genome Properties" provides information on the characteristics of organisms derived from genomic data and literature sources; "Genome Properties Detailed Comparison" provides detailed step information for a set of genomes users select; and "KEGG Pathway Display" highlights KEGG's pathway steps based on the presence or absence of EC evidence in the CMR.
Integrated Microbial Genomes (IMG) at JGI provides comparative analysis of microbial genomes in an integrated genome context. The data model underlying the IMG system incorporates primary genomic sequence information, computationally predicted and curated gene models, pre-computed sequence similarity data, functional annotation, and pathway information. Microbial genome data analysis in IMG is performed in the comparative context of multiple microbial genomes where a number of tools can be used to compare genomes in terms of organism-specific statistics, genes and sequence conservation.
Comparison to related works. ComPath is compared with five existing pathway analysis systems: The SEED, Pathway Tools, KAAS, KOBAS, and Path-A.
More interactive interface, more data integration
Annotation by EC, RC, and GO
Genomic data integration, pathway prediction and annotation
Simple gene annotation using KO
Annotation by KO with statistical evaluation
Annotation against model pathway
Reference Pathway Database
De novo generation by PathoLogic algorithm
10 model pathways – spanning 125 organism-specific pathway instances
FASTA Whole-HMM CSR-HMM PDB-domain search
Opt HMM BLAST-HMM BLAST Motif SVM HMM
EC, RC, GO
Pathway Tools Ontology, EC
CDD search, Prosite pattern search, phylogenetic tree analysis, BAG sequence clustering, Gibbs motif sampling, iGibbs
Gene cluster search, Phylogenetic tree analysis
Automatic generation based on information in KEGG/KGML
Automatic generation based on MetaCyc database
CGView, Gene cluster search
GBrowse, Gene cluster search
Genome browser at BioCyc site
▪ Easy-to-use interface: ComPath provides an easy-to-use, interactive pathway analysis and annotation environment/workbench by integrating multiple data sources and analysis tools into a single framework. ComPath represents biological pathways on a spreadsheet-style interface as the SEED system does, but our spreadsheet is designed to be more interactive so that users can directly perform sequence, motif, and context analyses on this spreadsheet.
▪ Flexible pathway assignment: ComPath provides a total of 327 model pathways combining both KEGG and The SEED: 205 pathways from KEGG database and 122 subsystems from The SEED.
▪ Sequence/motif analysis-oriented: ComPath uses structural domain and context information for pathway reconstruction, in addition to sequence homology information.
▪ Prediction and evaluation: ComPath, KAAS and KOBAS allow annotation of a whole genome in the pathway context. ComPath provides a suite of computational methods that can be used to verify predicted pathway components using sequence/motif/phylogeny analysis tools while KAAS and KOBAS allow use of standard sequence alignment method (BLAST) only.
▪ Genome context analysis: ComPath provides tools for gene cluster detection and visualization based on sequence similarity and position information.
▪ Pathway context analysis: Users can explore enzyme relationships in terms of global pathway networks of EC, GO (Gene Onology), and RxID (reaction ID), not just in terms of predefined pathways such as those in KEGG. This allows users to explore relationships among different predefined pathways with respect to an enzyme of choice.
Upon selection of a biological pathway and a set of genomes, users can perform the following tasks using the interactive enzyme-genome spreadsheet:
1. Enzyme sequences with the same EC category can be compared using various sequence- and structure-based computational tools.
2. Candidates for missing enzymes in particular genomes can be predicted and they can be further verified using computational tools. In this way, pathway holes may be filled in.
3. Un-annotated genomes can be easily compared against enzyme sequences in already well-annotated genomes in terms of a pathway of choice.
Detailed description of each step can be found in "The workflow" section. Below, we summarize important features of ComPath.
Feature 1: Interactive Spreadsheet for pathway data manipulation and analysis (see Figure 1 and Figure 2)
Users interact with the ComPath system using a spreadsheet style interface called "Table Data Type" . Unlike many other systems, users are able to perform various sequence and network analyses on the web simply by manipulating the interactive spreadsheet (See Figure 2). Computational analysis functions are grouped into five categories, denoted by buttons on the web: (i) "Enzyme Prediction" for predicting de novo enzymes, (ii) "Enzyme analysis" for comparing/verifying enzyme predictions, (iii) "Genome context" for searching and visualizing genomic context of a selected enzyme (iv) "Pathway context" for visualizing network context of enzymes, reactions, and GO terms and (v) "Annotation" for automatically annotating genes in a newly uploaded genome in terms of KEGG pathways. Users can either download the current spreadsheet (table) into a local computer or store it into a MySQL server at any time by clicking "Table download" or "My Account" buttons. Tables that are stored or downloaded can be uploaded to the server anytime to resume pathway analysis. Each function will be explained in "Workflow" section in detail.
Feature 2: Use of multiple enzyme prediction methods (see Figure 2)
One major goal of ComPath is to provide biologists with an exploratory computational environment/workbench to search for missing enzymes involved in a given biological pathway. To fill in pathway-holes, biologists can use sophisticated analysis methods in addition to standard sequence similarity methods such as BLAST  and FASTA . Sequence similarity-based comparison methods (e.g. BLAST and FASTA) generally work well, but they may detect many false positives and often fail to determine functions for many genes. As a result, 20–60% of the proteins in most sequenced genomes still remain as "hypothetical" .
To handle this problem, it is necessary to use computational tools other than sequence similarity-based methods. ComPath provides motif- and structural-domain-based search tools in addition to standard similarity-based gene search. Enzyme search methods used in ComPath are (i) Whole-HMM search, (ii) CSR-HMM search, a novel method of our own, (iii) PDB-domain search, and (iv) FASTA search. The Whole-HMM method builds a profile HMM model using "whole" enzyme sequences that belong to the same EC group and searches query genomes with the HMM model. In contrast, the CSR-HMM method uses "common shared region" that is automatically computed by the BAG clustering algorithm  to build profile HMM models. The PDB-domain search method first searches SCOPEC database to retrieve SCOP-to-EC mapping information. SCOPEC is a database of catalytic domains that combines structural domain information from SCOP, full-length sequence information from SwissProt, and verified functional information from the Enzyme Classification (EC) database. Once SCOP IDs are collected, the second step is searching ASTRAL SCOP domain sequence database (based on PDB SEQRES records) or SUPERFAMILY sequences (default setting). SUPERFAMILY sequences are those with longer than 30 residues (20 in ASTRAL) and shorter sequences that are parts of other sequences are removed when filtering on sequence identity. If multiple sequences are collected, CSR-HMM search is done against target genome(s). If only one sequence is retrieved, FASTA search is performed instead. The FASTA search with whole enzyme sequences is also available, but this is not recommended for enzyme search due to its low specificity; see "Empirical evaluation of enzyme prediction methods" section.
Users can evaluate predicted enzyme candidates (see Step3 in "Workflow" section). This step is necessary since predicted candidates are often false positives. For example, inspection on the existence of common CDD  domains or PROSITE  patterns of a given EC group can be used as an evidence to suspect enzyme candidates as false positives. This type of dynamic analysis can be performed easily in ComPath.
Feature 3: Physical context/gene neighborhoods analysis (see Figure 3)
ComPath provides two tools for visualizing gene neighborhood: an in-house physical context viewer (see Figure 3) and CGView for detecting and visualizing gene neighborhoods. The gene neighborhood, or physical context, is defined by setting a sequence similarity score cutoff for matching genes in different genomes and intergenic distance as a constraint. In a visualization of gene context using physical context viewer, genes are visualized with color scheme according to the EC category. Only two genomes can be compared in the current implementation. In addition, users can use the CGView genome browser  to navigate/visualize all enzymes in a genome and their positions. Physical context analysis is currently available only for prokaryotic genomes.
Feature 4: Global pathway context/network neighborhood analysis (see Figure 4)
The KEGG pathway map that we are utilizing in ComPath is a useful and convenient dissect of complex biological networks. However, there are two limitations in the pathway representation. First, each KEGG pathway is a union of all enzymes from all species participating in a given pathway theme (e.g. "Glycolysis") rather than "pathway" in a specific genome. Because of this, many reactions are often not coupled or connected in KEGG maps of a given genome. In addition, each KEGG pathway map is only a subset of global pathway network where numerous cross-talks among sub-networks exist. When the relationship is visualized, we can observe genes involved in the same functional theme are grouped as a module in a network of genes . In general, these modules are not consistent with pre-defined pathways maps (e.g. KEGG maps) and enzymes are often involved in more than one pathway.
The table-type pathway representation in ComPath that is based on KEGG map is not very effective in showing pathway interactions in the global network. Thus ComPath dynamically generates a genome-specific directed/undirected graph of a given biological network by parsing KEGG/KGML database that provides catalytic reaction information including substrate(s), product(s), reaction ID(s), and directionality (reversible, irreversible) of a enzyme reaction. The original representation of a given catalytic network in KEGG/KGML is a "directed metabolite graph." Its nodes are metabolites and edges are reaction IDs (i.e. RxID). Starting from this metabolite graph, ComPath creates Rx graph, EC graph, and GO graph, where nodes are reaction IDs, EC numbers, and GO terms, respectively. The conversion from the metabolite graph to the Rx graph is the same as described in a recent paper . EC graph is generated based on the Rx graph and RxID-EC relationship in KEGG/LIGAND database and the GO graph is derived from the EC graph by referring to EC2GO database. Due to the limitation of computational resources, ComPath visualizes only a sub-network of a given radius, a parameter the user can set a value to, from an enzyme which users are interested in. We recommend users to use Pajek  for detailed and faster global network analysis. ComPath generates Pajek input files of directed/undirected graph. In the current implementation, this feature only focuses on metabolic networks, but will be expanded to other biological networks.
Feature 5: Genome annotation
ComPath can be used to automatically generate a tentative annotation of a whole genome in terms of model pathways. Candidate enzymes in a new genome that are not in the KEGG database are matched to enzymes in model pathways and a new table is created by filling entries corresponding to known enzymes in the model pathway context. In this way, the new table can help users understand how likely the pathway or subsystem exists in the new genome since known KEGG pathways or subsystems in SEED provide meaningful context for potential gene matches. Users can use three different ways to define pathways and subsystems: KEGG pathways, SEED subsystems, and user-defined EC sets.
When an un-annotated or poorly annotated genome in FASTA format is uploaded onto the interactive spreadsheet, ComPath performs automatic enzyme candidate searches using FASTA or CSR-HMM methods. Once cells in the spreadsheet are populated with candidate enzymes, users can start to perform enzyme comparison analyses as described in the previous sections and the result can be used as evidence for gene function determination. Users should carefully select reference genomes that are phylogenetically close to the genome being annotated. In the current web implementation, only one genome is accepted for genome annotation.
Feature 6: Data management
The table can be easily edited, updated, and reloaded. At any time of pathway analysis, the table can be saved into either a local computer by downloading it or the MySQL data management server by clicking "My Account" button. To use the MySQL server, users need to register and get a user ID and password. Stored tables can be easily uploaded to the ComPath server and then the table at the time of the previous analysis can be re-generated by a single click on the web. In addition to the table, FASTA-format protein sequences can be stored in the MySQL database and retrieved later for annotation.
Step 1: EC-based pathway reconstruction (a spreadsheet table generation)
In case a new genome that is not among the genomes in KEGG is uploaded, ComPath searches the genome for candidate enzyme matches against selected genomes and fills in entries in the table with matches found. This step is called "EC-based pathway reconstruction." Detected genes are considered as candidates for pathway components. These candidates may need to be further verified (see Step 3 "Evaluation of enzyme classification"). Upon selection of a pathway and genomes, an interactive enzyme-genome spreadsheet, i.e. a table, is created on the web.
Step 2: Identification of missing enzyme candidates
Users can perform the missing gene search as follows. (i) Select an EC number, a genome, and query genes in the table based on BAG clustering (see Figure 5C/D/E) or other prior knowledge, (ii) Select computational methods for enzyme search (see Figure 5F), and (iii) Select which candidate matches from the search result will be added to the table (see Figure 5G and 5H). Figure 5I/J/K/L show prediction refinement step using Gibbs motif sampler (see Figure 5J), BAG clustering (see Figure 5K), and phylogenetic tree analysis (see Figure 5L).
Although the EC system effectively describes biochemical functions for most of known enzymes, it has several problems. For example, a simple EC-based pathway reconstruction tends to miss true enzymes because the number of "hypothetical" proteins without EC assignment continues to increase. In addition, the EC system was developed before sequence and structural information of enzymes was available, so it is not designed to match catalytic function to enzyme structure in terms of family and superfamily of homologous proteins . Thus, decisions on the pathway reconstruction and enzyme prediction using selected query sequences should be often guided by experts. We found that BAG clustering or other sequence classification methods (e.g. phylogenetic tree analysis) play an important role as shown in Figure 5D and the following two cases.
Case 1: Multi-subunits of enzyme complex
Case 2: Isozymes
Step 3: Evaluation of enzyme classification
Candidate enzymes predicted by sequence similarity often need to be examined further. ComPath provides several computational methods for further evaluation of candidate matches. Phylogenetic tree analysis is probably the most powerful method that visualizes the relationship among candidates and known enzyme sequences. ComPath uses PHYLIP package  to generate a phylogenetic tree by using the neighbor-joining algorithm. The phylogenetic tree can be viewed by a web browser (in PNG format) or ATV trees viewer program (via Java applet) . Multiple sequence alignment is also available to users while generating the phylogenetic tree. BAG sequence clustering program  is used for both the enzyme match and refinement step. Finally, Gibbs motif sampler  and iGibbs , our in-house motif detection algorithm, can be used to predict conserved regions and motifs in a set of enzyme sequences. Motif information from CDD  and PROSITE  are also provided.
Context-based sequence analysis techniques generally improve accuracy of gene function prediction made by using traditional sequence similarity-based search methods. The context involves a group of genes physically (e.g., gene neighbors on the chromosomes) or functionally (e.g., interacting proteins or those involved in the same pathway) related to the gene being analyzed. In particular, there is growing evidence that evolutionary relationships of multiple genomes can be understood better when metabolic pathways and functional context can be enforced by detailed reconstruction of relevant metabolic or other cellular pathways [35, 36].
Results and Discussion
Empirical evaluation of enzyme prediction methods
Sensitivity and specificity data
Then, three well-annotated query genomes (Salmonella enterica subsp. enterica serovar Typhi Ty2 (NC_004631), Vibrio cholerae O1 biovar eltor str. N16961 chromosome II (NC_002506), and Yersinia pestis CO92 (NC_003134) were selected as query genomes to test how correctly each search method detects metabolic pathway components (enzymes) using the "enzyme profile" of reference genomes. Sensitivity and specificity of each method were calculated after the enzyme profiles of each EC group were searched against these genomes.
If the EC number of a detected enzyme is the same as that of the enzyme profile, this enzyme was considered "true positive (TP)". "False positive (FP)" was the case that the EC numbers of the enzyme profile and detected enzyme were different (or no EC number was assigned to the enzyme). If an enzyme has a correct EC identifier, but was not detected, this enzyme was considered "false negative (FN)". Finally an enzyme is considered "true negative (TN)  if an enzyme was not detected and also has different (or no) EC number from the enzyme profile. Sensitivity (TP/(TP+FN)) and Specificity (TN/(TN+FP)) were calculated and the results plotted using GnuPlot 4.0.
According to our experiment, FASTA and CSR-HMM searches outperformed Whole-HMM and PDB-domain searches in terms of sensitivity, but FASTA search showed poor performance in terms of specificity, detecting more false positive as E-value increases. Overall, our CSR-HMM search method performed best in terms of sensitivity and specificity.
We have developed a web-based comparative pathway analysis workbench for biologists or medical scientists. Using the spreadsheet-style interface, users can freely compare pathways in multiple genomes, predict new enzyme candidates using various sequence analysis techniques including our own CSR-HMM method, and refine the prediction result. In case a given pathway/subsystem is unique only to the query genome, comparative and pathway/subsystems-based genome annotation may not work. Fortunately, pathways/subsystems are partially or fully conserved among close species in general, so pathway/subsystems-based genome annotation can take advantage of co-evolution, co-occurrence, and co-regulation relationship of pathway/subsystem component genes of query and reference genomes. Once a 'model' pathway/subsystem can be correctly defined, it is relatively easy to reconstruct it in query genome by referring to reference genomes. It worth noting that 'model' pathway representation (e.g. KEGG map) is artificially defined. It is always possible that several alternative pathways or hidden hierarchical networks can be found when we see the whole network from many different viewpoints and this is why the global pathway network analysis is also required for pathway analysis. ComPath plans to include more resources and computational tools. Among them are Structure-Function Linkage Database (SFLD)  and ProRule . We will include more genome context search methods, including in-house tools such as MCGS [22, 39] and OperonViz . These tools will be useful to explore which enzymes in a pathway appear in a physically clustered form .
Availability and requirements
Project home page: http://compath.org
Operating system(s): Platform independent.
Other requirements: Mozilla-based web browser and Java Runtime Environment (JRE v1.4.0 or higher) are recommended.
Licence: ComPath is freely available to academic and non-academic users
Any restrictions to use by non-academics: None
This research was partially funded by NSF CAREER DBI-0237901, NSF MCB-0731950, and INGEN (Indiana Genomics Initiatives).
- Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY, Cohoon M, de Crecy-Lagard V, Diaz N, Glass EM, Goesmann A, Hanson A, Iwata-Reuyl D, Jensen R, Jamshidi N: The Subsystems Approach to Genome Annotation and its Use in the Project to Annotate 1000 Genomes. Nucl Acids Res 2005, 33(17):5691–5702. 10.1093/nar/gki866PubMed CentralView ArticlePubMed
- Karp PD, Paley S, Romero P: The Pathway Tools software. Bioinformatics 2002, 18(suppl_1):S225–232.View ArticlePubMed
- Thompson W, Rouchka EC, Lawrence CE: Gibbs Recursive Sampler: finding transcription factor binding sites. Nucl Acids Res 2003, 31(13):3580–3585. 10.1093/nar/gkg608PubMed CentralView ArticlePubMed
- Wu J, Mao X, Cai T, Luo J, Wei L: KOBAS server: a web-based platform for automated annotation and pathway identification. Nucl Acids Res 2006, 34: W720–724. 10.1093/nar/gkl167PubMed CentralView ArticlePubMed
- Pireddu L, Szafron D, Lu P, Greiner R: The Path-A metabolic pathway prediction web server. Nucl Acids Res 2006, 34: W714–719. 10.1093/nar/gkl228PubMed CentralView ArticlePubMed
- Haft DH, Selengut JD, Brinkac LM, Zafar N, White O: Genome Properties: a system for the investigation of prokaryotic genetic content for microbiology, genome annotation and comparative genomics. Bioinformatics 2005, 21(3):293–306. Epub 2004 Sep 3. 10.1093/bioinformatics/bti015View ArticlePubMed
- Markowitz VM, Korzeniewski F, Palaniappan K, Szeto E, Werner G, Padki A, Zhao X, Dubchak I, Hugenholtz P, Anderson I, Lykidis A, Mavromatis K, Ivanova N, Kyrpides NC: The integrated microbial genomes (IMG) system. Nucl Acids Res 2006, 34: D344–348. 10.1093/nar/gkj024PubMed CentralView ArticlePubMed
- Ye Y, Osterman A, Overbeek R, Godzik A: Automatic detection of subsystem/pathway variants in genome analysis. Bioinformatics 2005, 21(suppl_1):i478–486. 10.1093/bioinformatics/bti1052View ArticlePubMed
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank. Nucl Acids Res 2006, 34(suppl_1):D16–20. 10.1093/nar/gkj157PubMed CentralView ArticlePubMed
- Caspi R, Foerster H, Fulcher CA, Hopkinson R, Ingraham J, Kaipa P, Krummenacker M, Paley S, Pick J, Rhee SY, Tissier C, Zhang P, Karp PD: MetaCyc: a multiorganism database of metabolic pathways and enzymes. Nucl Acids Res 2006, 34: D511–516. 10.1093/nar/gkj128PubMed CentralView ArticlePubMed
- Kanehisa M, Goto S, Kawashima S, Nakaya A: The KEGG databases at GenomeNet. Nucl Acids Res 2002, 30(1):42–46. 10.1093/nar/30.1.42PubMed CentralView ArticlePubMed
- Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resource for deciphering the genome. Nucl Acids Res 2004, 32(90001):D277–280. 10.1093/nar/gkh063PubMed CentralView ArticlePubMed
- Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer ELL, Studholme DJ, Yeats C, Eddy SR: The Pfam protein families database. Nucl Acids Res 2004, 32: D138–141. 10.1093/nar/gkh121PubMed CentralView ArticlePubMed
- Hulo N, Bairoch A, Bulliard V, Cerutti L, De Castro E, Langendijk-Genevaux PS, Pagni M, Sigrist CJA: The PROSITE database. Nucl Acids Res 2006, 34: D227–230. 10.1093/nar/gkj063PubMed CentralView ArticlePubMed
- Andreeva A, Howorth D, Brenner SE, Hubbard TJP, Chothia C, Murzin AG: SCOP database in 2004: refinements integrate structure and sequence family data. Nucl Acids Res 2004, 32(90001):D226–229. 10.1093/nar/gkh039PubMed CentralView ArticlePubMed
- George RA, Spriggs RV, Thornton JM, Al-Lazikani B, Swindells MB: SCOPEC: a database of protein catalytic domains. Bioinformatics 2004, 20(suppl_1):i130–136. 10.1093/bioinformatics/bth948View ArticlePubMed
- Madera M, Vogel C, Kummerfeld SK, Chothia C, Gough J: The SUPERFAMILY database in 2004: additions and improvements. Nucl Acids Res 2004, 32(90001):D235–239. 10.1093/nar/gkh117PubMed CentralView ArticlePubMed
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucl Acids Res 2000, 28(1):235–242. 10.1093/nar/28.1.235PubMed CentralView ArticlePubMed
- Gene Ontology C: The Gene Ontology (GO) database and informatics resource. Nucl Acids Res 2004, 32(suppl_1):D258–261. 10.1093/nar/gkh036View Article
- Marchler-Bauer A, Anderson JB, Derbyshire MK, DeWeese-Scott C, Gonzales NR, Gwadz M, Hao L, He S, Hurwitz DI, Jackson JD, Ke Z, Krylov D, Lanczycki CJ, Liebert CA, Liu C, Lu F, Lu S, Marchler GH, Mullokandov M, Song JS, Thanki N, Yamashita RA, Yin JJ, Zhang D, Bryant SH: CDD: a conserved domain database for interactive domain family analysis. Nucl Acids Res 2007, 35: D237–240. 10.1093/nar/gkl951PubMed CentralView ArticlePubMed
- The UniProt C: The Universal Protein Resource (UniProt). Nucl Acids Res 2007, 35: D193–197. 10.1093/nar/gkl929View Article
- Kim S, Choi JH, Saple A, Yang J: A Hybrid Gene Team Model and Its Application to Genome Analysis. Journal of Bioinformatics and Computational Biology (JBCB) 2006, 4(2):171–196. 10.1142/S0219720006001850View Article
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMed
- Pearson WR, Lipman DJ: Improved Tools for Biological Sequence Comparison. PNAS 1988, 85(8):2444–2448. 10.1073/pnas.85.8.2444PubMed CentralView ArticlePubMed
- Osterman A, Overbeek R: Missing genes in metabolic pathways: a comparative genomics approach. Current Opinion in Chemical Biology 2003, 7(2):238–251. 10.1016/S1367-5931(03)00027-9View ArticlePubMed
- Kim S, Lee J: BAG: a graph theoretic sequence clustering algorithm. International Journal of Data Mining and Bioinformatics 2006, 1(2):178 -1200. 10.1504/IJDMB.2006.010855View ArticlePubMed
- Stothard P, Wishart DS: Circular genome visualization and exploration using CGView. Bioinformatics 2005, 21(4):537–539. 10.1093/bioinformatics/bti054View ArticlePubMed
- Barabási AL, Oltva ZN: Network biology: understanding the cell's functional organization. Nat Rev Genet 2004, 5(2):101–113. 10.1038/nrg1272View ArticlePubMed
- Ma HW, Zhao XM, Yuan YJ, Zeng AP: Decomposition of metabolic network into functional modules based on the global connectivity structure of reaction graph. Bioinformatics 2004, 20(12):1870–1876. 10.1093/bioinformatics/bth167View ArticlePubMed
- BATAGELJL V, MRVAR A, Mutzel P, Jünger M, Leipert S: Pajek - Analysis and Visualization of Large Networks. Lecture notes in computer science 2003, 77–103.
- Babbitt PC: Definitions of enzyme function for the structural genomics era. Current Opinion in Chemical Biology 2003, 7(2):230–237. 10.1016/S1367-5931(03)00028-0View ArticlePubMed
- Zmasek CM, Eddy SR: ATV: display and manipulation of annotated phylogenetic trees. Bioinformatics 2001, 17(4):383–384. 10.1093/bioinformatics/17.4.383View ArticlePubMed
- Kim S, Wang Z, Dalkilic M: iGibbs: Improving Gibbs motif sampler for proteins by sequence clustering and iterative pattern sampling. Proteins: Structure, Function, and Bioinformatics 2007, 66(3):671–681. 10.1002/prot.21153View Article
- Wolf YI, Rogozin IB, Kondrashov AS, Koonin EV: Genome Alignment, Evolution of Prokaryotic Genome Organization, and Prediction of Gene Function Using Genomic Context. Genome Res 2001, 11(3):356–372. 10.1101/gr.GR-1619RView ArticlePubMed
- Zheng Y, Szustakowski JD, Fortnow L, Roberts RJ, Kasif S: Computational Identification of Operons in Microbial Genomes. Genome Res 2002, 12(8):1221–1230. 10.1101/gr.200601PubMed CentralView ArticlePubMed
- Pegg SCH, Brown SD, Ojha S, Seffernick J, Meng EC, Morris JH, Chang PJ, Huang CC, Ferrin TE, Babbitt PC: Leveraging Enzyme Structure-Function Relationships for Functional Inference and Experimental Design: The Structure-Function Linkage Database. Biochemistry 2006, 45(8):2545–2555. 10.1021/bi052101lView ArticlePubMed
- Sigrist CJA, De Castro E, Langendijk-Genevaux PS, Le Saux V, Bairoch A, Hulo N: ProRule: a new database containing functional and structural information on PROSITE profiles. Bioinformatics 2005, 21(21):4060–4066. 10.1093/bioinformatics/bti614View ArticlePubMed
- Kim S, Choi JH, Yang J: Gene Teams with Relaxed Proximity Constraint. IEEE Computational Systems Bioinformatics 2005, 44–55.
- Choi K, Ma Y, Choi JH, Kim S: PLATCOM: a Platform for Computational Comparative Genomics. Bioinformatics 2005, 21(10):2514–2516. 10.1093/bioinformatics/bti350View ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.