IdentiCS – Identification of coding sequence and in silico reconstruction of the metabolic network directly from unannotated low-coverage bacterial genome sequence
© Sun and Zeng; licensee BioMed Central Ltd. 2004
Received: 17 May 2004
Accepted: 16 August 2004
Published: 16 August 2004
A necessary step for a genome level analysis of the cellular metabolism is the in silico reconstruction of the metabolic network from genome sequences. The available methods are mainly based on the annotation of genome sequences including two successive steps, the prediction of coding sequences (CDS) and their function assignment. The annotation process takes time. The available methods often encounter difficulties when dealing with unfinished error-containing genomic sequence.
In this work a fast method is proposed to use unannotated genome sequence for predicting CDSs and for an in silico reconstruction of metabolic networks. Instead of using predicted genes or CDSs to query public databases, entries from public DNA or protein databases are used as queries to search a local database of the unannotated genome sequence to predict CDSs. Functions are assigned to the predicted CDSs simultaneously. The well-annotated genome of Salmonella typhimurium LT2 is used as an example to demonstrate the applicability of the method. 97.7% of the CDSs in the original annotation are correctly identified. The use of SWISS-PROT-TrEMBL databases resulted in an identification of 98.9% of CDSs that have EC-numbers in the published annotation. Furthermore, two versions of sequences of the bacterium Klebsiella pneumoniae with different genome coverage (3.9 and 7.9 fold, respectively) are examined. The results suggest that a 3.9-fold coverage of the bacterial genome could be sufficiently used for the in silico reconstruction of the metabolic network. Compared to other gene finding methods such as CRITICA our method is more suitable for exploiting sequences of low genome coverage. Based on the new method, a program called IdentiCS (Identi fication of C oding S equences from Unfinished Genome Sequences) is delivered that combines the identification of CDSs with the reconstruction, comparison and visualization of metabolic networks (free to download at http://genome.gbf.de/bioinformatics/index.html).
The reversed querying process and the program IdentiCS allow a fast and adequate prediction protein coding sequences and reconstruction of the potential metabolic network from low coverage genome sequences of bacteria. The new method can accelerate the use of genomic data for studying cellular metabolism.
Keywordslow-coverage unfinished genome sequence annotation coding sequence in silico reconstruction visualization comparison metabolic network Salmonella typhimurium Klebsiella pneumoniae
Knowledge about the metabolic network of an organism is essential for understanding its physiology and phenotypic behavior. A comprehensive understanding of the metabolic network at the system level is particularly important for both biotechnological and biomedical research and is now made possible by rapid advances in genome sequencing and functional genomics. In silico reconstruction of metabolic networks from genome sequences of organisms represents a starting point for a systematic analysis of metabolism [1–3]. The functionality of the potential metabolic network of a given organism can then be further experimentally studied by system perturbations at both physiological and genetic levels .
The three-step method starting from gene finding has several drawbacks for the reconstruction of the metabolic network from incomplete or unfinished genome sequences. In unfinished sequences of a genome, especially in sequences with a low genome coverage (e.g. less than 4 fold), there may be many sequencing errors that do not warrant an accurate prediction of genes . For example, the start or stop positions of CDSs may not be accurately predicted. Protein sequences translated from these CDSs may be completely wrong because of coding frame shifts. Fusion CDSs may be predicted, to which the function assignment is difficult. Moreover, a CDS that normally appears as one CDS in other organisms may be predicted as several smaller fragmented CDSs. On the other hand, existing CDSs may not be found at all, either because of sequencing errors or because of limitations of the gene finding software. For eukaryotes, the prediction of CDSs is even more difficult because of the existence of introns.
To avoid these problems, alternative methods are required for directly reconstructing metabolic networks from unfinished genome data. Sequencing and annotation are still time and resource consuming. An as early as possible exploitation of the genome data is of importance for functional genome research. In this work, we propose a method to identify coding sequences for proteins (particularly for metabolic enzymes) directly from unannotated low-coverage genomic data for in silico reconstruction of the metabolic network. The method is demonstrated with genome data from two organisms. A program combining automatic prediction and function assignment of CDSs with a visualization and comparison of metabolic networks of different organisms is also delivered.
Principle of the new method
The principle of the new method is schematically shown in Fig. 1B. In comparison to the conventional three-step method (Fig. 1A) our method can be called a two-step approach. To avoid the separate step of gene finding in the conventional methods, we propose to reverse the searching relationship between public databases and the query sequence: gene or protein sequences from public databases are taken as queries, while the sequences in the unannotated genome of a given organism are treated as a local database that can be searched using a standalone algorithm of BLAST . This results in the prediction of possible CDSs in the genome and simultaneously their functions. Functional information about these CDSs is then used to reconstruct the metabolic network. Thus, our method can significantly simplify the process of CDS prediction and metabolic network reconstruction. By skipping over the separated steps of gene-finding and function assignment, our method can avoid or relax some of the problems of the traditional methods mentioned above.
Results and Discussion
Evaluation of IdentiCS for identifying protein coding sequences from genome sequences of S. typhimurium LT2
Evaluation of the method IdentiCS for identification of CDSs in the genome of Salmonella typhimurium. KEGG: KEGG genome database; SW: SWISS-PROT + TrEMBL + TrEMBL updates.
Inconsistence rate (%) in TP
The specificity of the method is about 81–82% on the CDS level and 87.2–94.9% on the nucleotide level for the KEGG genome database and the whole protein database SWISS-PROT and TrEMBL. The moderate specificity on CDS level is due to the relatively high amount of additionally predicted CDSs (false positive). It should be mentioned that all the additionally predicted CDSs have quite strong statistic significance (most of them with an E-value 1E-20 – 1E-40). These additional CDSs may be missed in the original annotation and could in fact represent good candidates for an improved annotation of the genome.
The inconsistence rate by IdentiCS is as low as 0.35% for the KEGG genome database and 0.64% for the SWISS-PROT and TrEMBL protein database, indicating the reliability of our method.
Effects of different scoring criteria on CDS identification in the genome of S. typhimurium using IdentiCS and the database SWISS-PROT and TrEMBL. Criteria 1: E-value < = E-10; Criteria 2: E-value < = E-10 and Bits score > = 75; Criteria 3: E-value < = E-10, Bits score > = 75 and Identities > = 25%
Identification of enzyme-coding sequences in S. typhimurium LT2
Evaluation of the performance of IdentiCS for the prediction of EC number -containing CDSs (EC-CDSs) with the EC-number containing subset of the protein database SWISS-PROT and TrEMBL.
Compared to originally annotated EC-CDSs
Compared to all originally annotated CDSs
Inconsistence rate in T.P.
The KEGG genomes based prediction is also evaluated for its ability to predict the enzyme-coding sequences. 95.4% of the CDSs originally annotated to have an EC-number are correctly predicted and assigned with EC numbers. This value is slightly lower than the one based on the SWISS-PROT+TrEMBL database. The more complete protein databases are therefore more suitable for EC-CDS identification as well.
Identification of enzyme coding sequences with different coverage of genome sequences of K. pneumoniae
Both the KEGG genomes and SWISS-PROT-TrEMBL databases are used to identify enzyme-coding sequences from the 3.9-fold and 7.9-fold coverage genome sequences of K. pneumoniae. From the 3.9-fold coverage genome, IdentiCS identified 1169 and 1342 EC-CDSs by applying the KEGG genome database and SWISS-PROT-TrEMBL databases, respectively, whereas from the 7.9-fold genome sequences 1158 and 1495 EC-CDSs, respectively. As in the case of S. typhimurium, IdentiCS identified 15% to 30% more EC-CDSs with queries from SWISS-PROT-TrEMBL than with queries from KEGG for the two versions of K. pneumoniae genome sequences respectively. The number of EC-CDSs identified for K. pneumoniae is comparable to that identified for S. typhimurium with the respective databases. They are also comparable to the number (1156) of annotated EC-CDSs of E. coli based on the KEGG genome database. With the method proposed by Ma and Zeng , the structure and evolution distance of the metabolic networks of these three organisms and other 47 bacteria are compared. The metabolic network of K. pneumoniae is found to be most similar to those of E. coli and S. typhimurium (data not shown). Thus, the predicted number of enzyme-encoding sequences for K. pneumoniae appears to be reasonable. With the same 3.9-fold coverage genome sequences of K. pneumoniae, the method of WIT predicted 2650 EC-CDSs which are twice the number of EC-CDSs in E. coli and S. typhimurium. The EC-CDSs predicted by WIT are significantly smaller and fragmented, possibly because of the presence of too many errors in the unfinished genome sequences. The fragmentation problem was overcome in our method that leads to a significant reduction in the number of identified EC-CDSs. The less false positive EC-CDSs will further simplify experimental design such as for microarray to examine the metabolic network.
Comparison of EC numbers identified with different methods and different versions of the genome sequence of Klebsiella pneumoniae. WIT: WIT version of annotation by gene prediction from the 3.9-fold genome sequences; KEGG3.9 and KEGG7.9: annotations of the 3.9-fold and 7.9-fold genome sequences by applying the KEGG genome database. SW3.9 and SW7.9: annotations of the 3.9-fold and 7.9-fold genome sequences by applying SWISS-PROT and TrEMBL protein databases.
Number of unique ECs identified
Version-specific EC numbers* compared to:
Distribution of the unique EC numbers of K. pneumoniae in different function categories compared to other organisms (see legend of Fig. 3 for name abbreviations of organisms). The EC numbers for K. pneumoniae were identified from the unannotated 7.9-fold coverage genome sequences by SWISS-PROT-TrEMBL based IdentiCS. The EC numbers for other organisms are taken from the KEGG genome annotations. The total number of strain-specific ECs is shown in parenthesis under the strain name.
Amino Acid metabolism
Other Amino Acids
Complex Lipids metabolism
Cofactors and Vitamins
Sum of unique EC numbers
Comparison of IdentiCS and CRITICA for identifying coding sequence from low coverage genome sequences
Comparison of coding sequences (CDSs) prediction by IdentiCS and CRITICA from unfinished genome sequences of K. pneumoniae with different genome sequence coverage.
3.9 × genome data
7.9 × genome data
Number of all CDSs
CDSs shared by both programs
CDSs merely identified by the respective program
From the 3.9-fold coverage genome data, CRITICA predicts 6734 CDSs with a cut-off p-value = -4 suggested by Badger and Olsen , while IdentiCS predicted 5650 CDSs (with a cut-off E-value = 1E-10). 94.0% of the CDSs predicted by CRITICA are covered by the prediction of IdentiCS. O, In many cases two or more smaller CDSs predicted by CRITICA are covered by a CDS predicted by IdentiCS, obviously because of the relatively high sequencing errors in the 3.9-fold coverage genome data. CRITICA predicts 29 fusion coding sequences. Since they have similarities to two different functions, function assignment to this kind of fusion CDSs is uncertain. Half of the CRITICA-specific CDSs have p-values between 1E-4 and 1E-10. In comparison, of the 1348 CDSs merely predicted by IdentiCS, all have E-values less than 1E-10, 27% have E-values less than 1E-40; all the CDSs have an identity greater than 20% and 60% have an identity greater than 50%, indicating that the predictions by IdentiCS have a high confidence.
From the 7.9-fold coverage genome data, CRITICA predicts 5135 CDSs. This number is much less than the CDS number predicted from the 3.9-fold coverage. This may be explained by the significant decrease of sequence errors in the 7.9-fold genome data. In contrast, CDSs predicted by IdentiCS are only 389 less than that predicted from the 3.9-fold genome data. 93.9% of the CRITICA predictions are covered by the 4512 CDSs predicted by IdentiCS. Only 8 CDSs predicted by CRITICA span two or more CDSs. This shows that the increase of sequence quality increases the precision of the prediction of CRITICA. Again, the IdentiCS-specific predictions have a high confidence: all with E-values less than 1E-10 and amino acid sequence identities greater than 20%, more than 50% with E-values less than 1E-20 and identities greater than 50%. The fact that in some cases fusion CDSs are predicted by CRITICA and in other cases many highly potential coding regions are not predicted as CDSs indicates a shortcoming in this algorithm for low quality contigs. When CRITICA finds a coding region with a high score, it tries to find the start and stop codons by extending this region to both upstream and downstream with the conditions of not decreasing the total score after extension. Sequencing errors, especially translation shifts, make it difficult for CRITICA to calculate the extension score correctly. In such cases, the algorithm used by IdentiCS does not need to locate the start and stop codons. Transcription frame shifts also have less interference to IdentiCS because it does not use predicted coding sequence as queries but uses entries from public database to search for coding sequences in the raw genome sequences of an organism. These features make IdentiCS more suitable for identifying possible protein-coding regions from low-coverage error-containing raw genome sequences than other available approaches.
Reconstruction and visualization of metabolic networks for comparison
With the identified enzyme-encoding sequences discussed above the potential metabolic networks of S. typhimurium and K. pneumoniae can be reconstructed and compared to other organisms. The reconstruction of metabolic networks can be done in a similar way as based on CDSs from annotated genome sequences as recently described by Ma and Zeng . Briefly, from the identified EC numbers of CDSs, the set of biochemical reactions involved in the organism can be established with the help of a reaction database (i.e. a revised version of LIGAND  or BRENDA ). From the reaction set, a connection matrix is obtained that can be used to represent the metabolic network as a directed graph for computational analysis.
The use of genome sequences from S. typhimurium and K. pneumoniae demonstrated the applicability and reliability of the new method proposed for in silico identification of protein coding sequences from unannotated genome sequences. The use of protein sequence databases SWISS-PROT and TrEMBL is more favorable than the use of KEGG genome database for identifying coding sequences and thus for metabolic network reconstruction. Furthermore, the method allows an adequate reconstruction of the potential metabolic network from sequence data with low coverage (e.g. < 4 fold) of the bacterial genome as shown for K. pneumoniae. Together with the algorithms for the automatic annotation of sequences, the visualization and comparison of metabolic networks, the method and program developed in this work can accelerate the use of genomic data for studying cellular metabolism.
The applicability of the method proposed above was examined with the genome sequences of two organisms, namely Salmonella typhimurium LT2 and Klebsiella pneumoniae. The genome of S. typhimurium LT2 has been completely sequenced and well annotated . Thus, the annotated genome sequences of S. typhimurium LT2 serve as a reference to evaluate the accuracy of the proposed method. The sequences and annotation for S. typhimurium LT2 were downloaded from KEGG (version of Dec. 18. 2003). The genome of K. pneumoniae has been recently sequenced and the annotation is still in progress. Two different versions of the raw genome data of K. pneumoniae (3.9-fold whole genome shotgun coverage in 920 contigs and 7.9-fold coverage in 341 contigs) obtained from the Genome Sequencing Center of Washington University  were examined in this study. Each version of the raw genome data was formatted as a local database for BLAST .
Two types of databases are used in this work for the prediction and function assignment of CDSs for a given organism, namely the nucleic acid database from KEGG and the non-redundant protein sequence databases from SWISS-PROT, TrEMBL and TrEMBL updates. The reason to choose the genome database from KEGG as query but not from other nucleic acid databases such as GenBank or EMBL is that KEGG contains the most extensive EC numbers for enzymes that are needed for reconstructing metabolic networks. Therefore, the genome database of KEGG version can serve as an EC number source and be used for the purpose of comparative analysis of genome-based metabolism. In contrast, the flat data files from GenBank and EMBL do not contain the necessary enzyme index information in many cases. SWISS-PROT is human-curated and therefore more preferred. SWISS-PROT and its sister database TrEMBL (SWISS-PROT Release 42.7, TrEMBL Release 25.7, released on 15 Dec. 2003) were obtained from the Swiss Institute of Bioinformatics . Not "fasta" format files but SWISS-PROT flat files were used because the enzyme EC numbers may not be included in the fasta format files available on the FTP site. Entries in the databases that do not contain EC numbers can be filtered out before the sequence alignment step to shorten the computational time if the purpose is merely to identify metabolic enzymes and to reconstruct the metabolic network. For identifying all possible CDSs, the complete SWISS-PROT and TrEMBL databases are used.
Automatic prediction and annotation of protein-coding sequences
The annotation process is based on similarity comparison as normally used in other annotation processes. The difference is that in our approach the gene or protein sequences from public databases are used as queries to search and locate similar ones in the raw genome sequences. When proteins from public database are used as queries, the tblastn algorithm in the BLAST program is applied that compares the query to all six translation frames of the unannotated DNA sequences. The dynamic translation of a small genomic database takes much less system resource than the translation of a large public database as in the conventional methods. Our method can thus be realized on a common PC system, especially when merely a subset of the public database is considered, for example for the purpose of identifying metabolic enzymes for metabolic network reconstruction.
they are coded on the same strand of the same genomic contig as the highest score fragment. In other words, all of these fragments must be translated either in positive or in negative frames.
the alignments have an identity level not lower than 80% of the identity level of the highest score alignment.
the generated larger sequence region has alignment gaps or extensions not more than 20% of its length.
Each region normally has only one function. Here the region represents a piece of nucleotides either on the positive strand or on the negative strand of a DNA molecule. The same physical position on different strands of a DNA molecule can belong to different regions, and can therefore have a different function assignment. Although there are examples in some viruses that a region can code different proteins depending on the transcription frame, it happens very rarely in other organisms. The user can assign a tolerance value (e.g. 60 bp) to allow two successive regions to overlap each other to some extent.
Highest similarity principle. If a query gene or protein has a similarity to a CDS higher than other queries, then the function of this query gene or protein is assigned as the annotation of the CDS. Bits score is used as a measure for similarity first. If two queries have the same bits score, then the identity level in percentage is taken as a second measure for similarity. If both bits score and identity are the same and these two entries have different function annotation (rarely occurred) then both of their functions are assigned to that region.
Closest evolutionary relationship. If two or more query genes or proteins are comparably similar (e.g. the difference between their identity levels is lower than 5%) to a CDS but have different function, the evolutionary relationship between these organisms is further considered. The annotation of the organism that is mostly related to the studied organism from the viewpoint of metabolic evolution is transferred to the unknown CDS. The evolutionary relationship between different organisms and the one studied is established with the method of Ma and Zeng  after the initial function assignment for the CDSs with the highest similarity criteria.
In this way, the coding sequences of a genome are identified and annotated at the same time. No second large-scale sequence alignment is needed. Once all the software and databases are prepared, our program which is called IdentiCS (Identi fication of C oding S equences from Raw Genome Sequences) can reconstruct the metabolic network of an organism with about 5 million base pairs of raw genome data. The computing time is less than 8 hours on a PC with 2.8 GHz Pentium 4 CPU and 512 MB memory. This program works together with Microsoft Excel under Windows environment.
For a more detailed examination of our method, the results are evaluated separately for the prediction of CDSs and their function assignment, although our method integrates these two aspects into one step. The terms true positive (TP), false negative (FN) and false positive (FP) are used to calculate the sensitivity and specificity of CDS prediction in comparison with CDSs in the original annotation. The terms "sensitivity" and "specificity" are defined according to Burset and Guigo :
We also evaluated the terms TP, FN and FP on nucleotide level according to Burset and Guigo  and calculated the corresponding sensitivity and specificity as above. It should be mentioned that a true positive CDS does not necessarily mean that its function assignment is also correct. The terms consistence and inconsistence are used to describe whether a true positive CDS has the same function assignment as in the original annotation or not. Correspondingly, an "inconsistence rate" is used and defined as:
One of the authors (J. Sun) greatly acknowledges the PhD Scholarship of German Academic Exchange Service (DAAD). This work is supported by grant 031U110A/031U210A from Bundesministerium für Bildung und Forschung (BMBF) of Germany. We also thank Dr. H. Ma for his help in the structure and evolution analysis for the metabolic network of K. pneumoniae.
- Ma HW, Zeng AP: Reconstruction of metabolic networks from genome data and analysis of their global structure for various organisms. Bioinformatics 2003, 19: 270–277. 10.1093/bioinformatics/19.2.270View ArticlePubMed
- Ma HW, Zeng AP: The connectivity structure, giant strong component and centrality of metabolic networks. Bioinformatics 2003, 19: 1423–1430. 10.1093/bioinformatics/btg177View ArticlePubMed
- Overbeek R, Larsen N, Pusch GD, D'Souza M, Selkov E Jr, Kyrpides N, Fonstein M, Maltsev N, Selkov E: WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction. Nucleic Acids Res 2000, 28: 123–125. 10.1093/nar/28.1.123PubMed CentralView ArticlePubMed
- Ideker T, Thorsson V, Ranish JA, Christmas R, Buhler J, Eng JK, Bumgarner R, Goodlett DR, Aebersold R, Hood L: Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science 2001, 292: 929–934. 10.1126/science.292.5518.929View ArticlePubMed
- Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000, 28: 27–30. 10.1093/nar/28.1.27PubMed CentralView ArticlePubMed
- Michal G: Biochemical Pathways 3 Edition Boehringer Mannheim, Germany 1992.
- Michal G: Biochemical Pathways Heidelberg; Berlin: Spektrum Akademischer Verlag 1999.
- Selkov E Jr, Grechkin Y, Mikhailova N, Selkov E: MPW: the Metabolic Pathways Database. Nucleic Acids Res 1998, 26: 43–45. 10.1093/nar/26.1.43PubMed CentralView ArticlePubMed
- Karp PD, Riley M, Saier M, Paulsen IT, Paley SM, Pellegrini-Toole A: The EcoCyc and MetaCyc databases. Nucleic Acids Res 2000, 28: 56–59. 10.1093/nar/28.1.56PubMed CentralView ArticlePubMed
- Karp PD, Riley M, Saier M, Paulsen IT, Collado-Vides J, Paley SM, Pellegrini-Toole A, Bonavides C, Gama-Castro S: The EcoCyc Database. Nucleic Acids Res 2002, 30: 56–58. 10.1093/nar/30.1.56PubMed CentralView ArticlePubMed
- Goesmann A, Haubrock M, Meyer F, Kalinowski J, Giegerich R: PathFinder: reconstruction and dynamic visualization of metabolic pathways. Bioinformatics 2002, 18: 124–129. 10.1093/bioinformatics/18.1.124View ArticlePubMed
- Delcher AL, Harmon D, Kasif S, White O, Salzberg SL: Improved microbial gene identification with GLIMMER. Nucleic Acids Res 1999, 27: 4636–4641. 10.1093/nar/27.23.4636PubMed CentralView ArticlePubMed
- Besemer J, Lomsadze A, Borodovsky M: GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res 2001, 29: 2607–2618. 10.1093/nar/29.12.2607PubMed CentralView ArticlePubMed
- Guo FB, Ou HY, Zhang CT: ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes. Nucleic Acids Res 2003, 31: 1780–1789. 10.1093/nar/gkg254PubMed CentralView ArticlePubMed
- Badger JH, Olsen GJ: CRITICA: coding region identification tool invoking comparative analysis. Mol Biol Evol 1999, 16: 512–524.View ArticlePubMed
- Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Barrell D, Bateman A, Binns D, Biswas M, Bradley P, Bork P, Bucher P, Copley RR, Courcelle E, Das U, Durbin R, Falquet L, Fleischmann W, Griffiths-Jones S, Haft D, Harte N, Hulo N, Kahn D, Kanapin A, Krestyaninova M, Lopez R, Letunic I, Lonsdale D, Silventoinen V, Orchard SE, Pagni M, Peyruc D, Ponting CP, Selengut JD, Servant F, Sigrist CJ, Vaughan R, Zdobnov EM: The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res 2003, 31: 315–318. 10.1093/nar/gkg046PubMed CentralView ArticlePubMed
- Falquet L, Pagni M, Bucher P, Hulo N, Sigrist CJ, Hofmann K, Bairoch A: The PROSITE database, its status in 2002. Nucleic Acids Res 2002, 30: 235–238. 10.1093/nar/30.1.235PubMed CentralView ArticlePubMed
- Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, Sonnhammer EL: The Pfam protein families database. Nucleic Acids Res 2002, 30: 276–280. 10.1093/nar/30.1.276PubMed CentralView ArticlePubMed
- The Genomic Database at Integrated Genomics, Inc[http://www.integratedgenomics.com/genomic.html]
- The Academic Site of WIT[http://www-wit.mcs.anl.gov/]
- Mount DW: Bioinformatics: Sequence and genome analysis Cold Spring Harbor Laboratory Press 2001.
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMed
- McClelland M, Sanderson KE, Spieth J, Clifton SW, Latreille P, Courtney L, Porwollik S, Ali J, Dante M, Du F, Hou S, Layman D, Leonard S, Nguyen C, Scott K, Holmes A, Grewal N, Mulvaney E, Ryan E, Sun H, Florea L, Miller W, Stoneking T, Nhan M, Waterston R, Wilson RK: Complete genome sequence of Salmonella enterica serovar Typhimurium LT2. Nature 2001, 413: 852–856. 10.1038/35101614View ArticlePubMed
- Ma HW, Zeng AP: Phylogenetic comparison of metabolic capacities of organisms at genome level. Mol Phylogenet Evol 2004, 31: 204–213. 10.1016/j.ympev.2003.08.011View ArticlePubMed
- Schomburg I, Chang A, Ebeling C, Gremse M, Heldt C, Huhn G, Schomburg D: BRENDA, the enzyme database: updates and major new developments. Nucleic Acids Res 2004, 32: D431–3. 10.1093/nar/gkh081PubMed CentralView ArticlePubMed
- International Union of Biochemistry and Molecular Biology (IUBMB)[http://www.iubmb.unibe.ch]
- The ExPASy (Expert Protein Analysis System) proteomics server of the Swiss Institute of Bioinformatics (SIB)[http://www.expasy.org]
- The FTP Site of KEGG Genomes[ftp://ftp.genome.ad.jp/pub/kegg/genomes]
- The Genome Sequencing Center at Washington University Medical School[http://genome.wustl.edu]
- The Non-Redundant Protein Sequence Database[ftp://ftp.expasy.org/databases/sp_tr_nrdb]
- Pearson WR: Flexible similarity searching with the FASTA3 program package. In Bioinformatics Methods and Protocols (Edited by: Misener S, Krawetz SA). Totowa: NJ: Humana Press 1999, 185–219.View Article
- Burset M, Guigo R: Evaluation of gene structure prediction programs. Genomics 1996, 34: 353–367. 10.1006/geno.1996.0298View ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.