A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases
© Green and Karp; licensee BioMed Central Ltd. 2004
Received: 26 January 2004
Accepted: 09 June 2004
Published: 09 June 2004
The PathoLogic program constructs Pathway/Genome databases by using a genome's annotation to predict the set of metabolic pathways present in an organism. PathoLogic determines the set of reactions composing those pathways from the enzymes annotated in the organism's genome. Most annotation efforts fail to assign function to 40–60% of sequences. In addition, large numbers of sequences may have non-specific annotations (e.g., thiolase family protein). Pathway holes occur when a genome appears to lack the enzymes needed to catalyze reactions in a pathway. If a protein has not been assigned a specific function during the annotation process, any reaction catalyzed by that protein will appear as a missing enzyme or pathway hole in a Pathway/Genome database.
We have developed a method that efficiently combines homology and pathway-based evidence to identify candidates for filling pathway holes in Pathway/Genome databases. Our program not only identifies potential candidate sequences for pathway holes, but combines data from multiple, heterogeneous sources to assess the likelihood that a candidate has the required function. Our algorithm emulates the manual sequence annotation process, considering not only evidence from homology searches, but also considering evidence from genomic context (i.e., is the gene part of an operon?) and functional context (e.g., are there functionally-related genes nearby in the genome?) to determine the posterior belief that a candidate has the required function. The method can be applied across an entire metabolic pathway network and is generally applicable to any pathway database. The program uses a set of sequences encoding the required activity in other genomes to identify candidate proteins in the genome of interest, and then evaluates each candidate by using a simple Bayes classifier to determine the probability that the candidate has the desired function. We achieved 71% precision at a probability threshold of 0.9 during cross-validation using known reactions in computationally-predicted pathway databases. After applying our method to 513 pathway holes in 333 pathways from three Pathway/Genome databases, we increased the number of complete pathways by 42%. We made putative assignments to 46% of the holes, including annotation of 17 sequences of previously unknown function.
Our pathway hole filler can be used not only to increase the utility of Pathway/Genome databases to both experimental and computational researchers, but also to improve predictions of protein function.
Genome sequencing projects generate large numbers of nucleotide sequences each year . Once the sequences are obtained, functions must be assigned to these new sequences. This is typically accomplished by searching large, public databases for similar sequences. Assessment methods include hidden Markov models and identification of functional motifs [2–4]. These methods assign functional annotations based on sequence alone and do not incorporate non-sequence-based information.
Most genome annotation efforts fail to assign function to 40 – 60% of the new sequences. Even when annotated, many functions remain incomplete (only one function of a multidomain protein) or nonspecific (e.g., "thiolase family protein"). Operon- and pathway-based information can provide additional clues about protein function, and can clarify incomplete or nonspecific annotations. For instance, the annotation of CC1617 (a Caulobacter cresentus gene) as guaB (E.C.# 184.108.40.206) is supported by the fact that CC1617 is part of a predicted operon with guaA (E.C.# 220.127.116.11) and the pathway for de novo biosynthesis of purine nucleotides (I) includes both of these reactions.
Previous computational efforts to identify missing enzymes in metabolic pathways have focused solely on sequence similarity as a means to find candidate enzymes. Other methods have proposed the use of additional types of evidence, but have not provided a computational algorithm for combining these data [6–8]. Rather than merely providing a score of the similarity of a sequence from the genome of interest to sequences that catalyze the same reaction, we use a Bayes classifier to determine the probability that the candidate protein has the function required to fill the pathway hole. Our computational method focuses on identification of potential candidates to fill pathway holes and evaluation of these candidates by combining homology-, operon-, and pathway-based data. This approach will refine metabolic maps available in PGDBs and improve annotations for previously unannotated or incompletely annotated proteins.
1. Sequence retrieval – Retrieve from Swiss-Prot  and PIR  sequences for enzymes that catalyze the desired reaction in other organisms. Because these sequences are not necessarily homologs, we will refer to enzymes with the same function in a variety of organisms as isozymes. For Swiss-Prot, the program retrieves Swiss-Prot IDs directly from the ENZYME database. For PIR sequences, the program retrieves IDs from the MetaCyc PGDB. Sequences are then retrieved directly from the most recent version of each database.
3. Data consolidation – Congruence analysis of the resulting BLAST hits to consolidate data reported for sequences that align with one or more query isozymes.
4. Candidate evaluation – Determine the probability that each candidate protein has the activity required by the missing reaction.
Steps 1 through 3 compose the "candidate identification" phase of the process and generate the evidence used in the fourth step, "candidate evaluation". Our implementation of the homology search and data consolidation steps is based on the Shotgun congruence analysis algorithm . We applied our method to PGDBs developed by SRI's Bioinformatics Research Group for three organisms – CauloCyc (Caulobacter cresentus ), MtbRvCyc (Mycobacterium tuberculosis ), and VchoCyc (Vibrio cholerae ). All three PGDBs are available through the BioCyc Web site at http://BioCyc.org
In this report, we will describe both the candidate identification and evaluation steps of our algorithm, including the Bayesian network used for evaluation and the calculation of the conditional probability distributions used for the network. We will also present the results of a five-fold cross-validation study completed to evaluate the predictive value of various network structures. Our prediction of putative hole-fillers for three PGDBs will demonstrate the utility of the method for identifying functions for unannotated genes and for increasing the completeness of the metabolic map provided by a PGDB.
To determine the function of a sequence, researchers typically use a single sequence of unknown function to search for potential homologs in a large, public database. We have, in effect, reversed this search process to fill pathway holes in a PGDB. The activity of the missing reaction is known from the inferred pathway; we search the genome for a sequence that will provide that function. Our method uses multiple isozymes from other organisms to search for similar sequences in the genome used to build the PGDB. Searching the smaller genome database greatly increases search sensitivity by reducing the probability of finding a match by chance [18, 19]. Also, if a sequence with the desired function exists in the genome, its alignment with multiple isozymes will be more credible than an alignment between a single sequence queried against a large sequence database [4, 13, 14].
For each pathway hole, we retrieve isozyme sequences from the Swiss-Prot and PIR databases. Next, we use BLAST (version 2.0.10) to search the genome of interest with each of the query sequences. The BLAST queries use the default search parameters (e.g., filtering on, gapped) with the E-value cutoff reduced to 1.0 (from the default of 10) to reduce the number of false hits returned but maintain the ability to identify hits with remote similarity.
Node names and descriptions for Bayes classifier.
true if the protein has the function required to fill the pathway hole, false if it does not
the number of query sequences whose BLAST output included the candidate sequence
negative log of the E-value for the best alignment of the candidate with a query sequence
the average rank of the candidate sequence in the BLAST output lists (e.g., if a candidate is the best hit in each search, the average rank for the candidate is 1)
the average of each alignment length normalized by the length of the query sequence
true if the hit is in the same directon as another gene in the same pathway; a directon is a contiguous series of genes transcribed in the same direction
true if the hit is adjacent to one of the genes coding the enzyme for an adjacent reaction in the pathway
Example of consolidation of BLAST output data for glutamine-hydrolyzing NAD(+) synthase.
Sequences producing significant alignments
rank in output
Hits found by Query Isozyme Q9CBZ6
CC3619 NAD(+) synthetase, putative
CC2963 hydrolase, carbon-nitrogen family
CC3013 TonB-dependent receptor
Hits found by Query Isozyme Q58747
CC3619 NAD(+) synthetase, putative
CC1620 GMP synthase
Hits found by Query Isozyme F86762
CC3619 NAD(+) synthetase, putative
CC1541 2-isopropylmalate synthase
Hits found by Query Isozyme P18843
CC3619 NAD(+) synthetase, putative
CC1541 2-isopropylmalate synthase
Hits found by Query Isozyme P47623
CC3619 NAD(+) synthetase, putative
CC1620 GMP synthase
Results of data consolidation
consolidated data for hits from target genome
Bayesian network structure
Each of the candidate hits is evaluated by calculating the probability that the sequence encodes the desired function based on operon-, homology- and pathway-based data. We use a Bayesian network to calculate this probability.
While the N0 network may more accurately capture the relationship between these variables, large amounts of data would be required to determine the probability distribution for each node in the network. Thus, we instead chose to assess the adequacy of a drastically simpler network. Figure 4 shows both N0 and the static naïve Bayes classifier we evaluated, N1. We eliminated all conditional dependencies between the child nodes (shown as dashed lines in Figure 4). We also added pathway-based data to the network (shown as double-lined nodes in Figure 4). Each node in our model is described in Table 1. Since the naïve classifier has only one parent node (has-function), the probability distributions can be obtained from a much smaller dataset.
Calculating the probability that a candidate catalyzes the missing reaction
In evaluating the candidate sequences identified by our searches, we are primarily concerned with the network's parent node, has-function. The probability that this state is true, P(has-function), or P, is the probability that a candidate protein has the function needed for the missing reaction. The probability that the protein does not have the needed function is 1-P. Each of the other nodes in the network will have probability distributions conditioned on the state of the has-function node. For example, suppose our network structure included only the parent node, has-function, and two child nodes, average-rank and pathway-directon, as defined in Table 1. To calculate the probability that a candidate has the desired function given the evidence for that candidate, we need the following data: the evidence for the particular candidate (i.e., the candidate's values for average-rank and pathway-directon), the probability of finding that evidence if the candidate has the desired function or if it does not, and the prior probability that any candidate has the desired function. For our example network and candidate we need
I – The average-rank of the candidate and the value of pathway-directon. Let's assume that the values of average-rank and pathway-directon for our example candidate are 1.5 and "true", respectively.
II – The probability that a protein has a particular value for each child node given that it has or does not have the desired function. For our example candidate, P(average-rank = 1.5 | has-function) = 0.40 and P(average-rank = 1.5 | ¬has-function) = 0.03. Also, P(pathway-directon = true | has-function) = 0.24 and P(pathway-directon = true | ¬has-function) = 0.04.
is the evidence collected from the BLAST searches for each candidate protein. The values for (II) are taken from the conditional probability distributions. Forty percent of the candidates have an average-rank between 1.0 and 2.0 when the candidate is a true hit. Likewise, pathway-directon = "true" for 24% of the candidates that are true hits (conversely, pathway-directon is "false" for 76% of the candidates when the candidate is a true hit). Since our network assumes conditional independence among the child nodes, we use Bayes' Rule to determine P(has-function | average-rank = 1.5 and pathway-directon = "true"):
Calculation of conditional distributions
To perform this calculation for each candidate, we must determine a conditional probability distribution for each node in the network (conditioned on the state of the has-function node). We derived the data used to calculate these distributions from the known reactions in each PGDB.
Cross-validation and statistical evaluation
To evaluate our model structure, we divided the set of known reactions in a PGDB and their candidate proteins randomly into five separate sets. Given the network and the associated distributions, we then calculated P for each candidate sequence from the data collected for the sequence. Using these true hits, false hits, and their associated P's, we evaluated various network structures. For each set, the probability distributions were derived from the other four.
Since classifying each protein (has-function or ¬has-function) depends on the chosen probability threshold, a single assessment of specificity and sensitivity will not completely describe the predictive power of the model. Hence, we compared models by plotting the number of true positives as a function of the number of false positives. We used McNemar's test  to determine statistically significant differences between the models at various false positive rates.
PGDB and cross-validation statistics.
pathways in PGDB
# true hits
# false hits
positive predictive value at P > 0.9
true-positives at P > 0.9
Cross-validation and model exploration
To explore the adequacy of the model proposed for the Bayes classifier, we performed fivefold cross-validation on the known reactions from the three PGDBs.
The candidate identification step of our program generates an average of 27 candidate proteins per pathway hole. Although, we identify one true hit for most reactions, some reactions (e.g., 101 for VchoCyc) have multiple true hits (when multiple genes are assigned to a reaction either as part of a multimeric complex or as separate polypeptides that catalyze the same reaction), while other reactions (e.g., 44 for VchoCyc) have no true hits identified (i.e., either no isozyme sequences were available or all candidate proteins for that reaction are false hits).
Using data from the CauloCyc PGDB, we explored various network configurations to determine their effect on the predictive value of the classifier. The following three models were included in our investigation:
model 1 – classification of each hit by the Bayes classifier shown in Figure 4, N1
model 2 – classification of each hit by the Bayes classifier excluding the best-E-value node
model 3 – classification of each hit by E-value of the best alignment alone (i.e., choosing an E-value cutoff to classify hits as positive or negative)
Candidate identification for missing reactions
We used model 1 to identify enzymes for the missing reactions in MtbRvCyc, CauloCyc, and VchoCyc.
Summary of hole-filling results.
missing reactions w/ isozyme seqs
putative enzymes found at P > 0.9
fillers with borderline P (0.5 to 0.9)
holes with no BLAST hits
holes filled with ORFs at P > 0.8
newly completed pathways at P > 0.9
newly completed pathways at P > 0.8
The function of the protein was correctly annotated; however, due to inconsistent and nonstandardized annotations, the PathoLogic name-matcher did not recognize or understand the annotation, leading to a "missing enzyme" when the enzyme had, in fact, been identified in the genome already.
The annotated activity is nonspecific (e.g., "thiolase family protein" or "probable phosphoribosyltransferase"). Our program provides a more exact specification of the enzyme's function than the original annotation.
The annotated activity is incomplete (e.g., a multifunctional protein annotated with only one function). Our program can identify additional functions that were missed by the original annotation process.
The functional annotation of the protein is inconsistent with the activity required for the reaction. These instances may represent either false positives classified by our program or, perhaps, incorrectly annotated proteins.
Example pathway: pyridine nucleotide biosynthesis
The first missing enzyme, nadB (CC2913), was actually annotated as L-aspartate oxidase in the Caulobacter genome. Because the reaction (and the E.C. number of the reaction) catalyzed by nadB changes depending on whether it acts as a monomer (1.4.3.-) or in complex with nadA (18.104.22.168) [22, 23], CauloCyc includes two different reactions for L-aspartate oxidase. During the creation of CauloCyc, PathoLogic matched CC2913 to the monomeric reaction, but did not match the enzyme to the reaction catalyzed by the complex (complexes must be created by the user), leaving a hole in the pathway.
Our program assigns putative nicotinate-nucleotide adenylyltransferase activity to CC3431 for the second missing reaction in the pathway. CC3431 lacked any functional description in the original annotation. When queried against GenBank, the highest scoring hit to CC3431 is a nicotinate-nucleotide adenylyltransferase with an E-value of 3e-38. Against the nicotinate (nicotinamide) nucleotide adenylyltransferase TIGRFAM , CC3431 scores just above the trusted cutoff score. The sequence has been assigned to the NadD COG (COG1057) .
The third missing reaction, glutamine-dependent NAD+ synthetase, is catalyzed by the same enzyme catalyzing the NH3-dependent reaction. CC3619 includes an N-terminal domain that enables the reaction to proceed with either substrate . Each of the two domains of CC3619 has been assigned to COGs (COG0388, predicted amidohydrolase and COG0171, NAD synthase). Additional files 5, 6 and 7 contain alignments  of the set of sequences used to query the Caulobacter genome and the putative enzyme assigned to each pathway hole in the example pathway.
We have shown that our hole-filling program can improve the quality of PGDBs and the quality of genome annotations by clarifying nonspecific and incomplete annotations and providing annotations for sequences of previously unknown function. For the three PGDBs used in this work, an average of 27% of the pathways were complete before running our program. After filling pathway holes, an average of 38% of the pathways are complete, and an average of 28% of the previously missing reactions have putative enzymes assigned with probabilities above 0.9.
In the course of this work, we have identified additional uses for the pathway hole-filler in improving the quality of a PGDB. For example, in cases where several enzymes have been assigned to a reaction by PathoLogic, our predictor may determine which enzyme is most likely to be the correct one. When the set of enzymes all have the same general functional annotation (e.g., aldehyde dehydrogenase), we can provide a more specific function for the enzyme assigned by our predictor when the activity of the missing reaction is more specific.
Osterman and Overbeek have provided a review of comparative genomics techniques that can be applied to the identification of missing enzymes in metabolic pathways . The authors provide detailed descriptions and examples of cases where homology alone was insufficient for finding a missing gene to fill a pathway hole. Their approach encompasses three phases: identifying holes in pathways, identifying and ranking candidate genes, and experimental verification. Identification and ranking of candidates includes the use of comparative genomics techniques to gather potential candidates and the evidence needed to support or refute the assignment of the missing activity to that gene. The review proposes the approach to be used by experimentalists in identifying missing genes, but does not propose a computational method. Our method employs a similar approach for gathering candidates and evidence, but then combines these elements in a way that can be efficiently applied to large-scale predictions.
Reed et al.  have completed a "network gap analysis" for E. coli similar to our search for missing enzymes to fill pathway holes. As a result of this analysis, they made putative functional assignments to 55 ORFs in the E. coli K-12 genome, using an approach similar to the candidate identification step of our algorithm. For a particular hole of interest, a set of orthologous sequences was used to find sequences with that function in the E. coli genome. Their approach differs from ours in that they used MEME  and ClustalW  to generate profiles for the set of orthologous sequences. These profiles were then used with MAST  and HMMER  to query the genome rather than searching the genome with each sequence individually. Search results were inspected manually to confirm the putative functional assignments, and the three top-scoring hits were reported for putative reannotations. The study provided no specific evaluation criteria for recommended reannotations (e.g., E-value cutoff) besides the appearance of the sequence within the top three hits for a search.
Our algorithm provides several improvements over this approach. Our Bayesian network can be used programmatically to rigorously combine disparate sources of evidence including sequence-based and non-sequence-based sources. MAST and HMMER consider only sequence-based information, and Reed et al. do not provide an automated means of combining the evidence from their searches. Following the identification of candidate sequences, our program calculates a probability from homology-, operon-, and pathway-based data. The probability P(has-function) gives a clear indication of how likely it is that each candidate provides a particular function; thus, for each reaction, the set of potential candidates can be directly compared to each other based on all of the evidence included in our network. Comparing E-values from MAST and HMMER excludes valuable contextual information. Instead of gauging the likelihood that a sequence has the desired function, E-values reflect the likelihood that each match might be found by chance in the database, which is not the question most searches intend to ask. Also, our method does not restrict the search to only the elements common to all sequences used to build the profile (i.e., the query sequences are not assumed to be evolutionarily related), nor does it require multiple sequences. The program evaluates candidates found by a single query sequence or multiple query sequences in the same way. The Shotgun-score node appropriately adjusts the P(has-function) based on the number of query sequences identifying a candidate.
Probabilities calculated for putative functional assignments made by Reed et al. Bnumbers (unique IDs for E. coli genes) shown in italics are below the P cutoff (P > 0.8). Those in bold correspond to the gene assigned to the reaction in EcoCyc when one is assigned.
Comparison of selected examples from Reed et al. method.
Limitations of the predictor
The distributions used to determine P(has-function | evidence) for each candidate protein are calculated from the data collected for the known reactions in the PGDB. Proteins linked to reactions by PathoLogic may have stronger similarity to the set of query isozymes than the proteins that should be assigned to missing reactions. Hence, the probability distributions calculated from known reactions may not accurately reflect the distributions for missing reactions.
The classifier structure in Figure 4 assumes conditional independence among the evidence nodes. Obviously, the average-rank of a hit and the percent of the query sequence aligned are likely to correlate with the best-E-value of the hit. Like several other Bayesian models applied to biological problems, the violation of the conditional independence assumption does not preclude our model from adequately classifying potential enzymes. Future work will include learning the structure of the model, thereby incorporating the appropriate dependencies among the evidence nodes.
Our predictor is generally applicable to any organism's PGDB and is useful not only for developing a more complete picture of the organism's metabolism, but also for identifying the function of ORFs and clarifying incomplete or nonspecific annotations. We currently incorporate evidence from homology, operon, and metabolic pathway relationships into our candidate evaluation step, but those candidates are identified based solely on homology to known proteins in SWISS-PROT or PIR. With this limitation, our program will not identify enzymes for any reaction catalyzed by an enzyme with an extremely divergent sequence or an enzyme whose activity is the result of convergent evolution. Also, reactions catalyzed by enzymes lacking sequenced isozymes in other organisms will not be identified by our program. Incorporating non-homology-based data will help to address these limitations. Fortunately, given the simple structure of the classifier, additional evidence nodes can be easily integrated. Expression data and phylogenetic profiles  might enhance the accuracy of the predictions and allow identification of candidate proteins with little to no homology to known sequences.
Our pathway hole filler provides an effective, computational method for combining evidence from homology data, operon-based data, and pathway context to identify missing reactions in a Pathway/Genome database. By identifying missing enzymes in a genome, we can not only increase the completeness of the PGDB, but we can also improve existing genome annotations by identifying functions for previously unannotated proteins and clarifying non-specific annotations. With the completion of additional pathways in PGDBs and improvements in functional annotation provided by our approach, experimental and computational researchers will benefit from PGDBs that include more accurate and relevant information.
Availability and requirements
The pathway hole filler is available as part of the Pathway Tools software which is freely available freely to academics. Contact (email@example.com) for information on obtaining the software.
This work was supported in part by grant No. DE-FG03-01ER63219 from the Department of Energy, DARPA contract N66001-01-C-8011, and NLM National Institutes of Health Grant NLM 07033. This financial support does not constitute an endorsement of the views expressed herein.
- Benson Dennis A., Karsch-Mizrachi Ilene, Lipman David J., Ostell James, Wheeler David L.: GenBank. Nucl Acids Res 2003, 31: 23–27. 10.1093/nar/gkg057PubMed CentralView ArticlePubMedGoogle Scholar
- Hughey R, Krogh A: Hidden Markov models for sequence analysis: extension and analysis of the basic method. Comput Appl Biosci 1996, 12: 95–107.PubMedGoogle Scholar
- Karplus K, Barrett C, Hughey R: Hidden Markov models for detecting remote protein homologies. Bioinformatics 1998, 14: 846–856. 10.1093/bioinformatics/14.10.846View ArticlePubMedGoogle Scholar
- Krogh A, Brown M, Mian IS, Sjolander K, Haussler D: Hidden Markov models in computational biology. Applications to protein modeling. J Mol Biol 1994, 235: 1501–1531. 10.1006/jmbi.1994.1104View ArticlePubMedGoogle Scholar
- Karp PD, Paley S, Romero P: The Pathway Tools software. Bioinformatics 2002, 18: S225–32.View ArticlePubMedGoogle Scholar
- Claudel-Renard Clotilde, Chevalet Claude, Faraut Thomas, Kahn Daniel: Enzyme-specific profiles for genome annotation: PRIAM. Nucl Acids Res 2003, 31: 6633–6639. 10.1093/nar/gkg847PubMed CentralView ArticlePubMedGoogle Scholar
- Osterman A, Overbeek R: Missing genes in metabolic pathways: a comparative genomics approach. Curr Opin Chem Biol 2003, 7: 238–251. 10.1016/S1367-5931(03)00027-9View ArticlePubMedGoogle Scholar
- Cordwell SJ: Microbial genomes and "missing" enzymes: redefining biochemical pathways. Archives Microbiology 1999, 172: 269–279. 10.1007/s002030050780View ArticleGoogle Scholar
- The BioCyc Knowledge Library[http://BioCyc.org]
- Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, Pilbout S, Schneider M: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research 2003, 31: 365–370. 10.1093/nar/gkg095PubMed CentralView ArticlePubMedGoogle Scholar
- Wu CH, Yeh LS, Huang H, Arminski L, Castro-Alvear J, Chen Y, Hu Z, Kourtesis P, Ledley RS, Suzek BE, Vinayaka CR, Zhang J, Barker WC: The Protein Information Resource. Nucleic Acids Research 2003, 31: 345–347. 10.1093/nar/gkg040PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Gish Warren, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Journal of Molecular Biology 1990, 215: 403–410. 10.1006/jmbi.1990.9999View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Pegg SC, Babbitt PC: Shotgun: getting more from sequence similarity searches. Bioinformatics 1999, 15: 729–740. 10.1093/bioinformatics/15.9.729View ArticlePubMedGoogle Scholar
- Nierman WC, Feldblyum TV, Laub MT, Paulsen IT, Nelson KE, Eisen JA, Heidelberg JF, Alley MR, Ohta N, Maddock JR, Potocka I, Nelson WC, Newton A, Stephens C, Phadke ND, Ely B, DeBoy RT, Dodson RJ, Durkin AS, Gwinn ML, Haft DH, Kolonay JF, Smit J, Craven MB, Khouri H, Shetty J, Berry K, Utterback T, Tran K, Wolf A, Vamathevan J, Ermolaeva M, White O, Salzberg SL, Venter JC, Shapiro L, Fraser CM, Eisen J: Complete genome sequence of Caulobacter crescentus. Proc Natl Acad Sci U S A 2001, 98: 4136–4141. 10.1073/pnas.061029298PubMed CentralView ArticlePubMedGoogle Scholar
- Cole ST, Brosch R, Parkhill J, Garnier T, Churcher C, Harris D, Gordon SV, Eiglmeier K, Gas S, Barry C. E., 3rd, Tekaia F, Badcock K, Basham D, Brown D, Chillingworth T, Connor R, Davies R, Devlin K, Feltwell T, Gentles S, Hamlin N, Holroyd S, Hornsby T, Jagels K, Barrell BG, al. et: Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature 1998, 393: 537–544. 10.1038/31159View ArticlePubMedGoogle Scholar
- Heidelberg JF, Eisen JA, Nelson WC, Clayton RA, Gwinn ML, Dodson RJ, Haft DH, Hickey EK, Peterson JD, Umayam L, Gill SR, Nelson KE, Read TD, Tettelin H, Richardson D, Ermolaeva MD, Vamathevan J, Bass S, Qin H, Dragoi I, Sellers P, McDonald L, Utterback T, Fleishmann RD, Nierman WC, White O: DNA sequence of both chromosomes of the cholera pathogen Vibrio cholerae. Nature 2000, 406: 477–483. 10.1038/35020000View ArticlePubMedGoogle Scholar
- Brenner SE, Chothia C, Hubbard TJ: Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc Natl Acad Sci U S A 1998, 95: 6073–6078. 10.1073/pnas.95.11.6073PubMed CentralView ArticlePubMedGoogle Scholar
- Spang R, Vingron M: Statistics of large-scale sequence searching. Bioinformatics 1998, 14: 279–284. 10.1093/bioinformatics/14.3.279View ArticlePubMedGoogle Scholar
- Pearl Judea: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA, Morgan Kaufmann 1988.Google Scholar
- Pagano Marcello, Gauvreau Kimberlee: Principles of Biostatistics. 1st Edition Belmont, CA, Wadsworth Publishing Company 1993.Google Scholar
- Flachmann R, Kunz N, Seifert J, Gutlich M, Wientjes FJ, Laufer A, Gassen HG: Molecular biology of pyridine nucleotide biosynthesis in Escherichia coli. Cloning and characterization of quinolinate synthesis genes nadA and nadB. Eur J Biochem 1988, 175: 221–228.View ArticlePubMedGoogle Scholar
- Nasu S, Wicks FD, Gholson RK: L-Aspartate oxidase, a newly discovered enzyme of Escherichia coli, is the B protein of quinolinate synthetase. J Biol Chem 1982, 257: 626–632.PubMedGoogle Scholar
- Haft Daniel H., Selengut Jeremy D., White Owen: The TIGRFAMs database of protein families. Nucl Acids Res 2003, 31: 371–373. 10.1093/nar/gkg128PubMed CentralView ArticlePubMedGoogle Scholar
- Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV: The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res 2001, 29: 22–28. 10.1093/nar/29.1.22PubMed CentralView ArticlePubMedGoogle Scholar
- Cantoni R, Branzoni M, Labo M, Rizzi M, Riccardi G: The MTCY428.08 gene of Mycobacterium tuberculosis codes for NAD+ synthetase. J Bacteriol 1998, 180: 3218–3221.PubMed CentralPubMedGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucl Acids Res 1994, 22: 4673–4680.PubMed CentralView ArticlePubMedGoogle Scholar
- Reed JL, Vo TD, Schilling CH, Palsson BO: An expanded genome-scale model of Escherichia coli K-12 (iJR904 GSM/GPR). Genome Biology 2003, 4: R54.. 10.1186/gb-2003-4-9-r54PubMed CentralView ArticlePubMedGoogle Scholar
- Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 1994, 2: 28–36.PubMedGoogle Scholar
- Bailey TL, Gribskov M: Combining evidence using p-values: application to sequence homology searches. Bioinformatics 1998, 14: 48–54. 10.1093/bioinformatics/14.1.48View ArticlePubMedGoogle Scholar
- Eddy SR: Hidden Markov models. Curr Opin Struct Biol 1996, 6: 361–365. 10.1016/S0959-440X(96)80056-XView ArticlePubMedGoogle Scholar
- Karp PD, Riley M, Saier M, Paulsen IT, Collado-Vides J, Paley SM, Pellegrini-Toole A, Bonavides C, Gama-Castro S: The EcoCyc Database. Nucleic Acids Research 2002, 30: 56–58. 10.1093/nar/30.1.56PubMed CentralView ArticlePubMedGoogle Scholar
- Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A 1999, 96: 4285–4288. 10.1073/pnas.96.8.4285PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.