Genome-wide discovery of missing genes in biological pathways of prokaryotes
© Chen et al; licensee BioMed Central Ltd. 2011
Published: 15 February 2011
Reconstruction of biological pathways is typically done through mapping well-characterized pathways of model organisms to a target genome, through orthologous gene mapping. A limitation of such pathway-mapping approaches is that the mapped pathway models are constrained by the composition of the template pathways, e.g., some genes in a target pathway may not have corresponding genes in the template pathways, the so-called “missing gene” problem.
We present a novel pathway-expansion method for identifying additional genes that are possibly involved in a target pathway after pathway mapping, to fill holes caused by missing genes as well as to expand the mapped pathway model. The basic idea of the algorithm is to identify genes in the target genome whose homologous genes share common operons with homologs of any mapped pathway genes in some reference genome, and to add such genes to the target pathway if their functions are consistent with the cellular function of the target pathway.
We have implemented this idea using a graph-theoretic approach and demonstrated the effectiveness of the algorithm on known pathways of E. coli in the KEGG database. On all KEGG pathways containing at least 5 genes, our method achieves an average of 60% positive predictive value (PPV) and the performance is increased with more seed genes added. Analysis shows that our method is highly robust.
An effective method is presented to find missing genes in biological pathways of prokaryotes, which achieves high prediction reliability on E. coli at a genome level. Numerous missing genes are found to be related to knwon E. coli pathways, which can be further validated through biological experiments. Overall this method is robust and can be used for functional inference.
Reconstruction of biological pathways is a fundamental problem in understanding the functional mechanisms of cellular organisms. Substantial efforts have been put into the elucidation of biological pathways, particularly for prokaryotic organisms, in a systematic manner based on high-throughput omic data and computational prediction. As a result, a number of pathway databases have been developed and are being widely used, such as KEGG and BioCyc [1–5]. These databases not only serve as an information resource for retrieving well-characterized pathways for specific organisms but also provide a set of pathway templates for reconstructing pathways for organisms that are not directly covered by the databases, as substantial portions of homologous pathways may be conserved across different organisms, particularly related organisms.
A number of computer programs have been developed for pathway reconstruction through mapping known pathways from one organism to another. While some success has been reported on these programs, there has been a general issue associated with such homologous pathway mapping-based approaches, which is that homologous pathways are generally not identical and hence the mapped pathways could miss some parts not covered by their well-characterized homologous template pathways. This problem, called pathway holes or missing genes, has been widely recognized [6–9]. A number of methods have been developed to find such missing genes, based mainly on the idea of finding genes that are functionally associated with genes already in the mapped pathways. One class of such methods attempts to find enzyme-encoding genes missing in a mapped metabolic pathway based on multiple types of gene association information [8–10], taking advantage of the fact that genes encoding a metabolic pathway tend to group into clusters (e.g., operons). Another class of methods attempt to identify functional modules from some large gene association networks or groups [11–15], and then to suggest possible candidates for missing genes based on genes found in the same functional modules of genes already in mapped pathways. While these methods have provided useful information for searching for missing genes, there is clearly substantial room for improvement in terms of the functional specificity of their predicted candidates and the scope of applicability of the existing methods . Among the various areas for further improvements, we identified a few we can possibly improve on using the currently available information: (i) there have not been reliable methods for consideration and inclusion of functionally uncharacterized genes (often referred to as hypothetical and conserved genes) into partially predicted pathway models (e.g, mapped pathways); (ii) while (conserved) genomic synteny has been utilized for prediction of functionally associated genes, its true usefulness, other than operon information, is yet to be well documented. Previous studies have shown that there is a strong link between genes in the same operons and genes working in the same biological pathways . So full utilization of operon information should be a key direction for improving biological pathways, particular now as the state of the art prediction methods for operons have reached high accuracy (~90%) [17–19].
We present, in this paper, a novel computational method for identification and functional annotation of missing genes in a predicted pathway model, either through homologous pathway mapping or using other methods. The basic idea of the method can be outlined as follows. For any specified target genome, we define a distance between any pair of genes in the genome to measure the level of their functional relatedness in terms of a set of reference genomes. Specifically, two genes are functionally related if they (i) are homologous, (ii) share a common operon directly or through their homologs in a reference genome, (iii) are phylogenetically related, or (iv) deemed to be functionally related through combinations of the first three criteria. For any pair of functionally related genes in the target genome, their distance is defined essentially as the minimum number of applications of this recursive definition. Our algorithm identifies genes possibly involved in a target pathway based on their distances to genes already in the pathway. We have tested the algorithm on all characterized pathways of E. coli, using portions of the pathways as the initial pathway genes (called seed s), and found that the vast majority of the remaining genes of these known pathways are all within short distances to the seeds, confirming the effectiveness of our distance measure. Our study has also identified numerous genes with short distances to the known pathway genes, which we believe are highly promising candidates for addition to these known E. coli pathways. Limited analyses of the potential functional roles of these genes have been carried out, and reported in this paper.
High-level description of our algorithm
Selection of reference genomes
Currently over 1,000 bacterial and archaean genomes have been sequenced and are publicly available (NCBI release of September 2009). From this set, we have selected 185 strains (non-redundant genomes and plasmids) (see Additional File 1) from 185 different genera using the following rule: for each genus, select the genome with the longest sequence.
Calculation of homology-based distance
where p s (x i , x j ) is the BLAST E-value for genes x i , x j , and 185 is a normalization factor since when the E-value is smaller than 1e-185, it is set as 0 in the BLAST program. Clearly d s (x i , x j ) is between 0 and 1; and the more similar two genes are, the smaller the d s (x i , x j ) value is.
Calculation of operon-based distance
where p o (x i , x j ) represents the probability that x i , x j are in the same operon as given in .
Reference graph and linkage graph
where k is the number of edges in the path, and α is a scaling factor. In our current implementation, we set α = 380 and system(error) = 0.06 based on a ten-fold cross-validation method (see Parameter Selection). E(operon) and E(similarity) are the set of operon edges and the set of similarity edges, respectively.
with p being the frequency of 1’s in common positions between the two phylogenetic profiles. Note that the more similar two phylogenetic profiles are, the smaller their distance is.
Rank functional relatedness of candidate genes
where β is a scaling factor and set to 5, based on the ten-fold cross-validation method (see Parameter Selection); and T is set to be 2 if gene x i has both the path distance and phylogenetic distance, and as 1 if it has only one distance defined. The candidate genes are ranked by their combined distance and the final top γ genes are output (γ = 10 in this study).
Parameter Selection and Validation Method
where SP i , SE i and PPV i are SP, SE and PPV for the i th pathway, respectively, and N is the number of pathways considered.
For each to-be-determined parameter in our program, a ten-fold cross-validation procedure is used to derive the optimal value. Specifically, all the pathways are divided randomly into ten parts, nine for training and one for testing each time. The value with the best average is finally selected. The leave-one-out cross-validation procedure is used to assess the performance. For each pathway, its known genes are used as the seed-gene set. The procedure removes each gene from the pathway seed set one at a time, and then calculates the final combined distance from the remaining genes to the removed gene and all the left genes of the target genome. If the removed gene is output in the final top γ genes, it is counted as a successful prediction.
Performance measure calculation
While the major contribution to the prediction accuracy by our method is from operon and homology information, we have also assessed the contribution from phylogentic profiles. We noted that the phylogenetic profile gives a small increase for PPV (~ 4% for K = 5). When K increases, the contribution also increases (Figure 3). It shows that genes confirmed by the phylogenetic profile can reduce the mis-predicted genes from the graph-based prediction results, and increase the PPV value. This result suggests that phylogenetic profile can detect some genes which cannot be found by operon or sequence similarity alone.
One interesting observation we made is that our method gives rise to different performance levels for pathways in different functional categories. To fully investigate this observation, we have tested our algorithm on 18 different functional categories of KEGG pathways where each has at least 5 (assigned) genes. One special care needs to be taken when assessing the prediction performance as some KEGG pathways are predicted to form one “combined” pathway by our prediction. For example, all the pathways in Amino Acid Metabolism are put together into one combined “pathway”. Hence we need to evaluate the performance of our method on this combined “pathway”. The performance on the 18 categories of KEGG pathways is generally good except for the category of Biosynthesis of Secondary Metabolism, Metabolism of Other Amino Acids, Transcription and Xenobiotics Biodegradation and Metabolism (see Additional File 4). The reduced performance may be due to two reasons: (i) some correctly predicted genes are regarded as false positives since the combined pathway is incomplete; and (ii) the combined pathways may not be conserved across different genomes; and hence cannot be inferred by our method. We also calculated the PPV values of individual pathways whose number of genes is at least 30. They all have high prediction accuracy except for the Pyruvate Metabolism Pathway, which only gets 40% prediction accuracy (see Additional File 5).
Case study of the predicted pyruvate metabolism pathway
We have carefully analyzed our prediction results on the pyruvate metabolism pathway (eco00620) since it has the worst prediction performance among all the 21 E. coli pathways, each of which has at least 30 (assigned) genes. This KEGG pathway currently consists of 41 annotated genes (released in September 2009); five (pflD, tdcE, pflB, accC, ybiW) of them are correctly predicted in the top 10 by our method. Among the “incorrect” top 10 predictions (tdcD, eutD, ybiY, prpE, yiaY), some have been reported as correct genes involved in the pathway by a number of published papers. For example, gene ybiY is predicted as a “pyruvate formate lyase activating enzyme” in the NCBI and KEGG databases. Furthermore, we find three genes (tdcD, pflD, prpE) are all in the same “Propanoate Metabolism” pathway (eco00640), which is directly related to the pyruvate metabolism pathway. Actually, there are 10 genes that are common in both pathways.
Robustness analysis of parameters
We present a method to find pathway genes at a genome level, which can be used to fill pathway holes or recruit new genes into existing pathways. The results show that our method can achieve higher prediction accuracy and is very robust. The main advantage of our method is that by introducing the reference graph, we get a natural way to integrate different types of information such as genomic structure information and sequence similarity information. More information could possibly be added in future studies. For example, we can use information like regulons  and gene fusion events  to provide a more general framework for integrating different information, which can be easily included into our current program. Besides finding new genes for pathways, our method can also be used for functional module inference, as some functional modules may be the union of existing pathways.
This work is supported by National Science Foundation (NSF/DBI-0542119, NSF/DBI-0542119004, NSF/DEB-0830024, NSF/DBI-0821263, DOE/4000063512), also National Institutes of Health (1R01GM075331 and 1R01GM081682) and a Distinguished Scholar grant from the Georgia Cancer Coalition. This work is also supported in part by grants (60673059, 60373025 and 10926027) from the National Science Foundation of China, the Taishan Scholar Fund from Shandong Province, and the State Scholarship Fund of China (20073020). We also thank the financial support from the China Postdoctoral Science Foundation (20090450396), the Scientist Research Fund of Shandong Province (BS2009SW044), and the Doctoral Research Fund from the University of Jinan (XBS0914). Finally, we thank all the CSBL colleagues for their comments on this work; and thank Greg Vatcher for helps in correcting language errors.
The BioEnergy Science Center is a U.S. Department of Energy Bioenergy.
Research Center supported by the Office of Biological and Environmental Research in the DOE Office of Science. This study was supported in part by (e.g., funds, resources, technical expertise) provided by the University of Georgia Research Computing Center, a partnership between the Office of the Vice President for Research and the Office of the Chief Information Officer. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 1, 2011: Selected articles from the Ninth Asia Pacific Bioinformatics Conference (APBC 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S1.
- Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C, et al.: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 2004, 32(Databaseissue):D258–261.PubMedGoogle Scholar
- Wierling C, Herwig R, Lehrach H: Resources, standards and tools for systems biology. Brief Funct Genomic Proteomic 2007, 6(3):240–251. 10.1093/bfgp/elm027View ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S, Katayama T, Araki M, Hirakawa M: From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res 2006, 34(Databaseissue):D354–357. 10.1093/nar/gkj102PubMed CentralView ArticlePubMedGoogle Scholar
- Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, et al.: The COG database: an updated version includes eukaryotes. BMC Bioinformatics 2003, 4: 41. 10.1186/1471-2105-4-41PubMed CentralView ArticlePubMedGoogle Scholar
- Keseler IM, Bonavides-Martinez C, Collado-Vides J, Gama-Castro S, Gunsalus RP, Johnson DA, Krummenacker M, Nolan LM, Paley S, Paulsen IT, et al.: EcoCyc: a comprehensive view of Escherichia coli biology. Nucleic Acids Res 2009, 37(Databaseissue):D464–470. 10.1093/nar/gkn751PubMed CentralView ArticlePubMedGoogle Scholar
- Osterman A, Overbeek R: Missing genes in metabolic pathways: a comparative genomics approach. Curr Opin Chem Biol 2003, 7(2):238–251. 10.1016/S1367-5931(03)00027-9View ArticlePubMedGoogle Scholar
- Cordwell SJ: Microbial genomes and "missing" enzymes: redefining biochemical pathways. Arch Microbiol 1999, 172(5):269–279. 10.1007/s002030050780View ArticlePubMedGoogle Scholar
- Green ML, Karp PD: A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases. BMC Bioinformatics 2004, 5: 76. 10.1186/1471-2105-5-76PubMed CentralView ArticlePubMedGoogle Scholar
- Kharchenko P, Chen L, Freund Y, Vitkup D, Church GM: Identifying metabolic enzymes with multiple types of association evidence. BMC Bioinformatics 2006, 7: 177. 10.1186/1471-2105-7-177PubMed CentralView ArticlePubMedGoogle Scholar
- DeJongh M, Formsma K, Boillot P, Gould J, Rycenga M, Best A: Toward the automated generation of genome-scale metabolic networks in the SEED. BMC Bioinformatics 2007, 8: 139. 10.1186/1471-2105-8-139PubMed CentralView ArticlePubMedGoogle Scholar
- Kolesov G, Mewes HW, Frishman D: SNAPping up functionally related genes based on context information: a colinearity-free approach. J Mol Biol 2001, 311(4):639–656. 10.1006/jmbi.2001.4701View ArticlePubMedGoogle Scholar
- Sanguinetti G, Noirel J, Wright PC: MMG: a probabilistic tool to identify submodules of metabolic pathways. Bioinformatics 2008, 24(8):1078–1084. 10.1093/bioinformatics/btn066View ArticlePubMedGoogle Scholar
- Ulitsky I, Shamir R: Identification of functional modules using network topology and high-throughput data. BMC Syst Biol 2007, 1: 8. 10.1186/1752-0509-1-8PubMed CentralView ArticlePubMedGoogle Scholar
- Yan X, Mehan MR, Huang Y, Waterman MS, Yu PS, Zhou XJ: A graph-based approach to systematically reconstruct human transcriptional regulatory modules. Bioinformatics 2007, 23(13):i577–586. 10.1093/bioinformatics/btm227View ArticlePubMedGoogle Scholar
- Huang Y, Li H, Hu H, Yan X, Waterman MS, Huang H, Zhou XJ: Systematic discovery of functional modules and context-specific functional annotation of human genome. Bioinformatics 2007, 23(13):i222–229. 10.1093/bioinformatics/btm222View ArticlePubMedGoogle Scholar
- Cakmak A, Ozsoyoglu G: Mining biological networks for unknown pathways. Bioinformatics 2007, 23(20):2775–2783. 10.1093/bioinformatics/btm409View ArticlePubMedGoogle Scholar
- Brouwer RW, Kuipers OP, Hijum SA: The relative value of operon predictions. Brief Bioinform 2008.Google Scholar
- Dam P, Olman V, Harris K, Su Z, Xu Y: Operon prediction using both genome-specific and general genomic information. Nucleic Acids Res 2007, 35(1):288–298. 10.1093/nar/gkl1018PubMed CentralView ArticlePubMedGoogle Scholar
- Mao F, Dam P, Chou J, Olman V, Xu Y: DOOR: a database for prokaryotic operons. Nucleic Acids Res 2009, 37(Databaseissue):D459–463. 10.1093/nar/gkn757PubMed CentralView ArticlePubMedGoogle Scholar
- Korbel JO, Jensen LJ, von Mering C, Bork P: Analysis of genomic context: prediction of functional associations from conserved bidirectionally transcribed gene pairs. Nat Biotechnol 2004, 22(7):911–917. 10.1038/nbt988View ArticlePubMedGoogle Scholar
- Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A 1999, 96(8):4285–4288. 10.1073/pnas.96.8.4285PubMed CentralView ArticlePubMedGoogle Scholar
- Sun J, Xu J, Liu Z, Liu Q, Zhao A, Shi T, Li Y: Refined phylogenetic profiles method for predicting protein-protein interactions. Bioinformatics 2005, 21(16):3409–3415. 10.1093/bioinformatics/bti532View ArticlePubMedGoogle Scholar
- Wu H, Su Z, Mao F, Olman V, Xu Y: Prediction of functional modules based on comparative genome analysis and Gene Ontology application. Nucleic Acids Res 2005, 33(9):2822–2837. 10.1093/nar/gki573PubMed CentralView ArticlePubMedGoogle Scholar
- Spirin V, Gelfand MS, Mironov AA, Mirny LA: A metabolic network in the evolutionary context: multiscale structure and modularity. Proc Natl Acad Sci U S A 2006, 103(23):8774–8779. 10.1073/pnas.0510258103PubMed CentralView ArticlePubMedGoogle Scholar
- Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi AL: Hierarchical organization of modularity in metabolic networks. Science 2002, 297(5586):1551–1555. 10.1126/science.1073374View ArticlePubMedGoogle Scholar
- Clauset A, Moore C, Newman ME: Hierarchical structure and the prediction of missing links in networks. Nature 2008, 453(7191):98–101. 10.1038/nature06830View ArticlePubMedGoogle Scholar
- Salgado H, Gama-Castro S, Peralta-Gil M, Diaz-Peredo E, Sanchez-Solano F, Santos-Zavaleta A, Martinez-Flores I, Jimenez-Jacinto V, Bonavides-Martinez C, Segura-Salazar J, et al.: RegulonDB (version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon organization, and growth conditions. Nucleic Acids Res 2006, 34(Databaseissue):D394–397. 10.1093/nar/gkj156PubMed CentralView ArticlePubMedGoogle Scholar
- Suhre K, Claverie JM: FusionDB: a database for in-depth analysis of prokaryotic gene fusion events. Nucleic Acids Res 2004, 32(Databaseissue):D273–276. 10.1093/nar/gkh053PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.