- Research article
- Open Access
In silico prioritisation of candidate genes for prokaryotic gene function discovery: an application of phylogenetic profiles
© Lin et al; licensee BioMed Central Ltd. 2009
- Received: 27 October 2008
- Accepted: 17 March 2009
- Published: 17 March 2009
In silico candidate gene prioritisation (CGP) aids the discovery of gene functions by ranking genes according to an objective relevance score. While several CGP methods have been described for identifying human disease genes, corresponding methods for prokaryotic gene function discovery are lacking. Here we present two prokaryotic CGP methods, based on phylogenetic profiles, to assist with this task.
Using gene occurrence patterns in sample genomes, we developed two CGP methods (statistical and inductive CGP) to assist with the discovery of bacterial gene functions. Statistical CGP exploits the differences in gene frequency against phenotypic groups, while inductive CGP applies supervised machine learning to identify gene occurrence pattern across genomes. Three rediscovery experiments were designed to evaluate the CGP frameworks. The first experiment attempted to rediscover peptidoglycan genes with 417 published genome sequences. Both CGP methods achieved best areas under receiver operating characteristic curve (AUC) of 0.911 in Escherichia coli K-12 (EC-K12) and 0.978 Streptococcus agalactiae 2603 (SA-2603) genomes, with an average improvement in precision of >3.2-fold and a maximum of >27-fold using statistical CGP. A median AUC of >0.95 could still be achieved with as few as 10 genome examples in each group of genome examples in the rediscovery of the peptidoglycan metabolism genes. In the second experiment, a maximum of 109-fold improvement in precision was achieved in the rediscovery of anaerobic fermentation genes in EC-K12. The last experiment attempted to rediscover genes from 31 metabolic pathways in SA-2603, where 14 pathways achieved AUC >0.9 and 28 pathways achieved AUC >0.8 with the best inductive CGP algorithms.
Our results demonstrate that the two CGP methods can assist with the study of functionally uncategorised genomic regions and discovery of bacterial gene-function relationships. Our rediscovery experiments also provide a set of standard tasks against which future methods may be compared.
- Phylogenetic Profile
- Area Under Receiver Operate Characteristic Curve
- Gene Prioritisation
- Functional Discovery
- Candidate Gene Prioritisation
Identifying gene functions is an important task in biology. The exponential growth of genome sequences has placed greater importance on the use of computational approaches for sequence analysis and annotation. With the development of high-throughput technology, methods of comparative genomics are increasingly used to assist with the identification of gene functions , as conventional methods of gene screening using transgenic organisms are resource intensive and time consuming. In practice, bench-side researchers frequently encounter extensive lists of genes that require further pruning and experimental validation. Accurate prioritisation of candidate genes, therefore, constitutes a key step in accelerating the discovery of gene functions.
In silico candidate gene prioritisation (CGP) ranks genes based upon the features associated with genes and the function of interest. A variety of gene features have been suggested for the prioritisation of causal genes in human diseases, including the co-occurrence of gene name and disease terminology in biomedical texts [2–5], sharing of terms in annotation or gene ontology databases [2, 4, 6–9], gene expression in different tissues [2, 4, 6], protein-protein interactions , similarity of gene or protein sequences [8, 9], presence of genes within a phenotype or diseases database , phylogenetic relationships , or a combination of the above [2, 4]. However, to construct a CGP system for prokaryotes, different forms of gene features are needed, as current CGP algorithms are skewed towards eukaryotic genomes and the systematic curation of annotation or genotype-phenotype databases are less complete than for eukaryotes. Hundreds of whole genome sequences of bacteria and thousands of partial genome sequences are available in public databases, yet prokaryotic genomes display a higher proportion of genes with unknown function than eukaryotes . In contrast, several methods for computational protein function discovery have been studied, including chromosomal proximity method, domain fusion analysis, analysis of gene expression patterns, and phylogenetic profiles . In particular, the phylogenetic profile method exploits knowledge of gene occurrences across a range of sequenced genomes and postulates that genes involved in the same metabolic pathway are frequently co-inherited. Phylogenetic profiles have been applied to unsupervised clustering of proteins to discover their functional linkages  and to discover conserved gene clusters in microbes (with probabilistic phylogenetic tree models) . Supervised approaches of phylogenetic profiles have also been applied to infer protein networks (with canonical correlation analysis ) and predicting protein functional class in Saccharomyces cerevisiae (with tree-based kernels ), in the discovery of protein localisation in eukaryotes , in functional annotation of genes (by correlation enrichments ). These studies suggested that the concept of phylogenetic profiles provides a valuable tool for predicting gene-function linkage. It was thus hypothesised that such concept can also be exploited as gene features for prioritising genes contributing to a particular phenotypic trait of interest, thus providing a practical and generalisable tool to guide microbiologists in gene selection.
This paper examines the practical application of the phylogenetic profile method for gene prioritisation to investigate its generalisability and applicability on both simple and complex traits in prokaryotes.
Phylogenetic profiles form an indirect connection between gene and function in two conceptual steps. The first step establishes the gene-genome relationship, by examining the occurrence (presence or absence) of a candidate gene (or its homolog) in a given genome. The second step groups genomes according to their known phenotypes. We investigate two scenarios in which CGP can be useful in assisting with functional discovery of uncharacterised genes in prokaryotes. The method of statistical CGP is used when the occurrence profile can be directly inferred from the study phenotype, whereas inductive CGP is used when the profile is obscure but a small number of genes known to contribute to the study phenotype are available. Candidate genes are then prioritised by either statistical scoring functions or supervised machine learning algorithms.
In addition, at present there are no clear benchmarks to allow comparison between these different approaches to gene prioritisation, and the extent to which such algorithms are capable of identifying target genes in bacteria remains unexplored. This paper takes advantage of selected metabolic processes with a well-understood genetic basis to craft gold standard prioritisation tasks. The two CGP approaches are evaluated by rediscovering genes participating in well-characterised biochemical pathways – the metabolism of peptidoglycan, fermentation in anaerobes, and selected metabolic pathways curated in Kyoto Encyclopaedia of Genes and Genomes (KEGG) . We ultimately aim to develop metrics that will provide an indication of the likelihood that highly prioritised genes are strong biological candidates, and the degree to all potential candidates have been identified for tasks such as the selection of biomarkers, the discovery of virulence genes, and the formulation of new hypotheses about uncharacterised genes.
Determination of genomic occurrences of candidate genes
To evaluate the performance of CGP methods, three case studies were selected for rediscovery experiments using well-known pathway genes as gold-standards. For each case study, the polypeptide sequences of n candidate genes were compared with all open reading frames (orf) of the k genome sequences from the National Centre for Biotechnology Information (NCBI, accessed April 2007) by Basic Local Alignment and Search Tool (BLASTP) http://blast.ncbi.nlm.nih.gov/Blast.cgi. If a candidate gene reached the critical E-value of < 10-5 in a given genome, a gene or gene homolog was defined as present in the genome. If a gene did not reach the critical E-value in a genome, the gene was recorded as absent from the genome. The binary states of gene occurrence were recorded in an n × k homolog matrix.
From the k genomes, k p genomes known to display the phenotype of interest (p) were selected as positive genome examples, and k n genomes not displaying p were chosen as negative genome examples. For each of the n candidate genes, the number of co-presence (homologs present in positive genome examples) and co-absence (homologs absent in negative genome examples) were counted and presented into a 2 × 2 contingency table, from which a number of statistical scoring functions was calculated. The scoring functions included: a) sensitivity (sens, the proportion of genes present in the positive genome examples), b) specificity (spec, proportion of genes absent in the negative genome examples), c) positive and negative predictive values (ppv/npv, the proportion of positive/negative genomes were present/absent when the gene was present/absent), d) arithmetic (amss) and harmonic (hmss) mean of sensitivity and specificity, e) odds ratio (OR, the odds of a gene existed in the positive example versus the odds of a gene was absent in the negative examples), f) chi-square scoring function (chisq, the deviation of the observed frequency from the expected proportion), g) directional chi-square function (bchisq, the chisq function with genes that displayed inverse associations be reversed to the bottom of the rank), and h) F-measure (F, the harmonic mean of the sensitivity and precision). The mathematical definitions of these scoring functions are listed in the Additionl file 1.
Inductive CGP ranks genes by finding genes with similar occurrence pattern across a number of bacterial genomes using supervised machine learning. A number of genes known to display a target phenotype or function p were selected as positive examples for the training set. Similarly, genes that did not contribute to p were selected as negative gene examples. The occurrences of genes in k genome examples were used as features for model training. Candidate genes were ranked by the score or posterior probability from the output of the machine learning classifiers. The machine learning classifiers included naïve Bayes (NB), logistic regression (LR; ridge = 10-5), J48 decision tree (J48, pruning confidence = 0.25), nearest neighbour classifier (IBk, with inverse distance weighing; k was determined by leave-one-out cross-validation), alternating decision tree (ADTree; boosting iteration = 10), support vector machines (SVM) with polynomial (SVM/Poly; linear kernel trained by sequential minimal optimisation algorithm, SMO) and radial basis function (SVM/RBF; trained by SMO; γ = 0.01) kernels. The Waikato Environment for Knowledge Analysis (WEKA) 3.5.6 was used for classifier training . For the purpose of benchmarking, the generalisation performance of inductive CGP was evaluated by stratified 10-fold cross-validation: for the n genes used as candidate genes for prioritisation, all n + genes from the validation set and the rest of n - genes not in the validation set were each randomly divided into 10 subsets. One-tenth of the the genes from each group ( of n + and n - genes) were sequentially selected as test set, whereas the rest of the genes were selected as training set to train inductive models. The performance of each inductive CGP algorithm was obtained by averaging areas under receiver operating characteristic curves (AUC) over the 10 runs.
Evaluation of CGP performance
where t was the rank fraction at threshold τ, ppv was the overall precision, pppv n (t) was the partial precision at rank fraction t, and η(t) was at its maximum at t*. Both AUC and measure the overall performance of a CGP task. The rank fraction t* indicates the point above which correct genes are likely to be found η max -times more likely than compared to a random gene list. Evaluation with η max is useful to identify cases where a small proportion of genes is ranked highly but the overall performance is poor.
Effect of number of genome examples on CGP performance
Two simulation experiments were performed to investigate the effect of the number of genome examples on statistical CGP performance (Case study 1, see below). For the first simulation, the amss scoring function was repeatedly applied on randomly selected subsets of 417 positive and negative genome examples, using genes from set M (Case study 1). The number of positive genome examples (N p ) and negative genome examples (N n ) were gradually increased in each subset. For each combination of N p and N n , 25 runs were performed and the median AUC was obtained. A second simulation was performed to determine the variability of performance. Here the proportion of positive and negative genome examples was kept the same (400:17) and the median and the range of AUC were then obtained over 1000 runs for each N p and N n .
A similar simulation was also performed to determine the effect of genome example sizes on inductive CGP performance. Twenty five subsets of N genomes (from 417 genomes) were randomly selected as features with N increased from 1 to 417. For each N, stratified 10-fold cross-validations were performed with SVM/Poly using all genes from SA-2603 genome as candidates. Median AUCs from 25 random subset of N genomes were obtained.
The case studies
Case study 1: Identification of genes involved in bacterial cell wall synthesis
Well-characterised genes responsible for peptidoglycan biosynthesis and metabolism in bacteria were used for testing and were grouped into three nested validation sets [see Additional file 2]. The C (core) validation set consisted of genes responsible for the synthesis of N-acetylmuramate-pentapeptide from UDP-N-acetylglucosamine (mur A to mur G, and mra Y). The B (biosynthesis) validation set, extended the C set with genes involved in precursor pathways including N-acetyl-D-glucosamine, meso-diaminopelamate and D-alanyl-D-alanine, as well as genes responsible for undecaprenyl phosphate biosynthesis and recycling. The M (metabolism) validation set further extended the B set by including genes responsible for the modification, recycling, and cross-linking of the peptidoglycan such as penicillin-binding proteins and N-acetylmuramoyl-L-alanine amidases [see Additional file 3]. Genome examples were selected from the NCBI bacterial genomes catalogue file  and manually verified by one of the authors (RL). Genes in the validation sets were identified using KEGG  and EcoCyc . Genomes of one Gram positive bacterium (S. agalactiae 2603 V/R, SA-2603, 2124 genes, GenBank ID: AE009948) and one Gram negative bacterium (E. coli K-12, EC-K12, 4134 genes, GenBank ID: U00096) were selected for prioritisation.
For statistical CGP, 400 genomes of bacteria known to produce peptidoglycan were selected as positive examples. Genomes of 17 bacterial species lacking cell wall, including Mycoplasma spp., Ureaplasma spp., Anaplasma spp., and Phytoplasma spp., were selected as negative examples [see Additional file 4]. For inductive CGP, the occurrence of the candidate genes in the same 417 genomes was used as features for machine learning training. Candidate genes were labelled according to whether they belong to C, B, and M validation sets. To compare the effectiveness of statistical CGP, the relative positions of the peptidoglycan genes were compared with an unrelated metabolic pathway (glycolysis genes) acting as the control validation set [see Additional file 5].
Case study 2: Anaerobic mixed acid fermentation genes
Enzymes responsible for anaerobic respiration and fermentation were identified from pathway databases [20, 23] and literature searches [24, 25]. Statistical and inductive CGP methods were used to derive the occurrence matrix and to rank candidate genes for anaerobic mixed-acid fermentation in EC-K12. All genes in EC-K12 were used as candidates for prioritisation. For statistical CGP, 200 bacterial genomes of known obligatory and facultative anaerobes capable of performing anaerobic metabolism were selected as positive genome examples, and 142 genomes of obligatory aerobes that do not perform anaerobic respiration were applied as negative examples [see Additional file 6]. Methods for genome example selection were identical to Case study 1. For inductive CGP, the occurrence patterns of 4134 candidate genes in 342 genomes were obtained by the methods described above.
Case study 3: KEGG Pathways
To evaluate the generalisability of inductive CGP, a large-scale rediscovery experiment based on the curated KEGG metabolic pathways was performed . Thirty-one metabolic pathways with at least 10 genes involved in each pathway were selected for evaluation from the 81 known pathways available for the SA-2603 genome in KEGG. All seven inductive CGP algorithms were tested, and the generalisation performance of the algorithms evaluated by stratified 10-fold cross-validation.
Case study 1: Peptidoglycan-related genes
CGP performance on peptidoglycan-related genes (Streptococcus agalactiae 2603 V/R, 2124 genes).
C (9 genes)
B (18 genes)
M (25 genes)
Statistical CGP (scoring functions)
Inductive CGP (machine learning algorithms)
CGP performance on peptidoglycan-related genes (Escherichia coli K-12, 4131 genes).
C (8 genes)
B (28 genes)
M (51 genes)
Statistical CGP (scoring functions)
Inductive CGP (machine learning algorithms)
The performance of statistical CGP was also measured by folds-increase in precision (probability enrichments) compared to the non-prioritised rank. With the chisq scoring function, the ranked gene list achieved an average enrichment of 3.65 folds (maximum 28.3 folds) for SA-2603 and 3.16 folds for EC-K12 (maximum 27 folds). The probability enrichments of other validation sets are listed in Tables 1 and 2. High AUC values were obtained from the stratified cross-validations of inductive CGP experiments. In particular, SVM achieved near-perfect AUCs in both SA-2603 C and B validation sets, whereas ADTree had the best AUC of 0.975 in M set genes (the trained ADTree model is shown in Additional file 8). Similarly, the best AUC was achieved by SVM/RBF in the EC-K12 M validation set (0.964). The best AUCs of 0.998 and 0.981 were also achieved in the rediscovery of C and B set genes (by IBk and ADTree respectively).
Case study 2: Anaerobic mixed-acid fermentation genes
CGP performance on anaerobic mixed-acid fermentation genes (Escherichia coli K-12, 4131 genes). Prioritisation (AUC) of anaerobic mixed-acid fermentation genes in Escherichia coli K-12
Case study 3: Inductive prioritisation of KEGG pathway genes
Successful prioritisation of bacterial genes by occurrence-based CGP methods
In this paper, we applied two approaches (statistical and inductive CGP) to prioritise candidate genes for functional discovery, based on the occurrence patterns of candidate genes in a selected set of bacterial genomes (phylogenetic profiles). Our findings demonstrate that both CGP methods can rediscover genes with high accuracy in two selected genomes of E. coli K-12 and S. agalactiae 2603 (Figure 3).
Interestingly, these methods seem relatively insensitive to the number of genome examples. In the peptidoglycan example with statistical CGP (case study 1), we were able to identify peptidoglycan genes with high accuracy, despite only a limited number of sequenced genomes among negative examples. For inductive CGP, increasing profile dimension beyond 200 genome examples apparently resulted in a decrease in subsequent median AUCs, implying that a proportion of genome examples was less informative and might not have contributed to the identification of genes of interest. This finding coincides with the observation by Johti et al., where increasing the phylogenetic profile dimension with redundant genomes did not necessarily improve the accuracy in eukaryotic gene function prediction .
In statistical CGP, we found that the scoring functions measuring gene occurrence in both positive and negative genome example groups (amss, hmss, chisq, and bchisq) consistently outperform the scoring functions that measure only positive (sens and ppv), negative (spec and npv), or partial (F-measure) frequencies of the groups. This finding highlights the importance of including both positive and negative examples in comparative genomic studies. In inductive CGP, the rediscovery experiments favoured ADTree, IBk, and SVM s when compared with other algorithms. In case study 1, the performance of the best inductive and statistical CGP methods are comparable, suggesting that both approaches are capable of producing robust results.
Statistical CGP rediscovers genes specific to the function of interest
Our results demonstrated that statistical CGP can discover genes specific to a particular function or pathway. For example, by comparing the cell-walled bacteria with cell wall-less Mycoplasma spp. (case study 1), it was expected that the peptidoglycan genes, which are specific for the phenotypic trait, were among the genes lost in this evolutionary lineage  (See Additional file 12). Such genes were ranked very highly and yielded favourable aggregated performance (with AUC >0.95). In contrast, results from our anaerobic fermentation experiment (case study 2) suggested that genes specific to anaerobic respiration were placed very highly on the rank (η max > 95-fold), whereas genes sharing with the obligatory aerobic bacteria (negative examples) were ranked much lower. As these shared genes are present in both phenotypic groups, finding these "shared" genes by applying only statistical CGP is a challenging task. Alternative methods are needed to aid in the discovery of such non-specific genes.
Evolutionary pressure may contribute to specific gene occurrence patterns
Both occurrence-based CGP methods performed well, suggesting the genes encoding for a complex phenotype are frequently co-present and co-absent across the genomes, forming specific occurrence patterns, thus allowing the functional predictions. This co-occurrence phenomenon may reflect the process of natural selection and the adaptation of microorgranisms into different evolutionary niches.
For bacteria undergoing positive selection, the acquisition of a particular gene group may result in phenotypes conferring survival advantage for the microorganism to adapt to a new environment. It has been known that genes contributing to symbiosis or pathogenesis are frequently organised into genomic islands, in which gene mobility is facilitated by horizontal gene transfer, conferring the ability to form a new relationship with the host . The good AUC achieved by inductive CGP in KEGG pathways (case study 3) suggests specific functional co-occurrence patterns of genes do exist, regardless of the physical proximity of the genes or the presence of a mobile genetic structure.
Similarly, negative selection can also contribute to the co-absence of functional gene units across multiple genomes. For a complex phenotype encoded by multiple genes, the deletion of a critical gene could result in the non-expression of phenotype, leading to the subsequent loss of other non-functional genes over time. Thus, the differential co-occurrence patterns in genes can be exploited for comparative genomics studies, as demonstrated by our methods, in assisting our understanding of gene functions.
Factors affecting CGP performance
Prioritising candidate genes with reliance on gene features from literature, ontology, or annotation as background knowledge may introduce literature or annotation biases [2, 4, 9]. While we minimised such biases, several sources of sampling bias could have limited the performance of our gene prioritisation methods. With statistical CGP, accuracy can be affected by ad hoc selection of examples, especially by the choice of positive and negative genome examples representing the variations in the study phenotype. With inductive CGP, performance may be impeded by incorporating inconsistent (or wrong) genes in the training set. Increasing the heterogeneity of the training genes may adversely influence prioritisation performance. An example can be found in a slight decrease of performance (in median AUC) from C to M validation sets in the peptidoglycan experiments in EC-K12 candidates, as shown in Figure 1.
Using KEGG as a validation data source
There were considerable variations in inductive CGP performances across different KEGG categories (Case study 3). By manually inspecting the worst-performing functional category (phenylalanine, tyrosine, and tryptophan biosynthesis), we found the phenylalanine and tyrosine tRNA synthases genes were also included in the validation set. The tRNA synthases have roles downstream of the biosynthesis pathways and thus are not involved in the anabolism of these essential amino acids. Removal of the unrelated genes improved overall performance (with best AUC of 0.852 achieved by SVM/Poly, see Additional file 13). This contrasts with the best-performing pathways (for example, fatty acid biosynthesis and peptidoglycan synthesis pathways) where only function-specific genes were included in the validation set. Since KEGG is a commonly-used resource for benchmarking computational methods of functional discovery [15, 16, 26], our finding suggests that careful selection must be practised in constructing validation sets, as mixture of distinct functional groups could lead to inconsistent results. This specific sampling bias needs to be considered when explaining variations in the predicting of gene functions by in silico methods.
The inclusion of paralogs in the occurrence matrix
Reciprocal best BLAST matches are frequently used in the search for orthologous genes. In our experiments, we applied non-reciprocal BLAST E-value < 10-5 as the criterion for determining the sequence similarity between genes. While our results supported its use in functional discovery at the gene level, the use of such criterion may include many paralogs and may affect prioritisation performance of large gene families with diverse functions. Detecting and excluding paralogs may be required to refine the gene ranking and warrant further studies.
We developed a statistical and an inductive computational gene prioritisation methods, based on the concept of gene occurrence across a range of genomes, to improve the search efficiency in the functional discovery of bacterial genes. We designed a range of rediscovery experiments for benchmarking different CGP approaches. Promising results were yielded from the testing on the rediscovery of peptidoglycan-related genes, mixed-acid fermentation genes, and a diverse range of bacterial metabolic pathways. These CGP methods could be generalised to other functional discovery tasks when a pair of positive and negative datasets are available (statistical CGP) or when a subset of genes with known functions can be used for training machine learning models (inductive CGP). With more genome sequences become available, we anticipate the demand of such methods will grow as many different scenarios can be formulated and analysed. In summary, occurrence-based gene prioritisation method offers a simple yet effective framework for ranking candidate genes for functional discovery in prokaryotes. In addition, our experimental framework should provide a standardised benchmark for evaluating future CGP methods and algorithms when prioritising bacterial candidate genes.
Authors are grateful to Mike Bain and Guy Tsafnat for their critical comments on the design of CGP experiments. We thank Fanrong Kong for his review and comments on peptidoglycan-related genes. This research is funded by Australian National Health and Medical Research Council (NH&MRC) project grant No. 358351.
- Raskin DM, Seshadri R, Pukatzki SU, Mekalanos JJ: Bacterial genomics and pathogen evolution. Cell 2006, 124(4):703–714. 10.1016/j.cell.2006.02.002View ArticlePubMedGoogle Scholar
- Gaulton KJ, Mohlke KL, Vision TJ: A computational system to select candidate genes for complex human traits. Bioinformatics 2007, 23(9):1132–1140. 10.1093/bioinformatics/btm001View ArticlePubMedGoogle Scholar
- Perez-Iratxeta C, Bork P, Andrade MA: Association of genes to genetically inherited diseases using data mining. Nat Genet 2002, 31(3):316–319.PubMedGoogle Scholar
- Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, De Smet F, Tranchevent LC, De Moor B, Marynen P, Hassan B, Carmeliet P, Moreau Y: Gene prioritization through genomic data fusion. Nat Biotechnol 2006, 24(5):537–544. 10.1038/nbt1203View ArticlePubMedGoogle Scholar
- Tiffin N, Kelso JF, Powell AR, Pan H, Bajic VB, Hide WA: Integration of text- and data-mining using ontologies successfully selects disease gene candidates. Nucleic Acids Res 2005, 33(5):1544–1552. 10.1093/nar/gki296PubMed CentralView ArticlePubMedGoogle Scholar
- Kent WJ, Hsu F, Karolchik D, Kuhn RM, Clawson H, Trumbower H, Haussler D: Exploring relationships and mining data with the UCSC Gene Sorter. Genome Res 2005, 15(5):737–741. 10.1101/gr.3694705PubMed CentralView ArticlePubMedGoogle Scholar
- Turner FS, Clutterbuck DR, Semple CAM: POCUS: mining genomic sequence annotation to predict disease genes. Genome Biol 2003, 4(11):R75. 10.1186/gb-2003-4-11-r75PubMed CentralView ArticlePubMedGoogle Scholar
- Adie EA, Adams RR, Evans KL, Porteous DJ, Pickard BS: Speeding disease gene discovery by sequence based candidate prioritization. BMC Bioinformatics 2005, 6: 55. 10.1186/1471-2105-6-55PubMed CentralView ArticlePubMedGoogle Scholar
- Adie EA, Adams RR, Evans KL, Porteous DJ, Pickard BS: SUSPECTS: enabling fast and effective prioritization of positional candidates. Bioinformatics 2006, 22(6):773–774. 10.1093/bioinformatics/btk031View ArticlePubMedGoogle Scholar
- Freudenberg J, Propping P: A similarity-based method for genome-wide prediction of disease-relevant human genes. Bioinformatics 2002, 18(Suppl 2):S110–115.View ArticlePubMedGoogle Scholar
- López-Bigas N, Ouzounis CA: Genome-wide identification of genes likely to be involved in human genetic disease. Nucleic Acids Res 2004, 32(10):3108–3114. 10.1093/nar/gkh605PubMed CentralView ArticlePubMedGoogle Scholar
- Médigue C, Moszer I: Annotation, comparison and databases for hundreds of bacterial genomes. Res Microbiol 2007, 158(10):724–736. 10.1016/j.resmic.2007.09.009View ArticlePubMedGoogle Scholar
- Galperin MY, Koonin EV: Who's your neighbor? New computational approaches for functional genomics. Nat Biotechnol 2000, 18(6):609–13. 10.1038/76443View ArticlePubMedGoogle Scholar
- Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci USA 1999, 96(8):4285–8. 10.1073/pnas.96.8.4285PubMed CentralView ArticlePubMedGoogle Scholar
- Zheng Y, Anton BP, Roberts RJ, Kasif S: Phylogenetic detection of conserved gene clusters in microbial genomes. BMC Bioinformatics 2005, 6: 243. 10.1186/1471-2105-6-243PubMed CentralView ArticlePubMedGoogle Scholar
- Yamanishi Y, Vert JP, Kanehisa M: Protein network inference from multiple genomic data: a supervised approach. Bioinformatics 2004, 20(Suppl 1):i363–70. 10.1093/bioinformatics/bth910View ArticlePubMedGoogle Scholar
- Vert JP: A tree kernel to analyse phylogenetic profiles. Bioinformatics 2002, 18(Suppl 1):S276–84.View ArticlePubMedGoogle Scholar
- Marcotte EM, Xenarios I, Bliek AM, Eisenberg D: Localizing proteins in the cell from their phylogenetic profiles. Proc Natl Acad Sci USA 2000, 97(22):12115–20. 10.1073/pnas.220399497PubMed CentralView ArticlePubMedGoogle Scholar
- Wu J, Hu Z, DeLisi C: Gene annotation and network inference by phylogenetic profiling. BMC Bioinformatics 2006, 7: 80. 10.1186/1471-2105-7-80PubMed CentralView ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000, 28: 27–30. 10.1093/nar/28.1.27PubMed CentralView ArticlePubMedGoogle Scholar
- Witten IH, Frank E: Data Mining: Practical machine learning tools and techniques. 2nd edition. San Francisco: Morgan Kaufmann; 2005.Google Scholar
- NCBI bacterial genomes catalogue file[ftp://ftp.ncbi.nih.gov/genomes/Bacteria/]
- Karp PD, Keseler IM, Shearer A, Latendresse M, Krummenacker M, Paley SM, Paulsen I, Collado-Vides J, Gama-Castro S, Peralta-Gil M, Santos-Zavaleta A, Peñaloza-Spínola MI, Bonavides-Martinez C, Ingraham J: Multidimensional annotation of the Escherichia coli K-12 genome. Nucleic Acids Res 2007, 35(22):7577–7590. 10.1093/nar/gkm740PubMed CentralView ArticlePubMedGoogle Scholar
- Lin EC, Iuchi S: Regulation of gene expression in fermentative and respiratory systems in Escherichia coli and related bacteria. Annu Rev Genet 1991, 25: 361–387. 10.1146/annurev.ge.25.120191.002045View ArticlePubMedGoogle Scholar
- Michal G: Biochemical pathways: An Atlas of Biochemistry and Molecular Biology. Hoboken, NJ; Wiley-Spektrum; 1999.Google Scholar
- Jothi R, Przytycka TM, Aravind L: Discovering functional linkages and uncharacterized cellular pathways using phylogenetic profile comparisons: a comprehensive assessment. BMC Bioinformatics 2007, 8: 173. 10.1186/1471-2105-8-173PubMed CentralView ArticlePubMedGoogle Scholar
- Fraser CM, Gocayne JD, White O, Adams MD, Clayton RA, Fleischmann RD, Bult CJ, Kerlavage AR, Sutton G, Kelley JM, Fritchman RD, Weidman JF, Small KV, Sandusky M, Fuhrmann J, Nguyen D, Utterback TR, Saudek DM, Phillips CA, Merrick JM, Tomb JF, Dougherty BA, Bott KF, Hu PC, Lucier TS, Peterson SN, Smith HO, Hutchison CA, Venter JC: The minimal gene complement of Mycoplasma genitalium. Science 1995, 270(5235):397–403. 10.1126/science.270.5235.397View ArticlePubMedGoogle Scholar
- Gogarten JP, Townsend JP: Horizontal gene transfer, genome innovation and evolution. Nat Rev Microbiol 2005, 3(9):679–687. 10.1038/nrmicro1204View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.