Skip to main content

Gene function prediction based on genomic context clustering and discriminative learning: an application to bacteriophages



Existing methods for whole-genome comparisons require prior knowledge of related species and provide little automation in the function prediction process. Bacteriophage genomes are an example that cannot be easily analyzed by these methods. This work addresses these shortcomings and aims to provide an automated prediction system of gene function.


We have developed a novel system called SynFPS to perform gene function prediction over completed genomes. The prediction system is initialized by clustering a large collection of weakly related genomes into groups based on their resemblance in gene distribution. From each individual group, data are then extracted and used to train a Support Vector Machine that makes gene function predictions. Experiments were conducted with 9 different gene functions over 296 bacteriophage genomes. Cross validation results gave an average prediction accuracy of ~80%, which is comparable to other genomic-context based prediction methods. Functional predictions are also made on 3 uncharacterized genes and 12 genes that cannot be identified by sequence alignment. The software is publicly available at


The proposed system employs genomic context to predict gene function and detect gene correspondence in whole-genome comparisons. Although our experimental focus is on bacteriophages, the method may be extended to other microbial genomes as they share a number of similar characteristics with phage genomes such as gene order conservation.


The increasing number of completely sequenced genomes has enabled gene function predictions by means of whole genome comparison. Existing methods such as SynBrowse [1], Vista [2], LAGAN [3], PipMaker [4] and Ensembl SyntenyView [5] provide visualization of conserved regions between two or more genome sequences for comparative analysis. Such visualization facilitates the prediction of gene function based on comparison of genomic context information such as co-occurrence of genes [6, 7] and conservation of gene order [8, 9].

However, these methods have two major limitations. First, they rely on sequence alignment to identify corresponding genes or regions between genomes [15, 1012]. Consequently, they cannot automatically detect homologous or functionally similar genes that share no sequence similarity, resulting in a need for manual prediction for those genes. Second, these methods require the genomes being compared to be closely related. This hinders the possibility of automatically analyzing a large collection of weakly related genomes and makes it impossible to inspect a genome to which related species have not been identified.

Bacteriophage genomes are one example that suffers from the above limitations. Firstly, sequence alignment based methods are not fully reliable in detecting functionally similar genes within phages. This is because homologous phage genes have often diverged beyond the recognition of sequence similarity [1315]. A key argument to explain such divergence was that the genes have a very distant common ancestry [15]. Secondly, requiring to compare only a few related phages and to ignore the remainder can hinder the genomic analysis of the target phage. The reason is that the global phage relationships are not clearly defined phylogenetically due to an extensive amount of horizontal gene transfers (HGT) [14, 16], implying that relatedness between phages often cannot be established. Consequently, it is desirable to have an objective measure to automatically identify closely related genomes based on the genetic data, as opposed to depending on the user to define a set of "related species".

This work addresses the shortcomings of the existing methods and aims to provide a highly automated gene function prediction system based on whole-genome comparison. The system, named SynFPS, contains two automated learning units with distinct roles: a clustering technique that utilizes gene-to-gene distances to identify closely related genomes and a Support Vector Machine (SVM) for discriminative classification on gene functions. The algorithm of SynFPS and the results of function prediction on phage genes will be presented in the remainder of this paper.

Results and discussion

Evaluation of prediction results by leave-one-out cross validation

We have attempted to perform predictions over nine common phage genes using SynFPS. These are major head, major tail, tape measure, prohead protease, integrase, terminase, portal, holin and lysin genes. They were selected on the basis of regular existence – they encode necessary functions not provided by their hosts, including structural and assembly genes, as well as lysis genes [16]. These genes were searched against the annotation database using regular expression patterns defined in Table 1. Manual modifications of the search results have been conducted to remove ambiguous entries.

Table 1 Regular expression patterns used for the nine selected genes.

Table 2 indicates the amount of genes that can be detected if sequence alignment (BLAST) alone was used. The K-Means clustering result based on these genes can be found in Supplementary Material (see Additional file 1).

Table 2 Percentage of genes detected using sequence alignment.

We perform leave-one-out (LOO) cross validation to evaluate the prediction performances for these genes. For each gene function, we run the cross validation in each cluster individually over a discrete range of values of the kernel parameter – σ for Gaussian RBF kernel [17]. The σ value that gives the best accuracy is chosen and is used for all future predictions for that function. The prediction accuracies shown in Table 3 are the averages of cross validation results across all the clusters.

Table 3 Prediction settings and results for the nine gene functions.

K-fold cross validation may also be used to evaluate the prediction performances and it is expected that accuracies are lower with a smaller K value. For instance, the prediction accuracy for Terminase is 79.8% for K = 4 and 62.3% for K = 2. However, LOO is more suited to our overall purpose – one primary objective of the cross validation is to find out the near optimal σ value for the gene class to perform future predictions. Since most clusters contain only a very small portion of genomes that require genuine prediction, they are best simulated by LOO, where only one genome is taken out for prediction testing at a time.

The prediction accuracies are averaged at ~80%. The 100% prediction accuracy of lysin can be explained by the strong context relationship between lysin and holin. Since the presence of a lysin is always accompanied by the presence of a holin immediately beside it [18], SynFPS can easily identify the lysin gene if it already knows the position of the holin. However, the converse is not true: the identification of holin genes may not depend upon the presence of lysin. Consequently, the prediction accuracy for holin is not as high.

These prediction accuracies reflect the sensitivity of the system (true positives/(true positives + false negatives)). The specificity of the system (true negatives/(true negatives + false positives)) on the other hand is always larger the sensitivity because of two system features. Firstly, we allow only a single positive prediction for each genome (see Methods). Thus, the number of false negatives is always the same as the number of false positives, implying that the specificities always scale together with the sensitivities. Secondly, the number of negative training data (hence true negatives) is always larger than the number of positive training data (hence true positives), and consequently Specificity > Sensitivity. One reason for using LOO cross validation accuracies to evaluate the system is the lack of benchmark for our problem. However, it may be noteworthy that other genomic-context based methods for the prediction of functional elements have similar reported accuracies ranging from 72% to 80% [6].

Trade-off between prediction coverage and prediction accuracy

We have examined the effect of the K-Means adaptive threshold t on the prediction accuracies. The value of t (0,1] implicitly specifies the maximum tolerable distance between any two genomes within a cluster. As a result, as t → 0, there are as many clusters as the number of genomes, and as t → 1, there is only one cluster. Both of these cases do not provide useful information for prediction. Since there is no analytical method to find out a good value for t, we have run SynFPS over a range of values from t = 0.05 to t = 0.3. Values outside this range generate either too many or too few clusters (average number of genomes per cluster < 2 or number of clusters < 3 respectively). Using different t values lead to a different amount of genomes that are covered by the automated prediction (a.k.a. prediction coverage). Genomes within the "coverage" are those for which SynFPS has made a classification decision; the remaining genomes are discarded or ignored by SynFPS. Here are examples of genomes not in coverage:

  • genomes not containing the gene being predicted (discarded during cross validation only)

  • genomes that is in a cluster on their own

  • within a cluster, if there are fewer than two genomes that contain the gene being predicted, then all the genomes are discarded

  • genomes with genomic context different to the consensus of the group may be discarded

Figure 3 shows the plot of prediction accuracies versus prediction coverage. The highest coverage values for all gene functions are about 20–25%, achieved by using a t value ~0.1. The results indicate that we can obtain a higher accuracy by lowering the coverage. However, the ultimate purpose of the system is to make genuine predictions over the genomes that lack identification of the genes being predicted. Lowering the coverage can lead to ignorance of many of these genomes. Consequently, one must find a balance between the accuracy and the coverage according to the intended task.

Functions predicted to 3 uncharacterised genes and 12 sequence dissimilar genes

Using the maximum coverage and the σ values optimized by LOO cross validation, we have generated predictions over genomes within which certain gene functions were not already detected. The outcome of SynFPS is to identify which genes within those genomes correspond to the functions of our interest. The prediction outcomes are listed in Table 4.

Table 4 Gene function prediction results for bacteriophage genomes.

Three genes that we have predicted functions for have no existing functional annotation in the database (marked uncharacterised in Table 4). Seven genes in Table 4 exhibit sequence similarity to their reference genes, suggesting that their predicted functions are supported by both sequence similarity and the genomic context information embedded in our system, such as gene order conservation and positional coupling. For other genes that show no sequence similarity (a total of 12 of them in Table 4), the predicted functions are only evident by the genomic context. It is noteworthy that sequence alignment based methods would have failed in finding correspondences to these genes. Other prediction results have complemented existing annotations in the database in cases where they do exist, and therefore support the validity of our approach.


We presented a novel genomic-context based method capable of predicting gene functions from a large collection of genomes. An adaptive K-Means clustering is used to distinguish groups of related genomes based on the conservation of gene order and the conservation of gene-to-gene distances. The clustering results serve as a platform for the SVM to extract training data to perform classification based predictions. Nine common gene functions of bacteriophages were tested and the LOO cross-validated prediction results are averaged at 80%. Functional predictions are also made on 3 uncharacterized genes and 12 genes that cannot be identified by sequence alignment.

Although our experimental focus is on bacteriophages, the method may be extended to other microbial genomes. For example, bacterial genomes have been observed with conserved gene order [8, 19, 20] and conserved gene-to-gene distances (positional coupling) [21, 22]. These properties satisfy the underlying assumptions of our approach and suggest potential application of the method.


Strategy overview – SynFPS

We present a novel method called Synteny-based Function Prediction System (SynFPS) capable of predicting gene functions among completed genomes based on the conservation of gene order (synteny) and the conservation of gene-to-gene distance. An overview of SynFPS is shown in Figure 1. The genome annotation database as shown in the figure defines the scope of analysis for the system. In our work, it consists of 296 phage genomes retrieved from GenBank (see Additional file 1).

Figure 1
figure 1

Structure of the Synteny-based Function Prediction System (SynFPS). The dotted line represents the system boundary, outside of which lies the system inputs and outputs. A set of gene functions (A) specified in the form of regular expressions are matched against the genome database (B) via the text processing unit (D), which result may then be refined (C). A clustering system (E) based on the synteny scores of the matching genes brings together genomes that show conservation of gene order and position (G). Such information is used to generate a set of positive and negative data (genes) to train the classification system (F) that produces function prediction results (H).

SynFPS runs on Windows and is publicly available. It was developed in C# and requires the free Microsoft .NET Framework 2.0 to run. Bioperl 1.4 [23] is needed for data retrieval from public databases. Workstations with a single CPU of ~3.0 GHz and 1 GB of RAM are sufficient for reasonable performance over a collection of ~300 phages.

Identification of functionally similar genes using regular expression

The system begins by identifying in the database a collection of genes that correspond to a set of user-specified gene functions. Instead of using sequence similarity as in many other methods [1, 2, 4, 5, 12], SynFPS identifies functionally similar genes using regular expressions [24]. For example, to search for genes that encode the major head proteins of phages, one possible regular expression pattern is "(?<!minor)\b(head|capsid) protein". With this pattern, we are including genes that have been annotated with "head protein" or "capsid protein" except those with the prefix term "minor". The use of regular expression is aimed at tackling annotation discrepancies among coding sequences in databases that do not have vocabulary control. The regular expression syntax used in SynFPS follows the syntax defined for the .NET Framework [25].

Once a regular expression pattern is given, the system searches against the annotation data of all the genomes that have been supplied to the program. By default, it will identify coding sequence (CDS) regions in each of the genome and then try to match the patterns against their annotated features such as "product", "function" and "note". The set of annotated features that the search will perform over is customisable by the users. The search results can be visually displayed, where the genomes and matching genes are illustrated. The display is interactive in which annotations can be viewed and search results can be modified via manual addition and removal of genes.

Although genome annotation processes are often assisted by sequence alignment, many annotations are prepared manually by biologists who conducted research on the genomes. Therefore, the set of sequences found by annotation search could embrace functionally similar genes that show no sequence similarity. In the results section, we provide an assessment on sequence alignment in relation to regular expression search.

K-Means clustering to identify similar genomic context

The annotation search process leads to a mapping of genes across the genomes. This mapping provides the necessary information for a context based clustering. Let G = {g1, g2,..., g n } be the set of all gene functions where g is a symbol representing a function and n is the total number of functional classes identified. Let m be the number of genomes in the database. We define X k G, k = 1,2,..., m to be the set of genes detected in genome k and C kl = C(X k , X l ) = X k X l to be the common set of genes between genomes k and l. The genomic-context distance between two genomes k and l is defined as:

D k l = g i , g j C k l ; i < j [ d k ( g i , g j ) d l ( g i , g j ) ] | C k l | + p ( | X k X l | | C k l | ) ( 1 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegyvzYrwyUfgarqqtubsr4rNCHbGeaGqiA8vkIkVAFgIELiFeLkFeLk=iY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfeaY=biLkVcLq=JHqVepeea0=as0db9vqpepesP0xe9Fve9Fve9GapdbaqaaeGacaGaaiaabeqaamqadiabaaGcbaGaemiraq0aaSbaaSqaaiabdUgaRjabdYgaSbqabaGccqGH9aqpdaWcaaqaamaaqafabaWaamWaaeaacqWGKbazdaWgaaWcbaGaem4AaSgabeaakiabcIcaOiabdEgaNnaaBaaaleaacqWGPbqAaeqaaOGaeiilaWIaem4zaC2aaSbaaSqaaiabdQgaQbqabaGccqGGPaqkcqGHsislcqWGKbazdaWgaaWcbaGaemiBaWgabeaakmaabmaabaGaem4zaC2aaSbaaSqaaiabdMgaPbqabaGccqGGSaalcqWGNbWzdaWgaaWcbaGaemOAaOgabeaaaOGaayjkaiaawMcaaaGaay5waiaaw2faaaWcbaGaem4zaC2aaSbaaWqaaiabdMgaPbqabaWccqGGSaalcqWGNbWzdaWgaaadbaGaemOAaOgabeaaliabgIGiolabdoeadnaaBaaameaacqWGRbWAcqWGSbaBaeqaaSGaei4oaSJaemyAaKMaeyipaWJaemOAaOgabeqdcqGHris5aaGcbaWaaqWaaeaacqWGdbWqdaWgaaWcbaGaem4AaSMaemiBaWgabeaaaOGaay5bSlaawIa7aaaacqGHRaWkcqWGWbaCdaqadaqaamaaemaabaGaemiwaG1aaSbaaSqaaiabdUgaRbqabaGccqWIQisvcqWGybawdaWgaaWcbaGaemiBaWgabeaaaOGaay5bSlaawIa7aiabgkHiTmaaemaabaGaem4qam0aaSbaaSqaaiabdUgaRjabdYgaSbqabaaakiaawEa7caGLiWoaaiaawIcacaGLPaaacaWLjaGaaCzcamaabmaabaGaeGymaedacaGLOaGaayzkaaaaaa@8F3A@

where d k (g i , g j ) = location of g j - location of g i in genome k, |s| denotes the size of a set s and p is a parameter to penalize the genomes not sharing the same set of genes. The summation term dictates the conservation of gene order as well as the conservation of gene-to-gene distances between the two genomes. The second term dictates gene co-occurrence.

We represent each genome k by a vector of distance values: F k = [Dk 1, Dk 2,...,D km ] and then we perform K-Means clustering over the set S = {F k | k = 1,..., m}. We implemented an adaptive technique such that the number of clusters grows incrementally until the size of the largest cluster is smaller than a specified threshold. The threshold t (0,1] describes the fractional size of the Euclidean space spanned by S. Each resulting cluster contains genomes with high resemblance in gene distribution. Alternative adaptive clustering methods include dynamic self-organizing maps [26, 27].

Support Vector Machines for function prediction

The clusters of genomes are analysed separately and individually in the last stage of the system. For each cluster, we use the information of the previously identified genes to predict the functions of other genes that exhibit similar context. This is achieved by extracting a set of genes from the cluster and converting them into positive and negative training data for a discriminative classification. Positive data are formed by the group of genes previously identified by the system during the match of regular expression plus any manually added genes, with each gene function representing one class. Negative data comprise the genes that are neighbours to the positive genes. The size of neighbourhood is determined by the statistics of the gene locations in that particular cluster. We use 99% confidence interval on the gene locations of each class to determine the range in which neighbour genes are to be included. This interval also determines the set of candidate genes on which function predictions are performed (see Figure 2). The discriminative classification is carried out by a Support Vector Machine (SVM) [28], which has been reported with superior results in a variety of biological applications [2931]. For each gene function, the SVM produces a binary result on each candidate gene indicating whether or not the gene belongs to that function class. Since the number of gene functions is specified by the user and is not likely to cover every possible function, only a subset of the candidate genes – those with positive results – will eventually be assigned with predicted functions.

Figure 2
figure 2

An illustration of a cluster containing four genomes. Performing function prediction over gene class "A" consists of two steps: i) perform Leave-One-Out cross validation over the first three genomes and hence adapt to the optimal kernel parameters, ii) find A in the bottom genome within the confidence interval. Since the distances between A and B genes are the most conserved, class B will act as the reference genes for computing relative positions for class A genes for use as one of the training features.

Figure 3
figure 3

A plot of cross-validated prediction accuracy versus prediction coverage of the genomes in the database (296). Prediction coverage indicates the percentage amount of genomes that have been included to perform the leave-one-out cross validations using SynFPS. The maximum coverage of each gene function is limited by the number of its existences detected in the database. The coverage is varied using different adaptive threshold for the K-Means clustering.

To enhance prediction accuracy, we force a unique positive prediction in every genome within a cluster. This is based on an assumption that all pairs of genomes within a cluster would have a one-to-one mapping of genes (gene correspondence). The decision values generated by SVM depict the relative positiveness of each candidate gene. Consequently, the gene with the strongest decision value will be chosen as the positive prediction.

In order to apply SVM, each gene is converted into a numeric vector capturing the following features: composition, normalized van der Waals volume, hydrophobicity, polarity [30, 32], pairwise similarity scores against other genes in the database [29], relative position and gene size. To compute the "relative position", the system first finds the gene class which has the most conserved distance to the gene under current prediction. For example, as demonstrated in Figure 2, if we are making predictions over class A, then class B will be chosen as the reference for computing the relative positions because the distances between class B genes and class A genes are the most conserved. The relative position of a gene in class A is then computed as the distance between itself and the class B gene in the corresponding genome.

The pairwise similarity scores have been observed to improve classification accuracies. These scores represent the distance between a gene and every other gene in the database [29]. However, it should be emphasized that while these sequence similarity scores enhance the strength of the feature vectors, the system does not rely upon similarity significances to detect gene correspondence.

Availability and requirements

Project name: SynFPS

Project website:

Operating system: Microsoft Windows family

Other requirements: Microsoft .NET Framework 2.0 (free), Bioperl 1.4 (optional)

Any restrictions to use by non-academics: None



Coding Sequence


Horizontal gene transfers




Support Vector Machines

SynFPS :

Synteny-based Function Prediction System


  1. Pan X, Stein L, Brendel V: SynBrowse: a synteny browser for comparative sequence analysis. Bioinformatics 2005, 21(17):3461–3468. 10.1093/bioinformatics/bti555

    CAS  Article  PubMed  Google Scholar 

  2. Frazer KA, Pachter L, Poliakov A, Rubin EM, Dubchak I: VISTA: computational tools for comparative genomics. Nucleic Acids Res 2004, (32 Web Server):W273–279. 10.1093/nar/gkh458

  3. Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, Program NCS, Green ED, Sidow A, Batzoglou S: LAGAN and Multi-LAGAN: Efficient Tools for Large-Scale Multiple Alignment of Genomic DNA. Genome Res 2003, 13(4):721–731. 10.1101/gr.926603

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  4. Schwartz S, Zhang Z, Frazer KA, Smit A, Riemer C, Bouck J, Gibbs R, Hardison R, Miller W: PipMaker – a web server for aligning two genomic DNA sequences. Genome Res 2000, 10(4):577–586. 10.1101/gr.10.4.577

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  5. Clamp M, Andrews D, Barker D, Bevan P, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, et al.: Ensembl 2002: accommodating comparative genomics. Nucleic Acids Res 2003, 31(1):38–42. 10.1093/nar/gkg083

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  6. Huynen MA, Snel B, von Mering C, Bork P: Function prediction and protein networks. Curr Opin Cell Biol 2003, 15(2):191–198. 10.1016/S0955-0674(03)00009-7

    CAS  Article  PubMed  Google Scholar 

  7. von Mering C, Jensen LJ, Snel B, Hooper SD, Krupp M, Foglierini M, Jouffre N, Huynen MA, Bork P: STRING: known and predicted protein-protein associations, integrated and transferred across organisms. Nucleic Acids Res 2005, 33(Database issue):D433-D437. 10.1093/nar/gki005

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  8. Tamames J: Evolution of gene order conservation in prokaryotes. Genome Biol 2001, 2(6):RESEARCH0020. 10.1186/gb-2001-2-6-research0020

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  9. Yanai I, Mellor JC, DeLisi C: Identifying functional links between genes using conserved chromosomal proximity. Trends in Genetics 2002, 18(4):176–179. 10.1016/S0168-9525(01)02621-X

    CAS  Article  PubMed  Google Scholar 

  10. Bray N, Dubchak I, Pachter L: AVID: A Global Alignment Program. Genome Res 2003, 13: 97–102. 10.1101/gr.789803

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  11. Brudno M, Malde S, Poliakov A, Do CB, Couronne O, Dubchak I, Batzoglou S: Glocal alignment: finding rearrangements during alignment. Bioinformatics 2003, 19(suppl_1):i54–62. 10.1093/bioinformatics/btg1005

    Article  PubMed  Google Scholar 

  12. Korbel JO, Snel B, Huynen MA, Bork P: SHOT: a web server for the construction of genome phylogenies. Trends Genet 2002, 18(3):158–162. 10.1016/S0168-9525(01)02597-5

    CAS  Article  PubMed  Google Scholar 

  13. Brussow H, Hendrix RW: Phage Genomics: Small Is Beautiful. Cell 2002, 108: 13–16. 10.1016/S0092-8674(01)00637-7

    CAS  Article  PubMed  Google Scholar 

  14. Hendrix RW: Bacteriophage genomics. Curr Opin Microbiol 2003, 6(5):506–511. 10.1016/j.mib.2003.09.004

    CAS  Article  PubMed  Google Scholar 

  15. Jiang W, Li Z, Zhang Z, Baker ML, Prevelige PE Jr, Chiu W: Coat protein fold and maturation transition of bacteriophage P22 seen at subnanometer resolutions. Nat Struct Biol 2003, 10(2):131–135. 10.1038/nsb891

    CAS  Article  PubMed  Google Scholar 

  16. Hatfull GF, Pedulla ML, Jacobs-Sera D, Cichon PM, Foley A, Ford ME, Gonda RM, Houtz JM, Hryckowian AJ, Kelchner VA, et al.: Exploring the mycobacteriophage metaproteome: phage genomics as an educational platform. PLoS Genet 2006, 2(6):e92. 10.1371/journal.pgen.0020092

    PubMed Central  Article  PubMed  Google Scholar 

  17. Cristianini N, Shawe-Taylor J: An introduction to support vector machines: And other kernel-based learning methods. Cambridge, England: Cambridge Press; 2000.

    Chapter  Google Scholar 

  18. Wang IN, Smith DL, Young R: Holins: the protein clocks of bacteriophage infections. Annu Rev Microbiol 2000, 54: 799–825. 10.1146/annurev.micro.54.1.799

    CAS  Article  PubMed  Google Scholar 

  19. Tamames J, Gonzalez-Moreno M, Mingorance J, Valencia A, Vicente M: Bringing gene order into bacterial shape. Trends in Genetics 2001, 17(3):124–126. 10.1016/S0168-9525(00)02212-5

    CAS  Article  PubMed  Google Scholar 

  20. Wolf YI, Rogozin IB, Kondrashov AS, Koonin EV: Genome alignment, evolution of prokaryotic genome organization, and prediction of gene function using genomic context. Genome Res 2001, 11(3):356–372. 10.1101/gr.GR-1619R

    CAS  Article  PubMed  Google Scholar 

  21. Fujibuchi W, Ogata H, Matsuda H, Kanehisa M: Automatic detection of conserved gene clusters in multiple genomes by graph comparison and P-quasi grouping. Nucleic Acids Res 2000, 28(20):4029–4036. 10.1093/nar/28.20.4029

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  22. Kanehisa M, Goto S, Kawashima S, Nakaya A: The KEGG databases at GenomeNet. Nucl Acids Res 2002, 30(1):42–46. 10.1093/nar/30.1.42

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  23. Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H, et al.: The Bioperl toolkit: Perl modules for the life sciences. Genome Res 2002, 12(10):1611–1618. 10.1101/gr.361602

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  24. Sipser M: Chapter 1: Regular languages. In Introduction to the theory of computation. 2nd edition. Boston: Thomson Course Technology; 2006:31–90.

    Google Scholar 

  25. Microsoft: Regular Expression Language Elements. MSDN Library: .NET Framework General Reference, Microsoft Corporation; 2006.

    Google Scholar 

  26. Hsu AL, Halgamuge SK: Enhancement of topology preservation and hierarchical dynamic self-organising maps for data visualisation. International Journal of Approximate Reasoning 2003, 32(2–3):259–279. 10.1016/S0888-613X(02)00086-5

    Article  Google Scholar 

  27. Hsu AL, Tang SL, Halgamuge SK: An unsupervised hierarchical dynamic self-organizing approach to cancer class discovery and marker gene identification in microarray data. Bioinformatics 2003, 19(16):2131–2140. 10.1093/bioinformatics/btg296

    CAS  Article  PubMed  Google Scholar 

  28. Keerthi SS, Shevade SK, Bhattacharyya C, Murthy KRK: Improvements to Platt's SMO Algorithm for SVM Classifier Design. Neural Comp 2001, 13(3):637–649. 10.1162/089976601300014493

    Article  Google Scholar 

  29. Liao L, Noble WS: Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. J Comput Biol 2003, 10(6):857–868. 10.1089/106652703322756113

    CAS  Article  PubMed  Google Scholar 

  30. Cai CZ, Han LY, Ji ZL, Chen YZ: Enzyme family classification by support vector machines. Proteins 2004, 55(1):66–76. 10.1002/prot.20045

    CAS  Article  PubMed  Google Scholar 

  31. Baten A, Chang BCH, Halgamuge SK, Li J: Splice site identification using probabilistic parameters and SVM classification. BMC Bioinformatics 2006, 7(Suppl 5):S15. 10.1186/1471-2105-7-S5-S15

    PubMed Central  Article  PubMed  Google Scholar 

  32. Dubchak I, Muchnik I, Holbrook SR, Kim SH: Prediction of protein folding class using global description of amino acid sequence. Proc Natl Acad Sci USA 1995, 92(19):8700–8704. 10.1073/pnas.92.19.8700

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  33. Tatusova TA, Madden TL: BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiol Lett 1999, 174(2):247–250. 10.1111/j.1574-6968.1999.tb13575.x

    CAS  Article  PubMed  Google Scholar 

Download references


We thank Bill Chang and Arthur Hsu for their advice on this work and Zhi Feng Zhu for his assistance in software implementation.

This article has been published as part of BMC Bioinformatics Volume 8, Supplement 4, 2007: The Second Automated Function Prediction Meeting. The full contents of the supplement are available online at

Author information

Authors and Affiliations


Corresponding author

Correspondence to Sen-Lin Tang.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JL conceived of the study, designed the software and drafted the manuscript. SKH supervised the work and participated in results evaluation. ST conceived of the clustering design and gave expertise in bacteriophage analysis. CIK participated in the SVM predictions. All authors have participated in preparing the manuscript, have read and approved the final manuscript.

Electronic supplementary material

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Li, J., Halgamuge, S.K., Kells, C.I. et al. Gene function prediction based on genomic context clustering and discriminative learning: an application to bacteriophages. BMC Bioinformatics 8 (Suppl 4), S6 (2007).

Download citation

  • Published:

  • DOI:


  • Support Vector Machine
  • Prediction Accuracy
  • Regular Expression
  • Genomic Context
  • Related Genome