High-level description of our algorithm
We first represent genes in the target genome or in a set of specified reference genomes, and their functional relatedness as a graph, called a reference graph, where each gene in any of the genomes is represented as a vertex, and two genes have an edge linking them if they are in the same operon or they are homologous. We then define a linkage graph for the target genome such that each gene is represented as a vertex and two genes have an edge if and only if there is a path linking the two genes in the reference graph, and the distance of the edge is defined as the distance of the shortest path between the two genes in the reference graph. We have augmented this distance by including two additive terms, one penalty factor (system(error)) used to model the reliability of a predicted functional relationship, and a phylogeny-based distance used to capture co-evolutionary relationships, more general than homology relationships among genes, between two genes. Our goal here is to find genes that have short distances, defined above, to genes in a known pathway, and predict that they are involved in this pathway if their distances are ranked among the top such genes. The whole procedure is summarized in Figure 1, with the detailed steps explained as follows.
Selection of reference genomes
Currently over 1,000 bacterial and archaean genomes have been sequenced and are publicly available (NCBI release of September 2009). From this set, we have selected 185 strains (non-redundant genomes and plasmids) (see Additional File 1) from 185 different genera using the following rule: for each genus, select the genome with the longest sequence.
Calculation of homology-based distance
For each pair of genes x
i
, x
j
, in the target genome and the 185 reference genomes, we use the E-value of BLAST (with default parameters) to define their homology-based distance d
s
(x
i
, x
j
) as follows:
where p
s
(x
i
, x
j
) is the BLAST E-value for genes x
i
, x
j
, and 185 is a normalization factor since when the E-value is smaller than 1e-185, it is set as 0 in the BLAST program. Clearly d
s
(x
i
, x
j
) is between 0 and 1; and the more similar two genes are, the smaller the d
s
(x
i
, x
j
) value is.
Calculation of operon-based distance
We have used the operons predicted using our own program [18], which is considered the most reliable operon prediction method in the public domain [17]. A probability calculated by this method represents the likelihood that two neighbouring genes are in the same operon. We apply this program to all of the 185 reference genomes and get the probability p
o
(x
i
, x
j
) between two genes x
i
, x
j
in each genome. For any pair of neighbouring genes x
i
, x
j
in the same genome (target or reference), we define their operon-based distance d
o
(x
i
, x
j
) as follows:
where p
o
(x
i
, x
j
) represents the probability that x
i
, x
j
are in the same operon as given in [18].
Reference graph and linkage graph
We define a reference graph over all genes in the target as well as the reference genomes as follows. Each gene is represented as a vertex, and an edge between two genes is created if (i) the two genes are in the same operon, with their edge distance defined to be the operon-based distance between the two genes; or (ii) the two genes in different genomes are homologous, with their edge distance defined to be their homology-based distance. Based on the reference graph, we define a linkage graph on genes in the target genome. For any pair of genes, x
i
, x
j
, we define an edge between them if and only if there is a path x
i
, x1,x2, … x
j
in the reference graph, with its edge distance set to be the distance of the shortest path between the two genes (Figure 2). We intend to use an edge in this graph to capture a functional linkage relationship possibly through multiple steps of co-operon and homologous relationship. We recognize that the reliability of such defined edges could go down (largely independent of the reliability of individual operon and homology predictions) as the number of edges in the above path goes up. Hence we included a penalty factor, system (error), which is proportional to the number of edges in the path, and redefined the path distance of a gene pair as follows:
where k is the number of edges in the path, and α is a scaling factor. In our current implementation, we set α = 380 and system(error) = 0.06 based on a ten-fold cross-validation method (see Parameter Selection). E(operon) and E(similarity) are the set of operon edges and the set of similarity edges, respectively.
Phylogeny-based distance
We also considered a more general class of functional relationship defined in terms of the phylogenetic profiles of genes, which measures their co-evolutionary relationship [20, 21]. Basically, the phylogenetic profile X of a gene against a set of n reference genomes is a binary string of length n, with the i th position being 1, if the gene has a homolog in the i th reference genome, and 0 otherwise. It has been found that two genes (of the same genome) are generally functionally related if their phylogenetic profiles are highly similar [20]. We have used a BLAST E-value e-3 as the cutoff for determining the presence of a homolog in another genome [22]. We use the following to measure the similarity between two phylogenetic profiles, similar to that reported in [23]. Given the phylogenetic profiles Xi and Y
j
for genes x
i
and y
j
, their phylogeny-based distance is defined as follows:
where, d
hamming
(X
i
, X
j
) is the Hamming distance between X
i
and X
j
, and Entropy (X
i
, X
j
) is the entropy of the common part of X
i
and X
j
, defined as follows:
with p being the frequency of 1’s in common positions between the two phylogenetic profiles. Note that the more similar two phylogenetic profiles are, the smaller their distance is.
Rank functional relatedness of candidate genes
Our goal here is to rank all the genes in a target genome in terms of a possible relationship with a set of seed genes (known genes in a pathway), by fusing the path distance and the phylogeny-based distance. For a given pathway P, let its known gene set be G(P) and |G(P)| be the number of its genes. We define a distance from P to a candidate gene x
i
as
Similarly, we define a phylogenetic distance from P to x
i
as
Our experience has been that for both the path distance and the phylogenetic distance, the distance for the top ranked genes tend to be more reliable. Hence only the top K candidate genes to each gene x
j
ε G(P) are considered and the remaining is ignored. To a seed gene, we only take the K shortest genes measured by reference distance, where the K is ranged from 5 to 30. Similarly, only the top K( = 50) genes closed to a seed gene is considered for phylogenetic distance [20]. So some candidate genes may not have a path distance or phylogenetic distance, due to their ranking. The final combined distance from any gene x
i
to pathway P is defined as
where β is a scaling factor and set to 5, based on the ten-fold cross-validation method (see Parameter Selection); and T is set to be 2 if gene x
i
has both the path distance and phylogenetic distance, and as 1 if it has only one distance defined. The candidate genes are ranked by their combined distance and the final top γ genes are output (γ = 10 in this study).
Parameter Selection and Validation Method
For a predicted target gene and a target pathway, the gene is considered a positive prediction (based on a partial gene list of the pathway) if it is part of the pathway. For any of the following assessments of our prediction, we use the following (standard) notations: TP for true positive predictions; TN for true negative predictions; FP for false positive predictions and FN for false negative predictions (FN); and we use the following standard measures of sensitivity (SE), specificity (SP) and positive predictive value (PPV) to assess the performance of our prediction method of missing genes:
To assess the prediction performance against a set of pathways, we use the average of the above three measures across all the pathways as follows:
where SP
i
, SE
i
and PPV
i
are SP, SE and PPV for the i th pathway, respectively, and N is the number of pathways considered.
For each to-be-determined parameter in our program, a ten-fold cross-validation procedure is used to derive the optimal value. Specifically, all the pathways are divided randomly into ten parts, nine for training and one for testing each time. The value with the best average is finally selected. The leave-one-out cross-validation procedure is used to assess the performance. For each pathway, its known genes are used as the seed-gene set. The procedure removes each gene from the pathway seed set one at a time, and then calculates the final combined distance from the remaining genes to the removed gene and all the left genes of the target genome. If the removed gene is output in the final top γ genes, it is counted as a successful prediction.