In this section, we describe our algorithm for solving the BBH r-window clustering problem described in the previous section. We start off with a simple quadratic time algorithm and show how it can be improved using a sliding window technique. Finally, we present an efficient data structure that allows us to obtain a subquadratic time algorithm.
A simple quadratic time algorithm
A straightforward algorithm is to generate both sets of windows W
G
and W
H
, then for each window in W
G
, compute its best hit in W
H
by going through each window in W
H
and vice versa. For simplicity, we assume that there are at most r genes in a window of length r. Therefore, the size of W
G
and W
H
is O(nr) and O(mr) in the worst case and comparing two windows take O(r) time. This simple algorithm has a time complexity of O(nmr3).
A sliding window algorithm
We first show how we can find the best hits for each window in W
G
efficiently. Finding the best hits for windows in W
H
is done in the same way. After that, we go through the hits and keep only the bidirectional best hits.
One problem in the previous algorithm is that many of the comparisons between two windows would result in a shared gene count of zero. Therefore, instead of storing all the r-windows, we generate them one-the-fly to avoid comparing two windows with no common homology family.
We enumerate the windows in W
G
by starting from each gene and incrementally add genes in increasing order of their position as long as the window length is less than or equal to r. We use a data structure T to maintain the set of windows W
H
that have a non zero shared gene count with respect to the current window in W
G
.
Each time we consider a different window w
G
, we need to update our data structure by adding the corresponding genes in H from the same family to our data structure. To determine the list of genes to be added, we preprocess H to compute the list of genes for each homology family. Finally, for each window W
G
, we make use of our data structure to determine the best hit in H.
The pseudo code for this algorithm is shown in Algorithm 1.
Putting it all together, we first find the best hits from G to H and vice versa, then filter the results to only retain the bidirectional best hits. We store the best hits from H to G in a hash table and for each best hit from G to H, we access the table to check if it is also a best hit from H to G.
The pseudocode for the algorithm is shown in Algorithm 2.
Data structure for W
H
Observe that for a given window W
G
in W
G
, most of the windows in W
H
do not have any genes in common with W
G
. Hence, instead of finding the best hit by checking against all windows in W
H
, we only represent
Algorithm 1 BestHitWindows(G, H, r)
Ensure: Determine for each window in W
G
the best hit in W
H
BH := ∅ {set of best hits from G to H}
{Determine list of genes for each family in H and store in gs}
for i from 1 to n
H
do
h
i
:= i th gene in H
f
i
:= family of h
i
gs [f
i
] := gs [f
i
] ∪ {h
i
}
end for
{Enumerate r-windows in G and compute best hits}
for i from 1 to n
G
do
{g
i
is the i th gene in G}
e := i - 1
w
g
:= ∅
initialize T
while Δ (g
i
, ge+1) ≤ r do
e := e + 1
w
g
= w
g
∪ {g
e
}
f
e
:= family of g
e
for each gene g ∈ gs [f
e
] do
insert(T, g)
end for
w
h
:= besthit(T)
BH := BH ∪ {(W
G
, W
H
)}
end while
end for
return BH
Algorithm 2 BBHWindows(G, H, r)
Ensure: Compute the set of BBH r-window gene clusters between G and H
BBH := ∅ {set of bidirectional best hits}
BH G, H:= BestHitWindows(G, H, r)
BH H, G:= BestHitWindows(H, G, r)
{Store the best hits from H to G in a hash table M}
for each (W
H
, W
G
) in BH H, Gdo
M [w
h
] := W
G
end for
{Compute the bidirectional best hits}
for each (W
G
, W
H
) in BH G, Hdo
if M [W
H
] = W
G
then
BBH := BBH ∪ {(W
G
, W
H
)}
end if
end for
return BBH
the windows that have at least one gene in common with W
G
.
This is achieved by storing the genes in H that share a family with w
g
in a balanced binary search tree, T, using the position of the gene as the key. Each gene represents the start of a window, thus each node in the data structure represents a window of length r in H.
We need to be able to insert/delete genes in this structure and find the largest window. To find the largest window efficiently, we maintain the shared gene count, s, of each window as an additional attribute of each node.
Insertion and deletion of a gene follows from the algorithm for standard binary search tree. Unfortunately these two operations cause the shared gene count of up to r contiguous windows to change. Instead of updating these windows one by one, we make an update to the roots of the subtrees that contains only these windows to indicate the change in shared gene count to all the windows in the subtrees. For this to work, we need to keep track of the range of windows in a subtree by storing the minimum and maximum position of genes, (min
p
, max
p
), and the adjustment to the shared gene count, Δ
s
. This is similar to the canonical decomposition technique used in segment trees [16]. Hence, the number of nodes affected is at most O(lg |T|), where |T| is the number of nodes in the tree.
To find the window with the highest shared gene count, we need to keep store the maximum shared gene count in each subtree. Then the maximum shared gene count in the whole tree is found in the root. Finding the best hit is done by traversing only those nodes whose maximum shared gene count is equal to the maximum in the whole tree. The complexity of this step is therefore O(lg |T|).
In summary, to make the three operations efficient, we augment each node of the tree with the following attributes:
s -- shared gene count for the window of length r starting at this gene
max
s
-- maximum shared gene count of windows in this subtree
(min
p
, max
p
) -- minimum and maximum position of genes in this subtree; used to determine the windows under this subtree
Δ
s
-- adjustment in shared gene count made to all windows in this subtree
When rotations are necessary to maintain the balance of the tree, the additional attributes in the nodes can be updated in constant time as they can be computed from the attributes in the left and right subtrees.
Time complexity
The first part of the algorithm determine the list of genes in H for each homology family. This has a worst case time complexity of O(m). The complexity of the operations on the data structure T, depends on its size, which is O(r). Hence, the complexity of determining the best hits for each window in W
G
is O(m + nr lg r) and the complexity for determining the best hits in both directions is O((n + m)r lg r). The number of results for BestHitWindows(G, H, r) and BestHitWindows(H, G, r) is O(nr) and O(mr) respectively. Creating the associative array to index the best hits from H to G takes O(mr) time on average using a hash table. Going through the best hits from G to H and keeping only the bidirectional best hits takes O(nr) time on average, assuming expected O(1) time to access the hash table.
Therefore, the time complexity of the whole algorithm is dominated by the time taken to compute the best hits which is O((n + m)r lg r).