CB-grid layout algorithm: Introduction of the grid layout algorithm
Given a graph G = (V, E) with nodes V and edges E, a layout L = (V, E, U, P) of G consists of the underlying graph G, grid points U and a function P : V → U such that P (v
α
) ≠ P (v
β
) for any two distinct nodes v
α
, v
β
∈ V. This definition does not allow overlaps between nodes in the layout. For a layout L, this paper uses the following notations.
-
W
L
: a set of vacant points of L.
-
E
v
: the set of all edges connected to node v.
-
|V|: the number of nodes in V.
-
|W|: the number of vacant points in L, instead of |W
L
| if there is no confusion possible.
We define the following operations.
-
Tv → pL: the layout generated by moving a node v to a vacant point p ∈ W
L
.
-
(1)
L: the layout generated by swapping nodes v
α
and v
β
.
-
D
v
L: the layout generated by removing a node v and all edges connected to v.
In addition, we define the following functions.
-
(2)
(L): a binary function that returns 1 if an edge e
i
crosses with an edge e
j
and 0 otherwise.
-
(3)
(L): a binary function that returns 1 if an edge e
j
crosses with a node v
i
and 0 otherwise.
-
(4)
(L): a function that returns , where is the weight to the couple of nodes v
i
and v
j
, and md (v
i
, v
j
) is the Manhattan distance between v
i
and v
j
.
In our previous approach [15] (mainly referred to as CB-grid layout algorithm), the layout cost C (L) of L was defined as follows:
where W
ee
, W
ne
, and W
d
are called respectively edge-edge crossing weight, node-edge crossing weight, and distance cost weight.
The CB-grid layout algorithm repeats the operation of moving a unique node to a vacant point one-by-one until it reaches a locally optimal layout. At each step, the algorithm calculates costs of all layouts that can be generated by moving one of all nodes to one of all vacant points. The layout with the lowest cost is selected as a starting layout for the next step. After reaching convergence, the algorithm outputs a locally optimal layout. If the cost calculation of all possible adjacent layouts is implemented in a naïve way, high time complexity is required. To overcome this problem, the previous method [15] introduced Δ matrix that stores each possible cost difference at the previous step and succeeded in reducing the time complexity at each step from O (|W| (|V|2 + |E|2) to O (|V|2 + |E|2 + |W||| (|V| + |E|)), where v
β
is the node moved at the previous step.
When CB-grid layout algorithm was applied to several biopathways, we encountered three problems. Thus, we propose new grid layout algorithms that solve these problems. Problems and solutions are summarized as follows:
-
1.
Improving the choice of the initial layout: since a locally optimal layout depends noticeably on the initial layout, we first apply Eades initial layout algorithm to a random layout, and use its output as the initial layout. In the previous approach, a random layout was directly used as the initial layout.
-
2.
Improving the cost function: we introduce the concept of a combo score that gives a good score, i.e., a negative cost when nodes with the same biological attribute are aligned (CCB-grid layout algorithm). In CB-grid layout algorithm, the biological attributes, except subcellular localization, were ignored.
-
3.
Improving the search strategy: we propose a better search strategy, which allows us to obtain improved results, keeping the time complexity. For obtaining a better layout, the search space is extended by adding the swap operation. At each step, all layouts obtained by swapping two nodes are also considered (SCCB-grid layout algorithm).
In the remainder of this section, we describe these three new algorithms mentioned above.
Eades initial layout algorithm: generating a new initial layout for grid layout algorithms
In the previous paper [15], a random layout was used as an initial layout for CB-grid layout algorithm. When the initial layout is far from the global optimum, the local optimum obtained tends to be unacceptable. Therefore, we decided to develop Eades algorithm [18] and use its output as the initial layout. Eades algorithm is one of the force-directed algorithms, consisting of the following two steps.
-
1.
Two types of forces are defined for each pair of nodes. If two nodes are adjacent, there exists an attractive force ac 1log(d/ac 2) between them, where ac 1and ac 2are constants, and d is the distance between the two nodes. On the other hand, if two nodes are not adjacent, there exists a repulsive force r
c
/ between them, where r
c
is a constant. At each step, the positions of all the nodes are updated according to the sum of the repulsive and attractive forces between them.
-
2.
The above step is iterated a predetermined number of times, and the final result is obtained.
We have customized two points in Eades algorithm. First, nodes in Eades algorithm can be placed anywhere. All the nodes in the initial layout for CB-grid layout algorithm, however, should be placed on the grid points that satisfy the subcellular localization. Thus, the output of Eades algorithm cannot be used directly as an input for CB-grid layout algorithm.
To handle this problem, we propose to move each node to the closest vacant point that satisfies the subcellular localization after moving nodes at each step.
Second improvement is the following one. Since Eades algorithm doesn't consider edge-edge crossings and node-edge crossings in its implementation, the resulting layout could contain a lot of such crossings. For example, suppose a biological pathway with a subcellular localization, membrane, which slimly surrounds other subcellular localizations as shown in Figure 1(a), the graph in (a) could be a layout resulting from Eades algorithm. In this case, the layout might contain a large number of edge-edge crossings and node-edge crossings because edges cross over other subcellular localizations. In order to avoid this problem, we propose to gather nodes around a particular grid point for each subcellular localization as shown in Figure 1(b). Eades algorithm with the above improvements is called Eades initial layout algorithm.
CCB-grid layout algorithm: utilizing various biological attributes
When humans draw biopathway models, nodes with the same attribute are usually arranged according to a rule. In CB-grid layout algorithm, this type of information is completely ignored. To implement this type of property, we introduce the concept of combo scores called combo1 and combo2 (see Figure 2). Note that a combo score is applied only to nodes having an attribute since some nodes do not have any attributes. We denote the set of nodes having an attribute by V' ⊆ V. In this algorithm, (i) upperGrid(p, i)/lowerGrid(p, i) returns the upper/lower i th grid point over/under a grid point p ∈ P, and (ii) Attr(v) is the attribute of a node v ∈ V', and CW
a
= (1 + C/||), where C is a constant and normally set to |V|, and is the set of nodes having an attribute a.
The combo score is designed such that the more nodes with the same attribute are aligned vertically, the higher the score is. The combo score is defined between two nodes, and a combo score of a layout L is defined to be the sum of all the combo scores occurring in L. We say that two nodes have a combo relation when a combo score occurs between them. Note that the horizontal alignment score is not implemented because if the above combo score supported both the vertical and horizontal directions, the numbers of edge-edge crossings and node-edge crossings would be considerably increased. Therefore, we should choose only one direction for combo scores. In this paper, we defined combo scores in the vertical direction. We have considered two types of combo scores, i.e., combo1 and combo2 for layouts in Figure 3(a) and 3(b), respectively. Let nodes v
a
to v
f
in Figure 3 have the same attribute. The combo1 considers only the nodes with one vertical grid distance from the target node. In contrast, combo2 considers the nodes with up to two vertical grid distances from the target node. For the layout in Figure 3(a), the number of combo relations with combo1 and combo2 are 8 and 12, respectively. If node v
f
is moved as shown in Figure 3(b), the number of combo relations with combo1 is the same as before, whereas that with combo2 is 14. Thus, only by using combo2, we can improve the combo score when node v
f
is moved as shown in Figure 3(a) and 3(b). As shown in the dotted rectangle in Figure 3(a), a pair of vertically aligned nodes often occurs during the process of updating a layout. In this case, Figure 3(b) should be a better layout than Figure 3(a). For this reason, we decide to employ combo2. Henceforth, for a node v ∈ V in a layout L, Combo
v
(L) denotes the same combo score as combo2 (v, L). The total score for L is denoted by Combo (L).
If CW
a
returns the same value for any attribute a, many of the nodes with the same attribute will be vertically aligned easily since they have a greater chance to neighbor one another. So as to reduce the biases among the attributes, we define CW
a
to be inversely related to the total number of the nodes whose attribute is a.
By modifying the layout score of CB-grid layout algorithm, we can define the layout cost C (L) of a layout L with the new concept of the combo score as follows:
where W
cs
is called combo score weight. CB-grid layout algorithm improved by the above modification is named Combo score, Cross cost and Biological information grid layout algorithm (CCB-grid layout algorithm). The reason for multiplying the sum of the combo scores by 1/2 is that combo scores are counted twice since a combo score between nodes v
α
and v
β
is included in both (L) and (L). The algorithm is the same as C-optimization (L) step in [15] except for the use of the above layout cost C (L), i.e., the algorithm for calculating Δ matrix is also the same.
For calculating the combo score for each node, only four nodes need to be checked at most, i.e., its time complexity is constant, while for calculating the edge-edge crossing cost, the node-edge crossing cost, and the distance cost for each node, these time complexities depend on |E|, |V|, and |W|, respectively. Thus, without using Δ matrix, the time complexity related to combo scores is O (|V||W|) at each step.
At each step, we need to calculate the difference between the combo score of the previous layout L and that of the current layout that is generated by moving a node v to a vacant point p, i.e., Combo(Tv→pL) – Combo(L). We can efficiently calculate the difference of the combo score (L) as follows:
where
We introduced Adj
v
(L) due to the following reason. First, suppose that three nodes with the same attribute are aligned vertically. We call them v
α
, v
β
, and v
γ
beginning from the bottom. There are three combo relations among the three nodes: one is between v
α
and v
β
, another between v
β
and v
γ
, and the third between v
α
and v
γ
. Although v
β
is involved in these three combo relations, the combo relation between v
α
and v
γ
is not considered in (L). Therefore, Adj
v
(L) is needed to correct this type of undercount.
SCCB-grid layout algorithm: extension of the search space due to the swap operation
Another drawback of CB-grid layout algorithm is that only one node can be moved to a vacant point at each step. For example, the layout shown in Figure 4(a) is optimal for CB-grid layout algorithm despite the fact the layout in Figure 4(b) should be selected as the better layout. This limitation is due to the strategy of CB-grid layout algorithm. Thus, we have devised a new algorithm by allowing the swap operations between two nodes while keeping the time complexity. With this improvement, the layout in Figure 4(a) will be arranged as shown in Figure 4(b). The new algorithm is named CCB-grid layout with the swap operation (SCCB-grid layout algorithm). The layout cost function is the same as in CCB-grid layout algorithm. However, a naïve implementation would increase the time complexity to calculate the layout cost for swapped layouts.
In the previous approach [15], Δ matrix stores cost differences that are induced only by moving nodes to vacant points. As a result, if a grid point of interest was occupied at the previous step, we cannot exploit Δ matrix to calculate cost differences corresponding to that grid point. Since grid points of interest on the swap operation are obviously occupied at the previous step, Δ matrix cannot be used. However, if Δ matrix also stores cost differences related to occupied points, Δ matrix can be exploited for this problematic case, too. We then propose an extended Δ matrix, which considers occupied points as well as vacant points. Since the definition of the cost differences for vacant points cannot be applied directly to occupied points, we decide to calculate the cost differences for the occupied points by calculating it without taking into account the node occupying that grid point and all edges connected to it. In the remainder of this section, we will show how to calculate the extended Δ matrix and then compare the time complexity of the extended Δ matrix and the original Δ matrix.
Henceforth, let us refer to the extended Δ matrix as Δ matrix. Given a layout L, at the first step, we update Δ (L) matrix as follows:
(5)
is the following function:
If the previous layout is updated by moving node v
β
to vacant point q, Δ (L) can be updated efficiently by using Δ (L) as follows:
where DIFF0 to DIFF4 are defined in the following way:
where Q shall be defined below.
If the previous layout is updated by swapping two nodes and , Δ (L) is then updated efficiently by using Δ (L) as follows:
where DIFF5 to DIFF9 are defined in the following way:
The case of v
α
= is not considered in Equation (13) because equations of this case can be obtained by simply replacing with in case 1 and 3.
(6)
(·) and (·) in DIFF0 to DIFF9 are partial cost functions depending on the two nodes v
a
and v
b
and the three nodes v
a
, v
b
, and v
c
, respectively, they are the sums of the corresponding partial edge-edge crossing costs, node-edge crossing costs and distance costs as follows:
where (·) and (·) are related to edge-edge crossings, while (·) and (·) are related to node-edge crossings, and (·) and (·) are related to the distance cost. The details are described as below.
-
(a)
(7)
(·) is a partial edge-edge crossing cost function of and , and is defined as follows:
Similarly, (·) is a partial edge-edge crossing cost function of , , and , and is defined as follows:
-
(b)
(8)
is a partial node-edge crossing cost function of v
a
, v
b
, , and , and is defined as follows:
Similarly, (·) is a partial node-edge crossing cost function of v
a
, v
b
, v
c
, , , and , and is defined as follows:
-
(c)
(9)
is a partial distance cost function of v
a
and v
b
, and is defined as follows:
Similarly, (·) is a partial distance cost function of v
a
, v
b
, and v
c
, and is defined as follows:
Thus far, we found out a method to efficiently calculate Δ matrix. The purpose of extending Δ matrix is to calculate the cost difference of the swap operation. When nodes and are swapped, we can calculate using these Δ costs as follows:
where
In SCCB-grid layout algorithm, the combo score also needs to be considered. Given a layout such that a node v
α
is moved to a vacant point p, can be calculated as shown in Equation (3). In contrast, if two nodes and are swapped, the difference of combo scores, Combo (L) – Combo (L), is effectively calculated as follows:
where
A pseudo code of SCCB-grid layout algorithm is described in Figure 5.
If node v
β
is moved at the previous step, the time complexity of calculating Δ matrix is O ((|V| + |E|)|||U|). If two and are swapped at the previous step, the time complexity of calculating Δ matrix was O ((|V| + |E|) (|| + ||) |U|) = O ((|V| + |E|) |||U|), where || = (|| + ||)/2. In addition, the time complexity of all the swap operations considered at each step is O (|E|2). Therefore, the time complexity of SCCB-grid layout algorithm is O (|E|2 + |U||| (|V| + |E|)) at each step.
Since the time complexity of CB-grid layout algorithm is O (|V|2 + |E|2 + |W||| (|V| + |E|)) at each step [15], the time complexity of SCCB-grid layout algorithm is O(|V||| (|V| + |E|)) larger than that of CB-grid layout algorithm (note that v
β
and v
β'
are not distinguished here). Here, we consider two cases, |V| ≤ |W| (case 1) and |V| > |W| (case 2) and show these two algorithms have the same time complexity with high probability. For case 1, the above difference is negligible since O (|V||| (|V| + |E|)) ≤ O (|W|||(|V| + |E|)). In contrast, the O(|V||| (|V| + |E|)) difference cannot be neglected in case 2. However, if we assume that all nodes can be moved to form the next layout with equal probability, |V||| = 2 |E|, and O(|V||| (|V| + |E|)) = O (|V|2 + |E|2) subsequently. Therefore, the time complexity of SCCB-grid layout algorithm will be the same as that of CB-grid layout algorithm even in the case 2. For the above reasons, the time complexities of SCCB-grid and CB-grid layout algorithms are the same in practice.