The cost function and the weight matrix
A network layout is a configuration of the nodes and edges properly placed on a 2D plane. Generally, all nodes are represented as points without regard to their sizes and all edges are drawn as straight lines. Under such a drawing convention, a layout is fully described by the nodes' coordinates, denoted by R = (r1, r2, ..., r
n
), where n is the number of nodes and r
i
= (x
i
, y
i
) the coordinates. Because nodes are placed on grid points, all x
i
and y
i
are forced to be integers.
To determine the coordinates, we use a widely adopted method that treats nodes as interacting particles, and the layout quality is evaluated by a cost function that is defined as the total interaction energy of all pairs of the nodes with lower costs corresponding to better layouts. Following Li and Kurata [7], we use the cost function given by
where w
ij
is the interaction weight of nodes i and j, which describes the way nodes interplay. The weights between all node pairs constitute the weight matrix. The term d
ij
is the Manhattan distance between nodes i and j. For detailed explanations about the design principles of the cost function, please refer to Ref [7].
There are unlimited possibilities to choose detailed weight matrices. A convenient way is to evaluate the weight matrix according to the graph distances (i.e., shortest paths). Denote L
ij
the graph distance between nodes i and j, we set w
ij
= χ (L
ij
), where χ is some integer functions. By extensive experiments, we found three χ functions are suitable for typical biochemical networks. The corresponding layout styles are called common, compact, and stretched styles (Figure 1). The layout algorithm itself does not confine the weight matrix. Even when a predefined weight matrix is chosen, there is still room for users to modify some weights as wish. This provides flexible ways to use the method. For example, if two nodes are known a priori to belong to the same module and therefore hoped to be placed closely, one may add an extra positive value to the corresponding weight. See the Results section for an example.
The layout algorithm
The layout algorithm aims to find the best layout by optimizing the cost function, which can be described as follows:
Set R to a random layout
Repeat the following steps for niter times
Generate R' by perturbing R
Locally optimize R'
If cost(R')< cost(R), set R = R'
(Otherwise, R remains unchanged.)
End repeat
Output R as the final result
At beginning, a random layout R is set as the initial state, then the algorithm optimizes R through a neighborhood-test procedure that repeatedly tries to move every single node to its adjacent vacant sites to lower down the cost score. As neighborhood-test proceeds, the layout eventually arrives at such a state that its quality cannot be further improved by moving any single nodes, i.e., the cost function attains a local minimum. To fully optimize R, the layout should be managed to escape from the local minimum. For this reason, the algorithm perturbs the layout by moving each node with a given probability p to a randomly chosen neighboring location. The perturbed layout is then set to the neighborhood-test procedure. When this re-optimization-after-perturbation process repeats sufficiently many times, the layout becomes hopefully satisfactory and the whole computation ends.
An important feature of the algorithm is that it uses a simple global search strategy relies on the perturbation probability p. When p = 0, no node is perturbed, the output layout remains unchanged. When p = 1, all nodes change their positions, the output layout is little related to the input. For 0<p < 1, some parts of the input layout are unchanged, or "memorized". Heavy perturbations (i.e., perturbations with large p) lead to significant losses of previous optimization efforts, and consequently the re-optimization will demand relatively high computational expense. In practice, however, the performance is not very sensitive to p; moderate values, say, 0.3-0.7, work usually well. In LucidDraw, the default value of p is set to 0.7.
Generally, computation speed and layout quality are largely controlled by niter, the number of iterations. A small niter is obviously preferred for computation speed but usually results in relatively low quality of layouts. Though layout quality benefits from more iterations, very large niter is usually not necessary because as the optimization proceeds, better layouts are harder and harder to obtain by re-optimization-after-perturbation. To balance effort and gain, the whole layout process should stop when search efficiency becomes very low. In practice, a moderate value of niter = 60 is usually enough to generate satisfied layouts.
Computational complexity
The accurate complexity of the whole layout process is difficult to estimate analytically. We used a set of example networks to empirically measure the time complexity under the default parameter setting of the algorithm. The results are shown in Figure 2, where the fitted curve is quadratic with respect to the number of nodes, i.e., the required time is O(n2).
The graphical user interface
The GUI of LucidDraw (Figure 3) is developed based on JGraph http://www.jgraph.com/jgraph.html, an open source graph visualization library written in Java. With the help of abundant graphical functionalities provided by JGraph, LucidDraw supports interactive operations on the network drawings such as moving nodes, zooming in/out, showing/hiding labels or edge arrows. Editing functions like redo/undo are also accessible to make LucidDraw more user-friendly. To aid easy use of LucidDraw in MATLAB environment, we developed another simple GUI (Figure 4) to provide users an intuitive way to manipulate input network data and change detailed parameters of the layout algorithm.
Treatment of node labels
Node labels are necessary to comprehend network structures shown graphically. To display labels appropriately is not trivial because for drawings of large biochemical networks, room for labels is limited and hence incautious label placement usually causes additional visual complexity. It is usually not satisfied to show all labels simultaneously due to overlaps of labels and nodes. Barsky et al. [9] use a greedy method to select as many as possible labels to display without label overlaps, featuring an advantage that more labels are shown at higher zoom levels.
In this work we use three kinds of labels to avoid increasing much visual complexity while making desired node information readable. The first kind is the engraved labels that are shown within the node symbols if the space is large enough. The second kind is the floating labels. A floating label is automatically shown when the mouse pointer is hovering over a node, and disappears when the mouse is moved away. The third kind is the mandatory labels that are statically shown for the right-clicked nodes, staying displayed until the zoom level is changed or the "clear labels" button is pressed.
Displaying of engraved labels is dependent on the zoom level. At higher zoom levels, node symbols become larger and more inside space is available to accommodate longer node names, so there are more node names appearing as engraved labels. Engraved labels can save space but are confined by the node sizes, which cannot label nodes with long names at relatively low zoom levels. Floating labels can make up this deficiency and they do not overlap with other nodes. Mandatory labels are useful when several interesting nodes have long names and cannot be simultaneously displayed at current zoom level by engraved or floating labels. Please see Figure 5 for examples of the three kinds of labels.