FragViz: visualization of fragmented networks
© Štajdohar et al. 2010
Received: 26 February 2010
Accepted: 22 September 2010
Published: 22 September 2010
Skip to main content
© Štajdohar et al. 2010
Received: 26 February 2010
Accepted: 22 September 2010
Published: 22 September 2010
Researchers in systems biology use network visualization to summarize the results of their analysis. Such networks often include unconnected components, which popular network alignment algorithms place arbitrarily with respect to the rest of the network. This can lead to misinterpretations due to the proximity of otherwise unrelated elements.
We propose a new network layout optimization technique called FragViz which can incorporate additional information on relations between unconnected network components. It uses a two-step approach by first arranging the nodes within each of the components and then placing the components so that their proximity in the network corresponds to their relatedness. In the experimental study with the leukemia gene networks we demonstrate that FragViz can obtain network layouts which are more interpretable and hold additional information that could not be exposed using classical network layout optimization algorithms.
Network visualization relies on computational techniques for proper placement of objects under consideration. These algorithms need to be fast so that they can be incorporated in responsive interfaces required by the explorative data analysis environments. Our layout optimization technique FragViz meets these requirements and specifically addresses the visualization of fragmented networks, for which standard algorithms do not consider similarities between unconnected components. The experiments confirmed the claims on speed and accuracy of the proposed solution.
From the onset of systems biology, visualization of networks has played a key role in communicating the relations between objects of interest and the structure of the problem domain. Gene networks , protein interactions [2, 3], synergistic relations between SNPs , gene-based disease similarities , enzymatic relations and metabolic processes are just a few examples of domains where visualization of networks can aid in understanding the layout of the biological systems. The interest in this area has sparked the development of a large variety of software tools and approaches that deal with network layout optimization, data integration, interactive exploration of the networks and data analytics .
Formally, a network is a graph which consist of vertices (nodes) linked by edges. In systems biology, vertices can represent genes, proteins, metabolites, diseases, or other objects of interest. Edges abstract the relations between these objects.
The network often consists of a large number of unconnected components, like the recently published yeast protein interaction network  and a drug similarity network  with 160 and 240 unconnected components, respectively. Classical network layout techniques such as Fruchterman-Reingold , Kamada-Kawai  and Frick et al.  algorithms arrange unconnected components arbitrarily, which can wrongly suggest a relation between otherwise unrelated components.
In the paper we introduce a generally applicable technique called FragViz for placing the components according to the background data on their similarity. For example, rendering a network from Figure 1a. by our algorithm yields the layout in Figure 1b, from which we can infer that there is a relation between blvra-hmox-blvrb and alas1-urod. These are indeed correct relations as all of the mentioned genes have a function in heme metabolism. Notice that in the same visualization components dars-aars and eif2s2-eif2s1 are close to each other and all genes comprising them participate in protein translation. To render this network FragViz used additional information on mutual similarity between network nodes. It is clear that consideration of this additional information can improve the placement of unconnected components and expose additional information, thus avoiding misinterpretations based on the proximity of arbitrarily placed components.
FragViz uses a two-step network layout optimization procedure. It first applies the standard Fruchterman-Reingold algorithm separately on each unconnected component to optimize the layout of its vertices. Then it optimizes the global placement and orientation of components using a semi-physical model where the forces between components are inferred from similarities between the corresponding vertices in these components.
The data on similarity of the network nodes can either come from the same data source used to infer the structure of the network, or can be provided by supplying any additional information. Most often, the network's structure itself is derived from the scored relations between objects (e.g., the correlation in expression of two genes , the degree of SNP synergy in phenotype prediction , the number of disorder-specific genes shared by two diseases ). Edges then connect pairs of vertices for which the corresponding score exceeds some user-defined threshold. In such cases, the node pair similarity scores can be used as additional data for our procedure. If relations in the graph are not obtained by imposing thresholds on numerical data, other data source can be used to describe the vertex similarities. For instance, in the experimental study reported in the paper we show a protein-protein interaction network in which the vertex similarities are computed based on the biological function of the proteins.
The proposed approach belongs to the family of algorithms for force-directed placement of objects into two-dimensional projections, and is strongly related to two kinds of algorithms: the optimization of network layout and multidimensional scaling (MDS) [15, 16]. Network layout algorithms typically consider undirected graphs and optimize their layouts so that the pairs of connected vertices are placed closer to each other than to other vertices. If graph edges are weighed, shorter distances in the layout indicate stronger relations between objects represented with vertices. Multidimensional scaling considers an input matrix of object dissimilarities. It represents objects with points in Euclidean plane, and optimizes their placement so that the plotted distances match the dissimilarities as accurately as possible.
The two kinds of algorithms are related. It is possible to lay out a network by representing it with a distance matrix and performing MDS-based optimization. Or, vice versa, we can convert a distance matrix into a weighed complete graph and use a graph layout optimization in place of MDS. The optimizations would yield different results, as each of the methods uses its own stress function that is being optimized and was designed to match the goals of particular projection. For instance, in network layout optimization, projected distance between unconnected vertices has no effect for as long as it is large in comparison with distances between the connected vertices. In contrast, MDS optimizes distances between all pairs of objects, including the most distant.
With regard to the optimization procedure, algorithms make assumptions about the structure of the data. Network layout algorithms work best for graphs in which most vertices have only a small number of neighbors. MDS, on the contrary, considers distances between all pairs of objects, a data structure that can be represented with a complete weighed graph. Force-directed network layout optimization algorithms do not work well on densely connected graphs (e.g., ). The time complexity becomes prohibitive, and optimization may get trapped in the local optimum. In contrast, MDS is inapplicable to data with a large number of objects due to space complexity (prohibitively large distance matrix), whereas the Fruchterman-Reingold algorithm might still be useful if the number of edges is small enough.
There are a number of algorithms that use the metaphors from either network layout algorithms or MDS or both, trying to adapt each one for a particular data structure or heuristically improve runtime performance. Clustered graphs, for one, include groups of vertices that are related to each other. Clusters can be determined by observing the density of mutual connections between vertices or they can be based on data describing the vertices. Various algorithms have been designed that can detect such clusters [18–20]. Eades et al.  proposed a method for plotting clustered graphs, which models them in terms of four layers representing the entire graph, clusters, abridgments and pictures (groups of points shown in a particular projection). A corresponding model includes forces between connected vertices, between all vertices in each cluster, and between meta-vertices representing entire clusters. The performance of MDS can be improved by various heuristics. Morrison et al. , for instance, propose an algorithm which first projects a sample of points, then interpolates the remaining points between their positions, and finally fine-tunes the projection using a force model. These and similar methods can be used to speed up the layout optimization, increase the readability of the graph and construct user interfaces for interactive graph exploration. A complete survey of information visualization methods that focus on graph visualization techniques, can be found in .
The method described in this paper, FragViz, is a representative of context-specific methods for layout optimization. Unlike other methods we have reviewed in this section, it specifically addresses the layout optimization for graphs consisting of isolated components, which are given in advance and represent meaningful entities, such as groups of genes related to a particular process. The components, in turn, need to be considered jointly, based on their mutual relations which may stem from individual relations between member vertices. The natural approach that deals with this particular data structure is to first optimize the layout of each component independently, and then optimize the position and rotation of the components. We achieve this by combination of network layout and MDS-based algorithms. Notice that, as further addressed in the Discussion, other, perhaps more straightforward adaptations of existing approaches could address such data, but perform worse both in terms of runtime and quality of the resulting layout.
The input to FragViz is a list of network components and a matrix of (dis)similarities between the network's vertices. FragViz first uses a network layout optimization technique, like Fruchterman-Reingold algorithm , to determine the placement of vertices within each of the connected components. Then, it finds a placement of components which reflect their mutual similarities. It is this second step that is an original contribution of our method, and which we in detail describe below.
Formally, we are given a graph G = (V, E) that consists of p disjunct components and a |V| × |V| dissimilarity matrix D. The internal layout of each component V k is fixed and given by positions of its vertices inside its own fixed coordinate system. We will denote the position of vertex v i by v i . We also assume that the internal coordinate systems are centered, i.e. for each component V k . The task is to find the placement c k and orientation ϕ k of coordinate systems for all components, which reflect the given dissimilarities D.
We will base the method on a physical metaphor. Imagine each component as a board with vertices as pegs. Pegs from different components are connected with springs of different lengths corresponding to the given dissimilarities D. The nature (or, in our case, a computer simulation) "optimizes" the system by finding the lowest energy configuration of the boards (components).
where k is such that v i ∈ V k .
These equations allow for a computer simulation of the physical process. Starting from a random placement of components, we iteratively compute the forces F i , and move and rotate the components accordingly until the system reaches an optimum in which all F i are negligible.
Computer simulation of the system described above is rather slow. We can speed it up by first computing the positions of components and then rotating them in place. The result is only approximately optimal with regard to the total stress (3), yet we will experimentally show that the difference is negligible.
The task is then to find the positions in a two dimensional plane, in which the distance between every pair of component centers c k and c l matches the given δ kl as close as possible. This approach is much faster than the simulation from the previous section since the computation of all pairwise distances at each step of optimization is replaced by a single such computation in (10). This translates the problem of placing the components into the familiar multidimensional scaling problem (MDS). There exist many efficient solutions of the MDS, such as, for instance, SMACOFF , which optimizes the overall energy of the system without computing its gradient, the force (3).
For rotation of components we use the original vertex-wise definition of force (3) computed in the scaled coordinate system (11). We apply the same procedure as in the exact simulation, except that we only compute the rotation without the translation. To avoid ending up in local minima, we use simulated annealing where the component can also rotate in the "wrong direction", with the probability of doing so decreasing with time. Although this optimization recomputes the pairwise distances between all vertices at each step, it is not overly time consuming since it requires only a small number of iterations.
In the remainder of the paper we only show layouts optimized by the approximate method.
The performance of the proposed algorithm was assessed on four different networks (N1, N2.1, N2.2 and N3) showing relations between genes which were most differentially expressed in the leukemia gene expression data set . The original data set includes 4,860 genes whose expression was measured using DNA microarrays in 72 tissue samples classified either as acute lymphoblastic leukemia (ALL, 48 samples) or acute myeloid leukemia (AML, 25 samples). For N1, N2.1 and N2.2 we selected 1,025 differentially expressed genes with expression levels significantly smaller or larger (p-value < 0.01) according to Student's t-statistic with respect to the null distribution of the statistic. The null distribution was obtained by randomly permuting the class labels and calculating the t-statistic for all the genes. Network N3 was built with 131 out of 4,860 originally measured genes for which the information on their protein interactions was available in the MIPS mammalian protein-protein interaction database . In the visualizations in the paper, genes represented with solid circles were significantly over-expressed in the ALL samples and genes shown as hollow circles had higher expression in the AML samples.
N1 - biological function similarity score: the similarity of genes relates to their biological functions and was calculated based on their membership in canonical biological pathways using the Jaccard index . The information on the membership of genes in biological pathways was acquired from the Molecular Signature Database  (C2 collection, canonical pathways). Figure 2 shows the network where the similarity threshold was set to 0.7 and all the unconnected genes were ignored.
Basic characteristics of the networks used in experiments, describing the average local clustering coefficient and the number of vertices, edges and components
The goal of FragViz is to find the network layout in which the arrangement of components uncovers new insights on relations between them and their constituents. We evaluated the method in an experimental study that considered FragViz visualization of the leukemia gene networks N1, N2.1, N2.2 and N3. For additional assistance to the domain expert, the network components were named according to their most specific term from biological process or molecular function aspect of Gene Ontology .
Our goal was to obtain a clear visualization relating the most important genes and their biological functions for two major types of acute leukemia, yielding insight and valuable clues about the disrupted biological processes and pathways in leukemic cells. Solid vertices in Figure 2 represent genes significantly over-expressed in the ALL samples while empty circles are genes that had higher expression in the AML samples.
FragViz allows for the exploration of biological processes related to acute myeloid and acute lymphoblastic leukemia on different levels, from specific to more general ones. In Figure 2 additional Gene Ontology terms were assigned to groups of clusters which were determined manually by the expert to elucidate the disrupted biological pathways on a more general level as they cover higher number of differentially expressed genes. These ontological terms apply to all the genes in the marked areas and are significantly enriched with a p-value < 0.01. The components of the graph that are close to each other have similar biological and/or molecular functions according to Gene Ontology, demonstrating similarity between genes constituting them.
For example, the "guanylate cyclase activity", "nucleotide metabolic process", "RNA polymerase activity", and "DNA replication" components in Figure 2 all connect genes significantly over-expressed in acute lymphoblastic leukemia. All of these genes have a function in nucleotide metabolism and DNA biosynthesis. It is well known that lymphoblastic cells typically have severalfold higher activity of enzymes responsible for nucleotide metabolism enabling excessive proliferation of transformed cells . Moreover, some of the pathways active in nucleotide metabolism, for example de novo purine synthesis (DNPS), have been recognized as important targets of antileukemic agents (e. g., methotrexate, mercaptopurine). In combination with other therapeutical agents, these drugs have improved survival of children with ALL to an overall cure rate of approximately 80 percent . The network shown in Figure 2 clearly demonstrates this characteristic of acute lymphoblastic leukemia.
The N1 and both N2 networks contain the same 1,025 differentially expressed genes from the leukemia data set. However, in N2.1 and N2.2 a combined gene distance score was used, computed from multiple biological data sources (e.g., gene expression, protein-protein interactions, biological function, ...) as proposed by Huttenhower et al. . N2.1 shows only vertices with at least one edge. N2.2 also includes isolated vertices (genes not connected to any other gene), in order to observe the similarity of all the differentially expressed genes.
As in the N1 network, most of the graph components in N2 networks (Figures 3, 4) connect genes that are over-expressed in one of the two investigated kinds of leukemia (all genes in the component are the same color). One can observe that the genes significantly differentially expressed in the two investigated leukemias cluster together (Figures 3, 4). This reflects the well known phenomenon that not only individual genes, but whole processes and pathways are disrupted in cancer cells . In Figure 4, the empty circles (AML) are clustered in the right part of the graph and the solid ones (ALL) in the left part, again demonstrating that expression changes in cancer tissues are disrupted on the level of pathways and processes.
For example, the genes in components "spliceosomal snRNP biogenesis", "tRNA aminoacylation for protein translation", "sequence-specific DNA binding" and the nearest genes in the component "protein binding" participate in processes of cell proliferation. All these genes have higher expression in ALL samples. Excessive cell proliferation is a characteristic of all leukemic cells. However, previous studies [36, 37] have shown that the proliferative index of ALL cells is significantly higher compared to AML cells.
Since the distance information is used to adjust the position of unconnected components, the layout allows for the exploration of the data on different levels, using genes from a single component or from clusters of biologically related components.
The placement of unconnected components in a fragmented network can be optimized using the vertex distance information from a source other than that used in the inference of network structure. For example, the N3 network (Figure 5) shows the protein-protein interactions for the leukemia genes from the MIPS database. The network is fragmented into many smaller unconnected components. We used the biological function similarity score for calculating the similarities between the components and optimizing the network layout.
Several gene products (proteins) that lie close to each other in the FragViz optimized network (Figure 5) are actually in interaction, as is reported in Human Protein Reference Database (HPRD) , a different public repository that stores protein-protein interactions identified by experimental results. For example, in HPRD, the protein Integrin beta 3 (itgb3 ) is in interaction with protein Integrin beta 1 (itgb1 ). Also, proteins Poly A polymerase alpha (papola) and smad3 are both in interaction with protein smad2. According to HPRD, protein interactions also exist among proteins in the components il4r-htatip and the near-lying component in the optimized layout. To outline them in the network, the vertices that correspond to these proteins (in Figure 5) are labeled accordingly. While our goals was not to use network layout optimization for protein interaction prediction, the cases mentioned here demonstrate the potential utility of different data sources in network layout optimization.
We added an optional component similarity visualization to the network. The similarity between network components is visualized by blue lines in Figure 5. Each component is connected to two most similar components and the line width represents the magnitude of the similarity. In Figure 5, most connected components are placed close to each other. However, in few cases similar components are positioned apart. Besides the technical problem - the optimization getting stuck in a local optimum - this may happen when two components belonging to different clusters of components nevertheless share a common function or when some component essentially belongs to two clusters. For example, genes in the component "G1/S transition of mitotic cell cycle" influence gene expression, as do most of the genes in the nearest cluster of components. The same component however also participates in the apoptotic pathway which is reflected in its connection with the "phosphoinositide 3-kinase cascade" component, a representative of components related to the apoptotic processes.
Average layout optimization time in seconds for all four networks
All measurements have been conducted on a desktop PC, with Intel Core 2 Duo 2.20GHz processor and 4 GB of RAM, using the 64-bit Windows 7 OS. The results represent an average over 10,000 runs of the algorithms on the N1-N3 networks, starting from random positions of vertices.
The Fruchterman-Reingold algorithm is by far the fastest, but it uses less data than the others and the resulting projections are much less informative. Running times of Eades 2 are comparable to those of Fruchterman-Reingold. This was expected, as both approaches run on a similar graph. Eades 1 employs a complete graph, which makes it much slower. On large networks, Eades 1 (N2.2) is even slower than MDS. The running times of FragViz simulation are similar to those of MDS, which is also expected. The approximate method runs much faster, except for the large network N2.2, where most vertices are unconnected, which essentially translates the visualization problem to MDS.
Pearson's correlation between elements of the gene distance matrix and the Euclidean distance between the corresponding vertices in the two-dimensional network layout
For all four networks, the correlation coefficients of the FragViz algorithms are very similar. The correlation was always lower with the FR algorithm and, for three out of four networks, the highest correlation was obtained with MDS. In one of the compared networks (N2.2) MDS performed slightly worse than approximation, suggesting MDS got trapped in a local minimum. As expected, when the vertices were arbitrary placed in the graph, the correlation between the position of vertices in the graph and their actual distances is close to 0.
Clustered graph approaches (Eades 1 and Eades 2) are in general faster than FragViz, but performed worse in terms of layout quality. Eades 2 performed better than Eades 1 on smaller graphs (N1, N2.1 and N3), whereas Eades 1 had a high correlation for a large network (N2.2). However, Eades 1 approach is not appropriate for analyzing large fragmented networks as it works prohibitively slow.
Note that the compared algorithms pursue different goals. The tests were run on data suitable for the method presented in this paper, while in other contexts another method could give better results. In particular, clustered graph methods could not be directly applied to the original data, so its results depend on the proposed transformation of the original problem.
FragViz and FR algorithms are equivalent when the network consists of only one component (threshold values lower than 0.1). For the FR algorithm, the correlation decreases when the network gets more fragmented. However, when the fragmentation increases (threshold value greater than 0.2), the correlation score of the FragViz algorithm increases and rises above the best score obtained by the FR algorithm. Correlation for MDS does not depend on the threshold.
Besides the projection quality issues, FragViz is also faster than the above approaches since it splits the optimization problem into a set of much smaller problems, laying out small individual components and then arranging a small number of components instead of all vertices at once. Using the graph layout optimization algorithms instead of FragViz, as described above, would be slower since these algorithms do not perform well on complete graphs. For MDS, to get similar running times as FragViz, one needs to employ fast heuristic MDS algorithms, which gain speed by somewhat compromising the quality of the projection .
Figures 7d and 7e show some shortcomings of cluster based approaches on this particular data. When the problem is transformed so that the cluster structure is defined by graph components and applied over the complete graph (Eades 1), the optimization is more likely to end up in a local minimum due to a higher number of forces involved. In Figure 7d, we cannot spot any regions containing mostly solid or empty vertices, as opposed to Figures 7a and 7b by FragViz and by Fruchterman-Reingold algorithms. This may also be one of the reasons behind the worse Pearson correlations of this approach in general (Table 2). The second way in which we used cluster based layout optimization, Eades 2, gives better correlations and running times, yet the resulting layouts are visually unsatisfactory: the cluster structure does not correspond exactly to the graph components, so the vertices belonging to the same component may be pulled apart since they ended up in different clusters. We were unable to alleviate this problem by tweaking the parameters of the method.
We have recently witnessed the emergence of large repositories of biomedical research and clinical data. Methods are needed that would allow the domain experts to sieve through these data sets to gather information, reason on the hidden patterns and form plausible hypotheses to be tested in subsequent studies. Here, visualization combined with visual data analytics plays a major role, as it can reveal the data patterns and allow the experts to explore the data.
Visualizations require the development of dedicated algorithms that craft the proper placement of the object under consideration. Explorative data analysis requests these to be fast to be able to construct responsive interfaces. We have developed a layout optimization technique FragViz that meets these requirements and specifically addresses the visualization of fragmented networks, where standard algorithms do not consider similarities between unconnected components.
FragViz is neither faster than all existing algorithms nor more accurate in terms of the match between the given and the projected distances. FragViz is slower than the Fructherman-Reingold algorithm, which is a direct consequence of considering more information. The resulting vertex distances may match the given distance matrix worse than in multidimensional scaling, a consequence of fixing the layout of the components. This is a matter of design decision: the goal of FragViz is to provide a sensible local picture and a global overview, thence the two level optimization. It can happen, for instance, that in a chain-like component the two vertices on the edge are weakly related to a common third vertex not belonging to the component. While other layout optimization algorithms would bend the chain, FragViz keeps it straight. Our experiments confirmed usefulness of the proposed solution. The case study on the leukemia gene networks shows that derived visualizations may be helpful in uncovering the relations between the components.
The data, networks, their visualizations, and the implementation of the described methods in an open-source analytics framework Orange  are available on the supplementary web page at http://www.ailab.si/supp/fragviz. An online network optimization web application is available at http://www.ailab.si/fragviz.
Project home page: http://www.ailab.si/orange
Operating system: Platform independent
Programming language: Python, C++
Other requirements: PyQt, PyQwt, Numpy
License: GNU GPL
Any restrictions on use by non-academics: none
This work was supported by grants from the Slovenian Research Agency (P2-0209, J2-9699, L2-1112).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.