Extracting the abstraction pyramid from complex networks
© Cheng and Hu. 2010
Received: 26 February 2010
Accepted: 3 August 2010
Published: 3 August 2010
Skip to main content
© Cheng and Hu. 2010
Received: 26 February 2010
Accepted: 3 August 2010
Published: 3 August 2010
At present, the organization of system modules is typically limited to either a multilevel hierarchy that describes the "vertical" relationships between modules at different levels (e.g., module A at level two is included in module B at level one), or a single-level graph that represents the "horizontal" relationships among modules (e.g., genetic interactions between module A and module B). Both types of organizations fail to provide a broader and deeper view of the complex systems that arise from an integration of vertical and horizontal relationships.
We propose a complex network analysis tool, Pyramabs, which was developed to integrate vertical and horizontal relationships and extract information at various granularities to create a pyramid from a complex system of interacting objects. The pyramid depicts the nested structure implied in a complex system, and shows the vertical relationships between abstract networks at different levels. In addition, at each level the abstract network of modules, which are connected by weighted links, represents the modules' horizontal relationships. We first tested Pyramabs on hierarchical random networks to verify its ability to find the module organization pre-embedded in the networks. We later tested it on a protein-protein interaction (PPI) network and a metabolic network. According to Gene Ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG), the vertical relationships identified from the PPI and metabolic pathways correctly characterized the inclusion (i.e., part-of) relationship, and the horizontal relationships provided a good indication of the functional closeness between modules. Our experiments with Pyramabs demonstrated its ability to perform knowledge mining in complex systems.
Networks are a flexible and convenient method of representing interactions in a complex system, and an increasing amount of information in real-world situations is described by complex networks. We considered the analysis of a complex network as an iterative process for extracting meaningful information at multiple granularities from a system of interacting objects. The quality of the interpretation of the networks depends on the completeness and expressiveness of the extracted knowledge representations. Pyramabs was designed to interpret a complex network through a disclosure of a pyramid of abstractions. The abstraction pyramid is a new knowledge representation that combines vertical and horizontal viewpoints at different degrees of abstraction. Interpretations in this form are more accurate and more meaningful than multilevel dendrograms or single-level graphs. Pyramabs can be accessed at http://188.8.131.52/pyramabs.php/.
Networks provide a natural representation for the complex interactions of heterogeneous entities in complex systems. Many complex networks have been studied in recent years, for example in the fields of biology, sociology, and ecology [1–4]. As high-throughput techniques have advanced, biological networks have become increasingly complex, and it has become more challenging to interpret them accurately and clearly by extracting and representing the knowledge embedded in the networks.
Our approach, named Pyramabs (Pyramid of abstractions), identifies the modules and simultaneously constructs the pyramid based on the network topology. Prior domain knowledge is not used. We tested Pyramabs on artificial random networks, a protein-protein interaction network, and a metabolic network. We compared Pyramabs with other methods and verified our results based on those published in the literature and public databases.
The two overarching goals of our work are to (1) propose an alternative knowledge representation for improved network interpretations, and (2) introduce a novel approach for extracting knowledge from networks and describing it using the new representation. The abstraction pyramid discovered by Pyramabs does not replace the known structure of ontology (e.g., the Gene Ontology (GO)), but instead provides other information that may be missing. For example, an abstraction pyramid identified from a protein-protein interaction network could illuminate the protein interactions at various levels. Some vertical or horizontal relationships can provide additional biological meaning that may not be characterized in the GO's Directed Acyclic Graph (DAG) structure.
Step 1. Input a given network of nodes to Pyramabs.
Step 2. Calculate the proximity between all pairs of nodes and use as the link weights.
Step 3. Normalize the proximity by computing the z-scores; then discard the links with a z-score below a specified threshold to reduce the search space of the network.
Step 4. Obtain the maximum-weight spanning tree from the network and use as the backbone.
Step 5. Partition the network into modules based on its backbone.
Step 6. Construct a network of the modules found in Step 5. This network forms a hierarchical level of a pyramid.
Step 7. If the network produced in Step 6 contains more than one node (i.e., one module), go to Step 2 to find a higher hierarchical level.
Step 8. Otherwise, return the pyramid.
We conducted a series of experiments that applied our approach to datasets from various domains, including artificial and real-world data. Following Sales-Pardo et al. , we first tested our approach on hierarchically nested random networks with a hierarchical structure. Since the theoretical partition is known in these networks, we can use the results to validate our method's ability to identify the inclusion hierarchy implied in the network. Furthermore, to evaluate the method's generality and its applicability to real-world problems, we tested it on several real-world datasets with different characteristics: protein-protein interactions, metabolic pathways, and social networks [see Additional file 1]. The experimental results indicated that this new method could not only uncover the inherent hierarchy and the significant modules in a complex network, but could also provide different degrees of abstraction of the network.
To increase efficiency, Pyramabs reduces the search space using a z-score threshold to filter out "weak" links; the tradeoff is a loss of information. We conducted a series of experiments on random networks, using z-score thresholds ranging from -2 to 2, to evaluate their effect on the results. Pyramabs identified the correct hierarchy with threshold values of -2, -1, -0.5, and 0 (Figure 3(E)). As we further increased the threshold, some supernodes became isolated due to their limited number of links. These experiments illustrated a limitation of Pyramabs that the information loss caused by a low network density or a high z-score threshold has greater influence at the higher levels in the hierarchy, as seen in Figure 3(D) and 3(E). Based on these test results, we set the z-score threshold to zero for the remaining experiments. Because the optimum threshold balancing efficiency and accuracy will vary depending on the network and may not be known beforehand, we made the threshold a user-specified parameter in Pyramabs, with a default value of zero.
Summary of biological significance of modules based on GO biological process annotations
Luo et al.a
Raddichi et al.a
(Level 2) b
(Level 3) b
(Level 4) b
(Level 5) b
From Table 1, it is seen that the average p-value decreased at higher levels. This suggested that the vertical relationships in the hierarchy identified by Pyramabs correctly corresponded to the GO hierarchy, since the modules at lower levels correctly merged into larger modules at higher levels. Note that the average cluster sizes at levels 2 and 3 (723 and 152) in our pyramid are much greater than the average level 2 cluster size using box clustering (26). With a closer examination of our level 4 compared with level 3 in box clustering, we found that the total number of clusters was similar (72 vs. 77), as was the average cluster size (34 vs. 26). We further compared these 72 modules with the 77 modules, and found there were a significant number of common module member proteins. The average overlap was over 80%. Based on these findings, Pyramabs was proven to be more useful for disclosing higher-level module organizations than was box clustering. On the other hand, when comparing the bottom level in both hierarchies, our average cluster size was larger (10 vs. 5). This suggests that box clustering has a greater tendency to partition modules into smaller ones than does Pyramabs.
where pv a and pv b are the p-values of nodes a and b calculated by the GO Term Finder, and pv ab is the p-value of the new node consisting of a merged with b. We used min(pv a , pv b ) in the definition, so a positive p-DecreaseRatio indicates that pv ab is smaller than both pv a and pv b , i.e., the merge of a and b is more biologically significant than either a or b.
In our analysis of protein-protein interactions, we verified whether a and b actually had a closer biological relationship than c and d when P ab >P cd ; this was accomplished by evaluating the change in p-value calculated by the GO Term Finder before and after the node (module) merging. We ran a sign test on the abstract network at each level in the hierarchy, and we found a significantly greater number of positive cases in which the ratio of the p-value decrease after merging a and b was larger than that after merging c and d, when P ab >P cd (at the significance level 0.01). These results demonstrated the feasibility of applying a horizontal relationship measured by proximity to the characterization of closeness in biological functions.
Thousands of components in a living cell are dynamically interconnected within a complex network that determines the cell's functional properties [5, 16]. One of the primary examples is cellular metabolism arising from sophisticated biochemical networks, in which numerous metabolites are integrated through biochemical reactions. To facilitate the identification and characterization of system-level features in biological organizations, we can partition cellular functionality into a collection of modules and organize them in a hierarchy . We tested Pyramabs on the metabolic network of E. coli that was used previously . This network contained 507 nodes and 947 links, where each node represented a metabolic substrate, and each link described a reaction.
Summary of the within-module consistency of metabolic pathway classification by Pyramabs based on KEGG
Summary of the within-module consistency of metabolic pathway classification by Sales-Pardo et al.'s box clustering based on KEGG
One widely used method for finding the organization within data is hierarchical clustering [11, 21, 22]. Hierarchical clustering techniques group data into a sequence of nested clusters, either by treating each singleton as a cluster and merging them into larger clusters (agglomerative or "bottom-up"), or by dividing an initial single cluster into successively smaller clusters (divisive or "top-down"). Both techniques organize data into a hierarchical structure, typically depicted as a dendrogram. Both agglomerative and divisive clustering techniques produce a hierarchical tree allowing the visualization of the internal hierarchical structure within data, regardless of whether or not the data are actually organized hierarchically. It can be argued that a "height threshold" in a dendrogram can be judiciously selected according to some metric, above which any clusters and their hierarchical relationships are regarded as genuine. Nevertheless, it is debatable if any post-clustering analysis that is independent of the clustering process will be effective. Box clustering has been proposed as a variant of divisive unsupervised clustering . This method iteratively identifies the modules at each level in the hierarchy until no further hierarchical levels can be found through module division. Although it visualizes the final clustering result by a box-model clustering tree, it only shows vertical relationships between different hierarchical levels.
The quality of the network interpretation depends on the completeness of the knowledge extracted and the expressiveness of the knowledge representations. The present paper provides two contributions to this area. First, we proposed the abstraction pyramid, a new representation that combined vertical and horizontal viewpoints and is capable of interpreting a complex biological network at different degrees of abstraction. Interpretations in this form are more accurate and more meaningful than multilevel dendrograms or single-level graphs. Second, we developed Pyramabs, a two-way approach combining top-down and bottom-up clustering techniques to detect modules and organize them into a multilevel pyramid. As an improvement, the abstraction pyramid gives us the opportunity to achieve a new perspective on cellular organization, by traversing the pyramid freely through the links vertically and horizontally. For example, in one pyramid, we can learn how the metabolites in the metabolic pathways at the bottom level are merged into functional modules through vertical links. We can also verify if the higher-level modules connected by the horizontal links show any topological property, e.g., scale-free connectivity, that is shared by natural and social networks [11, 23]. With a macro view, we can investigate the changes in topological properties and the biological meanings from one abstraction level to another. In contrast, with a micro view, we can analyze all possible routes going through the modules across levels, to identify interesting attributes or patterns. The modularity and hierarchy concepts have long been popular in various fields, e.g., biology, psychology, sociology, and digital system design [21, 24–26]. Our abstraction pyramid combines these two concepts. It allows the details of each module to be dealt with in isolation, or the overall characteristics of a coherent system to be dealt with at different levels. This integrated concept is similar to a computer architecture design or system engineering, in which computing modules are organized in a hierarchy according to functionality and implementation details. We expect that future studies in these directions will shed light on new research topics within these fields.
To evaluate the interpretations made on complex biological networks by Pyramabs, we experimented on PPI and metabolic networks. The experiments showed that the abstraction pyramids were biologically meaningful. The vertical relationships successfully characterized the inclusion relationship according to the GO and KEGG category hierarchy, and the strength of the horizontal relationships correctly reflected the functional closeness according to the GO and KEGG annotations. In addition, we tested Pyramabs on two social networks to demonstrate its generality: Zachary's karate club network and an NCAA college football network [see Additional file 1]. These results were encouraging.
We can extend this work in several directions. One future improvement of Pyramabs is to identify overlapping modules. Currently the modules at the same level are not allowed to overlap, although overlapping modules exist in some real-world domains. Second, although the performance of Pyramabs has been demonstrated in real-world domains, we can refine the proximity measure and utilize domain knowledge to improve its robustness for situations in which networks may contain spurious links and nodes, or may be missing crucial links or nodes. Pyramabs currently assumes that the given network is correct when it extracts the abstraction pyramid from the complex network. Third, we can characterize an algorithm for network community analysis, using the proximity measure applied to evaluate the association between nodes, and using the construction procedure it takes to organize the communities. A more thorough comparative study of Pyramabs with other methods provides the opportunity to integrate various complementary algorithms to increase its applicability to various domains and its accuracy in interpreting the networks.
There is great flexibility in how we define the proximity between a pair of nodes, and the selection of an appropriate proximity function is crucial since it will affect the formation of the resulting modules. Several measures are commonly used, including Euclidean distance, correlation coefficient, and cosine similarity [22, 27]. Here, we investigated the use of clustering based on network topology alone. Conventional proximity measures are not applicable to clustering problems if the network topology is the only information given (e.g., we cannot calculate Euclidean distance without the node coordinates). Other proximity measures, such as edge betweenness [12, 13, 21] and topological overlap [11, 28, 29], were recently proposed and used in the study of social, metabolic, protein-protein interaction, and gene networks. In spite of having some successful applications, they have limitations. The edge betweenness of a pair of nodes reflects the global characteristics of a network, but suffers from high computational cost [13, 15, 21] and the effects of incompleteness and noise in the network [14, 30]. The topological overlap is a local measure, and may fail to identify any module beyond a locally dense connectivity pattern .
where is the sum of the weights of all outgoing links of node i. Our proximity function considers not only the effects of common neighbors (i.e., node k), but also the link direction and the link weight. According to studies of protein-protein interaction [31–33], often two interacting proteins share no functional pathways, but reveal substantial functional similarity to their common neighbors. These observations suggest that we treat direct links and indirect paths differently. We assume that the weight of the direct link, A ij , directly contributes to the proximity, prox(i,j), as indicated by the first term in Eq. .
On the other hand, to calculate the proximity between i to j based on an indirect path from i to j by way of k, we divide the path into two sub-paths, i to k and k to j. Unlike direct links, we hypothesize that on an indirect path, one node does not always affect all its neighbors; rather, it acts probabilistically. For an indirect path from i to j by way of k, the probability that node i affects node k is defined as the ratio of the link weight between i and k to the sum of the weights of all outgoing links of node i, except the direct link from i to j. The probability that node k affects node j is defined as the ratio of the link weight between k and j to the sum of the weights of all outgoing links of node k. The probability of the complete indirect path from i to j by way of k is then the product of the probabilities of the path from i to k and the path from k to j. The proximity contributed by the indirect path from i to j by way of k is determined by both the probability of the indirect path and the link weights A ik and A kj . If there is more than one common neighbor of i and j, we sum the proximity of each indirect path, as shown in the second term in Eq. . Although the proposed proximity function is a local measure, like topological overlap, it has better discrimination in network topology [see Additional file 2], and requires less computational effort than a global measure (e.g., edge betweenness). Incorporating the proximity function into a two-way module-finding-hierarchy-building strategy, we can gather the local and global characteristics, and detect the hierarchical structure of the network.
The optimal solution to the partition of a network, based on some criterion function, can be found by enumerating all possibilities. However, this is computationally prohibitive for large practical networks. To reduce the problem space, we adopted a graph-theoretic approach to partitioning . After computing the proximity between all pairs of nodes, we build a maximum spanning tree  that includes all the nodes of the network, and connect these nodes with the maximum sum of the link proximity. We view the maximum spanning tree as the backbone of the network, and discard the links with less significant proximities. We perform partitioning based on the maximum spanning tree, rather than the original network, in order to reduce computational cost.
where is the sum of the proximity of each intralink within M k , and is the sum of the proximity of each interlink between M a and M b .
Our criteria for modules are similar to those previously proposed [14, 15], but with a focus on the link weight (proximity) instead of on the degrees of nodes. Note that the sum of proximity in Eq.  is calculated on the network rather than on the tree, to avoid information loss. We use the tree only for evaluating which nodes are clustered to reduce the search space of the original network. We provide the pseudocode for the top-down network partitioning procedure [see Additional file 3].
where |Ma| is the number of nodes in module Ma. We first compute the proximity between all possible pairs of supernodes, and then normalize the proximity to a z-score. Those links with a z-score below a threshold (currently set to zero) are considered insignificant, and thus discarded from the network. The resulting network of supernodes is the abstraction of the original network, and is placed one level higher than the original network in the hierarchy. By repeating this process on the networks in the hierarchy, we can generate additional abstract networks and continue building the pyramid of abstraction consistently and systematically from the bottom up, as illustrated in Figure 1.
We thank F. Luo and M. Sales-Pardo for providing the datasets and the source code for the tools used in the experiments. This work was partially supported by National Science Council (NSC) of Taiwan, NSC 98-2221-E-009-150.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.