SING: Subgraph search In Nonhomogeneous Graphs
 Raffaele Di Natale^{1},
 Alfredo Ferro^{1}Email author,
 Rosalba Giugno^{1},
 Misael Mongiovì^{1},
 Alfredo Pulvirenti^{1} and
 Dennis Shasha^{2}
DOI: 10.1186/147121051196
© Di Natale et al; licensee BioMed Central Ltd. 2010
Received: 29 July 2009
Accepted: 19 February 2010
Published: 19 February 2010
Abstract
Background
Finding the subgraphs of a graph database that are isomorphic to a given query graph has practical applications in several fields, from cheminformatics to image understanding. Since subgraph isomorphism is a computationally hard problem, indexing techniques have been intensively exploited to speed up the process. Such systems filter out those graphs which cannot contain the query, and apply a subgraph isomorphism algorithm to each residual candidate graph. The applicability of such systems is limited to databases of small graphs, because their filtering power degrades on large graphs.
Results
In this paper, SING (Subgraph search In Nonhomogeneous Graphs), a novel indexing system able to cope with large graphs, is presented. The method uses the notion of feature, which can be a small subgraph, subtree or path. Each graph in the database is annotated with the set of all its features. The key point is to make use of feature locality information. This idea is used to both improve the filtering performance and speed up the subgraph isomorphism task.
Conclusions
Extensive tests on chemical compounds, biological networks and synthetic graphs show that the proposed system outperforms the most popular systems in query time over databases of medium and large graphs. Other specific tests show that the proposed system is effective for single large graphs.
Background
Graphs naturally model a multitude of complex objects in the real world. A chemical compound can be represented by a graph where atoms are vertices and bonds are edges. Biological networks model the complex of interactions among components in cells, (e.g. proteins, genes, metabolites). Social networks, the web, the water system and the power grid are all represented by graphs. A basic operation is the search of a query graph in a target graph or, more generally, in a database of graphs. Searching a molecular structure in a database of molecular compounds is useful to detect molecules that preserve chemical properties associated with a well known molecular structure. This can be used in screening and drug design. Searching subnetworks in biological networks helps to identify conserved complexes, pathways and motifs among species, and assist in the functional annotation of proteins and other cell components. The problem of searching for a query graph in a target graph is called subgraph isomorphism and is known to be NPcomplete. Since the subgraph isomorphism test is expensive, screening all graphs of a large database can be unfeasible. Recently, indexing techniques for databases of graphs have been developed with the purpose of reducing the number of subgraph isomorphism tests involved in the query process. In a preprocessing phase the database of graphs is analyzed and an index is built. A query is processed in two phases. In the filtering step the index is used to discard the graphs of the database which cannot contain the query, producing a small set of candidate graphs. The set of candidates is then verified (verification step) by a subgraph isomorphism algorithm and all the resulting matches are reported.
Most graph indexing tools are based on the concept of feature. Depending on the particular system, a feature can be either a small graph [1–3], a tree [4] or a path [5, 6]. The filtering property is based on checking whether the features of the query are contained in each target graph. In the preprocessing phase the database of graphs is scanned, the features are extracted from each graph and stored in the index data structure. During the filtering phase, the features are extracted from the query and the index is probed in order to discard all graphs which do not contain some feature of the query.
Existing indexing techniques are effective on databases of small graphs but they become unfeasible when applied to huge graphs [1, 2]. The reason is that features that may be rare in small graphs are likely to be found in enormous graphs just by chance. This implies that filtering systems based only on the presence or number of features are not effective for large graphs. Moreover the subgraph isomorphism test over a large graph is extremely expensive. Unfortunately, alternative indexing systems which do not make use of features [4, 7] show similar problems on large graphs.
To make the verification phase faster, GraphGrep [5] stores all the feature occurrences of each graph, and discards the part of the graph which does not contain features of the query thus restricting the search to small portions of the target graph. However, this produces a large index which is more difficult to manage and can lead to a reduction in filtering performance. Furthermore, the features of the query often occur in many parts of the graphs, reducing the filtering power.
In this paper, a novel approach to cope with large graphs is proposed. The present approach makes use of paths as features. In contrast to systems that use more complex features such as subgraphs or subtrees, our index includes all paths of bounded length. The position of a feature within the graph is considered. This additional information is used to both improve the filtering power and guide the verification phase allowing an effective pruning of the search tree. In contrast to GraphGrep, only the starting point of a feature is stored and bit arrays are used to reduce the index size. Furthermore this information is used to optimize the verification phase. Notice that this approach cannot be used for graph features since graphs have no starting points. Although a similar approach could be used for tree features (using the roots as starting points), the resulting preprocessing time would be higher since enumerating subtrees is much more expensive than enumerating paths. Despite using path features, our system is effective in capturing the topology of graphs and it is shown to perform better than existing systems in terms of query processing time, while keeping the size of the index comparable. An extensive experimental analysis on real and synthetic data shows that the proposed system is efficient and effective on both databases of small graphs and single large graphs.
Preliminaries
This paper considers undirected nodelabeled graphs. However, the concepts introduced in what follows can be easily extended to edgelabeled and directed graphs. An undirected labeled graph (in what follows simply a graph) is a 4tuple g = (V, E, Σ, l) where V is the set of vertices, E ⊆ V × V is the set of edges (a symmetric binary relation on V), Σ is the alphabet of labels and l: V → Σ is a function which maps each vertex onto a label. If e = (v_{1}, v_{2}) is an edge, then v_{1} and v_{2} are called its endpoints. We set size(g) = E and indicate with the set of all possible graphs. A graph g_{1} = (V_{1}, E_{1}, Σ, l_{1}) is said to be a subgraph of another graph g_{2} = (V_{2}, E_{2}, Σ, l_{2}) iff V_{1} ⊆ V_{2} and E_{1} ⊆ E_{2}.
Given two graphs g_{1} = (V_{1}, E_{1}, Σ, l_{1}), g_{2} = (V_{2}, E_{2}, Σ, l_{2}) an isomorphism between g_{1} and g_{2} is a bijection ϕ: V_{1} → V_{2} so that:

(u, v) ∈ E_{1} ⇔ (f(u), f (v)) ∈ E_{2}

l_{1}(u) = l_{2}(f (u))∀ u ∈ V_{1}
A subgraph isomorphism between g_{1} and g_{2} is an isomorphism between g_{1} and a subgraph of g_{2}. A graph g_{1} is said to be isomorphic to another graph g_{2} if there exist an isomorphism between g_{1} and g_{2}. For the sake of simplicity we say also that g_{1} is equivalent to g_{2} and write g_{1} ≈ g_{2}. Notice that ≈ is an equivalence relation on . A graph g_{1} is said to be subgraph isomorphic to another graph g_{2} if there exist a subgraph isomorphism between g_{1} and g_{2}. In this case we say that g_{1} is contained in g_{2} and write g_{1} ≾ g_{2}.
In this paper, the following two problems will be discussed:
First_query_occurrence problem: Given a database of n graphs D = {g_{1}, g_{2}, ..., g_{ n }} and a query graph q, executing the query q on D is equivalent to finding all graphs g of D such that q is subgraph isomorphic to g. In the following we assume, without loss in generality, that all graphs of D and the query graph, share the same alphabet Σ.
All_query_occurrences problem: Given a database of n graphs D = {g_{1}, g_{2}, ..., g_{ n }} and a query graph q, executing the query q on D is equivalent to finding all subgraph isomorphisms between q and elements of D. We will make extensive use of the notion of feature. Features are formally introduced by the following definition.
Definition 1 Let be the set of all possible graphs in a given alphabet of labels. A set ℱ is a set of features on iff there exists a binary relation is_a_feature ⊆ ℱ × such that the following property holds ( graph upward monotonicity):
is_a_feature(f, q) ∧ q ≾ g → is_a_feature(f, g)
In what follows, is_a_feature(f, g) is expressed by saying that g contains f.
Every set of features defines a pruning rule for the subgraph isomorphism problem:
Pruning rule 1 If is_a_feature(f, q) and ¬ is_a_feature(f, g) then q cannot be subgraph isomorphic to g.
Examples of set of features are:

The set Paths_{≤ k}of all labeled paths of length ≤ k. Here a labeled path is the sequence of labels.

The set Subtrees_{≤ k}of all labeled subtrees of depth ≤ k.

The set Subgraphs_{≤ k}of all labeled subgraphs of size ≤ k.
This paper considers the set of features Paths_occ_{≤ k}of pairs (p, n), where p is a labeled path of length ≤ k and n is a lower bound on the number of occurrences of p in the given graph. The corresponding pruning property asserts that if the query graph q contains at least n occurrences of a given labeled path p and g does not contain at least n occurrences of p, then q cannot be subgraph isomorphic to g and g can be pruned.
Notice that in all above examples if a feature f is a subfeature of a given feature f' of g then f' is also a feature of g. The following definition formalizes this notion.
A downward monotonic set of features is a partially ordered set of features (ℱ; ) such that,
f f' ∧ is_a_feature(f', g) → is_a_feature(f, g)
A downward monotonic set of features allows an additional optimization in the pruning process: the pruning rule can be restricted only to maximal features f in the query. This means that no other feature f' in the query can be strictly greater than f in the partial order of features.
Related work
Featurebased graph indexing systems
All featurebased graph indexing systems are characterized by choosing a set of features ℱ and apply a Pruning rule 1 to features of ℱ. To prune as many graphs as possible, the graph indexing systems consider a set of features F_{ q }⊆ ℱ such that each feature f ∈ F_{ q }is contained in q, and prune all graphs g ∈ D which do not contain some feature in F_{ q }. The filterandverification scheme is performed in the following way:

Preprocessing: each graph of the database is examined offline in order to extract all features of ℱ which are contained in the graph. An inverted index is generated, which maps each feature f ∈ ℱ into the set graph_set(f) of all graphs containing f.

Query processing:
Filtering: The given query q is examined in order to extract a suitable set F_{ q }⊆ ℱ of features contained in q. A set of candidate graphs C is then computed by .
Matching: Each candidate graph is examined in order to verify that the given query is subgraph isomorphic to it. If the All_query_occurrences problem must be solved, then an exhaustive enumeration of all distinct subgraph matches is executed.
The differences among the various graph indexing systems lie mainly in the choice of the sets ℱ and F_{ q }. ℱ can be a set of boundedsize graphs, trees or paths. Since the number of features can be very high, some graph indexing systems select a restricted feature set from the database. For example gIndex [1] selects frequent subgraphs of bounded size. This operation requires the performance of an expensive graph data mining step during the preprocessing phase. A possible choice for F_{ q }is F_{ all }= {f ∈ ℱis_a_feature(f, q)}. If ℱ is an ordered feature set, F_{ q }can be chosen, without loss in pruning power, to be the set F_{ max }of all maximal features in F_{ all }. It is also possible to choose any set F_{ q }: F_{ max }⊆ F_{ q }⊆ F_{ all }. This is the choice made in SING.
Some indexing systems consider also more effective pruning rules based on the number of feature occurrences [5, 6] and the distances between features [8]. Some systems define compact representations of the index [2, 6]. A description of the various indexing systems follows, with a discussion of the positive and negative aspects of the various choices.
Graph features. Some systems such as gIndex [1], GDIndex [3] and FGIndex [2] use graphs as features. They consider a set of features ℱ = , where is the universe of graphs and is the partition of induced by graph isomorphism. All isomorphic graphs are considered as a single feature represented by their equivalence class. The main advantage of using graph features is that they are more suitable to capture the topological structure of graphs. Consequently they tend to produce fewer candidates.
Unfortunately, the number of graph features grows exponentially with the graph size, leading to a large index which degrades the performance of the preprocessing and filtering phases. To solve this problem, gIndex [1] and GDIndex [3] choose as features a set of frequent subgraphs. gIndex [1] considers also the concept of discriminative subgraphs to further reduce the number of features. All these approaches require the performance of an expensive data mining step in the preprocessing phase, leading to a loss of efficiency. Moreover, when it comes to coping with large graphs the mining step may become impractical. FGIndex uses a small index resident in main memory, and stores the remaining index in secondary storage. The authors of FGIndex use a novel concept of δtolerance closed frequent subgraph to distinguish from mainmemoryresident features and secondarymemoryresident ones. When the query cannot be performed using only the mainmemoryresident index, the mainmemory index is used to identify the blocks of the secondary memory index to be loaded. To avoid expensive disk accesses, a small set of maximal features which cover the whole query graph is selected.
GDIndex enumerates all induced subgraphs contained in each graph of the database. It organizes all the features in a DAG representing the partial order relation among features. The size of the index is reduced by avoiding redundancy. Each feature is associated with the set of graphs containing it and not containing any ancestorfeature in the DAG. During the filtering phase, the set of graphs containing a feature can be deduced by the featureDAG. Enumerating all subgraphs of a graph is very expensive, therefore this approach can be used only on databases of very small graphs.
Tree features. Tree features are easier to manage since the treeisomorphism problem can be solved in polynomial time. TreePi [8] is the first attempt to use trees as features. The authors describe a lineartime algorithm for computing the canonical labeling of a tree. They experimentally show that tree features capture the topological structure well enough. Therefore, using them may result in a good compromise between efficiency and effectiveness of filtering. As shown by authors, a unique center can be defined for a tree. Consequently the distance (shortest path) between pairs of features in a graph can be computed. TreePi uses an additional pruning rule based on distances between features to improve the quality of the match. More precisely, this pruning rule is based on the observation that for a query graph to be subgraph isomorphic to a target graph, the distance between each pair of query vertices cannot be less than the distance between corresponding vertices in the target graph. Tree+δ [9] uses as features both trees and a restricted class of small graphs to improve the filtering performance. As for graphs, enumerating all trees of bounded size still produces a large number of features. Consequently, a restricted set of features needs to be selected by an expensive data mining step.
Path features. GraphGrep [5] and GraphFind [6] consider as features all paths of length up to l_{ p }(usually 4). Formally a klength path of a graph G = (V, E, Σ, l) is an ordered sequence of vertices (v_{1}, v_{2}, ..., v_{ k }) ∈ V^{ k }such that (v_{ i }, v_{i+1}) ∈ E for 1 ≤ i ≤ k  1. We say that a path is simple if all of its vertices are distinct. A path feature on Σ is an ordered sequence of labels (a_{1}, a_{2}, ..., a_{ k }) where a_{1}, a_{2}, ..., a_{ k }∈ Σ.
Given a graph g = (V, E, Σ, l) and a path feature f = (a_{1}, a_{2}, ..., a_{ k }) ∈ Σ^{ k }, f is said to be contained in g, in symbols is_a_feature(f, g), if there is a simple path (v_{1}, v_{2}, ..., v_{ k }) ∈ V^{ k }such that l(v_{ i }) = a_{ i }for 1 ≤ i ≤ k. In this case (v_{1}, v_{2}, ..., v_{ k }) is called a path occurrence of f starting from v_{1}.
Fixed an integer l_{ p }> 0, the set is a partially ordered feature set with respect to the relation defined by:
(a_{1}, ... a_{ n }) (b_{1}, ..., b_{ m }) if n ≤ m and a_{ i }= b_{ i }∀i = 1 ... n.
Therefore, a Pruning rule 1 can be used to select candidate graphs.
To improve the quality of filtering, the number of path occurrences of each path feature is stored in an inverted index. When a query q is processed, a set F_{ q }of path features is extracted from it and the number of occurrences of each path feature in the query is compared to the corresponding number of occurrences of the same path feature in each graph of the database. A graph is a candidate if the number of occurrences of each path feature in it is greater than the corresponding number in the query.
GraphGrep also stores the location of each path occurrence in the graphs of the database. Moreover, for each candidate graph, it prunes all parts of the graph which do not contain any path feature of the query. This choice produces an improvement of the matching phase. However the resulting index size is quite large.
The choice of using path features in GraphGrep leads to a very efficient preprocessing phase. On the other hand, it limits the filtering power since paths cannot fully synthesize the topology of graphs.
Nonfeature based graph indexing systems
Recently, two nonfeaturebased graph indexing systems have been proposed. They have been shown to outperform many featurebased indexing systems, probably because they are able to better capture the structure of graphs.
CTree [4] organizes the graphs of the database in a Rtreelike data structure. The leaves represent single graphs while internal nodes represents sets of graphs synthesized in graph structures called closure graphs. The closure graph of a set of graphs is obtained in the following way. All graphs in the set are aligned by a fast approximate algorithm, called Neighbor Biased Mapping. The vertices of the closure graph are labeled by the sets of labels of the corresponding aligned vertices. Similarly, the edges of the closure graphs are the union of aligned edges. When a query is given, an approximate matching algorithm with nofalsenegatives is executed on the closure graphs of the tree in a topdown fashion. When the closure graph of a node has a negative response, all the subtrees rooted at that node are pruned and all its leaf graphs are discarded. The remaining graphs are the candidates, and they can be verified by an exact matching algorithm.
Despite the flexibility and filtering power of CTree [4], its filtering efficiency is limited since the execution of the approximate matching algorithm is expensive and needs to be applied to many closure graphs.
GCoding [7] uses the eigenvalue properties of the adjacency matrix for pruning graphs. In particular, it makes use of the Interlacing theorem which bounds the eigenvalues of the adjacency matrices of matching graphs. In the preprocessing phase, all the graphs of the database are scanned. For each vertex v of a given graph, a vertex signature is computed. This computation involves its label, its neighbor's labels together with the higher eigenvalues of the adjacency matrix of the tree rooted on v and representing all nlength paths starting from v. The vertex signatures of a graph are then merged to form the graph signature. Finally the graph signatures are organized in a Btreelike structure for efficient search. When a query q is given, the vertex and the graph signatures of q are computed. The graph signature is used to identify in the Btree a first set of candidate graphs. Than, a second set of candidate graphs is selected from the first one by discarding all graphs whose vertex signatures do not match the vertex signatures of the query. The correspondence between graph signatures and vertex signatures is defined by applying the Interlacing theorem.
Thanks to its coding strategy based on eigenvalues, GCoding [7] allows a compact representation of the index. However the computation of eigenvalues is expensive, leading to a slower preprocessing phase. Finally, the loss of information introduced by the chosen coding produces a less effective pruning compared to CTree [4].
Results and Discussion
Approach
The proposed approach is based on a new featurelocalitybased pruning rule that reduces the set of candidates resulting from the application of Pruning rule 1. The new pruning rule captures the structure of the graphs much better, leading to a strong reduction of candidates. Locality information is also used to reduce the search space of the verification phase. Our concept has been inspired by Treepi [8], which uses the concept of distance between features, requiring the computation of allpairdistances. In certain cases, especially when it comes to deal with large graphs, the approach of Treepi is computationally expensive. Enumerating all trees produces an explosion of the number of features that must be reduced by a data mining step. This leads to increase the preprocessing time as well as keep the filtering performances limited, due to the small number of feature selected. Moreover, Treepi requires the computation of the pairwise distances between features. To limit the preprocessing and filtering time, a small number of features need to be selected. Consequently, an high number of candidates is produced. In contrast, SING considers all paths starting from a node. It requires much less computation producing low preprocessing and filtering time. Moreover, SING is able to capture the topology of the tree induced by a node, using simple paths. Consequently it requires a lower number of features and avoid the expensive feature selection process. Consider the graphs in Figure 1. It is easily verifiable that q is subgraph isomorphic to g_{1} but not to g_{2}. q contains the features (A, B) and (A, C) and they are also contained in both g_{1} and g_{2}. Based on these features the graph g_{2} cannot be pruned. Note that the occurrences of both features in q start from the same vertex. The same situation holds in g_{1} but not in g_{2}. More precisely in g_{2} there is no vertex from which occurrences of both features start. Consequently vertex labeled A of q cannot match with any vertex of g_{2}, which can be pruned. The following statements formalize this concept. They are immediate consequences of the definition of subgraph isomorphism. Let start(f, g) be the set of vertices v such that an occurrence of f starts from v in g.
Statement 1 Given two graphs q = (V_{ q }, E_{ q }, Σ, l_{ q }) and g = (V_{ g }, E_{ g }, Σ, l_{ g }), let ϕ: V_{ q }→ V_{ g }be a subgraph isomorphism between q and g. For each vertex v∈ V_{ q }the following holds:
{f ∈ ℱv ∈ start(f, q)} ⊆ {f ∈ ℱ ϕ (v) ∈ start(f, g)}.
Statement 2 Given two graphs q, g. If q ≾ g then for each vertex v of q must exist at least a vertex u in g so that
{f ∈ ℱv ∈ start(f, q)} ⊆ {f ∈ ℱu ∈ start(f, g)}.
Statement 2 suggest a more effective way to prune the graph database. Given a candidate graph g, for each vertex v of the query graph q, there exists a vertex u of g such that each feature starting from v also starts from u. Consequently, if for some vertex of q there is no such corresponding vertex u, g can be pruned. Statement 1 gives a method to reduce the search space of the matching algorithm. That is, it introduces a more restrictive condition on the matching pairs of vertices. A detailed description of each phase of the proposed graph indexing system is given in Section Methods.
Experimental Analysis
This section compares the proposed system to the most popular tools. Three different dataset classes are used. Tests on real data coming from a database of small molecules (DTP AIDS Antiviral Screen) and a database of biological networks labeled with gene expressions are performed. We evaluate SING on large graphs by generating a synthetic scalefree network of 2000 nodes and executing several queries of sizes ranging from 4 to 16.
The proposed system was implemented in C++ and compiled with the GNU compiler gcc 3.3. Experimental analysis was performed on an Intel Xeon with 2 GB of memory using Linux OS. The executable used to perform all the experiments here reported is available as Additional File 1. The other tools used for the comparison are: CTree [4], GCoding [7], gIndex [1] and Tree+Delta [9].
Molecular data
Experiments on molecular data were performed over the DTP AIDS Antiviral Screen dataset published by the National Cancer Institute [10]. The complete dataset contains about 42000 chemical compounds. The experiment took three subsets containing respectively 8000, 24000 and 40000 graphs. Each compound corresponds naturally to a graph whose nodes are the atoms labeled with their atomic symbol. Each simple or multiple chemical bond between two atoms is represented by a single edge.
For each database, a set of queries is generated in the following way. Randomly choose a graph g of the database and one of its vertices v. Starting from v, proceed randomly in a breadthfirst fashion until a fixed total number t of edges is reached. This yields groups of 100 queries, each having a number of edges equal to 4, 8, 16 and 32 respectively.
Preprocessing time
Database size(kb)  CTree  GCoding  SING 

8000  8  642  149 
24000  25  1948  452 
40000  42  2960  755 
Index size
Database size  CTree  GCoding  SING 

8000  13844  6687  8445 
24000  41372  20088  25279 
40000  70208  30651  42830 
Winner table
# of queries  size 4  size 8  size 16  size 32 

≤ 70  CTree  CTree  CTree  CTree 
71147  CTree  CTree  CTree  SING 
148157  CTree  CTree  SING  SING 
≥ 158  CTree  SING  SING  SING 
With lp = 4, the building time, the index size and the filtering time are much lower than SING because of the small number of features considered by gIndex [1]. For the same reason, the filtering performances are worse than that with lp = 10 (higher number of candidates) and, with the exception of query size 32 with s = 0.1, the number of candidates is always higher than SING.
With lp = 10 and s = 0.1, the index size is comparable (7.5 MB of SING against 8.5 MB of gIndex), but gIndex [1] require 3 times more to build it (391 seconds against 135 seconds of SING). The filtering time of gIndex [1] tends to be constant with respect to the query size. Compared to gIndex [1], SING takes more filtering time on lowsize queries. On the other hand SING takes less time when the query size increases. The filtering power of SING and gIndex [1] are comparable over small queries. gIndex [1] produces a smaller number of candidates on larger queries at the expense of a longer preprocessing and filtering time. With lp = 10 and s = 0.1, gIndex [1] shows a little improvement in pruning power, but the preprocessing performances drop and the filtering time is higher, mainly due to the high number of features considered. We also performed a comparison with Tree+Delta [9], a recently proposed system which uses as features trees and a small set of selected graphs. Tree+Delta [9] performs better than TreePi, consequently, we do not compare SING with TreePi.
Transcription networks
To evaluate the performance of SING on large networks, we generated a database of gene networks labeled with discretized gene expressions, based on a transcription regulation network of Escherichia Coli annotated with gene expressions. We extracted The largest connected component from the complete network, available with the supplementary material of [13]. Gene expression profiles of 22 samples of the experiment GDS2825 (freely available from NCBI [14]), concerning the analysis of E. Coli K12 strains adapted to grow in benzalkonium chloride (a commonly used disinfectant and preservative) were used. We discretized each gene expression value by mapping it into a set of 5 levels: very low, low, medium, high, very high. Those levels became the node labels of the regulatory networks.
Following Alon [13], we attempted to identify groups of nodes connected with a given topology and annotated with a certain gene expression level profile. One can use this approach to understand a gene regulation mechanism, by verifying if a given pattern is present in a set of samples, where it occurs and which genes are involved.
Preprocessing over biological networks
Tool  Index size (KB)  Preprocessing time (sec) 

SING  1252  23 
CTree  292  1.3 
GCoding  85  101 
Synthetic data
Proteinprotein interaction networks
Proteinprotein interaction networks (PPI) are in general very complex to manage by most of graph minining tools [1, 2]. Indeed, PPI networks are scale free and the degree of their nodes follows the power law. These networks are characterized by the presence of nodes, called hubs, carrying hundreds of connections (edges). In these graphs most of the nodes are connected through a few hubs. The matching phase for this kind of networks is very heavy for most of the available tools.
Feature based tools such as gIndex [1], Tree+Delta [9], do not perform on large graphs since they were not designed for such purpose. Therefore, we compared SING only with VF2 [12].
Query time (in sec.) of matching complexes in the humanyeast experiment
QueryName  VF2  SING 

CAF  2,928  2,982 
Elg1 RFC like  2,984  2,996 
Piccolo NuA4 histone acetyltransferase  2,959  2,980 
SNARE 31201  2,952  2,983 
methionyl glutamyl tRNA synthetase  2,960  3,032 
nascent polypeptide associated  2,952  2,976 
tubulin  3,034  2,994 
To compare the algorithms on a more difficult environment, we considered the whole yeast network (downloaded from BioGRID [15]). This network contains 5589 nodes and 92835 edges. Figure 9b describes the degree distribution for this network. 8 different labels were assigned to the nodes at random with uniform distribution.
100 queries of size 4 and 8 and 50 queries of size 16 were randomly extracted from the network. Since a query can have a huge number of matches, we report the running time after at most 100 matches were reached.
Conclusions
Methods
In this section we report a detailed description of the proposed system. Our indexing system is based on the classical Pruning rule 1 and the new pruning rule introduced in Section Results. The three steps of the filterandverification scheme are discussed separately.
Preprocessing
Construction visits all graphs of the database and enumerates, using a depthfirst strategy, all feature paths contained in each graph. That is, starting from each vertex it visits all paths of depth at most l_{ p }, where l_{ p }is a fixed threshold (usually l_{ p }≤ 10, by default l_{ p }= 4).
The structures GI and LI can be implemented using hash tables. Therefore the average complexity of the above algorithm in the database D is , where d_{ m }is the maximum degree of the vertices and v_{ a }is the average number of vertices in graphs of D. In our implementation the structures GI and LI are binary trees, so the complexity is in the worst case.
In the implementation, LI [g] [p] is represented as a bit array whose size is equal to the number of nodes of g. This choice both reduces the index space and produces faster filtering (Section Second step filtering below gives details).
Filtering
The first step of the pruning procedure computes the set C_{1} = ⋂_{f∈FQ}{g ∈ GI [f]: GI [f] [g] ≥ FQ [f]}.
This retains graphs only if they have all features in the query. In addition, if a graph does not have at least as many occurrences of each feature as the query does, it is discarded.
If FQ and FV Q are represented using hash tables, the average complexity of extracting query features is , where d_{ m }is the maximum degree of the vertices and V_{ q }is the set of query vertices. The complexity of the first step filtering is using hash tables. In practice the complexity is lower since the set of graphs associated to a feature is lower than D and not all possible path features occur in the database.
Second step filtering
The first step of filtering takes into account only the occurrences of a feature in the database graphs. The second step filtering uses locality information to further prune the database. For each graph g which passes the first step filtering test, a mapping between query vertices and vertices of g is computed by the following procedure. Let v be a vertex of the query and FV Q [v] be the previously computed set of features starting from v. The algorithm computes M [g] [v] = ⋂_{f∈FVQ [v]}start(f, g) as the set of vertices of the graph g compatible with v. If for some vertices v we have M [g] [v] = ∅ then the graph g is discarded. Statement 2 guarantees the correctness of this second filtering step.
Since LI is implemented using bit arrays, the set M [g] [v] can be efficiently computed by the logical AND operation. The complexity is if M is implemented by either vectors or hash tables.
In practice, this is an extremely fast operation on modern hardware.
Matching
To each candidate graph, the VF2 [12] subgraph matching algorithm is applied. VF2 is a combinatorial search algorithm which spawns a search tree by branching states and makes use of a set of feasibility rules to prune the search. Every state of VF2 consists of a partial match between the query and the target graph. Starting from an initial state consisting of an empty match, VF2 produces a sequence of states by increasing the size of the partial match. Each step generates a new state by adding a pair of corresponding vertices to the partial match. When the partial match cannot be extended, the algorithm backtracks. To decide if two vertices can be matched, VF2 uses a set of topological feasibility rules and a semantic feasibility rule (largely based on label comparison).
SING replaces the semantic compatibility criterion between nodes with a more efficient test to reduce the breadth of the search tree. As described in Section Second step filtering, for each vertex v of the query, a set M [g] [v] of compatible vertices of the graph g is computed. M [g] [v] represents the set of vertices of g which can be matched to v. In fact, by definition if a vertex v' of the graph g does not belong to M [g] [v], there is at least one feature f such that v ∈ start(f, q) and v' ∉ = start(f, g). It follows by Statement 1 that no subgraph isomorphism ϕ between q and g can map v into v'. This means that the pair (v, v') cannot be involved in any match.
SING solves both the First_query_occurrence and the All_query_occurrences problems. In the First_query_occurrence case, the matching algorithm stops when the first match of the graph is found.
Declarations
Acknowledgements
We would like to thank the authors of gIndex, CTree and GCoding for having kindly provided their tools for comparison purpose. Authors were in part supported by PROGETTO FIRB ITALYISRAEL grant n. RBIN04BYZ7: "Algorithms for Patterns Discovery and Retrieval in discrete structures with applications to Bioinformatics" and by the Sicily Region grants PROGETTO POR 3.14: "Ricerca e Sviluppo suite di programmi per l'analisi biologica, denominata: BIOWARE". D. Shasha's work was partly supported by the US National Science Foundation grants GM 3287721/22, IIS0414763, DBI0445666, IOB0519985, DBI0519984, DBI0421604, and N20100115586.
Authors’ Affiliations
References
 Yan X, Yu PS, Han J: Graph Indexing Based on Discriminative Frequent Structure Analysis. ACM Transactions on Database Systems 2005, 30(4):960–993. 10.1145/1114244.1114248View ArticleGoogle Scholar
 Cheng J, Ke Y, Ng W, Lu A: Fgindex: towards verificationfree query processing on graph databases. Proceedings of ACM SIGMOD international conference on Management of data 2007, 857–872.Google Scholar
 Williams DW, Huan J, Wang W: Graph Database Indexing Using Structured Graph Decomposition. Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on 2007, 976–985. full_textView ArticleGoogle Scholar
 He H, Singh AK: ClosureTree: An Index Structure for Graph Queries. In ICDE '06: Proceedings of the 22nd International Conference on Data Engineering. Washington, DC, USA: IEEE Computer Society; 2006:38.Google Scholar
 Giugno R, Shasha D: Graphgrep: A fast and universal method for querying graphs. Pattern Recognition, 2002. Proceedings. 16th International Conference on 2002, 2: 112–115.Google Scholar
 Ferro A, Giugno R, Mongiovi M, Pulvirenti A, Skripin D, Shasha D: GraphFind: enhancing graph searching by low support data mining techniques. BMC Bioinformatics 2008., (9):Google Scholar
 Zou L, Chen L, Yu JX, Lu Y: A novel spectral coding in a large graph database. In EDBT '08: Proceedings of the 11th international conference on Extending database technology. New York, NY, USA: ACM; 2008:181–192. full_textView ArticleGoogle Scholar
 Zhang S, Hu M, Yang J: TreePi: A Novel Graph Indexing Method. Proceedings of IEEE 23rd International Conference on Data Engineering 2007, 181–192.Google Scholar
 Zhao P, Yu JX, Yu PS: Graph indexing: tree + delta ≤ graph. In VLDB '07: Proceedings of the 33rd international conference on Very large data bases. VLDB Endowment; 2007:938–949.Google Scholar
 NCI DTP Antiviral Screen data[http://dtp.nci.nih.gov/docs/aids/aids_data.html]
 Ullmann J: An algorithm for Subgraph Isomorphism. Journal of the Association for Computing Machinery 1976, 23: 31–42.View ArticleGoogle Scholar
 Cordella L, Foggia P, Sansone C, Vento M: A (Sub)Graph Isomorphism Algorithm for Matching Large Graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence 2004, 26(10):1367–1372. 10.1109/TPAMI.2004.75View ArticlePubMedGoogle Scholar
 Kashtan N, Itzkovitz S, Milo R, Alon U: Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 2004, 20(11):1746–1758. 10.1093/bioinformatics/bth163View ArticlePubMedGoogle Scholar
 National Center for Biotechnology Information[http://www.ncbi.nlm.nih.gov/]
 Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets. Nucleic Acids Res 2006, (34 Database):D535–9. 10.1093/nar/gkj109Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.