Volume 9 Supplement 4
A Semantic Web for Bioinformatics: Goals, Tools, Systems, Applications
GraphFind: enhancing graph searching by low support data mining techniques
 Alfredo Ferro†^{1, 2}Email author,
 Rosalba Giugno^{1},
 Misael Mongiovì^{1},
 Alfredo Pulvirenti^{1},
 Dmitry Skripin^{1} and
 Dennis Shasha^{3}
DOI: 10.1186/147121059S4S10
© Ferro et al.; licensee BioMed Central Ltd. 2008
Published: 25 April 2008
Abstract
Background
Biomedical and chemical databases are large and rapidly growing in size. Graphs naturally model such kinds of data. To fully exploit the wealth of information in these graph databases, a key role is played by systems that search for all exact or approximate occurrences of a query graph. To deal efficiently with graph searching, advanced methods for indexing, representation and matching of graphs have been proposed.
Results
This paper presents GraphFind. The system implements efficient graph searching algorithms together with advanced filtering techniques that allow approximate search. It allows users to select candidate subgraphs rather than entire graphs. It implements an effective data storage based also on lowsupport data mining.
Conclusions
GraphFind is compared with Frowns, GraphGrep and gIndex. Experiments show that GraphFind outperforms the compared systems on a very large collection of small graphs. The proposed lowsupport mining technique which applies to any searching system also allows a significant index space reduction.
Background
 1.
Reduce the search space by filtering. For a database of graphs a filter limits the search to only possible candidate graphs. For a singlegraph database only the possible candidate subgraphs are identified. The common idea is to extract structural features of graphs and store them in a global index. When a query graph is presented, its own structural features are extracted and compared with the features stored in the index to check compatibility [3–6]. Most existing systems use subgraphs of small size (typically not larger than 10 nodes). However, even though small subgraphs are used, the size of the index and its time construction may be high. Therefore, highsupport/highconfidence mining rules are used to index only frequent and not redundant subgraphs (i.e. a subgraph is redundant when its presence in a graph can be predicted by the presence of its subgraphs) [7–9].
 2.
Store Data. In order to scale to very large databases of graphs indexing structures and data must be stored in secondary memory. Applications make use of advanced database management systems [4].
 3.
Match. After candidate graphs have been selected, an exhaustive search on these graphs must be performed. This step is implemented either by traditional (sub)graphtograph matching techniques [10, 11] or by an implementation on an extension of the SQL algebra [12].
In this paper, GraphFind, an enhancement of the applicationindependent graph searching system GraphGrep [5, 12], is presented. Experiments show that GraphFind outperforms the compared systems on a very large collection of small molecules, available at the web site of the National Cancer Institute [13]. A key feature of GraphFind is the use of lowsupport data mining technique (MinHashing [14]) to reduce the index size. It is shown that such a mining technique can be successfully applied to enhance other systems such as gIndex [7].
Results and discussion
Approach
GraphFind locates all exact and approximate occurrences of a query graph in collections of graphs. It combines filtering techniques described in [5] with a recent matching algorithm [11]. Each graph is stored as a set of small subgraphs. At query time, such a representation allows the selection of candidate subgraphs. GraphFind is implemented on top of Berkeley DB [15] to store both indexing structures and data graphs. A lowsupport data mining technique (MinHashing [14]) is applied to reduce the index size of GraphFind and gIndex [7].
Related compared systems
GraphGrep [5, 12] finds all exact and approximate occurrences of a query graph in collections of graphs. Approximate queries are special subgraphs that may contain: (a) nodes with a special wildcard symbol “?”, that can match any node; (b) approximate paths (represented by a wildcard symbol “*”) which are paths of any length that can connect two nodes. GraphGrep enumerates all small subgraphs (say paths with no more than 4 nodes) in the database together with all occurrences sites and the number of such occurrences. Matching is performed by combining such occurrences making use of an extension of the classical SQL algebra [12].
Daylight [3] is a commercial system to search in molecules databases. The index of each graph is a fixedsize bit vector. It enumerates all existing small paths in a graph, hashes them, and adds them to the vector. A disadvantage of such an approach is that different and unrelated paths may “collide” at the same bit position. An academic freely available emulation of Daylight, called Frowns [10], makes use of an efficient matching subgraph algorithm [11]. All above systems are designed to optimize the query time, at the cost of large preprocessing time.
Data mining techniques have been applied to reduce index construction (space and time) complexity. Related recent work includes [8, 9] (stable release of such software are upcoming). gIndex [7] represents the state of art in this area. The key ideas of gIndex are: (i) index frequent subgraphs with a sizeincreasing support function; (ii) represent them in a canonical form (strings); and (iii) store such strings in a prefix tree.
Although there is a long history of research on indexing for exact searching in database of graphs, only recently have indexing structures for approximate search been proposed [16–18]. SAGA [18] appears to be the most flexible system. It finds subgraphs of a query which are similar (allowing node gaps, node mismatches and graph structural differences) to subgraphs in the database. Algorithms for networks alignment [19] such as NetworkBlast [20] may be used to find approximate occurrences of a query path in a single graph. The main difference between those systems and GraphFind is that GraphFind users may specify precisely at query construction time which nodes or paths are approximate. Thus, GraphFind can not be compared with those systems because it controls the semantics of the output precisely.
Results
In order to evaluate the performance of GraphFind, we have compared it with the main graph search systems (GraphGrep [12], GFrowns, and gIndex [7]). GFrowns is an implementation of the system Frowns [10] to deal with general graphs. Experiments show that GraphFind compared had better behavior than gIndex in terms of scalability on the tested databases. In addition, GraphFind improves our previous system GraphGrep which is commonly used in the literature as a test system. Experimental analysis was performed on a Pentium IV with 1GB of memory using Linux OS. All algorithms were implemented in C++.
Test sets
To test the proposed system, a database of 40000 molecules, available at the web site of the National Cancer Institute [13], was used. It contains sparse graphs having from 20 to 270 nodes. The database was divided into subsets of size ranging from 1000 to 40000 molecules.
Systems were tested using a set of 40 queries drawn from the molecules database. The number of nodes, for each query, ranges from 4 to 32. Query time is given as the sum of filtering time and matching time.
Experiments on a single graph database were performed using synthetic data described in [21]. The MinHashing technique was analyzed using both synthetic and molecules database.
Comparisons
Preprocessing time and index size of GraphFind using l_{ p } = 10 is considerable higher than the ones obtained using l_{ p } = 4 (see Figure 2). Results of GraphGrep and GFrowns with l_{ p } =10 are not reported since they are clearly outperformed by GraphFind and gIndex. The querying time of GraphFind (l_{ p } = 10) is not shown since it does not yield any speedup with respect to the case of l_{ p } = 4.
gIndex filtering with l_{ p } = 10 compared to l_{ p } = 4 discards more graphs, but is slower.
Conclusions
This paper has presented GraphFind, an applicationindependent graph searching system that enhances GraphGrep. The system allows exact and approximate graph searching where the approximations can be precisely specified. Comparisons with competitive systems show that GraphFind performs well and scales better. GraphFind significantly reduces data storage with respect to GraphGrep overhead thanks to lowsupport data mining. The proposed lowsupport mining technique, which applies to other searching methods also, reduces indexing space significantly.
GraphFind can be easily implemented in a distributed environment. The database of graphs may be distributed among several servers according to a graph similarity criterion. When graph searching is applied to a huge graph (network), the graph may be partitioned into components based on a minimum cut strategy (e.g. locate hubs and cut at them). Future work will include the design and the experimental analysis of a GraphFind distributed version on webscale databases. Moreover, methods to rank outputs will be added on specific domains of application. This will be a domainspecific extension. Datasets, software and results are freely available at [22].
Methods
GraphFind models the nodes of data graphs as having an identification number (nodeid) and a label (nodelabel). An idpath of length n is a list of n + 1 nodeids with an unlabeled edge between any two consecutive nodes. A labelpath of length n is a list of n + 1 nodelabels. Labelpaths and the idpaths of the graphs in a database are used to construct the index of the database and to store the data graphs.
Index construction
Data storage
Since several paths may contain the same label sequence, the idpaths of all the paths representing a label sequence are grouped into a labelpathset. GraphFind uses Berkeley DB [15] as the underlying database to store data graph representation and index. GraphFind stores each fingerprint as a dynamic Berkeley DB hash table of linked lists (whose keys and values are described above). Each graph is stored in a set of Berkeley DB tables each corresponding to a labelpathset (see Figure 7).
Queries
A query is an undirected labeled graph. Approximate queries are special subgraphs that may contain: (a) nodes labeled with a special wildcard symbol “?”, which can match any label; (b) approximate paths (represented by a wildcard symbol “*”) which are paths of any length that can connect two nodes.
Database filtering
The database is filtered by comparing the fingerprint of the query with the fingerprints of the graph in the database. A database graph, for which at least one value in its fingerprint is less than the corresponding value in the fingerprint of the query, is filtered out. The remaining graphs are candidates for matching (see Figure 7 (Filtered Database(1))). Next, parts of the candidate graphs are filtered out as follows: (i) decompose the query into patterns and (ii) select only those idpath sets associated with patterns in the query (see Figure 7 (Filtered Database(2))). The selected idpath sets correspond to one or several subgraphs of candidate graphs. Those subgraphs are the only ones that may match the query.
Subgraph exact and approximate matching
After filtering, subgraph matching on the possible matching candidates is performed by applying the VF2 algorithm [11] to each candidate. This is a refinement of Ullmann's subgraph isomorphism algorithm that uses more selective feasibility rules to prune the state search space. Approximate queries are handled by independently processing, as described above, all maximal exact (completely specified) subqueries. The resulting subgraph matchings are then “joined” by checking, for each pair of query nodes connected by an approximate path, if there is a path in the data graph (of length equal to the wildcards' values) between the corresponding matched nodes. This is performed by using depthfirst search. As shown in [11], the computational complexity in the worst case of the VF2 algorithm is Θ(N!N), where N is the number of nodes in the query.
Indexing by low support data mining techniques
Let M(m,n) be the fingerprint of a graph database. Rows correspond to graphs, columns are patterns and each entry is the number of occurrences of each pattern in that graph. Two patterns are similar if a large number of graphs have the same number of occurrences of it. More precisely, let the similarity Sim(C_{ i }, C_{ j }) of two columns be the percentage of non null rows in which the two columns have the same value. The aim of the MinHashing algorithm [14] is to quickly find pairs of columns (indexed patterns) that have a similarity greater than a given threshold s*. It generates k random permutations, say p^{ j } : {1,…,m} → {1,…,m} for j = 1,···,k, of row indices of M. ${p}_{i}^{j}$ denotes the ith element of the permutation p^{ j }. Let $\overline{M}\left(k,n\right)$ be the corresponding signature matrix of M. Each entry $\overline{M}\left[i,j\right]$ is the index t of the first row in M in which $M\left[{p}_{t}^{i},j\right]\ne 0$. Formally, $\overline{M}\left[i,j\right]$ =t if and only if $M\left[{p}_{t}^{i},j\right]\ne 0$ ∀s <t, $M\left[{p}_{s}^{i},j\right]=0$. Let the similarity Sim(C_{ i }, C_{ j }) of two columns C_{ i } and C_{ j } be defined as $\frac{\left{C}_{i}\cap {C}_{j}\right}{\left{C}_{i}\cup {C}_{j}\right}$. In [14] the authors show that the similarity of two columns is well approximated by the similarity of the corresponding columns in the signature matrix. Consequently, finding similar columns in the matrix becomes a lightweight computation. This allows deletion of columns which are similar to others. Such a technique can be applied to any indexing system. In GraphFind it is applied to the transposed database fingerprint matrix (see Figure 7). Moreover, in GraphFind s* is not a user parameter. The system is designed to find pairs of columns (patterns) with similarity s* = 100% in the fingerprint database. Therefore, two patterns that have the same occurrence in each graph will be represented in the matrix using only one column indexed by both patterns. Notice that, by reducing the similarity threshold s*, correctness is maintained and the compression ratio may be higher. However, this implies a loss in filtering efficiency and therefore greater searching time. Figure 6 reports the compression ratio of index size in both GraphFind and gIndex after MinHashing.
Notes
List of abbreviations used
 3D:

ThreeDimensional
 DB:

Database
 G:

Number of Graphs
 GB:

Gigabyte
 GraphFind_{NT}:

GraphFind Fingerprint
 GFrowns:

Graph Frowns, Implementation of Frowns for General Graph
 L:

Number of Different Node Labels
 l _{ p } :

Length of Label Path
 OS:

Operating System
 N:

Number of Nodes
 SQL:

Structured Query Language
 SAGA:

Substructure Indexbased Approximate Graph Alignment
 VF2:

Graph Matching Algorithm by Vento Foggia et al.
Declarations
Acknowledgements
We thank all the users who have downloaded our software and contributed to its improvement. We would like to thank Xifeng Yan, Philip S. Yu and Jiawei Han for providing gIndex. Some of the authors were in part supported by PROGETTO FIRB ITALYISRAEL grant n. RBIN04BYZ7: Algorithms for Patterns Discovery and Retrieval in discrete structures with applications to Bioinformatics.
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 4, 2008: A Semantic Web for Bioinformatics: Goals, Tools, Systems, Applications. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/9?issue=S4.
Authors’ Affiliations
References
 Cook DJ, Holder LB: Substructure Discovery Using Minimum Description Length and Background Knowledge. Artificial Intelligence Research 1994, 1: 231–255.Google Scholar
 Ferro A, Giugno R, Pigola G, Pulvirenti A, Skripin D, Bader GD, Shasha D: NetMatch: a Cytoscape plugin for searching biological networks. Bioinformatics 2007, 23(7):910–912.View ArticlePubMedGoogle Scholar
 Daylight Chemical Information Systems . [http://www.daylight.com/]
 Kumar S, Srinivasa S: A Database for Storage and Fast Retrieval of Structure Data. In Proceedings of the 19th International Conference on Data Engineering: 5–8 March 2003; Bangalore Edited by: Dayal U, Ramamritham K, Vijayaraman TM, IEEE Computer Society. 2003, 789–791.Google Scholar
 Shasha D, Wang JTL, Giugno R: Algorithmics and Applications of Tree and Graph Searching. In Proceedings of the 21st ACM SIGACTSIGMODSIGART Symposium on Principles of Database Systems: 3–5 June 2002; Madison Edited by: Popa L, ACM. 2002, 39–52.Google Scholar
 Messmer BT, Bunke H: Subgraph Isomorphism Detection in Polynominal Time on Preprocessed Model Graphs. In Recent Developments in Computer Vision, Second Asian Conference on Computer Vision: 5–8 December 1995; Singapore, Volume 1035 of Lecture Notes in Computer Science Edited by: Li SZ, Mital DP, Teoh EK, Wang H, Springer. 1995, 373–382.Google Scholar
 Yan X, Yu PS, Han J: Graph Indexing Based on Discriminative Frequent Structure Analysis. ACM Transactions on Database Systems 2005, 30(4):960–993.View ArticleGoogle Scholar
 Cheng J, Ke Y, Ng W, Lu A: Fgindex: towards verificationfree query processing on graph databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data: 12–14 June 2007; Beijing Edited by: Chan CY, Ooi BC, Zhou A, ACM. 2007, 857–872.View ArticleGoogle Scholar
 Zhang S, Hu M, Yang J: TreePi: A Novel Graph Indexing Method. In Proceedings of the 23nd International Conference on Data Engineering: 15–20 April 2007; Istanbul Edited by: Chirkova R, Dogac A, Ozsu T, Sellis T, IEEE Computer Society. 2007, 966–975.View ArticleGoogle Scholar
 Frowns . [http://frowns.sourceforge.net/]
 Cordella L, Foggia P, Sansone C, Vento M: A (Sub)Graph Isomorphism Algorithm for Matching Large Graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence 2004, 26(10):1367–1372.View ArticlePubMedGoogle Scholar
 Giugno R, Shasha D: GraphGrep: A Fast and Universal Method for Querying Graphs. In Proceedings of the 16th International Conference on Pattern Recognition: 11–15 August 2002; Quebec Edited by: Kasturi R, Suen DLC, IEEE Computer Society. 2002, 112–115.Google Scholar
 National Cancer Institute. U.S. National Institute of Health . [http://www.cancer.gov/]
 Cohen E, Datar M, Fujiwara S, Gionis A, Indyk P, Motwani R, Ullman JD, Yang C: Finding interesting associations without support pruning. IEEE Transactions on Knowledge and Data Engineering 2001, 13: 64–78.View ArticleGoogle Scholar
 Berkeley DB . [http://www.sleepycat.com/]
 Yan X, Yu PS, Han J: Substructure Similarity Search in Graph Databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data: 14–16 June 2005; Baltimore Edited by: Özcan F, ACM. 2005, 766–777.View ArticleGoogle Scholar
 Yan X, Zhu F, Han J, Yu PS: Searching Substructures with Superimposed Distance. In Proceedings of the 22nd International Conference on Data Engineering: 3–8 April 2006; Atlanta Edited by: Liu L, Reuter A, Whang K, Zhang J, IEEE Computer Society. 2006, 88–98.Google Scholar
 Tian Y, McEachin RC, Santos C, States DJ, Patel JM: SAGA: a subgraph matching tool for biological graphs. Bioinformatics 2007, 23(2):232–239.View ArticlePubMedGoogle Scholar
 Sharan R, Ideker T: Modeling cellular machinery through biological network comparison. Nature Biotechnology 2006, 24: 427–433.View ArticlePubMedGoogle Scholar
 Sharan R, Suthram S, Kelley RM, Kuhn T, McCuine S, Uetz P, Sittler T, Karp RM, Ideker T: Conserved patterns of protein interaction in multiple species. Proc Natl Acad Sci U S A 2005, 102(6):1974–1979.PubMed CentralView ArticlePubMedGoogle Scholar
 Foggia P, Sansone C, Vento M: A Database of Graphs for Isomorphism and SubGraph Isomorphism Benchmarking. Proceedings of the 3rd IAPR TC15 Workshop on Graphbased Representations in Pattern Recognition: 23–25 May 2001; Ischia, CUEN 2001, 176–188.Google Scholar
 CTNYU Research Lab . [http://alpha.dmi.unict.it/~ctnyu/]
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.