GraphFind: enhancing graph searching by low support data mining techniques

Background Biomedical and chemical databases are large and rapidly growing in size. Graphs naturally model such kinds of data. To fully exploit the wealth of information in these graph databases, a key role is played by systems that search for all exact or approximate occurrences of a query graph. To deal efficiently with graph searching, advanced methods for indexing, representation and matching of graphs have been proposed. Results This paper presents GraphFind. The system implements efficient graph searching algorithms together with advanced filtering techniques that allow approximate search. It allows users to select candidate subgraphs rather than entire graphs. It implements an effective data storage based also on low-support data mining. Conclusions GraphFind is compared with Frowns, GraphGrep and gIndex. Experiments show that GraphFind outperforms the compared systems on a very large collection of small graphs. The proposed low-support mining technique which applies to any searching system also allows a significant index space reduction.


Background
Application domains such as bioinformatics and cheminformatics represent data as graphs where nodes are basic elements (i.e. proteins, atoms, etc…) and edges model relations among them. In these domains, graph searching plays a key role. For example, in computational biology locating subgraphs matching a specific topology is useful to find motifs of networks that may have functional relevance. In drug discovery, the main task is to find novel bioactive molecules, i.e., chemical compounds that, for example, protect human cells against a virus. One way to support the solution of this task is to analyze a database of known and tested molecules with the aim of building a classifier which predicts whether a novel molecule will be active or not. Future chemical tests can focus on the most promising candidates. Users may ask to find molecules containing the query graph (exact search) or subgraphs similar to the one described by the query (approximate querying) (see Figure 1 for an example).
The graph searching problem can be formalized as follows. Given a database of graphs D = {G 1 , G 2 ,…, G n } (e.g. collection of molecules, etc.) and a query graph Q (e.g pattern), find all graphs in D containing Q as a subgraph. Moreover, all occurrences of Q in those graphs should be detected. In many important application D consists of a single huge graph. Since most of these problems involve solutions of the graph isomorphism problem, an efficient exact solution can not exists. In order to make searching time acceptable research efforts have tried to improve the following steps [1,2].
1. Reduce the search space by filtering. For a database of graphs a filter limits the search to only possible candidate graphs. For a single-graph database only the possible candidate subgraphs are identified. The common idea is to extract structural features of graphs and store them in a global index. When a query graph is presented, its own structural features are extracted and compared with the features stored in the index to check compatibility [3][4][5][6]. Most existing systems use subgraphs of small size (typically not larger than 10 nodes). However, even though small subgraphs are used, the size of the index and its time construction may be high. Therefore, high-support/highconfidence mining rules are used to index only frequent and not redundant subgraphs (i.e. a subgraph is redundant when its presence in a graph can be predicted by the presence of its subgraphs) [7][8][9].
2. Store Data. In order to scale to very large databases of graphs indexing structures and data must be stored in secondary memory. Applications make use of advanced database management systems [4]. 3. Match. After candidate graphs have been selected, an exhaustive search on these graphs must be performed. This step is implemented either by traditional (sub)graphto-graph matching techniques [10,11] or by an implementation on an extension of the SQL algebra [12].
In this paper, GraphFind, an enhancement of the application-independent graph searching system GraphGrep [5,12], is presented. Experiments show that GraphFind outperforms the compared systems on a very large collection of small molecules, available at the web site of the National Cancer Institute [13]. A key feature of GraphFind is the use of low-support data mining technique (Min-Hashing [14]) to reduce the index size. It is shown that such a mining technique can be successfully applied to enhance other systems such as gIndex [7].

Approach
GraphFind locates all exact and approximate occurrences of a query graph in collections of graphs. It combines filtering techniques described in [5] with a recent matching algorithm [11]. Each graph is stored as a set of small subgraphs. At query time, such a representation allows the selection of candidate subgraphs. GraphFind is implemented on top of Berkeley DB [15] to store both indexing structures and data graphs. A low-support data mining technique (Min-Hashing [14]) is applied to reduce the index size of GraphFind and gIndex [7].

Related compared systems
GraphGrep [5,12] finds all exact and approximate occurrences of a query graph in collections of graphs. Approximate queries are special subgraphs that may contain: (a) nodes with a special wildcard symbol "?", that can match any node; (b) approximate paths (represented by a wildcard symbol "*") which are paths of any length that can connect two nodes. GraphGrep enumerates all small subgraphs (say paths with no more than 4 nodes) in the database together with all occurrences sites and the number of such occurrences. Matching is performed by combining such occurrences making use of an extension of the classical SQL algebra [12].
Daylight [3] is a commercial system to search in molecules databases. The index of each graph is a fixed-size bit vector. It enumerates all existing small paths in a graph, hashes them, and adds them to the vector. A disadvantage of such an approach is that different and unrelated paths may "collide" at the same bit position. An academic freely available emulation of Daylight, called Frowns [10], makes use of an efficient matching subgraph algorithm [11]. All above systems are designed to optimize the query time, at the cost of large preprocessing time.
Data mining techniques have been applied to reduce index construction (space and time) complexity. Related recent work includes [8,9] (stable release of such software are upcoming). gIndex [7] represents the state of art in this area. The key ideas of gIndex are: (i) index frequent subgraphs with a size-increasing support function; (ii) represent them in a canonical form (strings); and (iii) store such strings in a prefix tree.
Although there is a long history of research on indexing for exact searching in database of graphs, only recently have indexing structures for approximate search been proposed [16][17][18]. SAGA [18] appears to be the most flexible system. It finds subgraphs of a query which are similar Querying a database of molecules Figure 1 Querying a database of molecules. Graphs represent molecules. During the match process, edge information is ignored. Query occurrences are shown in bold. For Q 2 since query matches overlap, only one occurrence in each molecule is depicted. The number of occurrences is also given. Molecular descriptions include hydrogen atoms for search accuracy. In a context where hydrogen atoms are not considered, query Q 2 is present 11 times in G 1 , 6 in G 2 and 10 in G 3 . The approximate query specifies any path of an unspecified length between atoms C and N. Approximate queries may also contain atoms with unknown label (they match any atom). (allowing node gaps, node mismatches and graph structural differences) to subgraphs in the database. Algorithms for networks alignment [19] such as NetworkBlast [20] may be used to find approximate occurrences of a query path in a single graph. The main difference between those systems and GraphFind is that GraphFind users may specify precisely at query construction time which nodes or paths are approximate. Thus, GraphFind can not be compared with those systems because it controls the semantics of the output precisely.

Results
In order to evaluate the performance of GraphFind, we have compared it with the main graph search systems (GraphGrep [12], GFrowns, and gIndex [7]). GFrowns is an implementation of the system Frowns [10] to deal with general graphs. Experiments show that GraphFind compared had better behavior than gIndex in terms of scalability on the tested databases. In addition, GraphFind improves our previous system GraphGrep which is commonly used in the literature as a test system. Experimental analysis was performed on a Pentium IV with 1GB of memory using Linux OS. All algorithms were implemented in C++.

Test sets
To test the proposed system, a database of 40000 molecules, available at the web site of the National Cancer Institute [13], was used. It contains sparse graphs having from 20 to 270 nodes. The database was divided into subsets of size ranging from 1000 to 40000 molecules.
Systems were tested using a set of 40 queries drawn from the molecules database. The number of nodes, for each query, ranges from 4 to 32. Query time is given as the sum of filtering time and matching time.
Experiments on a single graph database were performed using synthetic data described in [21]. The Min-Hashing technique was analyzed using both synthetic and molecules database. Figure 2 reports preprocessing time and index size of GraphFind, GraphGrep, GFrowns, and gIndex on the molecule databases. Concerning l p = 4, GraphFind and gIndex were comparable and faster than the other systems. However, GFrowns and gIndex outperformed the others on index space. Notice that, the maximum database size treatable by gIndex was 16000. Figure 3 reports querying time. gIndex and GraphFind showed comparable behavior.

Comparisons
Preprocessing time and index size of GraphFind using l p = 10 is considerable higher than the ones obtained using l p = 4 (see Figure 2). Results of GraphGrep and GFrowns with l p =10 are not reported since they are clearly outperformed by GraphFind and gIndex. The querying time of GraphFind (l p = 10) is not shown since it does not yield any speed-up with respect to the case of l p = 4.
gIndex filtering with l p = 10 compared to l p = 4 discards more graphs, but is slower.
In Figure 4, the performances of GraphFind, GraphGrep and GFrowns on a single large graph are shown. Although the amount of space required by GraphFind is high, the querying results very efficient. gIndex is not reported since it does not treat graphs with thousands of nodes. Figure 5 reports the performance of GraphFind on approximate queries on a database of 8000 molecules available at [13]. As expected, by increasing the allowed degree of approximation in a query, the execution time and the number of matching subgraphs returned by such a query grow.
Finally, the Min-Hashing algorithm was applied to reduce the index size of GraphFind and gIndex (see Figure 6). The running time of Min-Hashing does not affect the preprocessing performance (less than one percent of the total time in all tests). However, the index size is considerably reduced in both GraphFind and gIndex.

Conclusions
This paper has presented GraphFind, an application-independent graph searching system that enhances Graph-Grep. The system allows exact and approximate graph searching where the approximations can be precisely specified. Comparisons with competitive systems show that GraphFind performs well and scales better. GraphFind significantly reduces data storage with respect to Graph-Grep overhead thanks to low-support data mining. The proposed low-support mining technique, which applies to other searching methods also, reduces indexing space significantly.
GraphFind can be easily implemented in a distributed environment. The database of graphs may be distributed among several servers according to a graph similarity criterion. When graph searching is applied to a huge graph (network), the graph may be partitioned into components based on a minimum cut strategy (e.g. locate hubs and cut at them). Future work will include the design and the experimental analysis of a GraphFind distributed version on web-scale databases. Moreover, methods to rank outputs will be added on specific domains of application. This will be a domain-specific extension. Datasets, software and results are freely available at [22].

Methods
GraphFind models the nodes of data graphs as having an identification number (node-id) and a label (node-label).
An id-path of length n is a list of n + 1 node-ids with an unlabeled edge between any two consecutive nodes. A label-path of length n is a list of n + 1 node-labels. Labelpaths and the id-paths of the graphs in a database are used to construct the index of the database and to store the data graphs.

Index construction
Let l p be a fixed positive integer. For each graph in the database and for each node, all paths that start at this node and have length from one up to l p are collected. The index is implemented using a hash  Figure 7).

Data storage
Since several paths may contain the same label sequence, the id-paths of all the paths representing a label sequence are grouped into a label-path-set. GraphFind uses Berkeley DB [15] as the underlying database to store data graph representation and index. GraphFind stores each fingerprint as a dynamic Berkeley DB hash table of linked lists Preprocessing time and index size for molecular databases Figure 2 Preprocessing time and index size for molecular databases. For all systems the value of l p , unless specified, is 4. Index size refers to the space needed by the fingerprint database and the data storage. For GraphFind the size of its fingerprint is explicitly indicated (GraphFind NT ). GraphGrep requires the same space as GraphFind. For gIndex and GFrowns data storage is very small compared to its fingerprint size. Preprocessing represents the time to compute the fingerprint and to store all data in secondary memory. Preprocessing of GraphFindNT includes the time needed to compute and write the fingerprint only. (whose keys and values are described above). Each graph is stored in a set of Berkeley DB tables each corresponding to a label-path-set (see Figure 7).

Queries
A query is an undirected labeled graph. Approximate queries are special subgraphs that may contain: (a) nodes labeled with a special wildcard symbol "?", which can match any label; (b) approximate paths (represented by a wildcard symbol "*") which are paths of any length that can connect two nodes.

Database filtering
The database is filtered by comparing the fingerprint of the query with the fingerprints of the graph in the database. A database graph, for which at least one value in its fingerprint is less than the corresponding value in the fingerprint of the query, is filtered out. The remaining graphs are candidates for matching (see Figure 7 (Filtered Database(1))). Next, parts of the candidate graphs are filtered out as follows: (i) decompose the query into patterns and (ii) select only those id-path sets associated with patterns in the query (see Figure 7 (Filtered Database(2))). The selected id-path sets Comparisons on a single graph database Figure 4 Comparisons on a single graph database. Preprocessing and Querying performances of GraphFind, GraphGrep and GFrowns on a single graph database. The graph is a Irregular 3D with 10000 nodes and 5 labels [21]. Index Size refers to the fingerprint database matrix and the graph representation. N is the number of nodes in the query. The query time is the average obtained by 10 different runs. gIndex is not reported since it does not treat graphs with thousands of nodes. Running time of approximate queries on a molecules database Figure 5 Running time of approximate queries on a molecules database. The database contains 8000 molecules, it is available at [13]. Nodes labeled with a special wildcard symbol "?" can match any label; a wavy segment indicates an approximate path of any length (here the wildcard "*" representing an approximate path has been substituted with a wavy segment only for a graphical purpose). Query Q 1 searches for a node (atom) C and a node N connected through a path. Query Q 2 searches for a node (atom) C and a node N connected through a single node. As expected, the number of occurrences of Q 1 is greater than those of Q 2 . In each graph, a query may have a large number of occurrences (in 27 candidate graphs, Q 3 has 5372 matches).

Subgraph exact and approximate matching
After filtering, subgraph matching on the possible matching candidates is performed by applying the VF2 algorithm [11] to each candidate. This is a refinement of Ullmann's subgraph isomorphism algorithm that uses more selective feasibility rules to prune the state search space. Approximate queries are handled by independently processing, as described above, all maximal exact (completely specified) subqueries. The resulting subgraph matchings are then "joined" by checking, for each pair of query nodes connected by an approximate path, if there is a path in the data graph (of length equal to the wildcards' values) between the corresponding matched nodes. This is performed by using depth-first search. As shown in [11], the computational complexity in the worst case of the VF2 algorithm is Θ(N!N), where N is the number of nodes in the query.

Indexing by low support data mining techniques
Let M(m,n) be the fingerprint of a graph database. Rows correspond to graphs, columns are patterns and each entry is the number of occurrences of each pattern in that graph. Two patterns are similar if a large number of graphs have the same number of occurrences of it. More precisely, let the similarity Sim(C i , C j ) of two columns be the percentage of non null rows in which the two columns have the same value. The aim of the Min-Hashing algorithm [14] is to quickly find pairs of columns (indexed patterns) that have a similarity greater than a given threshold s*. It generates k random permutations, say C j ) of two columns C i and C j be defined as . In [14] the authors show that the similarity of two columns is well approximated by the similarity of the corresponding columns in the signature matrix. Consequently, finding similar columns in the matrix becomes a lightweight computation. This allows deletion of columns which are similar to others. Such a technique can be applied to any indexing system. In GraphFind it is applied to the transposed database fingerprint matrix (see Figure 7). Moreover, in GraphFind s* is not a user parameter. The system is designed to find pairs of columns (patterns) with similarity s* = 100% in the fingerprint database. Therefore, two patterns that have the same occurrence in each graph will be represented in the matrix using only one column indexed by both patterns. Notice that, by reducing the similarity threshold s*, correctness is maintained and the compression ratio may be higher. However, this implies a loss in filtering efficiency and therefore greater searching time. Figure 6 reports the compression ratio of index size in both GraphFind and gIndex after Min-Hashing.