Application of kernel functions for accurate similarity search in large chemical databases
© Huan et al; licensee BioMed Central Ltd. 2010
Published: 29 April 2010
Similaritysearch in chemical structure databases is an important problem with many applications in chemical genomics, drug design, and efficient chemical probe screening among others. It is widely believed that structure based methods provide an efficient way to do the query. Recently various graph kernel functions have been designed to capture the intrinsic similarity of graphs. Though successful in constructing accurate predictive and classification models, graph kernel functions can not be applied to large chemical compound database due to the high computational complexity and the difficulties in indexing similarity search for large databases.
To bridge graph kernel function and similarity search in chemical databases, we applied a novel kernel-based similarity measurement, developed in our team, to measure similarity of graph represented chemicals. In our method, we utilize a hash table to support new graph kernel function definition, efficient storage and fast search. We have applied our method, named G-hash, to large chemical databases. Our results show that the G-hash method achieves state-of-the-art performance for k-nearest neighbor (k-NN) classification. Moreover, the similarity measurement and the index structure is scalable to large chemical databases with smaller indexing size, and faster query processing time as compared to state-of-the-art indexing methods such as Daylight fingerprints, C-tree and GraphGrep.
Efficient similarity query processing method for large chemical databases is challenging since we need to balance running time efficiency and similarity search accuracy. Our previous similarity search method, G-hash, provides a new way to perform similarity search in chemical databases. Experimental study validates the utility of G-hash in chemical databases.
Elucidate the roles of small organic molecules in biological systems, as studied in chemical genomics, is an emergent and challenging task. Traditionally the analysis of chemical genomics data was done mainly within pharmaceutical companies for therapeutics discovery, and it was estimated that only 1% of chemical information was in the public domains . The landscape of public available chemical genomics data, however, has been changed dramatically in the last few years. With the Chemical Genetics Initiative and the Molecular Library Initiative (started by NIH in 2002, , and 2004, , respectively), publicly-available, digitalized data grow exponentially fast. The PubChem database, just to name an example, contains structures of more than 18 million chemical compounds . With the rapid growth of public chemical databases, fast similarity search in large chemical databases has started to attract intensive research attentions. There are two approaches for similarity search for 2D or 3D structure of biomolecues. Most 3D structure based approaches compare three-dimensional shapes using a range of molecular descriptors . Such methods provide fast query processing in large chemical databases but relatively poor accuracy since such methods may lost much of the structure information during compressing the three-dimensional shapes. In 2D based similarity search, we focus on the 2D connectivity of chemical structures. Current 2D similarity measurements may be roughly divided into two categories. In the fragment based method, scientists embed chemical structures in a high dimensional feature space, e.g. through Daylight fingerprints  with the Tanimoto distance . Fragment based similarity measurement is by far the most widely used method and is adopted as the default choice in databases such as PubChem . Graph based similarity measurements, in contrast, do not break the chemical structures into fragment and has started to gain popularity. In graph based method, we utilize graph to model chemical structures and utilize different graph similarity measurements such as the largest common subgraph approach , graph editing distance  or graph alignment algorithms  to measure the similarity of chemical structures . Though graph methods have been successfully applied in cheminformatics research, as evaluated in our experimental study, none of them has achieved the goal of fast and effective similarity search in chemical databases, i.e. having computational efficiency in scaling to large chemical databases and computational efficiency in capturing the intrinsic similarity of graphs. With the fast growing of chemical databases, fast, effective, and indexable approaches are needed.
Our goal in this paper is to bridge the gap between graph kernel functions and similarity search for efficient and accurate similarity search in large chemical databases by applying our previous method, named G-hash . In our method, we model a chemical structure by its two dimensional (2D) connectivity graph where nodes represent atoms and edges represent chemical bounds between atoms. We extract local features for each node and their neighboring nodes in the graphs. Using a hash table, a graph kernel function is defined to capture the intrinsic similarity of graphs and for fast similarity query processing. Our experimental results show that the G-hash method achieves state-of-the-art performance for similarity search in chemical databases. The retrieved k nearest neighbors by G-hash are more likely similar to the query chemical compared with the state-of-the-art indexing methods such as Daylight fingerprints and C-tree. Most importantly, the similarity measurement and the index structure is scalable to large database with smaller indexing size, faster indexing construction time, and faster query processing time as compared favorably with other indexing methods.
The rest of the paper is organized as follows. In the Related Work section, we will give an overview of related work on subcomponent search and chemical similarity search. In the Background Section, we will introduce the concept of graphs and graph modeling of chemical structures. In the Methods section, we discuss the details of our algorithm including our index structure and kernel function. In the Results section, we show a comprehensive experimental study using our method and competing methods, and discuss the influence of feature sets. Finally, in the Conclusions section, we conclude with a few remarks on the study.
In this section we discuss two types of related work, i.e. subcomponent search and chemical similarity search. We work exclusively on the 2D connectivity graph of chemical structures and treat the following terms interchangeable: graphs and chemical structures, nodes and atoms, edges and chemical bounds.
Many of the recent subcomponent(subgraph) search methods adopt a similar framework, decomposing chemical structures into a set of smaller pieces, treating each piece as a descriptor, and building a descriptor-based index structure for subgraph query. The most well-known algorithm in this category is the Daylight fingerprints . In Daylight fingerprints, all paths up to a fixed length (e.g. 7) are retrieved as descriptors. A molecule is represented as a bit-string, indexed by the descriptors. Similar approach of using paths as descriptors is adopted by GraphGrep. Though paths are easy to retrieve and easy to work with, the simplicity of paths limits their power in capturing the intrinsic similarity of chemicals.
Recognizing the limitation of paths, gIndex, FG-Index and GDIndex build indices using general subgraphs. GDIndex also incorporated a hash table of subgraphs for fast subgraph isomophism lookup. The main drawback of subcomponent search is that no quantitative similarity measurement is provided which makes it difficult to rank the search results in a meaningful manner.
Chemical similarity search
Extending subcomponent strategy to similarity search in large compound databases is non-trivial. The most widely used strategy is previously mentioned the Daylight fingerprints approach, which treats a chemical compound as a bit-string and use various similarity metric for bit-strings, such as the Taminoto index  to measure the similarity of chemicals. Though fast, Daylight fingerprints provides only a coarse measurement of the true similarity of chemicals since majority of the features (i.e. paths) may not contribute to the chemical activity of the compounds and there is no feature selection step in the Daylight system.
Beside fragment-based method, maximal common subgraph (MCS)  was also utilized in measuring the similarity of graphs. Several heuristic strategies , based on specific properties of chemical structures, were proposed to improve the efficiency of MCS-based similarity search algorithm. Recently, anew backtracking algorithm was presented to compute MCS in . Although such method shows better accuracy, the MCS computation still time-consuming.
In addition, graph edit distance and graph alignment  were also used in cheminformatics to measure graph similarity. Unfortunately, there is no easy way to index both measurements for large chemical structure databases.
Before we proceed to discuss the algorithmic details, we present some general background materials which include the introduction of the concept of graphs and chemical structures as graphs.
A labeled graph G is described by a finite set of nodes V and a finite set of edges E ⊂ V × V. In most applications, a graph is labeled, where labels draw from a label set λ. A labeling function λ: V ∪ E → Σ assigns labels to nodes and edges. For the label set Σ we do not assume any structure of Σ now; it may be a field, a vector space, or simply a set. Following convention, we denote a graph as a quadruple G = (V, E, Σ, λ) with aforementioned V, E, Σ, λ,. A graph G = (V, E, Σ, λ ) is a subgraph of another graph G′ =(V′, E′, Σ′, λ′), denoted by G ⊆ G′, if there exists a 1-1 mapping f : V→ V′ such that
for all v ∈ V, λ(v) = λ′ (f(v))
for all (u, v) ∈ E, (f (u), f (v)) E′
for all (u, v) ∈ E, λ (u, v) = λ′ (f (u), f(v))
Graph modeling of chemical structures
Here we investigate the utility of graph kernel for chemical similarity measurement. Towards that end, we first give a overview of G-hash. We then briefly outline a graph kernel , which we will use, to define similarity of chemical structures. In particular, below we introduce details of the feature extractiion process, the index structure for fast similarity query and the kernel function for similarity measurement.
In particular, the application of G-hash to chemical databases follows the below steps. In index construction, we utilize the following steps:
For each chemical in the chemical database, extract node features for each atom in the chemical
Using graph wavelet analysis, extract local features for each atom in the chemical
Discretize the combined features and hash the atoms in a hash table.
In the query processing phase, we utilize the following steps:
For the query chemical, extract node and local features for each atom in the chemical
Discretize the combined features and hash the atoms in a hash table using the same procedure in index construction
Compute distances of the query chemical to the rest of chemicals
Report the k nearest neighbors.
Node feature extraction
To derive an efficient algorithm scalable to large graphs, our idea is to use a function Γ: V → ℝ n to map nodes in a graph represented a chemical compound to a n-dimensional feature space that captures not only the node label information but also the neighborhood topological information around the node. Two steps involve this process: first node feature extraction through which we extract features associated with a node, and second local feature extraction through which we extract features in a local region centered at the specific node.
We use the following node (atom) features: atomic number, the histogram of atom types of immediate neighbor of the node, the local functional group information, and the histogram of the (immediate) chemical bond information. The atom type of the node is a single number. For histogram of neighboring atom types, we collect information for C, N, O, S, and group the rest atom types to "others" to save space. We have a total of five numbers in the histogram. For local functional group information, we collect whether the node is in part of a 5-node ring, a 6-node ring, a high-order ring, a branch, or a path, as did in . We have a single number for this feature. For the histogram of the (immediate) chemical bond information, we have three numbers corresponding to single, double, and aromatic bonds. In the previously mentioned node extraction method, we ignore the neighborhood topology information of the chemical compound by focusing on atom physical and chemical properties. To add neighborhood topology information, we utilize a technique called the graph wavelet analysis, as originally presented in . The output of the wavelet analysis is a vector of local feature averages, with the size of the vector controlled by a diffusion parameter d. Further details of the analysis can be found in .
Structure matching kernel
where ||X||2 is the L2 norm of the vector X .
Similarity search with hash functions
To support effectively indexing, here we use a hash table where the key is the related node feature vector and the value is the node. Two chemicals are similar, if they share a lot of nodes that are hashed to the same cell since each node is represented by a feature vector which contains the local atomic and topological information. Since node features and local features may contain numeric value, we discretize each feature vector and map the feature value to an integer. After discretization, we hash all nodes in a chemical to the related hash table. We show an example of such hash table below.
where simi(v) is the set containing the nodes from graph G that are hashed to the same cell as the node v does. |simi(v)| is the number of nodes in the set of simi(v). In other words, we only count the number of common nodes, belonging to the graph G and G' in this version.
Finally we compute the distances between the query chemical and chemicals in a chemical database to obtain the k nearest neighbors of the query chemical. The idea is to compute the Euclidian distance of two objects between their embeddings in the related Hilbert space according to the kernel function.
We have performed a comprehensive evaluation of our method by evaluating the classification effectiveness and scalability for large chemical databases. We have compared our method with other similarity measurements including the Daylight fingerprints , Wavelet Alignment Kernel , C-tree , GraphGrep , gIndex . For G-hash, we extract 30 features for each node. We used the OpenBabel software package to compute Daylight Fingerprints  and k-nearest neighbors. For WA, we set the diameter d = 2 and use haar wavelet function. For C-tree, GraphGrep and gIndex, we use default parameters. All methods, except C-tree, were implemented using the C++ programming language and compiled using g++ with -O3 optimization. C-tree was developed in Java and compiled using SUN JDK1.5.0. We performed our experiments on a Linux cluster where each node has a dual-core Intel Xeon EM64T 3.2GHz processor and 4G memory running CentOS 4.
Data set characteristics.
We use the NCI/NIH AIDS Antiviral Screen data set, which contains 42,390 chemical compounds retrieved from DTP's Drug Information System, as a large chemical database. There is a total 63 types of atoms in this data set; the most frequent ones are C, O, N and S. The data set contains three types of bonds: single-bond, double-bond and aromatic-bond. We randomly sampled 1000 chemicals as the query data set.
Similarity measurement evaluation with classification
We have compared classification accuracy using k-NN classifier on the 12 Jorissen data sets and BindingDB data sets with different similarity measurements. For the WA method, we first obtain kernel matrix, and then calculate distance matrix to obtain the k nearest neighbors. For subgraph indexing methods such as gIndex and Graphgrep, we sketch one way to use them for similarity search. This method contains three steps: (i) randomly sample subgraphs from a query, (ii) use those subgraphs as features and compute the occurrences of the subgraphs in graph databases, and (iii) search for nearest neighbors in the obtained feature space. Clearly, the accuracy depends on the number of features. Here we pick 20 features for gIndex. We use standard 5-fold cross validation to obtain classification accuracy. We have tested different k values ranging from 3 to 11 in classifications. The quality of the results are similar and we only report results with k = 5.
Average Precision for different data sets.
Average recall for different data sets.
Accuracy results statistical information for G-hash, C-tree WA, gIndex and Daylight fingerprint on all data sets.
Chemical enrichment study
Feature set influences One of the key factors for determining both the accuracy and efficiency of the G-Hash method is the feature extraction methods r that maps nodes to a high-dimensional feature space. In order to evaluate the results, we have compare five sets of features. In the first set of features, we use two features(atom type and another feature from wavelet analysis with d = 1) as discussed in the Methods section. In the second set, we use 10 features described in the subsection of Node Feature Extraction. In the third feature set, we dropped the immediate chemical bond information from the first set and obtained seven features. In addition, we use wavelet analysis to extract features from the local region centered at the particular node. We use d = 1 with 10 additional features and d = 2 with 20 additional features. So we have 5 feature sets with sizes 2,7, 10, 20, and 30.
Average query running time for different number of features with different database sizes.
# of features
Average Running Time (ms)
So to obtain both good classification performance and fast query processing, relatively more features should be used. In our case, the feature set with 30 features is the best choice.
In summary, similarity search plays a critical role in cheminformatics. Efficient similarity query processing method for large chemical databases is challenging since we need to balance running time efficiency and similarity search accuracy. Here we applied our previous similarity search method, G-hash, combining hash based indexing and graph kernel function, and applied it into the similarity search in the large chemical databases. The key features of G-hash are that the k-NN query time is scalable to large databases and has better classification accuracy. We have compared our method with commonly used methods such as Daylight fingerprints  and C-tree  and have demonstrated the utility of our method.
XW developed methods, implemented the software, and drafted the manuscript. JH offered the research idea, participated in the discussion during research, and wrote part of the paper. AS was involved in testing the data set and helped revise the manuscript. GL provided advices on the chemical aspect of the work.
This work has been partially supported by the KU Specialized Chemistry Center (NIH award U54 HG005031) and NSF Award IIS: 0845951 CAREER: Mining Genome-wide Chemical-Structure Activity Relationships in Emergent Chemical Genomics Databases.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 3, 2010: Selected articles from the 2009 IEEE International Conference on Bioinformatics and Biomedicine. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S3.
- Dobson C: Chemical space and biology. Nature 2004, 432(7019):824–8. 10.1038/nature03192View ArticlePubMedGoogle Scholar
- Tolliday N, Clemons PA, Ferraiolo P, Koehler AN, Lewis TA, Li X, Schreiber SL, Gerhard DS, Eliasof S: Small Molecules, Big Players: the National Cancer Institute's Initiative for Chemical Genetics. Cancer Research 2006, 66: 8935–42. 10.1158/0008-5472.CAN-06-2552View ArticlePubMedGoogle Scholar
- Austin C, Brady L, Insel T, Collins F: NIH Molecular Libraries Initiative. Science 2004, 306(5699):1138–9. 10.1126/science.1105511View ArticlePubMedGoogle Scholar
- Ballester PJ, Richards WG: Ultrafast shape recognition for similarity search in molecular databases. Proceedings of the ROYAL SOCIETYA 2007.Google Scholar
- Rush TS, Grant JA, Mosyak L, Nicholls A: A shape-based 3-D scaffold hopping method and its application to a bacterial protein-protein interaction. J. Med. Chem. 2005, 48: 1489–1495. 10.1021/jm040163oView ArticlePubMedGoogle Scholar
- Daylight Fingerprints. 2008.http://www.daylight.com Software available at
- Girke T, Cheng LC, Raikhel N: ChemMine. A Compound Mining Database for Chemical Genomics. Plant Physiology 2005, 138: 573–577. 10.1104/pp.105.062687PubMed CentralView ArticlePubMedGoogle Scholar
- Cao Y, Jiang T, Girke T: A maximum common substructure-based algorithm for searching and predicting drug-like compounds. Bioinformatics 2008, 24(13):i366–74. 10.1093/bioinformatics/btn186PubMed CentralView ArticlePubMedGoogle Scholar
- He H, Singh AK: Closure-tree: an index structure for graph queries. Proc. International Conference on Data Engineering'06 (ICDE) 2006.Google Scholar
- Vert JP: The optimal assignment kernel is not positive definite. Tech. Rep. HAL-00218278, French Center for Computational Biology 2008.Google Scholar
- Wang XH, Smalter A, Huan J, Lushington GH: G-hash: towards fast kernel-based similarity search in large graph databases. Proc. 12th Int. Conf. EDBT. 2009, 472–480.Google Scholar
- Giugno R, Shasha D: GraphGrep: a fast and universal method for querying Graphs. Proceedings of the International Conference in Pattern Recoginition(ICPR) 2002.Google Scholar
- Yan X, Yu PS, Han J: Graph indexing: a frequent structure-based approach. SIGMOD 2004.Google Scholar
- Cheng H, Yan X, Han J, Hsu CW: Discriminative Frequent Pattern Analysis for Effective Classification. Proceedings of the 23rd IEEE International Conference on Data Engineering (ICDE) 2007.Google Scholar
- Williams D, Huan J, Wang W: Graph Database Indexing Using Structured Graph Decomposition. Proceedings of the 23rd IEEE International Conference on Data Engineering (ICDE) 2007.Google Scholar
- Jacob L, Hoffmann B, Stoven V, Vert JP: Virtual screening of GPCRs: an in silico chemogenomics approach. Tech. Rep. HAL-00220396, French Center for Computational Biology 2008.Google Scholar
- Cone MM, Venkataraghavan R, McLafferty FW: Molecular Structure Comparison Program for the Indentification of Maximal Common Substructures. J. Am.Chem. Soc. 1977, 99: 7668–7671. 10.1021/ja00465a041View ArticleGoogle Scholar
- J Raymond ea: Heuristics for similarity searching of chemical graphs using a maximum common edge subgraph algorithm. J.Chem.Inf.Comput. Sci. 2002, 42: 305–316.View ArticleGoogle Scholar
- Cheng J, Yu JX, Ding B, Yu PS, Wang H: Fast Graph Pattern Matching. 23rd International Conference on Data Engineering 2008.Google Scholar
- Smalter A, Huan J, Lushington G: Graph Wavelet Alignment Kernels for Drug Virtual Screening. Proceedings of the 7th Annual International Conference on Computational Systems Bioinformatics 2008.Google Scholar
- Shasha D, Wang JTL, Giugno R: Algorithmics and Applications of Tree and Graph Searching. Proceeding of the ACM Symposium on Principles of Database Systems (PODS) 2002.Google Scholar
- Yan X, Yu PS, Han J: Graph Indexing Based on Discriminative Frequent Structure Analysis. ACM Transactions on Database Systems (TODS) 2005.Google Scholar
- Jorissen R, Gilson M: Virtual Screening of Molecular Databases Using a Support Vector Machine. J. Chem. Inf. Model 2005, 45(3):549–561. 10.1021/ci049641uView ArticlePubMedGoogle Scholar
- Liu T, Lin Y, Wen X, Jorrisen RN, Gilson M: BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities. Nucleic Acids Research 2007, 35: D198-D201. 10.1093/nar/gkl999PubMed CentralView ArticlePubMedGoogle Scholar
- NCI/NIH AIDS Antiviral Screen data set. 1999.http://dtp.nci.nih.gov set available at
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.