- Open Access
A molecular structure matching approach to efficient identification of endogenous mammalian biochemical structures
© Hamdalla et al.; licensee BioMed Central Ltd. 2015
- Published: 18 March 2015
Metabolomics is the study of small molecules, called metabolites, of a cell, tissue or organism. It is of particular interest as endogenous metabolites represent the phenotype resulting from gene expression. A major challenge in metabolomics research is the structural identification of unknown biochemical compounds in complex biofluids. In this paper we present an efficient cheminformatics tool, BioSMXpress that uses known endogenous mammalian biochemicals and graph matching methods to identify endogenous mammalian biochemical structures in chemical structure space. The results of a comprehensive set of empirical experiments suggest that BioSMXpress identifies endogenous mammalian biochemical structures with high accuracy. BioSMXpress is 8 times faster than our previous work BioSM without compromising the accuracy of the predictions made. BioSMXpress is freely available at http://engr.uconn.edu/~rajasek/BioSMXpress.zip
- Metabolite Identification
- Chemical Structure Matching
- Biochemical Structures
Metabolomics is the comprehensive, qualitative, and quantitative study of all the small molecules, called metabolites, in an organism . A major challenge in metabolomics is the interpretation of the vast amount of data produced by the high-throughput techniques used for information extraction and data interpretation [2, 3]. The existence of several on-line chemical structure databases has provided a vital support for molecular identification by allowing the search for candidate compounds using experimentally determined features with computationally simulated features. Such searches often result in a large number of false positives, making identification of the compound under investigation extremely difficult. Hence, cheminformatics methods are needed to efficiently search such large chemical databases and potentially identify unknown endogenous biochemical compounds. Several methods have been developed with the objective of discriminating between candidate structures that are synthetic and those that are biochemical. The first attempt to solve this problem was reported by Nobeli et al.  who used two-dimensional molecular structures to manually derive a library of 57 structural fragments commonly found in 745 E. coli metabolites. Such fragments were used to reveal the main constituents of metabolites and to assist in the classification of the metabolome into biochemically relevant classes. In related work, Gupta and Aires-de-Sousa  and Peironcely et al.  employed fingerprints and random forest classifiers  to classify endogenous biochemical compounds. Molecular fingerprints represent the structure of a molecule as a list of binary values (0 or 1) that indicate the presence or absence of structural features in the molecule . Gupta and Aires-de-Sousa's model correctly annotates 95% of the 1,811 compounds downloaded from the Kyoto Encyclopedia of Genes and Genomes (KEGG)  used for training their model. While Peironcely et al. reported that 96% of 457 compounds downloaded from the Human Metabolome Database (HMDB) , not used for training the model, were classified as endogenous. In our previous work , we developed BioSM, a molecular classifier that can identify endogenous mammalian biochemical structures contained within chemical structure space. BioSM uses the structures of known endogenous mammalian biochemical compounds to aid in the classification process, as opposed to other works that use fragments of known structures. BioSM correctly predicted 95% of 1,388 (KEGG) compounds as endogenous mammalian biochemical in a set of leave-one-out cross validation experiment. Additionally, 89% of 2,330 compounds downloaded from HMDB were identified as endogenous metabolites. One of BioSM's limitations, granting its encouraging results, was its need to exhaustively search all known biochemical structures to be able to make a decision about the candidate compound under investigation which resulted in an undesirably high run time. In this paper we propose an efficient cheminformatics tool, which can be used to identify biological compounds based on their molecular structures, called BioSMXpress. Similar to our previous work, the prediction method applied by BioSMXpress relies on a set of endogenous mammalian biochemical compounds obtained from the KEGG database, hereafter referred to as scaffolds. In a nutshell, BioSMXpress is designed to predict that a given query structure is biological after encountering exactly one scaffold match satisfying a given threshold. This threshold is based on the number of atoms in both the query structure and the scaffold being examined. Knowing this gives us the opportunity to avoid the need to exhaustively search the entire scaffolds list before making a decision about the query compound with confidence. To do this efficiently, BioSMXpress selects the scaffolds that have the potential to promote the candidate in the least possible time. Then, only those scaffolds, with enough atoms to satisfy the given threshold, are checked against the candidate compound for similarity.
BioSMXpress was designed as an enhancement to BioSM with the aim of making the least possible number of structure comparisons to efficiently identify biochemical structures with the aid of a scaffolds list. BioSMXpress decides if a candidate structure is biochemical based upon how similar that structure is to any of the structures in the scaffolds list. In this work, two molecular structures are considered to be a "match", if the smaller structure is an exact substructure (atom and bond types) of the larger structure being compared. The underlying cheminformatics functionality of BioSMXpress is based on an open source Java based toolkit called the Small Molecule Subgraph Detector (SMSD) . This toolkit is used to find the maximum common sub-graph between small molecules using atom type matches and bond sensitivity information. In addition to SMSD, BioSMXpress uses Marvin, a chemical structure processing software, to generate both the canonical SMILES and atom counts from structure data files (.sdf) for all the compounds described in this work.
Finally, let be the scaffold list assigned to c q where . A scaffold is assigned to if and only if . Please note that each candidate structure with a different atom count is provided with a different set of scaffolds in . Once is populated with the appropriate scaffolds for c q , BioSMXpress examines c q against each of those scaffolds. As soon as a match (substructure or superstructure) is found, BioSMXpress predicts that c q is biological and terminates. Otherwise, it's predicted to be nonbiological. Values for subThr and superThr are determined by cross validation as described in the following section.
We will briefly describe the source and nature of the datasets selected to train and validate BioSMXpress. Since these datasets will be used to compare between BioSMXpress and BioSM in terms of prediction accuracy, we utilized the same datasets and followed the same curation steps in . In each dataset, compounds with any of the following characteristics were eliminated: (1) compounds with elements other than C, H, N, O, P and S; (2) compounds with less than 4 atoms and more than 53 atoms (explained below); (3) compounds that were polymers; (4) charged structures except those in which the charge was due to quaternary amines or sulfonium ions; (5) compounds with duplicate structures; and (6) compounds with disjoint structures. We start by defining compounds used to define biological versus non-biological in chemical structure space in this study.
Biological Dataset (Scaffolds list)
The KEGG database was chosen as the source of endogenous mammalian compounds. The list of 1,564 mammalian scaffolds (KEGGscafs) defined in  were used to represent the biochemical structure space in BioSMXpress. Each compound in the scaffolds list comprises of a number of atoms from 4 to 80 atoms per compound.
Non-Biological Dataset (Synthetic compounds list)
The Chembridge http://www.chembridge.com and Chemsynthesis http://www.chemsynthesis.com databases served as the sources of compounds representing the non-biological chemical space. These databases were selected because they comprise synthetic compounds for chemical synthesis and drug screening and design. After curation, a set of 375,930 structures represented the synthetic compounds list. Chemsynthesis and Chembridge databases mainly contain compounds with low molecular weights (a maximum atom count of 53 atoms per compound). Accordingly, 143 of the 1,564 KEGGscafs (with atom count between 54 and 80) were eliminated from any testing set throughout this study and were only used for superstructure scaffold matching. This restriction was enforced to ensure that the sole discrimination between a compound being biological or non-biological is based on the structure of a compound and not on the number of atoms in that compound.
A total of 2,842 compounds, with at least 4 atoms and at most 53, were used to train and test our predictive model. Half of those compounds were obtained from the scaffolds list (representing the endogenous mammalian chemical space) and the other half from the synthetic compounds list (representing the non-biological chemical space). The later molecules were randomly selected from the synthetic dataset to match the atom count distribution of the 1,421 biological set.
To estimate the performance of our predictive model and compare it with that of BioSM, four external validation sets were used: one set of putative human metabolites, one set of plant secondary metabolites, one set of drugs, and one set of synthetic compounds. For each dataset, any compound with a structure identical to any of those in the scaffolds list was removed. Also, structures found in more than one dataset were removed from all datasets except one. Molecules in each dataset had to satisfy both mass (50 - 700 Da) and atom count (4 - 53 atoms) constraints to allow for a fair comparison between BioSMXpress and BioSM. Additionally, compounds with at least one non-biological substructure (NBS) were eliminated. NBSs are substructures that are not commonly found in mammalian biochemical compounds. This decision was based on our interest in comparing the core predictive models of BioSM and BioSMXpress since in reality, NBS filters will be applied to both models before any scaffold comparisons are involved. For more details on the curation process followed please refer to .
The following is a brief description of the five datasets. Please note that the numbers of compounds reported below refer to the datasets after curation. The first dataset consisted of 2,329 compounds and was obtained from HMDB version 2.5 representing putative human metabolites. The second dataset consists of 2,416 secondary plant metabolites, as specified by KEGG, representing plant structures. The drug dataset was represented by 3,282 compounds and was obtained from DrugBank  version 3.0 and from the 1989 USAN and the USP Dictionary of Drug Names. A randomly chosen set of approximately 46,203 molecules was from the National Center for Biotechnology Information's (NCBI) PubChem database . PubChem is the largest freely accessible compound database currently available. Finally, a set of 374,509 compounds from the Chembridge and Chemsynthesis databases, not used in training the model, were used as a synthetic compound test set.
Classification methods selection
Four classification methods were proposed for BioSMXpress specifically, the SSF method which refers to finding a substructure scaffold match or a superstructure scaffold match in the sorted scaffolds list was utilized. SSSF refers to searching for the best substructure scaffold similarity score (Sc sub ), if existent, and the best superstructure scaffold similarity score (Sc super ), if existent. It declares the candidate as biological if Sc sub + Sc super ≥ sumThr. From my experience with BioSM, I found that distributing candidates into mass bins, with each bin having its own threshold values, showed an improvement in the prediction quality. Thus, we decided to test if the same concept applies here. I split the set of test compounds into five bins based on the number of atoms per compound and used CV to evaluate the model. This introduced two more methods, SSB and SSSB, similar to SSF and SSSF respectively, except that there are independent thresholds assigned to each bin.
Cross Validation (CV) is a method used for estimating how accurately a predictive algorithm will perform in practice while avoiding overfitting as well as tuning meta-parameters . In this study, we used a nested CV framework where parameter tuning was performed by executing 5-fold CV on the training data and the classification accuracy was empirically assessed using 2-fold CV on the testing data. For the results of each training fold, the score where SENS = SPEC was recorded as the cutoff threshold of that fold. This process was repeated 5 times to insure that each of the 5 parts was evaluated. Then the average thresholds of all five training sets were used as the cutoff values when evaluating the CV testing data as explained in.
Mean and standard deviation of accuracy measures obtained for 15 cross validation experiments using 4 different scoring.
Leave-One-Out cross validation analysis
As an additional method to evaluate how well BioSMXpress can identify endogenous mammalian biochemical structures using the SSB classifier, we carried out a set of LOOCV experiments using the SSB method with the averaged subThr, superThr and bin boundaries determined by CV as explained in the Methods section. Here, 1,421 experiments representing KEGG structures (with atom count between 4 and 53 atoms/compound) were carried out. Please note that in every experiment, the scaffolds list was composed of 1,420 compounds plus 143 compounds (those with atom count between 54 and 80 atoms/compound) as the scaffolds list. As a result, BioSMXpress annotated 94% of the 1,421 compounds as being biological structures.
BioSMXpress performed best when identifying compounds in the first bin (99% positive identification of biochemical compounds) while BioSM was able to predict only 92% of those compounds. In general, Figure 2 indicates that BioSMXpress is better at identifying biochemical compounds in the first and second bins (99% and 97%, respectively), while BioSM is better at identifying biochemical compounds in the fourth and fifth bins (96% and 94%, respectively).
A note worth mentioning is that in addition to the 1,387 scaffolds annotated, BioSMXpress was able to positively identify 34 compounds in the scaffolds list that were rejected by BioSM without classification due to mass restrictions (masses were greater than 700 Daltons). This indicates that BioSMXpress has broadened the range of compounds examined just by restricting the number of atoms in a candidate compound versus its molecular mass.
Predictive results using the SSB classifier for 6 different datasets.
Number of Compounds
In addition to these four prospective datasets, a set of 374,509 synthetic compounds (represented by Chembridge and Chemsynthesis compounds) were evaluated by BioSMXpress and BioSM with 36% and 33% of these being predicted to be biochemical, respectively. Overall, the comparison in Table 2 shows that the biochemical prediction percentages made by BioSM and BioSMXpress are practically comparable except that BioSMXpress predicted 10% lesser compounds of the PubChem compounds as biological.
Execution and CPU time comparison
Obviously, BioSMXpress impressively outperformed BioSM across all datasets. BioSMXpress was 10 times faster than BioSM when analyzing 10,000 compounds. Across all datasets examined, BioSMXpress provided an 8 times average speed up over BioSM. Another interesting observation is that it takes BioSM an average of 6 minutes and 51 seconds to evaluate 1,000 compounds while it takes BioSMXpress an average of 5 minutes and 8 seconds to evaluate 10,000 compounds (10 times more compounds in less time).
This drastic difference in run time and CPU time can be explained by observing the number of scaffold comparisons required by each predictive model to make a prediction about any given candidate compound. BioSM needs to compare the candidate structure with each and every structure in the scaffolds list accumulating scores and then finally comparing that score with a threshold to make a prediction. BioSMXpress intelligently selects and sorts the scaffolds that would produce the highest match scores, based on the thresholds, if they were to match the candidate. Only a portion of the scaffolds are added to the list that is actually used by BioSMXpress as the scaffolds list and once a match is found the candidate is predicted to be biological with no other computational steps further needed.
In this paper, we describe the development and validation of BioSMXpress, an efficient supervised cheminformatics tool that uses endogenous mammalian biochemical scaffolds to predict whether a candidate chemical structure is biochemical or synthetic. BioSMXpress is at average 8 times faster than BioSM without compromising the accuracy of the predictions. BioSMXpress was able to correctly classify 94% of 1,421 biochemical compounds in a set of leave-one-out cross validation experiment. Thus BioSMXpress may be useful for searching large chemical databases in metabolomics applications where the number candidates is extremely large as well as the number of potential false positives in an efficient manner.
This work has been supported in part by the NIH Grant R01-LM010101 and the Booth Engineering Center for Advance Technology (BECAT) at the University of Connecticut.
This article has been published as part of BMC Bioinformatics Volume 16 Supplement 5, 2015: Selected articles from the 10th International Symposium on Bioinformatics Research and Applications (ISBRA-14): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S5.
- Villas-Bôas SG, Bruheim P: The potential of metabolomics tools in bioremediation studies. Omics : a journal of integrative biology. 2007, 11 (3): 305-13. 10.1089/omi.2007.0005.View ArticlePubMedGoogle Scholar
- Kertesz T, Hill DW, Albaugh D, Hall L, Hall L, Grant DF: Database searching for structural identification of metabolites in complex biofluids for mass spectrometry-based metabonomics. Bioanalysis. 2009, 1 (9): 1627-1643. 10.4155/bio.09.145.View ArticlePubMedGoogle Scholar
- Roessner U, Bowne J: What is metabolomics all about?. BioTechniques. 2009, 46 (5): 363-365. 10.2144/000113133.View ArticlePubMedGoogle Scholar
- Nobeli I, Ponstingl H, Krissinel EB, Thornton JM: A structure-based anatomy of the E.coli metabolome. Journal of Molecular Biology. 2003, 334 (4): 697-719. 10.1016/j.jmb.2003.10.008.View ArticlePubMedGoogle Scholar
- Gupta S, Aires-de-Sousa Ja: Comparing the chemical spaces of metabolites and available chemicals: models of metabolite-likeness. Molecular Diversity. 2007, 11 (1): 23-36. 10.1007/s11030-006-9054-0.View ArticlePubMedGoogle Scholar
- Peironcely JE, Reijmers T, Coulier L, Bender A, Hankemeier T: Understanding and classifying metabolite space and metabolite-likeness. PloS One. 2011, 6 (12): doi:10.1371/journal.pone.0028966Google Scholar
- Breiman L: Random forests. Machine Learning. 2001, Kluwer Academic Publishers, 45: 5-32. 10.1023/A:1010933404324. [http://onlinelibrary.wiley.com/doi/10.1002/cbdv.200490137/abstract]Google Scholar
- James CA, Weininger D, Delany J: Fingerprints - Screening and Similarity. Daylight Theory Manual. Edited by: Irvine, CA and Santa Fe, NM. 2000, Daylight Chemical Information Systems, Inc, doi:10.1002/mus.23684., [http://www.ncbi.nlm.nih.gov/pubmed/23055325]Google Scholar
- Kanehisa M, Goto S, Kawashima S, Nakaya A: The KEGG databases at GenomeNet. Nucleic Acids Research. 2002, 30 (1): 42-46. 10.1093/nar/30.1.42.PubMed CentralView ArticlePubMedGoogle Scholar
- Wishart DS, Knox C, Guo AC, Eisner R, Young N, Gautam B, Hau DD, Psychogios N, Dong E, Bouatra S, Mandal R, Sinelnikov I, Xia J, Jia L, Cruz Ja, Lim E, Sobsey Ca, Shrivastava S, Huang P, Liu P, Fang L, Peng J, Fradette R, Cheng D, Tzur D, Clements M, Lewis A, De Souza A, Zuniga A, Dawe M, Xiong Y, Clive D, Greiner R, Nazyrova A, Shaykhutdinov R, Li L, Vogel HJ, Forsythe I: HMDB: a knowledgebase for the human metabolome. Nucleic Acids Research. 2009, 603-610. doi:10.1093/nar/gkn810, 37 DatabaseGoogle Scholar
- Hamdalla MA, Mandoiu II, Hill DW, Rajasekaran S, Grant DF: BioSM: A chemoinformatics tool for identifying biochemical structures in chemical structure space. Journal of Chemical Information and Modeling. 2012Google Scholar
- Rahman SA, Bashton M, Holliday GL, Schrader R, Thornton JM: Small Molecule Subgraph Detector (SMSD) toolkit. Journal of Cheminformatics. 2009, 1 (12): 10.1186/1758-2946-1-12Google Scholar
- Knox C, Law V, Jewison T, Liu P, Ly S, Frolkis A, Pon A, Banco K, Mak C, Neveu V, Djoumbou Y, Eisner R, Guo AC, Wishart DS: DrugBank 3.0: a comprehensive resource for 'omics' research on drugs. Nucleic Acids Research. 2011, 1035-41. doi:10.1093/nar/gkq1126, 39 DatabaseGoogle Scholar
- Bolton EE, Wang Y, Thiessen PA, Bryant SH: PubChem: Integrated Platform of Small Molecules and Biological Activities. Annual Reports in Computational Chemistry. 2008, American Chemical Society, Washington, 4 (Chap 12): 217-241. 4Google Scholar
- Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning. 2009, Springer, [http://www-stat.stanford.edu/texttildelowtibs/ElemStatLearn/]2View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.