Natural product-likeness score revisited: an open-source, open-data implementation
© Jayaseelan et al.; licensee BioMed Central Ltd. 2012
Received: 26 October 2011
Accepted: 20 May 2012
Published: 20 May 2012
Natural product-likeness of a molecule, i.e. similarity of this molecule to the structure space covered by natural products, is a useful criterion in screening compound libraries and in designing new lead compounds. A closed source implementation of a natural product-likeness score, that finds its application in virtual screening, library design and compound selection, has been previously reported by one of us. In this note, we report an open-source and open-data re-implementation of this scoring system, illustrate its efficiency in ranking small molecules for natural product likeness and discuss its potential applications.
The Natural-Product-Likeness scoring system is implemented as Taverna 2.2 workflows, and is available under Creative Commons Attribution-Share Alike 3.0 Unported License athttp://www.myexperiment.org/packs/183.html. It is also available for download as executable standalone java package fromhttp://sourceforge.net/projects/np-likeness/under Academic Free License.
Our open-source, open-data Natural-Product-Likeness scoring system can be used as a filter for metabolites in Computer Assisted Structure Elucidation or to select natural-product-like molecules from molecular libraries for the use as leads in drug discovery.
Natural products (NPs) are small molecules synthesised by living organisms. In drug discovery, the class of NPs termed secondary metabolites that are involved in defence or signalling, are of particular importance because they were optimised during evolution to have effective interactions with biological receptors. They are therefore good starting points for designing new drugs. Hence, Natural Product-likeness (NP-likeness) of a chemical structure can serve as a criteria in lead compound selection and in designing novel drugs. In order to estimate NP-likeness of a molecule, prior knowledge such as physicochemical and structural properties of existing natural products have to be captured. In this work, we focus only on identifying structural features typical of natural products, and based on their presence, rank molecules of interest according to their NP-likeness.
CDK-Taverna version 2[2, 3] is an open-source Java tool kit to perform cheminformatics tasks, making use of the pipelining technology offered by Taverna version 2.2, an open-source workflow management system. The CDK-Taverna 2 plug-in is based on the Chemistry Development Kit (CDK)[5, 6] and few other open source Java libraries. The individual components required to score a small molecule for NP-likeness are implemented as CDK-Taverna workflows to be used intuitively by users without programming background. Source code for the CDK-Taverna 2 workers is freely available athttps://sourceforge.net/projects/cdktaverna2/.
The scorer is also available as standalone Java ARchive (JAR) package to be used as a library component in stand-alone or web applications. The standalone JAR and the source code is freely available for download athttp://sourceforge.net/projects/np-likeness/.
Integration of NP-Likeness scorer components with CDK-Taverna 2.2
CDK-Taverna 2[2, 3] has drag and drop components (workers) to build cheminformatics workflows ranging from parsing a molecule file via fingerprinting and clustering to more advanced tasks such as reaction enumeration. The full features of the CDK-Taverna 2.0 plug-in, its installation procedure and example workflows are available athttp://cdk-taverna-2.ts-concepts.de/wiki/. CDK-Taverna 2 provides a set of workers commonly used in cheminformatics workflows. To provide additional functionality, individual components such as those required to score a small molecule for NP-likeness are bundled as sub-packages within the existing CDK-Taverna2 plug-in. The NP-likeness sub-packages comprise workers for molecule curation, fragment generation and fragment scoring; all of which can readily be integrated into other data analysis workflows.
Components for molecule curation
Before being evaluated for NP-likeness, molecules have to be pre-processed to remove small disconnected fragments like counter-ions and fragments containing metallic elements. In previous study commercial tools such as PipelinePilot and Molinspiration[7, 8] were used to standardise molecules. These curation workers are now implemented in an open manner within the CDK-Taverna 2.0 plug-in and available under the folder “Molecule curation”. To start with, Molecule Connectivity Checker worker checks for the disconnected parts in the molecule. If such are found, the user has an option of configuring the minimum atom-count for a fragment to be retained. As suggested by Ertl et al., the default minimum atom-count cut-off is set to 6 and so, unless modified, disconnected fragments with less than 6 atoms will be removed from the molecule. The Curate Strange Elements worker filters molecules, removing those that contain elements other than C, H, N, O, P, S, F, Cl, Br, I, As, Se or B. As another standardisation step, deglycosylation is needed to remove sugar moieties from the molecules. Remove Sugar Group worker identifies all the sugar moieties in the structure and remove the ones that are linked by glycosidic bond to the scaffold. This is done in order to retain core structural features that are more typical of natural products and to omit features like sugar moieties that are less distinctive, albeit commonly present in natural products. Removal of sugars is not expected to improve the score but to facilitate classifications based only on chemically interesting structural features.
Component for atom signature generation
The molecule curation workers leave behind curated structures of molecule upon standardisation. Down the workflow, they are consumed by another worker that generates its atom signatures. Atom signatures are structural descriptors – canonical, circular descriptions of an atom’s environment in a molecule. The atom signature of a given atom in a molecule is a directed acyclic graph of its connected atoms, where every node in the graph is an atom and the edges are the bonds between the atoms. The levels of neighbourhood of an atom in a molecule is the signature height of that atom. A molecular signature is the summation of all atom signatures of a molecule. The successful usage of molecular signatures is reported in various studies, ranging from QSAR calculations to prediction of enzyme-metabolite and target-drug interactions[9, 10]. In their original implementation, Ertl et al used HOSE codes, an earlier circular description of atom environments suggested by Bremser for the use in NMR spectrum prediction. Atom signatures and HOSE codes capture identical circular description of an atom environment but only differ in their string representation. Since we had a well-tested, efficient implementation of signatures in the CDK, provided by Torrance, we decided to test whether it would give the expected identical results as the HOSE code-based implementation of the original work by Ertl et al. The Generate Atom Signatures worker in the “Signature Scoring” folder generates atom signatures based on a given structure as input. The worker generates atom signatures of a molecule and tags them with the molecule’s UUID, to keep account of the signatures identity. The signature’s height (number of spheres in the atom environment used for signature generation) is configurable and we used atom signatures of height 2 (set as default) as it was sufficient in capturing relevant structural features in small molecules. The generated atom signatures for huge training datasets are usually written out to text file and stored for re-use. This feature is shown in Figure1.
Component for NP-likeness score calculation
The performance of the NP-likeness score depends, of course, on the choice of natural products and synthetic molecules in the training dataset. For the analysis of our engine’s performance, natural products, synthetic molecules and query compound collections were all obtained from open access databases only. Our first subset of natural products (22,876 molecules) originates from the ChEMBL database, where we selected molecules extracted from the Journal of Natural Products. The second subset of natural products (39,162 molecules) comes from the Traditional Chinese Medicine Database @ Taiwan (TCM). Together, the natural product training set comprised 58,018 non-redundant structures. Training set of synthetic molecules comprised 113,425 clean lead-like compounds selected from the ZINC database. Small molecules from DrugBank and the Human Metabolome Database (HMDB) were treated as our test sets. Besides that, PubMed abstracts reporting isolation of new NPs were text-mined for natural product’s name and the names were converted into SMILES using Chemical Identifier Resolver and the resultant set of 3610 non-redundant NPs was used as our test set.
To validate our scoring system, 3610 text-mined NPs with additional 5000 synthetics were scored using both our system and the original implementation by Ertl et al. Despite the much larger training set of the original system, the scores obtained showed a good correlation coefficient with r-value 0.94. Further, the scores obtained for the test set by replacing the training data in the original system with our open-data, showed very good correlation coefficient with r-value 0.97. Taking into account that two cheminformatics toolkits that have been used to calculate the values, differ slightly in handling of aromaticity, tautomerism, molecule normalisation etc and also slightly different types of substructure fragments, we consider this agreement very good and fully validating the new implementation of NP-likeness.
We have presented an open-source, open-data implementation of a Natural-Product-likeness scorer originally described by Ertl et al. Workflows for curation, training and scoring are implemented in the open-source workflow tool CDK-Taverna and published at myexperiment.org. A version of the scorer is available as an executable from command-line and as a library for inclusion in stand-alone or web applications. Training and test sets where extracted from open access databases such as ChEMBL, TCM, ZINC, DrugBank and HMDB. We replaced HOSE codes by Faulon’s atom signatures as our circular fingerprint implementation which showed similar performance. With the available open-data and open-source tool-kits, we have implemented a NP-likeness scorer engine and successfully demonstrated its capability to differentiate the natural product compound collection from synthetic and drug compound collections identical to what was reported in the original paper. The engine can be used as a filter to remove improbable metabolite structures from chemical spaces generated from Computer Assisted Structure Elucidation (CASE) or to select natural-product-like molecules from molecular libraries for the use as leads in drug discovery. The open-source, open-data implementation allows other researchers to modify the workflows or to use larger collections of training molecules once they become available.
KJ thanks PE and CS for their valuable suggestions and advice in implementing the scoring system. KJ also thanks her colleagues from Chemoinformatics and metabolism group at EBI for their active support and critical comments. All authors are very grateful to the open-source communities of CDK, Taverna and CDK-Taverna. This work was supported by the funds from the EMBL-EBI.
1Chemoinformatics and Metabolism, European Bioinformatics Institute (EBI), Cambridge, UK. 2Institute for Bioinformatics and Cheminformatics, University of Applied Sciences of Gelsenkirchen, Recklinghausen, Germany. 3Novartis Institutes for BioMedical Research, CH-4056 Basel, Switzerland.
- Ertl P, Roggo S, Schuffenhauer A: Natural Product-likeness Score and Its Application for Prioritization of Compound Libraries. J Chem Inf Model 2008, 48(1):68–74. 10.1021/ci700286xView ArticlePubMedGoogle Scholar
- Steinbeck C, Hoppe C, Kuhn S, Guha R, Willighagen EL: CDK-Taverna: an open workflow environment for cheminformatics. BMC Bioinformatics 2010, 11: 159. 10.1186/1471-2105-11-159PubMed CentralView ArticlePubMedGoogle Scholar
- Truszkowski A, Jayaseelan KV, Neumann S, Willighagen EL, Zielesny A, Steinbeck C: New developments on the cheminformatics open workflow environment CDK-Taverna. J Cheminform 2011, 3: 54. 10.1186/1758-2946-3-54PubMed CentralView ArticlePubMedGoogle Scholar
- Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, Carver T, Glover K, Pocock M, Wipat A, P L, P L: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 2004, 20(17):3045–3054. 10.1093/bioinformatics/bth361View ArticlePubMedGoogle Scholar
- Steinbeck C, Han YQ, Kuhn S, Horlacher O, Luttmann E, Willighagen E: The Chemistry Development Kit (CDK): An open-source Java library for chemo- and bioinformatics. Journal of Chemical Information and Computer Sciences 2003, 43(2):493–500.PubMedGoogle Scholar
- Steinbeck C, Hoppe C, Kuhn S, Guha R, Willighagen EL: Recent Developments of The Chemistry Development Kit (CDK) - An Open-Source Java Library for Chemo- and Bioinformatics. Current pharmaceutical design 2006, 12(17):2111–2120. 10.2174/138161206777585274View ArticlePubMedGoogle Scholar
- Pipeline Pilot, Version 6.0; Scitegic Inc.Inc.: San Diego, CA 2007.http://www.scitegic.com 
- Molinspiration Cheminformatics mib package, Version 2007.03; Molinspiration Cheminformatics: Slovensky Grob, Slovak Republic 2007.http://www.molinspiration.com 
- Faulon JL, Visco DP, Pophale RS: The Signature Molecular Descriptor. 1. Using Extended Valence Sequences in QSAR and QSPR Studies. J Chem Inf Model 2003, 43(3):707–720. 10.1021/ci020345wView ArticleGoogle Scholar
- Faulon JL, Misra M, Martin S, Sale K, Sapra R: Genome scale enzyme-metabolite and drug-target interaction predictions using the signature molecular descriptor. Bioinformatics 2008, 24(2):225–233. 10.1093/bioinformatics/btm580View ArticlePubMedGoogle Scholar
- Bremser W: HOSE - A Novel Substructure Code. Anal Chim Acta 1978, 103: 355–365. 10.1016/S0003-2670(01)83100-7View ArticleGoogle Scholar
- Torrance G: Implementation of Faulon’s atom signatures in Chemistry Development Kit. Internal communication
- Gaulton J, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B, Overington JP: ChEMBL: a large-scale bioactivity database for drug discovery. Nucl Acids Res 2011, 1–8. 10.1093/nar/gkr777Google Scholar
- Chen CYC: TCM Database @ Taiwan: The World Largest Traditional Chinese Medicine Database for Drug Screening In Silico. PloS one 2011, 6(1):e15939. 10.1371/journal.pone.0015939PubMed CentralView ArticlePubMedGoogle Scholar
- Irwin JJ, Shoichet BK: ZINC - A free database of commercially available compounds for virtual screening. Journal of Chemical Information and Modeling 2005, 45(1):177–182. 10.1021/ci049714+PubMed CentralView ArticlePubMedGoogle Scholar
- Knox C, Law V, Jewison T, Liu P, Ly S, Frolkis A, Pon A, Banco K, Mak C, Neveu V, Djoumbou Y, Eisner R, Guo AC, Wishart DS: DrugBank 3.0, a comprehensive resource for omics research on drugs. Nucleic Acids Res 2011, 39(Database issue):D103.Google Scholar
- Wishart SD, Knox C, Guo AC, Eisner R, Young N, Gautam B, Hau DD, Psychogios N, Dong E, Bouatra S, Mandal R, Sinelnikov I, Xia J, Jia L, Cruz AJ, Lim E, Sobsey CA, Shrivastava S, Huang P, Liu P, Fang L, Peng J, Fradette R, Cheng D, Tzur D, Clements M, Lewis A, De Souza A, Zuniga A, Dawe M, Xiong Y, Clive D, Greiner R, Nazyrova A, Shaykhutdinov R, Li L, Vogel HJ, Forsythe L: HMDB- a knowledgebase for the human metabolome. Nucleic Acids Res 2009, 37(Database issue):D603–10.PubMed CentralView ArticlePubMedGoogle Scholar
- NCI/CADD Chemical Identifier Resolver [http://cactus.nci.nih.gov/chemical/structure] 
- Paul DD, Patel Y, Kell BD, Kell BD: ‘Metabolite-likeness’ as a criterion in the design and selection of pharmaceutical drug libraries. Drug Discovery Today 2009, 14(1–2):31–40. 10.1016/j.drudis.2008.10.011View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.