- Methodology article
- Open Access
Predictive modeling of anti-malarial molecules inhibiting apicoplast formation
© Jamal et al; licensee BioMed Central Ltd. 2013
- Received: 23 July 2012
- Accepted: 4 February 2013
- Published: 15 February 2013
Malaria is a major healthcare problem worldwide resulting in an estimated 0.65 million deaths every year. It is caused by the members of the parasite genus Plasmodium. The current therapeutic options for malaria are limited to a few classes of molecules, and are fast shrinking due to the emergence of widespread resistance to drugs in the pathogen. The recent availability of high-throughput phenotypic screen datasets for antimalarial activity offers a possibility to create computational models for bioactivity based on chemical descriptors of molecules with potential to accelerate drug discovery for malaria.
In the present study, we have used high-throughput screen datasets for the discovery of apicoplast inhibitors of the malarial pathogen as assayed from the delayed death response. We employed machine learning approach and developed computational predictive models to predict the biological activity of new antimalarial compounds. The molecules were further evaluated for common substructures using a Maximum Common Substructure (MCS) based approach.
We created computational models using state-of-the-art machine learning algorithms. The models were evaluated based on multiple statistical criteria. We found Random Forest based approach provides for better accuracy as assessed from ROC curve analysis. We further evaluated the active molecules using a substructure based approach to identify common substructures enriched in the active set. We argue that the computational models generated could be effectively used to screen large molecular datasets to prioritize them for phenotypic screens, drastically reducing cost while improving the hit rate.
- Random Forest
- Area Under Curve
- Misclassification Cost
- Receiver Operating Characteristic Curve Curve
Malaria is a major health problem across the world, more so in the tropics and especially in developing nations . According to the recent World Malaria Report, released by the World Health Organization (WHO) in 2011, there were about 216 million cases of malaria across the globe and 0.65 million deaths in 2010; with highest mortality found within children living in Africa [2, 3]. Malaria is a mosquito-borne disease and is caused by protozoan parasites belonging to the genus Plasmodium. P. falciparum, P. vivax, P. ovale and P. malariae are the four species of the parasite which are routinely implicated as the causative agents in humans, with P. falciparum being the most commonly encountered and deadliest amongst them all and associated with 90% of the fatalities in Africa [4, 5]. Endemic to the tropical and subtropical regions of Africa, Asia, South and Central America where hot and humid climatic conditions prevail, malaria has been indicated as a major constraint to economic development [6-8].
One of the major roadblocks in the adequate control of malaria has been the limited therapeutic options available for its treatment. The current commonly used classes of drugs are limited to aminoquinolines and their derivatives such as arylamino alcohols, methanols, biguanides, diaminopyrimidines and antimalarial endoperoxidases. Chloroquine and primaquine have been extensively used for the treatment and prophylaxis of malaria [9, 10]. However, widespread drug resistance to available therapeutic agents and the emergence of multi-drug resistant strains has resulted in limited treatment options [11-14]. The current pipeline for drug discovery of anti-malarials is also limited, with just 13 products in clinical trials and 8 in preclinical stages of development . Large scale collaborative initiatives have made it possible to assemble large datasets of chemical structure information online . This has been complemented by the annotation of biological activities of these molecules. Many of the biological activities have been derived by high-throughput bioassays made possible by recent advances in automation of these assays. The availability of chemical structure and bio-activity information in standardized forms provide immense opportunities for creating predictive computational models to understand the correlation between chemical properties and their activities and also opens up the possibility to create predictive computational models for bio-activities [17, 18]. These predictive models make it possible to computationally screen large molecular datasets thereby offering a possibility to improve the hit-rate and thereby reduce the overall costs of drug discovery. We have also previously successfully generated such predictive models for anti-tubercular molecules [19, 20] and for small molecule modulators of miRNA .
In the present study, we applied the machine learning technique to create classification models from high-throughput screens of anti-malarial agents that inhibit the development of the apicoplast in the malaria parasite, P. falciparum. In addition, we used a Maximum Common Substructure (MCS) based approach to identify substructures enriched in the bioactive molecules. Our result suggests that efficient and accurate computational predictive models could be built to screen large datasets in silico and could be potentially used to prioritize molecules for high-throughput screens.
Descriptor generation and model construction
Evaluation of substructures
Significantly enriched scaffolds in the active dataset
Malaria is a neglected tropical disease. Widespread drug-resistance to commonly used anti-malarials which has limited the therapeutic options available has warranted the need to search for novel molecules with anti-malarial activity. The availability of high-throughput chemical screens in the public domain provides an excellent opportunity to create predictive computational models to prioritize molecules using a virtual screening approach. Such an approach therefore will, not only serve to aid the rapid screening of compounds but also subsequently enhance the identification of true hits and thereby would lead to reduced cost of carrying out biological screens. Our analysis shows that a systematically designed computational model for activity based on chemical descriptors could be potentially used for virtual-screening. The work encompasses a machine learning based framework to build in silico predictive models based on datasets from high-throughput screens for apicoplast inhibitors of the malaria parasite. Comparative analysis of various classifiers revealed that Random Forest performed better than both Naive Bayes and J48. The study was extended further to explore potentially enriched substructures in bioactive molecules, which resulted in the identification of 20 significantly enriched scaffolds. Predictive models in conjunction with the enriched scaffold information can be potentially used as a molecular filtering criterion for prioritizing molecules for biological screens for anti-malarial activity.
Source of bioassay data
The cell based assay used in the current study [AID: 504834] consists of antimalarial compounds and was obtained from PubChem database maintained by National Center for Biotechnology Information (NCBI) . Briefly, the bioassay contained compounds which have the potential to inhibit apicoplast formation in Plasmodium. The assay was based on a Luciferase reporter assay and the compounds that cause inhibition of apicoplast formation was assayed by a delayed death response at 96 hours. The dataset AID: 504834 contained a total of 323,201 tested compounds. Compounds having a PubChem activity score between 40 and 100 were considered as active (N = 22,335), and all compounds with a score of 0 were considered as inactive (N = 197,373). Besides the active and inactive set of compounds, the assay depositor also reported two other sets consisting of inconclusive and unspecified compounds which were excluded from our study because of the un-certainty in their bioactivities. The compounds from the active and inactive datasets were downloaded in Structural Data Format (SDF).
Descriptor generation and data pre-processing
2D molecular descriptors were generated for the molecules in the active and inactive datasets using PowerMV . PowerMV is popular software used for descriptor generation statistical analysis and molecular similarity search and extensively used in the field. The datasets contained large number of chemical compounds which could not be processed in one single run, so they were initially split into smaller SDF files using SplitSDFiles Perl script available from Mayachem tools . A total of 179 descriptors were generated using PowerMV. Among the descriptors generated, 147 belonged to pharmacophore fingerprints while 24 belonged to weighted burden numbers and 8 were property descriptors (Additional file 1). For the bit string descriptors, the attributes having only one value (all 0’s or all 1’s) throughout the dataset were filtered out to reduce the dimensionality of the dataset. Using a custom script, the dataset was split randomly into 80% train-cum-validation set and a 20% independent test set. A 5-fold cross validation was employed for training and validation set.
Cost sensitive classifiers
Machine Learning (ML) is a scientific discipline that deals with the generation of predictive models based on known properties learned from training datasets. In this particular scenario, ML was employed to create binary classifiers for the molecules based on their bio-activity viz., actives and inactives. One of the issues to keep in consideration while using standard classifiers for model building is the imbalanced nature of the dataset, i.e. the class imbalance problem. Class imbalance arises from the fact that in most of the high-throughput unbiased screens, the numbers of inactive molecules exceeds far beyond the number of actives, the minority ratio being 11% in our study. Standard classifiers that use equal weighting for all the classes are incapable to handle such highly imbalanced data and tends to assume that all misclassification errors cost equally. One of the alternatives for this is to use cost sensitive classifiers in which misclassification costs are used . We applied Weka (Waikato Environment for Knowledge Analysis) , a popular suite of machine learning algorithms in our study. Weka supports algorithms for data pre-processing, analysis, classification, clustering, feature selection techniques and visualization tools. Weka introduces cost sensitivity in the base classifiers by means of a confusion matrix, which for a binary classification scheme consists of four sections: True Positives (TP) for actives correctly classified as actives; False Positives (FP) for inactives incorrectly classified as actives; True Negatives (TN) in which inactives correctly classified as inactives and False Negatives (FN) for active compounds incorrectly classified as inactive. As False Negatives are considered more important in an experiment for compound selection, we set misclassification cost for False Negatives to lessen the False Negatives number at the cost of increasing the False Positives. However, increasing the cost for False Negatives will increase both the False Positives and True Positives. Therefore we set an empirical upper limit of 20% on the False Positive rate. Setting of the misclassification cost is always arbitrary and no general rule exists for it. It is more or less dependent on the base classifier used.
Machine learning encompasses the application of a wide variety of methods and algorithms that extract rules and functions from large datasets. In our study, we used three different classifiers Naive Bayes, Random forest and J48. The Naive Bayes classifier, is based on the Bayesian theorem, and assumes that each predictor is conditionally independent of the other . The algorithm for Random forest (RF), a form of multiple decision trees, was developed by Leo Breiman . J48, a version of earlier algorithm (the very popular C4.5) developed by J. Ross Quinlan, builds decision trees from a set of labelled training data using the fact that each attribute of the data can be used to make a decision by splitting the data into smaller subsets .
Cost sensitivity was introduced by means of meta-learners. The two meta-learners employed in this study were MetaCost for J48 and CostSensitiveClassifier for Naive Bayes and Random Forest respectively .
Standard ML statistical measures such as Accuracy, Sensitivity, Specificity, Balanced Classification Rate (BCR) and Receiver Operating Characteristic curve (ROC) were used to evaluate the performance of the classifiers. Accuracy is the percentage of predictions that are correct ((TP + TN)/(TP + TN + FP + FN)). Sensitivity is the percentage of positive labelled instances that are predicted as positive (TP/(TP + FN)). Specificity refers to percentage of negative labelled instances that are predicted as negative (TN/(TN + FP)). BCR is the average of sensitivity and specificity and enforces balance in the correct classification rate between two classes. A ROC curve is a graphical plot of True Positive rate vs. False Positive rate that illustrates a binary classifier’s performance by means of area under the curve (AUC).
Maximum common substructure search
In order to identify potentially enriched substructures in the bioactive molecules, we employed a Maximum Common Substructure (MCS) based approach. We used a MCS based hierarchical clustering algorithm ‘LibMCS’ available from ChemAxon . The minimal MCS size was empirically set to ’8’ atoms owing to the size and structural complexity of the molecules.
The molecular scaffolds thus generated as a result of MCS clustering were then used for similarity searching in active and inactive datasets using the ‘jcsearch’ algorithm available from ChemAxon . The evaluation of substructures was done using the chi-square test. The p-value which is the probability value associated with chi-square was used to test the significance of enrichment. Using the vROCS (release 3.1.2)  we performed a molecular alignment of the selected scaffolds with molecules of active dataset and visualized the alignment in VIDA (4.1.1)  available from OpenEye Scientific Software, Inc. .
The authors thank Dr Chetana Sachidanandan and Dr Souvik Maiti for reviewing the manuscript and for scientific suggestions. The authors also thank the Open Source Drug Discovery (OSDD) community for support and discussions. The computation was supported by CDAC India through the Garuda grid, and authors acknowledge help and support from the CDAC Garuda grid team members. This work was funded by the Council of Scientific and Industrial Research (CSIR), India for funding through the Open Source Drug Discovery Project (HCP001).
- Hay SI, Guerra CA, Tatem AJ, Noor AM, Snow RW: The global distribution and population at risk of malaria: past, present, and future. Lancet Infect Dis. 2004, 4: 327-336. 10.1016/S1473-3099(04)01043-6.PubMed CentralView ArticlePubMedGoogle Scholar
- World Health Organization: 2012, http://www.who.int/mediacentre/factsheets/fs094/en/index.html,
- World Health Organization: 2012, http://www.who.int/malaria/world_malaria_report_2011/9789241564403_eng.pdf,
- Newton CR, Taylor TE, Whitten RO: Pathophysiology of fatal falciparum malaria in African children. Am J Trop Med Hyg. 1998, 58: 673-683.PubMedGoogle Scholar
- World malaria situation 1990: Division of Control of Tropical Diseases. World Health Organization, Geneva. World Health Stat Q. 1992, 45: 257-266.Google Scholar
- Ruiz W, Kroeger A: The socioeconomic impact of malaria in Colombia and Ecuador. Health Policy Plan. 1994, 9: 144-154. 10.1093/heapol/9.2.144.View ArticlePubMedGoogle Scholar
- Kidson C, Indaratna K: Ecology, economics and political will: the vicissitudes of malaria strategies in Asia. Parassitologia. 1998, 40: 39-46.PubMedGoogle Scholar
- Breman JG, Alilio MS, Mills A: Conquering the intolerable burden of malaria: what's new, what's needed: a summary. Am J Trop Med Hyg. 2004, 71: 1-15.PubMedGoogle Scholar
- Trenholme GH, Carson PE: Therapy and prophylaxis of malaria. JAMA. 1978, 240: 2293-2295. 10.1001/jama.1978.03290210075039.View ArticlePubMedGoogle Scholar
- Mehta SR, Das S: Management of malaria: recent trends. J Commun Dis. 2006, 38: 130-138.PubMedGoogle Scholar
- Wongsrichanalai C, Webster HK, Wimonwattrawatee T, Sookto P, Chuanak N, Thimasarn K: Emergence of multidrug-resistant Plasmodium falciparum in Thailand: in vitro tracking. Am J Trop Med Hyg. 1992, 47: 112-116.PubMedGoogle Scholar
- Wongsrichanalai C, Pickard AL, Wernsdorfer WH, Meshnick SR: Epidemiology of drug-resistant malaria. Lancet Infect Dis. 2002, 2: 209-218. 10.1016/S1473-3099(02)00239-6.View ArticlePubMedGoogle Scholar
- Dua VK, Dev V, Phookan S, Gupta NC, Sharma VP, Subbarao SK: Multi-drug resistant Plasmodium falciparum malaria in Assam, India: timing of recurrence and anti-malarial drug concentrations in whole blood. Am J Trop Med Hyg. 2003, 69: 555-557.PubMedGoogle Scholar
- Yang Z, Li C, Miao M, Zhang Z, Sun X, Meng H: Multidrug-resistant genotypes of Plasmodium falciparum. Myanmar. Emerg Infect Dis. 2011, 17: 498-501. 10.3201/eid1703.100870.View ArticleGoogle Scholar
- Moran M, Guzman J, Ropars A, Jorgensen M, McDonald A, Potter S: The malaria product pipeline: planning for the future. 2007, The George Institute for International Health, http://www.policycures.org/downloads/The_malaria_product_pipeline_planning_for_the_future.pdf,Google Scholar
- Wang Y, Xiao J, Suzek TO, Zhang J, Wang J, Bryant SH: PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res. 2009, 37: W623-W633. 10.1093/nar/gkp456.PubMed CentralView ArticlePubMedGoogle Scholar
- Schierz AC: Virtual screening of bioassay data. J Cheminform. 2009, 1: 21-10.1186/1758-2946-1-21.PubMed CentralView ArticlePubMedGoogle Scholar
- Melville JL, Burke EK, Hirst JD: Machine learning in virtual screening. Comb Chem High Throughput Screen. 2009, 12: 332-343. 10.2174/138620709788167980.View ArticlePubMedGoogle Scholar
- Periwal V, Rajappan JK, Jaleel AU, Scaria V: Predictive models for anti-tubercular molecules using machine learning on high-throughput biological screening datasets. BMC Res Notes. 2011, 4: 504-10.1186/1756-0500-4-504.PubMed CentralView ArticlePubMedGoogle Scholar
- Periwal V, Kishtapuram S, Scaria V: Computational models for in-vitro anti-tubercular activity of molecules based on high-throughput chemical biology screening datasets. BMC Pharmacol. 2012, 12: 1-PubMed CentralView ArticlePubMedGoogle Scholar
- Jamal S, Periwal V, Consortium O, Scaria V: Computational analysis and predictive modeling of small molecule modulators of microRNA. J Cheminform. 2012, 4: 16-4. 10.1186/1758-2946-4-16.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu K, Feng J, Young SS: PowerMV: a software environment for molecular viewing, descriptor generation, data analysis and hit evaluation. J Chem Inf Model. 2005, 45: 515-522. 10.1021/ci049847v.View ArticlePubMedGoogle Scholar
- Sud M: MayaChemTools. 2010, http://www.mayachemtools.org/,Google Scholar
- Elkan C: The Foundations of Cost-Sensitive Learning. Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence. 2001, 2: 973-978.Google Scholar
- Bouckaert RR, Frank E, Hall MA, Holmes G, Pfahringer B, Reutemann P: Weka -Experiences with a Java Open-Source Project. Journal of Machine Learning Research. 2010, 2533-2541.Google Scholar
- Friedman N, Geiger D, GoldSzmidt M: Bayesian Network Classifiers. Machine Learning. 1997, 29: 131-163. 10.1023/A:1007465528199.View ArticleGoogle Scholar
- Breiman L: Random Forests. Machine Learning. 2001, 45: 5-32. 10.1023/A:1010933404324.View ArticleGoogle Scholar
- Quinlan JR: C4.5: programs for machine learning. 1993, San Francisco: Morgan Kaufmann PublishersGoogle Scholar
- Domingos P: MetaCost: a general method for making classifiers cost sensitive. The First Annual International Conference on Knowledge Discovery in Data. 1999, 155-164.Google Scholar
- Chemaxon: Budapest H. Library MCS, version 0.7. 2008Google Scholar
- Chemaxon: Budapest H. Jcsearch version 5.8.2.Google Scholar
- vROCS: release 3.1.2, OpenEye Scientific Software. 2010, NM, USA: Inc. Santa Fe, http://www.eyesopen.com,Google Scholar
- VIDA: version 4.1.1, OpenEye Scientific Software, Inc. 2010, NM, USA: Santa Fe, http://www.eyesopen.com,Google Scholar
- OpenEye Scientific Software, Inc. 2010, NM, USA: Santa Fe, http://www.eyesopen.com,
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.