Discrimination of approved drugs from experimental drugs by learning methods
© Tang et al; licensee BioMed Central Ltd. 2011
Received: 8 December 2010
Accepted: 14 May 2011
Published: 14 May 2011
Skip to main content
© Tang et al; licensee BioMed Central Ltd. 2011
Received: 8 December 2010
Accepted: 14 May 2011
Published: 14 May 2011
To assess whether a compound is druglike or not as early as possible is always critical in drug discovery process. There have been many efforts made to create sets of 'rules' or 'filters' which, it is hoped, will help chemists to identify 'drug-like' molecules from 'non-drug' molecules. However, among the chemical space of the druglike molecules, the minority will be approved drugs. Classifying approved drugs from experimental drugs may be more helpful to obtain future approved drugs. Therefore, discrimination of approved drugs from experimental ones has been done in this paper by analyzing the compounds in terms of existing drugs features and machine learning methods.
Four methodologies were compared by their performance to classify approved drugs from experimental ones. The best results were obtained by SVM, in which the accuracy is 0.7911, the sensitivity is 0.5929, and the specificity is 0.8743. Based on the results, consensus model was developed to effectively discriminate drugs, which further pushed the correct classification rate up to 0.8517, sensitivity up to 0.7242, specificity up to 0.9352. The applications on the Traditional Chinese Medicine Ingredients Database (TCM-ID) tested the methods. Therefore this model has been proven to be a potent tool for identifying drug molecules.
The studies would have potential applications in the research of combinatorial library design and virtual high throughput screening for drug discovery.
In the early 1990s, the advent of high-throughput screening (HTS) and combinational chemistry methodologies was widely seen as having great potential to revolutionize modern drug discovery. However, the quality of the output from these technologies was limited than expected. Despite advances in technology and understanding of biological systems, drug discovery is still a "lengthy, expensive, difficult, and inefficient process" with low rate of new therapeutic discovery . Drugs as well as drug-like compounds are distributed extremely meagerly through chemical space, which is estimated to contain ~1040 to ~10100 molecules. Among the whole chemical space, the majority is nondrug molecules, the minority is druglike molecules. To assess whether a compound is druglike or not as early as possible in drug discovery process will be extremely meritorious. Druglike compounds generally indicates molecules that contain functional groups and/or have physical properties consistent with the majority of known drugs, and hence can be inferred as compounds which might be biologically active or show therapeutic potential . For a drug, properties like synthetic ease, stability, oral availability, good pharmacokinetic properties, lack of toxicity and minimum addictive potential are of utmost importance . Many of these properties depend on the inherent biological and physicochemical parameters of the molecule; whereas the complex structure of the whole drug molecule makes correlating attempts difficult to screen in such a large chemical space. Meanwhile, about more than 80% of all failures of commercial drugs can be attributed to inappropriate absorption, distribution, metabolism, elimination, and toxicity (ADMET) properties despite in vitro and in vivo testing [4–6]. Only a small portion of druglike molecules would survive the rigorous evaluation process and be approved finally, which could be defined as approved drugs. The other compounds are regarded as experimental drugs, which are still in the clinical process or have not been approved for safety and effectiveness yet.
Commonly used datasets
Number of compounds
Comprehensive Medicinal Chemistry (CMC) 
> 8000 compounds used or studied as medicinal agents
World Drug Index (WDI) 
> 80,000 marketed and development drugs worldwide
MACCS-II Drug Data Report (MDDR) 
>100,000 drugs launched or under development
Available Chemicals Directory (ACD) [B35]
> 1,160,000 unique chemicals
These above researches have focused on the classification of druglike and nondrug molecules. There are only a little druglike molecules would survive the clinical trials. Discriminating the druglike compound from non-drug molecules is just the first step in long march. Among the chemical space of the druglike molecules, the minority will be approved drugs. Classifying approved drugs from experimental drugs may be more helpful to obtain future approved drugs. However, Discriminations of approved drugs from other molecules have not been reported yet. Can approved drugs be differentiated from experimental drugs? Do the existing 'rules', features and modeling methods still work in the discrimination of approved drugs? In this paper, a further work has been done to assess the molecules which could be marketed drugs rather than experimental drugs. Common used descriptors and classification methods have been utilized to discriminate approved drugs from experimental drugs. In order to evaluate the classification models, the models are applied to a highly possible drug-like database TCM-ID (Traditional Chinese Medicine ingredient Database) .
The number of compounds per dataset
Currently various sets of molecular descriptors are available . In order to make approved/experimental classification of compounds, the molecules are typically represented by n-dimensional vectors. As the pro-processing, hydrogen was added. The charges and energy optimization of compounds were calculated by Force Field, MMFF94x. The descriptors are calculated by the MOE software (Molecular Operation Environment, version 2008.10). Four sets of descriptors were calculated: 28 druglike index  (DLI); 32 widely applicable descriptors  (WD); 257 standard MOE descriptors (MOE); 76 Surface Area, Volume and Shape descriptors (SURF). WD descriptors are based upon atomic contributions to van der Waals surface area, log P (octanol/water), molar refractivity and partial charge. The SURF descriptors depend on the structure connectivity and conformation; have been shown to be useful in pharmacokinetic property prediction. All descriptor columns were individually normalized to have a mean of zero and unit variance prior to generation of classification models.
The reported algorithms can be formulated in terms of Machine learning methods. The standard scenario for classifier development consists of two stages: training and testing. During the first stage the learning machine is presented with labeled samples, which are basically n-dimensional vectors with a class membership label attached. The learning machine generates a classifier for prediction of the class label of the input coordinates. During the second stage, the generalization ability of the model is tested. Here, four different methods are applied.
Partial least squares (PLS) is a technique that generalizes and combines features from principal component analysis and multiple regression. Its goal is to predict or analyze a set of dependent variables from a set of independent variables or predictors[20, 21]. This prediction is achieved by extracting from the predictors a set of orthogonal factors called latent variables which have the best predictive power. PLS regression is one of the most powerful data mining tools for large data sets with many variables with high collinearity.
KPLS was first described by S. Wold  and applied to spectral analysis in the late nineties. Rosipal introduced KPLS in 2001 as a nonlinear extension to the linear PLS method. This nonlinear extension of PLS makes KPLS a powerful machine learning tool for classification as well as regression. KPLS can also be formulated as a paradigm closely related (and almost identical)  to Support Vector Machines (SVM).
An artificial neural network (ANN), often called as "neural network" (NN), is a mathematical model or computational model based on biological neural networks. It consists of an interconnected group of artificial neurons and processes information using a connectionist approach to computation . In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network during the learning phase.
Support Vector Machines work by mapping the training data into a feature space by the aid of a so-called kernel function and then separating the data using a large margin hyperplane. Intuitively, the kernel computes a similarity between two given examples. Most commonly used kernel functions are RBF kernels. More details on SVMs have been provided in the literature numerous times [26, 27].
Due to robust convergence behavior SVM seems to be well-suited for solving binary classification problems especially with large variables. In previous studies, SVM performed better than ANN when large numbers of features or descriptors are used . But it is not observed in this paper.
MOE descriptors contained 2D and 3D descriptors. 2D molecular descriptors are defined to be numerical properties that can be calculated from the connection table representation of a molecule (e.g., elements, formal charges and bonds, but not atomic coordinates). There are two types of 3D molecular descriptors: those that depend on internal coordinates only and those that depend on absolute orientation. The descriptor number of MOE is far more than that of other methods. From the result of our study, the more comprehensive the descriptors is, the better results are obtained. While the descriptors were chosen on the basis of simplicity, ease of calculation, and diverse representation of chemical properties, simple descriptors are popular in research. Among these descriptors used in this study, the DLI maybe made fairly important contribution that additional descriptors were unlikely to significantly improve prediction accuracy. Considering the complexity of hundreds of thousands of descriptors, such generic and simple chemical properties are so predictive. These simple descriptors have been shown previously to encode and have been used successfully in the past to predict diverse datasets [31, 32]. WD descriptors are applicable descriptors and the results of it are in the median of the best and worst. The SURF descriptors led to approximately 10% lower accuracy than the best one. The SURF descriptors have been shown to be useful in pharmacokinetic property prediction but not take effect in this case.
The fundamental problem of the method is how to characterize samples with precise and informative features. From the above results, MOE descriptors conformed well in approved and experimental drugs classification. The DLI descriptors, which made fairly important contribution, were employed to characterize molecules due to its calculation based upon the knowledge derived from known drugs.
In this study, it is noticed that the classified results of different descriptors on different models various largely. For example, the number of approved drugs and experimental drugs which were correctly classified by over 60% methods was 38.5%, 79.1% on average respectively. The molecules are correctly classified in some models while misclassified in others, which indicates their complementary to each other.
Thus we proposed jointly applications of all predictive systems. One possibility to combine several estimators is to employ a voting, e.g. calculating an ensemble average. The other is to construct a consensus model. The central idea of the consensus model is to use the results of multiple, heterogeneous classifiers with variables which may maximum the diversity of the classifiers as the input variables in a new layer classification.
From the results in Table 3, the consensus model gained widely improvement and outperformed the other methods, such as the best SVM and the voting model. The order of accuracy yielded was as follows: consensus model, best SVM, voting model. Compared to the results of best SVM, the sensitivity of consensus model increased 13%, the specificity increased 6% and the accuracy increased 6%. Compared to the results of voting model, the sensitivity of consensus model increased 17% and the accuracy increased 4%. The sensitivity means true positive, that is to say, correctly classified approved drugs. For example, an approved drug--Sulfinpyrazone is misclassified by best SVM and voting model as experimental drugs, while it is discriminated correctly using consensus model. The specificity means true negative. Here it means classifying the experimental drugs correctly. An experimental drug--Adenosine-5-Diphosphoribose, which is misclassified as approved drugs by SVM and voting model, is correctly classified by consensus model. The vote scheme is usually tend to accept the prediction with more voting supports, which may ignore the special samples. This limits the prediction accuracy. While in the consensus model, the results of first level classifiers are used as the input of the second layer SVM, which will avoid unnecessary voting and can combine the results of different methods. The consensus model would further improve the prediction accuracy and robustness of a predictor.
Herbal ingredients have been expected as a potential druglike database. The utility of natural products as sources of novel structures is still alive and well. In the area of cancer, over the time frame from around the 1940s to date, of the 155 small molecules, with 47% actually being either natural products or directly derived therefrom. A comparison by Feher and Schmidt  showed that, overall, natural products are more similar to drugs than compounds obtained from combinatorial synthesis. A large proportion of natural products is biologically active and has favorable ADME/T properties (absorption, distribution, metabolism, excretion, and toxicology).
Since the major properties were similar, we used the model constructed by approved drugs and experimental drugs to test herbal ingredients. The final model was applied to TCM-ID. The results showed that 3726 compounds were classified as potential drugs from 10370 molecules. While about 58% and 73% ingredients passed Lipinski 5 rules filter and Oprea filter respectively.
Compounds in typeIII are unknown to us for whether they would be a candidate drug. The passed compounds by different filter rules are different. For example, Aristolochic acid has been proved being carcinogenicity and high nephrotoxic and may be a causative agent in Balkan nephropathy. It passed the filter of Lipinski 5 rules and Oprea 3 rules while it was taken as an experimental drug in our model. Astragaloside IV, which is a main ingredient in many Chinese patent medicines, is predicted as a candidate drug in our model while not pass the filter of Lipinski 5 rules and Oprea 3 rules. Whether it is a potential drug or not will be tested by further experiments.
From the work, discrimination of approved drugs and experimental drugs is practicable. A comparison of four widely used classification methods has shown that, on average, the SVM is able to generate the most predictive models to discriminate approved and experimental drugs, followed by ANN, KPLS and then PLSDA. Seven sets of molecular descriptors were applied to the discrimination of approved drugs and experimental drugs. Notably, these descriptors have comprehensible definitions and physicochemical meanings for drug properties. The classifiers have been complement to each other. The correct classification rate is up to 85.17% by using the consensus model. The herbal ingredients dataset was tested, indicating that this database is a good source for drug discovery. Furthermore, It will not only narrow down the space of drug prediction and screening but also facilitate drug discovery, which the approved drugs and experimental drugs' discrimination has been implemented into the early stage of drug discovery by discarding compounds that are likely to fail further down the baseline.
This work was supported in part by grants from Ministry of Science and Technology China(2009ZX10004-601, 2010CB833601), National Natural Science Foundation of China (30900832), Program for New Century Excellent Talents in University(NCET-08-0399), "Shu Guang" project by Shanghai Municipal Education Commission and Shanghai Education Development Foundation (07SG22), TCM modernization of Shanghai (09dZ1972800) and Open Project Program Foundation of Key Laboratory of Liver and Kidney Diseases (Shanghai University of Traditional Chinese Medicine), Ministry of Education
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.