Classifying and scoring of molecules with the NGN: new datasets, significance tests, and generalization

This paper demonstrates how a Neural Grammar Network learns to classify and score molecules for a variety of tasks in chemistry and toxicology. In addition to a more detailed analysis on datasets previously studied, we introduce three new datasets (BBB, FXa, and toxicology) to show the generality of the approach. A new experimental methodology is developed and applied to both the new datasets as well as previously studied datasets. This methodology is rigorous and statistically grounded, and ultimately culminates in a Wilcoxon significance test that proves the effectiveness of the system. We further include a complete generalization of the specific technique to arbitrary grammars and datasets using a mathematical abstraction that allows researchers in different domains to apply the method to their own work. Background Our work can be viewed as an alternative to existing methods to solve the quantitative structure-activity relationship (QSAR) problem. To this end, we review a number approaches both from a methodological and also a performance perspective. In addition to these approaches, we also examined a number of chemical properties that can be used by generic classifier systems, such as feed-forward artificial neural networks. In studying these approaches, we identified a set of interesting benchmark problem sets to which many of the above approaches had been applied. These included: ACE, AChE, AR, BBB, BZR, Cox2, DHFR, ER, FXa, GPB, Therm, and Thr. Finally, we developed our own benchmark set by collecting data on toxicology. Results Our results show that our system performs better than, or comparatively to, the existing methods over a broad range of problem types. Our method does not require the expert knowledge that is necessary to apply the other methods to novel problems. Conclusions We conclude that our success is due to the ability of our system to: 1) encode molecules losslessly before presentation to the learning system, and 2) leverage the design of molecular description languages to facilitate the identification of relevant structural attributes of the molecules over different problem domains.

graphs, every single response of the graph is sorted from lowest to highest. Each response value is in turn used as the threshold value. The range of the graph is the number of correct predictions given a value as the threshold. A bulge upward indicates a system that performs favourably; a straight line indicates random guessing; and a bulge downward indicates a poorly performing system. The bounding parallelogram indicates the possible range of answers.

SMILES-NGN toxicity prediction
We are providing to the community our complete dataset used for the SMILES-NGN toxicity prediction experiment. This document, supplementary2.pdf, contains a list of all molecules and LD50s with their respective sources. Table 1 shows the results obtained when applying the ANN to the random training datasets. The first column represents the test subject. The second column contains the name of the organ implicated in the CPDB [1]. The third column represents one of 35 feature vectors that are implemented in the CDK [2].
Note that many feature vectors were unable to achieve convergence in the pre-testing phase and are therefore not included in the results. We show only the best feature vectors used with the ANN. The fourth column (Average/Epsilon) represents the average error between the ANN's predicted LD50 and the actual LD50 value over 100 trials and standard deviation of the error over 100 trials.
The fifth column measures the standard deviation of the Epsilon value on a per trial basis along with the standard deviation of that standard deviation. The final column shows the correlation coefficient for the know LD50 values to the estimates provided by ANN. Table 2 contains the same information as Table 1 except for the per-trial, randomized training data. Tables 3 and 4 show the data produced from this experiment with the use of the SMILES-NGN. Note, that the NGN does not use any descriptors, instead processing the entire molecule.
The final two tables represent how statistically relevant the data generated from this experiment are. Each method (ANN with all descriptor combinations and SMILES-NGN) were compared using a Wilcoxon signed-rank test. From the results, it is determined that each method belongs to a separate population where the average Epsilon and standard deviation are different. Comparison of those populations leads to the conclusion that one population is closer to zero than another. Table 5 shows the results for the grouped data, while Table 6 and Table 7 show the results from the random data. In each table, the animal, organ and descriptor used with the ANN are shown. The Wilcoxon's test statistic is shown for both the hypothesis that "Epsilon is lower for the NGN" and that "the standard deviation is lower for the NGN". A negative/positive value indicates the degree of support/lack-of-support for the hypothesis. A p-value of less than 0.05 indicates that the hypothesis is statistically significant at the 95% or 90% confidence interval.
From the group Wilcoxon test table it is shown that SMILES-NGN statistics may be lower than that of ANN, but it is not within a 95% or 90% confidence interval. By contrast for the random datasets, the