Comparison of seven in silico tools for evaluating of daphnia and fish acute toxicity: case study on Chinese Priority Controlled Chemicals and new chemicals

Background A number of predictive models for aquatic toxicity are available, however, the accuracy and extent of easy to use of these in silico tools in risk assessment still need further studied. This study evaluated the performance of seven in silico tools to daphnia and fish: ECOSAR, T.E.S.T., Danish QSAR Database, VEGA, KATE, Read Across and Trent Analysis. 37 Priority Controlled Chemicals in China (PCCs) and 92 New Chemicals (NCs) were used as validation dataset. Results In the quantitative evaluation to PCCs with the criteria of 10-fold difference between experimental value and estimated value, the accuracies of VEGA is the highest among all of the models, both in prediction of daphnia and fish acute toxicity, with accuracies of 100% and 90% after considering AD, respectively. The performance of KATE, ECOSAR and T.E.S.T. is similar, with accuracies are slightly lower than VEGA. The accuracy of Danish Q.D. is the lowest among the above tools with which QSAR is the main mechanism. The performance of Read Across and Trent Analysis is lowest among all of the tested in silico tools. The predictive ability of models to NCs was lower than that of PCCs possibly because never appeared in training set of the models, and ECOSAR perform best than other in silico tools. Conclusion QSAR based in silico tools had the greater prediction accuracy than category approach (Read Across and Trent Analysis) in predicting the acute toxicity of daphnia and fish. Category approach (Read Across and Trent Analysis) requires expert knowledge to be utilized effectively. ECOSAR performs well in both PCCs and NCs, and the application shoud be promoted in both risk assessment and priority activities. We suggest that distribution of multiple data and water solubility should be considered when developing in silico models. Both more intelligent in silico tools and testing are necessary to identify hazards of Chemicals.


Background
Global regulations have called for systematic testing of potential environmental contaminants to protect human health and the environment from exposure to anthropogenic chemicals, such as industrial chemicals and pharmaceuticals. Considering the everincreasing number of chemicals, more than 350,000 chemicals and mixtures of chemicals been registered for production and use currently [1], are presenting challenges to traditional ecotoxicity testing strategies for in vivo experiments, which are expensive, time-consuming, and reliant on large number of animal subjects. Therefore, it is virtually impossible to test acute toxicity for all the chemicals used globally.
To mitigate the challenges associated with in vitro and in vivo toxicity testing, global regulations, including European Chemical Agency (ECHA) REACH initiative, U.S. Toxic Substances Control Act and Canadian Environmental Protection Act, encourage increased reliance on in silico approaches [2][3][4][5]. China is also attempting to explore the possibility using in silico approaches when chemicals risk assessment.
The cost-benefit advantages and regulatory support of in silico methods have led to the development of a number of tools for ecotoxicity assessments [6]. The major in silico methods including (Quantitative) Structure-Activity Relationships (QSAR), and chemical category methods.
QSAR method uses a mathematical model that was derived from a training set of example chemicals. The training set includes the chemicals that were found to be positive and negative in a given toxicological study (e.g., the bacterial reverse mutation assay) or to induce a continuous response (e.g., Lowest Observed Adverse Effect Level in teratogenicity) that the model will predict. As part of the process to generate the model, physicochemical property based descriptors (e.g., molecular weight, octanol water partition coefficient (K ow )), electronic and topological descriptors (e.g., quantum mechanics calculations), or chemical structure-based descriptors (e.g., the presence or absence of different functional groups) are generated and used to describe the training set compounds. The model encodes the relationship between these descriptors and the (toxicological) response. After the model is built and validated, it can be used to make a prediction. The (physical) chemical descriptors incorporated into the model are then generated for the test compound and are used by the model to generate a prediction. This prediction is only accepted when the test compound is sufficiently similar to the training set compounds (i.e., it is considered within the applicability domain of the QSAR model, often considering the significance of descriptors). This applicability domain analysis may be performed automatically by some software to determine whether the training set compounds share similar chemical and/or biological properties with the test chemical [7].
Chemicals whose physical-chemical, toxicological and ecotoxicological properties are likely to be similar or follow a regular pattern as a result of structural similarity may be considered as a group, or 'category' of chemicals. The assessment of chemicals by using this category approach differs from the approach of assessing them on an individual basis, since the properties of the individual chemicals within a category are assessed on the basis of the evaluation of the category as a whole, rather than based on measured data for any one particular chemical alone. For (a) category member(s) that lacks data for one or more endpoints, the data gap can be filled in a number of ways, including by read-across from one or more other category members. Within a chemical category, the members are often related by a trend in an effect for a given endpoint, and a trend analysis can be carried out through deriving a model based on the data for the members of the category [8].
In 2007, the Organization for Economic Co-operation and Development (OECD) guidelines on the development and validation of QSAR models were issued [9]. They proposed that a QSAR model for practical application should be associated with an unambiguous algorithm [10], a defined endpoint, an AD, appropriate goodness-of-fit measures, robustness as well as predictive ability, and a mechanistic interpretation, if possible [9,11]. Despite these guidelines, lack of external validations and model performances of the test sets, model overfitting, and poor AD definitions remain major concerns [12][13][14][15]. A clear AD definition would ensure that the model assumptions are met [16,17].
In view of the possible uses of in silico tools, regulators often use predictions from multiple in silico tools to arrive at a decision, such as persistence, bioaccumulation,and toxicity/very persistent and very bioaccumulative (PBT/vPvB) assessment and prioritization [29]. In framework of regulation purpose, the performance of in silico tools requires not only accuracy, but also ease of use, and can fulfil the different purpose, such as qualitative risk assessment, quantitative risk assessment, and even high throughput screening [30].
Based on models for specific chemical classes and different classes of substances, some studies have compared the performance of some QSAR models for acute toxicity. Moore et al. [31] evaluated model performance of six QSAR modeling packages that predict acute toxicity to fish: ECOSAR, TOPKAT, a Probabilistic Neural Network, a Computational Neural Network, the QSAR components of the Assessment Tools for the Evaluation of Risk (ASTER) system, and the Optimized Approach Based on Structural Indices Set (OASIS) system. Golbamaki et al. [32] evaluated and compared eight in silico modelling packages that predict daphnia acute toxicity: TOPKAT, ACD/Tox Suite, ADMET Predictor ™ , ECOSAR, TerraQSAR ™ , T.E.S.T. and two models implemented in VEGA. Cassotti et.al [33]. evaluated the accuracy, stability and reliability of two acute toxicity models (MICHEM and ChemProp) to daphnia.
However, some of those evaluated tools were not easy to use and were not developed for regulatory purposes. These evaluation study did not include recently developed models, such as QSAR Toolbox, Danish Q. D., KATE, or the latest version of prediction tools, such as VEGA. Finally, the performance of chemical category approach for predicting acute toxicity to fish and daphnia has not been evaluated.
To implement the regulatory requirements of the "Action Plan for Prevention and Control of Water Pollution, " the Ministry of Ecological Environment of China issued the List of Priority Controlled Chemicals (PCCs) (the first batch) at the end of 2017 [34]. List of PCCs (the second batch) has been compiled and is under comment [35]. Most of these PCCs had been assessed shown the characteristic of PBT/vPvB, especially hazard to aquatic ecosystem. If a model can identify such eventually hazarddetermining chemicals, it has great regulation application prospects. In addition, in silico tools should also be able to predict the hazard of emerging chemical substances in order to respond to the premanufacture notification for new chemical substances.
In this study, we selected seven in silico tools, namely ECOSAR, T.E.S.T., Danish Q. D., VEGA, KATE, Read Across and Trent Analysis, to predict acute aquatic toxicity to daphnia and fish, in order to provide insight into the applicability, accuracy and ease of use (convenience and the level of expert knowledge required) of these in silico tools. The testsets used in this evaluation were PCCs which are representative the final chemicals in the regulatory management process and NCs which are representative of emerging substances.

Validation datasets
Systematic and rigorous model evaluation requires reliable experimental data. As such, acute aquatic toxicity experimental data (48-h LC 50 for daphnia and 96-h LC 50 for fish) of PCCs with a great reliability were obtained from resources such as ECHA's risk assessment report, Good Laboratory Practice (GLP) reports, or study with standard test methods were prioritize used. Other sources, such as ECHA, OECD eChemPortal database and QSAR Toolbox were also considered. If more than one data existed, a lowest reasonable value was used. Daphnia  , and fish species were zebra fish. As these NCs came from chemical companies, the testing data is used for registration as the requirement of Measures for Environmental Management of New Chemical Substances in China. For confidentiality requirements, identification information of these NCs such as structural can not be provided. The functional groups contained were used to analysis and were obtained by module of organic functional groups (nested) in QSAR Toolbox.

Predictive tools
The following seven in silico methods were evaluated for predicting acute aquatic toxicity to daphnia and fish: ECOSAR, T.E.S.T., Danish Q. D., VEGA, KATE, Read Across in QSAR Toolbox, and Trent Analysis in QSAR Toolbox. All of seven in silico tools were evaluated with PCCs dataset. Five tools including ECOSAR, T.E.S.T., Danish Q. D., VEGA and KATE were evaluated with NCs dataset.
Simplified Molecular Input Line Entry System (SIMLES) of each chemicals was used as input to models. A brief description of each program is provided below, and the pertinent details are summarized in Table 1.

ECOSAR
ECOSAR estimates acute aquatic toxicity via the Mayer-Overton relationship for chemicals within structurally similar classes. ECOSAR is trained on a large data set of ecotoxicity studies from the ECOTOX database that follow the U.S. EPA Office of Chemical Safety and Pollution Prevention guidelines, which comprise 130 structural classes. The log10 K OW values for each training set chemical is predicted using the KOWWIN program from U.S. EPA's Estimation Programs Interface Suite (EPISuit) model. The linear regression models between the LC 50 toxicity estimates and log10 K OW were developed for substances in each class. The predicted results of acute toxicity of fresh water other than saltwater were select to validation. Chemicals that do not meet the log10 K OW range are considered to lie outside the AD.

KATE
KATE estimates acute aquatic toxicity via the Mayer-Overton relationship for chemicals within a total of 40 structural chemical classes [37,38]. KATE is trained on the US EPA fathead minnow (Pimephales promelas) and the Japanese Ministry of Environment Oryzias latipes datasets [25]. The log K OW value of the test chemical, which is obtained from an internal experimental database or is estimated with the alternative forced choice method. The relationship between LC 50 value and log10 K ow is obtained by linear regression. log10 K ow of predicted substance is compared to the range of log K ow values in each structural class of the training set, and it internally defines the ADs. The lowest predicted values were used to validation.

T.E.S.T
T.E.S.T. estimates acute aquatic toxicity using several QSAR methodologies: hierarchical clustering, single model, the Food and Drug Administration method, multilinear regression method, group contribution method, mode of action method, nearest neighbour method and consensus methods. In the default consensus methods (used to validation), the predicted toxicity is simply the average of the predicted toxicities from the above QSAR methodologies (taking into account the applicability domain of each method). T.E.S.T. is trained on the endpoint from the EPA ECOTOX database [39]. T.E.S.T has AD for each method and a final AD where predicitons must be made by at least 2 methods for a consensus value to be used. If only a single QSAR methodology can make a prediction, the predicted value is deemed unreliable and not used. So if there is a predicted value given by consensus methods, we defined this situation as in the AD.

VEGA
VEGA provides seven models to predict the fish acute toxicity: (1) SarPy/IRFMN (V1.0.2), QSAR classification model based on fragments built by SarPy software. (2)   Two sets of fragments have been considered and implemented in VEGA and freely available: Functional Groups that account for 154 chemical groups, and Atom-Cantered Fragments (ACF), for 115 fragments, each one corresponding to a type of atom with different connectivity. The software to analyse the chemical space checks for the presence of the above mentioned Functional Groups and ACF, then reports, for each of these chemical features, the total number of matches, the number of matches in each class, and its percentage. The overall reliability of the prediction is measured by combining statistical values, elements of case based reasoning, and possibly presence of active substructures. The possible reasons of concern are underlined. All those considerations are weighted and summed up in an index (in 0-1) that is called Applicability Domain Index (ADI) [26]. All of the seven models predicting the fish acute toxicity and two models predicting daphnia acute toxicity were used with an integrated method ( Fig. 1), except that experimental values were not used. The predicted results with good reliability were deem as inside the AD, else deem as outside the AD.

Danish Q. D
Danish Q. D. includes nearly all organic single constituent substances that were preregistered or registered under REACH (around 80,000). The database was developed by Technical University of Denmark. The endpoints are modelled in two software systems (Leadscope, and SciQSAR), and an overall battery prediction is made to reduce "noise" from the individual model estimates and thereby improve accuracy and broaden the AD [27,40].
Leadscope is a software program for systematic sub-structural analysis of a chemical using predefined structural features stored in a template library, training set-dependent generated structural features (scaffolds) and calculated molecular descriptors. Leadscope has a default automatic descriptor selection procedure. This procedure selects the top 30% of the descriptors (structural features and molecular descriptors) according to X2-test for a binary variable or the top and bottom 15% descriptors according to t-test for a continuous variable. After selection of descriptors the program performs partial least squares (PLS) regression for a continuous response variable, or partial logistic regression for a binary response variable, to build a predictive model.
The SciQSAR software provides over 400 built-in molecular descriptors such as connectivity indices, electrotopological (atom E and HE-state) indices, and other descriptors. For continuous data, regression analysis is used to build the predictive model, and a number of different regression methods are available such as regression on principal components and PLS.
The Battery results were used firstly. If not given for Battery results, the lowest toxicity value of Leadscope and SciQSAR was selected to verification.

Trent Analysis and Read Across
OECD QSAR Toolbox finds structurally and mechanistically defined analogues and chemical categories, which serve as sources for Read Across, Trent Analysis and QSAR for filling in data gaps. QSAR Toolbox has multiple functions, such as identifying analogues of a chemical, retrieving the existing experimental results of those analogues, and filling in data gaps through Read Across, Trent Analysis or QSAR.
The predictions of Read Across and Trent Analysis were accomplished by collecting a set of test data for PCCs considered to be in the same category as the target molecule. The category was firstly defined using categorization method of "Organic functional groups (nested)". The analogues of each PCCs were identified. Then all available experimental data on 48 h-LC 50 value for daphnia and 96 h-LC 50 value for Actinopterygii of identified analogues were retrieved from the selected databases (Aquatic ECETOC, Aquatic Janpan MoE, Aquatic OASIS, ECHA REACH, ECOTOX and Food TOX Hazard EFSA). Finally the Read Across and Trend Analysis were implemented with internal standardized workflow. By default of Read Across, the QSAR Toolbox averages the result of the 5 "nearest" analogues (log10 K ow in this case) to estimate the result for the target chemical. AD of each prediction was recorded as it automatic assessed by combing the log10 K ow range and organic functional groups similarity. log10 K ow must be in the range of all collected analogues, and organic functional groups must be included by that of all collected analogues.

Statistical analysis
Two types of method were used to quantify the performance of all the models to PCCs: qualitative assessment and quantitative assessment methods. Only qualitative assessment was used to quantify the performance of the five models to NCs, as most of NCs were not harmful and only a limit test result of 96-h LC 50 > 100 mg/L were given.
Qualitative effect assessment only needs classified chemicals according to toxicity values ( Table 2). This is related to the toxicity classes described in the The Globally Harmonized System of Classification and Labelling of Chemicals (GHS) [41]. These classification criteria are accepted by most of countries as regulatory classes. In qualitative assessment, the experimental data and predicted data were classified into four classes based GHS criteria of United Nations (Table 2). If the predicted value and the experimental value are in the same regulation category, the prediction can be considered accurate without specific values.
Quantitative assessment needs exact toxicity value to obtain the risk quotient [42]. In quantitative assessment, the difference between predicted and measured LC 50 value was analysed, with difference factors of 10, 100 and 100.
A number of summary statistics were calculated to compare model performance. The correlation coefficient (R 2 ), correlation coefficient of the AD (R 2 AD ), root mean square error (RMSE), and percent of accuracy between predicted and measured toxicity were statistic with Microsoft excel. Software of IBM SPSS Statistics (V19) was used to obtain distribution of difference frequency between log10 experimental LC 50 and log10 estimated LC 50 .
Total accuracy was calculated as: Similar to total accuracy, predictive power measures the total number of correct category assignments. However, lack of prediction was treated as an incorrect assignment:

Statistical distribution of experimental values
The 37 PCCs assessed in this study represent a diverse array of commercial substances. They include olefins, nitrobenzene, perfluorinated and polyfluoro compounds, halogenated hydrocarbon, halogenated benzene, organophosphate, phenols, aldehydes, organophosphate, phthalates, polycyclic aromatic hydrocarbons. The experimental LC 50 values of 37 chemicals cover all regulatory categories ( Fig. 2 (A) and (B)). 43% of chemicals are very toxic chemicals. The number of very toxic, toxic and hazardous chemicals are account for 92 and 86% of all the chemicals for daphnia and fish acute toxicity, respectively. The NCs assessed in this study include almost all of the organic functional groups. They are much more complex as many of which have two or more functional groups, and the most complex NC have 12 functional groups. The overall toxicity of NCs are lower than PCCs shown in Fig. 2 (c) and (d). The number of non-toxicity NCs account for 57 and 65% of total NCs to Daphnia and fish, respectively.

Acute toxicity of daphnia
Experimental and predicted toxicity values to daphnia for the 37 PCCs are shown in Table 3, for the results of NCs can be found in section of "Availability of data and materials".

Models performance across the entire data set
Model performance was evaluated on the entire 37 PCCs and 42 NCCs. The performance metrics for all models tested in this evaluation to acute toxicity of daphnia are summarized in Table 4.

Prediction to 37 PCCs
In qualitative assessment based on classification into the four toxicity classes of the entire 37 PCCs data set, KATE has total accuracies of 84%, which is highest among all of the test models. However, the predictive power of KATE is decrease to 57% as it did not predict 12 of PCCs, which is most among all of the test models. ECOSAR predict all of the PCCs, both of total accuracy and the predictive power is 65%. Based on total accuracies, the tested tools can be ranked in the following order from highest-to lowest-performers: KATE > ECOSAR >T.E.S.T. > Danish Q.D. > VEGA>Read Across>Trend Analysis. KATE shows the excellent performance as only five PCCs were predicted incorrectly.
In quantitative assessment based on comparison of the LC 50 value of PCCs provided by models, the KATE and ECOSAR shows better performance with accuracies of 80 and 76%, respectively, when predictions fall within a factor 10 of the measured LC 50 . All of the models can achieve the accuracy of 80% when differences between measured and predicted toxicity within a factor 100, except for Trent Analysis was only 55%. From Coefficient of variance (R 2 ) in both qualitative assessment and quantitative assessment, it can be further prove that KATE has the best performance.

Prediction to 42 NCs
In qualitative assessment based on classification into the four toxicity classes of the entire 42 NCs dataset, total accuracy and predictive power are decrease dramatically compare with to PCCs. Danish Q.D and KATE have 18 and 22 chemicals that could not be predicted, which are relative higher than other model. These indicate that the performance of models are poor to NCs, and predictive power to NCs is limited.

Model performance within AD
Robust and relevant AD definition is essential for model performance. Model performance within ADs is shown in Table 5.
Prediction to 37 PCCs ECOSAR has the most chemicals inside the AD, with 27 of the 37 PCCs. VEGA has the least chemicals inside the AD, with 10 of the 37 tested chemicals, showing a rigorous AD assessment mechanism. In qualitative assessment, the accuracies of VEGA increased slightly from 51 to 60% after considering AD. T.E.S.T. kept at 64%. The accuracies of other five tools did not increase when inside the AD. Accuracies and R 2 AD of Danish Q.D., Read Across and KATE after considering the AD are decreasing. Some PCCs with correct predicted were excluded as a results of outside the AD. Danish Q.D., Read Across and KATE assess the AD by the range of log10 K ow and structural classes, and the methods are not as rigorous as used by VEGA. Similar phenomena was also found by Melnikov et.al [43]. that KATE total accuracy decreased from 58 to 46% when analysis is limited to the compounds within its AD.
In quantitative assessment, performance of all tools is increase when inside the AD. VEGA shows the best performance with 100% accuracy when predictions fall within a factor 10 of the measured LC 50 . VEGA also has the lowest RMSE (0.48 log10 units) and highest R 2 AD (0.82). Read Across and Trent Analysis have the worst predictive ability from all of the indictors: accuracies, RMSE and R 2 AD .
In general, Based on the accuracies of quantitative assessment, the tested tools for daphnia can be ranked in the following order, from the highest to the lowest performers:  Trent Analysis are underestimated significantly. Underestimated toxicity does not meet the principal of reasonable worst-case.

Acute toxicity of fish
Experimental and predicted toxicity results to fish for the 37 PCCs are shown in Table 6, for the results of 86 NCs can be found in section of "Availability of data and materials".

Model performance across the entire test set
Models performance were first evaluated on the entire dataset regardless of the AD to assess the tool utility for any new or existing chemical. The performance metrics for all models tested in this evaluation to acute toxicity of fish are summarized in Table 7.

Prediction to 37 PCCs
In qualitative assessment based on predictive power of classification into the four toxicity categories of the entire dataset, all models besides ECOSAR are performance not well, with accuracies not more than 50%. ECOSAR has the highest predictive power, with accuracy of 54% and all of the 37 chemicals predicted. The performance of ECOSAR to fish is similar as well as to daphnia. The total accuracies followed are Danish Q.D., T.E.S.T. and VEGA, with the accuracy of 50, 49 and 47%, respectively. Read Across and Trend Analysis have the lowest total accuracies, which are same as the situation of prediction to daphnia. The total accuracy of KATE is only 36%, the performance to predict the toxicity of fish is far less than prediction to daphnia.
In quantitative assessment of comparison log10 LC 50 of experiment value with predicted value, VEGA and T.E.S.T. shows excellent predicted ability as they can achieve the accuracy of 80% when the absolute deviation between predicted and experimental value is limited to 10 times. The performance is followed by KATE and ECOSAR when deviation is limited to 10 times, with the accuracy of 71 and 68%, respectively. The coefficient of variance also reflect the same tendency with accuracy.
Prediction to 86 NCs In qualitative assessment based on classification into the four toxicity classes of the entire 86 NCs, total accuracies decreased comparing with prediction to PCCs. As T.E.S.T., Danish Q.D and KATE could not predict 25, 45 and 49 NCs, respectively, the predictive power of these three tools are lowest. Both total accuracy and predictive power of VEGA are about 20%, which are decrease dramatically compare with prediction to PCCs. ECOSAR has the highest total accuracy and Predictive power compare with others tools, however, it is still not high with accuracy of about 40%.

Model performance within the AD
Model performance within AD to fish toxicity is shown in Table 8.

Prediction to 37 PCCs
The number PCCs inside the AD of VEGA, Read Across and Trend Analysis is most, with 29, 31 and 30 tested chemicals, respectively. T.E.S.T. and KATE have the minimal number of chemical inside the AD.    In quantitative assessment, there are four models: VEGA, KATE, ECOSAR and T.E.S.T., with which the prediction accuracies are greater than 80% when the absolute error is limited to 10 times. VEGA reaches highest accuracy of 90%, with accuracy increased significantly after considering the AD. RMSE is a measure of accuracy, the lower of the RMSE, the higher of the predication accuracy. ECOSAR has the best RMSE (0.71 log10 In general, based on the predictive power of quantitative assessment, the tested tools for fish can be ranked in the following order, from the highest to the lowest performers: VEGA > ECOSAR = KATE = T.E.S.T. > Danish Q.D > Read Across >Trend Analysis. Prediction to 86 NCs Accuracies inside AD of ECOSAR, T.E.S.T., Danish Q. D. and KAT are as same as prediction to PCCs. Whereas, Accuracy inside AD of VEGA to decreased from 55% for PCCs to 36% for NCs. The lower accuracy of VEGA's prediction of NCs, probably because most of the measured results of SCs were non-toxic (LC 50 > 100 mg/L), but when VEGA predicted, the lowest value of the 7 model included in VEGA was used and finally the probability of being predicted to be toxic category increased. Figure 4 shows the distribution of the 96 h-LC 50 fish toxicity predictions with respect to under-and overestimation. Positive errors indicate predicted LC 50 is above experimental LC 50 and toxicity is underestimated. Considering the error of prediction between the log10 LC 50 of the experimental value and the log10 LC 50 of the estimated toxicity value provided by the model, over-and underestimation of fish toxicities by Danish Q.D. are more or less similarly distributed. Fish toxicity predicted by ECOSAR, T.E.S.T, VEGA and KATE appear to be more often overestimated than underestimated, which meet the principal of reasonable worst-case.

Methods to assess AD
All models provide AD assessments that predictions fall inside or outside the AD of the models. Most of these models (ECOSAR, KATE, Read Across and Trent Analysis) assess the AD directly with the range of log10 K ow . In addition to log10 K ow , these models also consider the structural similarity. The ECOSAR package provides warnings when the model prediction is above the substance solubility limit or if the substance log10 K ow is outside the AD, it is helpful when non-professional application.
T.E.S.T. does not provide the AD of results directly. However, T.E.S.T has AD for each method and a final AD where predicitons must be made by at least 2 methods for a consensus value to be used .
Although there is no criterion to judge the validity or invalidity of the predicted data, predicted results within the AD are preferred. Although, the prediction accuracy inside the AD is not obviously improved compare to total accuracy not considering the AD in qualitative assessment, it improved significant in quantitative assessment.
There is no single and absolute AD assessment methods for a given model. Generally, the broader the definition of the AD, the lower the accuracies. This principle can be confirmed in the prediction of daphnia, in which the number of PCCs outside the AD and missing prediction are most by VEGA, however, the performance is best. In the quantitative evaluation within AD with the 10-fold factor, the accuracy of VEGA is the highest among all of the models, both to daphnia and to fish toxicity, with accuracy of 100 and 90%, respectively. The reason for the highest accuracy of VEGA prediction may be attributed to the detailed definition of the AD.
VEGA assess the AD with overall reliability, which is a relative complex mechanism. An overall reliability of the prediction is measured in a quantitative manner, whose value ranges from 1 to 0, by considering five factors, including Global AD Index, similar index of molecules with known experimental value, accuracy index of prediction for similar molecules, concordance index for similar molecules, index of Atom Centered Fragments similarity check. All those considerations are weighted and summed up into reliability of a model.

Difference between classification and quantitative assessment
The qualitative method has a certain randomness for the substances at the classification boundary point. Substances at the toxicity boundary point will be divided into two distinct toxicities class easily. Therefore, qualitative method with toxicity classification method to assess accuracy will be inferior to quantitative methods in terms of scientific significance. The current aquatic acute classification method is based on the 10-fold factor in toxicity values. The quantitative method with a 10-fold factor is similar to the toxicity classification method, but it overcomes the uncertainty of the boundary points and is more meaningful for accuracy evaluation. It can also be proven from the results that the accuracy of the quantitative method is higher than that qualitative method. Therefore, the results of quantitative method is a good indicator to assess the performance of tested tools.

Integrated assessment strategy when predicting the fish acute toxicity using VEGA
In the quantitative evaluation to prediction both daphnia and fish toxicity inside the AD, VEGA performs very well with the highest accuracy. However, there are seven models can be used to predict the fish acute toxicity in VEGA. Some confuse existing even if internal reliability is given. For example, several models may give the same liability with different AD index. And SarPy/IRFMN model is a classification model, it will give a toxicity class instead of toxicity value. Therefore, it is crucial to choose the most rational value of different models, and to use the toxicity class provided by SarPy/IRFMN model in quantitative effect assessment.
In order to make full advantage of VEGA, we proposed an integrated assessment strategy for fish acute toxicity, as shown in Fig. 1. This integrated assessment strategy were used in this study except that experimental values were not used, and it is prove to be useful.
Step 1: if experimental value exist, it should be used, else go to step 2.
Step 2: if reliability shows 3 stars with all ADI =1, it should be used, else go to step 3 at the following case: -If more than 1 models have 3 stars, or.
-If models have only 2 stars or 1 star.
Step 3: if it has a highest global ADI, it should be priority used, else go to step 4.
Step 4: if the other ADI outperforms the others models, it should be priority used.
Notes: (1) A lowest toxicity value should be used when all ADIs are same; (2) Toxicity class given by SarPy/IRFMN model is transformed to lower limit, if needed. e.g. transformed the toxic-3 (between 10 and 100 mg·L − 1 ) to 10.1 mg·L − 1 .

QSAR vs Chemical category approach
ECOSAR, KATE, T.E.S.T. Danish Q.D and some of models in VEGA belong to QSAR methods. Both Read Across and Trent Analysis method are category approach. QSAR models and category approach method have similarities and differences.
In QSAR Toolbox, application strategy of Read across, Trend analysis and QSAR models is addressed. Read across is recommended for "qualitative" (e.g. skin sensitisation or mutagenicity) or "quantitative endpoints" (e.g., 96 h-LC 50 for fish) if only a low number of analogues with experimental results are identified. Trend analysis is the appropriate data-gap filling method for "quantitative endpoints" (e.g., 96 h-LC 50 for fish) if a high number of analogues with experimental results are identified. QSAR models can be used to fill a data gap if no adequate analogues are found for a target chemical.
The issue of chemical-to-chemical similarity is not directly present in the case of QSAR models. In the case of QSAR models, the target chemical is in some way compared with the whole population of chemicals as the basis of the model, and this is addressed within the AD of the model. Thus, the comparison is done not between one chemical and another, or a few others, as in the category approach, but with the whole set of compounds used for the model.
The overall structure of the SAR models model is like a collection of read across models, with similarity structure or fragment are collect and statistic. Identification of similarity structure in QSAR models is completed automatically. The evaluation of similar compound(s) in case of category approach is often done manually, typically done by the expert, which is quite subjective.
The accuracies of Read Across and Trend Analysis method are lowest among of tested tools. Read Across may be used when there are experimental data from high quality databases for one or more substances which are similar enough to the target chemical of interest. It is difficult to assess the quality of experimental data. Predictions applied in this research were based on category on organic functional groups, and standardized workflow in QSAR Toolbox. However, Trend Analysis can be further refined by subcategorization, such as elimination of analogues, which are dissimilar to the target chemical with respect to have same mode of action or same elements. Expert judgement always used when removing outliers. Each expert is guided by his or her past experience, pieces of information may escape her or his knowledge, the weight assigned to each element of evidence and value may be different, and expressed in a subjective way, such as likely, plausible, reasonable, level of concern, etc. and hence often difficult to replicate. Besides, the category approach is typically not so strictly formalized, depending on the similar chemicals data existing in internal database [44].
A case study is shown in Fig. 5 that fish 96-h LC 50 to 2,4,6-tri-tert-butylphenol was predicted using Trent Analysis. Figure 5a is the case that using standardized workflow in QSAR Toolbox without any manually disruption. An outlier can be judged easily. However, after deleting that obvious outlier, the result is still uncertain on how to refining shown in Fig. 5b. Thus, professional judgement require by chemical category methods limit application in regulation purpose, especially in high throughput Case study on predicting fish 96-h LC 50 to 2,4,6-tri-tert-butylphenol using Trent Analysis. (a Using standardized workflow in QSAR Toolbox without any manually disruption, b Using standardized workflow after deleting an obvious outlier substance) screening in risk assessment. QSAR Toolbox also allows some different category methods, such as acute aquatic toxicity classification by ECOSAR, acute aquatic toxicity Mode of Action by OASIS, acute aquatic toxicity classification by Verhaar (Modified). Thus, performance of these category methods need further assessment, and they shall be used limiting in experts. At the same time, more intelligence technologies, such as artificial intelligence shall apply in category approach.

PCCs that were incorrect predicted frequently
There are two PCCs, which daphnia toxicity were predicted incorrectly by more than 2 models (Table 9). The water solubility of anthracene is 0.047 mg·L − 1 , which is lower than experimental LC 50 value of 0.0356 mg·L − 1 , indicating that experimental LC 50 value may be tested incorrectly. There was only one experimental data of anthracene, so the acute toxicity to daphnia needs further testing.
The experimental LC 50 value to daphnia used to validate of dibutyl phthalate is 0.5 mg·L − 1 , which was evaluated and accepted by ECHA. However, values are range from 1.4 to 3.7 mg·L − 1 gathered from database of these models. Predicted LC 50 value of dibutyl phthalate from T.E.S.T, Danish Q.D, Read Across and Trend Analysis is 6.61, 17.5, 6.68 and 73.6 mg·L − 1 , respectively. Therefore, it is the experiment value difference causing the "incorrectly prediction" to dibutyl phthalate by T.E.S.T, Danish Q.D and Read Across. Trend Analysis will still give a value that exceed to 10 times difference to experimental value, which performances not well.
For the acute toxicity of fish, according to the evaluation criterion that the difference between the experimental value and the predicted value is 10 times, there are 6 substances that more than 3 models predicted incorrectly, shown in Table 10. Among them, five substance have low water solubility of below 1 mg·L − 1 . In principle, the experimental LC 50 value of a substance should be lower than its water solubility. The water solubility of musk xylene, 2,4,6-tri-tert-butylphenol and bis(2-ethylhexyl) phthalate, show no significant difference to experimental LC 50 value. Water solubility of heptadecafluorooctanesulfonic acid and pentadecafluorooctanoic acid is much lower than experimental LC 50 value, indicating an incorrect experimental data. In fact, substance with low water solubility is classed as "difficult to test", the aquatic toxicity of these difficult substance were often testing improperly even at GLP condition. Hence, the special caution should be given to this low water solubility substance when developing models. Meanwhile, uncertainly of models when validation and comparison of these PCCs, with low water solubility. As a result, some of the differences between model predictions and measured toxicity values can be partially attributed to the measured toxicity values themselves being less-than-perfect indicators of true toxicity. The errors associated with the measured toxicity values, however, should not affect our conclusions regarding the relative performance of the tested models (their rank orders), particularly in the common PCCs comparison, because all models are being evaluated against the same measured toxicity values.
Danish Q.D. predicted large errors to heptadecafluorooctanesulfonic acid, perfluoro-1-octanesulfonyl fluoride, potassium perufluorooctane sulfonate, pentadecafluorooctanoic acid, with which all LC 50 value are above 100,000 mg·L − 1 . There are two models in Danish Q.D: Leadscope and SciQSAR. As a case to predict Heptadecafluorooctanesulfonic acid, Leadscope predict a 0.00636 mg·L − 1 , that is much closer to its water solubility of 0.10 mg·L − 1 than SciQSAR with predicted value of 354,065 mg·L − 1 . This situation is similar in prediction of Perfluoro-1-octanesulfonyl fluoride, Potassium perufluorooctane sulfonate, Pentadecafluorooctanoic acid. Therefore, the SciQSAR model in Danish Q.D. is note suite for estimate the fish acute toxicity of perfluorinated compounds.
There are 54 experimental 96 h-LC 50 fish values of benzene ranging from 5.3 mg·L − 1 to 542 mg·L − 1 collected in QSAR Toolbox, covering 21 fish species within the Actinopterygii class. As many factors affect the experimental results, such as test method, test conditions, species, or even the experience dealing with difficult substance.
It is difficulty to select a fish species to compare the models performance, as the fish species in tanning data of some model are not deterministic. Hence, this single point comparison method has some limitation when more than one experiment data exist. Therefore, we suggest that distribution of multiple data other than single value should be consider when developing in silico models.

Analysis to Groups of NCs that were incorrect predicted frequently
The functional groups of NCs with more than three model prediction incorrectly were analyzed. Among them, the functional groups with more than 2 occurrences are shown in Table 11.
Of the 42 NCs in the daphnia toxicology prediction, 14 substances were simultaneously incorrect predicted by more than 3 models. The most frequently predicted functional groups are aryl, aryl halide, and aromatic amine.
Of the 86 NCs in the fish toxicology prediction, 40 substances were simultaneously incorrect predicted by more than 3 models. The most frequently predicted functional groups are aryl, aromatic amine, organic amide and thioamide, alkyl (hetero)arenes, ketone, diketone, aryl halide, ether moiety, alkane branched with secondary carbon.
So these function groups should be pay more attention when developing in silico tools.

Outlook
In silico tools are developed based on existing information to hazard. However, over 350,000 chemicals and mixtures of chemicals have been registered for production and use [1]. These chemicals consisted various type of chemicals. As science and technology advances, the chemicals synthetic or prepared chemicals are more and more complicated. Existing in silico tools have note covered all type of chemicals. It is expect that most of chemicals registered or used are not testing for their hazards, and hence no abundant data to support the development of in silico tools. Besides, in silico tools developed are most focus on individual compounds, it is difficulty to identified hazard of a number of mixtures, polymers and UVCBs, the number of which is over 75,000 [1]. So, testing is still needed whether it is used to identify chemical hazards or to provide more information to develop in silico tools. In silico tools are also need continuous development to accuracy, and expansion to AD of various substance, such as mixtures, polymers and UVCBs.

Conclusion
In this study, the performance of seven in silico methods (ECOSAR, T.E.S.T., Danish Q. D., VEGA, KATE, Read Across and Trend Analysis) for acute aquatic toxicity to daphnia and fish was evaluated and compared using PCCs and NCs datasets.
In the quantitative evaluation of PCCs with the criteria of 10-fold difference between experimental value and estimated value, the accuracy of VEGA is the highest among all of the models, both in prediction of daphnia and fish acute toxicity, with accuracy of 100 and 90% after considering AD, respectively. The performance of KATE, ECOSAR and T.E.S.T. is at the similar level, with the accuracies are slight lower than VEGA. The accuracies of Danish Q.D. is lowest among above tools within them QSAR is the main mechanism. The performance of Read Across and Trent Analysis is lowest among all of the tested in silico tools by standardized workflow of QSAR Toolbox, indicating that chemical category approach shall limited in expert use at this stage. The main factor affects the accuracies of in silico tools may be the distribution of multiple experimental data, and the accuracies of experimental values for PCCs with poorly water solubility.
The performance of models to NCs that are much more complex are not as well as to PCCs, indicating in silico tools are also need continuous development. Testing is still needed whether it is used to identify hazards of NCs or to provide more information to develop in silico tools.