Supervised prediction of drug-induced nephrotoxicity based on interleukin-6 and -8 expression levels

Background Drug-induced nephrotoxicity causes acute kidney injury and chronic kidney diseases, and is a major reason for late-stage failures in the clinical trials of new drugs. Therefore, early, pre-clinical prediction of nephrotoxicity could help to prioritize drug candidates for further evaluations, and increase the success rates of clinical trials. Recently, an in vitro model for predicting renal-proximal-tubular-cell (PTC) toxicity based on the expression levels of two inflammatory markers, interleukin (IL)-6 and -8, has been described. However, this and other existing models usually use linear and manually determined thresholds to predict nephrotoxicity. Automated machine learning algorithms may improve these models, and produce more accurate and unbiased predictions. Results Here, we report a systematic comparison of the performances of four supervised classifiers, namely random forest, support vector machine, k-nearest-neighbor and naive Bayes classifiers, in predicting PTC toxicity based on IL-6 and -8 expression levels. Using a dataset of human primary PTCs treated with 41 well-characterized compounds that are toxic or not toxic to PTC, we found that random forest classifiers have the highest cross-validated classification performance (mean balanced accuracy = 87.8%, sensitivity = 89.4%, and specificity = 85.9%). Furthermore, we also found that IL-8 is more predictive than IL-6, but a combination of both markers gives higher classification accuracy. Finally, we also show that random forest classifiers trained automatically on the whole dataset have higher mean balanced accuracy than a previous threshold-based classifier constructed for the same dataset (99.3% vs. 80.7%). Conclusions Our results suggest that a random forest classifier can be used to automatically predict drug-induced PTC toxicity based on the expression levels of IL-6 and -8.


Background
The kidney plays an important role in the maintenance of water and electrolyte balance, and the filtration and elimination of metabolic wastes and drugs from the plasma [1]. Due to drug exposure and active transport and metabolism of drugs, the kidney is susceptible to drug-induced toxicity [2][3][4][5]. Nephrotoxic drugs may perturb renal perfusion, induce loss of filtration capacity, and cause damage to the vascular, tubular, glomerular and interstitial cells in the kidney [6]. Drug-induced nephrotoxicity can lead to acute kidney injury, or chronic kidney disease that may process to end-stage kidney disease [6][7][8]. However, nephrotoxicity of drug candidates is often detected only during the late phases of drug development, and accounts for 19% of drug attrition in phase 3 of clinical trials [9]. Therefore, early, pre-clinical prediction of nephrotoxicity could help to prioritize drug candidates for further evaluations, increase the success rates of clinical trials, and reduce the overall time and cost of drug development.
Renal proximal tubular cells (PTCs) are a major target for drug-induced toxicity because they are involved in the regulation of filtrate concentration and drug transportation and metabolism [1]. Current pre-clinical, in vitro nephrotoxicity predictors are usually based on protein-or gene-expression markers of immortalized renal proximal tubular cell lines [2,[10][11][12][13]. Most of these predictors have only been tested in around ten or less compounds [2]. Recently, Li et al. have developed an in vitro predictor based on the expression levels of two inflammatory markers, interleukin (IL)-6 and -8, in human primary renal proximal tubular cells (HPTCs) [14] and human embryonic stem cell-derived HPTC-like cells [15]. These markers were tested in a larger number of 41 compounds, and gave higher prediction accuracy than many previous predictors [2]. However, most of these existing predictors use simple linear thresholds to distinguish between the effects of nephrotoxic and non-nephrotoxic compounds, even though more than one markers (or "features") are measured from the cells. These manually-determined thresholds may be subject to human biases, and have difficulties in distinguishing features that are non-linearly separable [16]. Therefore, we wonder if non-linear decision boundaries identified automatically using supervised classifiers can further improve the accuracy of nephrotoxicity predictors based on the IL-6 and -8 markers.
Supervised classifier is a computational algorithm that maps, or classifies, input data into different pre-defined categories based on a set of training data whose category membership is known. The support vector machine (SVM) algorithm is one of the most commonly used classifiers. It constructs classification boundaries based on soft margins that allow mis-classified data points, and is especially useful when the data is not linearly separable and/or the number of features is high [17]. The k-nearest-neighbor (k-NN) and naive Bayes classifiers are two other commonly used classifiers. In a k-NN classifier, the category membership of a data point is determined by a majority vote of its neighboring training data points [18]. Naive Bayes classifier is based on the Bayes' theorem and assumes that the measured features are independent [19]. These two classifiers have the advantages of being simple and efficient, especially for low numbers of features [20,21]. Finally, the random forest algorithm is a relatively new type of classifier based on ensemble learning of a set of decision trees [22]. In certain datasets, random forest may achieve higher classification accuracy than SVM [23]. Despite the popularity of these classifiers, their performances have not been systematically compared and studied under the context of nephrotoxicity prediction.
Here, we report a systematic comparison of random forest, SVM, k-NN, and naive Bayes classifiers in predicting the PTC toxicity of 41 well-characterized compounds that are toxic or not toxic to PTC. The prediction is based on the IL-6 and -8 expression levels measured by Li et al. in HPTCs [14]. We describe how the parameters of all the tested classifiers can be automatically determined without any manual intervention. We also compare the importance of IL-6 and -8 in predicting PTC toxicity. Finally, we also show that supervised prediction based on the best classifier, random forest, achieves higher accuracy than the threshold-based classifier used by Li et al. [14].

Dataset
We used the gene expression dataset generated by Li et al. [14]. The dataset was collected from HPTCs derived from three different human donors (HPTC1, 2, and 3). The cells were exposed to 41 compounds for 16 hours, and the expression levels of IL-6 and -8 were determined using quantitative polymerase chain reaction (qPCR). The measured values were then averaged across three experimental replicates and divided by the vehicle control values (Additional File 1). For each batch of HPTCs, we obtained a final 41x2 floating point data matrix and presented it to each of the tested classifiers as described below (Figure 1). IL-6 and -8 were selected because their expression levels are substantial increased in injured or diseased kidneys [24][25][26][27]. The 41 compounds can be divided into two categories [2,14]. The 'toxic' category has 22 nephrotoxicants that are known to be directly toxic to the human PTCs. The 'nontoxic' category has 11 nephrotoxicants that are not known to directly damage PTCs, and 8 non-nephrotoxic compounds. A more detailed description of the experimental protocols and compounds can be found in the report of the original study [14] and another more recent study [15]. In these previous studies, simple classifiers based on manually determined thresholds were used, and cross validation was not used to test the performance of these classifiers. The mean balanced accuracy of these classifiers constructed using all the data points (compounds) was reported to be 80.7% [14].

Classifier evaluation
We compared four different supervised classifiers, namely random forest, SVM, k-NN and naive Bayes ( Figure 1). The maximum expression levels of IL-8 and -6 induced by the tested compounds were used as inputs to the classifiers. For each classifier, we used a stratified 3-fold cross validation procedure [28] to estimate its generalized classification performance. This procedure randomly divided the dataset into three roughly equal folds, one of which was used to test a classifier trained on the remaining folds. We repeated the whole cross validation procedure 10 times, each with a different random fold division. The final classification performance was obtained by taking the mean of all the obtained measurements.
We used three different classification performance indicators: sensitivity, specificity, and balanced accuracy. The definitions of these three measurements are as the following: balanced accuracy = sensitivity + specificity 2 . ( where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives and FN is the number of false negatives. We performed all the analyses using the R statistical environment (v3.0.2) on a personal computer equipped with an Intel Core i7-3770K processor and Windows 7 operating system. The R source code used can be found in Additional File 2.

Random forest
Random forest is an ensemble learning method that constructs a large number of decision trees (B) during training, and predicts the category label of a new data sample by taking the mode of the labels predicted by these trees [22]. During training, a random subset of m rf features is selected, and the best spit of data points based on these features are used to construct a decision tree. We tested B = 50, 100, 150, 200, 250, 300, 400, 500, 600, 700, 800, 900, 1000, 1200, 1400, 1600, 1800, and 2000. Since our dataset only has two features, we set m rf = 1. The "randomForest" library (v4.6-10) under the R environment was used to perform random forest classification.
Binary support vector machine SVM aims to construct a decision hyperplane with the largest margin that distinguishes data points from different categories [29]. Let us denote the training data and category labels to be If the data points are linearly separable, the optimum decision hyperplane is w · x + b = 0 , where w is a weight vector that is normal to the hyperplane, and b is a bias term. For all the input data x, they must satisfy the following constraints: These two equations can be combined as: The hyperplane with the maximum margin can be calculated through solving the following quadratic programming problem: To handle non-separable datasets, these constraints are relaxed with a positive slack variable ξ i , where i = 1, 2, ..., m (called "soft margin") [17]. Then the optimization in Equation 7 becomes: The upper bound for the error in the training dataset meter. This optimization equation allows a trade-off between large margin and small error values.
In the case where a decision function is not linear, the data is mapped into a higher-dimensional space, and a hyperplane is constructed so that the data can be linearly separated in this new space. The projection x' = (x) is done through : R N → F . This hyperplane that separates the two categories has a similar form as the linear case: w · x' + b = 0 . The optimization equation is also similar to Equation 8 except that the input now becomes x' . A Lagrangian is constructed and can be transformed into a dual form: where a 1 , a 2 ,...,a i are the Lagrangian multipliers. K(x x , x j ) is the kernel function with the form: The classifier based on the kernel function for a new input data x u is: More details about SVM can be found in [16] and [17].
In our analyses, we tested four SVM kernels, namely linear, polynomial, sigmoid and radial basis function (RBF). We optimized all the parameters, including C, g, r, and degree of the kernels through an exhaustive grid search [30]. We used the 'e1071' library (v1.6-1) under the R environment to perform SVM classification.

k-NN classifier
A k-NN classifier classifies input data according to the labels of the k-nearest neighbors in the training data { x i , y i }. We calculated the Euclidean distance between a new input data point x u, and each of the training data points x i using: The category label of x u was assigned based on the majority vote of the category labels of its k-nearest training data (kNN): where δ(x i , y j ) ∈ {0, 1} indicates if x i belongs to y i. We tested k = 1, 3, 5, 7 and 9, and used the "class" library (v7.3 -9) under the R environment to perform k-NN classification.

Naive Bayes classifier
A naive Bayes classifier assumes that each of the measured features contribute independently to the probability of a category label given an input data, p(y i |x u ). According to the Bayes theorem, we have: where p(x u |y i ) is the probability that the input data x u belongs to the category y i, p(y i ) is the probability of category y i, and p(x u ) is the probability of the input data x u. The naive Bayes classifier maximizes the probability among all the possible categories: As p(x u ) has the same value for a given problem, we only need to maximize the numerator of Equation 14.
Since we assume all the features are independent, the classifier becomes: where f 1 , f 2 , ..., f n is the set of feature values in the input data point x u. We used the 'e1071' library (v1.6-1) under the R environment to perform naive Bayes classification.

Random forest classification
We found that the performance differences between random forest classifiers using the different tested numbers of trees are very small (Figure 2). Our results agree with previous observations that random forest classifiers do not overfit even for large numbers of trees [22]. Therefore, we fixed the number of trees to be 250, which gives the highest mean balanced accuracy (87.8%), sensitivity (89.4%), and specificity (85.9%).

SVM parameter optimization
The classification performance of a SVM is closely related to its parameter values. A SVM classifier based on the RBF kernel has two important parameters C and g [17]. The C parameter determines the misclassification penalty, and the g parameter determines the width of the RBF kernel. We tested C and g values ranging from 10 -5 to 10 10 . During each trial of the cross validation procedure, we always determined the optimum C and g values based on the training data of the current fold ( Figure 3). These optimum values might slightly change from fold to fold due to the different training data used. Using this optimization procedure on our 41-compound dataset, we found that the mean classification performance across all folds and trials for a RBF-based SVM classifier are 81.6% (balanced accuracy), 78.7% (sensitivity), and 84.2% (specificity). We also used similar optimization procedures to optimize the parameters of SVMs based on the linear, polynomial (Figure 4), and sigmoid kernels.

SVM classification using linear, polynomial, sigmoid and RBF kernels
The performance of a SVM is also closely related to its kernel function. A linear kernel is simple, fast, but may not work well when the dataset is not linearly separable.
Polynomial, sigmoid or RBF kernels can provide complex decision boundaries, but may also lead to the problem of overfitting [31]. To determine the best kernel for our dataset, we compared the classification performance of SVM classifiers based on linear, polynomial, sigmoid and RBF kernels using a stratified 3-fold cross validation with 10 random trials ( Table 1). The parameters of these classifiers were optimized as described in the previous section. We found that the RBF kernel had the highest balanced accuracy (81.6%) and sensitivity (78.7%), and second highest specificity (84.2%). Our results suggest that the IL-6 and -8 expression levels are not linearly separable in the original feature space, and the mapping of these two features into a higher dimensional space using a RBF kernel helps to distinguish the toxic and non-toxic compounds.

k-NN classification
We found that the optimum number of nearest neighbors (k) for k-NN classifiers is three ( Figure 5). Although the mean specificity of the classifiers increases with k, the mean sensitivity starts to decrease after k = 3. At this optimum k value, we found that the mean classification performance across all folds and trials for k-NN classifiers are 74.1% (balanced accuracy), 74.0% (sensitivity), and 74.2% (specificity).

Comparison between random forest, SVM, k-NN and naive Bayes classifiers
After optimizing the parameters of all the classifiers, we performed a systematic comparison of the performances of these classifiers in classifying our 41-compound dataset ( Table 2). We found that random forest and SVM have the highest and second highest, respectively, values of balanced accuracy, sensitivity and specificity among all the classifiers. The k-NN classifier (k = 3) has higher sensitivity, but lower specificity, than the naive Bayes classifier. Based on these results, we conclude that a random forest classifier has the best overall performance, and will be used for all of our subsequent analyses.

Feature comparison
The expression levels of IL-6 and -8 increase in HPTCs in response to compounds that are toxic to human PTCs [14]. Previously, Li et al. found that IL-8 is more discriminative than IL-6 in classifying toxic and non-toxic compounds, but they also concluded that the combination of these two features do not provide additional advantages [14]. However, this previous analysis was performed using a classifier based on manually optimized thresholds, and the two features were thresholded independently. We wonder if multivariate classifiers, which construct decision boundaries in multi-dimensional feature spaces, may give better performance than classifiers based on individual features. Similar to the previous study, we found that random forest classifiers based on IL-8 only have higher balanced accuracy (82.8% vs. 72.8%), sensitivity (86.3% vs. 70.5%), and specificity (79.2% vs. 75.3%) than random forest classifiers based on IL-6 only (Table 3). Interestingly, we also found that the combination of IL-6 and -8 gives better classification performance than individual features (balanced accuracy = 87.8%, sensitivity = 89.4%, specificity = 85.9%). Similar trends were also observed for classifiers based on SVMs (Table 3). Our results are consistent with our findings in the previous section that IL-6 and -8 expression levels are not linearly separable, and therefore they can be better separated if a decision boundary is constructed for both features simultaneously. We conclude that both IL-8 and -6 are good and necessary features in the prediction of PTC toxicity.

Construction of final classifiers using all compounds
Finally, we trained a random forest and a SVM classifier using all the 41 compounds, and compared their classification performances to the threshold-based classifier (TC) used by Li et al. [14] (Table 4). We computed the receiver operating characteristic (ROC) curves [32] for these three classifiers (Figure 6), and measured the areas under the ROC curves (AUC). ROC curves that are closer to the upper left corner have higher AUC values and more desirable classification performances. We found that the random forest classifier has higher mean AUC (1.00 vs. 0.85), accuracy (99.3% vs. 80.7%), sensitivity (98.6% vs. 77.3%) and specificity (100.0% vs. 84.2%) than the threshold-based classifier ( Table 4). The perfect AUC score indicates that the toxic and non-toxic categories can be fully separated by the random forest classifier. The SVM also performs better than the threshold-based classifier, but poorer than the random forest classifier (Table 4). We also noticed that most of the toxic compounds mis-classified by random forest classifiers are usually also mis-classified by thresholdbased classifiers. For example, when using a thresholdbased classifier, two compounds, namely ifosfamide and germanium oxide, were mis-classified in HPTC1 [14]; but when using our random forest classifier, only ifosfamide was mis-classified. Altogether, our results suggest that a random forest classifier based on IL-6 and -8 expression levels can be used to automatically predict drug-induced PTC toxicity.

Conclusions
In summary, we have performed a systematic comparison of the performances of four supervised classifiers, namely random forest, SVM, k-NN and naive Bayes classifiers, in predicting nephrotoxicity based on the IL-6 and -8 expression levels. All parameters of the classifiers were determined automatically without any user intervention. We found that random forest classifiers have the highest overall classification performance (mean balanced accuracy = 87.8%, sensitivity = 89.4%, and specificity = 85.9%). Furthermore, we also found that IL-8 is more predictive than IL-6, but a combination of both markers gives higher classification accuracy. Finally, we also show that a final random forest classifier trained automatically on the whole 41-compound dataset has higher classification accuracy than a previous threshold-based classifier [14] (mean balanced accuracy  (RBF=radial basis function. The degree of polynomial kernel is three. The values were estimated using a 3-fold cross validation procedure with 10 random trials, and averaged across three batches of HPTCs.) = 99.3% vs. 80.7%). This better performance is likely due to the non-linear and multivariate decision boundaries generated by the random forest classifier. Our results suggest that a random forest classifier based on these two markers can be used to automatically predict druginduced nephrotoxicity. Our methods are general and can be easily applied to test and identify other potential nephrotoxicity markers based on gene expression levels, metabolic profiles, or cellular phenotypes. The classification performance of our classifier may also be further increased by combining markers from these different modalities, and also by increasing the number of training compounds. An important application of our automated classifier is to predict nephrotoxicity of novel chemical compounds identified from large-scale screening of small-molecule or natural product libraries. This will allow early selection and prioritization of compound candidates for further drug development, animal tests or clinical trials, which are costly and time-consuming processes. By focusing on smaller numbers of drug candidates that are less likely to induce nephrotoxicity, the drug or compound discovery process will be more efficient, and the chance of successful clinical trials will also be increased.  (NB=Naive Bayes classifier, k -NN=k-nearest neighbour classifier with k = 3, SVM=support vector machine with radial basis function kernel, RF=random forest classifier. The values were estimated using a 3-fold cross validation procedure with 10 random trials, and averaged across three batches of HPTCs.) (SVM=support vector machine, RF=random forest classifier. The values were estimated using a 3-fold cross validation procedure with 10 random trials, and averaged across three batches of HPTCs.)