Application of an efficient Bayesian discretization method to biomedical data

Background Several data mining methods require data that are discrete, and other methods often perform better with discrete data. We introduce an efficient Bayesian discretization (EBD) method for optimal discretization of variables that runs efficiently on high-dimensional biomedical datasets. The EBD method consists of two components, namely, a Bayesian score to evaluate discretizations and a dynamic programming search procedure to efficiently search the space of possible discretizations. We compared the performance of EBD to Fayyad and Irani's (FI) discretization method, which is commonly used for discretization. Results On 24 biomedical datasets obtained from high-throughput transcriptomic and proteomic studies, the classification performances of the C4.5 classifier and the naïve Bayes classifier were statistically significantly better when the predictor variables were discretized using EBD over FI. EBD was statistically significantly more stable to the variability of the datasets than FI. However, EBD was less robust, though not statistically significantly so, than FI and produced slightly more complex discretizations than FI. Conclusions On a range of biomedical datasets, a Bayesian discretization method (EBD) yielded better classification performance and stability but was less robust than the widely used FI discretization method. The EBD discretization method is easy to implement, permits the incorporation of prior knowledge and belief, and is sufficiently fast for application to high-dimensional data.


Background
With the advent of high-throughput techniques, such as DNA microarrays and mass spectrometry, transcriptomic and proteomic studies are generating an abundance of high-dimensional biomedical data. The analysis of such data presents significant analytical and computational challenges, and increasingly data mining techniques are being applied to these data with promising results [1][2][3][4]. A typical task in such analysis, for example, entails the learning of a mathematical model from gene expression or protein expression data that predicts well a phenotype, such as disease or health. In data mining, such a task is called classification and the model that is learned is termed a classifier. The variable that is predicted is called the target variable (or simply the target), which in statistical terminology is referred to as the response or the dependent variable. The features used in the prediction are called the predictor variables (or simply the predictors), which are referred to as the covariates or the independent variables in statistical terminology.
A large number of data mining methods have been developed for classification; several of these methods are unable to use continuous data and require discrete data [1][2][3]. For example, most rule learning methods that induce sets of IF-THEN rules and several of the popular methods that learn Bayesian networks require data that are discrete. Some methods that accept continuous data, as for example methods that learn classification trees, discretize the data internally during learning. Other methods, such as the naïve Bayes classifier, that accept both continuous and discrete data, may perform better with discrete data [3,4]. A variety of discretization methods have been developed for converting continuous data to discrete data [5][6][7][8][9][10][11], and one that is commonly used is Fayyad and Irani's (FI) discretization method [9].
In this paper, we present an efficient Bayesian discretization method and evaluate its performance on several high-dimensional transcriptomic and proteomic datasets, and we compare its performance to that of the FI discretization method. The remainder of this paper is structured as follows. The next section provides some background on discretization and briefly reviews the FI discretization method. The results section describes the efficient Bayesian discretization (EBD) method and gives the results of an evaluation of EBD and FI on biomedical transcriptomic and proteomic datasets. The final section discusses the results and draws conclusions.

Discretization
Numerical variables may be continuous or discrete. A continuous variable is one which takes an infinite number of possible values within a range or an interval. A discrete variable is one which takes a countable number of distinct values. A discrete variable may take few values or a large number of values. Discretization is a process that transforms a variable, either discrete or continuous, such that it takes a fewer number of values by creating a set of contiguous intervals (or equivalently a set of cut points) that spans the range of the variable's values. The set of intervals or the set of cut points produced by a discretization method is called a discretization.
Discretization has several advantages. It broadens the range of classification algorithms that can be applied to datasets since some algorithms cannot handle continuous attributes. In addition to being a necessary pre-processing step for classification methods that require discrete data, discretization has been shown to increase the accuracy of some classifiers, increase the speed of classification methods especially on high-dimensional data, and provide better human interpretability of models such as IF-THEN rule sets [8,10,11]. The impact of discretization on the performance of classifiers is not only due to the conversion of continuous values to discrete ones, but also due to filtering of the predictor variables [4]. Variables that are discretized to a single interval are effectively filtered out and discarded by classification methods since they are not predictive of the target variable. Due to redundancy and noise in the predictor variables in high-dimensional transcriptomic and proteomic data, such filtering of variables has the potential to improve classification performance. Even classification methods like Support Vector Machines and Random Forests that handle continuous variables directly and are robust to high dimensionality of the data may benefit from discretization [4]. The main disadvantage of discretization is the loss of information entailed in the process that has the potential to reduce performance of classifiers if the information loss is relevant for classification. However, this theoretical concern may or may not be a practical one, depending on the particular machine-learning situation.
Discretization methods can be classified as unsupervised or supervised. Unsupervised methods do not use any information about the target variable in the discretization process while supervised methods do. Examples of unsupervised methods include the Equal-Width method, which partitions the range of variable's values into a user-specified number of intervals and the Equal-Frequency method, which partitions the range of variable's values into a user-specified fraction of instances per interval. Compared to unsupervised methods, supervised methods tend to be more sophisticated and typically yield classifiers that have superior performance [8,10,11]. Most supervised discretization methods consist of a score to measure the goodness of a set of intervals (where goodness is a measure of how well the discretized predictor variable predicts the target variable), and a search method to locate a good-scoring set of intervals in the space of possible discretizations. The commonly used FI method is an example of a supervised method.
A second way to categorize discretization methods is as univariate versus multivariate methods. Univariate methods discretize a continuous-valued variable independently of all other predictor variables in the data, while multivariate methods take into consideration the possible interactions of the variable being discretized with the other predictor variables. Multivariate methods are rarely used in practice since they are computationally more expensive than univariate methods and have been developed for specialized applications [12,13]. The FI discretization method is a typical example of a univariate method.
We now introduce terminology that will be useful for describing discretization. Let D be a dataset of n instances consisting of the list ((X 1 , Z 1 ), (X 2 , Z 2 ), ..., (X k , Z k ), ..., (X n , Z n )) that is sorted in ascending order of X k , where X k is a real value of the predictor variable X and Z k is the associated integer value of the target variable Z. For example, suppose that the predictor variable represents the expression level of a gene that takes real values in the range 0 to 5.0 and the target variable represents the phenotype that takes the values: healthy or diseased (Z = 0 or Z = 1, respectively). Then, an example dataset D is ((1.2, 0), (1.4, 0), (1.6, 0), (3.7, 1), (3.9, 1), (4.1, 1)). Let S a, b be a list of the first elements of D, starting at the a th pair in D and ending at the b th pair. Thus, for the above example, S 4, 6 = (3.7, 3.9, 4.1). For brevity, we denote by S the list S 1, n . Let T b be a set that represents a discretization of S 1, b . For the above example of D, a possible 2-interval discretization is T 6 = {S 1, 3 , S 4, 6 } = {(1.2, 1.4, 1.6), (3.7, 3.9, 4.1)}. Equivalently, this 2-interval discretization denotes a cut point between 1.6 and 3.7, and typically the mid-point is chosen, which is 2.65 in this example. Thus, all values below 2.65 are considered as a single discrete value and all values equal or greater than 2.65 are considered another discrete value. For brevity, we denote by T a discretization T n of S.

Fayyad and Irani's (FI) Discretization Method
Fayyad and Irani's discretization method is a univariate supervised method that is widely used and has been cited over 2000 times according to Google Scholar 1 . The FI method consists of i) a score that is the entropy of the target variable induced by the discretization of the predictor variable, and ii) a greedy search method that recursively discretizes each partition at a cutpoint that minimizes the joint entropy of the two resulting subintervals until a stopping criterion based on the minimum description length (MDL) is met.
For a list S a, b derived from a predictor variable X and a target variable Z that takes J values, the entropy Ent (S a, b ) is defined as: where, P(Z = z j ) is the proportion of instances in S a, b where the target takes the value z j . The entropy of Z can be interpreted as a measure of its uncertainty or disorder. Let a cutpoint C split the list S a, b into the lists S a, c and S c + 1, b to create a 2-interval discretization {S a, c , S c + 1, b }. The entropy Ent(C; S a, b ) induced by C is given by: where, |S a, b | is the number of instances in S a, b , |S a, c | is the number of instances in S a, c , and |S c + 1, b | is the number of instances in S c + 1, b . The FI method selects the cut point C from all possible cut points that minimizes Ent(C; S a, b ) and then recursively selects a cut point in each of the newly created intervals in a similar fashion. As partitioning always decreases the entropy of the resulting discretization, the process of introducing cut points is terminated by a MDL-based stopping criterion. Intuitively, minimizing the entropy results in intervals where each interval has a preponderance of one value for the target.
Overall, the FI method is very efficient and runs in O (n log n) time, where n is the number of instances in the dataset. However, since it uses a greedy search method, it does not examine all possible discretizations and hence is not guaranteed to discover the optimal discretization, that is, the discretization with the minimum entropy.

Minimum Optimal Description Length (MODL) Discretization Method
To our knowledge, the closest prior work to the EBD algorithm, which is introduced in this paper, is the MODL algorithm that was developed by Boulle [5]. MODL is a univariate, supervised, discretization algorithm. Both MODL and EBD use dynamic programming to search over discretization models that are scored using a Bayesian measure. EBD differs from MODL in two important ways. First, MODL assumes uniform prior probabilities over the discretization, whereas EBD allows an informative specification of both structure and parameter priors, as discussed in the next section. Thus, although EBD can be used with uniform prior probabilities as a special case, it is not required to do so. If we have background knowledge or beliefs that may influence the discretization process, EBD provides a way to incorporate them into the discretization process.
Second, the MODL optimal discretization algorithm has a run time that is O(n 3 ), whereas the EBD optimal discretization algorithm has a run time of O(n 2 ), where n is the number of instances in the dataset. In essence, EBD uses a more efficient form of dynamic programming, than does MODL. Their difference in computational time complexity can have significant practical consequences in terms of which datasets are feasible to use. A dataset with, for example, 10,000 instances might be practical to use in performing discretization using EBD, but not using MODL.
While heuristic versions of MODL have been described [5], which give up optimality guarantees in order to improve computational efficiency, and heuristic versions of EBD could be developed that further decrease its time complexity as well, the focus of the current paper is on optimal discretization.
In the next section, we introduce the EBD algorithm and then describe an evaluation of it on a set of bioinformatics datasets.

An Efficient Bayesian Discretization Method
We now introduce a new supervised univariate discretization method called efficient Bayesian discretization (EBD). EBD consists of i) a Bayesian score to evaluate discretizations, and ii) a dynamic programming search method to locate the optimal discretization in the space of possible discretizations. The dynamic programming method examines all possible discretizations and hence is guaranteed to discover the optimal discretization, that is, the discretization with the highest Bayesian score.

Bayesian Score
We first describe a discretization model and define its parameters. As before, let X and Z denote the predictor and target variables, respectively, let D be a dataset of n instances consisting of the list ((X 1 , Z 1 ), (X 2 , Z 2 ), ..., (X k , Z k ), ..., (X n , Z n )), as described above, and let S denote a list of the first elements of D. A discretization model M is defined as: where, W is the number of intervals in the discretization, T is a discretization of S, and Θ is defined as follows. For a specified interval i, the distribution of the target variable P(Z | W = i) is modeled as a multinomial distribution with the parameters {θ i 1 ,θ i2 ,...,θ ij ,...,θ iJ } where j indexes the distinct values of Z. Considering all the intervals, Θ = {θ ij } over 1 ≤ i ≤ I and 1 ≤ j ≤ J and Θ specifies all the multinomial distributions for all the intervals in M. Given data D, EBD computes a Bayesian score for all possible discretizations of S and selects the one with the highest score.
We now derive the Bayesian score used by EBD to evaluate a discretization model M. The posterior probability P(M | D) of M is given by Bayes rule as follows: where P(M) is the prior probability of M, P(D | M) is the marginal likelihood of the data D given M, and P(D) is the probability of the data. Since P(D) is the same for all discretizations, the Bayesian score evaluates only the numerator on the right hand side of Equation 3 as follows: The marginal likelihood P(D | M) in Equation 4 is derived using the following equation: where Θ are the parameters of the multinomial distributions as defined above. Equation 5 has a closedform solution under the following assumptions: (1) the values of the target variable were generated according to i.i.d. sampling from P(Z | W = i), which is modeled with a multinomial distribution, (2) the distribution P (Z | W = i) is modeled as being independent of the distribution P(Z | W = h) for all values of i and h such that i ≠ h, (3) for all values i, prior belief about the distribution P(Z | W = i) is modeled with a Dirichlet distribution with hyperparameters a ij , and (4) there are no missing data. The closed-form solution to the marginal likelihood is given by the following expression [14,15]: where Γ(·) is the gamma function, n i is the number of instances in the interval i, n ij is the number of instances in the interval W i that have target-value j, a ij are the hyperparameters in a Dirichlet distribution which define the prior probability over the θ ij parameters, and The hyperparameters can be viewed as prior counts, as for example from a previous (or a hypothetical) dataset of instances in the interval i that belong to the value j. For the experiments described in this paper, we set all the a ij to 1, which can be shown to imply that a priori we assume all possible distributions of P(Z | W = i) to be equally likely, for each interval i. 2 If all a ij = 1, then all a i = J. With these values for the hyperparameters, and using the fact that Γ(n) = (n-1)!, Equation 6 becomes the following: The term P(M) in Equation 4 specifies the prior probability on the number of intervals and the location of the cut points in the discretization model M; we call these the structure priors. The structure priors may be chosen to penalize complex discretization models with many intervals to prevent overfitting. In addition to the structure priors, the marginal likelihood P(D | M) includes a specification of the prior probabilities on the multinomial distribution of the target variable in each interval; we call these the parameter priors. In Equation 6, the alphas specify the parameter priors.
The prior probability P(M) is modeled as follows. Let X k denote a real value of the predictor variable, as described above, and Z k denote the associated integer value of the target variable. Let Prior(k) be the prior probability of there being at least one cut point between X k and X k + 1 . In the Methods section, we describe the use of a Poisson distribution with mean λ to implement Prior(k), where λ is a structure prior parameter. Consider the prior probability for an interval i that represents the sequence S a i ,b i in a discretization model M. In general, we assume that the prior probability for interval i is independent of the prior probabilities for the other intervals in M. The prior probability for interval i in terms of the Prior function is defined as follows: Expression 8 gives the prior probability that no cut points are present between any consecutive pairs of values of X in the sequence S a i ,b i and at least one cut point is present between the values X b i and X b i +1 . Using the above notation and assumptions, and substituting Equations 7 and 8 into Equation 4, we obtain the specialized EBD score: The above score assumes that the n values of X in the dataset D are all distinct. However, the implementation described below easily relaxes that assumption.

Dynamic Programming Search
The EBD method finds the discretization that maximizes the score given in Equation 9 using dynamic programming to search the space of possible discretizations. The pseudocode for the EBD search method is given in Figure 1. It is globally optimal in that it is guaranteed to find the discretization with the highest score. Additional details about the search method used by EBD and its time complexity are provided in the Methods section.
The number of possible discretizations for a predictor variable X in a dataset with n instances is 2 n-1 , and this number is typically too large for each discretization to be evaluated in a brute force manner. The EBD method addresses this problem by the use of dynamic programming that at every stage uses previously computed optimal solutions to subproblems. The use of dynamic programming reduces considerably the number of possible discretizations that have to be evaluated explicitly without sacrificing the ability to identify the optimal discretization.
As described in the Methods section, the EBD algorithm runs in O(n 2 ) time, where n is the number of instances of a predictor X. Although EBD is slower than FI, it is still feasible to apply EBD to high-dimensional data with a large number of variables.

Evaluation of the Efficient Bayesian Discretization (EBD) Method
We evaluated the EBD method and compared its performance to the FI method on 24 biomedical datasets (see Table 1) using five measures: accuracy, area under the Receiver Operating Characteristic curve (AUC), robustness, stability, and the mean number of intervals per variable (a measure of model complexity). The last three measures evaluate the discretized predictors directly while the first two measures evaluate the performance of classifiers that are learned from the discretized predictors. We performed this comparison using the FI method, because it is so commonly used (1) in practice and (2) as a standard algorithmic benchmark for discretization methods.
For computing the evaluation measures we performed 10 × 10 cross-validation (10-fold cross-validation done ten times to generate a total of 100 training and test folds). For a pair of training and test folds, we learned a discretization model for each variable (using either FI or EBD) for the training fold only and applied the intervals from the model to both the training and test folds to generate the discretized variables. For the experiments, we set l, which is user specified parameter introduced in Figure 1 and in Equation 10 (see the Methods section) to be 0.5. The parameter l is the expected number of cut points in the discretization of the variables in the domain. Our previous experience with discretizing some of the datasets used in the experiments with FI indicated that the majority of the variables in these datasets have 1 or 2 intervals (that correspond to 0 or 1 cut points). We chose l to be 0.5 as the average of 0 and 1 cut points.
We used two classifiers in our experiments, namely, C4.5 and naïve Bayes (NB). C4.5 is a popular tree classifier that accepts both continuous and discrete predictors and has the advantage that the classifier can be interpreted as a set of rules. The NB classifier is simple, efficient, robust, and accepts both continuous and discrete predictors. It assumes that the predictors are conditionally independent of each other given the target value. Given an instance, it applies Bayes theorem to compute the probability distribution over the target values. This classifier is very effective when the independence assumptions hold in the domain; however, even if these assumptions are violated, the classification performance is often excellent, even when compared to more sophisticated classifiers [16].
Accuracy is a widely used measure of predictive performance (see the Methods section). The mean accuracies for EBD and FI for C4.5 and NB are given in Table 2. EBD has higher mean accuracy on 17 datasets for each of C4.5 and NB, respectively. FI has higher mean accuracy on 4 datasets and 3 datasets for C4.5 and NB, respectively. EBD and FI have the same mean accuracy on 4 datasets and 3 datasets for C4.5 and NB, respectively. Overall, EBD shows an increase in accuracy of 2.02% and 0.76% for C4.5 and NB, respectively. This increased performance is statistically significant at the 5% significance level on the Wilcoxon signed rank test for both C4.5 and NB.
The AUC is a measure of the discriminative performance of a classifier that accounts for datasets that have a highly skewed distribution over the target variable (see the Methods section). The mean AUCs for EBD and FI for C4.5 and NB are given in Table 3. For C4.5, EBD has higher mean AUC on 17 datasets, FI has higher Algorithm EBD Input: Dataset D and parameter . Output: An optimal Bayesian discretization of variable X relative to D.

Definitions of terms:
Let D be a dataset of n instances consisting of the list ((X 1 , Z 1 ), (X 2 , Z 2 ), ..., (X k , Z k ), ..., (X n , Z n )) that is sorted in ascending order of X k , where X k is a real value of the predictor variable and Z k is the associated integer value of the target variable.
Let S a, b be a list of the first elements in D, starting at the a th pair in D and ending at the b th pair.
Let T b be a set that represents a discretization of S 1, b .
Let target variable Z have J unique values, and let Z j denote the j th unique value. Let U be a real array of J elements, and let U j denote its j th element. U will contain the distribution of values of the target variable for some S a, b .
Let n' be the number of unique values of predictor variable X, and let X k denote the k th unique value.
Let V be a real array of n' elements, and let V y denote its y th element.
For 1 k n', let W k = (count k, 1 , count k, 2 ..., count k, J ) be an array such that for 1 j J, the term count k, j is equal to the number of pairs in D in which the first element has value X k and the second element (i.e., the target value) has value Z j .
Let MarginalLikelihood(U) be the following marginal likelihood function, which follows from Equation 7 when array U is used to derive the n i and n ij counts: 1 j )! V a := 0; 6.
Score_ba Prior(b-1)); 16. return T n' Figure 1 Pseudocode for the efficient Bayesian discretization (EBD) method. The EBD method uses dynamic programming and runs in O (n 2 ) time as indicated by the two for loops (n is the number of instances in the dataset).
mean AUC on 5 datasets, and both discretization methods have the same mean AUC on 2 datasets. For NB, EBD has higher mean AUC than FI on 16 datasets, lower mean AUC on 6 datasets, and the same mean AUC on two datasets. Overall, EBD shows an improvement in AUC of 1.07% and 1.12% for C4.5 and NB, respectively, and both increases in AUC are statistically significant at the 5% level on the Wilcoxon signed rank test.
Robustness is the ratio of the accuracy on the test dataset to that on the training dataset expressed as a percentage (see the Methods section). The mean robustness for EBD and FI for C4.5 and NB are given in Table 4. For C4.5, EBD has higher mean robustness on 10 datasets, FI has higher mean robustness on 11 datasets, and both have equivalent mean robustness on three datasets. For NB, EBD has better performance than FI on 9 datasets, worse performance on 13 datasets, and similar performance on two datasets. Overall, EBD shows a small decrease in mean robustness of 0.26% and 0.68% for C4.5 and NB, respectively, that are not statistically significant at the 5% level on the Wilcoxon signed rank test.
Stability quantifies how different training datasets affect the variables being selected (see the Methods section). The mean stabilities for EBD and FI are given in Table 5. Overall, EBD has higher stability than FI, but only at an overall average of 0.02, which nevertheless is statistically significant at the 5% significance level on the Wilcoxon signed rank test.  An example of the application of the efficient Bayesian discretization (EBD) method. This example shows the progression of the EBD method when applying the pseudocode given in Figure 1 to the dataset of six instances that is introduced in the main text. An asterisk denotes the discretization with the highest EBD score in a given iteration, as indexed by a. There are 2 5 = 32 possible discretizations for a dataset of six instances; for this dataset EBD explicitly evaluates only the 6 discretizations shown in bold font. In the Type column, T denotes transcriptomic and P denotes proteomic. In the P/D column, P denotes prognostic and D denotes diagnostic. #t is the number of values of the target variable and #n is the number of instances in the dataset. #V is the number of predictor variables. M is the proportion of the data that has the majority target value. Table 6 gives the mean number of intervals obtained by EBD and FI. The first column gives for each dataset the proportion of predictor variables that were discretized into a single interval, that is, there were no cut points. Such predictors are considered uninformative and are not used for learning a classifier. The second column gives for each dataset the mean number of intervals among those predictors that were discretized to more than one interval. The third column reports the mean number of intervals over all predictors, including intervals that contain no cut points. Overall, the application of EBD resulted in more predictors with more than one interval, relative to the application of FI, by an overall average of 9%. Also, the mean number of intervals per predictor was greater for EBD than for FI, but this difference was not statistically significant at the 5% level on the Wilcoxon signed rank test. This implies that while the average for the EBD complexity is slightly greater (1.27 versus 1.16 intervals per predictor), overall, EBD and FI are similar in terms of complexity of the discretizations produced.
The results of the statistical comparison of the EBD and FI discretization methods using the Wilcoxon paired samples signed rank test are given in Table 7. As shown in the table, the accuracy and AUC of C4.5 and NB classifiers were statistically significantly better at the 5% level when the predictor variables were discretized using EBD over FI. EBD was statistically significantly more stable to the variability of the datasets than FI. However, EBD was less robust, though not statistically significantly so, than FI and produced slightly more complex discretizations than FI.

Running Times
We conducted the experiments on an AMD X2 4400 + 2.2 GHz personal computer with 2GB of RAM that was running Windows XP. For the 24 datasets included in our study, on average to discretize all the predictor Accuracies for EBD and FI discretization methods are obtained from the application of C4.5 and NB classifiers to the discretized variables. The mean and the standard error of the mean (SEM) for the accuracy for each dataset is obtained by 10 × 10 cross-validation. For each dataset, the higher accuracy is shown in bold font and equal accuracies are underlined.

Discussion
We have developed an efficient Bayesian discretization method that uses a Bayesian score to evaluate a discretization and employs dynamic programming to efficiently search and identify the optimal discretization. We evaluated the performance of EBD on several measures and compared it to the performance of FI. Table 8 shows the number of wins, draws and losses when comparing EBD to FI on accuracy, AUC, stability and robustness.
On both accuracy and AUC, which are measures of discrimination performance, EBD demonstrated statistically significant improvement over FI. EBD was more stable than FI, which indicates that EBD is less sensitive to the variability of the training datasets. FI was moderately better in terms of robustness, but not statistically significantly so. On average, EBD produced slightly more intervals per predictor variable, as well as a greater proportion of predictors that had more than one interval. Thus, EBD produced slightly more complex discretizations than FI. A distinctive feature of EBD is that it allows the specification of parameter and structure priors. Although we used non-informative parameter priors in the evaluation reported here, EBD readily supports the use of informative prior probabilities, which enables users to specify background knowledge that can influence how a predictor variable is discretized. The alpha parameters in Equation 6 are the parameter priors. Suppose there are two similar biomedical datasets A and B containing the same variables, but different populations of individuals, and we are interested in discretizing the variables. The data in A could provide information for defining the parameter priors in Equation 6 before its application to the data in B. There is a significant amount of flexibility in defining this mapping for using data in a similar (but not identical) biomedical dataset to influence the discretization of another dataset. The lambda parameter in The mean and the standard error of the mean (SEM) for robustness for each dataset is obtained by 10 × 10 cross-validation. For each dataset, the higher robustness value is shown in bold font and equal robustness values are underlined. The mean stability for each dataset is obtained by 10 × 10 cross-validation. For each dataset, the higher stability value is shown in bold font and equal stability values are underlined. The mean fraction of predictor variables discretized to one interval (no cut points), the mean number of intervals for predictor variables discretized to more than one interval (at least one cut point), and the mean number of intervals for all predictor variables for each dataset is obtained by 10-fold cross-validation done ten times. For each dataset, the higher value is shown in bold font and equal values are underlined. In the first column the range of a measure is given in square brackets where n is the number of instances in the dataset. In the last column the number on top in the last column is the Z statistic and the number at the bottom is the corresponding p-value. On all performance measures, except for the mean number of intervals per predictor, the Z statistic is positive when EBD performs better than FI. The two-tailed p-values of 0.05 or smaller are in bold, indicating that EBD performed statistically significantly better at that level.

Equation 10
(described in the Methods section) allows the user to provide a structure prior. This is where prior knowledge might be particularly helpful by specifying (probabilistically) the expected number of cut points per predictor variable. Although we have presented a structure prior that is based on a Poisson distribution, the EBD algorithm can be readily adapted to use other distributions. In doing so, the main assumption is that a structure prior of an interval can be composed as a product of the structure priors of its subintervals. The running times show that although EBD runs slower than FI, it is sufficiently fast to be applicable to real-world, high-dimensional datasets. Overall, our results indicate that EBD is easy to implement and is sufficiently fast to be practical. Thus, we believe EBD is an effective discretization method that can be useful when applied to high-dimensional biomedical data.
We note that EBD and FI differ in both in the score used for evaluating candidate discretizations and in the search method employed. As a result, the differences in performance of the two methods may be due to the score, the search method, or a combination of the two. A version of FI could be developed that uses dynamic programming to minimize its cost function, namely entropy, in a manner directly parallel to the EBD algorithm that we introduce in this paper. Such a comparison, however, is beyond the scope of the current paper. Moreover, since the FI method was developed and is implemented widely using greedy search, we compared EBD to it rather than to a modified version of FI using dynamic programming search. It would be interesting in future research to evaluate the performance of a dynamic programming version of FI.

Conclusions
High-dimensional biomedical data obtained from transcriptomic and proteomic studies are often pre-processed for analysis that may include the discretization of continuous variables. Although discretization of continuous variables may result in loss of information, discretization offers several advantages. It broadens the range of data mining methods that can be applied, can reduce the time taken for the data mining methods to run, and can improve the predictive performance of some data mining methods. In addition, the thresholds and intervals produced by discretization have the potential to assist the investigator in selecting biologically meaningful intervals. For example, the intervals selected by discretization for a transcriptomic variable provide a starting point for defining normal, over-, and underexpression for the corresponding gene.
The FI discretization method is a popular discretization method that is used in a wide range of domains. While it is computationally efficient, it is not guaranteed to find the optimal discretization for a predictor variable. We have developed a Bayesian discretization method called EBD that is guaranteed to find the optimal discretization (i.e., the discretization with the highest Bayesian score) and is also sufficiently computationally efficient to be applicable to highdimensional biomedical data.

Biomedical Datasets
The performance of EBD was evaluated on a total of 24 datasets that included 21 publicly available transcriptomic datasets and two publicly available proteomic datasets that were acquired on the Surface Enhanced Laser/Desorption Ionization Time of Flight (SELDI-TOF) mass spectrometry platform. Also included was a University of Pittsburgh proteomic dataset that contains diagnostic data on patients with Amyotrophic Lateral Sclerosis; this data were acquired on the SELDI-TOF platform [17]. The 24 datasets along with their types, number of instances, number of variables, and the majority target value proportions are given in Table 1. The 23 publicly available datasets used in our experiments have been extensively studied in prior investigations [17][18][19][20][21][22][23][24][25][26][27][28][29][30][31][32][33][34].

Additional Details about the EBD Algorithm
In this section, we first provide additional details about the Prior probability function that is used by EBD. Next, we discuss details of the EBD pseudocode that appears in Figure 1.
Let D be a dataset of n instances consisting of the list ((X 1 , Z 1 ), (X 2 , Z 2 ), ..., (X k , Z k ), ..., (X n , Z n )) that is sorted in ascending order of X k , where X k is a real value of the predictor variable and Z k is the associated integer value of the target variable. Let l be the mean of a Poisson distribution that represents the expected number of cut points between X 1 and X n in discretizing X to predict Z. Note that zero, one, or more than one cut points can occur between any two consecutive values of X in the training set. Let Prior(k) be the prior probability of there being at least one cut point between values X k and X k + 1 in the training set. For k from 1 to n-1, we define the EBD Prior function as follows: where, d(a, b) = X b -X a represents the distance between the two values X a and X b of X, and X b is greater than X a . When k = 0 and k = n, boundary conditions occur. We need an interval below the lowest value of X in the training set and above the highest value. Thus, we define Prior(0) = 1, which corresponds to the lowest interval, and Prior(n) = 1, which corresponds to the highest interval.
The EBD pseudocode shown in Figure 1 works as follows. Consider finding the optimal discretization of the subsequence S 1, a for a being some value between 1 and n. 3 Assume we have already found the highest scoring discretization of X for each of the subsequences S 1,1 , S 1,2 , ..., S 1,a-1 . Let V 1 , V 2 , ..., V a-1 denote the respective scores of these optimal discretizations. Let Score i be the score of subsequence S i, a when it is considered as a single interval, that is, it has no internal cut points; this term is denoted as the variable Score_ba in Figure 1. For all b from a to 1, EBD computes V b -1 × Score_ba, which is the score for the highest scoring discretization of S 1, a that includes S b, a as a single interval. Since this score is derived from two other scores, we call it a composite score. The fact that this composite score is a product of two scores follows from the decomposition of the scoring measure we are using, as given by Equation 9. In particular, both the prior and the marginal likelihood components of that score are decomposable. Over all b, EBD chooses the maximum composite score, which corresponds to the optimal discretization of S 1, a ; this score is stored in V a . By repeating this process for a from 1 to n, EBD derives the optimal discretization of S 1, n , which is our overall goal.
Several lines of the pseudocode in Figure 1 deserve comments. Line 8 incrementally builds a frequency (count) distribution for the target variable, as the subsequence S b, a is extended. Line 11 determines if a better discretization has been found for the subsequence S 1, a . If so, the new (higher) score and its corresponding discretization are stored in V a and T a , respectively. Line 15 incrementally updates P to maintain a prior that is consistent with there being no cut points in the subsequence S b a .
We can obtain the time complexity of EBD as follows. The numbers computed within EBD can become very small. Thus, it is most practical to use logarithmic arithmetic. A logarithmic version of EBD, called lnEBD, is given in Additional file 1.

Discretization and Classification
For the FI discretization method, we used the implementation in the Waikato Environment for Knowledge Acquisition (WEKA) version 3.5.6 [35]. We implemented the EBD discretization method in Java so that it can be used in conjunction with WEKA. For our experiments, we used the J4.8 classifier (which is WEKA's implementation of C4.5) and the naïve Bayes classifier as implemented in WEKA. Given an instance for which the target value is to be predicted, both classifiers compute the probability distribution over the target values. In our evaluation, the distribution over the target values was used directly; if a single target value was required, the target variable was assigned the value that had the highest probability.

Evaluation Measures
We conducted experiments for the EBD and FI discretization methods using 10 × 10 cross-validation. The discretization methods were evaluated on the following five measures: accuracy, area under the Receiver Operating Characteristic curve (AUC), robustness, stability, and the average number of intervals per variable.
Accuracy is a widely used performance measure for evaluating a classifier and is defined as the proportion of correct predictions of the target made by the classifier relative to the number of test instances (samples). The AUC is another commonly used discriminative measure for evaluating classifiers. For a binary classifier, the AUC can be interpreted as the probability that the classifier will assign a higher score to a randomly chosen instance that has a positive target value than it will to a randomly chosen instance with a negative target value. For datasets in which the target takes more than two values, we used the method described by Hand and Till [36] for computing the AUC.
Robustness is defined as the ratio of the accuracy on the test dataset to that on the training dataset expressed as a percentage [5]. It assesses the degree of overfitting of a discretization method.
Stability measures the sensitivity of a variable selection method to differences in training datasets, and it quantifies how different training datasets affect the variables being selected. Discretization can be viewed as a variable selection method, in that variables with a non-trivial discretization are selected while variables with a trivial discretization are discarded when the discretized variables are used in learning a classifier. A variable has a trivial discretization if it is discretized to a single interval (i.e., has no cut points) while it has a non-trivial discretization if it is discretized to more than one interval (i.e., has at least one cut-point).
We used a stability measure that is an extension of the measure developed by Kuncheva [37]. To compute stability, first a similarity measure is defined for two sets of variables that, for example, would be obtained from the application of a discretization method to two training datasets on the same variables. Given two sets of selected variables, v i and v j , the similarity score we used is given by the following equation: where, k i is the number of variables in v i , k j is the number of variables in v j , r is the number of variables that are present in both v i and v j , n is the total number of variables, min(k i , k j ) is the smaller of k i or k j and represents the largest value r can attain, and k i k j n is the expected value of r that is obtained by modeling r as a random variable with a hypergeometric distribution. This similarity measure computes the degree of commonality between two sets with an arbitrary number of variables, and it varies between -1 and 1 with 0 indicating that the number of variables common to the two sets can be obtained simply by random selection of k i or k j variables from n variables, and 1 indicating that the two sets are contain the same variables. When v i or v j or both have no variables, or both v i and v j contain all predictor variables, Sim(v i , v j ) is undefined, and we assume the value of the similarity measure to be 0.

Experimental Methods
In performing cross validation, each training set (fold) contains a set of variables that are assigned one or more cutpoints; we can consider these as the selected predictor variables for that fold. We would like to measure how similar are the selected variables among all the training folds. For a single run of 10-fold cross validation, the similarity scores of all possible pairs of folds are calculated using Equation 11. With 10-fold cross validation, there are 45 pairs of folds, and stability is computed as the average similarity over all these pairs. For the ten runs of 10-fold cross-validation, we averaged the stability scores obtained from the ten runs to obtain an overall stability score. The stability score varies between -1 and 1; a better discretization method will be more stable and hence have a higher score. For comparing the performance of the discretization methods, we used the Wilcoxon paired samples signed rank test. This is a non-parametric procedure concerning a set of paired values from two samples that tests the hypothesis that the population medians of the samples are the same [38]. In evaluating discretization methods, it is used to test whether two such methods differ significantly in performance on a specified evaluation measure.
Endnotes 1 This is based on a search with the phrase "Fayyad and Irani's discretization" that we performed on December 24, 2010. 2 However, in general we can use background knowledge and belief to set the values of the a ij . 3 Technically, we should use the term n' here, as it is defined in Figure 1, but we use n for simplicity of notation. 4 We note that line 13 requires some care in its implementation to achieve O(1) time complexity, but it can be done by using an appropriate data structure. Also, the MarginalLikelihood function requires computing factorials from 1! to as high as (J-1 + n)!; these factorials can be precomputed in O(n) time and stored for use in the MarginalLikelihood function.