Using random forest for reliable classification and cost-sensitive learning for medical diagnosis

Yang, Fan; Wang, Hua-zhen; Mi, Hong; Lin, Cheng-de; Cai, Wei-wen

doi:10.1186/1471-2105-10-S1-S22

Volume 10 Supplement 1

Selected papers from the Seventh Asia-Pacific Bioinformatics Conference (APBC 2009)

Research
Open access
Published: 30 January 2009

Using random forest for reliable classification and cost-sensitive learning for medical diagnosis

Fan Yang¹,
Hua-zhen Wang¹,
Hong Mi¹,
Cheng-de Lin¹ &
…
Wei-wen Cai²

BMC Bioinformatics volume 10, Article number: S22 (2009) Cite this article

12k Accesses
62 Citations
Metrics details

Abstract

Background

Most machine-learning classifiers output label predictions for new instances without indicating how reliable the predictions are. The applicability of these classifiers is limited in critical domains where incorrect predictions have serious consequences, like medical diagnosis. Further, the default assumption of equal misclassification costs is most likely violated in medical diagnosis.

Results

In this paper, we present a modified random forest classifier which is incorporated into the conformal predictor scheme. A conformal predictor is a transductive learning scheme, using Kolmogorov complexity to test the randomness of a particular sample with respect to the training sets. Our method show well-calibrated property that the performance can be set prior to classification and the accurate rate is exactly equal to the predefined confidence level. Further, to address the cost sensitive problem, we extend our method to a label-conditional predictor which takes into account different costs for misclassifications in different class and allows different confidence level to be specified for each class. Intensive experiments on benchmark datasets and real world applications show the resultant classifier is well-calibrated and able to control the specific risk of different class.

Conclusion

The method of using RF outlier measure to design a nonconformity measure benefits the resultant predictor. Further, a label-conditional classifier is developed and turn to be an alternative approach to the cost sensitive learning problem that relies on label-wise predefined confidence level. The target of minimizing the risk of misclassification is achieved by specifying the different confidence level for different class.

Background

Most machine-learning classifiers output predictions for new instances without indicating how reliable the predictions are. The application of these classifiers is limited in the domains where incorrect predictions have serious consequences. Medical practitioners need a reliable assessment of risk of error for individual cases [1]. Thus, given the prediction tailed with a corresponding confidence value, a system can decide whether it is safe to classify. The recently introduced Conformal Predictor (CP) [2–5] is a promising framework that produces prediction coupled with confidence estimation. The exploiters advanced a welcome preference for formal relationship among Kolmogorov complexity, universal Turing Machines and strict minimum message length (MML). They assumed the transductive prediction as a randomness test which returns nonconformity scores closely associated with the property of the iid distribution (identically and independent distribution) governing all of the examples. When classifying a new instance, CP assigns a p-value for each given artificial label to approximate the confidence level of prediction. CP is more than a reliable classifier of which the most novel and valuable feature is hedging prediction, i.e., the performance can be set prior to classification and the prediction is well-calibrated that the accurate rate is exactly equal to the predefined confidence level. It is impressive to see its superiority over the Bayesian approach which often relies on strong underlying assumptions. In this paper, we use a random forest outlier measure to design the nonconformity score and develop a modified random forest classifier.

Since reports from both academia and practice indicate that the default assumption of equal misclassification costs is most likely violated [6], the natural desiderata is extending CP to label-wise CP, which takes into account different costs for misclassification errors of different class and allows different confidence level to be specified for different classification of an instance. In this paper, we investigate the method to extend CP to label-conditional CP, which can solve the non-uniform costs of errors in classification.

Consider a classification problem E: The reality outputs examples Z^(n-1) = {(x₁, y₁),..., (x_n-1, y_n-1)} ∈ X × Y and an unlabeled test instance x_n, where X denotes a measurable space of possible instances x_i∈ X, i = 1, 2,... n - 1,...; Y denotes a measurable space of possible labels, y_i∈ Y, i = 1,2,... n - 1,...; the example space is represented as Z = X × Y. We assume that each instance is generated by the same unknown probability distribution P over Z, which satisfies the exchangeability assumption.

Conformal predictor (CP)

CP is designed to introduce confidence estimation to the machine learning algorithms. It generalizes its framework from the iid assumption to exchangeability which omits the information about examples order.

To construct a prediction set for an unlabeled instance x_n, CP operates in a transductive manner and online setting. Each possible label is tried as a label for x_n. In each try we form an artificial sequence (x₁, y₁),..., (x_n, y), then we measure how likely it is that the resulting sequence is generated by the unknown distribution P and how nonconforming x_n is with respect to other available examples.

Given the classification problem E, The function A_n: Z^(n-1) × z_n → R is a nonconformity measure if, for any n ∈ N,

α_i := A_n(z_i, ⟦₁,..., z_i-1, z_i+1,..., z_n⟧)

i = 1,..., n - 1

α_n := A_n(z_n, ⟦z₁,..., z_n-1⟧)

where [·] is a "bag" in which the elements are irrelevant according to their order. The symbol α denotes sample nonconformity score: the larger α_i is, the stranger z_i is corresponding to the distribution. In short, a nonconformity measure is characterized as a measurable kernel that maps Z to R while the value of α_i is irrelevant with the order of z_i in sequence.

For confidence level 1 - ε (ε is the significance level) and any n ∈ N, a conformal predictor is defined as:

Γ^{ε} (z_{1}, ..., z_{n - 1}, x_{n}) = {y \in Y : p_{y} = \frac{| {i = 1, ..., n : α_{i} \geq α_{n}} |}{n} > ε}

(2)

A smoothed conformal predictor (smoothed CP) is defined as:

Γ^{ε, τ_{n}} (z_{1}, ..., z_{n - 1}, x_{n}, τ_{n}) = {y \in Y : p_{y} = \frac{| {i = 1, ..., n : α_{i} > α_{n}} | + τ_{n} | {i = 1, ..., n : α_{1} = α_{n}} |}{n} > ε}

(3)

where y is a possible label for x_n; P_y is called p value, which is the randomness level of z_n = (x_n, y) and also the confidence level of y being the true label; τ_n, n ∈ N is a random variables that distributed uniformly in [0, 1]. Smoothed CP is a power version of CP, which benefits from p distributing uniformly in [0, 1].

Let Γ^ε= {y ∈ Y: P_y > ε}, and the true label of x_n is denoted as y_n,

If |Γ^ε| = 1, we define it as a certain prediction.

If |Γ^ε| > 1, it is an uncertain prediction.

If |Γ^ε| = ∅ , it is an empty prediction.

If y_n ∈ Γ^ε, we define it as a corrective prediction with confidence level 1 - ε. Otherwise, it is defined as an error.

When it comes to forced point prediction, CP selects the label with maximum p value as the prediction.

CP is originally proposed for online learning and Vovk [7] offered the theoretical proof that in the online setting in the long run the prediction set Γ contains the true label with probability 1 - ε and the rate of wrong prediction is bounded by ε. Especially the smoothed CP is exactly valid, i.e., the rate of wrong prediction is exactly equal to ε. This is summarized as the proposition of well-calibrated.

\underset{n \to \infty}{l i m} \frac{\sup (E r r_{n}^{ε})}{n} = ε

(4)

with $E r r_{n}^{ε}$ the number of error predictions at the confidence level 1 - ε (See [7] for detailed proof). Extensive experiments demonstrated that CP is also applicable to offline learning, which enlarge its applications.

Different nonconformity measures have been developed from existing algorithms, such as SVM, KNN and so on [9–11]. All the CPs have the calibration property, but the efficiency of CP largely depends on the designing of nonconformity measure [8]. Efficiency means the certain and empty prediction ratio in all predictions. Certain prediction is favourable because it is more informative than uncertain predictions. CP is successfully employed to hedge these popular machine learning methods, and this paper shows that CP-RF is more efficient than others.

Random forest (RF)

Breiman's random forest applies Bagging [12] and Randomization [13] technique to grow many classification trees with the largest extent possible without pruning. Random Forest is especially attractive in the following cases [14, 15]:

(1)
First, the real world data is noisy and contains many missing values, some of the attributes are categorical, or semi-continuous.
(2)
Furthermore, there are needs to integrate different data sources which face the issue of weighting them.
(3)
RF show high predictive accuracy and are applicable in high-dimensional problems with highly correlated features, especially in the situation which often occurs in bioinformatics, like medical diagnosis.

In this paper, the random forest outlier measure is used to design a nonconformity measure in order to incorporate random forest into the CP and label conditional CP scheme. Our method can be used in both online and offline settings.

Cost-sensitive learning problem

In medical diagnosis, the default assumption of equal misclassification costs underlying machine learning techniques is most likely violated. A false negative prediction may have more serious consequences than a false positive prediction. To address this problem, cost-sensitive classification is developed, which considers the varying costs of different misclassification types [16]. Usually a cost matrix is defined or learned to reflect the penalty of classifying samples from one class as another. A cost-sensitive classification method takes a cost matrix into consideration during the model building process [17]. However, how to get a proper cost matrix remains an open question [18]. The definition or learning of a cost matrix is quite subjective. In this paper, we extend our method to label conditional CP to address the cost sensitive problem, and the risk of misclassification of each class is well controlled.

Results

Experiments setup

The experiments are divided into two parts: First, to show the calibration property and efficiency of our method, we demonstrate our method CP-RF on 8 benchmark datasets and a real-world gene expression dataset. Second, to cope with the cost-sensitive problem, we extend CP-RF to label conditional CP-RF, and test its performance on two public application datasets.

Part I Performance of CP-RF

We employ 8 UCI datasets [19], including satellite, isolet, soybean, and covertype, etc. Some details are included in Table 1, which contains information of the number of instances (n), number of class (c), number of attributes (a), and number of numeric (num) and nominal (nom).

Table 1 Datasets used in the experiments

Full size table

We perform CP-RF in a 10-fold cross validation in an online fashion and report the average performance and compare it with TCM-SVM and TCM-KNN. We use the following key indices at each predefined significance level: (1) Percentage of certain predictions. (2) Percentage of uncertain predictions. (3) Percentage of empty predictions. (4) Percentage of corrective predictions. These terms distinguish with traditional accuracy rate given by RF, SVM and other traditional classifiers.

Given a significance level ε, the calibration and efficiency can be laid out. Let the number of trees (denoted as ntrees) equal to 1000 and the number of variables to split on at each node (denoted as ntry) be the default value $\sqrt{a}$ (a is the number of attributes). In Figures 1, 2, 3, 4, 5, 6, 7, 8, we demonstrate performance curves according with the significance level ε ranging from 0.01 to 1, and show the average experimental results on pima (continuous variables), soybean (categorical variables), covertype (mixed variables) and liver (poor data quality), etc.

Figures 1, 2, 3, 4, 5, 6, 7, 8 show that the empirical error line is well-calibrated with neglectable statistical fluctuations. It allows controlling the number of errors prior to classification. Percentages of corrective predictions with a predefined level of confidence illustrate the calibration of the new algorithm. Figures 1, 2, 3, 4, 5, 6, 7, 8 also show high accuracy with series of significance level and some interest points are extracted in Table 2 for impressive purpose.

Table 2 Corrective predictions at 5 confidence level

Full size table

Table 2 demonstrates that CP-RF ensures relative high accuracy when controlling a low risk of error. It is important in many domains to measure the risk of misclassification, and if possible, to ensure low risk of error.

The percentage of certain predictions reflects the efficiency of prediction. Notice that the percentage of uncertain predictions monotonically decreases with higher significance levels. How fast this decline goes to zero depends on the performance of the classifier plugged into the CP framework. Figures 1, 2, 3, 4, 5, 6, 7, 8 show that CP-RF performs significantly well, and it is applicable at the significance levels of 0.20, 0.15, 0.10, and 0.05. For the convenience of comparison, we apply standard TCM-KNN and TCM-SVM algorithm provided by Gammerman and Vovk. The ratios of certain predictions on the 8 datasets are given for comparison in Table 3 and we can find that the efficiency of CP-RF is much better than the others, which indicates its superiority. Then we compare their performance in the occasion of forced point prediction in Table 4.

Table 3 Comparison of certain prediction

Full size table

Table 4 Comparisons of accuracy

Full size table

It is clear that CP-RF performs well at most of the datasets, especially on the datasets with categorical and mixed variable. CP-RF especially outperforms TCM-KNN for high-dimension dataset (isolet), and outperforms TCM-SVM for noisy data (covertype).

To compare with the most popular machine learning methods, we consider Acute Lymphoblastic Leukemia (ALL) data which was previously analyzed with traditional machine learning methods. We choose an ALL dataset from [20] for comparison (see Table 5). There are 327 cases with attributes of 12558 genes. The data has been divided into six diagnostic groups and one that contains diagnostic samples that did not fit into any one of the above groups (labeled as "Others"). Each group of samples has been randomized into training and testing parts.

Table 5 The characteristic of ALL data

Full size table

We report results using CP-RF without discriminating gene selections, i.e. using all of the genes. In order to compare with traditional machine learning method, we apply CP-RF in an offline fashion, and use results of forced prediction. Table 6 demonstrates the detailed classification performance per class in confusion matrix, and it shows that CP-RF makes only 2 misclassifications. The comparison with a fine-tuned Support Vector Machine is laid out in Table 7.

Table 6 Confusion matrix of CP-RF

Full size table

Table 7 Comparison of accuracy per class

Full size table

Tables 6 and 7 show CP-RF outperforms SVM in subgroup 3, 4 and 6, and they are well-matched in subgroup 2 and 5. All of the two misclassifications happen in subgroup 1, because this subgroup only has six cases, the error rate seems very large.

Due to the low sample size, the reliability of classification is not guaranteed[21]. We show the distinct advantage of CP-RF with two measures, corrective predictions and certain predictions under 5 confidence levels in Table 8. The results show that our method is well-calibrated and make reliable predictions, even in an offline fashion.

Table 8 Corrective and certain prediction at 5 confidence levels

Full size table

Part II: Performance of label conditional CP-RF

In this part, we choose two multi-class and unbalanced real-world data sets as examples for cost sensitive learning. The objective is to control risk of misclassification within each class for different misclassifications may have different penalty in medical diagnostics. The first data set is the Thyroid disease records [22], and the problem is to determine whether a patient referred to the clinic is hypothyroid. Each record has 21 attributes in total (15 Boolean and 6 continuous) corresponding to various symptoms and measurements taken from each patient. The data set contains 7200 examples in total and is highly unbalanced in its representation of the 3 possible classes corresponding to diagnoses. Some details are included in Table 9, which contains information on the name, index and size of each class.

Table 9 Datasets used in the experiments

Full size table

Another dataset is the Chronic Gastritis Dataset [23], which is a common disease of the digestive system with gastric inflammation being its notable features. Compare to Western medicine, Chinese medicine have many advantages in its treatment [24]. According to "Diagnostic criteria for the diagnosis of chronic gastritis combining traditional and western medicine" set by the Integrated Traditional and Western Medicine Digest Special Committee, Chronic Gastritis is divided into five subtypes (see table 9). In our application, we collected 709 cases from the digestion outpatient department of the Affiliated Shuguang Hospital during February and October, 2006. All cases are inspected by both gastroscopy and pathology. Each case is correlated with 55 kinds of symptoms listed in table 10.

Table 10 ID of the symptoms of chronic gastritis

Full size table

When constructing RF, we let the number of trees equal to 1000 and the number of variables to split on at each node be $⌊ \sqrt{55} ⌋$ (Parameter sensitivity analysis of CP-RF is laid out in the next section). For experiments on Thyroid disease dataset, the original dataset is randomly divided into a training set (3772 samples) and a test set (3428 samples). For Chronic Gastritis dataset, we perform our method in a 10-fold cross validation. Average performances are reported.

We are interested in performances comparison of the label conditional CP-RF with the CP-RF. Limited by space, we only show parts of results: Figures 9, 10, 11, 12, 13, 14, 15, 16 show experiment results on two classes with relatively small samples of Thyroid and Chronic Gastritis datasets. Figure 9 shows that the CP-RF is not well-calibrated at most of confidence levels within class "primary hyperthyroid" on Thyroid, and meanwhile Figure 10 shows that the efficiency of CP-RF is very low with confidences levels ranging from 0.01 to 1. In contrast, as is shown in Figure 11, label conditional CP-RF is well-calibrated up to neglectable statistical fluctuations and the empirical corrective prediction line can hardly be distinguished from the exact calibration line. Aside from the property of calibration, label conditional CP-RF show improvements on predictive efficiency in Figure 12, compared with CP-RF. These contrasts can also be observed in experiments on class "deficiency of spleen and stomach "of Chronic Gastritis datasets (See Figures 13, 14, 15, 16).

It is noticeable that the percentage of certain predictions and certain & correct ratios monotonically increase with significance levels. How fast the decline of uncertain prediction goes to zero also depends on the superiority of calculation of p value.

Some interest points are extracted in Tables 11 and 12. It demonstrates that label conditional CP-RF can be used to control the risk of misclassification within each class, so that it can be considered as an alternative approach for cost sensitive learning for unbalanced data.

Table 11 label conditional empirical corrective prediction at 5 confidence level within each class on thyroid data

Full size table

Table 12 label conditional empirical corrective prediction at 5 confidence level within each class on chronic gastritis data

Full size table

Discussion

Part I: CP-RF

Parameter sensitivity analysis

A common way to validate an approach is to ensure robustness, that is, the approach must produce consistent results independent of the initial parameter settings. Empirical studies show the parameters adjustments have great impacts on CPs. Normalization of examples affects TCM-KNN greatly. As for TCM-SVM, not only the normalization but the type and parameters of kernel functions are important. Thus, the empirical and non-theoretically alteration hints a potential instability.

To demonstrate the parameter insensitivity of CP-RF, we set up different parameters for CP-RF, with ntrees = 500,1000,5000 and ntry = 1,..., $\sqrt{a}$ . Mean and standard deviation of forced accuracy on sonar are reported.

For TCM-KNN, We compare the fluctuation of forced accuracy with or without normalization; For TCM-SVM, the affection of different types of kernel are illustrated. The results on sonar are summed up in table 13. CP-RF shows a comparatively trivial fluctuation with the change of parameter settings. The advantage comes from the nature of RF and will benefit medical diagnosis.

Table 13 Comparison of Parameter Sensitivity

Full size table

Feature selection

The problem of feature selection is an open question in many applications. In our method, there is no feature selection. Take gene expression analysis for example, gene selection is a crucial study and remains unsolved. In Yeoh's study, gene expression profiling can accurately identify the known prognostically important leukemia subtypes, by the means of classification using SVM, KNN, and ANN when various selected genes were used. Unfortunately, classifications were performed following a process of discriminating gene selections by a correlation-based feature selection. This process is also labor intensive and requiring experiential knowledge. It is better that automated classification should be made with a level of confidence. Moreover, due to the low sample size, although their research has yielded high predictive accuracies that are comparable with or better than traditional clinical techniques, it remains uncertain how well the selected genes results will extrapolate to practice in the future [25]. CP-RF is especially suitable for this situation, without discriminating gene selections, i.e. using all of the genes, and this may meet the need of an automated classification. Moreover, no selection bias is introduced.

Part II: Label conditional CP-RF

From experiments in Part I, we can see that though CP-RF is well calibrated globally, i.e. the error predictions equal to the predefined confidence level on the whole test data, it cannot guarantee the reliability of classification for each class especially for unbalanced datasets. Different from CP-RF, label conditional CP-RF is label-wise well calibrated while the former may not satisfy the calibration property in some classes. Because the latter uses only partial information from the whole data set, so the computational efficiency is better.

Conclusion

Most of state-of-the-art machine learning algorithms cannot provide a reliable measure of their classifications and predictions. This paper addresses the importance of reliability and confidence for classification, and presents a novel method based on a combination of random forest, and conformal predictor. The new algorithm hedges the predictions of RF and gives a well-calibrated region prediction by using the proximity matrix generated with RF as a nonconformity measure of examples. For medical diagnosis, the most important advantage of CP-RF is its calibration: the risk of error can be well controlled. The new method takes advantage of RF and possesses a more precise and stable nonconformity measure. It can deal with redundant and noisy data with mixed types of variables, and is less sensitive to parameter settings. Furthermore, we extend CP-RF to a label conditional version, so that it can control the risk of prediction for each class independently rather than globally. This modified version can provide an alternative way for cost sensitive learning. Experiments on benchmark datasets and real world applications show the usability and superiority of our method.

Methods

CP-RF algorithm

Executed by transductive inference learning, CP is able to hedge the predictions of any popular machine learning method, which constructs a nonconformity measure for CPs [3, 4]. It is a remarkable fact that error calibration is guaranteed regardless of the particular classifier plugged into CP and nonconformity measure constructed. However, the quality of region predictions and CP's efficiency accordingly, depends on the nonconformity measure. This issue has been discussed and several types of classifiers have been used, such as support vector machine, k-nearest neighbors, nearest centroid, kernel perceptron, naive Bayes and linear discriminant analysis [9–11]. The implementations of these methods are determined by the nature of these classifiers. So TCM-SVM and TCM-KP mainly consider binary classification tasks, TCM-KNN and TCM-KNC is the simplest mathematical realization, and TCM-NB and TCM-LDC is suitable for transductive regression. Indeed, the above methods have demonstrated their applicability and advantages over inductive learning, but there is still much infeasibility. For non-linear datasets, it is especially challenging to TCM-LDC. TCM-KNN and TCM-NC have difficulties with dispersed datasets. TCM-SVM is so processing intensive that it suffers from large datasets. TCM-KP is only practicable to relatively noise-free data. In short, there are many restrictions on data qualities when applying them to real world data. The difficulties in essence lie in the nonconformity measure, which remains an unanswered question.

Taking above into account, we propose a new algorithm called CP-RF. Random forest classifier naturally leads to a dissimilarity measure between examples in a "strange" space rather than a Euclidean measure. After a RF is grown, since an individual tree is unpruned, the terminal nodes will contain only a small number of observations. Given a random forest of size k: f = {T₁,..., T_k} and two examples x_iand x_j, we propagate them down all the trees within f. Let D_i= {T_1i,... T_ki} and D_j= {T_1j,... T_kj} be tree node positions for x_iand x_jon all the k trees respectively, a random forest similarity between the two examples is defined as:

p r o x (i, j) = \frac{1}{k} \sum_{t = 1}^{k} I (T_{t i}, T_{t j})

where

I (T_{t i}, T_{t j}) = {\begin{array}{l} 1 & i f T_{t i} = T_{t j} \\ 0 & e l s e \end{array}

i.e., if instance i and j both land in the same terminal node, the proximity between i and j is increased by one, this forms a N × N matrix (⟦prox(i, j))⟧_{N × N}, which is symmetric, positive definite and bounded above by 1, with the diagonal elements equal to 1, and N is the total number of cases [26].

Outliers are generally defined as cases that are removed from the main body of the data. In the framework of random forest, outliers are cases whose proximities to all other cases in the data are generally small. A useful revision is to define outliers relative to their class. Thus, an outlier in class j is a case whose proximities to all other class j cases are small. The raw outlier measure for case n in class j to the rest of the training data class j is defined as

o u t_{r a w} (i) = \frac{n s a m p l e}{\bar{p (i)}}

(5)

where nsample denotes the number of samples in class j and $\bar{p (i)}$ is the average proximity from case i to the rest of the training data within class j:

\bar{p (i)} = \sum_{j} {[p r o x (i, j)]}^{2}

The value of out_raw(i) will be large if the average proximity is small. Within each class find the median of these raw measures $\bar{o u t_{r a w}}$ , and their absolute deviation σ from the median. The raw measure is scaled to arrive at the final outlier measure by the following:

o u t (i) = \frac{o u t_{r a w} (i) - \bar{o u t_{r a w}}}{σ}

(6)

After a random forest is constructed, the proximity matrix of training dataset and a given test example remains the same regardless of changing the order of input data sequence, so random forest outlier measure can be used as a nonconformity measure.

In our method CP-RF, we define a new nonconformity measure α_i= out(i), and then predict each test sample with Eq. (3). The detailed CP-RF algorithm is summarized in pseudo codes below.

Algorithm: CP-RF

Input: Training set T = {(x₁, y₁),..., (x_l, y₁)} and a new unlabeled example x_l+1.

Output: The set of p value s ${p_{l + 1}^{1}, ..., p_{l + 1}^{m}}$ when T is an m-class dataset.

1.
for i = 1 to m do
2.
Assign label i to x_l+1
3.
Construct a RF classifier with training set T, put the test example x_l+1to the forest and output the sample proximity matrix (⟦prox(i, j))⟧_{(l+1) × (l+1)};
4.
Compute nonconformity scores $α_{1}^{i}, ..., α_{l}^{i}, α_{l + 1}^{i}$ of all examples using Eq.(6) ( $α_{l + 1}^{i}$ is the nonconformity measure of x_l+1when assigned label i);
5.
Compute the p value $p_{l + 1}^{i}$ of x_l+1with Eq. (3).
6.
End for

Label conditional CP-RF algorithm

Given a significance level ε > 0 and the goal is to compute predictive regions, ideally consisting of just one label, containing the true label with probability 1 - ε. But in some situations our predictions are well-calibrated globally, but not within each class. In cost-sensitive learning problem, we must allow different significance levels to be specified for each possible classification of an object because the penalty of misclassification is not the same among all classes [27, 28]. This problem can be viewed as a conditional inference. We extend our method to label conditional CP to address it, which can also be seen as one version of Mondrian CP (MCP) [3, 29].

An important aspect of MCP is the method of calculating p values. For example, calculating the p-values in standard CP, the nonconformity score of a new example against the nonconformity scores of all examples observed up to that point are compared. In contrast, label conditional CPs compare the nonconformity score of a new example with the previously observed examples within each class. In detail, this method applies a function called Mondrian taxonomy to effectively partition the example space Z into rectangular groups. Given a division of the Cartesian product N × Z into categories: a function k: N × Z → K maps each pair (n, z) (z is an example and n is the ordinal number of this example in the data sequence ⟦(Z)⟧ ↓ 1, z ↓ 2,...) to its category; a label conditional nonconformity measure based on k is defined as:

A_n: K^n-1× ((Z^•))^K× K × Z → R

The smoothed Mondrian conformal predictor (smoothed MCP) determined by the Mondrian nonconformity measure A_nproduces p values as:

\begin{matrix} Γ^{ε, τ_{n}, k} (z_{1}, ..., z_{n - 1}, x_{n}, τ_{n}, k) = \\ {y \in Y : p_{y} = \frac{| {i : k_{i} = k_{n} & α_{i} > α_{n}} | + τ_{n} | {i : k_{i} = k_{n} & α_{i} = α_{n}} |}{| {i : k_{i} = k_{n}} |} > ε} \end{matrix}

(7)

with α_idenotes a nonconformity score.

In label conditional CP-RF, α_i= out(i) and compared with CP-RF, the small difference is computing p values with part of training examples, with Eq. (7). So we get higher computational efficiency. Limited by space, the detailed label conditional CP-RF algorithm is omitted here.

Availability

The chronic gastritis dataset, the core source codes of CP-RF and label conditional CP-RF are available at http://59.77.15.238/APBC_paper or http://www.penna.cn/cp-rf

Abbreviations

CP:: Conformal predictor
RF:: Random forests
KNN:: K nearest neighbour classifier
SVM:: Support vector machine
KP:: Kernel perceptron
NB:: Naïve Bayes
NC:: Nearest centroid
LDC:: Linear discriminant classifier
KNC:: Kernel nearest centroid
ANN:: Artificial neural network.

References

Pirooznia M, Yang JY, Yang MQ, Deng YP: A comparative study of different machine learning methods on microarray gene expression data. BMC Genomics. 2008, 9 (Suppl 1): S13-
Article PubMed Central PubMed Google Scholar
Gammerman A, Vovk V: Prediction algorithms and confidence measures based on algorithmic randomness theory. Theoretical Computer Science. 2002, 287: 209-217.
Article Google Scholar
Vovk V, Gammerman A, Shafer G: Algorithmic learning in a random world. 2005, Springer, New York
Google Scholar
Gammerman A, Vovk V: Hedging predictions in machine learning. Computer Journal. 2007, 50: 151-177.
Article Google Scholar
Shafer G, Vovk V: A tutorial on conformal prediction. J Mach Learn Res. 2007, 9: 371-421.
Google Scholar
Elkan C: The foundations of cost-sensitive learning. Proceedings of the Seventeenth International Joint Conference of Artificial Intelligence. 2001, Morgan Kaufmann, Seattle, Washington, 973-978.
Google Scholar
Vovk V: A Universal Well-Calibrated Algorithm for On-line Classification. J Mach Learn Res. 2004, 5: 575-604.
Google Scholar
Stijn V, Laurens VDM, Ida SK: Off-line learning with transductive confidence machines: an empirical evaluation. Proceedings of the 5th International Conference on Machine Learning and Data Mining in Pattern Recognition. Edited by: Petra Perner, LNAI. 2007, Leipzig, Germany. Springer Press, 4571: 310-323.
Chapter Google Scholar
Tony B, Zhiyuan L, Gammerman A, Frederick VD, Vaskar S: Qualified predictions for microarray and proteomics pattern diagnostics with confidence machines. International Journal of Neural Systems. 2005, 15 (4): 247-258.
Article Google Scholar
Bellotti T, Zhiyuan L, Gammerman A: Reliable classification of childhood acute leukaemia from gene expression data using Confidence Machines. Proceedings of IEEE International Conference on Granular Computing, Atlanta, USA. 2006, 148-153.
Google Scholar
Proedrou K, Nouretdinov I, Vovk V, Gammerman A: Transductive confidence machines for pattern recognition. Proceedings of the 13th European Conference on Machine Learning. 2002, 381-390.
Google Scholar
Breiman L: Bagging Predictors. Mach Learn. 1996, 24 (2): 123-140.
Google Scholar
Breiman L: Random forests. Mach Learn. 2001, 45 (1): 5-32.
Article Google Scholar
Diaz UR, Alvarez AS: Gene Selection and Classification of Microarray Data Using Random Forest. BMC Bioinformatics. 2006, 7: 3-
Article Google Scholar
Strobl C, Boulesteix AL, Kneib T, Augustin T, Zeileis A: Conditional variable importance for random forests. BMC Bioinformatics. 2008, 9: 307-
Article PubMed Central PubMed Google Scholar
Turney P: Types of cost in inductive concept learning. Workshop on Cost-Sensitive Learning at ICML. 2000, Stanford University, California, 15-21.
Google Scholar
Zhou ZH, Liu XY: On multi-class cost-sensitive learning. Proceedings of the 21st National Conference on Artificial Intelligence, Boston, MA. 2006, 567-572.
Google Scholar
Zadrozny B, Elkan C: Learning and making decisions when costs and probabilities are both unknown. Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining. 2001, ACM Press, 204-213.
Chapter Google Scholar
UCI Machine Learning Repository. [http://archive.ics.uci.edu/ml/]
Yeoh EJ, Ross ME, Shurtleff SA: Classification subtype discovery and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell. 2002, 1 (2): 133-143.
Article CAS PubMed Google Scholar
Sorin D: Data Analysis Tools for DNA Microarrays. 2003, Chapman&Hall/CRC, London
Google Scholar
Thyroid Disease Database. [ftp://ftp.ics.uci.edu/pub/machine-learning-databases/thyroid-disease/]
Chronic Gastritis Dataset. [http://59.77.15.238/APBC_paper]
Niu HZ, Wang RX, Lan SM, Xu WL: hinking and approaches on treatment of chronic gastritis with integration of traditional Chinese and western medicine. Shandong Journal of Traditional Chinese Medicine. 2001, 20 (3): 70-72.
Google Scholar
Boulesteix AL, Strobl C, Augustin T, Daumer M: Evaluating microarray-based classifiers: an overview. Cancer Informatics. 2008, 6: 77-97.
PubMed Central PubMed Google Scholar
Qi Y, Klein SJ, Bar JZ: Random forest similarity for protein-protein interaction prediction from multiple sources. Pacific Symposium on Biocomputing. 2005, 10: 531-542.
Google Scholar
Domingos P: MetaCost: A general method for making classifiers cost-sensitive. Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining. 1999, New York. ACM Press, 155-164.
Chapter Google Scholar
Chris D, Robert CH: Cost curves: An improved method for visualizing classifier performance. Machine Learning. 2006, 65 (1): 95-130.
Article Google Scholar
Vovk V, Lindsay D, Nouretdinov I, Gammerman A: Mondrian Confidence Machine. Technical Report. Computer Learning Research Centre, Royal Holloway, University of London

Download references

Acknowledgements

The authors are grateful to Prof. Vladimir Vovk and Dima Devetyarov for valuable suggestions and essential help on conformal predictors. The authors would also like to thank Prof. Chang-le Zhou for assistance in data collection and support. The work is based upon work supported by the 985 Innovation Project on Information Technique of Xiamen University under Grant No.0000-x07204 and the National High Technology Research and Development Program of China (863 Program) under Grant No.2006AA01Z129.

This article has been published as part of BMC Bioinformatics Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S1

Author information

Authors and Affiliations

Automation Department, Xiamen University, Xiamen, 361005, P.R.C
Fan Yang, Hua-zhen Wang, Hong Mi & Cheng-de Lin
Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
Wei-wen Cai

Authors

Fan Yang
View author publications
You can also search for this author in PubMed Google Scholar
Hua-zhen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hong Mi
View author publications
You can also search for this author in PubMed Google Scholar
Cheng-de Lin
View author publications
You can also search for this author in PubMed Google Scholar
Wei-wen Cai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hong Mi.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

FY, HM and HZW conceived the study and research question. FY and HZW designed and implemented the algorithms, set up and performed the experiments and drafted the manuscript. HM, CDL and WWC contributed to the theoretical understanding and presentation of the problem.

Fan Yang, Hua-zhen Wang contributed equally to this work.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Yang, F., Wang, Hz., Mi, H. et al. Using random forest for reliable classification and cost-sensitive learning for medical diagnosis. BMC Bioinformatics 10 (Suppl 1), S22 (2009). https://doi.org/10.1186/1471-2105-10-S1-S22

Download citation

Published: 30 January 2009
DOI: https://doi.org/10.1186/1471-2105-10-S1-S22

Selected papers from the Seventh Asia-Pacific Bioinformatics Conference (APBC 2009)

Using random forest for reliable classification and cost-sensitive learning for medical diagnosis