 Research
 Open Access
 Published:
Using random forest for reliable classification and costsensitive learning for medical diagnosis
BMC Bioinformatics volume 10, Article number: S22 (2009)
Abstract
Background
Most machinelearning classifiers output label predictions for new instances without indicating how reliable the predictions are. The applicability of these classifiers is limited in critical domains where incorrect predictions have serious consequences, like medical diagnosis. Further, the default assumption of equal misclassification costs is most likely violated in medical diagnosis.
Results
In this paper, we present a modified random forest classifier which is incorporated into the conformal predictor scheme. A conformal predictor is a transductive learning scheme, using Kolmogorov complexity to test the randomness of a particular sample with respect to the training sets. Our method show wellcalibrated property that the performance can be set prior to classification and the accurate rate is exactly equal to the predefined confidence level. Further, to address the cost sensitive problem, we extend our method to a labelconditional predictor which takes into account different costs for misclassifications in different class and allows different confidence level to be specified for each class. Intensive experiments on benchmark datasets and real world applications show the resultant classifier is wellcalibrated and able to control the specific risk of different class.
Conclusion
The method of using RF outlier measure to design a nonconformity measure benefits the resultant predictor. Further, a labelconditional classifier is developed and turn to be an alternative approach to the cost sensitive learning problem that relies on labelwise predefined confidence level. The target of minimizing the risk of misclassification is achieved by specifying the different confidence level for different class.
Background
Most machinelearning classifiers output predictions for new instances without indicating how reliable the predictions are. The application of these classifiers is limited in the domains where incorrect predictions have serious consequences. Medical practitioners need a reliable assessment of risk of error for individual cases [1]. Thus, given the prediction tailed with a corresponding confidence value, a system can decide whether it is safe to classify. The recently introduced Conformal Predictor (CP) [2–5] is a promising framework that produces prediction coupled with confidence estimation. The exploiters advanced a welcome preference for formal relationship among Kolmogorov complexity, universal Turing Machines and strict minimum message length (MML). They assumed the transductive prediction as a randomness test which returns nonconformity scores closely associated with the property of the iid distribution (identically and independent distribution) governing all of the examples. When classifying a new instance, CP assigns a pvalue for each given artificial label to approximate the confidence level of prediction. CP is more than a reliable classifier of which the most novel and valuable feature is hedging prediction, i.e., the performance can be set prior to classification and the prediction is wellcalibrated that the accurate rate is exactly equal to the predefined confidence level. It is impressive to see its superiority over the Bayesian approach which often relies on strong underlying assumptions. In this paper, we use a random forest outlier measure to design the nonconformity score and develop a modified random forest classifier.
Since reports from both academia and practice indicate that the default assumption of equal misclassification costs is most likely violated [6], the natural desiderata is extending CP to labelwise CP, which takes into account different costs for misclassification errors of different class and allows different confidence level to be specified for different classification of an instance. In this paper, we investigate the method to extend CP to labelconditional CP, which can solve the nonuniform costs of errors in classification.
Consider a classification problem E: The reality outputs examples Z^{(n1)} = {(x_{1}, y_{1}),..., (x_{n1}, y_{n1})} ∈ X × Y and an unlabeled test instance x_{n}, where X denotes a measurable space of possible instances x_{i}∈ X, i = 1, 2,... n  1,...; Y denotes a measurable space of possible labels, y_{i}∈ Y, i = 1,2,... n  1,...; the example space is represented as Z = X × Y. We assume that each instance is generated by the same unknown probability distribution P over Z, which satisfies the exchangeability assumption.
Conformal predictor (CP)
CP is designed to introduce confidence estimation to the machine learning algorithms. It generalizes its framework from the iid assumption to exchangeability which omits the information about examples order.
To construct a prediction set for an unlabeled instance x_{n}, CP operates in a transductive manner and online setting. Each possible label is tried as a label for x_{n}. In each try we form an artificial sequence (x_{1}, y_{1}),..., (x_{n}, y), then we measure how likely it is that the resulting sequence is generated by the unknown distribution P and how nonconforming x_{n} is with respect to other available examples.
Given the classification problem E, The function A_{n}: Z^{(n1)} × z_{n} → R is a nonconformity measure if, for any n ∈ N,
α_{i} := A_{n}(z_{i}, ⟦_{1},..., z_{i1}, z_{i+1},..., z_{n}⟧)
i = 1,..., n  1
α_{n} := A_{n}(z_{n}, ⟦z_{1},..., z_{n1}⟧)
where [·] is a "bag" in which the elements are irrelevant according to their order. The symbol α denotes sample nonconformity score: the larger α_{i} is, the stranger z_{i} is corresponding to the distribution. In short, a nonconformity measure is characterized as a measurable kernel that maps Z to R while the value of α_{i} is irrelevant with the order of z_{i} in sequence.
For confidence level 1  ε (ε is the significance level) and any n ∈ N, a conformal predictor is defined as:
A smoothed conformal predictor (smoothed CP) is defined as:
where y is a possible label for x_{n}; P_{y} is called p value, which is the randomness level of z_{n} = (x_{n}, y) and also the confidence level of y being the true label; τ_{n}, n ∈ N is a random variables that distributed uniformly in [0, 1]. Smoothed CP is a power version of CP, which benefits from p distributing uniformly in [0, 1].
Let Γ^{ε}= {y ∈ Y: P_{y} > ε}, and the true label of x_{n} is denoted as y_{n},
If Γ^{ε} = 1, we define it as a certain prediction.
If Γ^{ε} > 1, it is an uncertain prediction.
If Γ^{ε} = ∅ , it is an empty prediction.
If y_{n} ∈ Γ^{ε}, we define it as a corrective prediction with confidence level 1  ε. Otherwise, it is defined as an error.
When it comes to forced point prediction, CP selects the label with maximum p value as the prediction.
CP is originally proposed for online learning and Vovk [7] offered the theoretical proof that in the online setting in the long run the prediction set Γ contains the true label with probability 1  ε and the rate of wrong prediction is bounded by ε. Especially the smoothed CP is exactly valid, i.e., the rate of wrong prediction is exactly equal to ε. This is summarized as the proposition of wellcalibrated.
with $Er{r}_{n}^{\epsilon}$ the number of error predictions at the confidence level 1  ε (See [7] for detailed proof). Extensive experiments demonstrated that CP is also applicable to offline learning, which enlarge its applications.
Different nonconformity measures have been developed from existing algorithms, such as SVM, KNN and so on [9–11]. All the CPs have the calibration property, but the efficiency of CP largely depends on the designing of nonconformity measure [8]. Efficiency means the certain and empty prediction ratio in all predictions. Certain prediction is favourable because it is more informative than uncertain predictions. CP is successfully employed to hedge these popular machine learning methods, and this paper shows that CPRF is more efficient than others.
Random forest (RF)
Breiman's random forest applies Bagging [12] and Randomization [13] technique to grow many classification trees with the largest extent possible without pruning. Random Forest is especially attractive in the following cases [14, 15]:

(1)
First, the real world data is noisy and contains many missing values, some of the attributes are categorical, or semicontinuous.

(2)
Furthermore, there are needs to integrate different data sources which face the issue of weighting them.

(3)
RF show high predictive accuracy and are applicable in highdimensional problems with highly correlated features, especially in the situation which often occurs in bioinformatics, like medical diagnosis.
In this paper, the random forest outlier measure is used to design a nonconformity measure in order to incorporate random forest into the CP and label conditional CP scheme. Our method can be used in both online and offline settings.
Costsensitive learning problem
In medical diagnosis, the default assumption of equal misclassification costs underlying machine learning techniques is most likely violated. A false negative prediction may have more serious consequences than a false positive prediction. To address this problem, costsensitive classification is developed, which considers the varying costs of different misclassification types [16]. Usually a cost matrix is defined or learned to reflect the penalty of classifying samples from one class as another. A costsensitive classification method takes a cost matrix into consideration during the model building process [17]. However, how to get a proper cost matrix remains an open question [18]. The definition or learning of a cost matrix is quite subjective. In this paper, we extend our method to label conditional CP to address the cost sensitive problem, and the risk of misclassification of each class is well controlled.
Results
Experiments setup
The experiments are divided into two parts: First, to show the calibration property and efficiency of our method, we demonstrate our method CPRF on 8 benchmark datasets and a realworld gene expression dataset. Second, to cope with the costsensitive problem, we extend CPRF to label conditional CPRF, and test its performance on two public application datasets.
Part I Performance of CPRF
We employ 8 UCI datasets [19], including satellite, isolet, soybean, and covertype, etc. Some details are included in Table 1, which contains information of the number of instances (n), number of class (c), number of attributes (a), and number of numeric (num) and nominal (nom).
We perform CPRF in a 10fold cross validation in an online fashion and report the average performance and compare it with TCMSVM and TCMKNN. We use the following key indices at each predefined significance level: (1) Percentage of certain predictions. (2) Percentage of uncertain predictions. (3) Percentage of empty predictions. (4) Percentage of corrective predictions. These terms distinguish with traditional accuracy rate given by RF, SVM and other traditional classifiers.
Given a significance level ε, the calibration and efficiency can be laid out. Let the number of trees (denoted as ntrees) equal to 1000 and the number of variables to split on at each node (denoted as ntry) be the default value $\sqrt{a}$ (a is the number of attributes). In Figures 1, 2, 3, 4, 5, 6, 7, 8, we demonstrate performance curves according with the significance level ε ranging from 0.01 to 1, and show the average experimental results on pima (continuous variables), soybean (categorical variables), covertype (mixed variables) and liver (poor data quality), etc.
Figures 1, 2, 3, 4, 5, 6, 7, 8 show that the empirical error line is wellcalibrated with neglectable statistical fluctuations. It allows controlling the number of errors prior to classification. Percentages of corrective predictions with a predefined level of confidence illustrate the calibration of the new algorithm. Figures 1, 2, 3, 4, 5, 6, 7, 8 also show high accuracy with series of significance level and some interest points are extracted in Table 2 for impressive purpose.
Table 2 demonstrates that CPRF ensures relative high accuracy when controlling a low risk of error. It is important in many domains to measure the risk of misclassification, and if possible, to ensure low risk of error.
The percentage of certain predictions reflects the efficiency of prediction. Notice that the percentage of uncertain predictions monotonically decreases with higher significance levels. How fast this decline goes to zero depends on the performance of the classifier plugged into the CP framework. Figures 1, 2, 3, 4, 5, 6, 7, 8 show that CPRF performs significantly well, and it is applicable at the significance levels of 0.20, 0.15, 0.10, and 0.05. For the convenience of comparison, we apply standard TCMKNN and TCMSVM algorithm provided by Gammerman and Vovk. The ratios of certain predictions on the 8 datasets are given for comparison in Table 3 and we can find that the efficiency of CPRF is much better than the others, which indicates its superiority. Then we compare their performance in the occasion of forced point prediction in Table 4.
It is clear that CPRF performs well at most of the datasets, especially on the datasets with categorical and mixed variable. CPRF especially outperforms TCMKNN for highdimension dataset (isolet), and outperforms TCMSVM for noisy data (covertype).
To compare with the most popular machine learning methods, we consider Acute Lymphoblastic Leukemia (ALL) data which was previously analyzed with traditional machine learning methods. We choose an ALL dataset from [20] for comparison (see Table 5). There are 327 cases with attributes of 12558 genes. The data has been divided into six diagnostic groups and one that contains diagnostic samples that did not fit into any one of the above groups (labeled as "Others"). Each group of samples has been randomized into training and testing parts.
We report results using CPRF without discriminating gene selections, i.e. using all of the genes. In order to compare with traditional machine learning method, we apply CPRF in an offline fashion, and use results of forced prediction. Table 6 demonstrates the detailed classification performance per class in confusion matrix, and it shows that CPRF makes only 2 misclassifications. The comparison with a finetuned Support Vector Machine is laid out in Table 7.
Tables 6 and 7 show CPRF outperforms SVM in subgroup 3, 4 and 6, and they are wellmatched in subgroup 2 and 5. All of the two misclassifications happen in subgroup 1, because this subgroup only has six cases, the error rate seems very large.
Due to the low sample size, the reliability of classification is not guaranteed[21]. We show the distinct advantage of CPRF with two measures, corrective predictions and certain predictions under 5 confidence levels in Table 8. The results show that our method is wellcalibrated and make reliable predictions, even in an offline fashion.
Part II: Performance of label conditional CPRF
In this part, we choose two multiclass and unbalanced realworld data sets as examples for cost sensitive learning. The objective is to control risk of misclassification within each class for different misclassifications may have different penalty in medical diagnostics. The first data set is the Thyroid disease records [22], and the problem is to determine whether a patient referred to the clinic is hypothyroid. Each record has 21 attributes in total (15 Boolean and 6 continuous) corresponding to various symptoms and measurements taken from each patient. The data set contains 7200 examples in total and is highly unbalanced in its representation of the 3 possible classes corresponding to diagnoses. Some details are included in Table 9, which contains information on the name, index and size of each class.
Another dataset is the Chronic Gastritis Dataset [23], which is a common disease of the digestive system with gastric inflammation being its notable features. Compare to Western medicine, Chinese medicine have many advantages in its treatment [24]. According to "Diagnostic criteria for the diagnosis of chronic gastritis combining traditional and western medicine" set by the Integrated Traditional and Western Medicine Digest Special Committee, Chronic Gastritis is divided into five subtypes (see table 9). In our application, we collected 709 cases from the digestion outpatient department of the Affiliated Shuguang Hospital during February and October, 2006. All cases are inspected by both gastroscopy and pathology. Each case is correlated with 55 kinds of symptoms listed in table 10.
When constructing RF, we let the number of trees equal to 1000 and the number of variables to split on at each node be $\lfloor \sqrt{55}\rfloor $ (Parameter sensitivity analysis of CPRF is laid out in the next section). For experiments on Thyroid disease dataset, the original dataset is randomly divided into a training set (3772 samples) and a test set (3428 samples). For Chronic Gastritis dataset, we perform our method in a 10fold cross validation. Average performances are reported.
We are interested in performances comparison of the label conditional CPRF with the CPRF. Limited by space, we only show parts of results: Figures 9, 10, 11, 12, 13, 14, 15, 16 show experiment results on two classes with relatively small samples of Thyroid and Chronic Gastritis datasets. Figure 9 shows that the CPRF is not wellcalibrated at most of confidence levels within class "primary hyperthyroid" on Thyroid, and meanwhile Figure 10 shows that the efficiency of CPRF is very low with confidences levels ranging from 0.01 to 1. In contrast, as is shown in Figure 11, label conditional CPRF is wellcalibrated up to neglectable statistical fluctuations and the empirical corrective prediction line can hardly be distinguished from the exact calibration line. Aside from the property of calibration, label conditional CPRF show improvements on predictive efficiency in Figure 12, compared with CPRF. These contrasts can also be observed in experiments on class "deficiency of spleen and stomach "of Chronic Gastritis datasets (See Figures 13, 14, 15, 16).
It is noticeable that the percentage of certain predictions and certain & correct ratios monotonically increase with significance levels. How fast the decline of uncertain prediction goes to zero also depends on the superiority of calculation of p value.
Some interest points are extracted in Tables 11 and 12. It demonstrates that label conditional CPRF can be used to control the risk of misclassification within each class, so that it can be considered as an alternative approach for cost sensitive learning for unbalanced data.
Discussion
Part I: CPRF
Parameter sensitivity analysis
A common way to validate an approach is to ensure robustness, that is, the approach must produce consistent results independent of the initial parameter settings. Empirical studies show the parameters adjustments have great impacts on CPs. Normalization of examples affects TCMKNN greatly. As for TCMSVM, not only the normalization but the type and parameters of kernel functions are important. Thus, the empirical and nontheoretically alteration hints a potential instability.
To demonstrate the parameter insensitivity of CPRF, we set up different parameters for CPRF, with ntrees = 500,1000,5000 and ntry = 1,..., $\sqrt{\text{a}}$. Mean and standard deviation of forced accuracy on sonar are reported.
For TCMKNN, We compare the fluctuation of forced accuracy with or without normalization; For TCMSVM, the affection of different types of kernel are illustrated. The results on sonar are summed up in table 13. CPRF shows a comparatively trivial fluctuation with the change of parameter settings. The advantage comes from the nature of RF and will benefit medical diagnosis.
Feature selection
The problem of feature selection is an open question in many applications. In our method, there is no feature selection. Take gene expression analysis for example, gene selection is a crucial study and remains unsolved. In Yeoh's study, gene expression profiling can accurately identify the known prognostically important leukemia subtypes, by the means of classification using SVM, KNN, and ANN when various selected genes were used. Unfortunately, classifications were performed following a process of discriminating gene selections by a correlationbased feature selection. This process is also labor intensive and requiring experiential knowledge. It is better that automated classification should be made with a level of confidence. Moreover, due to the low sample size, although their research has yielded high predictive accuracies that are comparable with or better than traditional clinical techniques, it remains uncertain how well the selected genes results will extrapolate to practice in the future [25]. CPRF is especially suitable for this situation, without discriminating gene selections, i.e. using all of the genes, and this may meet the need of an automated classification. Moreover, no selection bias is introduced.
Part II: Label conditional CPRF
From experiments in Part I, we can see that though CPRF is well calibrated globally, i.e. the error predictions equal to the predefined confidence level on the whole test data, it cannot guarantee the reliability of classification for each class especially for unbalanced datasets. Different from CPRF, label conditional CPRF is labelwise well calibrated while the former may not satisfy the calibration property in some classes. Because the latter uses only partial information from the whole data set, so the computational efficiency is better.
Conclusion
Most of stateoftheart machine learning algorithms cannot provide a reliable measure of their classifications and predictions. This paper addresses the importance of reliability and confidence for classification, and presents a novel method based on a combination of random forest, and conformal predictor. The new algorithm hedges the predictions of RF and gives a wellcalibrated region prediction by using the proximity matrix generated with RF as a nonconformity measure of examples. For medical diagnosis, the most important advantage of CPRF is its calibration: the risk of error can be well controlled. The new method takes advantage of RF and possesses a more precise and stable nonconformity measure. It can deal with redundant and noisy data with mixed types of variables, and is less sensitive to parameter settings. Furthermore, we extend CPRF to a label conditional version, so that it can control the risk of prediction for each class independently rather than globally. This modified version can provide an alternative way for cost sensitive learning. Experiments on benchmark datasets and real world applications show the usability and superiority of our method.
Methods
CPRF algorithm
Executed by transductive inference learning, CP is able to hedge the predictions of any popular machine learning method, which constructs a nonconformity measure for CPs [3, 4]. It is a remarkable fact that error calibration is guaranteed regardless of the particular classifier plugged into CP and nonconformity measure constructed. However, the quality of region predictions and CP's efficiency accordingly, depends on the nonconformity measure. This issue has been discussed and several types of classifiers have been used, such as support vector machine, knearest neighbors, nearest centroid, kernel perceptron, naive Bayes and linear discriminant analysis [9–11]. The implementations of these methods are determined by the nature of these classifiers. So TCMSVM and TCMKP mainly consider binary classification tasks, TCMKNN and TCMKNC is the simplest mathematical realization, and TCMNB and TCMLDC is suitable for transductive regression. Indeed, the above methods have demonstrated their applicability and advantages over inductive learning, but there is still much infeasibility. For nonlinear datasets, it is especially challenging to TCMLDC. TCMKNN and TCMNC have difficulties with dispersed datasets. TCMSVM is so processing intensive that it suffers from large datasets. TCMKP is only practicable to relatively noisefree data. In short, there are many restrictions on data qualities when applying them to real world data. The difficulties in essence lie in the nonconformity measure, which remains an unanswered question.
Taking above into account, we propose a new algorithm called CPRF. Random forest classifier naturally leads to a dissimilarity measure between examples in a "strange" space rather than a Euclidean measure. After a RF is grown, since an individual tree is unpruned, the terminal nodes will contain only a small number of observations. Given a random forest of size k: f = {T_{1},..., T_{ k }} and two examples x_{ i }and x_{ j }, we propagate them down all the trees within f. Let D_{ i }= {T_{1i},... T_{ ki }} and D_{ j }= {T_{1j},... T_{kj}} be tree node positions for x_{ i }and x_{ j }on all the k trees respectively, a random forest similarity between the two examples is defined as:
where
i.e., if instance i and j both land in the same terminal node, the proximity between i and j is increased by one, this forms a N × N matrix (⟦prox(i, j))⟧_{N × N}, which is symmetric, positive definite and bounded above by 1, with the diagonal elements equal to 1, and N is the total number of cases [26].
Outliers are generally defined as cases that are removed from the main body of the data. In the framework of random forest, outliers are cases whose proximities to all other cases in the data are generally small. A useful revision is to define outliers relative to their class. Thus, an outlier in class j is a case whose proximities to all other class j cases are small. The raw outlier measure for case n in class j to the rest of the training data class j is defined as
where nsample denotes the number of samples in class j and $\overline{p(i)}$ is the average proximity from case i to the rest of the training data within class j:
The value of out_{ raw }(i) will be large if the average proximity is small. Within each class find the median of these raw measures $\overline{ou{t}_{raw}}$, and their absolute deviation σ from the median. The raw measure is scaled to arrive at the final outlier measure by the following:
After a random forest is constructed, the proximity matrix of training dataset and a given test example remains the same regardless of changing the order of input data sequence, so random forest outlier measure can be used as a nonconformity measure.
In our method CPRF, we define a new nonconformity measure α_{ i }= out(i), and then predict each test sample with Eq. (3). The detailed CPRF algorithm is summarized in pseudo codes below.
Algorithm: CPRF
Input: Training set T = {(x_{1}, y_{1}),..., (x_{ l }, y_{1})} and a new unlabeled example x_{l+1}.
Output: The set of p value s $\{{p}_{l+1}^{1},\mathrm{...},{p}_{l+1}^{m}\}$ when T is an mclass dataset.

1.
for i = 1 to m do

2.
Assign label i to x_{l+1}

3.
Construct a RF classifier with training set T, put the test example x_{l+1}to the forest and output the sample proximity matrix (⟦prox(i, j))⟧_{(l+1) × (l+1)};

4.
Compute nonconformity scores ${\alpha}_{1}^{i},\mathrm{...},{\alpha}_{l}^{i},{\alpha}_{l+1}^{i}$ of all examples using Eq.(6) (${\alpha}_{l+1}^{i}$ is the nonconformity measure of x_{l+1}when assigned label i);

5.
Compute the p value ${p}_{l+1}^{i}$ of x_{l+1}with Eq. (3).

6.
End for
Label conditional CPRF algorithm
Given a significance level ε > 0 and the goal is to compute predictive regions, ideally consisting of just one label, containing the true label with probability 1  ε. But in some situations our predictions are wellcalibrated globally, but not within each class. In costsensitive learning problem, we must allow different significance levels to be specified for each possible classification of an object because the penalty of misclassification is not the same among all classes [27, 28]. This problem can be viewed as a conditional inference. We extend our method to label conditional CP to address it, which can also be seen as one version of Mondrian CP (MCP) [3, 29].
An important aspect of MCP is the method of calculating p values. For example, calculating the pvalues in standard CP, the nonconformity score of a new example against the nonconformity scores of all examples observed up to that point are compared. In contrast, label conditional CPs compare the nonconformity score of a new example with the previously observed examples within each class. In detail, this method applies a function called Mondrian taxonomy to effectively partition the example space Z into rectangular groups. Given a division of the Cartesian product N × Z into categories: a function k: N × Z → K maps each pair (n, z) (z is an example and n is the ordinal number of this example in the data sequence ⟦(Z)⟧ ↓ 1, z ↓ 2,...) to its category; a label conditional nonconformity measure based on k is defined as:
A_{ n }: K^{n1}× ((Z^{•}))^{K}× K × Z → R
The smoothed Mondrian conformal predictor (smoothed MCP) determined by the Mondrian nonconformity measure A_{ n }produces p values as:
with α_{ i }denotes a nonconformity score.
In label conditional CPRF, α_{ i }= out(i) and compared with CPRF, the small difference is computing p values with part of training examples, with Eq. (7). So we get higher computational efficiency. Limited by space, the detailed label conditional CPRF algorithm is omitted here.
Availability
The chronic gastritis dataset, the core source codes of CPRF and label conditional CPRF are available at http://59.77.15.238/APBC_paper or http://www.penna.cn/cprf
Abbreviations
 CP:

Conformal predictor
 RF:

Random forests
 KNN:

K nearest neighbour classifier
 SVM:

Support vector machine
 KP:

Kernel perceptron
 NB:

Naïve Bayes
 NC:

Nearest centroid
 LDC:

Linear discriminant classifier
 KNC:

Kernel nearest centroid
 ANN:

Artificial neural network.
References
 1.
Pirooznia M, Yang JY, Yang MQ, Deng YP: A comparative study of different machine learning methods on microarray gene expression data. BMC Genomics. 2008, 9 (Suppl 1): S13
 2.
Gammerman A, Vovk V: Prediction algorithms and confidence measures based on algorithmic randomness theory. Theoretical Computer Science. 2002, 287: 209217.
 3.
Vovk V, Gammerman A, Shafer G: Algorithmic learning in a random world. 2005, Springer, New York
 4.
Gammerman A, Vovk V: Hedging predictions in machine learning. Computer Journal. 2007, 50: 151177.
 5.
Shafer G, Vovk V: A tutorial on conformal prediction. J Mach Learn Res. 2007, 9: 371421.
 6.
Elkan C: The foundations of costsensitive learning. Proceedings of the Seventeenth International Joint Conference of Artificial Intelligence. 2001, Morgan Kaufmann, Seattle, Washington, 973978.
 7.
Vovk V: A Universal WellCalibrated Algorithm for Online Classification. J Mach Learn Res. 2004, 5: 575604.
 8.
Stijn V, Laurens VDM, Ida SK: Offline learning with transductive confidence machines: an empirical evaluation. Proceedings of the 5th International Conference on Machine Learning and Data Mining in Pattern Recognition. Edited by: Petra Perner, LNAI. 2007, Leipzig, Germany. Springer Press, 4571: 310323.
 9.
Tony B, Zhiyuan L, Gammerman A, Frederick VD, Vaskar S: Qualified predictions for microarray and proteomics pattern diagnostics with confidence machines. International Journal of Neural Systems. 2005, 15 (4): 247258.
 10.
Bellotti T, Zhiyuan L, Gammerman A: Reliable classification of childhood acute leukaemia from gene expression data using Confidence Machines. Proceedings of IEEE International Conference on Granular Computing, Atlanta, USA. 2006, 148153.
 11.
Proedrou K, Nouretdinov I, Vovk V, Gammerman A: Transductive confidence machines for pattern recognition. Proceedings of the 13th European Conference on Machine Learning. 2002, 381390.
 12.
Breiman L: Bagging Predictors. Mach Learn. 1996, 24 (2): 123140.
 13.
Breiman L: Random forests. Mach Learn. 2001, 45 (1): 532.
 14.
Diaz UR, Alvarez AS: Gene Selection and Classification of Microarray Data Using Random Forest. BMC Bioinformatics. 2006, 7: 3
 15.
Strobl C, Boulesteix AL, Kneib T, Augustin T, Zeileis A: Conditional variable importance for random forests. BMC Bioinformatics. 2008, 9: 307
 16.
Turney P: Types of cost in inductive concept learning. Workshop on CostSensitive Learning at ICML. 2000, Stanford University, California, 1521.
 17.
Zhou ZH, Liu XY: On multiclass costsensitive learning. Proceedings of the 21st National Conference on Artificial Intelligence, Boston, MA. 2006, 567572.
 18.
Zadrozny B, Elkan C: Learning and making decisions when costs and probabilities are both unknown. Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining. 2001, ACM Press, 204213.
 19.
UCI Machine Learning Repository. [http://archive.ics.uci.edu/ml/]
 20.
Yeoh EJ, Ross ME, Shurtleff SA: Classification subtype discovery and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell. 2002, 1 (2): 133143.
 21.
Sorin D: Data Analysis Tools for DNA Microarrays. 2003, Chapman&Hall/CRC, London
 22.
Thyroid Disease Database. [ftp://ftp.ics.uci.edu/pub/machinelearningdatabases/thyroiddisease/]
 23.
Chronic Gastritis Dataset. [http://59.77.15.238/APBC_paper]
 24.
Niu HZ, Wang RX, Lan SM, Xu WL: hinking and approaches on treatment of chronic gastritis with integration of traditional Chinese and western medicine. Shandong Journal of Traditional Chinese Medicine. 2001, 20 (3): 7072.
 25.
Boulesteix AL, Strobl C, Augustin T, Daumer M: Evaluating microarraybased classifiers: an overview. Cancer Informatics. 2008, 6: 7797.
 26.
Qi Y, Klein SJ, Bar JZ: Random forest similarity for proteinprotein interaction prediction from multiple sources. Pacific Symposium on Biocomputing. 2005, 10: 531542.
 27.
Domingos P: MetaCost: A general method for making classifiers costsensitive. Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining. 1999, New York. ACM Press, 155164.
 28.
Chris D, Robert CH: Cost curves: An improved method for visualizing classifier performance. Machine Learning. 2006, 65 (1): 95130.
 29.
Vovk V, Lindsay D, Nouretdinov I, Gammerman A: Mondrian Confidence Machine. Technical Report. Computer Learning Research Centre, Royal Holloway, University of London
Acknowledgements
The authors are grateful to Prof. Vladimir Vovk and Dima Devetyarov for valuable suggestions and essential help on conformal predictors. The authors would also like to thank Prof. Changle Zhou for assistance in data collection and support. The work is based upon work supported by the 985 Innovation Project on Information Technique of Xiamen University under Grant No.0000x07204 and the National High Technology Research and Development Program of China (863 Program) under Grant No.2006AA01Z129.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/10?issue=S1
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
FY, HM and HZW conceived the study and research question. FY and HZW designed and implemented the algorithms, set up and performed the experiments and drafted the manuscript. HM, CDL and WWC contributed to the theoretical understanding and presentation of the problem.
Fan Yang, Huazhen Wang contributed equally to this work.
Rights and permissions
About this article
Cite this article
Yang, F., Wang, H., Mi, H. et al. Using random forest for reliable classification and costsensitive learning for medical diagnosis. BMC Bioinformatics 10, S22 (2009). https://doi.org/10.1186/1471210510S1S22
Published:
Keywords
 Random Forest
 Cost Matrix
 Kolmogorov Complexity
 True Label
 Outlier Measure