An active learning based classification strategy for the minority class problem: application to histopathology annotation
 Scott Doyle^{1}Email author,
 James Monaco^{1},
 Michael Feldman^{2},
 John Tomaszewski^{2} and
 Anant Madabhushi^{1}Email author
https://doi.org/10.1186/1471210512424
© Doyle et al; licensee BioMed Central Ltd. 2011
Received: 9 November 2010
Accepted: 28 October 2011
Published: 28 October 2011
Abstract
Background
Supervised classifiers for digital pathology can improve the ability of physicians to detect and diagnose diseases such as cancer. Generating training data for classifiers is problematic, since only domain experts (e.g. pathologists) can correctly label ground truth data. Additionally, digital pathology datasets suffer from the "minority class problem", an issue where the number of exemplars from the nontarget class outnumber target class exemplars which can bias the classifier and reduce accuracy. In this paper, we develop a training strategy combining active learning (AL) with classbalancing. AL identifies unlabeled samples that are "informative" (i.e. likely to increase classifier performance) for annotation, avoiding noninformative samples. This yields high accuracy with a smaller training set size compared with random learning (RL). Previous AL methods have not explicitly accounted for the minority class problem in biomedical images. Prespecifying a target class ratio mitigates the problem of training bias. Finally, we develop a mathematical model to predict the number of annotations (cost) required to achieve balanced training classes. In addition to predicting training cost, the model reveals the theoretical properties of AL in the context of the minority class problem.
Results
Using this classbalanced AL training strategy (CBAL), we build a classifier to distinguish cancer from noncancer regions on digitized prostate histopathology. Our dataset consists of 12,000 image regions sampled from 100 biopsies (58 prostate cancer patients). We compare CBAL against: (1) unbalanced AL (UBAL), which uses AL but ignores class ratio; (2) classbalanced RL (CBRL), which uses RL with a specific class ratio; and (3) unbalanced RL (UBRL). The CBALtrained classifier yields 2% greater accuracy and 3% higher area under the receiver operating characteristic curve (AUC) than alternativelytrained classifiers. Our cost model accurately predicts the number of annotations necessary to obtain balanced classes. The accuracy of our prediction is verified by empiricallyobserved costs. Finally, we find that oversampling the minority class yields a marginal improvement in classifier accuracy but the improved performance comes at the expense of greater annotation cost.
Conclusions
We have combined AL with class balancing to yield a general training strategy applicable to most supervised classification problems where the dataset is expensive to obtain and which suffers from the minority class problem. An intelligent training strategy is a critical component of supervised classification, but the integration of AL and intelligent choice of class ratios, as well as the application of a general cost model, will help researchers to plan the training process more quickly and effectively.
Background
Motivation
In most supervised classification schemes, a training set of exemplars from each class is used to train a classifier to distinguish between the different object classes. The training exemplars (e.g. images, pixels, regions of interest) usually have a semantic label assigned to them by an expert describing a category of interest or class to which they belong. Each training exemplar serves as an observation of the domain space; as the space is sampled more completely, the resulting classifier should achieve greater classifier accuracy when predicting class labels for new, unlabeled (unseen) data. Thus, typically, the larger the training set, the greater the accuracy of the resulting classifier [1]. In most cases, the training set of labeled data for each of the object categories is generated by a human expert who manually annotates a pool of unlabeled samples by assigning a label to each exemplar.
Informative samples are those which, if annotated and added to the training set, would increase the accuracy of the resulting trained classifier. In this setup, illustrated in Figure 2 (bottom row), the AL algorithm identifies informative samples (those which are difficult to classify) in an unlabeled dataset for annotation and addition to the growing training set. AL generates training sets that yield better classifier performance compared with training sets of the same size obtained via RL. The concept of "informative" samples in this context is related to the idea of marginbased classification in support vector machines (SVMs) [11], where labeled samples close to a decision boundary are used to classify unlabeled samples. In the AL context, informative samples are difficulttoclassify unlabeled data points that improve an existing training set.
Several AL algorithms have been proposed to determine whether an unlabeled sample is informative.
These methods measure the "informativeness" of a sample as the distance to a supportvector hyperplane [12, 13], the disagreement among bagged weak classifiers [9, 10], variation in feature distributions [14, 15], and modelbased predictions [16]. In a bioinformatics context, Lee, et al. [17] showed the benefits of using AL in building a naive Bayes classifier to identify disease states for several different datasets. Veeramachaneni, et al. [18] implemented an AL training approach to build a classifier identifying patient status from tissue microarray data. Previously [19], we investigated the performance of different AL algorithms in creating training sets for distinguishing diseased from nondiseases tissue samples.
Among the results of that study, we found that the particular AL algorithm chosen for learning had no significant effect on the performance of the supervised classifier.
Another major issue in supervised training involves the minority class problem, wherein the target class is underrepresented in the dataset, relative to the nontarget classes. A labeled training set comprises two sets of samples: ${S}_{{\omega}_{1}}^{\mathsf{\text{tr}}}$representing training samples from the target (minority) class, and ${S}_{{\omega}_{2}}^{\mathsf{\text{tr}}}$ being the samples from the nontarget (majority) class. In the minority class problem, $\mid {S}_{{\omega}_{1}}^{\mathsf{\text{tr}}}\mid <<\mid {S}_{{\omega}_{2}}^{\mathsf{\text{tr}}}\mid $, where · indicates set cardinality. Several researchers [20–24] have shown that this training set will likely yield a classifier with lower accuracy and area under the receiver operating characteristic curve (AUC) compared with training sets where $\mid {S}_{{\omega}_{1}}^{\mathsf{\text{tr}}}\mid =\mid {S}_{{\omega}_{2}}^{\mathsf{\text{tr}}}\mid $ or $\mid {S}_{{\omega}_{1}}^{\mathsf{\text{tr}}}\mid >>\mid {S}_{{\omega}_{2}}^{\mathsf{\text{tr}}}\mid $. Weiss and Provost [20] showed that for several datasets, varying the percentage of the minority class in the training set alters the accuracy and AUC of the resulting classifiers, and that the optimal class ratio was found to be significantly different from the "natural" ratio. Japkowicz and Stephen [21] found that the effect of the minority class problem depends on a number of factors, including the complexity of the target class and the size of the class disparity. Chawla, et al. [22] proposed mitigating the problem by oversampling the minority class using synthetic samples; however, this method may simply introduce noise if the target class is too complex.
While some research has addressed the minority class problem in biomedical data [17, 25], there has been little related work in the realm of digital pathology. Cosatto, et al. [26] applied a SVM AL method [12] in training a classifier for grading nuclear pleomorphism on breast tissue histology, while Begelman, et al. [27] employed an ALtrained SVM classifier in building a telepathology system for prostate tissue analysis. However, these studies did not account for the minority class problem in the training set, particularly relevant in the context of digital pathology, since the target class (cancer) is often observed far less often than the nontarget class (noncancer) and occupies only a small percentage of the overall tissue area. Ideally, an intelligent training strategy for this domain would combine AL while simultaneously addressing the minority class problem by maintaining a userdefined class ratio (class balancing). Zhu and Hovey [23] combined an entropybased AL technique with overand undersampling to overcome the minority class problem for text classification, and found that oversampling the minority class yielded the highest classifier performance. However, they did not investigate different class ratios and did not discuss the increased cost of the sampling techniques. Bloodgood and VijayShanker [28] focused on an AL and classification method based on SVMs for unbalanced text and protein expression data; their approach involves estimating the class balance in the entire dataset, and then selecting samples to overcome this bias (as opposed to overcoming bias in the growing training set generated by AL).
While additional sampling can help to mitigate the minority class problem, this process requires more annotations compared to a training set with unbalanced classes. Because the cost of obtaining each annotation is high, it would be beneficial to be able to predict the number of annotations required to obtain a classbalanced training set of a predefined size. These predictions are critical for determining, a priori, the amount of resources (time, money, manpower) that will be employed in developing a supervised classifier. An analytical cost model will enable us to predict the cost involved in training the supervised classifier. Additionally, such a model will provide some insight into the relationship between (1) the size of a training set, (2) its class balance, and (3) the number of annotations required to achieve a predefined target accuracy.
Contributions and Significance
In this work, we develop an ALbased classifier training strategy that also accounts for the minority class problem. This training strategy is referred to as "ClassBalanced Active Learning" (CBAL). We apply CBAL to the problem of building a supervised classifier to distinguish between CaP and nonCaP regions on images of prostate histopathology. For this particular problem, training samples are difficult and expensive to obtain, and the target class (CaP) is relatively sparse in relation to the nontarget class; thus, we expect CBAL to yield large benefits in terms of training cost. Our mathematical model is used to predict the cost of building a training set of a predefined size and class ratio. This is, to the best of our knowledge, the first indepth investigation and modeling of ALbased training for supervised classifiers that also specifically addresses the minority class problem in the context of digital pathology. However, CBAL training can be easily applied to other domains where obtaining annotated training samples is a timeconsuming and difficult task, and where the target and nontarget class ratios are not balanced. The rest of the paper is organized as follows. In Section 2 we describe the theory behind CBAL, followed by a description of the algorithms and model implementation in Section 3. In Section 4 we describe our experimental design, and in Section 5 we present the results and discussion. Concluding remarks are presented in Section 6.
Methods
Modeling the Annotation Cost of Class Balancing in Training
Notation and Symbols
Notation and Symbols
Symbol  Description  Symbol  Description 

r ∈ R  Dataset of image patches  t ∈ {0, ⋯, T}  Iteration of ActiveLearn 
S^{tr}, S^{te}  Unlabeled training, testing pools  Φ  Training methodology 
${S}_{t}^{\mathsf{\text{E}}},{\widehat{S}}_{t}^{\mathsf{\text{E}}}$  Eligible samples, annotated samples  ${S}_{t,\Phi}^{\mathsf{\text{tr}}}$  Samples labeled via Φ at t 
${\mathcal{T}}_{t}$  Fuzzy classifier using ${S}_{t,\Phi}^{\mathsf{\text{tr}}}$  k_{1,t}, k_{2,t}  Number of samples in ${S}_{t}^{E}$ from ω_{1}, ω_{2} 
M  Number of votes used to generate ${\mathcal{T}}_{t}$  ω_{1}, ω_{2}  Possible classes of r 
τ  Confidence margin  r ↪ ω _{1}  Membership of r in class ω_{1} 
θ  Classifierdependent threshold for ${\mathcal{T}}_{t}$  $\hat{{k}_{1}},\hat{{k}_{2}}$  Number of samples in ${\widehat{S}}_{t}^{\mathsf{\text{E}}}$ from ω_{1}, ω_{2} 
p_{ t }(r ↪ ω_{1})  Probability of observing r ↪ ω_{1}  N _{ t }  Samples added to training set at t 
P _{Δ}  Model confidence  ${\widehat{P}}_{t}$  Probability of observing $\hat{{k}_{1}}$ samples 
${\mathcal{A}}_{t}$  Accuracy of trained classifier at t  $\mathcal{L}$  Total training cost after T iterations 
Theory of CBAL
In this subsection, we describe the theoretical foundation of the CBAL approach. Our goal in this section is to precisely define an "informative sample," identify the likelihood of observing a sample of a target class, and predict the number of samples that must be annotated before a specified number of target samples is observed and annotated. Our aim is to be able to predict a priori the cost of the system in terms of activelylearned annotations, which in turn represent an expenditure of resources.
Definition 1. The set of informative samples (eligible for annotation), ${S}_{t}^{E}$, at any iteration t is given by the set of samples r ∈ R for which$0.5\tau \le {\mathcal{T}}_{t}\left(r\right)\le 0.5+\tau $.
The value of ${\mathcal{T}}_{t}\left(r\right)$ denotes the classification confidence, where ${\mathcal{T}}_{t}\left(r\right)=1$ indicates strong confidence that r ↪ ω_{1}, and ${\mathcal{T}}_{t}\left(r\right)=0$ indicates confidence that r ↪ ω_{2}. The number of samples $r\in {S}_{t}^{\mathsf{\text{E}}}$ for which r ↪ ω_{1} and r ↪ ω_{2} are denoted k_{1,t}and k_{2,t}, respectively. The likelihood of randomly selecting a sample r ↪ ω_{1} from ${S}_{t}^{E}$ is ${p}_{t}\left(r\hookrightarrow {\omega}_{1}\right)=\frac{{k}_{1,t}}{{k}_{1,t}+{k}_{2,t}}$. The number annotated in class ω_{2} is ${N}_{t}\hat{{k}_{1}}$.
Proof Revealing the label of a sample $r\in {S}_{t}^{E}$ is an independent event resulting in either observation of class ω_{1} or ω_{2}. The probability of success (i.e. observing a minority class sample) is p_{ t } (r ↪ ω_{1}), and the probability of failure is p_{ t } (r ↪ ω_{2}) = 1  p_{ t } (r ↪ ω_{1}) in the two class case. We assume that ${S}_{t}^{E}$ is large, so p_{ t } (r ↪ ω_{1}) is fixed. The annotations continue until $\hat{{k}_{1}}$ successes are achieved. Because of these properties, the number of annotations N_{ t } is therefore a negative binomial random variable, and the probability of observing $\hat{{k}_{1}}$ samples from class ω_{1} in N_{ t } annotations is given by the negative binomial distribution.
The consequence of Proposition 1 is that as N_{ t } (i.e. the training cost in annotations) increases, ${\widehat{P}}_{t}$ also increases, indicating a greater likelihood of observing $\hat{{k}_{1}}$ samples r ↪ ω_{1}. We denote as P_{Δ} the target probability for the model to represent the degree of certainty that, within N_{ t } annotations, we have achieved our $\hat{{k}_{1}}$ samples r ∈ R for which r ↪ ω_{1}.
Proof We wish to find the value of N_{ t } that causes Equation 1 to match our target probability, P_{Δ}. When that happens, ${\widehat{P}}_{t}={P}_{\mathrm{\Delta}}$ and ${\widehat{P}}_{t}{P}_{\mathrm{\Delta}}=0$. Using a minimization strategy, we obtain the value of N_{ t } .
Proposition 2 gives us an analytical formulation for N_{ t } . Note that Equation 3 returns the smallest N_{ t } that matches the P_{Δ}. The possible values of N_{ t } range from $\hat{{k}_{1}}$, in which case exactly ${N}_{t}=\hat{{k}_{1}}$ annotations are required, to N_{ t } =  S^{tr}, in which case the entire dataset is annotated before obtaining $\hat{{k}_{1}}$ samples. Note that we are assuming that there are at least $\hat{{k}_{1}}$ samples in the unlabeled training set from which we are sampling.
Algorithms and Implementation
AL Algorithm for Selecting Informative Samples
The CBAL training strategy consists of two algorithms that work in tandem: ActiveTrainingStrategy, for selecting informative samples, and MinClassQuery, for maintaining class balance. Algorithm ActiveTrainingStrategy, detailed below, requires a pool of unlabeled samples, S^{tr}, from which samples will
Algorithm ActiveTrainingStrategy
Input: S^{tr}, T
Output:${S}_{T,\Phi}^{\mathsf{\text{tr}}}$, ${\mathcal{T}}_{T}$
 0.
initialization: create bootstrap training set ${S}_{0,\Phi}^{\mathsf{\text{tr}}}$, set t = 0
 1.
while t < T do
 2.
Create classifier ${\mathcal{T}}_{t}$ from training set ${S}_{t,\Phi}^{\mathsf{\text{tr}}}$;
 3.
Find eligible sample set ${S}_{t}^{E}$ where ${\mathcal{T}}_{t}\left(r\right)=\frac{1}{2}\pm \tau $;
 4.
Annotate K eligible samples via MinClassQuery() to obtain ${\widehat{S}}_{t}^{\mathsf{\text{E}}}$;
 5.
Remove ${\widehat{S}}_{t}^{\mathsf{\text{E}}}$ from S ^{tr} and add to ${S}_{t+1,\Phi}^{\mathsf{\text{tr}}}$;
 6.
t = t + 1;
 7.
endwhile
 8.
return ${\mathcal{T}}_{T}$, ${S}_{T,\Phi}^{\mathsf{\text{tr}}}$;
end
be drawn for annotation, as well as a parameter for maximum iterations T. This parameter can be chosen according to the available training budget or through a predefined stopping criterion. The output of the algorithm will be a fully annotated training set ${S}_{T,\Phi}^{\mathsf{\text{tr}}}$ as well as the classifier trained using training set ${\mathcal{T}}_{T}$. The identification of the informative samples occurs in Step 3, wherein a fuzzy classifier ${\mathcal{T}}_{T}$ is generated from a set of M weak binary decision trees [29] that are combined via bagging [30]. Informative samples are those samples for which half of the M weak binary decision trees disagree; that is, samples for which $0.5\tau \le {\mathcal{T}}_{t}\left(r\right)\ge 0.5+\tau $. This approach is similar to the QuerybyCommittee (QBC) AL algorithm [9, 10]. While there are several alternative algorithms available to perform ALbased training [12, 14–16], we chose the QBC algorithm in this work due to its intuitive description of sample informativeness and its straightforward implementation. It is important to note that poor performance of ${\mathcal{T}}_{T}$ does not degrade the ability of the algorithm to identify informative samples. We expect that at low t, the performance of ${\mathcal{T}}_{T}$ will be low due to the lack of sufficient training, and much of the dataset will be identified as informative.
However, even if ${\mathcal{T}}_{T}$ identifies the majority of unlabeled samples as informative, it is still more efficient than RL. In the worstcase scenario, where all unlabeled samples are considered informative, then we are forced to choose training samples at random  which is equivalent to traditional supervised training.
Obtaining Annotations While Maintaining Class Balance
Algorithm MinClassQuery is used by ActiveTrainingStrategy to select samples from the set of eligible samples, ${S}_{t}^{E}$, according to a class ratio specified by $\hat{{k}_{1}}$ and $\hat{{k}_{2}}$. Recall that $K=\hat{{k}_{1}}+\hat{{k}_{2}}$, and so K > 0. We expect that there will be many more samples from ω_{2} (the majority class) than from ω_{1}. Because these
Algorithm MinClassQuery
Input: ${S}_{t}^{E}$, K > 0, $\hat{{k}_{1}}$, $\hat{{k}_{2}}$
Output: ${\widehat{S}}_{t}^{\mathsf{\text{E}}}$
 0.
initialization: ${\widehat{S}}_{t}^{\mathsf{\text{E}}}=\mathrm{\varnothing}$, ${k}_{1}^{\prime}=0$, ${k}_{2}^{\prime}=0$
 1.
while $\left{\widehat{S}}_{t}^{\mathsf{\text{E}}}\right\ne K$ do
 2.
Find class ω_{ i } of a random sample $r\in {S}_{t}^{\mathsf{\text{E}}}$, i ∈ {1, 2};
 3.
if ${k}_{i}^{\prime}<\hat{{k}_{i}}$
 4.
Remove r from ${S}_{t}^{E}$ and add to ${\widehat{S}}_{t}^{\mathsf{\text{E}}}$;
 5.
${k}_{i}^{\prime}={k}_{i}^{\prime}+1$;
 6.
else
 7.
Remove r from ${S}_{t}^{E}$;
 8.
endif
 9.
endwhile
 10.
return ${\widehat{S}}_{t}^{\mathsf{\text{E}}}$;
end
samples are being annotated, they are removed from the unlabeled eligible sample pool ${S}_{t}^{E}$ in Step 7; however, since the resources have been expended to annotate them, they can be saved for future iterations.
Updating Cost Model and Stopping Criterion Formulation
where δ is a similarity threshold and ${\mathcal{A}}_{t}$ is the accuracy of classifier ${\mathcal{T}}_{T}$ (as evaluated on a holdout training set). Thus, when additional training samples no longer increase the resulting classifier's accuracy, the training can cease. An assumption in using this stopping criterion is that adding samples to the training set will not decrease classifier accuracy, and that accuracy will rise asymptotically. The total number of iterations T corresponds to the size of the final training set and can be specified manually or found using a stopping criterion discussed below. Classifiers that require a large training set will require a large value for T, increasing cost.
Selection of Free Parameters
Our methodology contains a few free parameters that must be selected by the user. The training algorithm employs three parameters: the similarity threshold δ (Equation 5); the confidence margin τ; and the number of samples from each class to add per iteration, $\hat{{k}_{1}}$ and $\hat{{k}_{2}}$. The choice of δ will determine the maximum number of iterations, T, the algorithm is allowed to run. A small value of δ will require a larger final training set (i.e. a larger T) before the algorithm satisfies the stopping criterion. Additionally, if Eq. 5 is never satisfied, then all available training samples will eventually be annotated (S^{tr} will be exhausted).
The confidence margin τ defines the range of values of ${\mathcal{T}}_{t}\left(r\right)$ for which sample r is considered informative (difficulttoclassify). Smaller values of τ define a smaller area on the interval [0, 1], requiring more uncertainty for a region to be selected. τ = 0.0 indicates that only samples for which ${\mathcal{T}}_{t}\left(r\right)=0.5$ (i.e. perfect classifier disagreement) are informative, while τ = 0.5 indicates that all samples are informative (equivalent to random learning). The number of samples to add from each class during an iteration of learning, $\hat{{k}_{1}}$ and $\hat{{k}_{2}}$, determines how many annotations occur before a new round of learning starts.
 1.
$\hat{{k}_{1}}=\hat{{k}_{2}}=10$: in this case, 20 samples (10 from each class) are annotated per iteration.
 2.
$\hat{{k}_{1}}=\hat{{k}_{2}}=1$: in this case, 2 samples (1 per class) are annotated per iteration.
In both cases, the learning algorithm for selecting informative samples is only updated after each iteration.
In the first case, 20 samples are added to $\hat{{S}_{t}^{\mathsf{\text{E}}}}$ before new learning occurs, while in the second case, the learning algorithm is updated after each additional sample is annotated. Thus, in case 2, we are sure that each additional sample is chosen using the maximum amount of available information, while in case 1, several samples are added before the learning algorithm is updated. Although the second case requires ten iterations before it has the same training set size as the first case, each additional annotation is chosen based on an updated AL model, ensuring that all 20 samples are informative.
Experimental Design
Data Description
We apply the CBAL training methodology to the problem of prostate cancer detection from biopsy samples. Glass slides containing prostate biopsy samples are digitized at 40× magnification (0.25 μ m per pixel resolution). The original images are reduced in size using a pyramidal decomposition scheme [31] to 6.25% of their original size (4.0 μ m per pixel resolution), matching the resolution of the images used in [3]. Each image is divided into sets of square regions, r ∈ R such that each region constitutes a 30by30 pixel square area (120by120 μ m area). These image regions constitute the dataset used for training and testing. Ground truth annotation is performed manually by an expert pathologist, who places a contour on tissue regions on the original 40× magnification image. Pathologists annotated both cancer and noncancer regions of tissue, and only annotated regions were included in the dataset. A total of 100 biopsy images were analyzed from 58 patients, yielding over 12,000 annotated image regions. All of the 58 patients exhibited prostate cancer, although cancer was not present in all 100 images. The square regions were assumed to be independently drawn from the images.
Feature Extraction
Firstorder Statistical Features
Firstorder features are statistics calculated directly from the pixel values in the image. These include the mean, median, and standard deviation of the pixels within a window size, as well as Sobel filters and directional gradients. Of these features, two were included in the subset: the standard deviation and the range of pixel intensities.
Secondorder Cooccurrence Features
Cooccurrence image features are based on the adjacency of pixel values in an image. An adjacency matrix is created where the value of the i th row and the j th column equals the number of times pixel values i and j appear within a fixed distance of one another. A total of thirteen Haralick texture features [33] are calculated from this coadjacency matrix, of which 5 were found to be highly discriminating: information measure, correlation, energy, contrast variance, and entropy.
Steerable Filter Features
To quantify spatial and directional textures in the image, we utilize a steerable Gabor filter bank [34]. The Gabor filter is parameterized by frequency and orientation (phase) components; when convolved with an image, the filter provides a high response for textures that match these components. We compute a total of 40 filter banks, of which 7 were found to be informative, from a variety of frequency and orientation values.
Evaluation of Training Set Performance via Probabilistic Boosting Trees
We generate receiver operating characteristic (ROC) curves by calculating the classifier's sensitivity and specificity at various decision thresholds θ ∈ {0, ..., 1}. Each value of θ yields a single point on the ROC curve, and the area under the curve (AUC) measures the discrimination between cancer and noncancer regions. The accuracy can then be calculated by setting θ to the operating point of the ROC curve. Again, it should be noted that it is possible to evaluate the performance of the training set using any supervised classifier in place of PBT. A previous study [36] used both decision trees [29] and SVMs [11] as supervised evaluation algorithms in an AL training experiment, and found that the trend in performance for both algorithms was similar. In this study we implemented PBTs because the algorithm was different from ${\mathcal{T}}_{T}$, which avoids biasing results; however, alternative evaluation algorithms could certainly be employed.
Although the classifier performance values may change, the goal of these experiments is to show that the performance of an activelylearned, classbalanced training set is better than a randomly generated unbalanced set.
List of Experiments
We perform three sets of experiments to analyze different facets of the active learning training methodology.
Experiment 1: Comparison of CBAL Performance with Alternate Training Strategies We compare the performance of CBAL with four alternative training strategies to show that CBAL training will yield a classifier with greater performance versus a training set of the same size trained using an alternative method.

Unbalanced Active Learning (UBAL): The class ratio is not controlled; eligible samples ${S}_{t}^{E}$ determined via AL are randomly annotated and added to ${\widehat{S}}_{t}^{\mathsf{\text{E}}}$.

Class Balanced Random Learning (CBRL): All unlabeled samples in S^{tr} are eligible for annotation, while holding class balance constant as described in MinClassQuery.

Unbalanced Random Learning (UBRL): All unlabeled samples are queried randomly. This is the classic training scenario, wherein neither class ratio nor informative samples are explicitly controlled.

Full Training (Full): All available training samples are used. This represents the performance when the entire training set is annotated and available (an ideal scenario).
In random learning (RL), all samples in the unlabeled pool S^{tr} are "eligible" for annotation; that is, S^{E} = S^{tr}. In unbalanced class experiments, the MinClassQuery algorithm is replaced by simply annotating K random samples (ignoring class) and adding them to ${\widehat{S}}^{\mathsf{\text{E}}}$. The full training strategy represents the scenario when all possible training data is used.
The classifier is tested against an independent testing pool, S^{te}, which (along with the training set) is selected at random from the dataset at the start of each trial. In these experiments, T = 40, the confidence margin was τ = 0.25, and the number of samples added at each iteration was K = 2. In the balanced experiments, $\hat{{k}_{1}}=\hat{{k}_{2}}=1$. A total of 12,588 image regions were used in the overall dataset, drawn from the 100 images in the dataset; 1,346 regions were randomly selected for S^{te}, and 11,242 for S^{tr} in each of 10 trials. The regions are assumed to be independent samples of the overall image space due to the heterogeneity of the tissue and appearance of disease. Because the goal of classification is to distinguish between cancer and noncancer regions of tissue rather than individual patients, the training and testing was drawn randomly from the overall pool of available regions. The true ratio of noncancer to cancer regions in S tr was approximately 25:1 (4% belonged to the cancer class). A total of 10 trials were performed, with random selection of S^{tr} and S^{te} at the beginning of each trial.
Experiment 2: Effect of Training Set Class Ratio on Accuracy of Resulting Classifier To explore the effect of training set class ratio on the performance of the resulting classifier, the CBAL methodology was used, setting K = 10 and varying $\hat{{k}_{1}}$ and $\hat{{k}_{2}}$ such that the percentages of the training set consisting of minority samples vary from 20% $\left(\hat{{k}_{1}}=2,\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\hat{{k}_{2}}=8\right)$ to 80% $\left(\hat{{k}_{1}}=8,\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\hat{{k}_{2}}=2\right)$. Each set of parameters was used to build a training set, which in turn was used to build a classifier that was evaluated on the same independent testing set S^{te}.
Experiment 3: Comparison of Cost Model Predictions with Empirical Observations At each step of the AL algorithm, we estimate N_{ t } for obtaining balanced classes as described in Section 2. The goal of this experiment was to empirically evaluate whether our mathematical model could accurately predict the cost of obtaining balanced classes at each iteration, and could thus be used to predict the cost of classifier training for any problem domain. For these calculations, we set the initial class probability p_{0}(ω_{1}) = 0.04, based on the observations of the labeled data used at the beginning of the AL process. Additionally, we set the desired sample numbers to correspond with the different class ratios listed in Experiment 2, from 20% minority class samples $\left(\hat{{k}_{1}}=2,\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\hat{{k}_{2}}=8\right)$ to 80% $\left(\hat{{k}_{1}}=8,\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\hat{{k}_{2}}=2\right)$. The aim of this experiment was to investigate the relationship between the cost of a specific class ratio and the performance of ${\mathcal{T}}_{T}^{\prime}$.
Results and Discussion
Experiment 1: Comparison of CBAL performance with Alternate Training Strategies
The AUC values for CBAL approach the full training with 60 samples (t = 30) while CBRL, UBRL, and UBAL have approximately 0.05 lower AUC at those sample sizes. Accuracy for CBAL remains similar to other methods until t = 30, at which point CBAL outperforms other methods by approximately 3%. CBRL, UBRL, and UBAL do not perform as well as CBAL for the majority of our experiments, requiring a larger number of samples to match the accuracy and AUC of CBAL.
Experiment 2: Effect of Training Set Class Ratio on Accuracy of Resulting Classifier
Experiment 3: Comparison of Cost Model Predictions with Empirical Observations
While it may seem from Figure 6 that the strategy yielding best performance would be to oversample the minority class as much as possible, we also plotted the empirical cost values N_{ t } for each of the class ratios from Experiment 2 in Figure 7(b). We find that as the percentage of the minority class increases, the cost associated with each iteration of the AL algorithm also increases. This is due to the fact that as the minority class is oversampled, more annotations are required to find additional minority samples. While there is some increase in accuracy by oversampling the dataset, the annotation cost increases by an order of magnitude. Thus, the optimal strategy will need to balance the increase in accuracy with the constraints of the overall annotation budget.
Conclusions
In this work we present a strategy for training a supervised classifier when the costs of training are high, and where the minority class problem exists. Our strategy, ClassBalanced Active Learning (CBAL), has the following characteristics: (1) Active Learning (AL) is used to select informative samples for annotation, thus ensuring that each annotation is highly likely to improve classifier performance. (2) Class ratios are specifically addressed in this training strategy to prevent the training set from being biased toward the majority class. (3) A mathematical model is used to predict the number of annotations required before the specified class balance is reached. We applied these techniques to the task of quantitatively analyzing digital prostate tissue samples for presence of cancer, where the CBAL training method yielded a classifier with accuracy and AUC values similar to those obtained with the full training set using fewer samples than the unbalanced AL, classbalanced random learning, or unbalanced random learning methods. Our mathematical cost model was able to predict the number of annotations required to build a classbalanced training set within 20 annotations, despite the large amount of variance in the empirically observed costs. This model is critical in determining, a priori, what the cost of training will be in terms of annotations, which in turn translates into the time and effort expended by the human expert in helping to build the supervised classifier. We found that by specifying class ratios for the training set that favor the minority class (i.e. oversampling), the resulting classifier performance increased slightly; however, the cost model predicted a large increase in the cost of training, as a high percentage of minority class samples requires more annotations to build. Thus, an optimal training strategy must take into account the overall training budget and the desired accuracy.
Some of the specific findings in this work, such as the observation that overrepresenting the minority class yields a slightly higher classifier performance, may be specific to the dataset considered here. Additionally, the observation that the AL algorithm has a large amount of variance in the empiricallyobserved costs (particularly at the beginning of training) indicates that the eligible sample set is unpredictable with respect to class compositions. This behavior may not necessarily be duplicable with different datasets or AL strategies, both of which will yield eligible sample sets with different class compositions. Additionally, we do not claim that our choice of AL algorithm (QBC), our weak classification algorithm (bagged decision trees), or our evaluation classifier (PBT) will outperform the available alternatives. However, by combining AL and class balancing, we have developed a general training strategy that should be applicable to most supervised classification problems where the dataset is expensive to obtain and which suffers from the minority class problem. These problems are particularly prevalent in medical image analysis and digital pathology, where the costs of classifier training are very high and an intelligent training strategy can help save great amounts of time and money. Training is an essential and difficult part of supervised classification, but the integration of AL and intelligent choice of class ratios, as well as the application of a general cost model, will help researchers to plan the training process more quickly and effectively. Future work will involve extensions of our framework to the multiclass case, where relationships between multiple classes with different distributions must be taken into account.
Declarations
Acknowledgements
Funding for this work provided by the Wallace H. Coulter Foundation, New Jersey Commission on Cancer Research, The United States Department of Defense (W81XWH0810145), National Cancer Institute (R01CA14077201, R01CA13653501, R03CA14399101), the Cancer Institute of New Jersey.
Authors’ Affiliations
References
 Van der Walt C, Barnard E: Data Characteristics that Determine Classifier Performance. 17th Annual Symposium of the Pattern Recognition Association of South Africa 2006, 6–12.Google Scholar
 Gurcan M, Boucheron L, Can A, Madabhushi A, Rajpoot N, Yener B: Histopathological Image Analysis: A Review. IEEE Reviews in Biomedical Engineering 2009, 2: 147–171.PubMed CentralView ArticlePubMedGoogle Scholar
 Doyle S, Feldman M, Tomaszewski J, Madabhushi A: A Boosted Bayesian MultiResolution Classifier for Prostate Cancer Detection from Digitized Needle Biopsies. IEEE Transactions on Biomedical Engineering (In Press, PMID 20570758) 2010.Google Scholar
 Madabhushi A, Doyle S, Lee G, Basavanhally A, Monaco J, Masters S, Feldman M, Tomaszewski J: Review: Integrated Diagnostics: A Conceptual Framework with Examples. Clinical Chemistry and Laboratory Medicine 2010, 989–998.Google Scholar
 Doyle S, Agner S, Madabhushi A, Feldman M, Tomaszewski J: Automated Grading of Breast Cancer Histopathology Using Spectral Clustering with Textural and Architectural Image Features. ISBI 2008. 5th IEEE International Symposium 2008, 496–499.Google Scholar
 Monaco J, Tomaszewski J, Feldman M, Hagemann I, Moradi M, Mousavi P, Boag A, Davidson C, Abolmaesumi P, Madabhushi A: Highthroughput detection of prostate cancer in histological sections using probabilistic pairwise Markov models. Medical Image Analysis 2010, 14(4):617–629. 10.1016/j.media.2010.04.007PubMed CentralView ArticlePubMedGoogle Scholar
 Fatakdawala H, Xu J, Basavanhally A, Bhanot G, Ganesan S, Feldman M, Tomaszewski J, Madabhushi A: Expectation Maximization driven Geodesic Active Contour with Overlap Resolution (EMaGACOR): Application to Lymphocyte Segmentation on Breast Cancer Histopathology. Biomedical Engineering, IEEE Transactions on 2010, 57(7):1676–1689.View ArticleGoogle Scholar
 Basavanhally A, Ganesan S, Agner S, Monaco J, Feldman M, Tomaszewski J, Bhanot G, Madabhushi A: Computerized ImageBased Detection and Grading of Lymphocytic Infiltration in HER2+ Breast Cancer Histopathology. Biomedical Engineering IEEE Transactions on 2010, 57(3):642–653.View ArticleGoogle Scholar
 Seung H, Opper M, Smopolinsky H: Query by committee. Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory 1992, 287–294.Google Scholar
 Freund Y, Seung H, Shamir E, Tishby N: Selective Sampling Using the Query by Committee Algorithm. Machine Learning 1996, 28: 133–168.View ArticleGoogle Scholar
 Cortes C, Vapnik V: SupportVector Networks. Machine Learning 1995, 20: 273–297.Google Scholar
 Tong S, Koller D: Active Learning for Structure in Bayesian Networks. 2001.Google Scholar
 Li M, Sethi IK: Confidencebased active learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 2006, 28(8):1251–61. [Journal Article United States] [Journal Article United States]View ArticlePubMedGoogle Scholar
 Cohn D, Atlas L, Ladner R: Improving generalization with active learning. Machine Learning 1994, 15(2):201–221. [10.1007/BF00993277] [10.1007/BF00993277]Google Scholar
 Cohn D, Ghahramani Z, Jordan M: Active Learning with Statistical Models. Journal of Artificial Intelligence Research 1996, 4: 129–145.Google Scholar
 Schmidhuber J, Storck J, Hochreiter S: Reinforcement Driven Information Acquisition in NonDeterministic Environments. Tech report, Fakultät für Informatik, Technische Universität München 1995, 2: 159–164.Google Scholar
 Lee M, Rhee J, Kim B, Zhang B: AESNB: Active Example Selection with Naive Bayes Classifier for Learning from Imbalanced Biomedical Data. 2009 Ninth IEEE International Conference on Bioinformatics and Bioengineering 2009, 15–21.View ArticleGoogle Scholar
 Veeramachaneni S, Demichelis F, Olivetti E, Avesani P: Active Sampling for Knowledge Discovery from Biomedical Data. In Knowledge Discovery in Databases: PKDD 2005, Volume 3721 of Lecture Notes in Computer Science. Edited by: Jorge A, Torgo L, Brazdil P, Camacho R, Gama J. Springer Berlin/Heidelberg; 2005:343–354.Google Scholar
 Doyle S, Madabhushi A: Consensus of Ambiguity: Theory and Application of Active Learning for Biomedical Image Analysis. Pattern Recognition in Bioinformatics (PRIB) 2010.Google Scholar
 Weiss GM, Provost F: The Effect of Class Distribution on Classifier Learning: An Empirical Study. Technical Report MLTR44 2001. [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=?doi=10.1.1.28.9570]Google Scholar
 Japkowicz N, Stephen S: The Class Imbalance Problem: A Systematic Study. Intelligent Data Analysis 2002, 6: 429–449.Google Scholar
 Chawla N, Bowyer K, Hall L, Kegelmeyer W: SMOTE: Synthetic Minority Oversampling Technique. Journal of Artificial Intelligence Research 2002, 16: 321–357.Google Scholar
 Zhu J, Hovy E: Active Learning for Word Sense Disambiguation with Methods for Addressing the Class Imbalance Problem. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLPCoNLL) Prague, Czech Republic: Association for Computational Linguistics; 2007, 783–790. [http://www.aclweb.org/anthology/D/D07/D07–1082]Google Scholar
 Batista G, Carvalho A, Monard M: Applying OneSided Selection to Unbalanced Datasets. In MICAI 2000: Advances in Artificial Intelligence, Volume 1793 of Lecture Notes in Computer Science. Edited by: Cairo O, Sucar L, Cantu F. Springer Berlin/Heidelberg; 2000:315–325.Google Scholar
 Yang K, Cai Z, Li J, Lin G: A stable gene selection in microarray data analysis. BMC Bioinformatics 2006, 7: 228. [http://www.biomedcentral.com/1471–2105/7/228] 10.1186/147121057228PubMed CentralView ArticlePubMedGoogle Scholar
 Cosatto E, Miller M, Graf H, Meyer J: Grading nuclear pleomorphism on histological micrographs. Pattern Recognition, ICPR 2008. 19th International Conference on 2008 2008, 1–4.Google Scholar
 Begelman G, Pechuk M, Rivlin E: A Microscopic Telepathology System for Multiresolution ComputerAided Diagnosis. Journal of Multimedia 2006, 1(7):40–48.View ArticleGoogle Scholar
 Bloodgood M, VijayShanker K: Taking into account the differences between actively and passively acquired data: the case of active learning with support vector machines for imbalanced datasets. In NAACL '09: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers. Edited by: Morristown, NJ. USA: Association for Computational Linguistics; 2009:137–140.Google Scholar
 Quinlan J, Quinlan J: Decision trees and decisionmaking. IEEE Trans Syst Man Cybern 1990, 20(2):339–346. 10.1109/21.52545View ArticleGoogle Scholar
 Breiman L: Bagging Predictors. Machine Learning 1996, 24(2):123–140.Google Scholar
 Burt P, Adelson E: The Laplacian Pyramid as a Compact Image Code. Journal of Communication 1983, 31(4):532–540.Google Scholar
 Freund Y, Schapire R: Experiments with a New Boosting Algorithm. Machine Learning: Proceedings of the Thirteenth International Conference 1996, 148–156.Google Scholar
 Haralick R, Shanmugan K, Dinstein I: Textural features for image classification. IEEE Trans on Systems Man and Cybernetics 1973, (SMC3):610–621.Google Scholar
 Manjunath B, Ma W: Texture features for browsing and retrieval of image data. Transactions on Pattern Analysis and Machine Intelligence 1996, 18:(8):837–842.View ArticleGoogle Scholar
 Tu Z: Probabilistic boostingtree: Learning discriminative models for classification, recognition, and clustering. ICCV 2005, 2: 1589–1596.Google Scholar
 Doyle S, Monaco J, Feldman M, Tomaszewski J, Madabhushi A: A Class Balanced Active Learning Scheme that Accounts for Minority Class Problems: Applications to Histopathology. OPTIMHisE Workshop (MICCAI) 2009, 19–30.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.