Reverse active learning based atrous DenseNet for pathological image classification

Background Due to the recent advances in deep learning, this model attracted researchers who have applied it to medical image analysis. However, pathological image analysis based on deep learning networks faces a number of challenges, such as the high resolution (gigapixel) of pathological images and the lack of annotation capabilities. To address these challenges, we propose a training strategy called deep-reverse active learning (DRAL) and atrous DenseNet (ADN) for pathological image classification. The proposed DRAL can improve the classification accuracy of widely used deep learning networks such as VGG-16 and ResNet by removing mislabeled patches in the training set. As the size of a cancer area varies widely in pathological images, the proposed ADN integrates the atrous convolutions with the dense block for multiscale feature extraction. Results The proposed DRAL and ADN are evaluated using the following three pathological datasets: BACH, CCG, and UCSB. The experiment results demonstrate the excellent performance of the proposed DRAL + ADN framework, achieving patch-level average classification accuracies (ACA) of 94.10%, 92.05% and 97.63% on the BACH, CCG, and UCSB validation sets, respectively. Conclusions The DRAL + ADN framework is a potential candidate for boosting the performance of deep learning models for partially mislabeled training datasets.


Background
The convolutional neural network (CNN) has been attractive to the community since the AlexNet [1] won the ILSVRC 2012 competition. CNN has become one of the most popular classifiers today in the area of computer vision. Due to outstanding performance of CNN, several researchers start to use it for diagnostic systems. For example, Google Brain [2] proposed a multiscale CNN model for breast cancer metastasis detection in lymph nodes. However, the following challenges arise when employing the CNN for pathological image classification.
First, most pathological images have high resolutions (gigapixels). Figure 1a shows an example of a ThinPrep and there may be normal tissues around tumors. Hence, the patch-level labels may be inconsistent with the slicelevel label. Figure 1b shows an example of a breast cancer histology image. The slice label is assigned to the normal patch marked with red square. Such mislabeled patches may influence the subsequent network training and decrease classification accuracy.
In this paper, we propose a deep learning framework to classify the pathological images. The main contributions can be summarized as follows: 1) An active learning strategy is proposed to remove mislabeled patches from the training set for deep learning networks. Compared to the typical active learning that iteratively trains a model with the incrementally labeled data, the proposed strategy -deep-reverse active learning (DRAL) -can be seen as a reverse of the typical process.
2) An advanced network architecture -atrous DenseNet (ADN) -is proposed for classification of the pathological images. We replace the common convolution of DenseNet with the atrous convolution to achieve multiscale feature extraction.
3) Experiments are conducted on three pathological datasets. The results demonstrate the outstanding classification accuracy of the proposed DRAL + ADN framework.

Active Learning
Active learning (AL) aims to decrease the cost of expert labeling without compromising classification performance [4]. This approach first selects the most ambiguous/uncertain samples in the unlabeled pool for annotation and then retrains the machine learning model with the newly labeled data. Consequently, this augmentation increases the size of the training dataset. Wang [4] proposed the first active learning approach for deep learning. The approach used three metrics for data selection: least confidence, margin sampling, and entropy. Rahhal et al. [5] suggested using entropy and Breaking-Ties (BT) as confidence metrics for selection of electrocardiogram signals in the active learning process. Researchers recently began to employ active learning for medical image analysis. Yang [6] proposed an active learning-based framework -a stack of fully convolutional networks (FCNs) -to address the task of segmentation of biomedical images. The framework adopted the FCNs results as the metric for uncertainty and similarity. Zhou [7] proposed a method called active incremental fine-tuning (AIFT) to integrate active learning and transfer learning into a single framework. The AIFT was tested on three medical image datasets and achieved satisfactory results. Nan [8] made the first attempt at employing active learning for analysis of pathological images. In this study, an improved active learning based framework (reiterative learning) was proposed to leverage the requirement of a human prediction.
Although active learning is an extensively studied area, it is not appropriate for the task of patch-level pathological image classification. The aim of data selection for patchlevel pathological image classification is to remove the mislabeled patches from the training set, which is different from the traditional active learning, i.e., incremental augmentation of the training set. To address this challenge, we propose deep-reverse active learning (DRAL) for patch-level data selection. We acknowledge that the idea of reverse active learning has been proposed in 2012 [9]. Therefore, we hope to highlight the difference between the RAL proposed in that study and ours. First, the typical RAL [9] is proposed for clinical language processing, while ours is for 2-D pathological images. Consequently, the criteria for removing mislabeled (negative) samples are totally different. Second, the typical RAL [9] is developed on the LIBSVM software. In contrast, we adopt the deep learning network as the backbone of the machine learning algorithm, and remove the noisy samples by using the data augmentation approach of deep learning.

Deep Learning-based Pathological Image Analysis
The development of the deep convolutional network was inspired by Krizhevsky, who won the ILSVRC 2012 competition with the eight-layer AlexNet [1]. In the following competitions, a number of new networks such as VGG [10] and GoogLeNet [11], were proposed. He et al. [12], the ILSVRC 2015 winner, proposed a much deeper convolutional network, ResNet, to address the training problem of ultradeep convolutional networks. Recently, the densely connected network (DenseNet) proposed by Huang [13] outperformed the ResNet on various datasets.
In recent years, an increasing number of deep learningbased computer-aided diagnosis (CAD) models for pathological images have been proposed. Albarqouni [14] developed a new deep learning network, AggNet, for mitosis detection in breast cancer histology images. A completely data-driven model that integrated numerous biological salient classifiers was proposed by Shah [15] for invasive breast cancer prognosis. Chen [16] proposed a framework based on FCN for segmentation of glands. Li [17] proposed an ultradeep residual network for segmentation and classification of human epithelial type-2 (HEp-2) specimen images. More recently, Liu [18] developed an end-to-end deep learning system to directly predict the H-Score for breast cancer tissue. All the aforementioned algorithms crop patches from pathological images to augment the training set, and achieve satisfactory performance on specific tasks. However, we noticed that few of the presented CAD systems use the DenseNet state-of-the-art network architecture, which leaves some margin for performance improvement. In this paper, we propose a deep neural network called ADN for analysis of pathological images. The proposed framework significantly outperforms the benchmark models and achieves excellent classification accuracy on two types of pathological datasets: breast and cervical slices.

Atrous Convolution & DenseNet
The proposed atrous DenseNet (ADN) is inspired by atrous convolution (or dilated convolution) and the DenseNet state-of-the-art network architecture [13]. In this section, we first present the definitions of atrous convolution and the original dense block.

Atrous Convolution
The atrous convolution (or dilated convolution) was employed to improve the semantic segmentation performance of deep learning based models [19]. Compared to the common convolution layer, the convolutional kernels in the atrous convolution layer have "holes" between parameters that enlarge the receptive field without increasing the number of parameters. The size of the "holes" inserted into the parameters is calculated based on the dilation rate (γ ). As shown in Fig. 2, a smaller dilation rate results in a more compact kernel (the common convolution can be seen as a special case with dilation rate = 1), while a larger dilation rate produces an expanded kernel. A kernel with a larger dilation rate can capture more context information from the feature maps of the previous layer.

Dense Block
The dense block adopted in the original DenseNet is introduced in [13]. Let H l (.) be a composite function of operations such as convolution and rectified linear units (ReLU), the output of the l th layer (x l ) for a single image x 0 can be written as follows: where [ x 0 , x 1 , ..., x l−1 ] refers to the concatenation of the feature maps produced by layers 0, ..., l − 1.
If each function H l (.) produces k feature maps, the l th layer consequently has k 0 + k × (l − 1) input feature maps, where k 0 is the number of channels of the input layer. k is called growth rate of the DenseNet block.

Deep-Reverse Active Learning
To detect and remove the mislabeled patches, we propose a reversed process of traditional active learning. As overfitting of deep networks may easily occur, a simple six-layer CNN called RefineNet (RN) is adopted for our DRAL (see the appendix for the architecture). Let M represent the RN model in the CAD system, and let D represent the training set with m patches (x). The deep-reverse active learning (DRAL) process is illustrated in Algorithm 1.
The RN model is first trained, and then makes predictions on the original patch-level training set. The patches with maximum confidence level lower than 0.5 are removed from the training set. As each patch is augmented to eight patches using data augmentation ("rotation" and "mirror"), if more than four of the augmented patches are removed, then the remaining patches are removed from the training set. The patch removal and model fine-tuning are performed in alternating sequence. A fixed validation set annotated by pathologists is used to evaluate the performance of fine-tuned model. Using DRAL resulted in a decline in the number of mislabeled patches. As a result, the performance of the RN model on the validation set is gradually improved. The DRAL stops when the validation classification accuracy is satisfactory or stops increasing. The training set filtered by DRAL can be seen as correctly annotated data, and can be used to train deeper networks such as ResNet, DenseNet, etc.

Atrous DenseNet (ADN)
The size of cancer areas in pathological images varies widely. To better extract multiscale features, we propose a deep learning architecture -atrous DenseNet -for pathological image classification. Compared to common convolution kernels [11], atrous convolutions can extract multiscale features without extra computational cost. The network architecture is presented in Fig. 3.
The blue, red, orange and green rectangles represent the convolutional layer, max pooling layer, average pooling layer and fully connected layers, respectively. The proposed deep learning network has different architectures for shallow layers (atrous dense connection (ADC)) and deep layers (network-in-network module (NIN) [20]). PReLU is used as the nonlinear activation function. The network training is supervised by the softmax loss (L), as defined in Eq. 2 as follows: where f j denotes the j th element (j ∈[ 1, K], K is the number of classes) of vector of class scores f, y i is the label of i th input feature and N is the number of training data. Our ADC proposes to use atrous convolution to replace the common convolution in the original DenseNet blocks and a wider DenseNet architecture is designed by using wider densely connected layers.

Atrous Convolution Replacement
The original dense block achieved multiscale feature extraction by stacking 3 × 3 convolutions. As the atrous convolution has a larger receptive field, the proposed atrous dense connection block replaces the common convolutions with the atrous convolution to extract better multiscale features. As shown in Fig. 4, atrous convolutions with two dilation rates (2 and 3) are involved in the proposed ADC block. The common 3 × 3 convolution is     Table 1 Detailed information of CCG dataset placed after each atrous convolution to fuse the extracted feature maps and refine the semantic information.
We notice that some studies have already used the stacking atrous convolutions for semantic segmentation [21]. The proposed ADC addresses two primary drawbacks of the existing framework. First, the dilation rates used in the existing framework are much larger (2, 4, 8 and 16) compared to the proposed ADC block. As a result, the receptive field of the existing network normally exceeds the patch size and requires multiple zeros as padding for the convolution computation. Second, the architecture of the existing framework has no shortcut connections, which is not appropriate for multiscale feature extraction.

Wider Densely Connected Layer
As the numbers of pathological images in common datasets are usually small, it is difficult to use them to train an ultradeep network such as the original DenseNet. Zagoruyko [22] proved that a wider network may provide better performance than a deeper network when using small datasets. Hence, the proposed ADC increases the growth rate (k) from 4 to 8, 16 and 32, and decreases the number of layers (l) from 121 to 28. Thus, the proposed dense block is wide and shallow. To reduce the computational complexity and enhance the capacity of feature representation, the growth rate (the numbers in the ADC modules in Fig. 3) increases as the network goes deeper.

Implementation
To implement the proposed ADN, the Keras toolbox is used. The network was trained with a mini-batch of 16 on four GPUs (GeForce GTX TITAN X, 12GB RAM). Due to the use of batch normalization layers, the initial learning rate was set to a large value (0.05) for faster network convergence. Following that, the learning rate was decreased to 0.01, and then further decreased with a rate of 0.1. The label for a whole-slice pathological image (slice-level prediction) is rendered by fusing the patch-level predictions made by ADN (voting).

Datasets
Three datasets are used to evaluate the performance of the proposed model: the BreAst Cancer Histology (BACH), Cervical Carcinoma Grade (CCG), and UCSB breast cancer datasets. While independent test sets are available for BACH and CCG, only a training and validation set are available for UCSB due to the limited number of images. While training and validation sets for the three datasets are first used to evaluate the performance of the proposed DRAL and ADN against popular networks such as AlexNet, VGG, ResNet and DenseNet, the independent test sets are used to evaluate the performance of the proposed approach against the state-of-the-art approach using public testing protocols.

BreAst Cancer Histology dataset (BACH)
The BACH dataset [23] consists of 400 pieces of 2048 × 1536 Hematoxylin and Eosin (H&E) stained breast histology microscopy images, which can be divided into four categories: normal (Nor.), benign (Ben.), in situ carcinoma (C. in situ), and invasive carcinoma (I. car.). Each category has 100 images. The dataset is randomly divided with an 80:20 ratio for training and validation. Examples of slices from the different categories are shown in Fig. 5. The extra 20 H&E stained breast histological images from the Bioimaging dataset [24] are adopted as a testing set for the performance comparison of our framework and benchmarking algorithms.
We slide the window with a 50% overlap over the whole image to crop patches with a size of 512 × 512. The cropping produces 2800 patches for each category. Rotation and mirror are used to increase the training set size. Each patch is rotated by 90 • , 180 • and 270 • and then reflected vertically, resulting in an augmented training set with 896,000 images. The slice-level labels are assigned to the generated patches.

Cervical Carcinoma Grade dataset (CCG)
The CCG dataset contains 20 H&E-stained whole-slice ThinPrep Cytology Test (TCT) images, which can be classified in four grades: normal and cancer-level I (L. I), II (L. II), III (L. III). The five slices in each category are separated according to a 60:20:20 ration for training, validation and testing. The resolution of the TCT slices is 16, 473×21, 163. Figure 6 presents a few examples of slices from the different categories. The CCG dataset is populated by pathologists collaborating on this project using a whole-slice scanning machine. We crop the patches from the gigapixel TCT images to generate the patch-level training set. For each normal slice, approximately 20,000 224 × 224 patches are randomly cropped. For the cancer slices (Fig. 6b-d), as they have large background areas, we first binarize the TCT slices to detect the region of interest (RoI). Then, the cropping window is passed over the RoI for patch generation. The slice-level label is assigned to the produced patches. Rotation is used to increase the size of training dataset. Each patch is rotated by 90 • , 180 • and 270 • to generate an augmented training set with 362,832 images. The patch-level validation set consists of 19,859 patches cropped from the validation slices. All of them have been verified by the pathologists. The detailed information of patch-level CCG dataset is presented in Table 1.

UCSB Breast Cancer dataset
The UCSB dataset contains 58 pieces of 896 × 768 breast cancer slices, which can be classified as benign (Ben.) (32) or malignant (Mal.) (26). The dataset is divided into training and validation sets according to a 75:25 ratio. Examples of UCSB images are shown in Fig. 7. We slide a 112 × 112 window over Fig. 8 Illustrations of mislabeled patches. The first, second and third rows list the normal patches mislabeled as cancer from the BACH, CCG, and UCSB datasets, respectively. All the patches have been verified by pathologists the UCSB slices to crop patches for network training and employ the same approach used for BACH to perform data augmentation. As many studies have reported their 4-fold cross validation results on UCSB dataset, we also conduct the same experiment for fair comparison.

Discussion of Preprocessing Approaches for Different Datasets
As previously mentioned, the settings for the preprocessing approaches (including the size of cropped patches and data augmentation) are different for each dataset. The reason is that the image size and quantity in each dataset are totally different. To generate more training patches, we select a smaller patch size (112 × 112) for the dataset with fewer lower resolution samples (UCSB) and a larger one (512 × 512) for the dataset with highresolution images (BACH). For the data augmentation, we use the same data augmentation approach for the BACH and UCSB datasets. For the CCG dataset, the gigapixel TCT slices can yield more patches than the other two datasets. While horizontal and vertical flipping produce limited improvements in classification accuracy, they significantly increase the time cost of the network training. Hence, we only adopt three rotations to augment the training patches of the CCG dataset.

Evaluation Criterion
The overall correct classification rate (ACA) of all the testing images is adopted as the criterion for performance evaluation. In this section, we will first evaluate the performance of DRAL and ADN on the BACH, CCG, and UCSB validation sets. Next, the results from applying different frameworks to the separate testing sets will be presented. Note that the training and testing of the neural networks are performed three times in this study, and the average ACAs are reported as the results.

Evaluation of DRAL Classification Accuracy during DRAL
The proposed DRAL adopts RefineNet (RN) to remove mislabeled patches from the training set. As presented in Table 2, the size of training set decreases from 89,600 to 86,858 for BACH, from 362,832 to 360,563 for CCG, and from 68,640 to 64,200 for UCSB. Figure 8 shows some examples of mislabeled patches identified by the DRAL; most of them are normal patches labeled as breast or cervical cancer. The ACAs on the validation set during the patch filtering process are presented in Table 2. It can be observed that the proposed DRAL significantly increases the patch-level ACAs of RN: the improvements for BACH, CCG, and UCSB are 3.65%, 6.01%, and 17.84%, respectively. Fig. 9 Examples of retained and discarded patches of BACH images. The patches marked with red and blue boxes are respectively recognized as "mislabeled" and "correctly annotated" by our RAL Fig. 10 The t-SNE figures of the last fully connected layer of RefineNet for different iterations K of the BACH training process. a-e are for K = 0, 1, 2, 3,

4, respectively
To better analyze the difference between the patches retained and discarded by our DRAL, an example of a BACH image containing the retained and discarded patches is shown in Fig. 9. The patches with blue and red boxes are respectively marked as "correctly annotated" and "mislabeled" by our DRAL. It can be observed that patches in blue boxes contain parts of breast tumors, while those in the red boxes only contain normal tissues.
In Fig. 10, the t-SNE [25] is used to evaluate the RefineNet's capacity for feature representation during different iterations of the BACH training process. The points in purple, blue, green and yellow respectively represent the normal, benign, carcinoma in situ, and invasive carcinoma samples. It can be observed that the RefineNet's capacity for feature representation gradually improved (the different categories of samples are gradually separated during DRAL training). However, Fig. 10e shows that the RefineNet, after the fourth training iteration (K=4), leads to the misclassification of some carcinoma in situ (green) and normal samples (purple) as invasive carcinoma (yellow) and carcinoma in situ (green), respectively.

CNN Models trained with the Refined Dataset
The DRAL refines the training set by removing the mislabeled patches. Hence, the information contained in the refined training set is more accurate and discriminative, which is beneficial for the training of a CNN with deeper architecture. To demonstrate the advantages of the proposed DRAL, several well-known deep learning networks such as AlexNet [1], VGG-16 [10], ResNet-50/101 [12], and DenseNet-121 [13] are used for the performance evaluation. These networks are trained on the original and refined training sets and also evaluated on the same fully annotated validation set. The evaluation results are presented in Table 3 (Patch-level ACA) and Table 4 (Slicelevel ACA).
As shown in Tables 3 and 4, for all three datasets, the classification accuracy of networks trained on the refined training set are better than those trained on the original training set. The greatest improvements for the patchlevel ACA that used DRAL is 4.49% for AlexNet on BACH, 6.57% for both AlexNet and our ADN on CCG, and 18.91% for VGG on UCSB. For the slice-level ACA, the proposed DRAL improves the performance of our ADN from 88.57% to 97.50% on BACH, from 75% to 100% on CCG, and from 90% to 100% on UCSB.
The results show that mislabeled patches in the original training sets have negative influences on the training of deep learning networks and decrease classification accuracy. Furthermore, the refined training set produced by the proposed DRAL is useful for general, deep learning networks such as shallow networks (AlexNet), wide networks (VGG-16), multibranch deep networks (ResNet-50) and ultradeep networks (ResNet-101 and DenseNet-121).  without the DRAL. This section presents a more comprehensive performance analysis of the proposed ADN.

ACA on the BACH Dataset
The patch-level ACA of different CNN models for each category of BACH is listed in Table 5. All the models are trained with the training set refined by DRAL. The average ACA (Ave. ACA) is the overall classification accuracy of the patch-level validation set. The Ave. ACA results are shown in Fig. 11. As shown in Table 5, the proposed ADN achieves the best classification accuracy for the normal (96.30%) and invasive carcinoma (94.23%) patches, while the ResNet-50 and DenseNet-121 yield the highest ACAs for benign (94.50%) and carcinoma in situ (95.73%) patches. The ACAs of our ADN for benign and carcinoma in situ are 92.36% and 93.50%, respectively, which are competitive compared to the performance of other state-of-the-art approaches. The average ACA of ADN is 94.10%, which outperforms the listed benchmarking networks.
To further evaluate the performance of the proposed ADN, its corresponding confusion map on the BACH validation set is presented in Fig. 12, which illustrates the excellent performance of the proposed ADN for classifying breast cancer patches.

ACA on the CCG Dataset
The performance evaluation is also conducted on CCG validation set, and Table 5 presents the experiment results.
For the patches cropped from normal and level III slices, the proposed ADN achieves the best classification accuracy (99.18% and 70.68%, respectively), which are 0.47% and 2.03% higher than the runner-up (VGG-16). The best ACAs for level I and II patches are achieved by ResNet-50 (99.10%) and ResNet-101 (99.88%), respectively. The proposed ADN generates competitive results (97.70% and 99.52%) for these two categories.
All the listed algorithms have low levels of accuracy for the patches from level III slices. To analyze the reasons for this low accuracy, the confusion map for the proposed ADN is presented in Fig. 13. It can be observed that some cancer level III patches are incorrectly classified as normal. A possible reason is that the tumor area in cancer level III is smaller than that of cancer levels I and II, so patches cropped from cancer level III slices usually contain normal areas. Therefore, the level III patches with large normal areas may be recognized as normal patches by ADN. We evaluated the other deep learning networks and again found that they incorrectly classify the level III patches as normal. To address the problem, a suitable approach that fuses the patch-level predictions with slice-level decisions needs to be developed. Table 5 lists the patch-level ACAs of different deep learning frameworks on the UCSB validation set. It can be observed that our ADN achieves the best patchlevel ACAs; 98.54% (benign) and 96.73% (malignant). The runner-up (VGG-16) achieves patch-level ACAs of 98.32% and 96.58%, which are 0.22% and 0.15% lower than the proposed ADN. The ResNet-50/101 and DenseNet yield similar performances (average ACAs are approximately 96%), while the AlexNet generates the lowest average ACA of 93.78%.

Statistical Validation
A T-test validation was conducted for the results from VGG-16 and our ADN. The p-values at the 5% significance level are 1.07%, 2.52% and 13.08% for BACH, CCG, and UCSB, respectively. The results indicate that the accuracy improvement is statistically significant for BACH and CCG. As the number of images (58) in UCSB is quite small, the problem might not be challenging enough. Therefore, both VGG-16 and our ADN achieve similar performances. Consequently, the deep learning networks yield similar classification accuracy levels on the UCSB dataset; that is, no statistical significance is observed between the results produced by different models.

Network Size
As previously mentioned, instead of building a deeper network, the proposed ADN adopts wider layers to increase its feature representation capacity, which is more suitable for small datasets. To further illustrate the excellent capacity of the proposed ADN, a comparison of network size between different network architectures is presented in Table 6. In the experiments, the wider networks -VGG-16 (16 layers) and ADN (28 layers) -achieved better performances than the ultradeep networks -ResNet-50/101 (50/101 layers) and DenseNet (121 layers). Since the VGG-16 and ADN have a much smaller model size than the ultradeep networks, they require fewer network parameters and have a lower risk of overfitting to a small dataset.
Compared to the straightforward VGG-16, the proposed ADN uses multiple atrous convolutions to extract multiscale features. As shown in Fig. 11, the proposed ADN outperforms the VGG-16 and produces the best average ACAs for the BACH (94.10%), CCG (92.05%) and UCSB (97.63%) datasets. The experiment results also demonstrate that the proposed ADN can maintain the balance between network size and feature learning capacity, which is extremely effective for small pathological datasets.

Comparison with State-of-the-art approaches
In this section, we compare the performance of the proposed framework with other state-of-the-art approaches on the BACH, CCG, and UCSB testing sets. For the UCSB dataset, the public protocol of 4-fold cross validation is used to make the results directly comparable. For better performance evaluation, we include the F-measure (Fmea.) as an additional evaluation metric for BACH and CCG, which can be defined as: where TP, FP and FN stand for true positive, false positive and false negative, respectively.

Patch-level and Slice-level ACA on BACH
The extra 20 H&E stained breast histological images from a publicly available dataset (Bioimaging [24]) are employed as the testing set for the frameworks trained on BACH. As Bioimaging is a publicly available dataset, the public testing protocol is used and the state-of-the-art results [24] are directly used for comparison. The results on the testing set are listed in Table 7 (Precision (Pre.), Recall (Rec.)). As shown in Table 7, the proposed ADN achieves the best average patch-level classification performance (77.08% on the testing set), which is 0.83% higher than

ADN+DRAL (ours) 100
Best accuracy is in Bold. the runner-up (DenseNet-121). The ADN trained with the training set refined by DRAL leads to a further improvement of 5.42% for the final classification accuracy. Accordingly, the slice-level average classification accuracy (90%) of the proposed ADN + DRAL framework is the highest among the listed benchmarking algorithms.

Patch-level and Slice-level ACA on CCG
The results for the CCG testing set are presented in Table 8. The proposed ADN achieved the best patch-level ACA (80.28%) among the models trained with the original training set, which is 2.51% higher than the runner-up (VGG-16). Furthermore, it has been noticed most of the listed benchmark algorithms do not perform well for the cancer level I patches; the highest accuracy produced by the ultradeep ResNet-101 is only 67.34%. Our ADN achieves a patch-level ACA of 71.51% with a 28-layer architecture.
The proposed DRAL refines the training set by removing the mislabeled patches, which benefits the subsequent network training. As a result, the DRAL training strategy yields significant improvements for both average patchlevel ACA (6.77%) and average slice-level ACA (25%) when using the proposed ADN framework.

Patch-level and Slice-level ACA on UCSB
The 4-fold cross-validation conducted on the UCSB dataset is presented in Table 9. The baselines are obtained using Fisher Vector (FV) descriptors of different local features such as dense SIFT, patchwise DBN, and CNN features from the last convolutional layer (labeled as FV-SIFT, FV-DBN, and FV-CNN). The three FV descriptors are then combined into longer descriptors: S+D (combining FV-SIFT and FV-DBN), S+C (combining FV-SIFT and FV-CNN), D+C (combining FV-DBN and FV-CNN), and S+D+C (combining all three FV descriptors). The linear kernel SVM without dimensionality reduction and the SDR method proposed in [26] are used for classification. Table 9 shows that, our ADN + DRAL achieves the best 4-fold cross-validation accuracy (100%), which outperforms the highest classification accuracy achieved by the benchmark approaches (98.3% yielded by SDR + SVM + FV-CNN).

Conclusions
Due to the impressive performance of deep learning networks, researchers find it appealing for application to medical image analysis. However, pathological image analysis based on deep learning networks faces a number of major challenges. For example, most of pathological images have high resolutions -gigapixels. It is difficult for CNN to directly process the gigapixel images, due to the expensive computational costs. Cropping patches from a whole-slice images is the common approach to address this problem. However, most of the pathological datasets only have slice-level labels. While the slice-level labels can be assigned to the cropped patches, the patch-level training sets usually contain mislabeled samples.
To address these challenges, we proposed a framework for pathological image classification. The framework consists of a training strategy -deep-reverse active learning (DRAL) -and an advanced network architecture -atrous DenseNet (ADN). The proposed DRAL can remove the mislabeled patches in the training set. The refined training set can then be used to train widely used deep learning networks such as VGG-16 and the ResNets. A deep learning network -atrous DenseNet (ADN) -is also proposed for the classification of pathological images. The proposed ADN achieves multiscale feature extraction by combining the atrous convolutions and dense blocks.
The proposed DRAL and ADN have been evaluated on three pathological datasets: BACH, CCG, and UCSB. The experiment results demonstrate the excellent performance of the proposed ADN + DRAL framework, achieving average patch-level ACAs of 94.10%, 92.05%, and 97.63% on BACH, CCG, and UCSB validation sets, respectively.

Appendix A: Architecture of RefineNet
To alleviate the overfitting problem, a simple CNN, namely RefineNet (RN), is adopted in the iterative Reverse Active Learning (RAL) process to remove mislabeled patches. The pipeline of RefineNet is presented in Table 10, which consists of convolutional (C), max pooling (MP), averaging pooling (AP) and fully-connected (FC) layers.