Convolutional neural network for automated mass segmentation in mammography

Background Automatic segmentation and localization of lesions in mammogram (MG) images are challenging even with employing advanced methods such as deep learning (DL) methods. We developed a new model based on the architecture of the semantic segmentation U-Net model to precisely segment mass lesions in MG images. The proposed end-to-end convolutional neural network (CNN) based model extracts contextual information by combining low-level and high-level features. We trained the proposed model using huge publicly available databases, (CBIS-DDSM, BCDR-01, and INbreast), and a private database from the University of Connecticut Health Center (UCHC). Results We compared the performance of the proposed model with those of the state-of-the-art DL models including the fully convolutional network (FCN), SegNet, Dilated-Net, original U-Net, and Faster R-CNN models and the conventional region growing (RG) method. The proposed Vanilla U-Net model outperforms the Faster R-CNN model significantly in terms of the runtime and the Intersection over Union metric (IOU). Training with digitized film-based and fully digitized MG images, the proposed Vanilla U-Net model achieves a mean test accuracy of 92.6%. The proposed model achieves a mean Dice coefficient index (DI) of 0.951 and a mean IOU of 0.909 that show how close the output segments are to the corresponding lesions in the ground truth maps. Data augmentation has been very effective in our experiments resulting in an increase in the mean DI and the mean IOU from 0.922 to 0.951 and 0.856 to 0.909, respectively. Conclusions The proposed Vanilla U-Net based model can be used for precise segmentation of masses in MG images. This is because the segmentation process incorporates more multi-scale spatial context, and captures more local and global context to predict a precise pixel-wise segmentation map of an input full MG image. These detected maps can help radiologists in differentiating benign and malignant lesions depend on the lesion shapes. We show that using transfer learning, introducing augmentation, and modifying the architecture of the original model results in better performance in terms of the mean accuracy, the mean DI, and the mean IOU in detecting mass lesion compared to the other DL and the conventional models.


Supplementary materials
Pre-processing The AMF [1] is a nonlinear filter that removes impulse noise while preserving edges and corners to improve the image quality. The CLAHE filter increases the contrast between the masses and their surrounding tissues [2][3][4][5]. The CLAHE [1] filter operates on small regions in the image, called tiles, rather than the entire image. It calculates the contrast for each tile individually producing local histograms. Each tile's contrast is enhanced and the neighboring tiles are then combined using bilinear interpolation to eliminate artificially induced boundaries. The contrast in the homogeneous regions can be limited using a clipLimit factor to avoid amplifying any noise that might be present in the image. We used tiles of [8 8] and a clipLimit factor of 0.005 with the CLAHE technique. Figure 1 shows a sample of the combined data-set we used in our experiments. Figure 2 shows images containing suspicious areas and its associated pixel-level GTMs. All full MGs and GTMs are converted into png format and re-sized to 512×512. All pixels in the GTM are labeled as belonging to the background (0) or breast lesion (255) (see Fig. 2).

Semantic segmentation using FCN
Semantic segmentation is an active research area for medical images where deep CNNs are used to classify each pixel in the image individually. Semantic segmentation results in a map image that is segmented by classes. The fully convolutional network (FCN) [6] is an encoder-decoder network. The encoder path uses a pre-trained VGG16 model [7] and transfer their learned representations by fine-tuning to the segmentation task. The decoder path uses up-sampling operations, and replace the final fully connected layer (FCL) with an N×1×1 convolution layer, which output probabilities for N classes. A skip architecture is proposed by [6] where the weights of shallow, fine layer features are combined with deep, coarse layer features to produce accurate and detailed segmentations, as intensive up-sampling can lead to coarse segmentation maps. There are 3 versions of FCN (FCN-32s, FCN-16s, FCN-8s) based on VGG16 network [6]. In this research, we adapt the FCN-8s VGG16 based network [6] to our segmentation task. FCN-8s up-samples the final feature map by a factor of 8 after fusing feature maps from the third and fourth max-pooling layers.

Semantic segmentation using SegNet
The SegNet architecture [8] adopts the VGG16 network [7] along with an encoder-decoder framework wherein it drops the FCLs of the network. SegNet shares a similar architecture to the encoder-decoder U-Net described in the previous subsection. However, in SegNet, the indices at each max-pooling layer in the encoder contracting path at each level are stored and later used to up-sample the corresponding feature map in the decoder by unpooling it using those stored indices (Fig. 3). Storing the indices from the contraction path helps keep the high-frequency information intact, however, it also misses neighboring information when unpooling from low-resolution feature maps. Finally, a Softmax classifier is used to produce the final segmentation maps with the same resolution of the original MG image. In this work, we used a Segnet that is preinitialized with layers and weights from a pre-trained VGG16 model with an encoder D of 5.

Semantic segmentation using Dilated-Net
Recently, Dilated-Net [9], also known as atrous convolutions, have been used in different image segmentation tasks [10][11][12][13][14]. Dilated convolutions [9] allow us to explicitly control the resolution at which feature responses are computed and incorporate larger context without increasing the number of parameters or the amount of computation. We adopt the dilated CNN in [9] with some modification to the network. The implemented dilated CNN architecture consists of ten cascaded 3×3 convolutional layers with dilation factors 1, 1, 2, 4, 8, 16, 32, 1, 1 and 1 (Fig. 4). Figure 4 illustrates a 3×3 convolution kernels with different dilation factor as 1,2, and 3. The last three layers are FCLs of 1×1 convolutions followed by dropout of 0.5 [15]. The first nine convolutional layers are followed by BN layer [16] and a ReLU activation function [17]. To classify the pixels, the last convolutional layer has two 1×1 convolutions, followed by a Softmax classifier.

Localization using Faster R-CNN
We adapt the Faster R-CNN method proposed in [18] to compare its performance in terms of accuracy of detection and inference time with that of the proposed Vanilla U-Net model. Faster R-CNN is based on a VGG16 model [7] with additional components for detecting, localizing and classifying lesions in MG image. Faster R-CNN outputs a BB for each detected lesion, and a score, which reflects the confidence in the class of the lesion. The Faster R-CNN method in [18] is trained with our pre-processed, and augmented data-set. Further details about the implemented Faster R-CNN method can be found in the original article in [18]. One limitation stated in the study of [18] is that the training data-set comes from small sized publicly available pixel-level annotated data-set. However, in our study we are using our combined large-sized data-set to reproduce their work.

Semantic segmentation using region growing (RG)
We also implemented the region growing (RG) model proposed in [19] and apply it to our MG images. RG is a traditional image segmentation CAD model that starts with selecting an initial seed point and then groups pixels or sub-regions into larger regions according to a similarity criterion. As RG results are sensitive to the initial seeds, the automated accurate seed selection is very critical for image segmentation. Further details about the implemented RG method can be found in the original article [19].
Comparison between state-of-the-art DL methods Table 1 lists the information about the architecture, databases, the number of images, the evaluation methods (i.e. Accuracy (ACC.), area under curve (AUC), Dice index (DI)), TPR@FPR, the testing time per image as provided in the literature.

Author details
List of tables Table 1 Shows a comparison between the proposed segmentation method and the current state-of-the-art DL methods for segmentation or localization of lesions in MG images. Fig. 1 The databases used in our experiments. Fig. 2 MG images and their corresponding GTMs. Fig. 3 In Segnet, the indices at each max-pooling layer in the encoder contracting path at each level are stored and later used to up-sample the corresponding feature map in the decoder by unpooling it using those stored indices. Fig. 4 Architecture of the dilated-Net, containing ten convolutional layers with dilation factors, indicated in red, increasing from 1 in the first layer to 32 in the seventh layer. The last 1×1 convolutional layer is followed by a Softmax classifier.