Classifying chest CT images as COVID-19 positive/negative using a convolutional neural network ensemble model and uniform experimental design method

Background To classify chest computed tomography (CT) images as positive or negative for coronavirus disease 2019 (COVID-19) quickly and accurately, researchers attempted to develop effective models by using medical images. Results A convolutional neural network (CNN) ensemble model was developed for classifying chest CT images as positive or negative for COVID-19. To classify chest CT images acquired from COVID-19 patients, the proposed COVID19-CNN ensemble model combines the use of multiple trained CNN models with a majority voting strategy. The CNN models were trained to classify chest CT images by transfer learning from well-known pre-trained CNN models and by applying their algorithm hyperparameters as appropriate. The combination of algorithm hyperparameters for a pre-trained CNN model was determined by uniform experimental design. The chest CT images (405 from COVID-19 patients and 397 from healthy patients) used for training and performance testing of the COVID19-CNN ensemble model were obtained from an earlier study by Hu in 2020. Experiments showed that, the COVID19-CNN ensemble model achieved 96.7% accuracy in classifying CT images as COVID-19 positive or negative, which was superior to the accuracies obtained by the individual trained CNN models. Other performance measures (i.e., precision, recall, specificity, and F1-score) obtained bythe COVID19-CNN ensemble model were higher than those obtained by individual trained CNN models. Conclusions The COVID19-CNN ensemble model had superior accuracy and excellent capability in classifying chest CT images as COVID-19 positive or negative.


Background
The rapid spread of coronavirus disease 2019  since the beginning of 2020 has often exceeded the capability of doctors and hospitals in many regions of the world. One effective tool for detecting COVID-19 is chest computed tomography (CT). Although a CT scan can be performed in several minutes, the time needed for a radiologist to review and classify the image is much longer. Therefore, tools for automatically detecting or diagnosing COVID-19 are extremely valuable and urgently needed.

Literature review
Gozes et al. [1] developed automated CT image analysis tools that used a Resnet-50 deep convolutional neural network (CNN) to detect coronavirus and to quantify the burden on healthcare systems. The study reported that deep-learning image analysis of thoracic CT images achieved 98.2% sensitivity, 92.2% specificity, and 0.996 area under curve (AUC) in classifying results as positive or negative for coronavirus.Another COVID-19 diagnosis method developed in Hu et al. [2] used a CNN with a ShuffleNet-v2 backbone to distinguish between CT images of patients with and without COVID-19 infection. Their experimental results indicated that the diagnostic model was accurate not only in identifying COVID-19, but also in distinguishing between COVID-19 infections from other viral infections. Li et al. [3] developed a COVNet framework using Resnet-50 as the backbone that detected COVID-19 by using a neural network to extract visual features from volumetric chest CT exams. In independent testing, per-exam sensitivity in detecting COVID-19 was 90% (114 of 127), and per-exam specificity was 96% (294 of 307). Shan et al. [4] developed a modified 3-D convolutional neural network that combines V-Net with the bottle-neck structure for automatically segmenting and quantifying infected regions in CT scans of COVID-19 patients. Quantitative evaluations indicated that the system was highly accurate in automatically delineating infected regions. Song et al. [5] developed a CT diagnosis system that used a detailed relation extraction neural network to identify COVID-19 patients. According to their experimental results, the model identified COVID-19 infection with recall (sensitivity) of 0.93. Wang et al. [6] proposed a transfer learning neural network based on the Inception network that used chest CT images to screen for COVID-19. Internal validation tests revealed that the model had an overall accuracy of 89.5% with specificity of 0.88 and sensitivity of 0.87. In the external testing dataset, the model showed a total accuracy of 79.3% with specificity of 0.83 and sensitivity of 0.67. Xu et al. [7] established the Resnet-18 network with the location-attention mechanism that appeared promising for supplementary clinical use by frontline doctors in diagnosis and early screening of COVID-19 patients. Experiments performed using the benchmark dataset achieved 86.7% accuracy in screening CT images for COVID-19. In Yang et al. [8], the multi-task learning and self-supervised learning method of COVID-19 diagnosis based on CT images of COVID-19 achieved an F 1 -score of 0.90, an AUC of 0.98, and an accuracy of 0.89. According to the senior radiologist in that study, the models perform well enough for clinical use. According to the above literature on COVID-19 screening, most researchers have used a single model to classify chest CT images. Compared to an ensemble model in which classification is based on the results of the majority, however, a single model is more likely to make classification errors. Moreover, no studies have discussed how algorithm hyperparameters affect classification accuracy in a pre-trained CNN model. Therefore, further research is needed to improve classification accuracy.

Motivation
The motivation of this study was to establish an ensemble model that uses majority voting strategy to screen chest CT images for COVID-19. In a pre-trained CNN model, learning speed and quality are determined by algorithm hyperparameters that are set before the learning process begins. In subsequent training, different pre-trained CNN models may require different algorithm hyperparameters (e.g., optimizer, learning rate, and mini-batch size) to improve their classification accuracy [9]. The current study used uniform experimental design (UED) to generate the combination of algorithm hyperparameters for a pre-trained CNN model. The experiments showed that the COVID19-CNN ensemble model had superior classification accuracy compared to a single model and excellent accuracy in classifying chest CT images as COVID-19 positive/negative.

Problem description
Chest CT and X-Ray images are critical practical tools for diagnosis of COVID-19, because they can be used relatively quickly and easily to detect pneumonia-like symptoms of COVID-19. A recent study concluded that screening lung CT images is the best method of early-stage COVID-19 diagnosis and concluded that CT should be the primary screening method [10]. The severe respiratory symptoms of COVID-19 result in relatively high ICU admission and mortality rates in these patients. Manifestations of COVID-19 in CT images differ from those of other viruses that cause pneumonia, e.g., influenza-A [7]. Therefore, CT images have an untapped potential use in COVID-19 diagnosis.
During a COVID-19 outbreak, overworked radiologists may have limited time to review CT scans. Additionally, radiologists in rural and/or under-developed areas may not be adequately trained to screen CT scans for an emerging disease such as COVID-19. The considered problem was how to screen large numbers of chest CT images for COVID-19 efficiently and accurately. Since a CT showing evidence of COVID-19 is difficult to distinguish from a normal CT, machine learning may be a useful tool for assisting radiologists in screening CT images for COVID-19.
The key slices of chest CT with suspected lesions were extracted from DICOM files by professional radiologists. All chest CT images used in the experiments in this study had been published previously [11]. The CT images were divided into two classes: COVID-19 and Normal. Figure 1 shows representative CT images in the two classes.

Results
The proposed COVID19-CNN ensemble model integrated multiple trained CNN models for classifying chest CT images as COVID-19 positive or negative. The pre-trained CNN models included VGG-19, Resnet-101, DenseNet-201, Inception-v3, and Inception-ResNet-v2. The chest CT images obtained from COVID-19 patients in Hu [11] were used for training and performance validation of pre-trained CNN models. The testing set of chest CT images from COVID-19 patients was used for performance evaluation of the COVID19-CNN ensemble model. The experimental environment was Matlab R2019 with its toolboxes developed by MathWorks, and GPU GTX-1080Ti-11G.
The experimental data for chest CT images from COVID-19 patients included a training set, a validation set, and a testing set. To maintain compatibility with the CNNbased architecture and the developed software, each CT image was processed as a 224 × 224 × 3 image for the VGG-19, Resnet-101, and DenseNet-201 models or as a 299 × 299 × 3 image for the Inception-v3 and Inception-ResNet-v2 models, where 3 is the number of color channels. Table 1 shows the training, validation, and testing sets of chest CT images from COVID-19 patients.
For training, different pre-trained CNN models require different algorithm hyperparameters that are set before the learning process begins. The algorithm hyperparameters for pre-trained CNN models in this study were 'Optimizer' , 'MiniBatchSize' , 'MaxEpochs' , and 'InitialLearnRate' . Optimizer was the training option. MiniBatchSize was a mini-batch at each iteration. MaxEpochs was the maximum number of training epochs. InitialLearnRate was an option for decreasing the learning rate during training.
The UED table of the minimum number of experiments for four factors is U 7 . Tables 2  and 3 show the seven-level uniform layout and selection table used for U 7 (7 6 ), respectively. Table 4 shows that U 7 (7 4 ) was selected from four factors in Table 3 and was used to design the combinations of the four algorithm hyperparameters for the seven levels. The levels for the 'Optimizer' hyperparameter were 'adam (adaptive moment  estimation)' and 'sgdm (stochastic gradient descent with a momentum)' . The values for the 'MiniBatchSize' hyperparameter ranged from 10 to 40. The values for the 'MaxEpochs' hyperparameter ranged from 4 to 10. The values for 'InitialLearnRate' hyperparameter were 10 -4 , 10 -5 , and 10 -6 . Table 5 shows the level values of the four algorithm hyperparameters for a pre-trained CNN model. Table 6 shows the seven combinations of the four algorithm hyperparameters that combined the values in Tables 4 and 5 and were used in a pre-trained CNN model for classifying chest CT images as COVID-19 positive or negative. According to the hyperparameter combination plan in Table 6, five independent experimental runs were performed for each hyperparameter combination. Table 7 shows the average correct rates and standard deviations (SDs) obtained by using each algorithm hyperparameter combination in Table 6 in five independent experimental runs when the VGG-19 was used to classify chest CT images as COVID-19 positive or      Table 7 Average correct rates and SDs in classifying chest CT images as COVID-19 positive or negative when VGG-19 and each algorithm hyperparameter combination in Table 6   The blue line shows the progressive improvement in accuracy for the training set, and the black line shows the progressive improvement in accuracy for the validation set. Table 8 shows the average correct rates and SDs obtained by using each algorithm hyperparameter combination in Table 6 in five independent experimental runs when the  Table 8 Average correct rates and SDs in classifying chest CT images as COVID-19 positive/negative when Resnet-101 and each algorithm hyperparameter combination in Table 6 Table 9 shows the average correct rates and SDs obtained for the training and validation sets when each algorithm hyperparameter combination in Table 6 Table 10 shows the average correct rates and SDs obtained when each algorithm hyperparameter combination in Table 6 was used in five independent experimental runs of Inception-v3 to classify chest CT images as COVID-19 positive/negative in the training and validation sets. The Inception-v3#7 model had average correct rates of 98.89% and 86.67% in the training and validation sets, respectively. The Inception-v3#7 also had small SDs of 0.00355 and 0.01317 in the training and validation sets, respectively. In the Inception-v3#7 model, the hyperparameter combination with the best performance in classifying chest CT images as COVID-19 positive/negative was Optimizer of 'adam' , MiniBatchSize of 40, MaxEpochs of 10, and InitialLearnRate of 10 -4 . Figure 5 shows how Table 9 Average correct rates and SDs in classifying chest CT images as COVID-19 positive/negative when DenseNet-201 and each algorithm hyperparameter combination in Table 6 Table 11 shows the average correct rates and SDs obtained when each algorithm hyperparameter combination in Table 6 was used in five independent experimental runs of Inception-ResNet-v2 to classify chest CT images as COVID-19 positive/negative in the training and validation sets. The Inception-ResNet-v2#3 model had average correct rates of 98.20% and 88.08% in the training and validation sets, respectively. The Inception-ResNet-v2#3 model also had small SDs of 0.00475 and 0.01807 in the training and validation sets, respectively. In the Inception-ResNet-v2#3 model, the hyperparameter combination with the high performance in classifying chest CT images as COVID-19 Table 10 Average correct rates and SDs in classifying chest CT images as COVID-19 positive/ negative when Inception-v3 and each algorithm hyperparameter combination in Table 6 Table 11 Average correct rates and SDs in classifying chest CT images as COVID-19 positive/ negative when Inception-ResNet-v2 and each algorithm hyperparameter combination in Table 6 were used in five independent experimental runs and the average correct rates on the verification set are high. Table 12 shows the high classification accuracy obtained by the Resnet-101#3, Resnet-101#7, DenseNet-201#3, DenseNet-201#7, Inception-v3#7, Inception-ResNet-v2#3, and Inception-ResNet-v2#7 models. The SDs on the training set of the seven models are between 0.003 and 0.0065, indicating that the classification ability of the seven models is quite stable. The seven models for the validation set had average correct rates exceeding 0.86, though the average correct rate on the training set is 10% higher than that on the validation set. Therefore, the seven models were selected for inclusion in the ensemble model for classifying chest CT images as COVID-19 positive/negative. The COVID19-CNN ensemble model, which combined Resnet-101#3, Resnet-101#7, DenseNet-201#3, DenseNet-201#7, Inception-v3#7, Inception-ResNet-v2#3, and Inception-ResNet-v2#7, used a majority voting strategy to classify chest CT images as COVID-19 positive/negative. An image classified as COVID-19 positive by most models was considered a COVID-19 image, and an image classified as COVID-19 negative by most models was considered a Normal image. The COVID19-CNN ensemble model aggregated the results of the majority voting strategy.

Discussion
This study found that setting an appropriate combination of algorithm hyperparameters for a pre-trained CNN model was very important for accurately classifying chest CT images as COVID-19 positive or negative. In the VGG-19#6 model, for example, the appropriate combination of the four algorithm hyperparameters for classifying CT images was Optimizer of 'sgdm' , MiniBatchSize of 30, MaxEpochs of 7, and InitialLearnRate of 10 -4 . In the Resnet-101#7, DenseNet-201#7, Inception-v3#7, and Inception-ResNet-v2#7 models, the appropriate combination was Optimizer of 'adam' , MiniBatchSize of 40, MaxEpochs of 10, and InitialLearnRate of 10 -4 . In Resnet-101#3, DenseNet-201#3, and Inception-ResNet-v2#3, the appropriate combination was Optimizer of 'adam' , MiniBatchSize of 35, MaxEpochs of 5, and InitialLearnRate of 10 -4 . Based on this study, it can be seen that a poor combination of algorithm hyperparameters for a pre-trained CNN model cannot get high accuracy in classifying chest CT images as COVID-19 positive/negative. Although, from the novelty perspective, the contribution may be a relatively minor innovation, the COVID19-CNN ensemble model provided increased accuracy by applying a majority voting strategy and an appropriate combination of algorithm

Methods
The research procedure was collecting data and processing chest CT images from COVID-19 patients, selecting multiple pre-trained CNN models for transfer learning, using UED to set algorithm hyperparameters for pre-trained CNN models, using multiple pre-trained CNN models to screen chest CT images for COVID-19, comparing classification performance among the trained CNN models, selecting the high accurate CNN models for further use in an ensemble model and, finally, comparing classification performance in the trained CNN models. The detailed steps were as follows.

Collecting data and processing chest CT images from COVID-19 patients
The chest CT images from COVID-19 patients in Hu [11] were divided into a training set, a validation set, and a testing set.

Selecting multiple pre-trained CNN models for transfer learning
Transfer learning is a machine learning approach in which a model developed for a task is reused as the starting point for a model developed for another task. In transfer learning, a pre-trained CNN model is used to construct a predictive model. Thus, the first step is selecting a pre-trained CNN model from available models. The second step is reusing the pre-trained CNN model. The third step is tuning the pre-trained CNN model for a new task. Depending on the input-output pair data available for the new task, the researcher may consider further modification or refinement of the pre-trained CNN model. Transfer learning in a CNN model with pre-training is typically much faster than that in a CNN model without pre-training. The widely used commercial software program Matlab R2019 by MathWorks has been validated as effective for pre-training CNN models for deep learning. Most pre-trained CNN models were trained with a subset of the ImageNet database [12] used in the Ima-geNet Large-Scale Visual Recognition Challenge [13]. After training on more than 1 million images, the pre-trained CNN models could classify images into 1000 object categories, e.g., keyboard, coffee mug, pencil, and various animals. The most important characteristics of pre-trained CNN models are network accuracy, speed, and size. Choosing a pre-trained network is generally a tradeoff between these characteristics. The classification accuracy on the ImageNet validation set is the most common way to measure the accuracy of networks trained on ImageNet. Networks that are accurate on ImageNet are also often accurate when you apply them to other natural image data sets using transfer learning or feature extraction.
The VGG-19 [14], Resnet-101 [15], and DenseNet-201 [16] CNNs have 19 layers, 101 layers, and 201 layers, respectively, and have been trained on more than 1 million images from the ImageNet database. As a result, these CNNs have learned rich feature representations for a wide range of images and can classify images into 1000 object categories. The image input size for these CNNs is 224 × 224 × 3.
The 48-layer Inception-v3 [17] and the 164-layer Inception-ResNet-v2 [18] CNNs have been trained on more than 1 million images from the ImageNet database and can classify images into 1000 object categories. The image input size for these CNNs is 299 × 299 × 3.

Using UED to design algorithm hyperparameters for pre-trained CNN models
The UED method developed by Wang and Fang [19][20][21] used space filling designs to construct a set of experimental points uniformly scattered in a continuous design parameter space. Because UED only considers uniform dispersion and not comparable orderliness, UED minimizes the number of experiments needed to acquire all available information.
Selecting appropriate algorithm hyperparameters for a pre-trained CNN model was essential for accurate screening of chest CT images for COVID-19. In this study, the algorithm hyperparameters for a pre-trained CNN model were Optimizer, MiniBatch-Size, MaxEpochs, and InitialLearnRate. The combinations of algorithm hyperparameters obtained by UED were used in a pre-trained CNN model to classify chest CT images as COVID-19 positive/negative.

Screening chest CT images for COVID-19 by multiple pre-trained CNN models
To fine-tune a pre-trained CNN model, transfer learning is often faster and easier than constructing and training a new CNN model for a new task. Although a pre-trained CNN model has already learned a rich set of image features, it can be fine-tuned to learn features specific to a new dataset (i.e., chest CT images from COVID-19 patients in this study). Since a pre-trained CNN model can learn to extract a different feature set, the final CNN model is often more accurate. The starting point for fine tuning deeper layers of pre-trained CNN models used for transfer learning (i.e., VGG-19, Resnet-101, DenseNet-201, Inception-v3, and Inception-ResNet-v2) was training the networks with a new dataset of chest CT images from COVID-19 patients. Figure 7 is a flowchart of the transfer learning procedure used in the CNN model.

Comparing classification performance among different trained CNN models
In this study, five independent runs of VGG-19, Resnet-101, DenseNet-201, Inception-v3, and Inception-ResNet-v2 were performed to classify chest CT images as COVID-19 positive or negative by using an algorithm hyperparameter combination obtained by UED. The results recorded for the training set and the validation set included (1) accuracy in each run of the experiment, (2) average accuracy in five independent runs, and (3) standard deviation in accuracy in five independent runs. Accuracy was defined as the proportion of true positive or true negative results for a population. The high accurate CNN models after training with VGG-19, Resnet-101, DenseNet-201, Inception-v3, and Inception-ResNet-v2 were selected for use in an ensemble model for classifying images in the testing set of chest CT images as COVID-19 positive or negative.
The classification performance of the different trained CNN models was compared in terms of accuracy, precision, recall (i.e., sensitivity), specificity, and F 1 -score values. Precision was assessed by positive predictive value (number of true positives over number of true positives plus number of false positives). Recall (sensitivity) was assessed by . Specificity was measured by true negative rate (number of true negatives over the number of false positives plus the number of true negatives). The F 1 -score, a function of precision and recall, was used to measure prediction accuracy when classes were very imbalanced. In information retrieval, precision is a measure of the relevance of results while recall is a measure of the number of truly relevant results returned. The formula for F 1 -score is 2 × (precision × recall)/(precision + recall).