VAnet
Overview
VAnet is a new type of medical image fusion model. It consists of three parts: encoder, AM fusion network and decoder. In Fig. 2, the encoder consists of five coding blocks, which are corresponding to five blocks of VGG-16, respectively. The five feature maps obtained from five blocks contain all the contextual semantic information of the image. Then the feature maps are put into the AM fusion network for multi-scale deep feature fusion. The AM fusion network consists of the attention mechanism branch and the residual multi-scale detail fusion branch. The attention mechanism branch consists of the channel attention mechanism block and five convolution blocks. Among them, the channel attention mechanism block can suppress noise, especially functional images. The residual multi-scale detail fusion branch includes three convolution blocks and a multi-scale detail fusion block. Among them, the multi-scale detail fusion block can completely compensate for the loss of detail caused by the pooling operation in the attention mechanism. Finally, the fused feature map will be input to the decoder to reconstruct the fused image.
Encoder
Traditional encoders tend to ignore the context information of feature maps in feature extraction. Facts have proved that the pathological characteristics of tissues are not only reflected in a certain independent part, but also in its contextual information. Therefore, we select the VGG-16 network that can obtain context information in the encoder.
As shown in Fig. 3, VGG-16 contains five blocks. Its biggest feature is that it can obtain information about the image context. The first two blocks consist of two convolutional layers and one max-pooling layer, respectively. The last three blocks consist of three convolutional layers and one max-pooling layer, respectively. The stacking of the two can easily form a deeper network structure to obtain more complete and deeper contextual information. The kernel size of all convolutional layers is 3 × 3 and the size of max pooling layers is 2 × 2. The first four blocks have different numbers of output channels; the fourth and last blocks have the same number of output channels.
AM fusion network
AM fusion network is the core part of the VAnet model. The extraction of important features and their associated features, the suppression of noise and the preservation of texture details all rely on the fusion network. In Fig. 4, AM fusion network consists of the attention mechanism branch and the residual multi-scale detail processing branch.
Attention mechanism branch Attention mechanism branch is composed of five convolutional blocks and a channel attention mechanism block. Each convolutional block is composed of a convolutional layer, a batch normalized layer and a ReLU activation function. The kernel of the convolution layer in all convolution blocks is 3 × 3. In the first convolution block, a pooling layer is added after the activation function to reduce the feature dimension. In the fourth convolution block, we add an unsampled layer before the convolution layer to restore the feature dimension. A channel attention mechanism block is added behind the second convolution block and its working principle is shown in Fig. 5.
In Fig. 5, the size of the input feature map F is \(H \times W \times C\), which is put into the max pooling layer and the average pooling layer to obtain two \(1 \times 1 \times C\) feature maps. Then the two feature maps are fed into a two-layer shared neural network for feature extraction. The number of neurons in the first layer of the network is C/r and the ReLU function is selected as the activation function. The number of neurons in the second layer of the network is C. The element-wise operation is performed on the features obtained by the shared neural network and the final channel attention feature Mc is generated after the sigmoid activation operation.
Residual multi-scale detail processing branch After the image is branched by the attention mechanism, the detailed information will be lost, which will affect the fusion result of the image. In order to avoid the above situation, the residual multi-scale detail fusion block is designed. The residual multi-scale detail processing block includes a set of residual convolution blocks, a multi-scale detail fusion block and a convolution block. Among them, the residual convolution block is designed to prevent gradient explosion. The convolution kernels of all convolution blocks are set to 3 × 3. In the multi-scale detail fusion block, we use three different convolution kernels. Different convolution kernels can fuse detailed information of different scales. The selection of the convolution kernel is shown in Fig. 4. Among them, a 1 × 1 convolution kernel filter is used to process the information of different channels at the same location. Filters with 3 × 3 and 5 × 5 convolution kernels are used to process the information of the surrounding channels at the same location. The reason why a filter with a larger convolution kernel is not used to process the surrounding information at the same position is due to the consideration of the computational complexity of the model. A large convolution kernel will bring more computation to the model and affect the computational performance of the model seriously.
Decoder
The decoder is based on a nested connection architecture. Inspired by UNet++, we simplified its structure. As shown in Fig. 2, the decoder consists of ten convolutional blocks. Each convolution block is composed of two convolution layers with convolution kernel of 3 × 3. The cross-layer link connects the multi-scale depth features in the decoder. The output of the decoder is a reconstructed image fused with multi-scale features.
Loss function
In order to improve the fusion effect of the VAnet model, we use the structural similarity (SSIM) loss function, the mean squared variance (MSE) loss function and the total variation (TV) loss function to form a mixed loss function. The description of the hybrid loss function is as follows
$$\begin{aligned} {L_{total}} = \alpha {L_{SSIM}} + \beta {L_{MSE}} + {L_{TV}} \end{aligned}$$
(1)
where \(\alpha\) and \(\beta\) are the balance parameters. The SSIM loss function is used to measure the loss of texture details of the source image during the fusion process. The MSE loss function is used to predict the pixel-to-pixel loss between the fused image and source images. The introduction of TV loss function aims to maintain the smoothness of the image and suppress noise. The structural similarity loss function is described as
$$\begin{aligned} {L_{SSIM}} = \sum \limits _{i = 1}^N {\left( {1 - SSIM\left( {{I^{fused}},{I^{source}}} \right) } \right) } \end{aligned}$$
(2)
where \({I^{fused}}\) represents the fused image and \({I^{source}}\) represents the source images. N is the size of the batch. \(SSIM( \cdot )\) is used to calculate the structural similarity between images. The closer the SSIM value is to 1, the more detailed information of the source image is contained in the fused image. The MSE loss function is defined as follows
$$\begin{aligned} {L_{MSE}} = \frac{1}{{WH}}\sum \limits _{x = 1}^W {\sum \limits _{y = 1}^H {{{\left( {I_{x,y}^{source} - I_{x,y}^{fused}} \right) }^2}} } \end{aligned}$$
(3)
where W and H are width and height of the image, respectively. (x,y) is the pixel position of the image. The total vision loss function is described as
$$\begin{aligned} {L_{TV}} = \sum \limits _{i,j} {\left( {{{\left( {I_{x,j - 1}^{fused} - I_{x,j}^{fused}} \right) }^2} + {{\left( {I_{x + 1,j}^{fused} - I_{x,j}^{fused}} \right) }^2}} \right) } \end{aligned}$$
(4)
Dataset and Experimental environment
The experimental data in the article are selected from the AANLIB database. 100 pairs of cross-modally registered brain abnormalities medical images are downloaded and cropped into 11960 patch pairs as the training set for the VANet model. The size of each patch is set to 84x84. This operation not only ensures the diversity of training data, but also enhances the robustness of VAnet. As for the test data, we randomly selected two sets of images from each of the 4 diseases to complete the test on VAnet. The training and testing of the VAnet model are all tested on a machine equipped with a 2.4 GHz Intel Core i7-11800H CPU (32G RAM) and a GeForce RTX 3070 GPU.
Comparison algorithm and metrics
In this section, eleven medical image fusion methods are selected for comparison with VAnet. These eleven algorithms are GFF [19], NSCT [20], IGM [21], LPSR [22], WLS [23], CSR [24], LRD [25], TLAYER [26], CSMCA [27], LATLRR [28] and DTNP [29]. Among them, GFF, NSCT, IGM, LRD and TLAYER are traditional image fusion methods. WLS and CSMCA are deep learning fusion methods. LPSR is a fusion method based on sparse representation classes. CSR is a fusion method combining neural network and sparse representation. LATLRR is based on a low-rank decomposition fusion method. DTNP is a fusion method that combines dynamic threshold and wavelet transform. The source codes of all comparison algorithms come from the Internet and the settings of each algorithm parameters are recommended by the corresponding authors.
In order to evaluate the performance of VAnet, we selected eight evaluation metrics to analyze the fused images of all algorithms. The eight metrics are Qw [30], Qe, SSIM [31], VIF [32], FMI [33], LABF [34], NABF [35] and NCIE [36]. Among them, Qw and Qe are derived from the Piella model. SSIM is used to measure the structural similarity between the fused image and the source image. VIF stands for visual evaluation of fused images. LABF, NABF, FMI and NCIE are representative metrics for evaluating image fusion in information theory.
Training details
The training of the VANet model involves many parameters, including batch_size, learning rate, epoch, and the balance parameter in the loss function. The settings of these parameters can have a profound effect on the fusion effect. Therefore, the analysis of these parameters has important research significance.
Batch_size
batch_size refers to the number of samples selected for a training and its size affects the optimization degree and speed of the model. Since the data for training VAnet model is relatively large, putting all the data into the network at one time will definitely cause a memory explosion. Therefore, batch_size needs to be introduced to solve this problem. However, the value of batch_size can not be too small. If it is too small, the learning will be random and the model will not converge. Considering the hardware environment and memory capacity of the experiment, according to Leslie’s theory, we set the value of batch_size to 64.
Epoch
Epoch is an important parameter that controls the number of weight update iterations and the weight update iteration directly affects the fit and convergence of the model. In the training of the VANet model, it is not enough to train all the data in one iteration to get the model into the best fit state. Therefore, it is necessary to set an appropriate epoch value to improve the stability of the model and the effect of image fusion. VIF is a metric that evaluates image quality from the perspective of information communication and sharing based on the statistical properties of natural scenes. Since the evaluation accuracy of this metric is related to the image itself and the distortion channel of the human visual system, it is very appropriate to choose it to assist in completing the determination of the value of epoch. Figure 6 shows the trend of VIF with the transformation of the epoch.
In Fig. 6, we give the average value of VIF for 50 pairs of medical fused images. When epoch is set to 40, the corresponding images average VIF value reaches the maximum and the fused image obtained is more in line with human visual perception. Therefore, we set the value of epoch to 40 to complete the training of the VANet model.
learning rate
The learning rate is an important parameter of the VANet model, which affects the convergence of the model. If the learning rate is too large, the model will oscillate and not converge. If the learning rate is too small, the model will converge slowly. Based on the actual situation, we chose the exponential decay learning rate. The formula is as follows
$$\begin{aligned} lr = l{r_{base}} * l{r_{decay}}^{epoch} \end{aligned}$$
(5)
where \(l{r_{base}}\) is the initial value of the learning rate and \(l{r_{decay}}\) is the decay rate of learning rate. According to prior knowledge, the initial value of the learning rate is set to 0.1, and the decay value of the learning rate is set to 0.99.
Hyperparameters
In the loss function of the VANet model, there are two hyperparameters \(\alpha\) and \(\beta\), which are used to adjust SSIM loss function and MSE loss function respectively. With reference to other scholars setting hyperparameters for deep learning, the values of \(\alpha\) and \(\beta\) are set between 0 and 0.01. Given the role of the two loss functions in the training process, we chose the evaluation metric VIF that related to the human eye perception to assist in determining the values of the hyperparameters \(\alpha\) and \(\beta\). Figure 7 shows the trend of VIF with \(\alpha\) and \(\beta\).
In Fig. 7, we give the average value of VIF for 50 pairs of medical fused images. Obviously, when \(\alpha\) is set to 0.005 and \(\beta\) is set to 0.003, the average VIF value of the corresponding image reaches the maximum value, which best meets the requirements of VANet model training.