Skip to main content

VANet: a medical image fusion model based on attention mechanism to assist disease diagnosis

Abstract

Background

Today’s biomedical imaging technology has been able to present the morphological structure or functional metabolic information of organisms at different scale levels, such as organ, tissue, cell, molecule and gene. However, different imaging modes have different application scope, advantages and disadvantages. In order to improve the role of medical image in disease diagnosis, the fusion of biomedical image information at different imaging modes and scales has become an important research direction in medical image. Traditional medical image fusion methods are all designed to measure the activity level and fusion rules. They are lack of mining the context features of different modes of image, which leads to the obstruction of improving the quality of fused images.

Method

In this paper, an attention-multiscale network medical image fusion model based on contextual features is proposed. The model selects five backbone modules in the VGG-16 network to build encoders to obtain the contextual features of medical images. It builds the attention mechanism branch to complete the fusion of global contextual features and designs the residual multiscale detail processing branch to complete the fusion of local contextual features. Finally, it completes the cascade reconstruction of features by the decoder to obtain the fused image.

Results

Ten sets of images related to five diseases are selected from the AANLIB database to validate the VANet model. Structural images are derived from MR images with high resolution and functional images are derived from SPECT and PET images that are good at describing organ blood flow levels and tissue metabolism. Fusion experiments are performed on twelve fusion algorithms including the VANet model. The model selects eight metrics from different aspects to build a fusion quality evaluation system to complete the performance evaluation of the fused images. Friedman’s test and the post-hoc Nemenyi test are introduced to conduct professional statistical tests to demonstrate the superiority of VANet model.

Conclusions

The VANet model completely captures and fuses the texture details and color information of the source images. From the fusion results, the metabolism and structural information of the model are well expressed and there is no interference of color information on the structure and texture; in terms of the objective evaluation system, the metric value of the VANet model is generally higher than that of other methods.; in terms of efficiency, the time consumption of the model is acceptable; in terms of scalability, the model is not affected by the input order of source images and can be extended to tri-modal fusion.

Peer Review reports

Background

As an important auxiliary tool for medical diagnosis, the importance of medical images is self-evident. With the development of sensor technology, the types of medical images are becoming more and more abundant [1, 2]. The information provided to doctors by different types of medical images is usually complementary and how to aggregate these complementary information into one image has become the focus of current research [3,4,5,6,7].

Figure 1 presents two modal images of a patient with mild Alzheimer’s disease and their fusion results. Figure 1a is the MR-T2 image showing globally widened hemispheric sulci, which is more prominent in parietal lobes. Figure 1b is the PET image that captures signals of markedly abnormal metabolism in brain regions. Weak metabolism occurs in the anterior temporal and posterior parietal regions. The changes tend to be bilateral, but the right hemisphere is more affected than the left, with the posterior cingulate gyrus relatively unaffected. Figure 1c is the fusion result of Fig. 1a and b. Doctors can pay attention to the metabolism of abnormal parts while observing structural changes. It can be seen that medical image fusion is of great significance to clinical diagnosis.

Fig. 1
figure 1

Multi-modal image of a brain metastasis of a bronchial cancer

Since the quality of the fused images directly affects the doctor’s judgment of the disease, how to improve the fusion quality of medical images has become an urgent problem to be solved. The quality of fused images depends on the acquisition of image features and the design of fusion rules. Traditional methods usually adopt manual design of feature extraction methods and fusion rules. Although such methods can effectively describe the detailed features of images, they can not acquire the features of images with different modalities. Human-designed image fusion rules focus more on computing weight maps, which integrate pixel activity information from different source images. In traditional fusion methods, the computation of the weight map is achieved by two steps of activity level measurement and weight assignment. Medical images are decomposed by pre-designed filters and their activity is measured by the absolute value of the decomposed coefficients. Then a “choose-max” or “weighted-average” fusion rule is applied to different measurement sources to assign weights.However, this kind of measuring activity and assigning weights are not very stable due to noise, registration and differences in pixel intensities. In order to further improve the performance of the fusion model, scholars have proposed many complex decomposition methods and designed weight allocation strategies carefully. Therefore, these methods are usually designed in steps, breaking the link between activity level measurement and weight assignment.

The medical image fusion method based on deep learning can comprehensively consider the key issues of the fusion image process. This kind of method realizes the direct mapping of source image to weight by encoding the image and completes activity level measurement and weight assignment in an ”optimal” way via learning network parameters, which enhances the correlation between activity level measurement and weight assignment effectively. In all deep learning algorithms, improved algorithms based on autoencoders (AE) [8,9,10], generative adversarial networks (GAN) [11, 12] and convolutional neural networks (CNN) [13,14,15] are popular in medical image fusion. Song et al. proposed MSDNet and applied it to the extraction of medical image features [16]. The multiplexing of features enhanced the expression of important information in the fused image; Kang et al. regarded the fusion of PET and MR images as a min-max optimization problem with respect to the generator and the discriminator [17]. They proposed TAcGAN model to enhance the structural features of fused images through a game of generator and discriminator, while preserving part of the information of SPECT images. Zhang et al. proposed a general fusion framework based on convolutional neural network called IFCNN [18]. IFCNN can obtain the salient features of medical images without being limited by the number of source images. The fused images preserves important features from different images better.

Although the above methods improve the fusion quality of medical images, their improvement is limited. This is because they only focus on image fusion itself, ignoring the significance of medical image fusion. Medical image fusion focuses on the global and local effects of abnormal tissue on medical images, which are often reflected in the contextual information of images. Therefore, how to obtain image context information has become the top priority of current research. In order to address this issue, we propose a new medical image fusion model on deep learning, called VAnet. The VAnet model has two most important parts, the encoder and the fusion network. The encoder consists of five convolutional pooling blocks of the VGG-16 network, which can sufficiently capture the contextual information of medical images. The fusion network adopts the method of combining residual multi-scale feature extraction and attention mechanism to realize the enhancement of salient features and the preservation of texture detail information.

Methods

VAnet

Overview

VAnet is a new type of medical image fusion model. It consists of three parts: encoder, AM fusion network and decoder. In Fig. 2, the encoder consists of five coding blocks, which are corresponding to five blocks of VGG-16, respectively. The five feature maps obtained from five blocks contain all the contextual semantic information of the image. Then the feature maps are put into the AM fusion network for multi-scale deep feature fusion. The AM fusion network consists of the attention mechanism branch and the residual multi-scale detail fusion branch. The attention mechanism branch consists of the channel attention mechanism block and five convolution blocks. Among them, the channel attention mechanism block can suppress noise, especially functional images. The residual multi-scale detail fusion branch includes three convolution blocks and a multi-scale detail fusion block. Among them, the multi-scale detail fusion block can completely compensate for the loss of detail caused by the pooling operation in the attention mechanism. Finally, the fused feature map will be input to the decoder to reconstruct the fused image.

Fig. 2
figure 2

Schematic diagram of the VAnet model

Encoder

Traditional encoders tend to ignore the context information of feature maps in feature extraction. Facts have proved that the pathological characteristics of tissues are not only reflected in a certain independent part, but also in its contextual information. Therefore, we select the VGG-16 network that can obtain context information in the encoder.

As shown in Fig. 3, VGG-16 contains five blocks. Its biggest feature is that it can obtain information about the image context. The first two blocks consist of two convolutional layers and one max-pooling layer, respectively. The last three blocks consist of three convolutional layers and one max-pooling layer, respectively. The stacking of the two can easily form a deeper network structure to obtain more complete and deeper contextual information. The kernel size of all convolutional layers is 3 × 3 and the size of max pooling layers is 2 × 2. The first four blocks have different numbers of output channels; the fourth and last blocks have the same number of output channels.

Fig. 3
figure 3

The structure of the encoder of the VAnet model

AM fusion network

AM fusion network is the core part of the VAnet model. The extraction of important features and their associated features, the suppression of noise and the preservation of texture details all rely on the fusion network. In Fig. 4, AM fusion network consists of the attention mechanism branch and the residual multi-scale detail processing branch.

Fig. 4
figure 4

The structure of the AM fusion network

Attention mechanism branch Attention mechanism branch is composed of five convolutional blocks and a channel attention mechanism block. Each convolutional block is composed of a convolutional layer, a batch normalized layer and a ReLU activation function. The kernel of the convolution layer in all convolution blocks is 3 × 3. In the first convolution block, a pooling layer is added after the activation function to reduce the feature dimension. In the fourth convolution block, we add an unsampled layer before the convolution layer to restore the feature dimension. A channel attention mechanism block is added behind the second convolution block and its working principle is shown in Fig. 5.

Fig. 5
figure 5

The structure of the channel attention block

In Fig. 5, the size of the input feature map F is \(H \times W \times C\), which is put into the max pooling layer and the average pooling layer to obtain two \(1 \times 1 \times C\) feature maps. Then the two feature maps are fed into a two-layer shared neural network for feature extraction. The number of neurons in the first layer of the network is C/r and the ReLU function is selected as the activation function. The number of neurons in the second layer of the network is C. The element-wise operation is performed on the features obtained by the shared neural network and the final channel attention feature Mc is generated after the sigmoid activation operation.

Residual multi-scale detail processing branch After the image is branched by the attention mechanism, the detailed information will be lost, which will affect the fusion result of the image. In order to avoid the above situation, the residual multi-scale detail fusion block is designed. The residual multi-scale detail processing block includes a set of residual convolution blocks, a multi-scale detail fusion block and a convolution block. Among them, the residual convolution block is designed to prevent gradient explosion. The convolution kernels of all convolution blocks are set to 3 × 3. In the multi-scale detail fusion block, we use three different convolution kernels. Different convolution kernels can fuse detailed information of different scales. The selection of the convolution kernel is shown in Fig. 4. Among them, a 1 × 1 convolution kernel filter is used to process the information of different channels at the same location. Filters with 3 × 3 and 5 × 5 convolution kernels are used to process the information of the surrounding channels at the same location. The reason why a filter with a larger convolution kernel is not used to process the surrounding information at the same position is due to the consideration of the computational complexity of the model. A large convolution kernel will bring more computation to the model and affect the computational performance of the model seriously.

Decoder

The decoder is based on a nested connection architecture. Inspired by UNet++, we simplified its structure. As shown in Fig. 2, the decoder consists of ten convolutional blocks. Each convolution block is composed of two convolution layers with convolution kernel of 3 × 3. The cross-layer link connects the multi-scale depth features in the decoder. The output of the decoder is a reconstructed image fused with multi-scale features.

Loss function

In order to improve the fusion effect of the VAnet model, we use the structural similarity (SSIM) loss function, the mean squared variance (MSE) loss function and the total variation (TV) loss function to form a mixed loss function. The description of the hybrid loss function is as follows

$$\begin{aligned} {L_{total}} = \alpha {L_{SSIM}} + \beta {L_{MSE}} + {L_{TV}} \end{aligned}$$
(1)

where \(\alpha\) and \(\beta\) are the balance parameters. The SSIM loss function is used to measure the loss of texture details of the source image during the fusion process. The MSE loss function is used to predict the pixel-to-pixel loss between the fused image and source images. The introduction of TV loss function aims to maintain the smoothness of the image and suppress noise. The structural similarity loss function is described as

$$\begin{aligned} {L_{SSIM}} = \sum \limits _{i = 1}^N {\left( {1 - SSIM\left( {{I^{fused}},{I^{source}}} \right) } \right) } \end{aligned}$$
(2)

where \({I^{fused}}\) represents the fused image and \({I^{source}}\) represents the source images. N is the size of the batch. \(SSIM( \cdot )\) is used to calculate the structural similarity between images. The closer the SSIM value is to 1, the more detailed information of the source image is contained in the fused image. The MSE loss function is defined as follows

$$\begin{aligned} {L_{MSE}} = \frac{1}{{WH}}\sum \limits _{x = 1}^W {\sum \limits _{y = 1}^H {{{\left( {I_{x,y}^{source} - I_{x,y}^{fused}} \right) }^2}} } \end{aligned}$$
(3)

where W and H are width and height of the image, respectively. (x,y) is the pixel position of the image. The total vision loss function is described as

$$\begin{aligned} {L_{TV}} = \sum \limits _{i,j} {\left( {{{\left( {I_{x,j - 1}^{fused} - I_{x,j}^{fused}} \right) }^2} + {{\left( {I_{x + 1,j}^{fused} - I_{x,j}^{fused}} \right) }^2}} \right) } \end{aligned}$$
(4)

Dataset and Experimental environment

The experimental data in the article are selected from the AANLIB database. 100 pairs of cross-modally registered brain abnormalities medical images are downloaded and cropped into 11960 patch pairs as the training set for the VANet model. The size of each patch is set to 84x84. This operation not only ensures the diversity of training data, but also enhances the robustness of VAnet. As for the test data, we randomly selected two sets of images from each of the 4 diseases to complete the test on VAnet. The training and testing of the VAnet model are all tested on a machine equipped with a 2.4 GHz Intel Core i7-11800H CPU (32G RAM) and a GeForce RTX 3070 GPU.

Comparison algorithm and metrics

In this section, eleven medical image fusion methods are selected for comparison with VAnet. These eleven algorithms are GFF [19], NSCT [20], IGM [21], LPSR [22], WLS [23], CSR [24], LRD [25], TLAYER [26], CSMCA [27], LATLRR [28] and DTNP [29]. Among them, GFF, NSCT, IGM, LRD and TLAYER are traditional image fusion methods. WLS and CSMCA are deep learning fusion methods. LPSR is a fusion method based on sparse representation classes. CSR is a fusion method combining neural network and sparse representation. LATLRR is based on a low-rank decomposition fusion method. DTNP is a fusion method that combines dynamic threshold and wavelet transform. The source codes of all comparison algorithms come from the Internet and the settings of each algorithm parameters are recommended by the corresponding authors.

In order to evaluate the performance of VAnet, we selected eight evaluation metrics to analyze the fused images of all algorithms. The eight metrics are Qw [30], Qe, SSIM [31], VIF [32], FMI [33], LABF [34], NABF [35] and NCIE [36]. Among them, Qw and Qe are derived from the Piella model. SSIM is used to measure the structural similarity between the fused image and the source image. VIF stands for visual evaluation of fused images. LABF, NABF, FMI and NCIE are representative metrics for evaluating image fusion in information theory.

Training details

The training of the VANet model involves many parameters, including batch_size, learning rate, epoch, and the balance parameter in the loss function. The settings of these parameters can have a profound effect on the fusion effect. Therefore, the analysis of these parameters has important research significance.

Batch_size

batch_size refers to the number of samples selected for a training and its size affects the optimization degree and speed of the model. Since the data for training VAnet model is relatively large, putting all the data into the network at one time will definitely cause a memory explosion. Therefore, batch_size needs to be introduced to solve this problem. However, the value of batch_size can not be too small. If it is too small, the learning will be random and the model will not converge. Considering the hardware environment and memory capacity of the experiment, according to Leslie’s theory, we set the value of batch_size to 64.

Epoch

Epoch is an important parameter that controls the number of weight update iterations and the weight update iteration directly affects the fit and convergence of the model. In the training of the VANet model, it is not enough to train all the data in one iteration to get the model into the best fit state. Therefore, it is necessary to set an appropriate epoch value to improve the stability of the model and the effect of image fusion. VIF is a metric that evaluates image quality from the perspective of information communication and sharing based on the statistical properties of natural scenes. Since the evaluation accuracy of this metric is related to the image itself and the distortion channel of the human visual system, it is very appropriate to choose it to assist in completing the determination of the value of epoch. Figure 6 shows the trend of VIF with the transformation of the epoch.

Fig. 6
figure 6

The changing trend of epoch

In Fig. 6, we give the average value of VIF for 50 pairs of medical fused images. When epoch is set to 40, the corresponding images average VIF value reaches the maximum and the fused image obtained is more in line with human visual perception. Therefore, we set the value of epoch to 40 to complete the training of the VANet model.

learning rate

The learning rate is an important parameter of the VANet model, which affects the convergence of the model. If the learning rate is too large, the model will oscillate and not converge. If the learning rate is too small, the model will converge slowly. Based on the actual situation, we chose the exponential decay learning rate. The formula is as follows

$$\begin{aligned} lr = l{r_{base}} * l{r_{decay}}^{epoch} \end{aligned}$$
(5)

where \(l{r_{base}}\) is the initial value of the learning rate and \(l{r_{decay}}\) is the decay rate of learning rate. According to prior knowledge, the initial value of the learning rate is set to 0.1, and the decay value of the learning rate is set to 0.99.

Hyperparameters

In the loss function of the VANet model, there are two hyperparameters \(\alpha\) and \(\beta\), which are used to adjust SSIM loss function and MSE loss function respectively. With reference to other scholars setting hyperparameters for deep learning, the values of \(\alpha\) and \(\beta\) are set between 0 and 0.01. Given the role of the two loss functions in the training process, we chose the evaluation metric VIF that related to the human eye perception to assist in determining the values of the hyperparameters \(\alpha\) and \(\beta\). Figure 7 shows the trend of VIF with \(\alpha\) and \(\beta\).

Fig. 7
figure 7

The Hyperparameters change trend graph

In Fig. 7, we give the average value of VIF for 50 pairs of medical fused images. Obviously, when \(\alpha\) is set to 0.005 and \(\beta\) is set to 0.003, the average VIF value of the corresponding image reaches the maximum value, which best meets the requirements of VANet model training.

Results

The test data are derived from the following five diseases, which are subacute stroke, hypertensive encephalopathy, cavernous hemangioma, metastatic bronchogenic carcinoma and mild Alzheimer’s disease. Two pairs of the source images are selected for each disease to prove the effectiveness and superiority of our fusion model.

Subacute stroke: loss of sensation

The two sets of source images in this section are from a 65-year-old patient with subacute stroke. He is right-handed with mild left hemiplegia and atrial fibrillation. When he felt a tingling pain in his left arm, he went to the hospital and found that he could not explore the left half of the space. In his two sets of MR images, the cerebrospinal fluid left behind by the liquefaction and necrosis of the old infarct showed hyperintensity and successfully replaced the frontal pole. Hyperperfusion appears on the corresponding SPECT images. Figures 8 and 9 show the fusion results of all algorithms on two sets of subacute stroke images. The fused image based on CSR model almost loses the ability to describe functional information. The fused images obtained by LRD, IGM, TLayers and DTNP algorithms can not completely describe the blood flow level. The fused images obtained by GFF and LPSR algorithms have serious distortion. The brightness of the fused images obtained by NSCT, WLS and CSMCA is dark, which is not conducive to the description of the structural information of the image. The fused image obtained by LATLRR algorithm has serious blurring. The fused image obtained by VANet model can clearly describe the blood flow situation of the tissue, while retaining the key information in the MR images.

Fig. 8
figure 8

The first set of fused MRI-SPECT images from 9 methods on subacute stroke

Fig. 9
figure 9

The second set of fused MRI-SPECT images from 9 methods on subacute stroke

Tables 1 and 2 give the objective performance of different algorithms on the fusion of the above two sets of medical images, respectively. The VANet model achieves optimal values on all objective evaluation metrics. From both subjective and objective perspectives, the subacute stroke images fused by VANet model can provide doctors with complete information about the diseased tissue and help doctors complete the diagnosis as soon as possible.

Table 1 The objective evaluation scores about group 1 fused images
Table 2 The objective evaluation scores about group 2 fused images

Hypertensive encephalopathy

Two sets of source images in Figs. 10 and 11 are from a young woman that has acute arterial hypertension. In her MR− T2 images, bilateral temporal and occipital lesions can be clearly seen. Early perfusion abnormalities are obvious at higher levels in her SPECT-Tl image. In order to observe the lesion tissue and its perfusion better, the two sets of images are selected for fusion on 12 algorithms and the fusion results are shown in Figs. 10 and 11, respectively. The fused images obtained by NSCT, WLS and CSMCA algorithms have a dim brightness and lose the energy information in the SPECT image. The fused images obtained by GFF, LPSR and CSR algorithms have serious distortion. The fused image obtained by TLayers algorithm is very blurry and can not describe the texture information. The fused images obtained based on IGM, LRD and DTNP algorithms have a large brightness, which affects the expression of some detailed information. The fused image obtained by LATLRR algorithm loses part of the color information, which affects the description of the blood flow information. The fused image obtained by VANet model can characterize the diseased tissue and its blood flow better.

Fig. 10
figure 10

The first set of fused MRI-SPECT images from 9 methods on hypertensive encephalopathy

Fig. 11
figure 11

The second set of fused MRI-SPECT images from 9 methods on hypertensive encephalopathy

In Tables 3 and 4, it can be seen that the VAnet model is outstanding on Qw, Qe, SSIM, LABF, NABF and NCIE. On VIF and FMI, the performance of VAnet is lower than that of the LATLRR algorithm and the CSR model respectively, which may be related to the feature extraction method. However, the fused images obtained by LATLRR algorithm and CSR model lack different color information, which makes them unable to provide reliable information for doctors. In contrast, the images fused by VANet model can obtain more complete color information,which may be helpful for treating hypertensive encephalopathy.

Table 3 The objective evaluation scores about group 3 fused images
Table 4 The objective evaluation scores about group 4 fused images

Cavernous angioma

The experimental data is from a 26-year-old woman with a ten-year history of headaches. Recently, she received radiosurgery due to progressive weakness of the right arm and leg. Her MR images show obvious hemangiomas. Her SPECT image is marked with technetium. Among them are blood clots and scarred brains, surrounded by crystalline old blood products. The lesion can not fill the marked red blood cells, indicating that they are not open to circulating blood. In order to assist the doctor in completing the diagnosis and treatment of her disease better, her two sets of registered images were chosen to be fused. Figures 12 and 13 show the fusion results of two sets of images under different algorithms, respectively. The fused images obtained based on NSCT, WLS and CSMCA algorithms lack the low-frequency energy of the SPECT image, resulting in its dim brightness. The fused image obtained by LPSR algorithm is seriously distorted. The brightness of the fused image obtained by IGM, LRD and DTNP algorithms is too high, which affects the description of the texture information. The fused images obtained based on GFF, CSR and LATLRR algorithms describe the blood circulation process poorly. The fused image obtained by TLayers algorithm is relatively blurry and can not describe the nuclide information. The fused image obtained by VANet model is superior to other algorithms in terms of brightness, contrast and description of nuclide information.

Fig. 12
figure 12

The first set of fused MRI-SPECT images from 9 methods on cavernous angioma

Fig. 13
figure 13

The second set of fused MRI-SPECT images from 9 methods on cavernous angioma

Tables 5 and 6 show the objective representation of all algorithms on the above two sets of images, respectively. With the exception of VIF and Qe, VANet achieves optimal solutions on all other metrics. Although the images obtained by IGM algorithm and DTNP algorithm are optimally solved in terms of visual fidelity and Qe metrics, respectively. Their poor performance in the fusion results has seriously affected the doctor’s observation of texture details. In summary, the fused images obtained by VANet model can help doctors complete to observe and diagnose glioma diseases better.

Table 5 The objective evaluation scores about group 5 fused images
Table 6 The objective evaluation scores about group 6 fused images

Metastatic bronchogenic carcinoma

The experimental data comes from a 42-year-old woman who has been smoking for a long time and the sudden increase in headaches caused her to go to the hospital for a check-up. After examination, a large number of lumps appeared in her brain. The MR image demonstrates the tumor as an area of high signal intensity on proton density (PD) and T2-weighted (T2) images in a large left temporal region. Perfusion SPECT image shows very low blood flow to the lesion. In order to further combine tissue structure information and blood flow conditions to accelerate the diagnostic process, two sets of registered medical images are selected for fusion. Figures 14 and 15 show the fusion results of two sets of images under different algorithms, respectively. The fused image obtained by TLayers algorithm is blurred in texture detail. The fused images obtained based on NSCT and CSMCA algorithms have a dim brightness and lose the low-frequency energy in the SPECT image. The fused images obtained by LPSR and LATLRR algorithms show color distortion. The fused images obtained based on GFF and CSR algorithms lose the ability to describe the blood flow levels of tissues. The brightness of the fused images obtained by IGM, WLS, LRD and DTNP algorithms is too large, which seriously affects the expression of image color information. The fused image obtained by VANet model has a appropriate contrast and can help doctors judge the adhesion relationship between brain tissue and metastatic cancers.

Fig. 14
figure 14

The first set of fused MRI-SPECT images from 9 methods on metastatic bronchogenic carcinoma

Fig. 15
figure 15

The second set of fused MRI-SPECT images from 9 methods on metastatic bronchogenic carcinoma

Tables 7 and 8 show the objective representations of all fusion results of these two sets of images, respectively. The VANet model has a significant performance improvement over other algorithms, except the LPSR algorithm. Although the LPSR algorithm and the VANet model perform equally well on all metrics, the images obtained by LPSR algorithm describe color information very poorly. In summary, the VANet model is more suitable for processing image fusion of bronchial cancer metastatic disease, which can provide great help to doctors.

Table 7 The objective evaluation scores about group 7 fused images
Table 8 The objective evaluation scores about group 8 fused images

Mild Alzheimer’s disease

The experimental images are taken from a 70-year-old man with memory difficulties.MR images showed globally widened hemispheric sulci, which is more prominent in parietal lobes. In his PET images, regional cerebral metabolism is markedly abnormal with hypometabolism in anterior temporal and posterior parietal regions. To further observe the metabolic status of the tumor location, his two sets of images are removed for fusion. Figures 16 and 17 show all the fusion results of the two sets of images, respectively. The brightness of the fused images obtained based on NSCT and CSMCA algorithms is too dark and the energy information of the PET image is lost. The image obtained by CSR algorithm loses almost all metabolic information. The fused image obtained by GFF algorithm shows serious color distortion. In the fused images obtained by IGM, DTNP and WLS algorithms, the brightness of them is too high, resulting in loss of information. The fused images obtained by LRD, LPSR and LATLRR algorithms have low contrast in the upper right corner and the outline is not obvious. The fused image obtained by TLayers algorithm has a severe blurry texture. The fused image obtained by VANet model can contain rich texture information and complete metabolic information.

Fig. 16
figure 16

The first set of fused MRI-PET images from 12 methods on mild Alzheimer’s disease

Fig. 17
figure 17

The second set of fused MRI-PET images from 12 methods on mild Alzheimer’s disease

Tables 9 and 10 show the objective performance of the two sets of images in different fusion algorithms. Compared with other algorithms, the VANet model achieves suboptimal values on the FMI metric and performs best on the remaining metrics. Combined with the fusion result, the medical images fused by VANet model can provide great help to doctors in the process of treating mild Alzheimer’s disease.

Table 9 The objective evaluation scores about group 8 fused images
Table 10 The objective evaluation scores about group 9 fused images

Ablation study

The core of the VANet model is the attention-multiscale fusion network. Among them, the attention mechanism branch is to fuse the global context of medical images; the residual multi-scale detail processing branch is to fuse the local context of medical images. In order to verify the influence of the two branches on the fusion results, the section chooses to ignore one of the branches and use the other branch for fusion. The experimental data are 60 groups of registered MRI and their corresponding nuclear medicine images, from which we randomly select the fusion results of three groups of images and show them in Fig. 18.

Fig. 18
figure 18

Influence of attention mechanism branch and residual multi-scale detail processing branch in VANet model on fusion results respectively

First, in order to verify the influence of global context on the fused images, the attention mechanism branch is ignored. The fusion results are shown in Fig. 8c, h and m. When the global context fusion is blocked, the fused image suffers from severe color distortion, resulting in a large deviation in the description of tissue metabolic information. Then, the residual multi-scale detail branch is ignored to verify the effect of local context on the fused images. The fusion results are shown in Fig. 8d, i and n. It can be clearly found that some detailed texture information is blurred, which affects the doctor’s observation of key tissue information. Table 1 shows the statistical results of objective metrics of the VANet model ablation experiment and the optimal value is selected in bold.

Table 11 The objective evaluation scores about group 10 fused images

In Table 11, it can be seen that the performance of the VANet model with the attention branch removed is significantly weaker on most metrics, especially in SSIM, FMI, and LABF. It shows that the global context plays an important role in the medical image fusion. The VANet model with the residual multiscale detail processing branch removed has the worst performance on the metric of NABF, which indicates that the local context affects the description of detail information in the fused image. Without this branch, the fused image would have more noisy information. In contrast, the complete VANet model considers the representation of image global information and local information, which improves the quality of fused images.

Time complexity analysis

The image obtained by the VAnet model has been subjectively analyzed and objectively evaluated before. This section will evaluate the VANet model and other algorithms from the perspective of time complexity. The time cost of each algorithm on each set of experimental images has been shown in Tables 1 to 10. From all the tables, it can be found that the LPSR algorithm takes the shortest time and the CSMCA algorithm takes the longest time. The time consumption of the LRD mehod is second only to the CSMCA algorithm. The time consumption of the CSR method and the DTMP algorithm also exceeded 10 seconds. The VAnet model takes some time to train. After the model is trained, the time it takes to fuse images is comparable to that of the WLS algorithm. However, the fusion effect of the VANet is much better than the WLS and the LPSR algorithms.

Statistical test

When comparing algorithms, it is often necessary to perform statistical tests on experimental results. Friedman test is a type of nonparametric test used to measure the performance of multiple algorithms on different datasets. However, Friedman test can only detect whether there are differences between the performance of multiple algorithms. Once there is a difference, a post-hoc test is needed to find out which algorithms have statistical differences in their performance. Nermenyi test is a commonly used method for subsequent testing. It uses Tukey’s distribution to complete the critical difference (CD) calculation. The level difference of any two methods is larger than the value of CD, which proves that there is a significant difference between the two methods. In Fig. 19, the values of the objective evaluation indicators in Tables 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 are used to calculate the level of each fusion algorithm. Combining the above two test results, we can find that the VANet model has obvious performance advantages compared with other fusion algorithms. In the evaluation of the selected objective indicators, the VAnet model has certain statistical significance.

Fig. 19
figure 19

The time complexity of different types of medical images

Conclusions

In this study, we propose a novel fusion model for medical image fusion. Aiming at the challenges faced by medical image fusion, first, the model uses the five blocks of VGG-16 to build an encoder to obtain feature maps containing image context information. Second, the model constructs an AM fusion network with the attention mechanism as the core. The network builds blocks around the channel attention mechanism to enhance salient features and weaken redundant features. In order to get more texture details, the network uses different convolution kernels to construct detail information patches to obtain multi-scale features of the image. Finally, all the acquired features are reconstructed by the decoder. The experimental results on the Harvard Medical School brain medical image dataset show that the fused images obtained by the VAnet model are superior to the current more advanced fusion algorithms in terms of structural information and metabolic condition expression. Since the VAnet model can avoid the problem of image fusion sequences, it can be further extended to the field of three medical images fusion.

Availability of data and materials

The datasets used and/or analyzed during the current study are available from https://www.med.harvard.edu/AANLIB/home.html Experimental images in Fig. 1 are downloaded from https://www.med.harvard.edu/AANLIB/cases/caseNN1/mr1-dg1/015.html Experimental images in Fig. 8 are downloaded from https://www.med.harvard.edu/AANLIB/cases/case15/mr1-tc1/012.html Experimental images in Fig. 9 are downloaded from https://www.med.harvard.edu/AANLIB/cases/case15/mr1-tc1/015.html Experimental images in Fig. 10 are downloaded from https://www.med.harvard.edu/AANLIB/cases/case21/mr1-tc1/009.html Experimental images in Fig. 11 are downloaded from https://www.med.harvard.edu/AANLIB/cases/case21/mr1-tc1/016.html Experimental images in Fig. 12 are downloaded from https://www.med.harvard.edu/AANLIB/cases/case12/mr2-tc2/007.html Experimental images in Fig. 13 are downloaded from https://www.med.harvard.edu/AANLIB/cases/case12/mr2-tc2/023.html Experimental images in Fig. 14 are downloaded from https://www.med.harvard.edu/AANLIB/cases/case28/mr1-tc1/006.html Experimental images in Fig. 15 are downloaded from https://www.med.harvard.edu/AANLIB/cases/case28/mr1-tc1/013.html.

Abbreviations

VGG-16:

Visual Geometry Group network with 16 layers

MR:

Magnetic resonance

PET:

Positron emission tomography

AE:

Autoencoders

CNN:

Convolutional neural network

GAN:

Generative Adversarial Network

PD:

Proton density

T2:

T2-weighted

SSIM:

Structural similarity

AG:

Average gradient

VIF:

Visual information fidelity

FMI:

:Future mutual information

MSE:

Mean square error

LPSR:

Laplacian pyramid sparse representation

LRD:

Laplacian re-decomposition

SPECT:

Single-Photon Emission Computed Tomography

MSDNet:

Multi-Scale Dense Convolutional Networks

TAcGAN:

Tissue-aware conditional generative adversarial network

TV:

Total variation

WLS:

Weighted least square optimization

CSR:

Convolutional sparse representation

TLayers:

Three-layer medical image fusion

CSMCA:

Medical image fusion via convolutional sparsity based morphological component analysis

LATLRR:

Latent low-rank representation

DTNP:

Dynamic threshold neural p systems medical image fusion

NCIE:

Nonlinear correlation information entropy.

References

  1. Fu J, Li W, Du J, Xu L. Dsagan: a generative adversarial network based on dual-stream attention mechanism for anatomical and functional image fusion. Inf Sci. 2021;576:484–506.

    Article  Google Scholar 

  2. Ganasala P, Prasad AD. Medical image fusion based on laws of texture energy measures in stationary wavelet transform domain. Int J Imaging Syst Technol. 2020;30(3):544–57.

    Article  Google Scholar 

  3. Singh S, Gupta D, Anand R, Kumar V. Nonsubsampled shearlet based ct and mr medical image fusion using biologically inspired spiking neural network. Biomed Signal Process Control. 2015;18:91–101.

    Article  Google Scholar 

  4. Shahdoosti HR, Mehrabi A. Multimodal image fusion using sparse representation classification in tetrolet domain. Digit Signal Process. 2018;79:9–22.

    Article  Google Scholar 

  5. Shahdoosti HR, Mehrabi A. Mri and pet image fusion using structure tensor and dual ripplet-ii transform. Multimed Tools Appl. 2018;77(17):22649–70.

    Article  Google Scholar 

  6. Li S, Kang X, Fang L, Hu J, Yin H. Pixel-level image fusion: a survey of the state of the art. Inf Fusion. 2017;33:100–112.

  7. Wang Q, Li S, Qin H, Hao A. Robust multi-modal medical image fusion via anisotropic heat diffusion guided low-rank structural analysis. Inf Fusion. 2015;26:103–21.

    Article  Google Scholar 

  8. Liu S, Liu S, Cai W, Che H, Pujol S, Kikinis R, Feng D, Fulham MJ, et al. Multimodal neuroimaging feature learning for multiclass diagnosis of alzheimer’s disease. IEEE Trans Biomed Eng. 2014;62(4):1132–40.

    Article  Google Scholar 

  9. Shi B, Chen Y, Zhang P, Smith CD, Liu J, Initiative ADN, et al. Nonlinear feature transformation and deep fusion for Alzheimer’s disease staging analysis. Pattern Recognit. 2017;63:487–98.

    Article  Google Scholar 

  10. Singh V, Verma NK, Ul Islam Z, Cui Y. Feature learning using stacked autoencoder for shared and multimodal fusion of medical images. In: Verma, G.A.K. Nishchal K. (ed.) Computational Intelligence: Theories, Applications and Future Directions-Volume I.

  11. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. Adv Neural Inf Process Syst. 2014.

  12. Tang W, Liu Y, Cheng J, Li C, Chen X. Green fluorescent protein and phase contrast image fusion via detail preserving cross network. IEEE Trans Comput Imaging. 2021;7:584–97.

    Article  Google Scholar 

  13. Liu Y, Chen X, Cheng J, Peng H. A medical image fusion method based on convolutional neural networks. In: 2017 20th International Conference on Information Fusion (Fusion). 2017:1–7.

  14. Hermessi H, Mourali O, Zagrouba E. Convolutional neural network-based multimodal image fusion via similarity learning in the shearlet domain. Neural Comput Appl. 2018;30(7):2029–45.

    Article  Google Scholar 

  15. Xia K-j, Yin H-s, Wang J-q. A novel improved deep convolutional neural network model for medical image fusion. Cluster Comput. 2019;22(1);1515–1527.

  16. Song X, Wu X-J, Li H. Msdnet for medical image fusion. In: International Conference on Image and Graphics. 2019:278–288.

  17. Kang J, Lu W, Zhang W. Fusion of brain pet and mri images using tissue-aware conditional generative adversarial network with joint loss. IEEE Access. 2020;8:6368–78.

    Article  Google Scholar 

  18. Zhang Y, Liu Y, Sun P, Yan H, Zhao X, Zhang L. Ifcnn: a general image fusion framework based on convolutional neural network. Inf Fusion. 2020;54:99–118.

    Article  Google Scholar 

  19. Li S, Kang X, Hu J. Image fusion with guided filtering. IEEE Trans Image Process. 2013;22(7):2864–75.

    Article  Google Scholar 

  20. Li T, Wang Y. Biological image fusion using a nsct based variable-weight method. Inf Fusion. 2011;12(2):85–92.

    Article  Google Scholar 

  21. Zhang X, Li X, Feng Y, Zhao H, Liu Z. Image fusion with internal generative mechanism. Expert Syst Appl. 2015;42(5):2382–91.

    Article  Google Scholar 

  22. Wang Z, Cui Z, Zhu Y. Multi-modal medical image fusion by laplacian pyramid and adaptive sparse representation. Comput Biol Med. 2020;123:103823.

    Article  Google Scholar 

  23. Ma J, Zhou Z, Wang B, Zong H. Infrared and visible image fusion based on visual saliency map and weighted least square optimization. Infrared Phys Technol. 2017;82:8–17.

    Article  CAS  Google Scholar 

  24. Liu Y, Chen X, Ward RK, Wang ZJ. Image fusion with convolutional sparse representation. IEEE Signal Process Lett. 2016;23(12):1882–6.

    Article  Google Scholar 

  25. Li X, Guo X, Han P, Wang X, Li H, Luo T. Laplacian redecomposition for multimodal medical image fusion. IEEE Trans Instrum Meas. 2020;69(9):6880–90.

    Article  Google Scholar 

  26. Du J, Li W, Tan H. Three-layer medical image fusion with tensor-based features. Inf Sci. 2020;525:93–108.

    Article  Google Scholar 

  27. Liu Y, Chen X, Ward RK, Wang ZJ. Medical image fusion via convolutional sparsity based morphological component analysis. IEEE Signal Process Lett. 2019;26(3):485–9.

    Article  Google Scholar 

  28. Li H, Wu X-J. Infrared and visible image fusion using latent low-rank representation. arXiv preprint arXiv:1804.08992. 2018.

  29. Li B, Peng H, Wang J. A novel fusion method based on dynamic threshold neural p systems and nonsubsampled contourlet transform for multi-modality medical images. Signal Process. 2021;178:107793.

    Article  Google Scholar 

  30. Piella G, Heijmans H. A new quality metric for image fusion. In: Proceedings 2003 International Conference on Image Processing (Cat. No. 03CH37429), 2003;3:173.

  31. Zhou W. Image quality assessment: from error measurement to structural similarity. IEEE Trans Image Process. 2004.

  32. Han Y, Cai Y, Cao Y, Xu X. A new image fusion performance metric based on visual information fidelity. Inf Fusion. 2013;14(2):127–35.

    Article  Google Scholar 

  33. Haghighat M, Razian MA. Fast-fmi: non-reference image fusion metric. In: 2014 IEEE 8th International Conference on Application of Information and Communication Technologies (AICT). 2014:1–3.

  34. Petrovic V, Xydeas C. Objective image fusion performance characterisation. In: Tenth IEEE International Conference on Computer Vision (ICCV’05). 2005;Volume 1, vol. 2:1866–1871.

  35. Shreyamsha Kumar B. Multifocus and multispectral image fusion based on pixel significance using discrete cosine harmonic wavelet transform. Signal Image Video Process. 2013;7(6):1125–43.

    Article  Google Scholar 

  36. Wang Q, Shen Y, Jin J. Performance evaluation of image fusion techniques. Image Fusion Algorithms Appl. 2008;19:469–92.

    Article  Google Scholar 

Download references

Acknowledgements

Not applicable.

Author information


Kai Guo received the B.S. degree in computer science and technology from Jilin University, in 2015, where he is currently pursuing the Ph.D. degree with the College of Computer Science and Technology. His research interests include machine learning, image processing, especially for image fusion and image segmentation.


Xiongfei Li received the B.S. degree in computer software in 1985 from Nanjing University, the M.S. degree in computer software in 1988 from the Chinese academy of sciences, the PhD degree in communication and information system in 2002 from Jilin University. Since 1988, he has been a member of the faculty of the computer science and technology at Jilin University, Changchun, China. He is a professor of computer software and theory at Jilin University. He has authored more than 60 research papers. His research interests include data mining, intelligent network, image processing and analysis.


Tiehu Fan received the B.S. degree in computer science and technology from Jilin University, in 2003, and the Ph.D. degree in computer science and technology from Jilin University, in 2010. He is currently an Associate Professor with the College of Instrumentation and Electrical Engineering, Jilin University. He has authored more than 10 research articles. His research interests include Intelligent Control and Integrated computational intelligence.


Xiaohan Hu received her B.M. degree in Medical tests(Clinically Oriented) from Southern Medical University(The Former First Military Medical University) in 2008, M.M. degree in Clinical Medicine from Jilin University in 2012 and received her M.D. degree in Diagnostic and Interventional Radiology from Johann-Wolfgang-Goethe-Universität Frankfurt am Main (Goethe University Frankfurt) in 2015. Now, she works as a lecturer and a doctor at Radiology Department,The First Hospital of Jilin University. Her research interests include medical image data analysis and disease diagnosis. She has authored more than 10 research papers.

Funding

This research was funded by the National Key Research and Development Project of China under Grant 2019YFC0409105, by the National Natural Science Foundation of China under Grant 61801190, by the Nature Science Foundation of Jilin Province under Grant 20180101055JC, by the Industrial Technology Research and Development Funds of Jilin Province under Grant 2019C054-3, by the “Thirteenth Five-Year Plan” Scientific Research Planning Project of Education Department of Jilin Province (JKH20200678KJ,JJKH20200997KJ)

Author information

Authors and Affiliations

Authors

Contributions

KG, XFL, THF and XHH conceived the study and design. KG and XHH made a formal analysis of data. KG, XFL and THF verified the design. THF and XHH are responsible for collecting data. KG is responsible for software implementation. KG and XHH analyzed and interpreted the data and drafted the manuscript. All authors read and approved the final manuscript. KG, XFL and THF performed critical revision of the manuscript. KG and THF are responsible for the visualization of the model. THF supervised the study. THF, XFL and XHH provided funding support for this research. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Tiehu Fan.

Ethics declarations

Ethics approval and consent to participate

The experimental protocol was established according to the ethical guidelines of the Helsinki Declaration and was approved by the Human Ethics Committee of the First Hospital of Jilin University. Informed consent was obtained from all participants.

Consent for publication

Consent for publication was obtained from every individual whose data are included in this manuscript.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Guo, K., Li, X., Fan, T. et al. VANet: a medical image fusion model based on attention mechanism to assist disease diagnosis. BMC Bioinformatics 23, 548 (2022). https://doi.org/10.1186/s12859-022-05072-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-022-05072-4

Keywords

  • Medical image
  • Medical image fusion
  • Attention mechanism
  • Contextual information
  • Multi scale feature extraction