Skip to main content

Hybrid generative adversarial network based on frequency and spatial domain for histopathological image synthesis

Abstract

Background

Due to the complexity and cost of preparing histopathological slides, deep learning-based methods have been developed to generate high-quality histological images. However, existing approaches primarily focus on spatial domain information, neglecting the periodic information in the frequency domain and the complementary relationship between the two domains. In this paper, we proposed a generative adversarial network that employs a cross-attention mechanism to extract and fuse features across spatial and frequency domains. The method optimizes frequency domain features using spatial domain guidance and refines spatial features with frequency domain information, preserving key details while eliminating redundancy to generate high-quality histological images.

Results

Our model incorporates a variable-window mixed attention module to dynamically adjust attention window sizes, capturing both local details and global context. A spectral filtering module enhances the extraction of repetitive textures and periodic structures, while a cross-attention fusion module dynamically weights features from both domains, focusing on the most critical information to produce realistic and detailed images.

Conclusions

The proposed method achieves efficient spatial-frequency domain fusion, significantly improving image generation quality. Experiments on the Patch Camelyon dataset show superior performance over eight state-of-the-art models across five metrics. This approach advances automated histopathological image generation with potential for clinical applications.

Peer Review reports

Introduction

Histopathology is a specialized field in medicine that primarily studies microscopic structural changes in tissues under pathological conditions. By observing diseased tissues, it helps experts diagnose diseases, determine the nature of the lesions, and formulate treatment plans. Histopathologists usually use microscopes to examine tissue samples taken from patients. Through a series of complex preprocessing steps [1] such as fixation, dehydration, paraffin embedding, staining, and sectioning, thin tissue slides are ultimately prepared for microscopic observation. However, obtaining tissue slides is a complex process involving many factors. Firstly, the preparation of tissue slide samples is very intricate. Each step, including fixation, dehydration, paraffin embedding, and staining, requires precise control. Any operational errors can lead to tissue degradation, shrinkage, deformation, damage, and reduced cell structure recognizability due to improper staining [2]. Secondly, the availability of tissue samples for slide preparation is very limited, especially for precise biomarker studies and rare disease research [3], which restricts histopathologists’ ability to study rare diseases. Lastly, obtaining high-resolution histological images requires not only high-quality tissue slides but also advanced imaging equipment [4], which entails high costs.

In recent years, some deep learning-based generative methods have effectively addressed these challenges. These models can use complex data-driven methods to generate high-quality tissue slides images, particularly outstanding in expanding tissue slides datasets corresponding to rare diseases. These models not only increase the number of images available for research and education but also enhance pathologists' understanding of the diagnosis and treatment strategies for these diseases.

Generative adversarial network (GAN) methods [5] have been widely used for histological slide image generation. Dolezal et al. [6] proposed a conditional GAN-based generative model to synthesize realistic, category-specific histological images to improve the interpretability of classifiers, enhancing the accuracy of pathologists in classifying rare tumor subtypes. Levine et al. [7] utilized GANs to generate high-resolution pathological images of 10 cancer tissue types, supplementing small training sets for cancer image classification, and validated the effectiveness of synthetic datasets for cancer classification diagnosis. Although these methods have enriched the diversity of tissue slide data, they still face limitations in the reliability and variety of generated images, such as mode collapse and the occurrence of blurriness and artifacts in the generated images.

Methods based on Variational Auto-Encoders (VAEs) [8] and diffusion models [9] have achieved remarkable progress. For example, Guleria et al. [10] proposed a histological image reconstruction method using variational auto-encoders and denoising variational auto-encoders, which significantly improved the classification accuracy of convolutional neural network-based classifiers. Moghadam et al. [11] introduced a diffusion probabilistic model incorporating prioritized morphology weighting and color normalization to synthesize high-quality neuropathological images of brain cancer. Shrivastava et al. [12] employed a conditional diffusion model to develop a method for synthesizing high-quality, realistic tissue samples with precise nuclear localization based on semantic instance masks of different nuclear types.

In this paper, unlike previous deep learning methods that focus solely on the spatial domain, we propose an unconditional generative adversarial network based on the fusion of spatial and frequency domain features. We designed a novel window attention mechanism [13] to capture the characteristics of images in the spatial domain and adopted a feature extraction module based on Fast Fourier transformation to capture the periodic information and intensity features in the frequency domain. However, simply concatenating spatial and frequency domain features fails to effectively capture critical information in the image. Therefore, we propose a novel cross-attention feature fusion module, which initially fuses features from both domains and then applies cross-attention mechanisms between the fused features, spatial domain features, and frequency domain features. This allows each domain to not only retain its own important information but also be guided by the attention from the other domain to optimize feature extraction. Through this cross-attention mechanism, the model can focus on the most relevant features, eliminate redundant information, and enhance the expressive power of the features. This approach significantly improves the clarity and detail of the generated images.

Surprisingly, the proposed method achieved promising results, Fig. 1 provides an example. Our approach outperforms in both detail and overall structure, overcoming the limitations of conventional techniques in generating high-quality images. The main contributions of this paper are as follows:

  • We propose a feature fusion method that combines spatial and frequency domain information. By simultaneously leveraging the advantages of both domains, it complements their roles in the feature extraction modules and is guided by a cross-attention mechanism, enabling the generation of high-quality pathological tissue slide images.

  • We propose a hybrid feature extraction module based on the local window attention mechanism and the multi-scale window attention mechanism. The local window attention mechanism captures subtle lesions or cellular structure details in tissue slide images, while the multi-scale window attention mechanism dynamically adjusts the focus area size according to the processed image content. This allows the network to expand its focus to capture the global context when necessary or narrow its view to capture fine features. By combining these two mechanisms, our model achieves a comprehensive observation from microscopic details to macroscopic structures, significantly enhancing the overall quality of the generated images.

  • We propose a frequency domain feature extraction module based on Fourier Transform [14]. This module converts data from the spatial domain to the frequency domain, effectively capturing and analyzing the periodic and structural features in the images. This process enhances the network's understanding of fine textures and global information in the images, significantly improving the quality and realism of the generated images.

  • We utilize five evaluation metrics for generated image quality and no-reference image quality assessment metrics to conduct a multi-angle, comprehensive quality evaluation of the generated histological slide images, ensuring the comprehensiveness and reliability of the evaluation results.

Fig. 1.
figure 1

Comparison of real and generated images, a represents real histological slide images, and b represents images generated by our method. The generated tissue slide images are very similar to the real images in terms of color, detail, and cellular structure

The structure of this paper is as follows, the first section is the introduction, which elaborates on the current research status, research problems, and the bottlenecks and difficulties encountered in current research. The second section provides a detailed explanation of the latest methods for generating histological slide images. The third section explains in detail the various modules of the proposed method, including the overall architecture of the generator, the variable window mixing attention module, the spectral filtering module, the cross-attention fusion module, and the loss function. The fourth section describes the experiment settings and compares the proposed method against eight methods in three image generation quality metrics and two no-reference image quality assessment metrics. Ablation studies were conducted to verify the effectiveness of each module. The fifth section summarizes the paper based on the results, discusses the advantages and limitations of the proposed method, and outlines future research directions. Overall, this paper introduces an efficient approach for generating tissue slide images, further enhancing the quality, effectiveness, and diversity of the generated images, making them more suitable for the practical needs of medical research and diagnostic applications.

Related work

Tissue slide images typically require high resolution and detail fidelity to accurately reflect minute tissue structures and pathological features. Additionally, the variety of staining agents used in tissue slide images leads to significant differences in staining effects. Furthermore, the diversity in tissue cell types and structures adds to the challenge of generating high-quality images. Currently, there are three categories of deep learning-based methods for generating tissue slide images: GAN-based methods, diffusion-based methods, and VAE-based methods. Additionally, fusion strategies can be effectively used in image generation by integrating features from different domains, leveraging their strengths to produce higher-resolution images with better detail and more accurate structures. The GAN-based methods generate high-quality tissue slide images through adversarial training between a generator and a discriminator. The diffusion model-based methods iteratively denoise and reconstruct noisy images, gradually restoring high-quality, high-detail fidelity tissue slide images over multiple iterative steps. The VAE-based methods encode input tissue slide images into a latent space probability distribution, decode this distribution back into the original image, and then sample new latent variables from the latent space, which the decoder then decodes into new tissue slide images. Fusion strategies involve integrating features from different methods or domains, leveraging their complementary strengths to generate images with enhanced resolution, detail fidelity, and more accurate representation of some structures. In the following sections, we provide a comprehensive overview of these three generative methods and fusion strategies, summarize the recent literature, and analyze each method.

Methods based on generative adversarial networks

Methods based on GANs include generator and discriminator. The generator creates histological slide images from random noise, while the discriminator distinguishes between real histological slide images and generated ones. Through dynamic adversarial training, the generator continuously improves itself to produce high-quality images, and the discriminator enhances its ability to distinguish between real and generated images. These methods can generate histological slide images with rich details and natural textures, exhibiting strong adaptability and diversity. Recently, Xue et al. [15] proposed a conditional GAN utilizing class labels and designed a synthetic enhancement framework. By comparing the confidence levels of assigned labels to the feature similarity of actual labeled images, they selected synthetic image patches to generate synthetic images with high fidelity and diversity. To address the issue of class imbalance in the breast cancer histopathology image dataset, Saini et al. [16] proposed a novel network architecture using DCGAN [17] and VGG16 [18], using a transfer learning training strategy to enhance tissue slide datasets at different magnifications. Xue et al. [19] proposed a filtering mechanism to control the quality of selected synthetic image features and used a conditional GAN for data augmentation, improving the accuracy of classification models for cervical histopathology images. Since supervised deep learning methods are sensitive to domain shifts when dealing with tissue slides stained in different ways, their application is somewhat limited. Moreover, obtaining samples of images stained differently is both expensive and time-consuming. To address this, Vasiljević et al. [20] proposed a method for enhancing unsupervised image-to-image translation to generate datasets of the same tissue slides stained with different agents for image enhancement. Their analysis demonstrated the stain invariance of the generated images. Jiang et al. [21] introduced a multi-scale gradient GAN to synthesize rectal cancer tissue slide images. They utilized the pre-trained model to select fake images with high class probabilities to add to the training set, improving the performance of the classification model. However, GAN-based methods for generating medical tissue slides face a series of issues, including unstable training, mode collapse, and insufficient image details, limiting their widespread adoption and effectiveness in practical applications.

Methods based on diffusion models

Methods utilizing diffusion models for generating tissue slide images involve a step-by-step denoising and reconstruction process, gradually generating high-quality images from noise. These methods can produce tissue slide images with high resolution and detail fidelity, and they demonstrate strong robustness to variations in staining methods and tissue types. Additionally, these methods can capture complex structures and diversity within the images, enhancing the reality and natural appearance of the generated images. Recently, Harb et al. [22] proposed a diffusion model approach for constructing histological slides from low resolution to high resolution, achieving gigapixel-scale whole slide image (WSI) generation. The results demonstrated that the generated WSI images closely match the structural features of actual WSI images. To address the impact of data imbalance on the performance of histopathology image classification, Guan et al. [23] introduced a post-discriminator mechanism-based diffusion model method for generating histopathology images. This method ensures the quality of augmented images by filtering synthetic images, thereby preventing poor-quality synthetic images from degrading subsequent classification performance and providing quality assurance for data augmentation. Zeng et al. [24] introduced a label diffusion graph learning method to enhance the recognition of breast cancer histology images using semi-supervised learning on small, labeled datasets. Yang et al. [25] proposed a dual-semantic diffusion model for generating high-quality and semantically related dynamic cellular imaging (DCI) images. This method combines semantic masks and reference images to generate DCI images that closely resemble real images, achieving higher accuracy in downstream segmentation tasks. Aversa et al. [26] proposed a hierarchical diffusion model that synthesizes segmentation masks to serve as conditions for generating tissue slide images, allowing the generation of tissue slides at any desired size and evaluating the plausibility of the generated data in downstream segmentation and classification tasks. However, the computational complexity of using diffusion models to generate tissue slide images is very high. The training process is time-consuming and requires significant computational resources. Additionally, tuning and training diffusion models is complex, often necessitating numerous experiments and expert knowledge to achieve optimal generation results.

Methods based on variational auto-encoders

Methods based on VAEs generate tissue slide images by encoding the input tissue slide images into a latent space probability distribution and then decoding from that distribution back to the original images, thereby generating new tissue slide images. This approach can produce high-quality and diverse tissue slide images with high stability during training. Recently, Shwetha et al. [27] proposed a Gram-stained culture image generation network based on a vector quantized VAE combined with quality loss, achieving enhancement of the dataset. Lutnick et al. [28] introduced a VAE image generation method for augmenting mouse glomeruli tissue slide image datasets and used nonlinear dimensionality reduction to map these data to human glomeruli slide datasets, enhancing the human glomeruli slide training set. Tellez et al. [29] proposed a neural image compression method that converts gigapixel images into highly compact representations, improving convolutional neural network performance in label prediction for such images by transforming low-level pixel space information into higher-level latent space representations, thus enhancing model efficiency. To achieve color normalization for hematoxylin and eosin-stained tissue slide images, Zanjani et al. [30] proposed a VAE-based staining normalization method. This method performs color normalization on tissue slide images without requiring any data labels or assumptions, and the generated images outperform state-of-the-art methods in color constancy measurements. However, compared to other generative models, the quality of VAE-generated slide images is often lower, with less clear detail representation, which may result in blurriness and distortion. The continuous nature of the latent space can result in overly smooth generated images, lacking necessary variations and details, which is a significant limitation for medical images that require high precision.

Fusion strategies for multi-domain feature integration

In recent years, fusion strategies have made significant progress in the field of image generation. By integrating information from different sources or feature domains, fusion methods can leverage the strengths of each, thereby improving the quality and accuracy of image generation. For example, Li et al. [31] proposed a Laplacian redecomposition framework that effectively addresses issues such as color distortion, blurring, and noise in multimodal medical image fusion by fusing redundant and complementary information. Han et al. [32] proposed an edge-guided adversarial network framework, which optimizes image content and structural information through an edge fidelity constraint, significantly improving remote sensing image compression quality. Additionally, Han et al. [33] introduced a progressive feature interleaved fusion framework that combines convolutional neural networks and transformers to enhance the accuracy of salient object detection in complex backgrounds. Zhang et al. [34] proposed a transformer-based conditional generative adversarial network framework, which uses adversarial training and multi-scale fusion modules to address the issues of long-distance dependencies and prior knowledge integration, significantly improving image fusion results. These studies demonstrate the wide application and effectiveness of fusion methods in image generation.

However, despite the significant progress made by existing fusion methods in various fields, generating tissue slide images still faces several challenges. In tissue slide image generation, the images contain complex details and structures, particularly the subtle differences between different types of tissue and complex boundaries, making it difficult for existing fusion strategies to simultaneously maintain global consistency and local fine structures.

Proposed method

In this section, we provide a detailed explanation of the four key components of the proposed method: generator architecture, variable window mixing attention module, spectral filtering module, cross-attention fusion module, and loss function. Each part is further elaborated, with emphasis on their specific functions and the critical roles they play in generating high-quality images.

Generator architecture

We propose a generator architecture based on StyleGAN [35], as illustrated in Fig. 2. This architecture consists of two main modules, the style generation network and the synthesis network. In the style generation network module, the input latent vector is mapped to an intermediate latent space by the mapping network, and then injected into various layers of the synthesis network through the style modulation network to control the style attributes of the generated image. The synthesis network module generates images layer by layer from low resolution to high resolution. The implementation processes of these two modules are described in detail in the content in this section.

Fig. 2
figure 2

The generator structure consists of two parts, the mapping network and the synthesis network. The mapping network is responsible for generating style vectors and injecting them into various layers of the synthesis network to control the style attributes of the generated image. The synthesis network is tasked with generating tissue slice images layer by layer from low resolution to high resolution

We first input a \(512 \times 1\) dimensional latent code into a mapping network composed of eight fully connected layers and obtain an output \(y\) of the same dimension as shown in Eq. (1), where \(f_{{{\text{map}}}} ( \cdot )\) represents the mapping network. The aim of this operation is to allow the generator to better disentangle features and enrich the diversity and details of the generated images. Through this mapping, the network can learn a more disentangled latent representation, thereby improving the quality of the generated images and allowing for finer control over different features of the image.

$$y = f_{{{\text{map}}}} (z).$$
(1)

Subsequently, the style vector is injected into feature maps at different resolutions to modulate the style of each layer’s feature map. This operation is implemented using Adaptive Instance Normalization (AdaIN), as described below:

$${\text{AdaIN}}\;(X_{i} ,y_{i} ) = y_{s,i} \left( {\frac{{X_{i} - \mu (X_{i} )}}{{\sigma (X_{i} )}}} \right) + y_{b,i} ,$$
(2)

In this process, \(X_{i}\) represents the feature map of the \(i - th\) synthesis block, and \(y_{i}\) is the style vector for the \(i - th\) layer, which is derived from the mapped latent vector\(y\). The terms \(\sigma \;( \cdot )\) and \(\mu \;( \cdot )\) denote the mean and standard deviation of the feature map, respectively, while \(y_{s,i}\) and \(y_{b,i}\) represent the scaling factor and bias of the feature map at the \(i - th\) block, which are learned separately by two distinct fully connected networks. To ensure that noise only affects subtle variations in the image style, a scaled noise is added to each channel before the \({\text{AdaIN}}\) module. This slight alteration in the visual expression of features at different resolution levels often results in generated images that are more realistic and diverse.

Variable window mixing attention module

At each resolution stage, we propose a feature extraction strategy using local window attention and varied-size window attention. In generating high-resolution images, different parts of the image often have significant correlations, and using a self-attention mechanism can help capture and maintain critical long-distance dependencies, thus producing more realistic and consistent images. For example, in tissue slides, the positional relationships between nuclei and cytoplasm are closely related, and the transition areas between tumor cells and normal cells also represent important long-distance information. However, feature extraction modules based on self-attention mechanisms tend to have high computational complexity [13], with a complexity of \(O\;(n^{2} )\), where \(n\) is the spatial dimension of the input feature map. Liu et, al [36] proposed a local window attention mechanism to effectively reduce computational complexity, but it has certain limitations, such as its inability to effectively capture long-distance dependencies in images. Therefore, we propose a hybrid attention module based on both local fixed window attention and variable window attention mechanisms, which reduces computational complexity and effectively captures long-distance dependencies in images.

First, we use \(X_{w}\) and \(X_{v}\) to represent image blocks divided under fixed and variable window partitions, respectively and \(X_{w} ,X_{v} \in R^{{\frac{HW}{{k_{i}^{2} }} \times k_{i} \times k_{i} \times C}}\). In this context, \(k_{i}\) represents the size of the window, while \(H\), \(W\), and \(C\) respectively represent the height, width, and number of channels of the feature map. For \(X_{w}\), \(k_{i}\) is a fixed value; for \(X_{v}\), \(k_{i}\) is a variable value. Subsequently, we calculate the attention heads for each method using two different attention mechanisms, as shown in Eq. (3).

$$head_{i} = \left\{ {\begin{array}{*{20}l} {{\text{LWA}}(X_{w} W_{i}^{Q} ,X_{w} W_{i}^{K} ,X_{w} W_{i}^{V} )} \hfill & {i < \left\lfloor \frac{h}{2} \right\rfloor } \hfill \\ {} \hfill & {} \hfill \\ {{\text{VSA}}(X_{v} W_{i}^{Q} ,X_{v} W_{i}^{K} ,X_{v} W_{i}^{V} )} \hfill & {i > \left\lfloor \frac{h}{2} \right\rfloor } \hfill \\ \end{array} } \right.,$$
(3)

Here, \(W^{Q}\), \(W^{K}\), and \(W^{V}\) represent the weight matrices for queries, keys, and values in the self-attention mechanism, each belonging to \(R^{C \times (C/h)}\), where \(h\) is the number of attention heads. \({\text{LWA}}\;( \cdot )\) and \({\text{VSA}}\;( \cdot )\) stand for local window attention and varied-size window attention, respectively. Subsequently, we propose a hybrid attention mechanism that enables a single transformer module to capture contextual information from both local and varied-sized windows, thus generating the final output \(X_{1}\) as shown in Eq. (4):

$$X_{1} = {\text{Concat}}\;(head_{1} ,...,head_{h} )\;W^{O} .$$
(4)

Here, \(W^{O} \in R^{C \times C}\) represents the projection matrix used to obtain the output from the hybrid attention heads. The computation method for varied-size window self-attention differs from that of fixed window self-attention, with the specific implementation process illustrated in Fig. 3. For a given input, the image is first divided into multiple windows \(X_{\nu }\) according to a predefined window size \(k\), and the query features \(Q_{\nu }\) are obtained based on a linear transformation, as shown in the following equation:

$$Q_{\nu } = {\text{Linear}}\;(X_{\nu } ).$$
(5)
Fig. 3.
figure 3

Varied-size window attention mechanism. First, images are divided into default windows, then simplified through pooling and transformed into query vectors (\(Q_{v}\)), keys (\(K\)), and values (\(V\)). The varied-size window regression (VSR) module adjusts the size and position of the windows to create target windows and using the keys and values from the target windows, calculate the attention weights to generate the attention map

Subsequently, to estimate the size and location of the target window corresponding to each predefined window, we employ a varied-size window regression module to estimate the size and position of the reference window. This module includes a Leaky ReLU [37] activation layer, a 1 × 1 convolutional layer with a stride of 1, followed in sequence, and an average pooling layer based on the default window size and stride, as illustrated in Eq. (6):

$$S_{\nu } ,O_{\nu } = {\text{Conv}} \circ {\text{Leaky}}\;{\text{ReLU}} \circ {\text{Average}}\;{\text{Pool}}(X_{\nu } ),$$
(6)

Here, \(S_{\nu } ,O_{\nu } \in R^{2 \times h/2}\) represent the estimated scale ratios and offsets in the horizontal and vertical directions, respectively. Subsequently, we obtain the global key and value from the input feature map \(X\), as shown in Eq. (7):

$$K,V = {\text{Reshape}} \circ {\text{Linear}}\;(X),$$
(7)

Here, the key and value \(K\) and \(V\) belong to \(R^{H \times W \times C}\). Finally, the varied-size window attention module uniformly samples \(M\) features from each window of different sizes on \(K\) and \(V\), obtaining the keys and values based on different windows, denoted as \(K_{V}\) and \(Q_{V}\) respectively, and together they are used to calculate the attention weights. The structure of Variable Window Mixing Attention Module is shown in Fig. 4

Fig. 4.
figure 4

Structure of the variable window mixing attention module. The input image X is divided into two parts, each processed by half of the attention heads. The left part, Xv, uses a varied-size window attention mechanism to dynamically segment the image, while the right part, Xw, uses a local window attention mechanism to extract local features. Both modules calculate mixed attention based on object size and local details, and the combined outputs produce the final image

Spectral filtering module

In the process of generating histological images, certain structural patterns, such as cell textures, symmetrical structures, and multi-scale patterns, exhibit periodic or repetitive features. The spectral filtering module effectively captures these periodic features, enhancing the model's ability to understand the global structure of the image, the structure of the module is shown in Fig. 5. First, spatial domain features are transferred into the frequency domain through a Fast Fourier Transform [38]. Subsequently, a learnable filter is used to select frequency domain features, automatically identifying and emphasizing the important information by assigning higher weights to these important features. Meanwhile, for less important or redundant frequency domain information, the filter assigns lower weights or suppresses it appropriately. Finally, the processed frequency domain data is converted back to the spatial domain through an inverse Fast Fourier Transform, ensuring that the final output maintains the integrity and practicality of the image content. The specific equation is as shown in Eq. (8):

$$X^{\prime} = {\mathcal{F}}[X] \in C^{H \times W \times C} ,$$
(8)
Fig. 5.
figure 5

Structure of the spectral filtering module. The input feature map X, after feature extraction via a convolutional layer, is transformed to the frequency domain through Fast Fourier Transform (FFT), where it is element-wise multiplied with a learnable global filter. Then, the data is converted back to the spatial domain using Inverse Fast Fourier Transform (IFFT). Finally, through layer normalization, the training process is stabilized, producing the final output

Here, \({\mathcal{F}}\,[ \cdot ]\) represents the Fast Fourier Transform, and \(X\) denotes the feature map input for this layer, with dimension \(C^{H \times W \times C}\). Subsequently, a global filtering operation is applied to \(X\), resulting in the modulated spectrum \(\tilde{X}\), as shown in Eq. (9):

$$\tilde{X} = G \odot X^{\prime},$$
(9)

where \(\odot\) is the element-wise multiplication, \(G\) is a learnable global filter. Finally, \(\tilde{X}\) is subjected to an inverse Fast Fourier Transform to revert it back to the spatial domain, as shown in Eq. (10):

$$X_{2} \leftarrow {\mathcal{F}}^{ - 1} [\tilde{X}],$$
(10)

Here, \({\mathcal{F}}^{ - 1} [ \cdot ]\) represents the inverse Fast Fourier Transform, and \(X_{2}\) denotes the output of spectral filtering module.

Cross-attention fusion module

When handling complex spatial structures, using variable window mixing attention module alone may not achieve the direct efficiency of Convolutional Neural Networks, and relying solely on the spectral filtering module might not effectively capture long-distance dependencies. Therefore, we propose the cross-attention fusion module that combines the advantages of attention mechanisms and frequency domain filtering. Unlike simple concatenation fusion methods, which often lead to information redundancy by directly connecting features from both domains and potentially retaining repetitive or irrelevant information, our proposed fusion method first performs an initial fusion of the dual-domain features, combining spatial and frequency domain features to ensure the preliminary integration of important information. Next, we apply a cross-attention mechanism to further compute and optimize the information from both domains and the single domain. Specifically, the cross-attention mechanism allows each domain to be guided by the information from the other domain, thereby enhancing feature extraction within the single domain and integrating key information from the other domain. This enables the model to focus more precisely on the important features in the single domain, avoid introducing redundant information, and improve the effectiveness and accuracy of feature extraction. Finally, the information from the two single domains is fused to generate higher-quality images. This mechanism not only avoids the noise and redundant information that may be introduced by traditional concatenation methods but also fully leverages the complementary strengths of spatial and frequency domain features, leading to more precise feature fusion and significantly improving the overall model performance. As shown in Fig. 6, each layer’s input first passes through the variable window mixing attention to obtain an intermediate output \(X_{1}\), and then through the spectral filtering module to obtain an intermediate output \(X_{2}\). Subsequently, we use cross-attention mechanisms and residual connections to guide and learn from these two types of features, thereby obtaining a comprehensive output that includes multi-domain information. Such a design not only enhances the depth and breadth of feature processing but also optimizes the overall performance of the model. The specific equations are as follows:

$$K_{j}^{\prime } = {\text{Reshape}}\;({\text{Conv}}_{K}^{j} (X_{j} )),$$
(11)
$$V_{j}^{\prime } = {\text{Reshape}}\;({\text{Conv}}_{V}^{j} (X_{j} )),$$
(12)
Fig. 6
figure 6

This diagram depicts the synthesis block with cross-attention fusion module. Input data is processed through a Variable Window Mixing Attention Block (VWMA) and a Convolutional Block (Conv), generating features that are refined by a Spectral Filtering Module (SFM) to create attention maps. These maps combine with values to form weighted features, which are merged and enhanced in a fusion layer to produce the final output, optimizing the model's attention mechanism efficiency

Here, \(j = 1,2\), \(K{\prime}_{j} \in R^{H \times W \times C}\) represent the key, \(V{\prime}_{j} \in R^{H \times W \times C}\) represents the value,.\({\text{Conv}}\;{(} \cdot {)}\) represents a convolutional layer with a 3 × 3 kernel size, and \({\text{Reshape}}\;{(} \cdot {)}\) represents the reshape operation. Subsequently, by computing the query and obtaining the attention scores, combined with residual connections, the final output \(X_{fused}\) is produced, as shown in the following equations:

$$Q{\prime} = {\text{Reshape}}({\text{Conv}}({\text{Concat}}(X_{1} ,X_{2} )),$$
(13)
$$A_{j}^{\prime } = {\text{Softmax}}(Q_{j} K_{j}^{T} ),$$
(14)
$$X_{fused} = {\text{Conv}}\;({\text{Concat}}\;(X_{2} \; \oplus \;{\text{Reshape}}\;(A_{1}^{\prime } V_{1}^{\prime } ),\,\;X_{1} \; \oplus \;{\text{Reshape}}\;(A_{2}^{\prime } V_{2}^{\prime } ))).$$
(15)

Here, \(Q{\prime} \in R^{H \times W \times C}\) represents the domain-invariant query, allowing us to leverage attributes from different domains for complementation. It is worth noting that both frequency domain and spatial domain features are integrated to produce the domain-invariant query, enabling us to fully utilize the complementary attributes across different domains. \(A^{\prime}_{j} \in R^{HW \times HW}\) represents the attention map of different domains; \(\oplus\) indicates element-wise addition of matrices, \(X_{fused}\) denotes the output corresponding to the resolution of that layer.

Loss function

The network is trained using non-saturating logistic GAN loss and the R1 gradient penalty term is applied to suppress mode collapse and promote the generator to learn more realistic image features. The loss functions are specifically expressed as follows:

$${\mathcal{L}}_{D} = - {\mathbb{E}}_{{x\sim P_{x} }} \left[ {log\;(D(x))} \right] - {\mathbb{E}}_{{x\sim P_{x} }} \left[ {log(1 - D\;(G(z)))} \right] + \gamma \cdot {\mathbb{E}}_{{x\sim P_{x} }} \left[ {\left\| {\nabla_{x} D\;(x)} \right\|_{2}^{2} } \right],$$
(16)
$${\mathcal{L}}_{G} = - {\mathbb{E}}_{{z\sim P_{z} }} [log\;(D\;(\;G(z)))].$$
(17)

For the discriminator loss \({\mathcal{L}}_{D}\), \({\mathbb{E}}_{{x\sim P_{x} }} [log\;(D\;(x))]\) represents the expected output of the discriminator for real data. The discriminator aims for \(D\;(x)\) to be as close to 1 as possible, \({\mathbb{E}}_{{x\sim P_{x} }} [log\;(1 - D\;(G(z)))]\) represents the output of the discriminator for generated data, the discriminator aims for \(D\;(G\;(z))\) to be as close to 0 as possible, \(\gamma \cdot {\mathbb{E}}_{{x\sim P_{x} }} [\left\| {\nabla_{x} D(x)} \right\|_{2}^{2} ]\) is used for gradient penalty on the discriminator to prevent overfitting and increase the diversity of generated tissue.

slide images, where \(\gamma\) is a hyperparameter, \(\nabla_{x} D(x)\) represents the squared norm of the discriminator's gradient on real data \(x\). For the generator loss \({\mathcal{L}}_{G}\), \({\mathbb{E}}_{{z\sim P_{z} }} [log(D(G(z)))]\) represents the expected output of the generator for generated data. The generator aims for \(D(G(z))\) to be as close to 1 as possible.

Network structure

The intricate configuration of the generator is delineated in Table 1. Ranging from 4 × 4 to 32 × 32 resolution, the model generates images through the utilization of a cross-attention fusion module, encompassing 512 channels, window sizes of 4 × 4 and 8 × 8, and 16 attention heads, while preserving the output channels at 512. Furthermore, upsampling procedures are executed at every resolution phase, ensuring consistency in channel quantity. Upon reaching the 64 × 64 resolution phase, the channel quantity of the cross-attention fusion module is halved to 256, featuring a window size of 8 × 8, and a reduction in the output channel count to 256. After upsampling, the channels are further diminished to 128. Progressing from 128 × 128 to 512 × 512 stages, the model exclusively employs the variable window mixing attention module, witnessing a decline in channel counts from 128 to 32 as dimensions amplify, while maintaining the window size and attention heads at 8 × 8 and 4, respectively. The channels undergo additional reductions with each successive step of upsampling. We also conducted an efficiency analysis of the various components of the model, as shown in Table 2.

Table 1 The detailed structure of generator
Table 2 The model efficiency analysis

Experiments

Experiment settings

In this section, we provide a detailed explanation of our experimental setup. Specifically, we comprehensively validated our method on the PCAM200 dataset using three image generation quality evaluation metrics and two no-reference image quality assessment metrics. Additionally, we compared the proposed method with eight other tissues slide image generation methods. Below, we provide a detailed introduction to the dataset used in this study, the comparison methods selected, and the evaluation metrics used to assess the quality of the generated images.

Dataset

The PCAM 200 dataset [39] is a public histological image dataset provided by Patch Camelyon, created in the same manner as the Camelyon2016 challenge [40] dataset. This dataset contains 327,680 color images with a resolution of 512 × 512 pixels. Each image is labeled with a binary label, indicating either tumor or normal. In our experiments, we used a specific subset of the dataset. Specifically, we randomly selected 2000 tissue slide images from the overall training data, including 1000 labeled as “tumor” and 1000 labeled as “normal”. This selection method is designed to ensure a balanced dataset in terms of categories, allowing the model to learn and distinguish between the two different types of tissue features more accurately. In addition, the random selection method helps to increase the representativeness and diversity of the data, making the training process more efficient and potentially enhancing the generalization ability of the model.

Compared methods

In comparative experiments, we used the PCAM200 dataset as a benchmark to comprehensively evaluate eight predominant unsupervised image generation methods. These methods include GAN [41], GANformer [42], StyleGAN2 [43], ProjectedGAN [44], LightweightGAN[45], Wavediff [46], LFM [47] and RDUOT [48]. The purpose of this comparison is to conduct an in-depth analysis and evaluation of the effectiveness and applicability of these advanced methods, as well as the proposed method, in generating tissue slide images.

Evaluation metrics

In evaluating the technology for generating histopathological tissue slide images, we used five image generation quality evaluation metrics for a comprehensive quantitative assessment of the generated image quality. Fréchet Inception Distance (FID) [49] measures the difference between the generated images and real images in the feature space by calculating the Wasserstein distance between the features extracted by a pre-trained Inception v3 network. A lower FID value indicates greater similarity between the generated images and real images. Inception Score (IS) [50] which assesses the quality and diversity of generated images by classifying them using the Inception v3 network and calculating the confidence of the classifications and the entropy of the image class distribution. A higher IS value indicates higher image quality and diversity. Kernel Inception Distance (KID) [51] which uses the Inception v3 pre-trained model and multiple kernel maximum mean discrepancy to measure the difference between generated images and real images. A lower KID value indicates greater similarity between the generated images and real images. MA Score [52] provides a comprehensive assessment of the generated images, with higher scores indicating higher visual quality and closer alignment with human perception standards. Natural Image Quality Evaluator (NIQE) [53] is a no-reference image quality assessment metric based on a statistical model, evaluating quality by measuring the naturalness and statistical properties of the images. A lower NIQE value indicates better image quality. These evaluation metrics comprehensively consider the similarity in distribution between generated and real images, as well as characteristics such as clarity, contrast, and naturalness of the generated images, providing a thorough reference for assessing the quality of tissue slide images.

Qualitative and quantitative comparisons

Qualitative Comparisons

Figure 7 shows the qualitative comparison between the proposed method and eight other methods. (a) represents real tissue slide images, (b) represents images generated by GAN. In the tissue slides generated by this method, the cell nuclei are overly stained, leading to unclear details such as the nuclear membrane and nucleoli. Additionally, there is an excess of hematoxylin in cytoplasm, causing an imbalance between the nuclei and cytoplasm. (c) represents tissue slides generated by GANformer. These generated slides suffer from insufficient transparency, resulting in slightly blurred image details. Moreover, the cytoplasm color is relatively light, lacking the color saturation found in real slide images, making the overall visual effect less vivid and striking. (d) represents tissue slide images generated by StyleGAN2. It can be observed that in these generated images, the cell nuclei are too lightly stained, while the cytoplasm is overly stained, leading to insufficient differentiation. The overall contrast between the nuclei and cytoplasm is poor, making the boundary between the nuclei and cytoplasm unclear, which affects the visual effect and diagnostic accuracy of the images. (e) represents tissue slide images generated by ProjectedGAN. In these images, the cell structures are not clear enough, some tissues are incomplete, the staining is uneven, and there are impurities after staining. (f) represents tissue slide images generated by LightweightGAN. These images exhibit obvious mode collapse, with the images generated lacking diversity, leading to high repetition in cell structures and morphology, failing to adequately reflect the complexity and diversity of real tissue slides. (g) represents tissue slide images generated by Wavediff, some areas exhibit blurring, and the nuclear structures are not clearly defined. (h) represents tissue slide images generated by LFM, there is uneven staining, and some tissues appear blurred. Overall, these methods have certain issues and clear deficiencies compared to our proposed method (i). The tissue slides generated by RDUOT exhibit a mist-like artifact, which significantly impairs the clarity of the cellular structure, making it extremely difficult to observe and analyze the details. (j). The proposed method outperforms in staining quality, with generated slides showing complete tissues under the microscope, good transparency, and cytoplasmic contrast, and clear backgrounds without contamination or bubbles. Specifically, in these generated images, the red contrast between the nuclei and cytoplasm is clear and distinct, the cell contours and structures are complete, the colors are bright, and the contrast is excellent. These characteristics ensure the clarity of the slide images under the microscope, aiding accurate judgment by the viewers.

Fig. 7
figure 7

Comparison of histological image generation results between the proposed method and eight other deep learning-based techniques. a Real image. b GAN [42]. c GANsformer [43]. d StyleGAN2 [44]. e ProjectedGAN [45]. f LightweightGAN [46]. g Wavediff [47]. h LFM [48]. (i) RDUOT [49] (j) Proposed Method

Quantitative comparisons

When comprehensively evaluating the quality of generated histological slide images, we conducted a systematic analysis using a series of quantitative metrics. Specifically, we compared eight different tissues slide generation methods on the PCAM 200 dataset using three image generation evaluation metrics and two no-reference image quality assessment metrics. These evaluation metrics include FID, IS, KID for generated image quality, as well as MA and NIQE for no-reference image quality. Through comprehensive analysis of these metrics, we can thoroughly evaluate the strengths and weaknesses of each generation method in terms of image quality, ensuring the objectivity and accuracy of the comparison results.

As shown in Table 3, in the comparative experiments, our proposed method performed the best in terms of the FID metric, achieving the lowest FID value. This indicates that our proposed method outperforms in the similarity between generated images and real images, with the smallest difference in feature space, thus producing the highest quality images. Regarding the Inception Score metric, our method achieved an IS value of 2.0171, second only to ProjectedGAN’s 2.4620. This suggests that our generated images perform well in terms of quality and diversity, with high classification confidence and even class distribution. For the KID metric, our method achieved a value of 0.0151, the best among all comparison algorithms, demonstrating the highest similarity between our generated images and real images in feature space. This shows that the proposed method performs very well in capturing the details and feature distribution of the image. In the two no-reference image quality assessment indicators, the proposed method achieves a MA value of 7.2831, indicating that the generated image has the best subjective quality. The NIQE value of the proposed method is 4.9504, which is second only to StyleGAN2's 4.6916, indicating that the naturalness and statistical properties of the generated images are very close to real images. These comprehensive evaluation results show that the proposed method not only performs well in a single metric, but also shows overall superiority in multiple evaluation dimensions, which verifies the effectiveness and superiority of the proposed method in practical applications. These results fully demonstrate the great potential of the proposed method in generating high-quality histological tissue slide images, which provides a solid foundation for future medical image generation and analysis.

Table 3 Comparison of generated images results between the proposed method and eight other methods, GAN [41], GANformer [42], StyleGAN2 [43], ProjectedGAN [44], LightweightGAN [45], Wavediff [46], LFM [47] and RDUOT [48]

Ablation study

In this section, we deeply analyzed the effectiveness of the proposed variable window mixing attention module and spectral filtering module on the PCAM200 dataset through qualitative and quantitative ablation experiments. By removing the variable window mixing attention module, its impact on the overall performance of the method is evaluated, which effectively proves the important role of the variable window mixing attention module in enhancing the quality of the generated tissue slide images. Similarly, by removing the spectral filtering module, the importance of the spectral filtering module in making full use of the image frequency domain information to improve the image quality is further analyzed. Finally, the variable window mixing attention module, and the spectral filtering module were removed to evaluate the overall impact of their simultaneous absence on the quality of the generated images and the performance of the algorithm. These experimental results clearly demonstrate the critical role of each module in our method, proving their contribution to improving the quality of the generated images and the overall performance of the algorithm.

Firstly, we conducted qualitative comparison experiments. In Fig. 8, (a) “- w/o VAM” represents the method without the variable window mixing attention module, showing the generated image effects the spectral filtering module. (c) “- w/o VAM and SFM” represents the tissue slides generated by the method with both the VAM and the SFM removed. (d) “Full model” represents the images generated by the method containing all modules. The comparative analysis reveals that removing any key module leads to a significant decline in model performance. Specifically, the tissue slides images generated without the variable window mixing attention module exhibit the nuclear cytoplasm shows clumped distribution, the cytoplasmic color is uneven, and some parts are overly light and overlap with the background, indicating that the VAM module plays a crucial role in image feature extraction and detail preservation. The tissue slides images generated without the spectral filtering module show the cytoplasm that appears mottled red, and the cells under the microscope are blurry with low resolution, suggesting that the SFM module is critical for capturing image frequency domain information and enhancing the naturalness of the images. The images generated with both the variable window mixing attention module and the spectral filtering module removed exhibit some of the nuclei are reddish, blending homogeneously with the eosinophilic cytoplasm. In contrast, the images generated by the full model are noticeably superior to those generated by any ablated model, demonstrating the contrast between the nucleus and cytoplasm is distinct, with clear quantity and structure.

Fig. 8
figure 8

Qualitative ablation experiment visualization a-w/o VAM denotes the images generated by the method without the variable window mixing attention module; b -w/o SFM represents the images generated by the method without the spectral filtering module; c -w/o VAM and SFM denotes the images generated by the method without VAM and SFM; d Full Model represents the images generated by the full method

Subsequently, we conducted quantitative comparison experiments. Table 4 presents the results of the ablation experiments of the proposed method on the PCAM 200 dataset. By removing key modules and using the five image quality evaluation metrics mentioned earlier, we can precisely reveal the contribution and effectiveness of each module to the overall algorithm. As shown in Table 4, the full model achieves the best performance across all metrics, indicating that the proposed method has a significant advantage in overall performance. The collaboration of all modules in the model enhances the overall quality of the generated images and the robustness of the algorithm.

Table 4 The results of the ablation experiments of the different modules

Conclusion

Histological slides are crucial tools in medical diagnostics. By observing tissue slides under a microscope, doctors can diagnose various diseases such as cancer, infections, and inflammation. Generating high-quality histopathological slides requires significant time and human resources and is susceptible to errors and variations during preparation. Utilizing generative technology to create high-quality histopathological slide images can improve the efficiency of medical resource utilization and address issues of insufficient sample quantities and excessive sample variation. To generate high-quality and diverse histological slide images, we propose a generative adversarial network that uses spatial domain and frequency domain information for feature extraction. This method adopts a varied-size window attention mechanism to adaptively adjust the attention window size according to the image content, to better capture local and global information. Fast Fourier transform transforms the image from the spatial domain to the frequency domain, so that the model can process and analyze the frequency component of the image more efficiently. This process can capture high and low frequency information that may be ignored by traditional spatial domain methods, thus enhancing the clarity and detail preservation of the generated images.

Since a single method cannot fully meet the requirements of generating high-quality images, a cross-attention fusion module is introduced in this paper. Our method is not only a simple combination of different techniques, but also an in-depth fusion mechanism to effectively integrate the information of spatial domain and frequency domain, to realize their mutual reference and complementarity, to improve the overall quality and detail performance of the image. Through the cross-attention mechanism, the generated images maintain consistency in the macro structure and show higher fidelity and resolution in the microscopic details. Finally, the proposed method can produce histopathological slide images with high authenticity and diversity, significantly improving the quality and efficiency of medical image analysis. In future work, we plan to further reduce the complexity and parameter count of the model and conduct in-depth optimization of the model structure to meet the requirements for lightweight implementation and intend to extend this study to other image domains, such as medical imaging, remote sensing images, and natural images. Through these efforts, we hope to provide more efficient and accurate image processing and analysis tools for these fields, supporting the progress of scientific research and practical applications.

Availability of data and materials

The dataset used in this study is the PatchCamelyon (PCAM) dataset, a publicly available dataset for histopathology image classification. The PCAM dataset can be accessed at https://github.com/basveeling/pcam.

References

  1. Al-Sabawy HB, Rahawy AM, Al-Mahmood SS (2021) Standard techniques for formalin-fixed paraffin-embedded tissue: a pathologist's perspective

  2. Buytaert J, Goyens J, De Greef D, Aerts P, Dirckx J. Volume shrinkage of bone, brain and muscle tissue in sample preparation for micro-CT and light sheet fluorescence microscopy (LSFM). Microsc Microanal. 2014;20(4):1208–17.

    Article  CAS  PubMed  Google Scholar 

  3. Rubinstein YR, Groft SC, Bartek R, Brown K, Christensen RA, Collier E, Vaught J. Creating a global rare disease patient registry linked to a rare diseases biorepository database: Rare Disease-HUB (RD-HUB). Contemp Clinical Trials. 2010;31(5):394–404.

    Article  Google Scholar 

  4. Higgins C. Applications and challenges of digital pathology and whole slide imaging. Biotech Histochem. 2015;90(5):341–7.

    Article  CAS  PubMed  Google Scholar 

  5. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Bengio Y. Generative adversarial networks. Commun ACM. 2020;63(11):139–44.

    Article  Google Scholar 

  6. Dolezal JM, Wolk R, Hieromnimon HM, Howard FM, Srisuwananukorn A, Karpeyev D, Pearson AT. Deep learning generates synthetic cancer histology for explainability and education. NPJ Precision Oncol. 2023;7(1):49.

    Article  Google Scholar 

  7. Levine AB, Peng J, Farnell D, Nursey M, Wang Y, Naso JR, Bashashati A. Synthesis of diagnostic quality cancer pathology images by generative adversarial networks. J Pathol. 2020;252(2):178–88.

    Article  CAS  PubMed  Google Scholar 

  8. Kingma DP, Welling M (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114

  9. Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models. Adv Neural Inf Process Syst. 2020;33:6840–51.

    Google Scholar 

  10. Guleria HV, Luqmani AM, Kothari HD, Phukan P, Patil S, Pareek P, Gabralla LA. Enhancing the breast histopathology image analysis for cancer detection using variational autoencoder. Int J Environ Res Public Health. 2023;20(5):4244.

    Article  PubMed  PubMed Central  Google Scholar 

  11. Moghadam PA, Van Dalen S, Martin KC, Lennerz J, Yip S, Farahani H, Bashashati A (2023) A morphology focused diffusion probabilistic model for synthesis of histopathology images. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 2000–2009)

  12. Shrivastava A, Fletcher PT (2023) Nasdm: Nuclei-aware semantic histopathology image generation using diffusion models. In: international conference on medical image computing and computer-assisted intervention (pp. 786–796). Cham: Springer Nature, Switzerland

  13. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Polosukhin I (2017) Attention is all you need. Adv Neural Inform Process Syst30

  14. Duhamel P, Vetterli M. Fast Fourier transforms: a tutorial review and a state of the art. Signal Process. 1990;19(4):259–99.

    Article  Google Scholar 

  15. Xue Y, Ye J, Zhou Q, Long LR, Antani S, Xue Z, Huang X. Selective synthetic augmentation with HistoGAN for improved histopathology image classification. Med Image Anal. 2021;67:101816.

    Article  PubMed  Google Scholar 

  16. Saini M, Susan S. Deep transfer with minority data augmentation for imbalanced breast cancer dataset. Appl Soft Comput. 2020;97:106759.

    Article  Google Scholar 

  17. Gao F, Yang Y, Wang J, Sun J, Yang E, Zhou H. A deep convolutional generative adversarial networks (DCGANs)-based semi-supervised method for object recognition in synthetic aperture radar (SAR) images. Remote Sens. 2018;10(6):846.

    Article  Google Scholar 

  18. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  19. Xue Y, Zhou Q, Ye J, Long LR, Antani S, Cornwell C, Huang X (2019) Synthetic augmentation and feature-based filtering for improved cervical histopathology image classification. In: Medical image computing and computer assisted intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part I 22 (pp. 387–396). Springer International Publishing.

  20. Vasiljević J, Feuerhake F, Wemmert C, Lampert T. Towards histopathological stain invariance by unsupervised domain augmentation using generative adversarial networks. Neurocomputing. 2021;460:277–91.

    Article  Google Scholar 

  21. Jiang L, Huang S, Luo C, Zhang J, Chen W, Liu Z. An improved multi-scale gradient generative adversarial network for enhancing classification of colorectal cancer histological images. Front Oncol. 2023;13:1240645.

    Article  PubMed  PubMed Central  Google Scholar 

  22. Harb R, Pock T, Müller H (2024) Diffusion-based generation of histopathological whole slide images at a gigapixel scale. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 5131–5140)

  23. Guan X, Wang Y, Lin Y, Zhang Y (2023) Data augmentation based on DiscrimDiff for histopathology image classification. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 53–62). Cham: Springer Nature Switzerland

  24. Zeng Y, Xu X. Label Diffusion Graph Learning network for semi-supervised breast histological image recognition. Biomed Signal Process Control. 2023;80:104306.

    Article  Google Scholar 

  25. Yang B, Li J, Wang J, Li R, Gu K, Liu B. DiffusionDCI: a novel diffusion-based unified framework for dynamic full-field OCT image Generation and segmentation. IEEE Access. 2024;12:37702–14. https://doi.org/10.1109/ACCESS.2024.3372863.

    Article  Google Scholar 

  26. Aversa M, Nobis G, Hägele M, Standvoss K, Chirica M, Murray-SmithR, Oala L (2024) Diffinfinite: large mask-image synthesis via parallel random patch diffusion in histopathology. Adv Neural Inform Process Syst, 36

  27. Shwetha V, Prasad K, Mukhopadhyay C, Banerjee B. Data augmentation for Gram-stained smear images based on Vector Quantized Variational AutoEncoder. Neurocomputing. 2024;600:128123.

    Article  Google Scholar 

  28. Lutnick B, Yacoub R, Jen KY, Tomaszewski JE, Jain S, Sarder P (2018) Deep variational auto-encoders for unsupervised glomerular classification. In: Medical Imaging 2018: Digital Pathology (Vol. 10581, pp. 88–94). SPIE

  29. Tellez D, Litjens G, Van der Laak J, Ciompi F. Neural image compression for gigapixel histopathology image analysis. IEEE Trans Pattern Anal Mach Intell. 2019;43(2):567–78.

    Article  Google Scholar 

  30. Zanjani FG, Zinger S, Bejnordi BE, van der Laak JA (2018) Histopathology stain-color normalization using deep generative models. In: Med Imagi Deep Learn

  31. Li X, Guo X, Han P, Wang X, Li H, Luo T. Laplacian redecomposition for multimodal medical image fusion. IEEE Trans Instrum Meas. 2020;69(9):6880–90.

    Article  Google Scholar 

  32. Han P, Zhao B, Li X. Edge-guided remote-sensing image compression. IEEE Transact Geosci Remote Sens. 2023;61:1–15. https://doi.org/10.1109/TGRS.2023.3314012.

    Article  Google Scholar 

  33. Han P, Zhao B, Li X. Progressive feature interleaved fusion network for remote-sensing image salient object detection. IEEE Trans Geosci Remote Sens. 2023;62:1–14.

    Google Scholar 

  34. Zhang J, Jiao L, Ma W, Xu F, Li L, Yang S. Transformer based conditional GAN for multimodal image fusion. IEEE Transact Multimed. 2023;25:8988–9001. https://doi.org/10.1109/TMM.2023.3243659.

    Article  Google Scholar 

  35. Karras T, Laine S, Aila T (2019) A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition: (pp. 4401–4410)

  36. Liu Z, LinY, Cao Y, Hu H, Wei Y, Zhang Z, Guo B. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision 2021 (pp. 10012–10022).

  37. Xu J, Li Z, Du B, Zhang M, Liu J. Reluplex made more practical: Leaky ReLU. In: 2020 IEEE Symposium on Computers and communications 2020 (ISCC) (pp. 1–7). IEEE.

  38. Nussbaumer HJ, Nussbaumer HJ. The fast Fourier transform. Berlin Heidelberg: Springer; 1982. p. 80–111.

    Google Scholar 

  39. Kawai M, Ota N, Yamaoka S. Large-Scale Pretraining on Pathological Images for Fine-Tuning of Small Pathological Benchmarks. In: Xue Z, Antani S, Zamzmi G, Yang F, Rajaraman S, Huang SX, Linguraru MG, Liang Z, editors. Medical Image Learning with Limited and Noisy Data: Second International Workshop, MILLanD 2023, Held in Conjunction with MICCAI 2023, Vancouver, BC, Canada, October 8, 2023, Proceedings. Cham: Springer Nature Switzerland; 2023. p. 257–67. https://doi.org/10.1007/978-3-031-44917-8_25.

    Chapter  Google Scholar 

  40. Wang D, Khosla A, Gargeya R, Irshad H, Beck AH. “Deep Learning for Identifying Metastatic Breast Cancer.” arXiv, Jun. 18, 2016. Accessed: Mar. 31, 2024. [Online]. Available: http://arxiv.org/abs/1606.05718

  41. Creswell A, White T, Dumoulin V, Arulkumaran K, Sengupta B, Bharath AA. Generative adversarial networks: an overview. IEEE Signal Process Mag. 2018;35(1):53–65.

    Article  Google Scholar 

  42. Hudson DA, Zitnick L. Generative adversarial transformers. In: International conference on machine learnin 2021 (pp. 4487–4499). PMLR.

  43. Karras T, Laine S, Aittala M, Hellsten J, Lehtinen J, Aila T. Analyzing and improving the image quality of stylegan. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 2020 (pp. 8110–8119)

  44. Sauer A, Chitta K, Müller J, Geiger A. Projected gans converge faster. Adv Neural Inf Process Syst. 2021;34:17480–92.

    Google Scholar 

  45. Liu B, Zhu Y, Song K, Elgammal A. Towards faster and stabilized gan training for high-fidelity few-shot image synthesis. 2020 In: International conference on learning representations.

  46. Phung H, Dao Q, Tran A. Wavelet diffusion models are fast and scalable image generators. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 2023 (pp. 10199–10208).

  47. Dao Q, Phung H, Nguyen B, Tran A (2023). Flow matching in latent space. arXiv preprint arXiv:2307.08698.

  48. Dao Q, Ta B, Pham T, Tran A. A high-quality robust diffusion framework for corrupted dataset. In: Leonardis A, Ricci E, Roth S, Russakovsky O, Sattler T, Varol G, editors. Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29 – October 4, 2024, Proceedings, Part LXXXIV. Cham: Springer Nature Switzerland; 2025. p. 107–23. https://doi.org/10.1007/978-3-031-72907-2_7.

    Chapter  Google Scholar 

  49. Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30.

  50. Barratt S, Sharma R (2018) A note on the inception score. arXiv preprint arXiv:1801.01973.

  51. Bińkowski M, Sutherland DJ, Arbel M, Gretton A (2018) Demystifying mmd gans. arXiv preprint arXiv:1801.01401.

  52. Ma C, Yang CY, Yang X, Yang MH. Learning a no-reference quality metric for single-image super-resolution. Comput Vis Image Underst. 2017;158:1–16.

    Article  Google Scholar 

  53. Zhang L, Zhang L, Bovik AC. A feature-enriched completely blind image quality evaluator. IEEE Trans Image Process. 2015;24(8):2579–91.

    Article  Google Scholar 

Download references

Acknowledgements

I would like to extend my deepest gratitude to all the mentors, colleagues, and family members who have supported and guided me throughout the research and writing process. Their invaluable encouragement and insights have been essential for the completion of this work.

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

QL conceptualized the study, developed the software, designed the methodology, drafted the original manuscript, and reviewed and edited the writing. TZ conducted validation, performed formal analysis, and provided resources. CC led the investigation, curated the data, and contributed to validation. JM participated in the investigation and data curation. MHT supervised the project and managed its administration.

Corresponding author

Correspondence to Marzia Hoque Tania.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, Q., Zhou, T., Cheng, C. et al. Hybrid generative adversarial network based on frequency and spatial domain for histopathological image synthesis. BMC Bioinformatics 26, 29 (2025). https://doi.org/10.1186/s12859-025-06057-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-025-06057-9

Keywords