Skip to main content

Benchmarking robustness of deep neural networks in semantic segmentation of fluorescence microscopy images

Abstract

Background

Fluorescence microscopy (FM) is an important and widely adopted biological imaging technique. Segmentation is often the first step in quantitative analysis of FM images. Deep neural networks (DNNs) have become the state-of-the-art tools for image segmentation. However, their performance on natural images may collapse under certain image corruptions or adversarial attacks. This poses real risks to their deployment in real-world applications. Although the robustness of DNN models in segmenting natural images has been studied extensively, their robustness in segmenting FM images remains poorly understood

Results

To address this deficiency, we have developed an assay that benchmarks robustness of DNN segmentation models using datasets of realistic synthetic 2D FM images with precisely controlled corruptions or adversarial attacks. Using this assay, we have benchmarked robustness of ten representative models such as DeepLab and Vision Transformer. We find that models with good robustness on natural images may perform poorly on FM images. We also find new robustness properties of DNN models and new connections between their corruption robustness and adversarial robustness. To further assess the robustness of the selected models, we have also benchmarked them on real microscopy images of different modalities without using simulated degradation. The results are consistent with those obtained on the realistic synthetic images, confirming the fidelity and reliability of our image synthesis method as well as the effectiveness of our assay.

Conclusions

Based on comprehensive benchmarking experiments, we have found distinct robustness properties of deep neural networks in semantic segmentation of FM images. Based on the findings, we have made specific recommendations on selection and design of robust models for FM image segmentation.

Peer Review reports

Introduction

Fluorescence microscopy (FM) is an imaging technique with many important applications in biology and medicine. Acquired FM images often must be segmented for their quantitative analysis. Reliable and accurate segmentation is critical to downstream data analysis. Recently, deep neural networks (DNNs) have become the state-of-the-art tool for segmentation of natural images [1, 2]. They have also become the tool of choice for segmentation of FM images [3,4,5]. However, studies have shown that their performance may collapse on natural images under various corruptions [6,7,8,9] or adversarial attacks [8, 10]. This vulnerability of DNN models poses real risks to their deployment in real-world applications, especially those with stringent performance requirements, such as autonomous driving. To address this problem, extensive studies have been performed on the robustness of DNN models in classification [8] and, more recently, segmentation [9, 11, 12] of natural images. In particular, various assays have been developed to benchmark robustness of the models in semantic segmentation of natural images. So far, however, such assays remain lacking for FM images. Robustness of DNN models in semantic segmentation of FM images remains poorly understood.

FM images differ from natural images in several important aspects. First, FM images generally have much wider dynamic ranges than natural images. They also have different noise properties [13, 14]. Second, under the high numerical aperture required for high image resolution, objects in fluorescence microscopy images tend to have diffusive boundaries and are often blurred because of defocusing. Third, FM images are often simpler than natural images in composition and semantics. In each wavelength channel, FM images only have two semantic classes: foreground and background. And their background is composed primarily of noise. In contrast, natural images often have multiple semantic classes. Although DNN models have found great success in segmentation of FM images [15, 16], many of them are developed originally for natural images. The differences between natural images and FM images raise the question of whether conclusions on robustness of DNN models in segmentation of natural image are still valid on FM images. In particular, it is unclear whether segmentation models that perform well on natural images also perform well on FM images.

In this study, we address these open questions by developing an assay to benchmark robustness of DNN segmentation models on FM images. Using this assay, we examine the robustness of 10 representative models on FM images under different forms and levels of corruptions and adversarial attacks. We find that some models that show good robustness on natural images actually perform poorly on FM images. Drawing on the simple composition of FM images, we also find new robustness properties of DNN models. For example, consistent with findings of studies such as [17], we find that the morphology of image objects is a key factor that affects DNN model robustness. The simple composition of FM images also allows us to dissect the relations between corruption robustness and adversarial robustness. The main research contributions of this study are as follows:

  1. (1)

    We have developed an assay that characterizes robustness of DNN models in semantic segmentation of FM images. It benchmarks both their corruption robustness and adversarial robustness. Corruption robustness of the models refers to their resistance in performance against different types and/or levels of image corruptions, such as noise and blurring, while adversarial robustness of the models refers to their resistance in performance against different types and/or levels of adversarial attacks. The assay is built on a method we have developed to synthesize realistic FM images with precisely controlled degradations. Robustness of DNN models has been benchmarked on three datasets of realistic synthetic FM images. Model robustness has also been benchmarked on eight datasets of real FM images of different modalities, with results consistent with those obtained on realistic synthetic images. The code and datasets used in this study are openly accessible (see Availability of data and materials).

  2. (2)

    Our study reveals important differences in robustness of DNN segmentation models on natural images versus FM images. Some conclusions on robustness of DNN models on natural images actually fail on FM images. In particular, some models that show good robustness on natural images perform poorly on FM images. Based on comprehensive comparison of 10 representative models, convolutional neural network (CNN)-based models such as SegNet perform better in FM segmentation compared to Transformer-based and ResNet-based models. In addition to identifying models that provide good robustness, we also make specific recommendations on how to design robust models for segmentation of FM images.

  3. (3)

    Our study reveals new and fundamental robustness properties of DNN models. By exploiting the simple composition of FM images, we find that segmentation robustness is highly dependent on morphology of image objects. DNN models generally show high accuracy and strong robustness on image objects with simple morphology. However, for image objects with complex morphology, optimization of DNN models for higher segmentation accuracy may come at the cost of lower robustness. DNN models optimized solely for segmentation accuracy may have poor robustness.

  4. (4)

    Our study reveals new relations between corruption robustness and adversarial robustness. We find that influence of adversarial attacks on FM images tends to be more specific on the foreground, i.e., image objects, while influence of corruptions is global and nonspecific. In addition, we find that when adversarial attacks are turned into randomized corruptions, their effectiveness in degrading DNN model performance is substantially weakened, indicating that adversarial robustness is conceptually and practically more stringent than corruption robustness, even though adversarial attacks are less common in practice than image corruptions.

Related work

Studies on robustness of DNN models so far have focused more on image classification than image segmentation. Both corruption robustness and adversarial robustness have been examined.

Corruption robustness of DNN models in image classification. Weakened performance of DNNs on corrupted images was observed early on in e.g., [6, 7], which found that DNNs are more vulnerable to noise and blurring than e.g., compression artifacts. To benchmark corruption robustness, several datasets were produced to simulate a wide variety of real image degradations [8, 18]. Benchmarking studies on several representative DNN models found that they generally lack robustness against corruptions such as noise, blurring, and contrast distortion [8]. To boost model robustness against these corruptions, several strategies have been proposed. A commonly adopted strategy is vanilla data augmentation, in which training data are augmented with the known types of corruptions such as noise [19], blurring [20], and compression artifacts [7]. However, DNN models trained on one type or level of corruptions were found to generalize poorly on unseen type or level of corruptions [7, 21]. Interestingly, a recent study reported that training with data augmented with additive Gaussian and speckle noise boosts robustness against unseen corruptions [22]. Nevertheless, models trained on too many types or levels of corruptions at one time may suffer from underfitting [19], which may be alleviated by various mixture of experts ensemble training methods [23]. But these ensemble methods are computationally costly and may not transfer well to unseen corruptions. Strategies other than vanilla data augmentation have also been proposed. One strategy is to randomly select a combination of image corruptions and realize them in a transformation network [18]. A similar but simpler strategy named AutoAugment was also proposed [24]. Another strategy named Mixup trains DNNs on convex combinations of images and their labels [25]. A strategy combining AutoAugment and Mixup was also proposed [26].

Adversarial robustness of DNN models in image classification. Adversarial attacks are malicious inputs designed to cause failure of DNN models [10]. Attack methods are designed to produce adversarial samples that induce maximal model errors [27, 28], whereas defense methods are designed to defeat the adversarial attacks [27, 29]. Recently, adversarial training has received much attention. It enhances robustness of models by training them explicitly on adversarial samples. For example, FGSM-AT is an adversarial training method that uses samples generated by the fast gradient sign method (FGSM) [27]. PGD-AT [29] is another adversarial training method that uses samples generated by projected gradient descent (PGD), a multi-step iterative attack method. Several studies have also proposed to adjust adversarial training by incorporating momentum [30]. For example, TRADES [31] was proposed to balance between accuracy and adversarial robustness, whereas MART [32] was proposed to incorporate explicit differentiation of misclassified examples as a regularization factor for adversarial risks. Studies have also been conducted on transferability of adversarial robustness. For example, adversarial training on ImageNet was conducted to evaluate transferability between different types of adversarial samples [33]. Other studies have also tried to understand how well adversarial attacks transfer between different models [34, 35].

Relations between corruption robustness and adversarial robustness. Several studies have tried to elucidate the relations between the two types of robustness. A half-space model was proposed to argue that the adversarial robustness and noise robustness are positively correlated so that adversarial training also enhances noise robustness [36]. Another study showed empirically that adversarial training also boosts robustness against different types of corruptions, including noise, blurring, digital and weather corruptions [37]. Yet another study reported that vanilla adversarial training enhances robustness against noise and blurring but not fog and contrast artifacts [36]. Adversarial training was also reported to boost robustness again elastic deformations of image objects and JPEG compression artifacts [38]. However, these observations are contradicted by a study reporting that adversarial training is ineffective for all types of corruptions and may even have an opposite effect [22] and another study reporting that increased robustness against adversarial attacks does not increase robustness against translation and rotation corruptions [39]. Overall, these conflicting reports show that our understanding of the relations between the two types of robustness remains limited.

Robustness of DNN models in image segmentation. Compared to the many studies on image classification, there were fewer studies on robustness of DNN models in segmentation. Most of related studies focus on proposing methods to enhance rather than evaluate segmentation robustness. Corruption robustness was benchmarked on four datasets constructed using images from Cityscapes, PASCAL VOC 2012, and ADE20K, with simulated blur, noise, digital and weather corruptions [11]. It was found that corruption robustness of semantic segmentation models depends strongly on the type of corruptions. Adversarial robustness of several semantic segmentation models was systematically evaluated in [12], which found empirically that architectures of the models have strong impact on their adversarial robustness and that observations on adversarial robustness of DNN models in classification may not hold in segmentation.

So far, few studies have examined the robustness of DNNs in segmentation of FM images. In [40] an assay was developed to synthesize images to benchmark robustness of FCN [41] and U-Net [42] against three types of corruptions, namely noise, spatial invariant blurring, and spatial variant blurring, in semantic segmentation of mitochondria. However, a key limitation of the study is that the synthesized mitochondria are unrealistic. Specifically, their sharp boundaries and randomized pixel patterns differ from those in real mitochondria. The study is also limited in the number of DNN models examined, its lack of diversity in image objects, and its lack of analysis on adversarial model robustness.

Methods

We propose an assay for benchmarking corruption robustness and adversarial robustness of DNN models in semantic segmentation of FM images. A critical part of the assay is a new method for synthesizing realistic synthetic FM images with precisely controlled corruptions or adversarial attacks. We evaluate robustness of 10 representative segmentation models on both realistic synthetic FM images and real microscopy images of different modalities.

Fig. 1
figure 1

Overall workflow of image synthesis

Generation of realistic synthetic images for benchmarking robustness

We have developed three datasets, referred to as ER-C, Mito-C and Nucleus-C, respectively, for benchmarking robustness of DNN models against corruptions and adversarial attacks in semantic segmentation of FM images [43]. Detailed statistics of the datasets are summarized in Supplementary Table S1. Degraded images in these three datasets are synthesized from raw images along with their manually annotated segmentation labels from the ER, Mito, and Nucleus datasets [44, 45], respectively.

We use realistic synthetic FM images to benchmark robustness of DNNs for four reasons. First, the ground truth of each synthetic image is known a priori so that no additional manual annotation is required. Second, using synthetic images enables more direct, flexible, and precise control of corruptions and adversarial attacks than using real images. Such control is difficult to achieve in real images because it is difficult or even infeasible to control imaging conditions in the real world. Third, synthesis of realistic images requires much less time and labor than generation of real images with controlled conditions. Finally, previous studies such as [46] have shown experimentally that models trained on realistic synthetic images perform equally well on real FM images.

The overall workflow of image synthesis consists of three steps (Fig. 1). First, segmentation labels are used as binary masks to guide synthesis of images using a generative adversarial network (GAN) [47,48,49,50], which is trained to learn the mapping from the masks to their corresponding FM images. The masks are used as the ground truth for the final output images. The segmentation labels, generated originally by manual annotation, are taken directly from the three datasets[44, 45]. For data augmentation, some objects in existing segmentation annotations are also randomly selected and combined to generate new masks. Furthermore, morphological operations including dilation and erosion are used to increase the shape variability of masks. Second, denoising is performed on the synthesized images to remove their background noise using the method in [51]. This step is important because it enables precise control of signal-to-noise ratios (SNRs) in the next step. Third, different corruptions and adversarial attacks are applied to the denoised synthetic images to generate the final output images for benchmarking robustness of DNN models. Detailed description of each step of the workflow is given below.

Fig. 2
figure 2

Comparison of real images versus images synthesized using two strategies. First row: an example from the Nucleus dataset. Second row: an example from the Mito dataset. Third row: an example from the ER dataset

Step 1-Initial image synthesis using a GAN

In a previous study [40], the foreground and background of FM images of mitochondria were modeled using a Gamma distribution and a Gaussian distribution, respectively, to synthesize images from binary masks. Pixels in foreground regions defined by the binary masks were filled with random samples from the Gamma distribution [40]. This method, referred to as Random Fill in this study, cannot capture spatial patterns of pixel intensities and diffusive boundaries of real image objects. Consequently, the synthesized images have low fidelity (Fig. 2). Moreover, because of the sharp boundaries of the synthetic images, DNN models trained on them tend to over segment on real images with diffusive boundaries [40].

Fig. 3
figure 3

Representative synthetic images with different types and levels of corruptions. First row: an example from the Nucleus-C dataset; Second row: an example from the Mito-C dataset; Third row: an example from the ER-C dataset

To generate realistic synthetic FM images, we use a customized GAN model based on Pix2Pix [52], which we refer to as P2P-SN [51]. It is trained to learn the mapping from binary masks to real images. When given a binary mask, it fills the foreground with synthetic signal and background with synthetic noise. In this way, a large number of synthetic images can be generated from given masks. As can be observed qualitatively in Fig. 2, images synthesized by P2P-SN better reproduce the pixel intensity patterns and diffusive boundaries of real images. This observation can be quantified using fidelity metrics of the foreground signal, background noise, and blurring, respectively [46]. However, background noise in the synthetic images makes it difficult to precisely control their SNRs. To solve this problem, we remove background noise of the synthetic images using the method developed in [51].

Step 2-Denoising synthetic images

Background noise synthesized by P2P-SN hinders the precise control of signal-to-noise ratios (SNRs) of images and therefore is removed via denoising. Specifically, we use the two-stage denoising method named global noise modeling denoiser (GNMD) [51]. In the first stage, a series of independent and nearly all-background masks are fed into a trained P2P-SN to generate a series of synthetic global noise images denoted by N. Assuming that noise is additive, we then synthesize a noisy image \(\hat{I} = I + N\) from a synthetic noise-free image I. In the second stage, pairs of images \((\hat{I},I)\) generated in the first stage are used to train another Pix2Pix-based GAN model referred to as P2P-DN. When GNMD is applied on images synthesized by P2P-SN, their background noise is effectively removed [51]. Controlled degradations can now be applied to the noise-free synthetic images.

Step 3-Synthesis of degraded images

FM images generally have lower SNRs than natural images. We find empirically that FM images with an SNR of 8 to be sufficiently clean visually. Therefore, we take synthesized images with an SNR of 8 as our reference clean images. Different forms and levels of corruptions and adversarial attacks are applied to the clean images to generate two types of samples: corrupted samples and adversarial samples.

Generation of corrupted samples. Noise is a common type of corruption for FM images. We simulate different levels of noise quantified by different SNRs (see Fig. 3). Blurring is another common type of corruption for FM images. Similar as in [40], we simulate two types of blurring, namely space-invariant blurring (SIB) and space-variant blurring (SVB), at different levels.

Previous studies have shown that Poisson and Gaussian noise are dominant in FM images[13, 14, 53]. Specifically, the photon noise, or shot noise, is generated by the statistical fluctuations of the number of photons emitted at a given exposure level, which follows a Poisson distribution. Photon noise is inherent in all optical signals that result from photon emission. The readout noise is mainly generated by the signal amplification during the process of converting electrical charges into voltages. It follows a Gaussian distribution. In view of these physical mechanisms, we first add Poisson noise onto a noise-free synthetic image I generated by P2P-DN to simulate photon noise (see Eq. 1).

$$\begin{aligned} {\hat{I}}_{Poisson} = I + N_{Poisson} \end{aligned}$$
(1)

\(N_{Poisson}\sim P\left( \lambda _{p} \right)\) is noise following a Poisson distribution, \(\lambda _{p}\) represents the average photon flux, which is dependent on signal strength. Then we add pixel-wise independent Gaussian noise onto \({\hat{I}}_{Poisson}\) to achieve desired SNRs, as formulated by the following equation:

$$\begin{aligned} {\hat{I}}_{SNR} = {\hat{I}}_{Poisson} + N\left( \mu _{noise},\sigma _{noise} \right) \end{aligned}$$
(2)

where \(\mu _{noise}\) and \(\sigma _{noise}\) denote the mean and standard deviation of the added Gaussian noise, respectively. \({\hat{I}}_{SNR}\) is the simulated noisy image of a certain SNR, which is defined in this study as:

$$\begin{aligned} SNR = \left( \mu _{signal} - \mu _{noise} \right) /\sigma _{noise} \end{aligned}$$
(3)

where \(\mu _{signal}\) denotes the mean of signal. Based on this definition, we simulate six SNR levels (SNR=1,2,3,4,5,8) and take \({\hat{I}}_{SNR=8}\) as our clean image. Specifically, \(\sigma _{noise}\) and \(\mu _{signal}\) are estimated from corresponding raw images. Then \(\mu _{noise}\) is calculated for a specific SNR based on its definition. To benchmark robustness against noise corruption of natural images, zero-mean Gaussian noise \(~N\left( {0,\sigma } \right)\) is often adopted, with its level controlled by \(\sigma\) [8, 11, 37]. However, for FM images, we simulate different levels of noise corruption by adjusting the SNRs. This is because FM images have much wider dynamic ranges than natural images, and the mean of their background noise often is nonzero.

For blurring, because of the limited depth of field under the high numerical aperture required for high-resolution imaging, FM images often are partially or completely out-of-focus and therefore blurred. Out-of-focus blur is often simulated through convolution with the point spread function (PSF) [54, 55]. Because a Gaussian kernel is often used to approximate PSF in practice, simulation of out-of-focus blur is implemented by convolution with a Gaussian kernel. To simulate space-invariant blurring (SIB), Gaussian filtering is performed on the entire synthetic FM images as described in [40]. Specifically, a fixed Gaussian kernel is applied on the whole image to simulate globally uniform blurring, with its level controlled by the standard deviation of the kernel \(\sigma\). In this study, six levels of SIB are simulated, with corresponding \(\sigma = 0,1,2,3,4,5\).

Fig. 4
figure 4

Examples of images with different levels of IFGSM attacks. From left to right, \(\varepsilon =0,2,8,16,32\), respectively

Space-variant blurring (SVB) is designed to simulate spatially nonuniform blur. In this study, an image is empiricially divided into 4 horizontal bands from the top to bottom. Each band is filtered by a Gaussian kernel randomly selected from \(\sigma =\) 1, M/3, 2 M/3, M to simulate SVB, with the highest level of blurring controlled by M, which is set to be 1, 2, 3, 4 and 5. Representative samples of the three types of corruptions are shown in Fig. 3.

Generation of adversarial samples. The fast gradient sign method (FGSM) [27] and the iterative fast gradient sign method (IFGSM) [28] are used to generate adversarial attack samples. If the original image is denoted as \(x \in R^{N \times N}\), FGSM is a one-step attack method based on the gradient of the DNN model loss function. As x steps along the gradient of the loss function, the loss function increases at the fastest rate. In this way, an adversarial sample is generated according to the following equation:

$$\begin{aligned} x^{adv} = x + \varepsilon \cdot sign\left( {\nabla _{x}Loss\left( {f(x),gt} \right) } \right) \end{aligned}$$
(4)

where \(\varepsilon\) is the step size that controls the level (i.e., strength) of attack, f(x) is the output of DNN model f, \(sign(\cdot )\) is the sign function, and gt denotes the ground truth of x. Different from FGSM, IFGSM is an iterative attack method based on the gradient of the DNN model loss function, and its formulation is as follows:

$$\begin{aligned}&x_{0}^{adv} = x \\&x_{t + 1}^{adv} = x_{t}^{adv} + \alpha \cdot sign\left( {\nabla _{x}Loss\left( {f(x),gt} \right) }\right) \\&x_{t + 1}^{adv} = clip\left( {x_{t + 1}^{adv},\;\varepsilon } \right) \end{aligned}$$
(5)

where \(\alpha\) is the step size for each iteration and function \(clip(x,\varepsilon )\) ensures that each element \(x_{i}\) of x is within the range of \([x_{i}-\varepsilon ,x_{i}+\varepsilon ]\). In our experiment, \(\alpha =1\), and we set the number of iterations as \(min(\varepsilon +4,1.25\varepsilon )\) [28] where \(\varepsilon\) is a variable that controls the level of attack. Figure 4 shows representative samples of IFGSM attacks.

Real microscopy image datasets for benchmarking robustness

In addition to datasets of realistic synthetic images, robustness of DNN segmentation models are also benchmarked on datasets of real microscopy images of different modalities, including fluorescence, brightfield, phase-contrast and differential interference contrast (DIC) microscopy. Representative images are illustrated in Fig. 13. For real fluorescence microscopy images, the datasets from [56] contain about 700 pairs of mitochondrial images, while the Nucleus datasets in [14] contain 1000 cell nucleus images acquired in three imaging modes: two-photon, confocal, and widefield. For these real fluorescence microscopy images, their segmentation annotations were made manually and controlled in quality by local experimental biologists. Because the real FM images of these datasets were collected under a pair of low and high SNRs, they can be used to benchmark corruption robustness. In addition, two phase-contrast microscopy datasets from [57] and [58] are selected. Specifically, the SH-SY5Y dataset, which is part of the LiveCell dataset [57], contains phase-contrast images of human neuroblastoma with long protrusions and dense populations. The Phc-Fib dataset, which comes from [58], contains phase-contrast images of overlapping fibroblasts. Two datasets of DIC images are taken from [58], which are named DIC_v1 and DIC_v2, respectively. DIC_v1 and DIC_v2 both contain images of normal elliptical cells, while DIC_v2 contains images of dense cell populations. Finally, two brightfield microscopy datasets are taken from [58]. The Bright_stain dataset contains images taken using brightfield microscopy on stained cells. The Bright dataset contains images of cells without staining. Cells of the two datasets are normally elliptical in shape and not clustered. It should be noted that the brightfield, phase-contrast and DIC microscopy images are taken from datasets with instance cell segmentation, in which individual cells are differentiated and marked with different segmentation annotations. Because this study only considers semantic segmentation, the segmentation labels are simplified by setting the annotation to 1 for all objects. Real microscopy images are used for benchmarking model robustness for two reasons. First, degradations in real microscopy images are more representative of actual image conditions than in simulated images. Second, real microscopy images can be used to verify benchmarking results obtained on realistic synthetic images.

Segmentation models

To date, a large number of DNN models have been developed for semantic image segmentation. In this study, we examine 10 models, including FCN, SegNet, UNet, UNet_3, Sim_UNet, DeepLab, PSPNet, ICNet, ViT-B_16 and R50-ViT-B_16. Among them, FCN, SegNet, UNet, DeepLab, PSPNet and ICNet are representative semantic segmentation models that have been validated extensively in the literature. UNet_3 and Sim_UNet are two simplified variants of UNet. ViT-B_16 and R50- ViT-B_16 are Transformer-based models.

FCN [41], SegNet [59] and UNet [42] are classical convolutional neural network (CNN) models of the encoder-decoder architecture. An input image is fed into their multi-layer encoder to extract high-level features. Then, their decoder maps the high-level features back to the input domain and outputs dense segmentation results. FCN is one of the early models that successfully apply deep learning to semantic segmentation by replacing full connection layers with convolutional layers to achieve end-to-end training. Its design of fully convolutional layers not only greatly reduces the size of input but also greatly reduces the number of parameters. It features a representative asymmetric encode-decode architecture, with simple up-sampling or deconvolution layers in its decoder. SegNet adds convolutional layers onto its decoder to make it symmetric with its encoder, forming a symmetric encoder-decoder architecture. Max-pooling indices are retained to provide high frequency information for up-sampling layers. UNet also takes a symmetric encoder-decoder structure. However, its skip connections concatenate features from its encoder to its decoder at different layers. Concatenated features largely preserve the information of each encoding layer, making segmentation more accurate. Recently, it has been shown that simplification of the UNet by reducing its level of down-sampling and up-sampling improves segmentation accuracy on FM images [16]. To check how such simplification may influence model robustness, we test two simplified variants of the UNet: UNet_3 retains the first three encode-decode layers of the original UNet and removes the last two layers, whereas Sim_UNet further reduces parameters of each layer of UNet_3 to obtain a more simplified architecture [16].

DeepLab [60], PSPNet [61], and ICNet [62] are all ResNet-based [63] models that utilize modules to handle image objects at multiple scales. DeepLab, specifically DeepLabv3, effectively captures multi-scale information at different rates using atrous spatial pyramid pooling (ASPP). PSPNet utilizes a pyramid pooling module (PPM) to extract global context information. ICNet is a cascaded lightweight network that is capable of achieving real-time semantic segmentation of natural images.

With its multi-head self-attention modules, Transformer [64] has the capacity to handle both short-range and long-range information and has been widely used in natural language processing. Vision Transformer (ViT) [65] is first proposed to deal with vision tasks based on the Transformer architecture, in which input images are divided into a sequence of non-overlapping patches, followed by positional and information embedding. ViT-B_16 and R50-ViT-B_16 [66] use the classical Transformer as their encoder. Becuase segmentation is a dense computer vision task, the decoder can be a CNN as usual. ViT-B_16 uses a Transformer as its encoder, while R50-ViT_B uses a CNN-Transformer hybrid model where CNN is first used as a feature extractor to generate a feature map for the input, then the feature map is fed into the Transformer modules.

We choose these models for several reasons. First, FCN, SegNet, UNet, DeepLab, PSPNet and ICNet are representative models that have been widely used and are known to perform well on natural images. Second, comparing robustness of UNet, UNet_3 and Sim_UNet allows us to examine how ablation of model architecture affects robustness. Third, we choose SegNet, FCNs, DeepLab, PSPNet, ICNet because their robustness has been characterized on natural images in e.g., [12]. This allows us to compare robustness of the same models on FM images versus natural images. Fourth, ViTs (Vision Transformers) [65] have achieved remarkable performance in a broad range of computer vision tasks. But their performance in segmentation of FM images has not be examined.

Quantification of robustness

To quantify robustness, we largely follow the protocol used in [11, 12] so that we can compare model robustness on FM images versus natural images. Specifically, we use IoU (Intersection over Union) as our metric to characterize semantic segmentation performance of DNN models. For a specific model, we use its IoU on the reference clean images to characterize its reference accuracy. We define its robustness as the ratio between its IoU on degraded images and its IoU on the clean image, namely:

$$\begin{aligned} R_{c,s}^{f} = \left( {IoU}_{c,s}^{f} \right) /\left( {IoU}_{clean}^{f} \right) \end{aligned}$$
(6)

where \({IoU}_{c,s}^{f}\) denotes the IoU of model f on degraded images. Subscript c denotes the type of corruption, which may be one of SNR (noise), SIB, and SVB. Subscript s denotes the level of degradation. For example, for a noise-corrupted image with an SNR of 4, c = ’SNR’, s = 4. \({IoU}_{clean}^{f}\) denotes the IoU of f when it is tested on clean images. When c denotes corruptions, \(R_{c,s}^{f}\) refers to corruption robustness. When c denotes adversarial attacks, \(R_{c,s}^{f}\) refers to adversarial robustness. In this study, this metric of robustness is first calculated on individual images then averaged over all images.

Fig. 5
figure 5

Segmentation accuracy (measured by mean IoU) of different models under different types and levels of corruptions

Experiments

First, we introduce the setup of our experiments using realistic synthetic images. Then, we present experimental results on corruption robustness and adversarial robustness, respectively, of the 10 selected models. Next, we present experimental results on the relations between noise corruption robustness and adversarial robustness. Next, we summarize robustness of the models in six different aspects. Finally, we present experimental results on real microscopy images of different modalities.

Experimental setup using realistic synthetic image datasets

Datasets. Three datasets of realistic synthetic images: ER-C, Mito-C, and Nucleus-C, are used for benchmarking model robustness [43]. They are synthesized with controlled corruptions and adversarial attacks on images of the endoplasmic reticulum, mitochondria, and the nucleus [44, 45], respectively, using the image synthesis protocol described in the Method section.

Models. Ten models are tested. PSPNet and ICNet take ResNet18 as their backbone. DeepLab takes ResNet50 as its backbone. UNet, FCNs, and SegNet use the standard 5-layer CNN-based architecture. UNet_3 retains the first three encoder-decoder layers of UNet, with the number of parameters for each layer unchanged. Sim_UNet retains a quarter of the convolutional kernels of UNet_3 at each layer. ViT-B_16 uses the ‘Base’ variant with 12 Transformer layers. Each input is divided into 16\(\times\)16 patches. The hidden size is set to 768. See [65] for details. R50-ViT-B_16 combines ResNet-50 [63] and ViT as its encoder. Each model is trained on clean images with an SNR of 8. Accuracy of each model refers to its mean IoU on clean images. Stochastic gradient descent (SGD) is used as the optimizer. Its learning rate is set initially at 0.01 and decreases by half every 100 epochs. The total number of epochs is 500. The batch size is 4. The datasets are openly accessible at https://ieee-dataport.org/documents/robustness-benchmark-datasets-semantic-segmentation-fluorescence-images-updated [43]. The code is openly accessible at https://github.com/cbmi-group/FMSegmentationRobustness.

Table 1 Model accuracy averaged over three datasets

Corruption robustness

Segmentation models trained on clean images are tested on three types of corruptions: SNR (noise), SIB and SVB, at different levels. Representative results are shown in Supplementary Figs. S9S11.

Performance under different levels of corruptions

Figure 5 summarizes accuracy of the models, measured by their mean IoU (mIoU), under different levels of corruptions. Several observations can be made. First, model accuracy consistently decreases under increased levels of corruptions. Second, different models show different rates in the degradation of their accuracy, suggesting substantial differences between their robustness. Third, the models show the most substantial degradation in accuracy under noise corruption, with the mIoU decreasing from less than 20% to 80%.

Performance under different types of corruptions

For each model, to examine its accuracy over different corruptions, we average its mIoU under each type of corruption over all levels. Table 1 summarizes the results.

Accuracy on clean images: Sim_UNet achieves overall the best accuracy on clean images with the highest mIoU at 81.84%, consistent with finding in [16] that the simplified model achieves higher accuracy on FM images than the original model. ICNet provides overall the worst accuracy at 62.42% on clean images. We also find that CNN-Transformer hybrid encoder (R50-ViT-B_16) outperforms pure transformer encoder (ViT-B_16), suggesting that CNN modules may be beneficial to model accuracy on clean samples in semantic segmentation.

Accuracy under noise corruption: On FM images corrupted by noise, traditional CNN-based models such as FCN, UNet and SegNet show distinct advantages, with their mIoU being at least 6% higher in absolute value than the other models. UNet_3 has the worst performance, with its mIoU at 48.31%. The two Transformer models rank middle in performance, with R50-ViT-B_16 underperforming ViT-B_16.

Fig. 6
figure 6

Corruption robustness versus accuracy of the selected models on three datasets. Results for ICNet are not shown because its very low accuracy skews the results of other models

Accuracy under SIB corruption: Under SIB corruption, traditional CNN-based models such as FCNs, UNet and SegNet also show distinct advantages. with SegNet ranking the best. ICNet ranks the worst. Both Sim_UNet and UNet_3 perform poorly but better than ICNet. Again, the two Transformer models rank middle in performance, with R50-ViT-B_16 underperforming ViT-B_16.

Accuracy under SVB corruption: DNN models perform similarly under SVB as under SIB. Overall, however, the models handle SVB better than SIB. Similarly, the two Transformer models rank middle in performance, with R50-ViT-B_16 underperforming ViT-B_16.

Summary

We quantify corruption robustness of the 10 models using the definition in Eq. (6). Robustness against each type of corruption is illustrated in Supplementary Fig. S2. We also calculate the average robustness over all corruption types. The results are summarized in Table 2. The corruption robustness of each model against its accuracy on each dataset is plotted in Fig. 6. Overall, the models show the highest robustness on the Nucleus-C dataset, which has the lowest morphological complexity among the three datasets. The models show the lowest robustness on the ER-C dataset, which has the highest morphological complexity among the three datasets. Together, these results indicate that morphological complexity of image objects is a key factor that affects corruption robustness of DNN models. In general, SegNet, FCNs and UNet show higher corrption robustness.

Table 2 Corruption robustness of models on three datasets, averaged over three types of corruptions

Taking the results together, several observations can be made on the corruption robustness of the models: First, when the morphological complexity of image objects is low, as in the Nucleus-C dataset, all models exhibit sufficient robustness against corruptions. However, when morphological complexity of image objects gets higher, as in the ER-C dataset, there seems to be a negative correlation between segmentation accuracy and corruption robustness. This indicates that models optimized for segmentation accuracy may have poor robustness, consistent with findings of previous studies on image classification [67,68,69]. Second, UNet_3 and Sim_UNet, have slightly better accuracy than UNet, consistent with results reported in [16], but show poor robustness against noise or blurring. This indicates that models should not be optimized solely for accuracy. Third, conventional CNN-based models, such as UNet, FCN, have better robustness than ResNet-based models, such as DeepLab, PSPNet. This result contradicts the findings on natural images that ResNet-based models show higher robustness than conventional CNN-based models [11, 12, 60,61,62]. This result also indicates that models performing well on natural images may not perform well on FM images and that it is necessary to develop models specifically for FM images. Fourth, among the 10 models, ICNet shows the worst performance (Supplementary Figs. S9S11), even though it shows good performance in segmentation of natural images [62]. Fifth, both transformer-based models perform moderately in clean samples, with the CNN-Transformer hybrid model performing better. However, pure Transformer-based model is excellent in corruption robustness. These findings suggest that conclusions drawn on corruption robustness of DNN models on natural images may fail on FM images. Finally, Transfomer-based models have no clear advantage over CNNs in FM segmentation, with R50-ViT-B_16 with its CNN-Transformer hybrid encoder performing worse than ViT-B_16 with its pure Transformer encoder under corruptions.

Adversarial robustness

We characterize robustness of the 10 selected models against FGSM and IFGSM attacks. See Supplementary Figs. S12S14 for representative segmentation results.

Performance under different levels of adversarial attacks

We control the level of FGSM attacks by adjusting the parameter \(\varepsilon\) in Eq. (4). Figure 7 shows model accuracy under different levels of FGSM attacks. Similar results are obtained on IFGSM attacks (see Supplementary Fig. S4). Several observations can be made. First, accuracy of DNNs decreases sharply under adversarial attacks on the ER-C dataset. In contrast, the models show stronger robustness against adversarial attacks on the Nucleus-C dataset. This indicates that images with high morphological complexity are more sensitive to adversarial attacks. Second, under increased level of adversarial attacks, accuracy of UNet_3 and Sim_UNet all decrease sharply.

Fig. 7
figure 7

Model accuracy under different levels of FGSM attacks. eps: \(\varepsilon\) that controls the level of attack

Fig. 8
figure 8

Adversarial robustness of models against FGSM attacks on three datasets. Results for ICNet are not shown because its very low accuracy skews the results of other models

Summary

Figure 8 summarizes the mean robustness against FGSM attacks versus accuracy of each model. Similar results for IFGSM attacks are shown in Supplementary Fig. S5. Overall, we observe similar trends in adversarial robustness as in corruption robustness. For images with high morphological complexity, models with higher accuracy may have lower adversarial robustness, as can be seen in UNet_3 and Sim_UNet. In addition, CNN-based models, such as FCN and UNet show stronger adversarial robustness than ResNet-based models such as PSPNet. This contradicts findings on natural images in [11, 12] that ResNet-based models have better robustness.

Table 3 Pearson correlation coefficients between adversarial robustness and SNR, SIB, SVB robustness on three datasets

Image corruptions differ from adversarial attacks in that the former comes from real-world degradation while the later are artificially constructed. Nevertheless, we find some similarities between the two types of robustness, which have also been compared on natural images [36, 37]. We further investigate relations between the two types of robustness in the next section.

Relations between corruption robustness and adversarial robustness

Fig. 9
figure 9

Comparison of segmentation performance under noise corruption versus FGSM attacks. a First column: Original sample. Second column: adversarial sample and its residual map. Third column: noise sample and its residual map. b Segmentation results on noise corruption samples versus adversarial samples using an FCN. First row: segmentation on original sample. Second row: segmentation on adversarial samples, Third row: segmentation on noise samples. Levels of degradation increase from left to right

Correlation between the two types of robustness

We examine the Pearson correlation coefficients [70] of the 10 selected models on three datasets between their adversarial robustness on FGSM, IFGSM attacks and their corruption robustness on SNR (noise), SIB, SVB. The results are summarized in Table 3. For FGSM attacks, the correlation between adversarial and noise robustness is relatively high, especially on the Mito-C dataset. Considering that both FGSM and noise are single-step additive degradation, we examine relations between the two types of robustness next.

Noise corruption samples vs adversarial attack samples

Assuming an original image X, its corresponding degraded version \(X_{d}\) is given by:

$$\begin{aligned} X_{d} = X + ~\delta \end{aligned}$$
(7)

where \(\delta\) is a degradation residual map that determines the level of degradation. When X is fixed, noise samples differ from FGSM samples in the residual map \(\delta\). Figure 9(a) show examples of FGSM sample, noise sample and their residual maps (see Supplementary Fig. S6 for more examples). It is clear that the adversarial attack is structured and correlative with foreground because it is more specific on foreground image objects. In contrast, noise corruption is unstructured, global, and nonspecific over the entire image.

Figure 9b shows representative segmentation results. When the original image is perturbed by light noise corruption or adversarial attack, there is little difference in the segmentation results. However, as the perturbation strengthens, segmentation results on images under adversarial attack and noise corruption show different characteristics. Adversarial attacks tend to perturb foreground, whereas noise corruption has greater impact on image background. Segmentation of background is increasingly affected by higher levels of noise. However, even under heavy noise, the foreground morphology remains visible. This is not the case under adversarial attacks. This suggests that adversarial attacks have a greater impact on foreground, whereas noise has a greater impact on background.

Table 4 Robustness before and after random pixel shuffling on adversarial residual maps. Tested on the ER-C dataset

Furthermore, when we alter the structure of adversarial residual maps by randomly shuffling their pixels, we obtain new samples similar to noise corruption samples. When we test models on these new samples, we find stronger robustness, see Table 4. Also see Supplementary Tables S2S3 and Fig. S3 for additional results. These result indicate that because adversarial attacks perturb structures of image objects, they are more powerful than noise corruptions.

Adversarial training vs noise augmentation training

We test two strategies to enhance model robustness: adversarial training [29] and Gaussian noise augmentation. We examine model accuracy, SNR robustness and FSGM robustness on the trained models. Figure 10 shows the robustness on Mito-C dataset of three types of training: standard training (ST), Gaussian noise augmentation training (GT) and adversarial training (AT). Results on Nucleus-C and ER-C can be found in Supplementary Figs. S7 and S8.

Fig. 10
figure 10

Comparison of robustness on Mito-C dataset with three types of training. ST: standard training, GT: Gaussian noise augmentation training, AT: adversarial training, Acc: mIoU on clean data

From Fig. 10, we can see that adversarial training also enhances noise robustness. In contrast, GT significantly improves SNR robustness but has limited effect on adversarial robustness. Because adversarial training is equivalent to augmentation training with adversarial samples, the result indicates that training with adversarial samples will lead to a more universally robust model. It also highlights the difference between adversarial samples and noise corruption samples.

In summary, adversarial attacks are more structured and are more specific in perturbing morphology of image objects. In contrast, noise corruption is unstructured, nonspecific, and has greater effect on background. Adversarial attacks are also more powerful than noise corruptions in degrading model performance. When adversarial attacks are turned into randomized corruptions, their attack ability weakens. In addition, training on adversarial samples will enhance robustness against noise, consistent with findings of previous studies [36, 37]. However, enhancing corruption robustness does not necessarily enhance adversarial robustness.

Fig. 11
figure 11

Comparison of robustness of 10 models in six different aspects. First row from left to right: UNet, UNet_3, Sim_UNet, FCNs, ViT-B_16. Second row from left to right: SegNet, DeepLab, PSPNet, ICNet, R50-ViT-B_16

Comprehensive comparison of model robustness

We have thus far analyzed and compared adversarial robustness and corruption robustness of 10 selected models. For more comprehensive comparison of their robustness, we use radar charts (Fig. 11) to visualize their performance in the following six aspects: (1) accuracy on clean data, (2) SNR robustness, (3) SIB robustness, (4) SVB robustness, (5) FGSM robustness and (6) IFGSM robustness. Detailed scores are summarized in Supplementary Table S4.

Overall, several observations can be made. First, nearly all models perform the worst on the ER-C dataset (light brown polygons). This indicates that it is difficult to balance the overall performance on images with complex object morphology even for binary segmentation. Second, SegNet, FCN and DeepLab have more balanced performance on the three datasets, with no particular shortcomings and, therefore, are recommended. UNet is also a potential candidate for its excellent corruption robustness. But its IFGSM robustness is relatively weak. Third, simplified models such as UNet_3 and Sim_UNet achieve excellent accuracy on clean images but show poor corruption robustness and adversarial robustness. Therefore, we caution against selecting these two models. ICNet exhibits good robustness on natural images but performs poorly on FM images. Lastly, different attacks induce different levels of decline in model performance. For example, IFGSM attack induces more severe performance degradation than FGSM attack.

Table 5 Model performance in mIoU on synthetic and real FM images
Fig. 12
figure 12

Comparison of mIoU on synthetic and real FM images. Ns_Sy: our synthetic nucleus dataset (Nucleus-C), Ns_Real: averaged over confocal, two-photon and widefield nucleus dataset. Mito_Sy: our synthetic mitochondria dataset (Mito-C). Mito_Real: real-world matrix mitochondria dataset

Benchmarking on real microscopy images

The results reported so far are derived from realistic synthetic FM images. To check the reliability and generalizability of the results, we also benchmark our models on real microscopy images of different modalities, including fluorescence, brightfield, phase-contrast and differential interference contrast (DIC) microscopy. All ten models are first trained on the realistic synthetic FM images and then fine-tuned on 5-10 real microscopy images with a small learning rate.

Fluorescence microscopy

As described previously in Sect. “Real microscopy image datasets for benchmarking robustness", real FM images used in this section come from the matrix mitochondria dataset [56] with approximately 700 images and the nucleus dataset [14] of three imaging modes: two-photon, confocal, and wide-field, with 1000 images under each mode. All the real images were acquired under a pair of low and high SNRs so that they can be used to benchmark corruption robustness. The results are summarized in Table 5 and Table 6. Representative segmentation results are shown in Supplementary Fig. S15. Figure 12 shows the mean IoUs of the same models on synthetic images and the real images are generally similar, indicating that models trained on realistic synthetic images perform equally well on real FM images. Statistical comparison using two sample student t-tests shows that there is no statistically significant difference between them in 80% percent of the cases (see Supplementary Table S6 for detailed p-values). This also shows the fidelity and reliability of the realistic synthetic images. Furthermore, the decline in model performance on real FM images of low SNRs is consistent with the decline observed on realistic synthetic FM images.

Table 6 Model performance in mIoU on synthetic and real FM images. Results on images with low SNRs are presented. Syn: Synthetic data, Nuc: nucleus, Mito: mitochondria, CF: confocal, TP: twophoton, WF: widefield
Fig. 13
figure 13

Examples of three types of microscopy images of whole cells and their semantic segmentation annotations

Phase-contrast microscopy

As described previously in Sect. “Real microscopy image datasets for benchmarking robustness", we use phase-contrast images from two datasets. The SH-SY5Y dataset is taken from the LiveCell dataset [57] It contains human neuroblastoma with with long protrusions and dense populations. The Phc-Fib dataset is taken from [58], containing some overlapping fibroblasts. Example images are illustrated in Fig. 13. Because that the SH-SY5Y and Phc-Fib datasets are all used initially for instance segmentation, we set the annotation labels to 1 for all objects to obtain their semantic segmentation. Column 2-3 of Table 7 presents mIoU of 10 models on SH-SY5Y and Phc-Fib. Representative segmentation results are shown in Supplementary Fig. S16.

Overall, we observe results consistent with those derived from realistic synthetic images. Specifically, ICNet also performs poorly on phase-contrast microscopy images. In addition, R50-ViT-B_16 still underperforms ViT-B_16, consistent with the observations from Table 1. ResNet-based and transformer-based models that perform well on natural images show no clear advantage on phase-contrast microscopy images. For datasets with complex image object shapes or with complex background, CNN-based models like UNet still have clear performance advantage, also consistent with observations from Table 2.

Table 7 Model performance in mIoU on three types of microscopy images of whole cells

Differential interference contrast microscopy

As described previously in Sect. “Real microscopy image datasets for benchmarking robustness", we take two DIC datasets DIC_v1 and DIC_v2 from [58] for evaluation. Results are listed in column 4–5 of Table 7. Representative segmentations are shown in Supplementary Fig. S16.

Consistent with results in Fig. 6 derived from realistic synthetic images, it can be found that for simple shaped cells, models achieve high performance. However, for densely populated cells, such as those in DIC_v2, segmentation accuracy decreases substantially, even for cells of simple elliptical shapes. This is because the cells adhere to each other, resulting in complex morphology and increasing the difficulty of segmentation. On DIC microscopy iamges, UNet shows the most competitive performance, which again confirms the superiority of CNN models. UNet_3 and Sim_UNet do not perform as well as UNet on DIC_v2 (complex dataset). The same observation can be made on synthetic FM images (see Table 1).

Brightfield microscopy

As described previously in Sect. “Real microscopy image datasets for benchmarking robustness", two dataset of brightfield microscopy images from [58] are used. Bright_stain is imaged through brightfield microscopy on stained cells. Another dataset Bright is imaged without staining. Cells of the two datasets are normally ellipsoid and non-clustered. The results are listed in column 6–7 of Table 7. Representative segmentation results are shown in Supplementary Fig. S16.

For brightfield microscopy images, segmentation performance of stained cells is low because of artifacts caused by staining artifacts. UNet and its simplied variants (UNet_3 and Sim_UNet) outperform other networks on simple and clear brightfield microscopy images. This again shows the superiority of the UNet architecture in biological image processing. Results of FM images on Table 1 and Fig. 11 also indicate the superiority of the UNet architecture. In addition, most models achieve high accuracy on the Bright dataset, which is free of complex shapes or background, confirming again the observations from Fig. 6 that DNN models generally exhibit good accuracy and robustness on images with objects of simple geometry. For all microscopy images, CNN-Transformer hybrid encoder (R50-ViT-B_16) always outperforms pure transformer encoder (ViT-B_16), which is also consistent with observations on FM images from Tables 1 and 2.

Conclusion

In this study, we have developed an assay for benchmarking robustness of DNNs in semantic segmentation of FM images. For real-world biomedical applications, it is essential to develop DNN segmentation models that are robust against commonly encountered degradation in FM images. To achieve this goal, methods and datasets to benchmark robustness of DNN models are required. The assay developed in this study is aimed at meeting this requirement. In particular, the method we have developed for generation of realistic synthetic images makes it possible to precisely control the degradation of image conditions, either by corruptions or adversarial attacks. In addition, we have used real-world microscopy images of different modalities, including fluorescence, brightfield, phase-contrast and DIC, without simulated degradation to benchmark robustness of DNN segmentation models. The results are consistent with those obtained using our realistic synthetic data, confirming the fidelity and reliability of our image synthesis method and the effectiveness of our assay.

In benchmarking robustness of DNN models in segmentation of FM images, we have found that some conclusions on robustness of DNN models in segmentation of natural images fail on FM images. For example, ResNet-based and Transformer-based models that perform well on natural images show no clear advantage on FM images, but CNN-based models have clear performance advantage (see Supplementary Table S4). Furthermore, ICNet, a lightweight model that performs well on natural images, performs poorly on FM images. Based on extensive comparison, we find that SegNet, UNet,FCNs, and DeepLab are well balanced models in terms of their accuracy and robustness for segmentation of FM images.

In benchmarking robustness of DNN models in segmentation of FM images, we have also discovered new and fundamental robustness properties. In particular, we find that morphology of image objects is a key factor that can significantly influence segmentation performance. DNN models generally exhibit good accuracy and robustness on image objects with simple morphology, such as cell nuclei. However, when morphology of image objects becomes more complex, models optimized for high segmentation accuracy often suffer from a sharp decline in their robustness, even though the task of binary segmentation is relatively simple. We recommend caution with simplifying models on such data for higher accuracy because simplified models are at risk of performance collapse. DNN models should not be assessed solely based on their segmentation accuracy. And they should not be optimized solely for high segmentation accuracy. Instead, segmentation accuracy should be balanced with model robustness.

Our study also provides new insights into the relations between adversarial robustness and corruption robustness. Previous studies on natural images have reported contradictory results on their relations [71, 72]. In this study, by exploiting the simplicity of FM images, we find that adversarial attacks generally cause more substantial decline in DNN model performance than image corruptions such as noise and blurring. We also find that, depending on the specific circumstances, adversarial robustness and corruption robustness may be positively correlated or uncorrelated.

Our study has its limitations. It only benchmarks robustness of deep learning models in semantic segmentation of 2D FM images. Identification of individual cells requires instance segmentation. Furthermore, 3D FM images are becoming widely used in practice. It is therefore important to benchmark model robustness in instance segmentation and on 3D FM images. The image synthesis method developed in this study is only applicable to FM images. Although microscopy images of other modalities have been used in this study to evaluate the ten deep learning models, they lack precisely controlled perturbations and therefore are limited in quantifying model robustness. Although our image synthesis method enables precise control of degradation of FM images, it requires three steps: initial synthesis, denoising, and degradation synthesis. Whether it is feasible to simplify this scheme remains an open question. In addition to the two types of corruptions namely noise and blur considered in this study, other imaging artifacts such as optical aberrations, non-uniform illumination and dust are also encountered in real-world applications. These artifacts may also influence the robustness of deep learning segmentation models. These limitations and questions will be addressed in our follow-up studies. Overall, this study provides new insights into the robustness of deep learning models against image corruptions and adversarial attacks in semantic segmentation of fluorescence microscopy images.

Data availability

The datasets of realistic synthetic images used in this study are openly accessible through the IEEE Dataport at https://ieee-dataport.org/documents/robustness-benchmark-datasets-semantic-segmentation-fluorescence-images-updated. For real fluorescence microscopy images, the mitochondrial dataset is taken from https://github.com/AiviaCommunity/3D-RCAN and the Nucleus datasets are taken from http://tinyurl.com/y6mwqcjs. Real phase-contrast microscopy dataset SH-SY5Y is taken from LiveCell https://sartorius-research.github.io/LIVECell/. Other real microscopy images used in this study such as Phc-Fib, brightfield microscopy and differential interference contrast (DIC) microscopy datasets are all taken from https://neurips22-cellseg.grand-challenge.org/dataset/.

Code availability

The code used in this study is openly accessible at https://github.com/cbmi-group/FMSegmentationRobustness.

References

  1. Minaee S, Boykov Y, Porikli F, Plaza A, Kehtarnavaz N, Terzopoulos D. Image segmentation using deep learning: a survey. IEEE Trans Pattern Anal Mach Intell. 2021;44(7):3523–42.

    Google Scholar 

  2. Garcia-Garcia A, Orts-Escolano S, Oprea S, Villena-Martinez V, Martinez-Gonzalez P, Garcia-Rodriguez J. A survey on deep learning techniques for image and video semantic segmentation. Appl Soft Comput. 2018;70:41–65.

    Article  Google Scholar 

  3. Sadanandan SK, Ranefall P, Le Guyader S, Wählby C. Automated training of deep convolutional neural networks for cell segmentation. Sci Rep. 2017;7(1):1–7.

    Article  CAS  Google Scholar 

  4. Kraus OZ, Ba JL, Frey BJ. Classifying and segmenting microscopy images with deep multiple instance learning. Bioinformatics. 2016;32(12):52–9.

    Article  Google Scholar 

  5. Xing F, Xie Y, Su H, Liu F, Yang L. Deep learning in microscopy image analysis: a survey. IEEE Trans Neural Netw Learn Syst. 2017;29(10):4550–68.

    Article  Google Scholar 

  6. Dodge S, Karam L. Understanding how image quality affects deep neural networks. In: International Conference on Quality of Multimedia Experience (QoMEX), 2016:1–6

  7. Dodge S, Karam L. A study and comparison of human and deep learning recognition performance under visual distortions. In: International Conference on Computer Communication and Networks (ICCCN), 2017:1–7

  8. Hendrycks D, Dietterich T. Benchmarking neural network robustness to common corruptions and perturbations. In: International Conference on Learning Representations (ICLR) 2019

  9. Michaelis C, Mitzkus B, Geirhos R, Rusak E, Bringmann O, Ecker AS, Bethge M, Brendel W. Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 2019

  10. Szegedy C, Zaremba W, Sutskever I, Bruna J, Erhan D, Goodfellow I, Fergus R. Intriguing properties of neural networks. In: International Conference on Learning Representations (ICLR) 2014

  11. Kamann C, Rother C. Benchmarking the robustness of semantic segmentation models with respect to common corruptions. Int J Comput Vis (IJCV). 2021;129(2):462–83.

    Article  Google Scholar 

  12. Arnab A, Miksik O, Torr P. On the robustness of semantic segmentation models to adversarial attacks. IEEE Trans Pattern Anal Mach Intell (TPAMI). 2020;42(12):3040–53.

    Article  Google Scholar 

  13. Meiniel W, Olivo-Marin J-C, Angelini ED. Denoising of microscopy images: a review of the state-of-the-art, and a new sparsity-based method. IEEE Trans Image Process (TIP). 2018;27(8):3842–56.

    Article  Google Scholar 

  14. Zhang Y, Zhu Y, Nichols E, Wang Q, Zhang S, Smith C, Howard S. A Poisson–Gaussian denoising dataset with real fluorescence microscopy images. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. p.11710–11718

  15. Caicedo JC, Roth J, Goodman A, Becker T, Karhohs KW, Broisin M, Molnar C, McQuin C, Singh S, Theis FJ. Evaluation of deep learning strategies for nucleus segmentation in fluorescence images. Cytom A. 2019;95(9):952–65.

    Article  Google Scholar 

  16. Guo Y, Huang J, Zhou Y, Luo Y, Li W, Yang G. Segmentation of intracellular structures in fluorescence microscopy images by fusing low-level features. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV), 2021. p. 386–397

  17. Maška M, Ulman V, Delgado-Rodriguez P, Gómez-de-Mariscal E, Nečasová T, Guerrero Peña FA, Ren TI, Meyerowitz EM, Scherr T, Löffler K, Mikut R, Guo T, Wang Y, Allebach JP, Bao R, Al-Shakarji NM, Rahmon G, Toubal IE, Palaniappan K, Lux F, Matula P, Sugawara K, Magnusson KEG, Aho L, Cohen AR, Arbelle A, Ben-Haim T, Raviv TR, Isensee F, Jäger PF, Maier-Hein KH, Zhu Y, Ederra C, Urbiola A, Meijering E, Cunha A, Muñoz-Barrutia A, Kozubek M, Ortiz-de-Solórzano C. The cell tracking challenge: 10 years of objective benchmarking. Nat Methods. 2023;20:1010–20.

    Article  PubMed  PubMed Central  Google Scholar 

  18. Hendrycks D, Basart S, Mu N, Kadavath S, Wang F, Dorundo E, Desai R, Zhu T, Parajuli S, Guo M. The many faces of robustness: a critical analysis of out-of-distribution generalization. In: International Conference on Computer Vision (ICCV), 2021. p. 8340–8349.

  19. Zheng S, Song Y, Leung T, Goodfellow I. Improving the robustness of deep neural networks via stability training. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. p. 4480–4488.

  20. Vasiljevic I, Chakrabarti A, Shakhnarovich G. Examining the impact of blur on recognition by convolutional networks. arXiv preprint arXiv:1611.05760 2016

  21. Geirhos R, Temme CR, Rauber J, Schütt HH, Bethge M, Wichmann FA. Generalisation in humans and deep neural networks. In: Advances in Neural Information Processing Systems (NeurIPS) 2018.

  22. Rusak E, Schott L, Zimmermann RS, Bitterwolf J, Bringmann O, Bethge M, Brendel W. A simple way to make neural networks robust against diverse image corruptions. In: European Conference on Computer Vision (ECCV), 2020. p. 53–69.

  23. Dodge S, Karam L. Quality resilient deep neural networks. arXiv preprint arXiv:1703.08119 2017

  24. Cubuk ED, Zoph B, Mane D, Vasudevan V, Le QV. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501 2018.

  25. Zhang H, Cisse M, Dauphin YN, Lopez-Paz D. mixup: Beyond empirical risk minimization. In: International Conference on Learning Representations (ICLR) 2018.

  26. Hendrycks D, Mu N, Cubuk ED, Zoph B, Gilmer J, Lakshminarayanan B. Augmix: A simple data processing method to improve robustness and uncertainty. In: International Conference on Learning Representations (ICLR) 2020.

  27. Goodfellow IJ, Shlens J, Szegedy C. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 2015

  28. Kurakin A, Goodfellow IJ, Bengio S. Adversarial examples in the physical world. In: Artificial Intelligence Safety and Security, 2018. p. 99–112.

  29. Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A. Towards deep learning models resistant to adversarial attacks. In: International Conference on Learning Representations (ICLR) 2018.

  30. Dong Y, Liao F, Pang T, Su H, Zhu J, Hu X, Li J. Boosting adversarial attacks with momentum. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. p. 9185–9193.

  31. Zhang H, Yu Y, Jiao J, Xing E, El Ghaoui L, Jordan M. Theoretically principled trade-off between robustness and accuracy. In: International Conference on Machine Learning (ICML), 2019. p. 7472–7482.

  32. Wang Y, Zou D, Yi J, Bailey J, Ma X, Gu Q. Improving adversarial robustness requires revisiting misclassified examples. In: International Conference on Learning Representations (ICLR) 2019.

  33. Kurakin A, Goodfellow I, Bengio S. Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236 2016

  34. Rozsa A, Günther M, Boult TE. Are accuracy and robustness correlated. In: IEEE International Conference on Machine Learning and Applications (ICMLA), 2016. p. 227–232.

  35. Liu Y, Chen X, Liu C, Song D. Delving into transferable adversarial examples and black-box attacks. In: International Conference on Learning Representations (ICLR) 2017.

  36. Gilmer J, Ford N, Carlini N, Cubuk E. Adversarial examples are a natural consequence of test error in noise. In: International Conference on Machine Learning (ICML), 2019. p. 2280–2289.

  37. Kireev K, Andriushchenko M, Flammarion N. On the effectiveness of adversarial training against common corruptions. In: Uncertainty in Artificial Intelligence, 2022. p. 1012–1021.

  38. Kang D, Sun Y, Brown T, Hendrycks D, Steinhardt J. Transfer of adversarial robustness between perturbation types. arXiv preprint arXiv:1905.01034 2019

  39. Engstrom L, Tran B, Tsipras D, Schmidt L, Madry A. A rotation and a translation suffice: Fooling CNNS with simple transformations. In: International Conference on Learning Representations (ICLR) 2019.

  40. Chai X, Ba Q, Yang G. Characterizing robustness and sensitivity of convolutional neural networks for quantitative analysis of mitochondrial morphology. Quant Biol. 2018;6(4):344–58.

    Article  Google Scholar 

  41. Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. p. 3431–3440.

  42. Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015. p. 234–241.

  43. Zhong L, Li L, Yang G. Robustness benchmark datasets for semantic segmentation of fluorescence images updated. IEEE Dataport 2024 https://doi.org/10.21227/1jk9-nv64

  44. Luo Y, Guo Y, Li W, Liu G, Yang G. Fluorescence microscopy image datasets for deep learning segmentation of intracellular orgenelle networks. IEEE Dataport 2020 https://doi.org/10.21227/t2he-zn97

  45. Caicedo JC, Goodman A, Karhohs KW, Cimini BA, Ackerman J, Haghighi M, Heng C, Becker T, Doan M, McQuin C. Nucleus segmentation across imaging experiments: the 2018 data science bowl. Nat Methods. 2019;16(12):1247–53.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Feng Y, Chai X, Ba Q, Yang G. Quality assessment of synthetic fluorescence microscopy images for image segmentation. In: IEEE International Conference on Image Processing (ICIP), 2019. p. 814–818.

  47. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial networks. Commun ACM. 2020;63(11):139–44.

    Article  Google Scholar 

  48. Radford A, Metz L, Chintala S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 2015

  49. Arjovsky M, Chintala S, Bottou L. Wasserstein generative adversarial networks. In: International Conference on Machine Learning (ICML), 2017. p. 214–223.

  50. Zhu J-Y, Park T, Isola P, Efros AA. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: IEEE International Conference on Computer Vision (ICCV), 2017. p. 2223–2232.

  51. Zhong L, Liu G, Yang G. Blind denoising of fluorescence microscopy images using gan-based global noise modeling. In: IEEE International Symposium on Biomedical Imaging (ISBI), 2021. p. 863–867.

  52. Isola P, Zhu J-Y, Zhou T, Efros AA. Image-to-image translation with conditional adversarial networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. p. 1125–1134.

  53. Morris PA, Aspden RS, Bell JE, Boyd RW, Padgett MJ. Imaging with a small number of photons. Nat Commun. 2015;6(1):5913.

    Article  CAS  PubMed  Google Scholar 

  54. KuKim S, KiPaik J. Out-of-focus blur estimation and restoration for digital auto-focusing system. Electron Lett. 1998;34(12):1217–9.

    Article  Google Scholar 

  55. Kim SK, Park SR, Paik JK. Simultaneous out-of-focus blur estimation and restoration for digital auto-focusing system. IEEE Trans Consum Electron. 1998;44(3):1071–5.

    Article  Google Scholar 

  56. Chen J, Sasaki H, Lai H, Su Y, Liu J, Wu Y, Zhovmer A, Combs CA, Rey-Suarez I, Chang H-Y. Three-dimensional residual channel attention networks denoise and sharpen fluorescence microscopy image volumes. Nat Methods. 2021;18(6):678–87.

    Article  CAS  PubMed  Google Scholar 

  57. Edlund C, Jackson TR, Khalid N, Bevan N, Dale T, Dengel A, Ahmed S, Trygg J, Sjögren R. Livecell-a large-scale dataset for label-free live cell segmentation. Nat Methods. 2021;18(9):1038–45.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Ma J, Xie R, Ayyadhury S, Ge C, Gupta A, Gupta R, Gu S, Zhang Y, Lee G, Kim J, et al. The multimodality cell segmentation challenge: toward universal solutions. Nature Methods, 2024. p. 1–11.

  59. Badrinarayanan V, Kendall A, Cipolla R. Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell (TPAMI). 2017;39(12):2481–95.

    Article  Google Scholar 

  60. Chen L-C, Papandreou G, Schroff F, Adam H. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 2017

  61. Zhao H, Shi J, Qi X, Wang X, Jia J. Pyramid scene parsing network. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. p. 2881–2890.

  62. Zhao H, Qi X, Shen X, Shi J, Jia J. Icnet for real-time semantic segmentation on high-resolution images. In: Proceedings of the European Conference on Computer Vision (ECCV), 2018:405–420

  63. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. p. 770–778.

  64. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. In: Advances in Neural Information Processing Systems (NeurIPS) 2017.

  65. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al. An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 2020

  66. Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille AL, Zhou Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 2021.

  67. Tsipras D, Santurkar S, Engstrom L, Turner A, Madry A. Robustness may be at odds with accuracy. In: International Conference on Learning Representations (ICLR) 2018.

  68. Pinot R, Meunier L, Araujo A, Kashima H, Yger F, Gouy-Pailler C, Atif J. Theoretical evidence for adversarial robustness through randomization. In: Advances in Neural Information Processing Systems (NeurIPS) 2019.

  69. Ilyas A, Santurkar S, Tsipras D, Engstrom L, Tran B, Madry A. Adversarial examples are not bugs, they are features. In: Advances in Neural Information Processing Systems (NeurIPS) 2019.

  70. Cha S-H. Comprehensive survey on distance/similarity measures between probability density functions. Int J Math Models Methods Appl Sci. 2007;1(4):300–7.

    Google Scholar 

  71. Xie C, Tan M, Gong B, Wang J, Yuille AL, Le QV. Adversarial examples improve image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. p. 819–828.

  72. Jordan M, Manoj N, Goel S, Dimakis AG. Quantifying perceptual distortion of adversarial examples. arXiv preprint arXiv:1902.08265 2019.

Download references

Acknowledgements

The authors thank colleagues in the Laboratory of Computational Biology and Machine Intelligence for their technical assistance.

Funding

This work was supported in part by the National Natural Science Foundation of China (Grants 92354307, 91954201, 31971289, 32101216), the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant XDB37040402) and the Fundamental Research Funds for the Central Universities (Grant E3E45201X2).

Author information

Authors and Affiliations

Authors

Contributions

LZ and GY conceived the idea and designed the study. LZ developed the method and wrote the code. LZ and LL performed the experiments and interpreted the results. GY supervised the study and secured research funding. LZ and GY wrote the manuscript with feedback from all other authors. All the authors read and approved the final manuscript.

Corresponding author

Correspondence to Ge Yang.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhong, L., Li, L. & Yang, G. Benchmarking robustness of deep neural networks in semantic segmentation of fluorescence microscopy images. BMC Bioinformatics 25, 269 (2024). https://doi.org/10.1186/s12859-024-05894-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-024-05894-4

Keywords