PyConvU-Net: a lightweight and multiscale network for biomedical image segmentation

Background With the development of deep learning (DL), more and more methods based on deep learning are proposed and achieve state-of-the-art performance in biomedical image segmentation. However, these methods are usually complex and require the support of powerful computing resources. According to the actual situation, it is impractical that we use huge computing resources in clinical situations. Thus, it is significant to develop accurate DL based biomedical image segmentation methods which depend on resources-constraint computing. Results A lightweight and multiscale network called PyConvU-Net is proposed to potentially work with low-resources computing. Through strictly controlled experiments, PyConvU-Net predictions have a good performance on three biomedical image segmentation tasks with the fewest parameters. Conclusions Our experimental results preliminarily demonstrate the potential of proposed PyConvU-Net in biomedical image segmentation with resources-constraint computing.

because of its excellent performances. Now, U-Net is widely applied in the field of biomedical image segmentation and derives many variants. Such as MultiResUNet [10], Attention U-Net [11], UNet++ [12], and so on. All these variants based on U-Net solve some problems that are produced by U-Net in its applications.
The U-Net is an encoder-decoder architecture [13] consisting of a contracting path and an expansive path. The former is down-sampling which increases the receptive field [14] to gain more features. The latter recovers the feature extracted in the former and concatenates the corresponding feature map in the contracting path. The concatenation called skip connection [15] is an important part of U-Net because it combines the information in the architecture. But the way of getting context information in the U-Net is not capable of extracting more fine information to achieve better performance. To address the above problems, we chose a new convolution called pyramidal convolution [16] to get more information and to improve the performance of our model.
The pyramidal convolution (PyConv) can process the input at multiple filter scales. It is illustrated in Fig. 1, contains a pyramid with n levels of different types of kernels. The goal of PyConv is to process the input at different kernel scales without increasing the computational cost or the model complexity (in terms of parameters). At each level of the PyConv, the kernel contains a different spatial size, increasing kernel size from the bottom of the pyramid to the top. Simultaneously with increasing the spatial size, the depth of the kernel is decreased from level 1 to level n. It involves different types of filters with varying sizes and depth so that it can capture different levels of details in the scene. Meanwhile, PyConv is also efficient and it does not increase the computational cost and parameters compared to standard convolution. Moreover, it is very flexible and extensible, providing a large space of potential network architectures for different applications.
In this paper, we develop a novel architecture called PyConvU-Net, an enhanced version of U-Net, demonstrating the implementation of PyConv in a standard U-Net architecture and applying it to biomedical images segmentation. We also compare the PyConvU-Net with many other models in different datasets, achieving a good performance while it has fewer number of parameters that can save computing power. Fig. 1 The structure of pyramidal convolution U-Net consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. The contracting path follows the typical architecture of a convolutional network. It consists of the repeated application of two 3 × 3 convolutions (unpadded convolutions), each followed by a rectified linear unit (ReLU) [17] and a 2 × 2 max pooling operation with stride 2 for down-sampling. Every step in the expansive path consists of an up-sampling of the feature map followed by a 2 × 2 convolution ("up-convolution") that halves the number of feature channels, a concatenation with the correspondingly cropped feature map from the contracting path, and two 3 × 3 convolutions, each followed by a ReLU. The cropping is necessary due to the loss of border pixels in every convolution. At the final layer, a 1 × 1 convolution is used to map each 64-component feature vector to the desired number of classes. In total the network has 23 convolutional layers.
The exploration of U-Net architecture has been a part of biomedical image segmentation research since its initial discovery. Many researchers propose a lot of variants of U-Net and continuously improve the performance of the structure. For example, Mul-tiResUNet [10] combines the MutiRes module and U-Net, where MutiRes is an extension of residual connection [18]. In this module, three 3 × 3 convolution results are spliced together as a combined feature map, which is then added to the input feature after 1 × 1 convolution. Besides the MultiRes module, MultiResUNet has a significant part that is ResPath, the function of which is doing some additional convolution operations before the feature of the encoder are spliced with the corresponding features in the decoder. Another excellent network is Attention U-Net [11] that brings the attention mechanism into U-Net. Before stitching the feature at each resolution of the encoder and the corresponding feature in the decoder, an attention module that generates a gating signal to control the importance of the feature at a different spatial location is used to readjust the output characteristic of the encoder. The attention module combines ReLU and Sigmoid through 1 × 1x1 convolution to generate a weight map α that can be corrected by multiplying the features in the encoder. UNet++ [12] also is a good architecture, starts with an encoder sub-network or backbone followed by a decoder sub-network. What distinguishes UNet++ from U-Net is the re-designed skip pathway that connects the two sub-networks and the use of deep supervision.
Besides the networks based on U-Net, there are also many segmentation networks for biomedical images. We choose a network called FCN [19] to compare with ours. FCN also is a good network for semantic segmentation. The reason why the network called FCN is because it converts the fully connected layers in traditional CNN [20] into convolutional layers. It is a fully convolutional network without a fully connected layer and can adapt to any size input. Besides, it makes use of a deconvolutional layer to increase the data size to achieve a better fine output result. What's more, it utilizes the skip connection to integrate the information in the different depth layers due to ensuring robustness and accuracy.

Results
As shown in Table 1, we demonstrate the application of the PyConvU-Net to three different segmentation tasks. The first task is the segmentation of the lung in the CT images [21]. The dataset called kaggleLung which is provided by the Finding and Measuring Lungs in CT Data in Kaggle is a collection of 512 × 512 CT images, manually segmented lungs, and measurements in 2/3D, containing 267 2D images. We just choose the 2D images and split the dataset into two parts, of which the training set accounts for 80%, and the test set accounts for 20%. Each image comes with a corresponding fully annotated ground truth segmentation map for the lung (white) and other parts (black). The second dataset is similar to the first, except that the organ is replaced with the liver. Meanwhile, the liver dataset has 400 512 × 512 images more than kaggleLung. The above two datasets have the same challenges that images have an unclear edge and organs from different people have some slight differences. These challenges will affect the edge extract and location of organs we want to segment. The last dataset is ISBICell [22] is provided by the EM segmentation challenge that was started at ISBI 2012 and is still open for new contributions. The training data is a set of 30 512 × 512 images from serial section transmission electron microscopy of the Drosophila first instar larva ventral nerve cord (VNC) [23]. ISBICell has more detailed information (complex cell boundaries), which will test the model's ability to handle details. Considering that these datasets have fewer samples, we have adopted some simple data augmentation methods to expand the datasets. These methods include horizontal flip, vertical flip, 90° rotation, and 180° rotation.
For comparison, we use FCN [19], the original U-Net, and a series of variants based on U-Net including UNet++, Resnet34_UNet, and Attention U-Net. First, the training losses of models are shown in Fig. 2. From Fig. 2, it is clear that the training losses of all models remain stable after the first 5 epochs training, only the loss of UNet++ is higher than other models after stable. As shown in Table 2, we choose two metrics, MIoU [24] and Dice [25] respectively, to evaluate our model in the three segmentation tasks.
MIoU is to calculate the ratio of the intersection and union of the true value set and predicted value set, the formula is as follows.
where TP FN +FP+TP can be equivalent to the following formula.
where k is the number of categories, i represents the true value, j represents the predicted value and p ij represents predicting i as j . p ii is the number of true values.
Dice coefficient is a function that measures the similarity of two sets and is one of the commonly used evaluation indicators in semantic segmentation. The Dice coefficient is defined as the intersection of two times divided by the sum of pixels, which is similar to IoU, and its calculation formula is as follows.
It is equivalent to the following formula.
Our proposed method achieves the best performance in liver dataset and is much higher than in the second place. On the kaggleLung dataset, our proposed method does not get the first place but has a better performance than other models but U-Net. In the last segmentation task, PyConvU-Net performs similarly to other methods, without much prominence where it gets the champion evaluated by Dice and gets the second place evaluated by MIoU. In the experiments, we also measured the parameter size and computational complexity of different models respectively, listed in Table 3.  Fig. 3, the MIoU and Dice of our proposed method, FCN8s and Resnet34_UNet are stable after 3 epochs while can keep a high level. Other methods perform very unstably.
Our method has the fewest parameters which means our network does not need too much computational power. From this, we can see that even if we lose some precision in some aspect, we can keep the network lightweight while not affecting the segmentation tasks finished by our proposed model.
We put the predictions of different methods in Fig. 4.
All experiments were carried out in the PyTorch framework [26] and trained using Nvidia-RTX 2080Ti GPUs. These networks are trained for a total of 50 epochs and a batch size of 5.

Discussion
Due to its excellent performance, U-Net is the most widely used backbone architecture for biomedical image segmentation in recent years. However, in our studies, we observe that U-Net will ignore detailed information when performing convolution operations [27]. We analyze this issue in detail and address it by proposing a lightweight and multiscale architecture PyConvU-Net which replaces the traditional convolution layer with the pyramidal convolution layer. This network which can extract multiple sequence feature information [28] not only achieves improvements in the biomedical image segmentation tasks [29] but also reduces the number of parameters.
We evaluate the proposed method on three biomedical image segmentation tasks. We can see from Table 2 that the proposed method does not outperform other methods on all datasets. The PyConvU-Net achieves first place on the liver dataset and much higher than the second place. However, it does not perform as well as FCN8s on the kaggleLung dataset, it just gets second in MIoU and third in Dice. In response to this phenomenon, we carefully consider the reasons for this phenomenon. We think the reason is that the liver dataset has a clear edge between different organs, however, the boundaries in the kaggleLung dataset are fuzzy. So the proposed method has shortcomings in the segmentation of images with blurred boundaries. This situation also happens in the ISBICell datasets. The cell images have many complex edges that are entangled with each other. To some extent, these boundaries are unclear, so PyConvU-Net does not have a very good performance on the ISBICell dataset. From the experimental results in Table 2, although the proposed model does not achieve the best performance on all tasks, it is still in a leading position. From the beginning, our goal is to minimize the number of model parameters and computational complexity without losing segmentation accuracy or losing the part of the accuracy. We list the number of parameters and the computational complexity of different models in Table 3. In terms of the number of parameters, U-Net has 7.77 MB parameters, our proposed model's parameters are almost half U-Net's. Meanwhile, in computational complexity, the metric is FLOPs. Our proposed model is far ahead in this regard.
Hence, the next step of our future work has three parts. One is improving the abilities to segment the image with blurred boundaries and edge extract to solve the problem of that loss of object edge. The second is to carry on reducing the number of parameters and computational complexity to implement model deployment on mobile devices. The last one is that we hope to achieve good performances in both segmentation accuracy and model lightweight and obtain an accurate and efficient biomedical image segmentation model.

Conclusion
We propose a lightweight and multiscale network called PyConvU-Net which is constructed by pyramidal convolution based on U-Net. The purpose of pyramidal convolution is to utilize different size filters to specifically capture detailed information which is typically missed out in the traditional convolution. Through the exhaustive experiments and analysis, despite we use different kernel sizes, PyConvU-Net does not increase the number of parameters while maintaining good performance in different segmentation tasks.
For future work, it will be interesting to explore improve the performance of our proposed architecture in other segmentation datasets. Figure 5 shows an overview of the suggested architecture. As seen, PyConvU-Net adopts a framework like U-Net's Encoder-Decoder. What distinguishes PyconvU-Net from U-Net is the re-designed convolutional layers (shown in red arrow) that replace the traditional convolution with the pyramidal convolution. As is shown in the legend which is at the bottom of Fig. 5, all convolution blocks are followed by a batch normalization layer [30] and a ReLU activation function.

Methods
Traditional convolutional using the fixed kernel size has entered a bottleneck period. It cannot gain more detailed information to improve the performance of the network. Therefore, we want to find another convolutional way that can extract as much as possible information in the biomedical images while not increasing the cost of computation. Pyramidal convolution came into our view at that time. We replace all conventional convolution layers in the U-Net with the pyramidal convolution. Also, we change the padding way in the U-Net. U-Net uses the valid padding that can reduce the size of the feature map after convolution, which can drop some fine information. To solve the problem, we change the valid padding into the same padding to ensure that the feature map does not change size before and after convolution. Meanwhile, At the final layer in the original U-Net, a 1 × 1 convolution is used to map each 64-component feature vector to the desired number of classes. However, the final layer in our proposed model is the Sigmoid activation function. This is because our mask image is a binary image. Through the Sigmoid activation function, the output of the network is a binary image that can be convenient to compare the difference between the two.
The number of parameters and FLOPs required for the standard convolution can be calculated by the following formulas: