Take UNet to the Next Level! Enhance UNet with Transformer (2024)

Segmentation

3 main points
✔️ Proposes TransUNet, a model that combines UNet and Transformer
✔️ The combination of CNN's locality and Transformer's long-term dependence is important
✔️ Achieve segmentation accuracy beyond traditional methods on two medical image datasets

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation
writtenbyJieneng Chen,Yongyi Lu,Qihang Yu,Xiangde Luo,Ehsan Adeli,Yan Wang,Le Lu,Alan L. Yuille,Yuyin Zhou
(Submitted on 8 Feb 2021)
Comments: Published on arxiv.

Subjects: Computer Vision and Pattern Recognition (cs.CV)
Take UNet to the Next Level! Enhance UNet with Transformer (1)Take UNet to the Next Level! Enhance UNet with Transformer (2)
code:Take UNet to the Next Level! Enhance UNet with Transformer (3)Take UNet to the Next Level! Enhance UNet with Transformer (4)Take UNet to the Next Level! Enhance UNet with Transformer (5)

The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

Segmentation of medical images is being studied every day because it is very important as a preprocessing step for medical applications. Recently, models using deep learning have achieved high segmentation accuracy.

One of the most successful models for medical image segmentation is UNet, which is a CNN model with a U-shaped architecture. However, UNet has a weakness in segmentation: it is not good at capturing long-term dependencies in segmentation. It is said that the reason is that CNNs, which constitute UNet, are good at capturing local features, while they are limited in capturing long-term features.

The strength of the Transformer is that it captures long-term dependencies. Therefore, it is expected that the Transformer can compensate for the weakness of UNet and improve the segmentation accuracy.

In this paper, we propose a model called TransUNet, which is a combination of UNet and Transformer and can perform more accurate segmentation than conventional methods by successfully combining CNN, which is good at capturing local features, and Transformer, which is good at capturing long-term features. TransUNet can perform more accurate segmentation than conventional methods.

As a result, we have achieved segmentation accuracy that exceeds that of conventional methods on two medical image datasets. In addition, our experiments show that combining CNN and Transformer provides more accurate segmentation than using CNN and Transformer alone.

In this article, we present an overview of TransUNet and the results of our experiments with medical image datasets.

TransUNet

The above figure shows the architecture of TransUNet, which in brief is a model of a UNet encoder with an embedded Transformer (ViT). In the following, the encoders and decoders of TransUNet will be described.

The TransUNet encoder first extracts features with CNN to capture local features. After that, it extracts features with a transformer to capture long-term features. In TransUNet, ResNet50 and ViT, which have been trained on ImageNet, are used as CNN and Transformer, respectively.

The decoder of TransUNet performs upsampling as well as UNet, and finally outputs the segmentation result. In addition, the CNN of the encoder and the corresponding layer of the decoder are connected by skip-connection.

experiment

medical image data set

In this paper, we perform segmentation experiments using the following two medical image datasets.

  1. Synapse multi-organ segmentation dataset
    • Data set of abdominal CT images
    • Segmentation of 8 sites
  2. Automated cardiac diagnosis challenge (ACDC)
    • MRI dataset of the heart
    • Segmentation of three sites

assess

The Dice coefficient (DSC, in %) and the Hausdorff distance (HD, in mm) are used to evaluate the model; a larger DSC indicates a higher segmentation accuracy and a smaller HD indicates a higher segmentation accuracy.

The segmentation accuracy on the Synapse multi-organ segmentation dataset is as follows.

Take UNet to the Next Level! Enhance UNet with Transformer (6)

TransUNet (DSC: 77.48 %, HD: 31.69 mm) achieves better segmentation accuracy than the traditional methods (V-Net, DARR, U-Net, AttnUNet). The fact that TransUNet achieves better segmentation accuracy than the model where the encoder is only the Transformer (the model where the Encoder is ViT and the Decoder is CUP) shows that it is important to combine CNN and the Transformer.

The segmentation accuracy on the ACDC dataset was as follows.

Even on the ACDC dataset, TransUNet has the highest segmentation accuracy (DSC: 89.71 %) compared to the traditional methods (R50-U-Net, R50-AttnUNet) and the Transformer-only model.

Segmentation Visualization

Take UNet to the Next Level! Enhance UNet with Transformer (7)

The figure above shows the actual segmentation images from the Synapse multi-organ segmentation dataset, showing that the TransUNet provides more accurate segmentation than the other models. TransUNet is more accurate than other models.

For example, compare the segmentation images in the second row: UNet incorrectly segments the left kidney (red) into the spleen (light blue), and AttnUNet incorrectly segments the spleen (light blue) into the river (purple) On the other hand TransUNet, on the other hand, correctly segments the spleen (light blue).

summary

In this talk, we introduced TransUNet, a model combining UNet and Transformer for medical image segmentation, which successfully combines the advantages of CNN and Transformer to achieve segmentation accuracy beyond that of conventional models.

TransUNet was a hybrid model of CNN+Transformer, but CNN-free models for segmentation have also been developed. It will be interesting to see whether the hybrid model or the CNN free model will dominate the segmentation task in the future.

Categories related to this article

  • Medical
  • Segmentation
  • Transformer

Take UNet to the Next Level! Enhance UNet with Transformer (8)

Shumpei Takezaki

Take UNet to the Next Level! Enhance UNet with Transformer (2024)

FAQs

What is TransUNet? ›

TransUNet employs a hybrid CNN-Transformer architecture as the encoder as well as a cascaded upsampler to enable precise localization. As in the figure, CNN is first used as a feature extractor to generate a feature map for the input.

What is U-Net in deep learning? ›

U-Net is an encoder-decoder convolutional neural network with extensive medical imaging, autonomous driving, and satellite imaging applications. However, understanding how the U-Net performs segmentation is important, as all novel architectures post-U-Net develop on the same intuition.

Where to look for the pancreas? ›

An abdominal ultrasound is a common imaging test for evaluating the organs in your abdomen. To look at the pancreas, your healthcare provider will conduct an “upper right quadrant” abdominal ultrasound, which shows the pancreas, liver and gallbladder.

Is U-Net ++ better than U-Net? ›

In summary, UNet++ is an extension of the UNet architecture that improves segmentation performance by introducing nested skip pathways and better feature aggregation, making it a powerful tool for various computer vision tasks, especially semantic segmentation.

Why is U-Net good? ›

UNet is well-suited for multi-class image segmentation tasks, as it can handle a large number of classes and produce a pixel-level segmentation map for each class. However, it may be necessary to balance the training data or use probabilistic segmentation maps to handle class overlap or imbalanced class distributions.

How does U-Net work? ›

The UNET architecture follows an “encoder-decoder” structure, where the contracting path represents the encoder, and the expanding path represents the decoder. This design resembles encoding information into a compressed form and then decoding it to reconstruct the original data.

What is the difference between CNN and U-Net? ›

In CNN, the image is converted into a vector which is largely used in classification problems. But in U-Net, an image is converted into a vector and then the same mapping is used to convert it again to an image. This reduces the distortion by preserving the original structure of the image.

Is U-Net an autoencoder? ›

There are different variations of autoencoders like sparse , variational etc. They all compress and decompress the data But the UNET is also same used for compressing and decompressing .

What is the difference between U-Net and feature pyramid network? ›

Like the U-Net, the FPN has laterals connection between the bottom-up pyramid (left) and the top-down pyramid (right). But, where U-net only copy the features and append them, FPN apply a 1x1 convolution layer before adding them.

Why do we use U-Net for image segmentation? ›

The combination of the two paths enables U-net to learn both global and local features and to achieve high accuracy in segmentation tasks. One of the strengths of U-net is its versatility in accepting different types of input data, such as grayscale, color, and multi-channel images.

Top Articles
Latest Posts
Article information

Author: Carmelo Roob

Last Updated:

Views: 5841

Rating: 4.4 / 5 (65 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Carmelo Roob

Birthday: 1995-01-09

Address: Apt. 915 481 Sipes Cliff, New Gonzalobury, CO 80176

Phone: +6773780339780

Job: Sales Executive

Hobby: Gaming, Jogging, Rugby, Video gaming, Handball, Ice skating, Web surfing

Introduction: My name is Carmelo Roob, I am a modern, handsome, delightful, comfortable, attractive, vast, good person who loves writing and wants to share my knowledge and understanding with you.