U-Net (2024)

Introducing Symmetry in Segmentation

Published in

Towards Data Science

6 min read

Jan 23, 2019

Vision is one of the most important senses humans possess. But have you ever wondered about the complexity of the task? The ability to capture the reflected light rays and get meaning out of it is a very convoluted task and yet we do it so easily. We developed it due to millions of years of evolution. So how can we give machines the same ability in a very small period of time? For computers, these images are nothing but matrices and understanding the nuances behind these matrices has been an obsession for many mathematicians for years. But after the emergence of artificial intelligence and particularly CNN architectures, the research has made progress like never before. Many problems which are previously considered untouchable are now showing astounding results.

One such problem is the image segmentation. In Image Segmentation, the machine has to partition the image into different segments, each of them representing a different entity.

As you can see above, how the image turned into two segments, one represents the cat and the other background. Image segmentation is useful in many fields from self-driving cars to satellites. Perhaps the most important of them all is medical imaging. The subtleties in medical images are quite complex and sometimes even challenging for trained physicians. A machine that can understand these nuances and can identify necessary areas can make a profound impact in medical care.

Convolutional Neural Networks gave decent results in easier image segmentation problems but it hasn't made any good progress on complex ones. That’s where UNet comes in the picture. UNet was first designed especially for medical image segmentation. It showed such good results that it used in many other fields after. In this article, we’ll talk about why and how UNet works. If you don’t know intuition behind CNN, please read this first. You can check out UNet in action here.

The main idea behind CNN is to learn the feature mapping of an image and exploit it to make more nuanced feature mapping. This works well in classification problems as the image is converted into a vector which used further for classification. But in image segmentation, we not only need to convert feature map into a vector but also reconstruct an image from this vector. This is a mammoth task because it’s a lot tougher to convert a vector into an image than vice versa. The whole idea of UNet is revolved around this problem.

While converting an image into a vector, we already learned the feature mapping of the image so why not use the same mapping to convert it again to image. This is the recipe behind UNet. Use the same feature maps that are used for contraction to expand a vector to a segmented image. This would preserve the structural integrity of the image which would reduce distortion enormously. Let’s understand the architecture more briefly.

How UNet Works

The architecture looks like a ‘U’ which justifies its name. This architecture consists of three sections: The contraction, The bottleneck, and the expansion section. The contraction section is made of many contraction blocks. Each block takes an input applies two 3X3 convolution layers followed by a 2X2 max pooling. The number of kernels or feature maps after each block doubles so that architecture can learn the complex structures effectively. The bottommost layer mediates between the contraction layer and the expansion layer. It uses two 3X3 CNN layers followed by 2X2 up convolution layer.

But the heart of this architecture lies in the expansion section. Similar to contraction layer, it also consists of several expansion blocks. Each block passes the input to two 3X3 CNN layers followed by a 2X2 upsampling layer. Also after each block number of feature maps used by convolutional layer get half to maintain symmetry. However, every time the input is also get appended by feature maps of the corresponding contraction layer. This action would ensure that the features that are learned while contracting the image will be used to reconstruct it. The number of expansion blocks is as same as the number of contraction block. After that, the resultant mapping passes through another 3X3 CNN layer with the number of feature maps equal to the number of segments desired.

Loss calculation in UNet

What kind of loss one would use in such an intrinsic image segmentation? Well, it is defined simply in the paper itself.

The energy function is computed by a pixel-wise soft-max over the final feature map combined with the cross-entropy loss function

UNet uses a rather novel loss weighting scheme for each pixel such that there is a higher weight at the border of segmented objects. This loss weighting scheme helped the U-Net model segment cells in biomedical images in a discontinuous fashion such that individual cells may be easily identified within the binary segmentation map.

First of all pixel-wise softmax applied on the resultant image which is followed by cross-entropy loss function. So we are classifying each pixel into one of the classes. The idea is that even in segmentation every pixel have to lie in some category and we just need to make sure that they do. So we just converted a segmentation problem into a multiclass classification one and it performed very well as compared to the traditional loss functions.

I implemented the UNet model using Pytorch framework. You can check out the UNet module here. Images for segmentation of optical coherence tomography images with diabetic macular edema are used. You can checkout UNet in action here.

The UNet module in the above code represents the whole architecture of UNet. contraction_block and expansive_block are used to create the contraction section and the expansion section respectively. The function crop_and_concat appends the output of contraction layer with the new expansion layer input. The training part can be written as

unet = Unet(in_channel=1,out_channel=2)
#out_channel represents number of segments desired
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(unet.parameters(), lr = 0.01, momentum=0.99)
optimizer.zero_grad() 
outputs = unet(inputs)
# permute such that number of desired segments would be on 4th dimension
outputs = outputs.permute(0, 2, 3, 1)
m = outputs.shape[0]
# Resizing the outputs and label to caculate pixel wise softmax loss
outputs = outputs.resize(m*width_out*height_out, 2)
labels = labels.resize(m*width_out*height_out)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()

Image segmentation is an important problem and every day some new research papers are published. UNet contributed significantly in such research. Many new architectures are inspired by UNet. But still, there is so much to explore. There are so many variants of this architecture in the industry and hence it is necessary to understand the first one to understand them better. So if you have any doubts please comment below or refer to the resources page.

This tutorial is the second article in my series of DeepResearch articles. If you like this tutorial please let me know in comments and if you don’t please let me know in comments more briefly. If you have any doubts or any criticism just flood the comments with it. I’ll reply as soon as I can. If you like this tutorial please share it with your peers.

FAQs

Why does U-Net work so well? ›

The main idea is to supplement a usual contracting network by successive layers, where pooling operations are replaced by upsampling operators. Hence these layers increase the resolution of the output. A successive convolutional layer can then learn to assemble a precise output based on this information.

Discover More Details ›

What are the cons of U-Net? ›

Disadvantages: A large number of parameters: UNet has many parameters due to the skip connections and the additional layers in the expanding path. This can make the model more prone to overfitting, especially when working with small datasets.

Read The Full Story ›

What is the basic explanation of U-Net? ›

U-Net is a popular deep-learning architecture for semantic segmentation. Originally developed for medical images, it had great success in this field. But, that was only the beginning! From satellite images to handwritten characters, the architecture has improved performance on a range of data types.

Discover More Details ›

What is the difference between CNN and U-Net? ›

In CNN, the image is converted into a vector which is largely used in classification problems. But in U-Net, an image is converted into a vector and then the same mapping is used to convert it again to an image. This reduces the distortion by preserving the original structure of the image.

What is U-Net bottleneck? ›

Bottlenecks in Neural Networks are a way to force the model to learn a compression of the input data. The idea is that this compressed view should only contain the “useful” information to be able to reconstruct the input (or segmentation map).

View Details ›

What is the difference between FCN and U-Net? ›

U-Net combines the strengths of traditional FCNs with additional features that make it more effective for image segmentation tasks. The key difference between the two models is the symmetricity of the encoder and decoder portions of the network and the skip connections between them.

Find Out More ›

What is the weakness of UNet? ›

However, UNet has a weakness in segmentation: it is not good at capturing long-term dependencies in segmentation. It is said that the reason is that CNNs, which constitute UNet, are good at capturing local features, while they are limited in capturing long-term features.

Read On ›

What are the cons of ResNets? ›

Disadvantages: Increased complexity: The presence of hop connections makes ResNets more complex than traditional deep neural networks, which can lead to higher computational demands and memory requirements.

Explore More ›

What is the full meaning of U-Net? ›

What does UNET stand for?

Rank Abbr.	Meaning
UNET	Unified Narcotics Enforcement Team (Rapid City Police Department; Rapid City, SD)
UNET	Union Network Systems, Inc.
UNET	Universal Education and Training, Ltd. (Australia)
UNET	Unified Neighborhood Enhancement Team (Houston, Texas Police Department)

Keep Reading ›

What is U-Net in deep learning? ›

U-Net is a widely used deep learning architecture that was first introduced in the “U-Net: Convolutional Networks for Biomedical Image Segmentation” paper. The primary purpose of this architecture was to address the challenge of limited annotated data in the medical field.

Read On ›

Is U-Net a fully connected network? ›

One thing you might notice is that unlike classification networks, this network doesn't have a fully connected / linear layer. This is an example of a fully convolutional network (FCN).

Discover More ›

Is U-Net ++ better than U-Net? ›

UNet++ without deep supervision achieves a significant performance gain over both U-Net and wide U-Net, yielding average improvement of 2.8 and 3.3 points in IoU. UNet++ with deep supervision exhibits average improvement of 0.6 points over UNet++ without deep supervision.

Is U-Net better than ResNet? ›

Contrarily, Inception ResNet UNet results in training are lower than the UNet (98.17% accuracy with 0.97 in Dice), while it performs better than the UNet when dealing with the unseen dataset. The final results of accuracy and Dice are as follows: 97.95% and 0.96 on the unseen dataset.

See Details ›

Why is U-Net good for image segmentation? ›

Now, what makes U-Net so good at image segmentation is skip connections and decoder networks. What we have done till now is similar to any CNN. The skip connections and decoder network separates the U-Net from other CNNs. The decoder network is also called the expansive network.

How does U-Net work in stable diffusion? ›

The U-Net in stable diffusion takes encoded text (plain text processed into a format it can understand) and a noisy array of numbers as inputs. Over many iterations, it produces an array containing imageable information from this noisy array it received.

Tell Me More ›

What is the difference between U-Net and U-Net ++? ›

As seen, UNet++ starts with an encoder sub-network or backbone followed by a decoder sub-network. What distinguishes UNet++ from U-Net (the black components in Fig. 1a) is the re-designed skip pathways (shown in green and blue) that connect the two sub-networks and the use of deep supervision (shown red).

Is U-Net an encoder decoder? ›

U-Net is an encoder-decoder segmentation network with skip connections. Image by the author. U-Net has two defining qualities: An encoder-decoder network that extract more general features the deeper it goes.

Get More Info ›