Computer Vision: U-Net

Convolutional Networks for Biomedical Image Segmentation

Rahul S


U-Net is one of the most famous image segmentation architectures. It was proposed in 2015 by Olaf Ronneberger, Philipp Fischer, Thomas Brox (University of Freiburg, Germany). [1]

One should read the full paper to really relish the architechture. Following is an outlines treatment, suitable for beginners.

An end-to-end segmentation technique- U-Net takes a raw image in and outputs a segmentation map of the image.

The U-Net architecture is a U-shaped, symmetric convolutional network with a down-sampling contraction path and an up-sampling expansion path. The resulting segmented output image is much smaller than the raw input image. U-net only has Convolutional layers.

And the input image is fed into the network, the data is propagated through the network resulting in a segmented map as output.

Source: Ronneberger O., Fischer P., Brox T. (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. In: Navab N., Hornegger J., Wells W., Frangi A. (eds) Medical Image Computing and Computer-Assisted Intervention — MI

Contraction/down sampling path (Encoder Path):

The encoder path captures the context of the image. It is just a stack of convolution and max pooling layers.

The encoding path has 4 blocks. Each block consists of
1) Two 3 x 3 convolution layers + ReLU activation function (with batch normalization).
2) And. One 2 x 2 max pooling layer.

In the original paper, the size of the input image is 572 x 572 x 3. 64 (3 x 3) kernels produce a feature map of size (570 x 570 x 64). After another such operation, we get a new feature map of size (568 x 568 x 64). Now a MaxPooling (2 x 2) layer downsamples the feature map to 284 x 284 x 64.

Note that the number of feature maps doubles at each pooling, starting with 64 feature maps for the first block, 128 for the second, and so on.

Expansion/Up sampling path (Decoder Path)

Decoder enable precise localization using transposed convolutions (An upsampling technique).

The expansion path has 4 blocks. Each block consists of:
1) Deconvolution layer with stride 2.
2) Concatenation with the corresponding cropped feature map from the contracting path. i.e. At every…