Computer Vision: U-Net
U-Net is one of the most famous image segmentation architectures. It was proposed in 2015 by Olaf Ronneberger, Philipp Fischer, Thomas Brox (University of Freiburg, Germany). [1]
One should read the full paper to really relish the architechture. Following is an outlines treatment, suitable for beginners.
An end-to-end segmentation technique- U-Net takes a raw image in and outputs a segmentation map of the image.
The U-Net architecture is a U-shaped, symmetric convolutional network with a down-sampling contraction path and an up-sampling expansion path. The resulting segmented output image is much smaller than the raw input image. U-net only has Convolutional layers.
And the input image is fed into the network, the data is propagated through the network resulting in a segmented map as output.
Contraction/down sampling path (Encoder Path):
The encoder path captures the context of the image. It is just a stack of convolution and max pooling layers.
The encoding path has 4 blocks. Each block consists of
1) Two 3 x 3 convolution layers + ReLU activation function (with batch normalization).
2) And. One 2 x 2 max pooling layer.
In the original paper, the size of the input image is 572 x 572 x 3. 64 (3 x 3) kernels produce a feature map of size (570 x 570 x 64). After another such operation, we get a new feature map of size (568 x 568 x 64). Now a MaxPooling (2 x 2) layer downsamples the feature map to 284 x 284 x 64.
Note that the number of feature maps doubles at each pooling, starting with 64 feature maps for the first block, 128 for the second, and so on.
Expansion/Up sampling path (Decoder Path)
Decoder enable precise localization using transposed convolutions (An upsampling technique).
The expansion path has 4 blocks. Each block consists of:
1) Deconvolution layer with stride 2.
2) Concatenation with the corresponding cropped feature map from the contracting path. i.e. At every…