In semantic segmentation, we associate each pixel of an image with a corresponding class of what is being represented. So, it’s an inference task at pixel level and is also called a dense prediction task. In this article, we will introduce ourselves to this all-important topic from a non-mathematical intuitive perspective.
We have multi-class classification and also multi-label classification, and both are different. In the first one, we have images, and we have classes associated with them. In the other one, the model tries to tell us about multiple things/objects that are in the image.
For example, if there’s a picture of a monkey eating a banana, instead of just classifying that the image as a monkey or a banana, a multi-label classifier can identify both- the label for the monkey and the label for the banana. And if there’s a picture of three snakes, instead of just classifying the image as snake, the multi-label classifier produces three snake labels- one for each snake in the image.
Multi-class is one in which we train our network to recognize more than one class. And multi-label is one when we perform inference on the image to detect more than one thing in the image.
Now, moving beyond classification, we may be interested in knowing not just what is in the image but also where in the image it is. Identifying the location of an object within the image is called as object localization. And this, along with multi-class classification, is object detection.
Object detection models classify all objects in the image with confidence scores, along with the bounding boxes associated with them; bounding box of an object being made of location and size of the object. Popular algorithms for object detection are Region-CNN, Faster-RCNN, YOLO (you look only once) and SSD (single shot detector).
Moving beyond, and going a little deeper into the image, we come to segmentation- which is nothing but pixel level classification.
Image segmentation segments the image into small fragments based on the image’s specific parameters, which are characteristic of the chosen area of the image, e.g., threshold, clusters, motion, contour, or edge of the picture. This segmented image is trained to pick up cues…