Computer Vision with Neural Networks — an Overview
Computer vision algorithms analyze certain criteria in images and videos and apply learned interpretations to predictive or decision-making tasks.
Image processing is not same as computer vision. It is about modifying or enhancing images, like optimizing brightness or contrast, increasing resolution, blurring sensitive information, or cropping. In image processing, we do not care about identification of an image’s content.
CNNs help focus on the most relevant features in the image with the help of filters and pooling layers by creating feature maps of the image, and feature maps of feature maps.
When an image is processed by a CNN, it is evaluated and condensed into collections of stacks of feature maps tied to various sections of the image. These tensors are created by passing the image through a series of convolutional and pooling layers- which extract the most relevant features from an image segment and condense it into a smaller, representative matrix, which is then sent to an output layer to generate predictions.
The performance and efficiency of a CNN is determined by its architecture. This includes the structure of layers, how elements are designed, and which elements are present in each layer.
Many CNNs have been created, but the following are some of the most effective designs, which are implemented in various CV tasks via transfer learning.
AlexNet (2012): AlexNet includes five convolutional and three fully connected layers. It uses a dual pipeline structure to accommodate the use of two GPUs during training. It uses ReLU) instead of sigmoid or Tanh used in LeNet.
GoogleNet (2014): GoogleNet (Inception V1) has 22 layers made of small groups of convolutions, called “inception modules”. These inception modules use batch normalization and RMSprop to reduce the number of parameters. RMSprop is an algorithm that uses adaptive learning rate methods.
VGGNet (2014): VGG 16 has 16 layers. Mostly, it has convolutional layers, a pooling layer, a few more convolutional layers, a pooling layer, several more conv layers and so on.
It is based on the notion of a much deeper network with smaller filters — it uses 3×3 convolutions all the way that looks at some of the neighbouring pixels only. It uses small filters because of fewer parameters, making it possible to add more layers. It has the same effective receptive field as if you have one 7×7 convolutional layer.
ResNet (2015): ResNet, short for Residual Neural Network, is an architecture designed to have a large number of layers — typically used architectures range from ResNet-18 (with 18 layers) to ResNet-1202 (with 1202 layers). These layers are established with gated units or “skip connections” which enable it to pass information to later convolutional layers. ResNet also employs batch normalization to improve the stability of the network.
Xception (2016): Xception replaces the inception modules (in InceptionNet) with depthwise separable convolutions followed by pointwise convolutions. It works by first capturing cross-feature map correlations and then spatial correlations. This enables more efficient use of model parameters.
ResNeXt-50 (2017): ResNeXt-50 is based on modules with 32 parallel paths. It uses cardinality to decrease validation errors and represents a simplification of the inception modules.
Image localization is about where objects are located in an image. Once identified, objects are marked with a bounding box.
Object detection extends on this and classifies the objects that are identified. This process is based on CNNs such as AlexNet, Fast RCNN, and Faster RCNN.
Two-step object detection — first, a Region Proposal Network (RPN) provides candidate regions that may contain important objects. Then, these region proposals are passed through either an RCNN-based hierarchical grouping algorithm, or region of interest (ROI) pooling in Fast RCNN. These approaches are quite accurate, but can be very slow.
One-step object detection — Algorithms like YOLO, SSD, and RetinaNet combine the detection and classification step by regressing bounding box predictions.
Semantic segmentation is object detection at pixel level. The goal is to define objects with their boundaries rather than bounding boxes. It uses fully convolutional networks (FCN) or U-Nets.
Pose estimation: Pose estimation helps determine where joints are in a picture of a person or an object and what the placement of those joints indicates. It can be used with both 2D and 3D images. The primary architecture used for pose estimation is PoseNet, which is based on CNNs.