Computer Vision: Convolutional Neural Networks (CNNs)

An Intuitive Introduction

Rahul S
8 min readNov 21, 2022
src: Google


Neural networks take in numbers and output numbers. And that is what an image is in computers. It is a series of individual color pixels- each pixel a mix of one/three colors: Greyscale/RGB.

There is only one-color channel in a grayscale image. So, a grayscale image is represented as (height, width, 1) or simply (height, width). We can ignore the third dimension because it is one. Therefore, a grayscale image is often represented as a 2D array (tensor).

There are three color channels (Red, Green and Blue) in an RGB image. So, an RGB image is represented as (height, width, 3). The third dimension denotes the number of color channels in the image. An RGB image is represented as a 3D array (tensor).

In the image file, these three channels are stored separately. They are actually three separate images in an image file. One represents the intensity of blue, the other red, and the other green.

Each pixel is a digit of 0 to 255 — telling our eyes how intense the color should be. Higher value of the digit, greater is the brightness, and dark points are almost zero. So black = (0,0,0).

Images are represented as arrays of pixel values.

Each color channel is a 2D array (because images are planar) of integers with one number for each pixel in the image.

In the image file, there are three separate arrays like this: one for each color. And if we layer the three-color channels on top of each other, we can think of an image as a 3D array — 3 layers deep (each layer for a channel).

Images in Neural Network

A neural network is a system of interconnected artificial “neurons” that exchange messages between each other.

The connections have numeric weights that are tuned during the training process, so that a properly trained network will respond correctly when presented with an image or pattern to recognize.

Each neuron calculates the dot product of inputs and weights, adds the bias, and applies non-linearity as a trigger function (for example, following a sigmoid response function).

The network consists of multiple layers of feature-detecting “neurons”. Each layer has many neurons that respond to different combinations of inputs from the previous layers. The layers are built so that the first layer detects a set of primitive patterns in the input, the second layer detects patterns of patterns, the third layer detects patterns of those patterns, and so on.

Now, the neural network we wish to feed this image to, should have one Input Node (neuron) for every digit in the 3D array. Every number standing for pixel intensity is a ‘Feature’ in the language of traditional machine learning. And number of features = number of dimensions. So, for an image of size 256 x 256 x 3, the dimension of the data would be 196000.

It is a huge number. And the number adds up rapidly with each passing layer in a neural network. That’s why using MLP neural networks for image processing is so computationally intensive. Processing an image requires sending it through a neural network of millions of nodes.

How Neural Networks learns to recognize images

We train a neural network by feeding many images into it and telling it what the correct answer for each one is. Correct answer being, say, what class does the image belong to- i.e., a typical classification setting.

Weights of the neural network change every time the images are run through it. When a neural network is trained, its different layers learn to look for different patterns in the input data. The top layers (layers nearer to the Input layer) look for simple patterns like lines and sharp edges. And lower layers use information/knowledge from the upper layers to look for more complex shapes and patterns.

And thus, with all the layers working together, the model is able to identify even very complex objects.

After a certain number of such iterations (Epochs), the neural network learns the ‘probabilistically correct’ weights for each node/neuron, that makes it possible for the network to learn the class of a new unseen image.

After the training process if felt to be over, we can pass on a new image, and the neural network will tell us about its best guess for the correct answer in the form of a probability value.


The design of a CNN is motivated by the discovery of a visual mechanism, the visual cortex, in the brain. The visual cortex contains a lot of cells that are responsible for detecting light in small, overlapping sub-regions of the visual field, which are called receptive fields.

These cells act as local filters over the input space, and the more complex cells have larger receptive fields. The convolution layer in a CNN performs the function that is performed by the cells in the visual cortex.

In a typical CNN layer, each feature of a layer receives inputs from a set of features located in a small neighborhood in the previous layer called a local receptive field. With local receptive fields, features can extract elementary visual features, such as oriented edges, endpoints, corners, etc., which are then combined by the higher layers.

In the traditional model of pattern/image recognition, a hand-designed feature extractor gathers relevant information from the input and eliminates irrelevant variabilities. The extractor is followed by a trainable classifier, a standard neural network that classifies feature vectors into classes. In a CNN, convolution layers play the role of feature extractor. But they are not hand designed. Convolution filter kernel weights are decided on as part of the training process. Convolutional layers are able to extract the local features because they restrict the receptive fields of the hidden layers to be local.

The convolution operation extracts different features of the input. The first convolution layer extracts low-level features like edges, lines, and corners. Higher-level layers extract higher-level features.

Convolution of an input with one kernel produces one output feature (Feature map), and with H kernels independently produces H features (feature maps).

SCR: Ruksan

Starting from top-left corner of the input, each kernel is moved from left to right, one element at a time. Once the top-right corner is reached, the kernel is moved one element in a downward direction, and again the kernel is moved from left to right, one element at a time. This process is repeated until the kernel reaches the bottom-right corner.

Why CNN?

Ruggedness to shifts and distortion in the image

Detection using CNN is rugged to distortions such as change in shape because of camera lens, different lighting conditions, different poses, presence of partial occlusions, horizontal and vertical shifts, etc. It is because CNNs are shift invariant, since the same weight configuration is used across space.

Fewer memory requirements

In the convolutional layer, the same coefficients are used across different locations in the space, so the memory requirement is drastically reduced.

Easier and better training

In a CNN, since the number of parameters is drastically reduced, training time is proportionately reduced.

Translational (In)Variance

Now, an image is a 2D data in the spatial sense, and it has objects distributed in its ‘region’.

What I mean is: an image is usually an image of an object. Say a cat. And a cat can be located anywhere in the image region. And there can be many such ‘objects’ in the given image, and they can be spatially disbursed around its planar region. An image of a traffic signal, for example, in which there are many objects of many kinds.

But neural networks of dense type (as in Multi-Layer Perceptron) take in features with no regard to their spatial position in the planar region of the image. The two-dimensional nature of the image gets lost when it is ‘FLATTENED’ before we feed the features (digit values) into the network.

So there is a loss of information. The MLP can learn to know which object is in the image, but it cannot learn where it is. In other words, the MLPs are translationally variant.

Convolution makes a network translationally invariant.

On Convolution operation — The big question

Convolution operation improves the neural network in the sense that it is now able to recognize an object no matter where it is moved within the image.

Unlike a normal dense layer, where every node is connected to every other node, a convolution layer breaks apart the image in a special way so that it can recognize the same object in different locations in the image-space.

There are a few ‘steps’ involved.

The first step is to break the image into small, overlapping tiles. A small window (kernel/filter) is passed over the image- as in: it translates across the entire image-space. And every time it lands somewhere, it output a new image-tile.

Now we have a set of tiles- in the next layer. And this set of image-tiles is now passed through the same neural network layer, where each tile is processed the same way and we save a value each time.

One can think it this way- By this process, in a way, we’re turning the image into an array, where each entry in the array represents whether the neural network thinks a certain pattern appears at that part of the image.

This exact process is again repeated, but with a different set of weights, creating another feature map that tells us whether a certain pattern appears in the image. But because we’re using different weights, the network looks for a pattern different from the first time.

We can repeat this process several times until we have several layers in our new array. This turns our original array into a 3D array called feature map. Each element in the array represents where a certain pattern occurs.

Feature map

The feature map stores the outputs of different convolution operations between different image sections and the filter(s). This is the input for the next pooling layer.

The number of elements in the feature map is equal to the number of different image sections that we obtained by moving the filter(s) on the image. But because we are checking each tile of the original image, it doesn’t matter where in the image a pattern occurs. We can find it anywhere.

The feature map uses the information of these patterns to decide which of them are most important in determining the final output.

Adding a convolutional layer makes it possible for our neural network to be able to find the pattern, no matter where it appears in an image. And we can have several convolutional layers that repeat this process multiple times.

The idea is- we keep squishing down the image with each convolutional layer while still capturing the most important information from it, so that, by the time we reach the output layer, the neural network can identify whether or not the object appeared.