Deep Learning: Internal Covariate Shift & Batch Normalization

Rahul S
3 min readAug 23

Internal Covariate Shift refers to the phenomenon where the distribution of intermediate activations (the outputs of hidden layers) within a neural network changes during training.

During the training of deep neural networks, the model’s parameters (weights and biases) are updated using optimization algorithms like gradient descent. These updates are based on the gradients of the loss function with respect to the model’s parameters.

Each layer applies a series of linear transformations and activation functions to its inputs. The outputs of these activation functions are referred to as activations or feature maps.

Internal Covariate Shift occurs because the distribution (mean and variance) of these activations can change as the model’s parameters are updated during training.

When the distribution of activations shifts significantly during training, it can make it harder for the model to converge.

Specifically, layers earlier in the network may adapt to the changing distribution, and this adaptation may propagate in a way that makes it difficult for the later layers to learn effectively.

Batch Normalization mitigates the effects of Internal Covariate Shift. It means Normalizing batches being fed to the network. BN normalizes the activations within each mini-batch during training, ensuring that they have a consistent mean and variance. This helps stabilize training and can lead to faster convergence.

The concept of batch normalization can be understood by considering a mini-batch of activations, denoted as x. The goal is to compute a normalized version of x, referred to as x_hat, which is then scaled and shifted using learnable parameters.

The formula to compute x_hat is given as:

x_hat = (x — µ) / σ

Where:

  • x is the mini-batch of activations.
  • µ is the mean of the mini-batch.
  • σ is the standard deviation of the mini-batch.

Attention in Transformers

Deep Learning: Importance of Data Normalization

Deep Learning: What Makes Transformers So Effective?

Deep Learning: Guidelines for model optimization and tuning

Deep learning: A non-mathematical intuition of how a neural network learns

Deep Learning: GELU (Gaussian Error Linear Unit) Activation Function

Unlocking Artistic Magic: Decoding Neural Style Transfer (NST)

Deep Learning: Activation Functions — 10 Tricky questions

Deep Learning: Impact of Gradient Descent Optimization Algorithm during Training

Deep Learning: Exploding / Vanishing Gradients

Rahul S

I learn as I write | LLM, NLP, Statistics, ML