Internal Covariate Shift refers to the phenomenon where the distribution of intermediate activations (the outputs of hidden layers) within a neural network changes during training.
During the training of deep neural networks, the model’s parameters (weights and biases) are updated using optimization algorithms like gradient descent. These updates are based on the gradients of the loss function with respect to the model’s parameters.
Each layer applies a series of linear transformations and activation functions to its inputs. The outputs of these activation functions are referred to as activations or feature maps.
Internal Covariate Shift occurs because the distribution (mean and variance) of these activations can change as the model’s parameters are updated during training.
When the distribution of activations shifts significantly during training, it can make it harder for the model to converge.
Specifically, layers earlier in the network may adapt to the changing distribution, and this adaptation may propagate in a way that makes it difficult for the later layers to learn effectively.
Batch Normalization mitigates the effects of Internal Covariate Shift. It means Normalizing batches being fed to the network. BN normalizes the activations within each mini-batch during training, ensuring that they have a consistent mean and variance. This helps stabilize training and can lead to faster convergence.
The concept of batch normalization can be understood by considering a mini-batch of activations, denoted as x. The goal is to compute a normalized version of x, referred to as x_hat, which is then scaled and shifted using learnable parameters.
The formula to compute x_hat is given as:
x_hat = (x — µ) / σ
Where:
- x is the mini-batch of activations.
- µ is the mean of the mini-batch.
- σ is the standard deviation of the mini-batch.