Deep Learning: Exploding / Vanishing Gradients

Deep neural network training involves understanding exploding and vanishing gradients. Exploding gradients become large, causing divergence, while vanishing gradients lead to slow convergence. These affect training times, convergence, and model performance. Techniques like weight initialization, activation functions (e.g., ReLU), batch normalization, gradient clipping, and residual connections mitigate these issues.

Rahul S
3 min readAug 17

When delving into the intricate world of training deep neural networks, it’s essential to grasp the concept of gradients and their role in optimizing models.

Gradients represent the rate of change of the loss function with respect to each weight in the network. These gradients guide the update process during training, ensuring that the model converges towards a solution that minimizes the loss.

1. Exploding Gradients:

Exploring the phenomenon of exploding gradients reveals a critical challenge that can disrupt the training process. This phenomenon occurs when the gradients calculated during backpropagation become exceptionally large as they traverse backward through the layers of the network.

In essence, the gradients snowball, gaining momentum and size with each layer. This ballooning effect leads to significant weight updates, often overshooting the optimal solution and causing the model to diverge.

The analogy of a snowball rolling downhill, growing exponentially, vividly captures the essence of this issue.

2. Vanishing Gradients:

Conversely, the enigma of vanishing gradients presents an equally disruptive force during training. In this scenario, gradients diminish in magnitude as they flow through the layers, becoming exceedingly small. As a result, weight updates become minute…



Rahul S

LLM, NLP, Statistics, MLOps | Senior AI Consultant | IIT Roorkee | Connect: []