Deep Learning: Impact of Gradient Descent Optimization Algorithm during Training

Rahul S
1 min readApr 20

During training, the objective is to find the optimal weights and biases (parameters) for the neural network that minimize the loss function. Gradient descent optimization is one of the most commonly used algorithms for training neural networks.

The activation function impacts the gradient descent optimization algorithm by affecting the gradients that are propagated backwards through the network during backpropagation. The gradients are used to update the weights and biases in each layer of the network in order to minimize the loss function.

In particular, the choice of activation function can impact the magnitude of the gradients. If the gradients become too small or too large, it can lead to “vanishing” or “exploding” gradient problem.

Activation functions such as ReLU and its variants can help alleviate the vanishing gradient problem by keeping the gradients relatively large and preventing them from becoming too small. On the other hand, activation functions such as sigmoid and hyperbolic tangent can exacerbate the vanishing gradient problem because they saturate at high and low input values, leading to small gradients.

Attention in Transformers

Deep Learning: Importance of Data Normalization

Deep Learning: What Makes Transformers So Effective?

Deep Learning: Guidelines for model optimization and tuning

Deep learning: A non-mathematical intuition of how a neural network learns

Deep Learning: GELU (Gaussian Error Linear Unit) Activation Function

Unlocking Artistic Magic: Decoding Neural Style Transfer (NST)

Deep Learning: Internal Covariate Shift & Batch Normalization

Deep Learning: Activation Functions — 10 Tricky questions

Deep Learning: Exploding / Vanishing Gradients

Rahul S

I learn as I write | LLM, NLP, Statistics, ML