Deep Learning: Activation Functions — 10 Tricky questions

1. What are some common activation functions used in deep learning, and how do they differ from each other?

Rahul S
4 min readAug 17


Answer: Some common activation functions used in deep learning include the sigmoid, hyperbolic tangent (tanh), rectified linear unit (ReLU), and softmax functions.

  • The sigmoid and tanh functions are both smooth and saturate as their input gets very large or small.
  • The ReLU function is non-saturating and has a simple derivative, making it computationally efficient.
  • The softmax function is used for multiclass classification problems and converts a vector of inputs into a probability distribution.

2. ReLU is non-saturating. What does it mean? How is it useful?

Answer: ReLU (Rectified Linear Unit) is considered “non-saturating” because it does not saturate at higher input values. As the input gets larger, the output of the function remains proportional to the input, rather than saturating at a maximum value.

The non-saturating property of ReLU helps to prevent the vanishing gradient problem, which can occur when gradients become too small (like in case of tanh and sigmoid) and lead to difficulties in training deep neural networks.

3. What is the vanishing gradient problem, and how does it relate to activation functions?
Answer: The vanishing gradient problem occurs when the gradients of the cost function with respect to the weights in deep neural networks become very small as they are propagated backwards through the network during backpropagation.

This can cause the network to learn slowly or not at all.

Activation functions such as the sigmoid and tanh are prone to this problem because they saturate at high and low values, resulting in small gradients. ReLU and its variants (e.g., Leaky ReLU) have been shown to mitigate the vanishing gradient problem.

4. What is the derivative of the ReLU function, and why is it useful in deep learning?
Answer: The derivative of the ReLU function is 1 for input values greater than zero and 0 for…



Attention in Transformers

2 min read

Nov 27

Deep Learning: Importance of Data Normalization

3 min read

Oct 8

Deep Learning: What Makes Transformers So Effective?

2 min read

Oct 6

Deep Learning: Guidelines for model optimization and tuning

10 min read

Dec 1, 2022

Deep learning: A non-mathematical intuition of how a neural network learns

5 min read

Nov 30, 2022

Deep Learning: GELU (Gaussian Error Linear Unit) Activation Function

2 min read

Aug 24

Unlocking Artistic Magic: Decoding Neural Style Transfer (NST)

4 min read

Aug 11

Deep Learning: Internal Covariate Shift & Batch Normalization

3 min read

Aug 23

Deep Learning: Impact of Gradient Descent Optimization Algorithm during Training

1 min read

Apr 20

Deep Learning: Exploding / Vanishing Gradients

3 min read

Aug 17

Rahul S

I learn as I write | LLM, NLP, Statistics, ML