Deep Learning: Activation Functions — 10 Tricky questions
1. What are some common activation functions used in deep learning, and how do they differ from each other?
Answer: Some common activation functions used in deep learning include the sigmoid, hyperbolic tangent (tanh), rectified linear unit (ReLU), and softmax functions.
- The sigmoid and tanh functions are both smooth and saturate as their input gets very large or small.
- The ReLU function is non-saturating and has a simple derivative, making it computationally efficient.
- The softmax function is used for multiclass classification problems and converts a vector of inputs into a probability distribution.
2. ReLU is non-saturating. What does it mean? How is it useful?
Answer: ReLU (Rectified Linear Unit) is considered “non-saturating” because it does not saturate at higher input values. As the input gets larger, the output of the function remains proportional to the input, rather than saturating at a maximum value.
The non-saturating property of ReLU helps to prevent the vanishing gradient problem, which can occur when gradients become too small (like in case of tanh and sigmoid) and lead to difficulties in training deep neural networks.
3. What is the vanishing gradient problem, and how does it relate to activation functions?
Answer: The vanishing gradient problem occurs when the gradients of the cost function with respect to the weights in deep neural networks become very small as they are propagated backwards through the network during backpropagation.
This can cause the network to learn slowly or not at all.
Activation functions such as the sigmoid and tanh are prone to this problem because they saturate at high and low values, resulting in small gradients. ReLU and its variants (e.g., Leaky ReLU) have been shown to mitigate the vanishing gradient problem.
4. What is the derivative of the ReLU function, and why is it useful in deep learning?
Answer: The derivative of the ReLU function is 1 for input values greater than zero and 0 for…