Deep Learning: GELU (Gaussian Error Linear Unit) Activation Function

Rahul S
2 min readAug 24

GELU stands for Gaussian Error Linear Unit. It was designed to address some of the limitations of ReLU, such as the “dying ReLU” problem and its inability to model negative values effectively.

The formula for the GELU activation function is as follows:

  1. Smoothness: GELU is a smooth function, meaning it’s continuously differentiable, unlike ReLU, which has a discontinuity at zero. The smoothness property can help with gradient-based optimization during training.
  2. Approximation to Identity: For small values of x, GELU behaves like the identity function, which allows it to preserve information in the network even for small activations.
  3. Non-Linearity: GELU introduces non-linearity to the network, which is essential for capturing complex relationships in data.
  4. Sigmoid and Tanh Components: GELU uses a combination of the sigmoid and hyperbolic tangent (tanh) functions, which helps it model negative values and gradients effectively.
  5. Normalization: GELU includes a normalization term sqrt(2/pi) to ensure that the outputs have a standard deviation close to 1, which can help stabilize training in deep neural networks.

GELU has found applications in various deep learning architectures, especially in natural language processing (NLP) and computer vision tasks.

It is commonly used in transformer-based models, including the popular BERT (Bidirectional Encoder Representations from Transformers) model for NLP tasks.

Attention in Transformers

Deep Learning: Importance of Data Normalization

Deep Learning: What Makes Transformers So Effective?

Deep Learning: Guidelines for model optimization and tuning

Deep learning: A non-mathematical intuition of how a neural network learns

Unlocking Artistic Magic: Decoding Neural Style Transfer (NST)

Deep Learning: Internal Covariate Shift & Batch Normalization

Deep Learning: Activation Functions — 10 Tricky questions

Deep Learning: Impact of Gradient Descent Optimization Algorithm during Training

Deep Learning: Exploding / Vanishing Gradients

Rahul S

I learn as I write | LLM, NLP, Statistics, ML