Encoder and Decoders in Transformers

Rahul S
4 min readNov 27, 2023

This is a crisp introduction/revision to the two concepts impeccable for understanding transformers.

ENCODER

The encoder in the transformer converts a given set of tokens from a sentence into its equivalent vectors, also called hidden state or context. These vectors capture the semantics and relationships between the tokens using multiple techniques like positional encoding, embedding matrix, and attention.

The encoder has a complex architecture made of multiple building blocks. Let’s review the architecture now.

SINGLE ENCODER LAYER

We will start off with the architecture of a single encoder layer.

  1. The input tokens from the sentences are converted into equivalent embeddings.
  2. Then the same input is also used to create positional encoding vectors.
  3. The position encoding vectors are then appended to the input embedding matrix for each token.
  4. This is then sent to the first encoder layer.
  5. The encoder layer has a multi-head attention block. The output of this multi-head attention block is then added to the original input vector, and the resulting vector is normalized using standard deep learning normalization techniques.
  6. The output vector from this normalization is then sent to a feed-forward network. This network is a classical neural network with its own layers, nodes, weights, and biases. The output of this feed-forward network is combined with its input and normalized.
  7. This then produces the output of a single encoder layer.

The transformer has an encoder stack, typically of six encoder layers. The numbers may vary based on the specific transformer architecture. The original input, consisting of the embedding matrix, along with the positional encoding vector, is fed into the first encoder. The output of the first encoder is then sent as the input to the second encoder. This then flows until the last encoder, where the final output is produced.

Each encoder layer has the same dimension for their inputs and outputs. Each layer has their own multi-head attention mechanism and feed-forward network. They in turn, have their own…

--

--