Attention in Transformers

Rahul S
2 min readNov 27, 2023

Tokens in a sequence possess semantic relationships that need representation in the model. The attention mechanism aids in modeling these relationships, facilitating parallel processing within the transformer.

For instance, let’s analyze the sentence, “I bought a new pant on, and it was too big.” In this sentence, what does the word ‘it’ refer to? Does it signify the word ‘pant’ or the phrase ‘’? In this context, ‘it’ pertains to ‘pant,’ but how do we convey to the model that in this instance, ‘it’ refers to ‘pant’? We employ attention for this purpose.


Now, let’s briefly revisit the concept of self-attention and how the attention score is calculated.

  1. The process commences with the input word, converted into a token and represented as an embedding vector. This vector already incorporates positional encoding.
  2. Three weight matrices, akin to those in deep learning, contain numeric values. The embedding vector is multiplied by these weights, yielding three vectors: query, key, and value.
  3. The query, taken as input, is employed to locate the key and retrieve the corresponding value. In the case of attention, these vectors are numeric and contribute to computing the attention vector ‘Z’ for the token.
  4. This computation involves utilizing keys from other tokens in the sentence, allowing modeling of relationships into the attention score.
  5. Hence, self-attention is computed for each token in the input sentence, producing an attention score vector of numeric values. The keys of adjacent tokens influence the attention score for the central token.
  6. It’s crucial to note that the original embedding, with positional encoding, ensures the self-attention vector considers both the token’s position and its relationships with other tokens.

The weights (Wq, Wk, and Wv) are initialized randomly and refined during the transformer’s training process. This is analogous to training parameters in deep learning.


Moving on to multi-head attention, each set of Wq, Wk, and Wv matrices constitutes a single attention head. In multi-head attention, multiple such…