Attention in Transformers
Tokens in a sequence possess semantic relationships that need representation in the model. The attention mechanism aids in modeling these relationships, facilitating parallel processing within the transformer.
For instance, let’s analyze the sentence, “I bought a new pant on amazon.com, and it was too big.” In this sentence, what does the word ‘it’ refer to? Does it signify the word ‘pant’ or the phrase ‘amazon.com’? In this context, ‘it’ pertains to ‘pant,’ but how do we convey to the model that in this instance, ‘it’ refers to ‘pant’? We employ attention for this purpose.
SELF ATTENTION
Now, let’s briefly revisit the concept of self-attention and how the attention score is calculated.
- The process begins with the input word, converted into a token and represented as an embedding vector. This vector already incorporates positional encoding.
- Three weight matrices, akin to those in deep learning, contain numeric values. The embedding vector is multiplied by these weights, yielding three vectors: query, key, and value.