Member-only story
ViT- Vision Transformers (An Introduction)

So-called generative AI is not just about large language models (LLMs). Processing and understanding other formats like Images, and videos are equally important.
In this article, I will try to give a non-technical overview of ViT (Vision Transformers), that have emerged as successors of CNN in Computer Applications like image classification, object detection, and semantic image segmentation.
ViT model architecture was introduced in “An Image is Worth 16*16 Words: Transformers for Image Recognition at Scale”. If you want, you can give it a try: [https://arxiv.org/abs/2010.11929]. Here, I intend to touch upon it. It is a researchers work to go deep, and I am currently not a researcher. My role is to understand and apply. And always remembering that new technologies will come.
But let me begin from the beginning. The Transformer Architecture.
The key idea behind the success of transformers is attention. It helps us consider the context of words and focus attention on the key relationships between different tokens.
Attention is about comparing ‘token embeddings’ and calculating a sort of alignment score that encapsulates how similar two tokens are based on their contextual and semantic meaning.
In the layers preceding the attention layer, each word embedding is encoded into a “vector space”. In this vector space, similar tokens share a similar location (a key idea behind the success of embeddings). Mathematically, dot product of two similar embeddings will result in higher alignment score compared to embeddings that are not aligned.
Before attention, our tokens’ initial positions are based purely on a “general meaning” of a particular word or sub-word token. But as we go through several encoder blocks (these include the attention mechanism), the position of these embeddings is updated to better reflect the meaning of a token with respect to its context. The context being all of the other words within that specific sentence.
In other words, the tokens are pushed towards its context-based meaning through many attention encoder blocks.