ViT- Vision Transformers (An Introduction)


Rahul S


So-called generative AI is not just about large language models (LLMs). Processing and understanding other formats like Images, and videos are equally important.

In this article, I will try to give a non-technical overview of ViT (Vision Transformers), that have emerged as successors of CNN in Computer Applications like image classification, object detection, and semantic image segmentation.

ViT model architecture was introduced in “An Image is Worth 16*16 Words: Transformers for Image Recognition at Scale”. If you want, you can give it a try: []. Here, I intend to touch upon it. It is a researchers work to go deep, and I am currently not a researcher. My role is to understand and apply. And always remembering that new technologies will come.

But let me begin from the beginning. The Transformer Architecture.

The key idea behind the success of transformers is attention. It helps us consider the context of words and focus attention on the key relationships between different tokens.

Attention is about comparing ‘token embeddings’ and calculating a sort of alignment score that encapsulates how similar two tokens are based on their contextual and semantic meaning.