Multi-class classification with Vision Transformer from Scratch using TensorFlow and Python
The Vision Transformer (ViT) model architecture was introduced in a research paper published as a conference paper at ICLR 2021 titled “An Image is Worth 16*16 Words: Transformers for Image Recognition at Scale”. URL: https://arxiv.org/abs/2010.11929