This repository contains an implementation of Vision Transformers (ViTs) from scratch. The notebook demonstrates the complete process of setting up, training, and evaluating a Vision Transformer model.
To get started, clone this repository and install the required dependencies:
git clone https://github.com/VarunBiyyala/ViT-from-Scratch.git
Vision Transformers (ViTs) are a type of neural network architecture that applies transformer models to image data. Unlike traditional convolutional neural networks (CNNs), ViTs divide images into fixed-size patches and process them as sequences of tokens, enabling the model to capture long-range dependencies in the data.
The initial step involves setting up the environment and installing necessary libraries. The einops library is used for flexible tensor operations and transformations.
!pip install einopsFor this experimental project, Oxford-IIIT Pet dataset has been downloaded using:
from torchvision.datasets import OxfordIIITPetImages are divided into fixed-size patches, which are then flattened and linearly embedded. This step is crucial as it prepares the image data for processing by the transformer model.
Positional encodings are added to the patch embeddings to retain spatial information. This step ensures that the model understands the relative positions of image patches.
The core component of the Vision Transformer is the Transformer encoder. It consists of multi-head self-attention mechanisms and feed-forward neural networks. This allows the model to process image patches in parallel and capture complex dependencies.
For classification tasks, a classification head is added to the transformer encoder. This involves taking the output corresponding to the class token (or applying global average pooling) and passing it through a fully connected layer to obtain class logits.
The training loop includes defining the loss function, optimizer, and running the training process for several epochs. The model's performance is monitored and evaluated using validation data.
The model's performance is evaluated on a test dataset by computing metrics such as accuracy, which assess how well the model has learned to classify images.
Though the results documented in this code are not good, i recommend to continue training for more epochs to get better results. My priority is to understand the architecture of ViT's.
Special thanks to the developers of the libraries and tools used in this project, and to the open-source community for their invaluable contributions.