Skip to content

Image classification pipeline using the CIFAR-100 dataset by leveraging a Vision Transformer (ViT) model, as described in the paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (Dosovitskiy et al., 2021).

License

Notifications You must be signed in to change notification settings

dna-witch/vision-transformer-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Image Classification with Vision Transformers: An Experimental Study

This project implements an image classification pipeline using the CIFAR-100 dataset by leveraging a Vision Transformer (ViT) model, as described in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al., 2021). The project includes scripts for preprocessing, training, and testing the model.


Table of Contents


Environment Setup

  1. Python Version:
    This project requires Python 3.10+.

  2. Dependencies: Create a new Conda (or Miniconda) environment and install the required Python packages by running:

conda env create -f environment.yml -n vit

Key Libraries

  • PyTorch
  • torchvision
  • OpenCV
  • scikit-learn
  • tqdm
  • numpy
  • Pillow

Hardware Requirements

A CUDA-enabled GPU is recommended for training. The code automatically detects GPU availability.


Dataset Preparation

Preprocessing and Partitioning the Data

Before training, the raw images must be resized, normalized, and then partitioned into the training, validation, and test datasets. The training dataset also has data augmentations applied to increase image diversity (RandomHorizontalFlip). The preprocessing module includes functions for:

To run the preprocessing script, call the following in the Command Line:

python preprocess.py

The preprocess.py script will download the CIFAR-100 dataset from PyTorch and create the Dataloaders for the training, validation, and test datasets.


Training the Model

Step 1: Run Training

Execute the training script from your terminal:

python run_train_test.py --mode 'train' | Out-File train_log.txt

During Training, the Script Will:

  • Load the frame dataset.
  • Split the dataset into training, validation, and test sets using stratified sampling.
  • Apply data augmentation techniques (resizing, random flips, normalization).
  • Create custom PyTorch Datasets and DataLoaders.
  • Initialize the ViT model using the ViT Base-16 architecture.
  • Set up the loss function, optimizer, and learning rate scheduler.
  • Run the training loop while tracking loss and accuracy, saving the best model weights.

Testing and Evaluation

Step 2: Run Testing

  • Run the Testing Script:
    Execute the testing script from your terminal:
python run_train_test.py --mode test | Out-File test_log.txt

Testing Script Overview

The testing script will:

  • Load the dataset splits (previously saved during training).
  • Create a DataLoader for the test set.
  • Load the trained model checkpoint.
  • Evaluate the model on the test data by computing overall accuracy, generating classification reports, and optionally producing confusion matrices.

Customization and Hyperparameters

You can modify several parameters to experiment with different settings:

Model Parameters

  • --mode: Choose between train (default) or test.

Training Parameters

  • --batch_size, --learning_rate, --n_epochs, and --dropout control the training dynamics.

By tweaking these parameters, you can study their impact on model performance and experiment with different network configurations.


Summary of Steps

  • Step 0: Dataset Preparation
    Download and organize the CIFAR-100 dataset. The preprocess.py script will do this for you.

  • Step 1: Run Training
    Execute run_train_test.py after configuring the --mode and other hyperparameters to train the model.

  • Step 2: Run Testing
    Execute run_train_test.py after updating the --mode argument. The script will load the weights from the best model checkpoint to evaluate the model.

About

Image classification pipeline using the CIFAR-100 dataset by leveraging a Vision Transformer (ViT) model, as described in the paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (Dosovitskiy et al., 2021).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages