This project implements an image classification pipeline using the CIFAR-100 dataset by leveraging a Vision Transformer (ViT) model, as described in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al., 2021). The project includes scripts for preprocessing, training, and testing the model.
- Environment Setup
- Dataset Preparation
- Preprocessing and Frame Extraction
- Training the Model
- Testing and Evaluation
- Project Structure
- Customization and Hyperparameters
-
Python Version:
This project requires Python 3.10+. -
Dependencies: Create a new Conda (or Miniconda) environment and install the required Python packages by running:
conda env create -f environment.yml -n vit- PyTorch
- torchvision
- OpenCV
- scikit-learn
- tqdm
- numpy
- Pillow
A CUDA-enabled GPU is recommended for training. The code automatically detects GPU availability.
Before training, the raw images must be resized, normalized, and then partitioned into the training, validation, and test datasets. The training dataset also has data augmentations applied to increase image diversity (RandomHorizontalFlip). The preprocessing module includes functions for:
To run the preprocessing script, call the following in the Command Line:
python preprocess.pyThe preprocess.py script will download the CIFAR-100 dataset from PyTorch and create the Dataloaders for the training, validation, and test datasets.
Execute the training script from your terminal:
python run_train_test.py --mode 'train' | Out-File train_log.txt- Load the frame dataset.
- Split the dataset into training, validation, and test sets using stratified sampling.
- Apply data augmentation techniques (resizing, random flips, normalization).
- Create custom PyTorch Datasets and DataLoaders.
- Initialize the ViT model using the ViT Base-16 architecture.
- Set up the loss function, optimizer, and learning rate scheduler.
- Run the training loop while tracking loss and accuracy, saving the best model weights.
- Run the Testing Script:
Execute the testing script from your terminal:
python run_train_test.py --mode test | Out-File test_log.txtThe testing script will:
- Load the dataset splits (previously saved during training).
- Create a DataLoader for the test set.
- Load the trained model checkpoint.
- Evaluate the model on the test data by computing overall accuracy, generating classification reports, and optionally producing confusion matrices.
You can modify several parameters to experiment with different settings:
--mode: Choose betweentrain(default) ortest.
--batch_size,--learning_rate,--n_epochs, and--dropoutcontrol the training dynamics.
By tweaking these parameters, you can study their impact on model performance and experiment with different network configurations.
-
Step 0: Dataset Preparation
Download and organize the CIFAR-100 dataset. Thepreprocess.pyscript will do this for you. -
Step 1: Run Training
Executerun_train_test.pyafter configuring the--modeand other hyperparameters to train the model. -
Step 2: Run Testing
Executerun_train_test.pyafter updating the--modeargument. The script will load the weights from the best model checkpoint to evaluate the model.