| Name | Roll Number |
|---|---|
| Sai Deepak Reddy Mara | 22CS10066 |
| Byreddi Sri Chaitanya | 22CS10018 |
| Angalakuduru Tejasree Sai | 22CS10009 |
Key tasks include:
- Building and training a custom ViT-GPT2 model for image captioning.
- Benchmarking this custom model against a pre-trained SmolVLM.
- Evaluating the robustness of both models on occluded images.
- Creating a BERT-based classifier to identify which model generated a given caption.
part_a.ipynbpart_b.ipynbpart_c.ipynbreport.pdf
This part focuses on Automatic Image Captioning using:
- Zero-shot captioning with SmolVLM
- A custom encoder-decoder model (ViT encoder + transformer decoder)
- Evaluation using BLEU, ROUGE-L, and METEOR
- Training Note:
Implemented custom early stopping for the ViT-GPT2 model
- The training stops if validation loss doesn't improve for 3 consecutive epochs, and the model with the best validation loss is saved.
- Execution Note:
- GPU Constraints: Running SmolVLM inference, custom model training, and custom model evaluation within a single session exceeded GPU memory limits.
- Training Workaround: The custom model was trained by restarting session after running the initial SmolVLM evaluation to ensure sufficient GPU memory.
- METEOR Evaluation: Calculating the METEOR score in Kaggle caused errors. The final evaluation results, including METEOR were obtained by running in Colab.
- Implemented random patch masking (10%, 50%, 80% occlusion).
- Evaluated both SmolVLM and the custom model on these perturbed images.
- Generated dataset for part C
final_raw_results.csv.
-
Built a dataset from Part B results.
-
Trained a BERT-based classifier (
bert-base-uncased) to distinguish between SmolVLM and custom model captions. -
Training Note:
- The classifier was trained for a fixed number of epochs.
- Plots comparing training and validation loss were used after training to analyze performance and found that 5 epoch was optimal.
- PART A Dataset: Dataset Link
- Generated data fron part B for part C: Dataset(csv) Link
- SmolVLM: (
HuggingFaceTB/SmolVLM-Instruct) - Custom Captioning: ViT Encoder (
WinKawaks/vit-small-patch16-224) + GPT-2 Decoder (gpt2)- Trained custom model for our dataset: Model Link
- Classifier: BERT-base-uncased (
google-bert/bert-base-uncased)
Install required Python libraries. You can use the following command:
pip install torch transformers evaluate rouge_score nltk pandas Pillow gdown matplotlib scikit-learn tqdm- Notebooks include
gdowncommands to download:- The dataset (zip format)
- The custom model (
vit_gpt2_caption_model.pth) - Evaluation results (
final_raw_results.csv)
Note: If you have already downloaded these files and placed them correctly, you can skip or comment out the gdown cells.
Ensure the dataset is extracted in custom_captions_dataset/ directory.
Run the notebooks in the following order:
- Run all cells in sequence
- Restart the runtime between the following stages(if needed):
- SmolVLM evaluation
- Custom model training
- This step saves the model as:
vit_gpt2_caption_model.pth
Notebook: part_b.ipynb
- Run after completing Part A
- Uses the model saved in Part A
- Saves intermediate results to
part_b_raw_results_csv/ - Merges all results into
final_raw_results.csv
Notebook: part_c.ipynb
- Run after Part B is complete
- Requires
final_raw_results.csvfrom Part B - Trains and evaluates a caption classifier
- Saves the final classifier as:
bert_caption_classifier_final.pth
A GPU environment (like Google Colab T4 or Kaggle GPU) is required for efficient model training and evaluation.