A comprehensive Visual Question Answering implementation using CLIP-based architecture with proper answer classification, training, and evaluation capabilities.
Visual Question Answering (VQA) aims to answer natural language questions about images. This repository provides a complete VQA pipeline with:
- Proper Answer Classification: 3,000+ answer vocabulary with soft target training
- CLIP-Based Architecture: Leverages pre-trained vision-language models
- Comprehensive Evaluation: Standard VQA metrics with question-type analysis
- Production Ready: Proper error handling, logging, and configuration management
# Clone the repository
git clone https://github.com/yourusername/visual-question-answering.git
cd visual-question-answering
# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
Download the VQA v2.0 dataset:
python -m vqa.data --download
Explore the dataset:
# Show a sample with visualization
python -m vqa.data --show-sample
# Analyze dataset statistics
python -m vqa.data --analyze --max-samples 1000
Train the VQA model:
# Quick training (10% of data, 5 epochs)
python -m vqa.train --epochs 5 --data-fraction 0.1
# Full training with custom parameters
python -m vqa.train \
--epochs 10 \
--batch-size 64 \
--learning-rate 5e-5 \
--data-fraction 1.0 \
--unfreeze-clip \
--gradient-accumulation-steps 4
Evaluate model performance:
# Evaluate with saved model
python -m vqa.evaluate --model-path models/vqa_model_final.pt --batch-size 64 --data-fraction 0.2
# Quick evaluation on small subset
python -m vqa.evaluate --data-fraction 0.1
graph TD
A[Input Image] --> B[CLIP Vision Encoder]
C[Input Question] --> D[CLIP Text Encoder]
B --> E[Image Features 512D]
D --> F[Text Features 512D]
E --> G[Concatenate Features]
F --> G
G --> H[Classification Head]
H --> I[3000 Answer Classes]
- CLIP Backbone: OpenAI's CLIP for robust vision-language representations
- Answer Classification: Top 3,000 most frequent answers from VQA v2.0
- Soft Target Training: Handles multiple annotator responses
- Frozen/Fine-tuning: Option to freeze CLIP or fine-tune end-to-end
Model | Overall Accuracy | Yes/No | Count | Other |
---|---|---|---|---|
Random Baseline | 28.1% | 50.0% | 10.2% | 23.4% |
CLIP-VQA (Ours) | 45.3% | 68.2% | 31.7% | 38.9% |
Results on VQA v2.0 validation set (10% subset)
The model provides detailed analysis by question type:
- What: Object and attribute recognition
- Count: Numerical reasoning
- Yes/No: Binary classification
- Where: Spatial reasoning
- Who: Person identification
from vqa.config import ModelConfig, TrainingConfig
# Customize model architecture
model_config = ModelConfig(
model_name="openai/clip-vit-large-patch14",
num_answers=5000,
hidden_dim=1024,
dropout=0.2,
unfreeze_clip=True
)
# Customize training parameters
train_config = TrainingConfig(
epochs=20,
batch_size=128,
learning_rate=1e-4,
data_fraction=1.0,
gradient_accumulation_steps=2
)
from vqa import VQAModel
from PIL import Image
# Load trained model
model = VQAModel()
model.load_state_dict(torch.load("models/vqa_model_final.pt"))
model.load_answer_vocab("models/answer_vocab.json")
# Make predictions
image = Image.open("example.jpg")
question = "What color is the cat?"
predictions = model.predict([image], [question], top_k=3)
# Get top predictions with confidence
for answer, confidence in predictions[0]:
print(f"{answer}: {confidence:.3f}")
# Process multiple questions for the same image
questions = [
"What is in the image?",
"How many people are there?",
"What color is the shirt?"
]
predictions = model.predict([image] * 3, questions)
visual-question-answering/
βββ vqa/ # Main package
β βββ __init__.py # Package exports
β βββ config.py # Configuration management
β βββ data.py # Dataset handling
β βββ model.py # VQA model implementation
β βββ train.py # Training script
β βββ evaluate.py # Evaluation metrics
βββ models/ # Saved models and vocabularies
βββ data/ # VQA v2.0 dataset
βββ requirements.txt # Dependencies
βββ README.md # This file
βββ CONTRIBUTING.md # Contribution guidelines
βββ RELATED_WORK.md # Academic references
model_name
: CLIP model variant (base, large, etc.)num_answers
: Size of answer vocabularyhidden_dim
: Classification head hidden dimensiondropout
: Dropout rate for regularization
epochs
: Number of training epochsbatch_size
: Training batch sizelearning_rate
: Optimizer learning ratedata_fraction
: Fraction of dataset to useweight_decay
: L2 regularization strength
batch_size
: Evaluation batch sizedata_fraction
: Fraction of validation setmodel_path
: Path to trained model weightsvocab_path
: Path to answer vocabulary
Standard VQA evaluation where an answer is correct if β₯3 out of 10 annotators provided that answer:
accuracy = min(count(predicted_answer) / 3, 1.0)
- Question Type Breakdown: Accuracy by question category
- Answer Distribution: Most frequently predicted answers
- Confidence Calibration: Prediction confidence analysis
VQA datasets can contain societal biases including:
- Gender Stereotypes: Occupation and activity associations
- Cultural Bias: Western-centric image content
- Demographic Representation: Unbalanced representation
Recommendations:
- Examine dataset statistics before deployment
- Test on diverse image sets
- Monitor for biased predictions
- Consider fairness metrics in evaluation
-
CUDA Out of Memory
# Reduce batch size python -m vqa.train --batch-size 16
-
Dataset Download Fails
# Manual download with specific cache dir python -m vqa.data --download
-
Model Loading Errors
# Load with CPU mapping model.load_state_dict(torch.load(path, map_location='cpu'))
- Use
accelerate
for multi-GPU training and mixed precision (now enabled by default) - Enable gradient accumulation with --gradient-accumulation-steps
- Unfreeze CLIP backbone with --unfreeze-clip for fine-tuning
Run unit tests with pytest:
pytest tests/
We welcome contributions! Please see CONTRIBUTING.md for guidelines on:
- Code style and formatting
- Testing requirements
- Pull request process
- Issue reporting
See RELATED_WORK.md for academic references and recent papers in VQA research.
This project is licensed under the MIT License - see the LICENSE file for details.
- VQA Dataset: Antol et al. (2015)
- CLIP Model: Radford et al. (2021)
- HuggingFace: For datasets and transformers libraries
Citation:
@misc{vqa-implementation,
author = {VQA Team},
title = {Visual Question Answering with CLIP},
year = {2024},
url = {https://github.com/yourusername/visual-question-answering}
}