A learning project implementing Google's PaliGemma multimodal model from scratch using PyTorch.
This repository contains a from-scratch implementation of PaliGemma, Google's vision-language model that combines visual understanding with text generation capabilities. The implementation focuses on understanding the core components and architecture of modern multimodal transformers.
PaliGemma/
├── modeling_siglip.py # SigLIP vision encoder implementation
├── processing_paligemma.py # Image and text preprocessing pipeline
├── requirements.txt # Project dependencies
└── README.md # This file
- Vision Transformer Architecture: Implements the SigLIP vision encoder with patch-based image processing
- Multi-Head Attention: Self-attention mechanism with proper scaling and dropout
- Image Tokenization: Converts images into sequences of patch embeddings
- Special Token Handling: Supports image tokens, location tokens, and segmentation tokens
- Preprocessing Pipeline: Complete image and text preprocessing for model inputs
- Images are divided into patches (default 16x16)
- Patches are linearly embedded and combined with positional embeddings
- Transformer encoder processes the sequence of patch embeddings
- Final layer normalization produces contextualized image representations
- Text prompts are prefixed with image tokens
- Special tokens are added for object detection and segmentation tasks
- Standard tokenization with BOS tokens and newline formatting
This implementation serves as an educational project to understand:
- Vision Transformer (ViT) architecture
- Multi-head self-attention mechanisms
- Multimodal preprocessing pipelines
- Modern transformer design patterns
- PyTorch implementation best practices