Vision-Language Model (PaliGemma)

This repository contains a from-scratch PyTorch implementation of a custom Vision-Language Model inspired by Google's PaliGemma architecture. It integrates a SigLIP-based vision encoder with a Gemma-based causal language decoder, creating a multimodal architecture capable of processing both images and text.

🚀 Project Highlights

Vision Encoder: Custom implementation of SigLIP-style transformer.
Language Decoder: Custom Gemma-style causal language model with rotary embeddings and KV caching.
Multimodal Fusion: Merges image embeddings as tokens in the input sequence.
Inference Script: Simple interface for running inference on image-text pairs.

🏗️ Architecture Overview

📐 Architecture Breakdown

🔍 Vision Encoder

Based on SigLIP (Contrastive Language-Image Pretraining without Labels).
A transformer-based image encoder that produces patch-level embeddings.
Output shape: [batch_size, num_patches, vision_hidden_dim].

🧠 Language Decoder (Gemma-style)

Autoregressive transformer decoder.
Components:
- Rotary Positional Embeddings (RoPE)
- RMSNorm
- Grouped Query Attention (GQA)
- Feedforward MLP block (GEGLU activation)
- KV Cache for efficient inference
Output shape: [batch_size, seq_len, hidden_dim]

🔄 Multimodal Fusion

Image patch embeddings are projected to match text hidden dimension.
These are injected into the token stream via special <image> tokens.
Final embedding sequence combines text and image embeddings using attention masks.

🖼️ Preprocessing Pipeline (Processor)

Responsibilities:

Tokenize the input text prompt
Resize and normalize the input image
Replace <image> placeholder tokens with actual image features

Steps:

Image Processing
- Resize image to image_size x image_size
- Normalize using vision model's mean/std
Text Processing
- Tokenize with tokenizer
- Inject special image token (<image> or image_token_index)
Fusion
- Merge image tokens with text embeddings
- Generate correct attention mask and position ids

📤 Inference Flow

Entry: `test_inference()`

Accepts: prompt, image path, tokenizer, model, and generation config
Flow:
1. Preprocess input via PaliGemmaProcessor
2. Generate image + text embeddings
3. Inject into GemmaForCausalLM
4. Autoregressively generate tokens using top-p or greedy decoding

Caching:

Uses KVCache to store key/value states per layer.
Only the new token is passed at each step (via q_len = 1).
Greatly improves inference speed.

📁 File Structure

├── model_gemma.py               # Core model components: Gemma decoder, attention, MLP
├── processing_paligemma.py     # Processor: image + text token prep
├── test_inference.py           # Inference runner with Fire CLI
├── utils.py                    # HF model loader or helper functions
├── launch_inference.sh         # Shell script for CLI testing
├── vlm_paligemma_model.png     # Architecture diagram

📦 Dependencies

Python 3.10+
PyTorch
torchvision
transformers
fire
PIL

Install dependencies:

pip install torch torchvision transformers fire pillow

🧠 & 📚 Inspiration & References

This project is heavily inspired by Google DeepMind's PaliGemma Vision-Language architecture:

🔗 Google Blog – PaliGemma Explained
📄 Official Paper (arXiv:2407.07726)

This is a research and learning project. The implementation is completely custom and does not use pre-trained models.

📜 License

This project is currently unlicensed. Feel free to fork and experiment for educational purposes. Contact the author if you plan to use this commercially.

🔖 Author

Dhruv Panchal GitHub

Feel free to star ⭐ this repo if you find it useful!

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.ipynb_checkpoints		.ipynb_checkpoints
.vscode		.vscode
__pycache__		__pycache__
test_images		test_images
.gitattributes		.gitattributes
.gitignore		.gitignore
Gemma2.pdf		Gemma2.pdf
PaliGemmaProcessor.md		PaliGemmaProcessor.md
Pali_Gemma_model.ipynb		Pali_Gemma_model.ipynb
SigLIP.pdf		SigLIP.pdf
SiglipVisionModelArchitecture.MD		SiglipVisionModelArchitecture.MD
VLM.pdf		VLM.pdf
demo.py		demo.py
doc.txt		doc.txt
inference.py		inference.py
launch_inference.sh		launch_inference.sh
model_gemma.py		model_gemma.py
modeling_gemma.ipynb		modeling_gemma.ipynb
modeling_gemma.md		modeling_gemma.md
modeling_siglip.ipynb		modeling_siglip.ipynb
modeling_siglip.py		modeling_siglip.py
processing_paligemma.ipynb		processing_paligemma.ipynb
processing_paligemma.py		processing_paligemma.py
readme.MD		readme.MD
utils.py		utils.py
vlm_paligemma_model.png		vlm_paligemma_model.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision-Language Model (PaliGemma)

🚀 Project Highlights

🏗️ Architecture Overview

📐 Architecture Breakdown

🔍 Vision Encoder

🧠 Language Decoder (Gemma-style)

🔄 Multimodal Fusion

🖼️ Preprocessing Pipeline (Processor)

Responsibilities:

Steps:

📤 Inference Flow

Entry: `test_inference()`

Caching:

📁 File Structure

📦 Dependencies

🧠 & 📚 Inspiration & References

📜 License

🔖 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

panchaldhruv27223/Vision-Language-Model

Folders and files

Latest commit

History

Repository files navigation

Vision-Language Model (PaliGemma)

🚀 Project Highlights

🏗️ Architecture Overview

📐 Architecture Breakdown

🔍 Vision Encoder

🧠 Language Decoder (Gemma-style)

🔄 Multimodal Fusion

🖼️ Preprocessing Pipeline (Processor)

Responsibilities:

Steps:

📤 Inference Flow

Entry: test_inference()

Caching:

📁 File Structure

📦 Dependencies

🧠 & 📚 Inspiration & References

📜 License

🔖 Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Entry: `test_inference()`

Packages