PaliGemma Implementation from Scratch

A learning project implementing Google's PaliGemma multimodal model from scratch using PyTorch.

Overview

This repository contains a from-scratch implementation of PaliGemma, Google's vision-language model that combines visual understanding with text generation capabilities. The implementation focuses on understanding the core components and architecture of modern multimodal transformers.

Project Structure

PaliGemma/
├── modeling_siglip.py      # SigLIP vision encoder implementation
├── processing_paligemma.py # Image and text preprocessing pipeline
├── requirements.txt        # Project dependencies
└── README.md              # This file

Key Features

Vision Transformer Architecture: Implements the SigLIP vision encoder with patch-based image processing
Multi-Head Attention: Self-attention mechanism with proper scaling and dropout
Image Tokenization: Converts images into sequences of patch embeddings
Special Token Handling: Supports image tokens, location tokens, and segmentation tokens
Preprocessing Pipeline: Complete image and text preprocessing for model inputs

Architecture Details

Vision Processing

Images are divided into patches (default 16x16)
Patches are linearly embedded and combined with positional embeddings
Transformer encoder processes the sequence of patch embeddings
Final layer normalization produces contextualized image representations

Text Processing

Text prompts are prefixed with image tokens
Special tokens are added for object detection and segmentation tasks
Standard tokenization with BOS tokens and newline formatting

Learning Goals

This implementation serves as an educational project to understand:

Vision Transformer (ViT) architecture
Multi-head self-attention mechanisms
Multimodal preprocessing pipelines
Modern transformer design patterns
PyTorch implementation best practices

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PaliGemma Implementation from Scratch

Overview

Project Structure

Key Features

Architecture Details

Vision Processing

Text Processing

Learning Goals

References

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
README.md		README.md
modeling_siglip.py		modeling_siglip.py
processing_paligemma.py		processing_paligemma.py
requirements.txt		requirements.txt

amroadel/PaliGemma

Folders and files

Latest commit

History

Repository files navigation

PaliGemma Implementation from Scratch

Overview

Project Structure

Key Features

Architecture Details

Vision Processing

Text Processing

Learning Goals

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages