Skip to content

A learning project implementing Google's PaliGemma multimodal model from scratch using PyTorch.

Notifications You must be signed in to change notification settings

amroadel/PaliGemma

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PaliGemma Implementation from Scratch

A learning project implementing Google's PaliGemma multimodal model from scratch using PyTorch.

Overview

This repository contains a from-scratch implementation of PaliGemma, Google's vision-language model that combines visual understanding with text generation capabilities. The implementation focuses on understanding the core components and architecture of modern multimodal transformers.

Project Structure

PaliGemma/
├── modeling_siglip.py      # SigLIP vision encoder implementation
├── processing_paligemma.py # Image and text preprocessing pipeline
├── requirements.txt        # Project dependencies
└── README.md              # This file

Key Features

  • Vision Transformer Architecture: Implements the SigLIP vision encoder with patch-based image processing
  • Multi-Head Attention: Self-attention mechanism with proper scaling and dropout
  • Image Tokenization: Converts images into sequences of patch embeddings
  • Special Token Handling: Supports image tokens, location tokens, and segmentation tokens
  • Preprocessing Pipeline: Complete image and text preprocessing for model inputs

Architecture Details

Vision Processing

  1. Images are divided into patches (default 16x16)
  2. Patches are linearly embedded and combined with positional embeddings
  3. Transformer encoder processes the sequence of patch embeddings
  4. Final layer normalization produces contextualized image representations

Text Processing

  1. Text prompts are prefixed with image tokens
  2. Special tokens are added for object detection and segmentation tasks
  3. Standard tokenization with BOS tokens and newline formatting

Learning Goals

This implementation serves as an educational project to understand:

  • Vision Transformer (ViT) architecture
  • Multi-head self-attention mechanisms
  • Multimodal preprocessing pipelines
  • Modern transformer design patterns
  • PyTorch implementation best practices

References

About

A learning project implementing Google's PaliGemma multimodal model from scratch using PyTorch.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages