This document explains the structure and functionality of the PaliGemmaProcessor class and its associated helper functions, as found in the processing_paligemma.py file. This processor is used in a Vision-Language Model (VLM) setup inspired by Google's PaliGemma model.
PaliGemmaProcessor prepares inputs for a vision-language model by:
- Preprocessing images (resize, normalize, etc.)
- Inserting visual tokens into text prompts
- Tokenizing the combined prompt
- Returning everything as PyTorch tensors
- Combines image and text preprocessing
- Handles special token logic (e.g., , , )
- Converts data into model-ready PyTorch tensors
tokenizer: HuggingFace tokenizernum_image_tokens: Number of visual tokens to insert (e.g., 32)image_size: Target size to resize images to (e.g., 224)
- Adds special tokens to the tokenizer:
<image>(used to represent image tokens)<locXXXX>(used for object detection)<segXXX>(used for segmentation)
- Disables auto-BOS/EOS token addition
- Stores image token ID
text: List of 1 string promptimages: List of 1 PIL image
- Preprocesses image via helper functions
- Adds
<image>tokens and<s>(BOS) to text prompt - Tokenizes updated prompt
- Converts everything to PyTorch tensors
A dictionary with:
{
"pixel_values": Tensor [1, 3, H, W],
"input_ids": Tensor [1, seq_len],
"attention_mask": Tensor [1, seq_len]
}Resizes a PIL image to a fixed (width, height).
Scales pixel values from [0, 255] → [0, 1] using scale=1/255.0.
Normalizes image channels using ImageNet stats:
image = (image - mean) / stdApplies all of the above:
- Resize
- Rescale
- Normalize
- Transpose [H, W, C] → [C, H, W]
Returns a string like:
<image><image>...<image><s> Prompt text
Used to simulate image embeddings within token stream.
<image>: Repeated before prompt to reserve space for visual features.<locXXXX>: 1024 object detection tags<segXXX>: 128 segmentation labels
Image (PIL) ──▶ Resize ──▶ Rescale ──▶ Normalize ──▶ Transpose ─▶ Torch Tensor
│
▼
Text Prompt ───────▶ Add <image> tokens ──▶ Tokenize ─▶ Torch Tensor
Final Output:
{
"pixel_values": [1, 3, 224, 224],
"input_ids": [1, seq_len],
"attention_mask": [1, seq_len]
}