PaliGemmaProcessor: Vision-Language Preprocessing Module

This document explains the structure and functionality of the PaliGemmaProcessor class and its associated helper functions, as found in the processing_paligemma.py file. This processor is used in a Vision-Language Model (VLM) setup inspired by Google's PaliGemma model.

Overview

PaliGemmaProcessor prepares inputs for a vision-language model by:

Preprocessing images (resize, normalize, etc.)
Inserting visual tokens into text prompts
Tokenizing the combined prompt
Returning everything as PyTorch tensors

Class: `PaliGemmaProcessor`

Purpose

Combines image and text preprocessing
Handles special token logic (e.g., , , )
Converts data into model-ready PyTorch tensors

Constructor ("`init`"):

Parameters:

tokenizer: HuggingFace tokenizer
num_image_tokens: Number of visual tokens to insert (e.g., 32)
image_size: Target size to resize images to (e.g., 224)

Logic:

Adds special tokens to the tokenizer:
- <image> (used to represent image tokens)
- <locXXXX> (used for object detection)
- <segXXX> (used for segmentation)
Disables auto-BOS/EOS token addition
Stores image token ID

Forward: `call()`

Parameters:

text: List of 1 string prompt
images: List of 1 PIL image

Steps:

Preprocesses image via helper functions
Adds <image> tokens and <s> (BOS) to text prompt
Tokenizes updated prompt
Converts everything to PyTorch tensors

Output:

A dictionary with:

{
  "pixel_values": Tensor [1, 3, H, W],
  "input_ids": Tensor [1, seq_len],
  "attention_mask": Tensor [1, seq_len]
}

Helper Functions

`resize(image, size, resample=None)`

Resizes a PIL image to a fixed (width, height).

`rescale(image, scale, dtype=np.float32)`

Scales pixel values from [0, 255] → [0, 1] using scale=1/255.0.

`normalize(image, mean, std)`

Normalizes image channels using ImageNet stats:

image = (image - mean) / std

`process_images(images, ...)`

Applies all of the above:

Resize
Rescale
Normalize
Transpose [H, W, C] → [C, H, W]

`add_image_tokens_prompt(prompt, bos_token, image_seq_len, image_token)`

Returns a string like:

<image><image>...<image><s> Prompt text

Used to simulate image embeddings within token stream.

Special Tokens

<image>: Repeated before prompt to reserve space for visual features.
<locXXXX>: 1024 object detection tags
<segXXX>: 128 segmentation labels

Final Pipeline

Image (PIL) ──▶ Resize ──▶ Rescale ──▶ Normalize ──▶ Transpose ─▶ Torch Tensor
     │
     ▼
Text Prompt ───────▶ Add <image> tokens ──▶ Tokenize ─▶ Torch Tensor

Final Output:
{
  "pixel_values": [1, 3, 224, 224],
  "input_ids": [1, seq_len],
  "attention_mask": [1, seq_len]
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PaliGemmaProcessor: Vision-Language Preprocessing Module

Overview

Class: `PaliGemmaProcessor`

Purpose

Constructor ("`init`"):

Parameters:

Logic:

Forward: `call()`

Parameters:

Steps:

Output:

Helper Functions

`resize(image, size, resample=None)`

`rescale(image, scale, dtype=np.float32)`

`normalize(image, mean, std)`

`process_images(images, ...)`

`add_image_tokens_prompt(prompt, bos_token, image_seq_len, image_token)`

Special Tokens

Final Pipeline

FilesExpand file tree

PaliGemmaProcessor.md

Latest commit

History

PaliGemmaProcessor.md

File metadata and controls

PaliGemmaProcessor: Vision-Language Preprocessing Module

Overview

Class: PaliGemmaProcessor

Purpose

Constructor ("__init__"):

Parameters:

Logic:

Forward: __call__()

Parameters:

Steps:

Output:

Helper Functions

resize(image, size, resample=None)

rescale(image, scale, dtype=np.float32)

normalize(image, mean, std)

process_images(images, ...)

add_image_tokens_prompt(prompt, bos_token, image_seq_len, image_token)

Special Tokens

Final Pipeline

Class: `PaliGemmaProcessor`

Constructor ("`init`"):

Forward: `call()`

`resize(image, size, resample=None)`

`rescale(image, scale, dtype=np.float32)`

`normalize(image, mean, std)`

`process_images(images, ...)`

`add_image_tokens_prompt(prompt, bos_token, image_seq_len, image_token)`