Skip to content

Commit edbf767

Browse files
committed
Add support for PaliGemma (&PaliGemma2)
1 parent ead1f22 commit edbf767

File tree

6 files changed

+124
-5
lines changed

6 files changed

+124
-5
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -375,6 +375,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
375375
1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al.
376376
1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby.
377377
1. **[OWLv2](https://huggingface.co/docs/transformers/model_doc/owlv2)** (from Google AI) released with the paper [Scaling Open-Vocabulary Object Detection](https://arxiv.org/abs/2306.09683) by Matthias Minderer, Alexey Gritsenko, Neil Houlsby.
378+
1. **[PaliGemma](https://huggingface.co/docs/transformers/main/model_doc/paligemma)** (from Google) released with the papers [PaliGemma: A versatile 3B VLM for transfer](https://arxiv.org/abs/2407.07726) and [PaliGemma 2: A Family of Versatile VLMs for Transfer](https://arxiv.org/abs/2412.03555) by the PaliGemma Google team.
378379
1. **[PatchTSMixer](https://huggingface.co/docs/transformers/main/model_doc/patchtsmixer)** (from IBM) released with the paper [TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting](https://arxiv.org/abs/2306.09364) by Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong, Jayant Kalagnanam.
379380
1. **[PatchTST](https://huggingface.co/docs/transformers/main/model_doc/patchtst)** (from Princeton University, IBM) released with the paper [A Time Series is Worth 64 Words: Long-term Forecasting with Transformers](https://arxiv.org/abs/2211.14730) by Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, Jayant Kalagnanam.
380381
1. **[Phi](https://huggingface.co/docs/transformers/main/model_doc/phi)** (from Microsoft) released with the papers - [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644) by Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee and Yuanzhi Li, [Textbooks Are All You Need II: phi-1.5 technical report](https://arxiv.org/abs/2309.05463) by Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar and Yin Tat Lee.

docs/snippets/6_supported-models.snippet

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,7 @@
9090
1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al.
9191
1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby.
9292
1. **[OWLv2](https://huggingface.co/docs/transformers/model_doc/owlv2)** (from Google AI) released with the paper [Scaling Open-Vocabulary Object Detection](https://arxiv.org/abs/2306.09683) by Matthias Minderer, Alexey Gritsenko, Neil Houlsby.
93+
1. **[PaliGemma](https://huggingface.co/docs/transformers/main/model_doc/paligemma)** (from Google) released with the papers [PaliGemma: A versatile 3B VLM for transfer](https://arxiv.org/abs/2407.07726) and [PaliGemma 2: A Family of Versatile VLMs for Transfer](https://arxiv.org/abs/2412.03555) by the PaliGemma Google team.
9394
1. **[PatchTSMixer](https://huggingface.co/docs/transformers/main/model_doc/patchtsmixer)** (from IBM) released with the paper [TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting](https://arxiv.org/abs/2306.09364) by Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong, Jayant Kalagnanam.
9495
1. **[PatchTST](https://huggingface.co/docs/transformers/main/model_doc/patchtst)** (from Princeton University, IBM) released with the paper [A Time Series is Worth 64 Words: Long-term Forecasting with Transformers](https://arxiv.org/abs/2211.14730) by Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, Jayant Kalagnanam.
9596
1. **[Phi](https://huggingface.co/docs/transformers/main/model_doc/phi)** (from Microsoft) released with the papers - [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644) by Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee and Yuanzhi Li, [Textbooks Are All You Need II: phi-1.5 technical report](https://arxiv.org/abs/2309.05463) by Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar and Yin Tat Lee.

src/models.js

Lines changed: 32 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -558,7 +558,9 @@ async function decoderForward(self, model_inputs, is_encoder_decoder = false) {
558558
new_model_inputs.use_cache_branch = boolTensor(!!past_key_values);
559559
}
560560
if (session.inputNames.includes('position_ids') && new_model_inputs.attention_mask && !new_model_inputs.position_ids) {
561-
new_model_inputs.position_ids = createPositionIds(new_model_inputs, past_key_values);
561+
// NOTE: Handle a special case for paligemma models, where positions are 1-indexed
562+
const start_index = self.config.model_type === 'paligemma' ? 1 : 0;
563+
new_model_inputs.position_ids = createPositionIds(new_model_inputs, past_key_values, start_index);
562564
}
563565

564566
// Unpack the `past_key_values` object into model inputs
@@ -694,14 +696,14 @@ async function imageTextToTextForward(self, {
694696
* @param {Tensor} attention_mask
695697
* @returns {{data: BigInt64Array, dims: number[]}}
696698
*/
697-
function cumsum_masked_fill(attention_mask) {
699+
function cumsum_masked_fill(attention_mask, start_index = 0) {
698700
const [bz, seq_len] = attention_mask.dims;
699701
const attn_mask_data = attention_mask.data;
700702

701703
const data = new BigInt64Array(attn_mask_data.length);
702704
for (let i = 0; i < bz; ++i) {
703705
const start = i * seq_len;
704-
let sum = BigInt(0);
706+
let sum = BigInt(start_index);
705707
for (let j = 0; j < seq_len; ++j) {
706708
const index = start + j;
707709
if (attn_mask_data[index] === 0n) {
@@ -728,10 +730,10 @@ function cumsum_masked_fill(attention_mask) {
728730
* position_ids = position_ids[:, -input_ids.shape[1] :]
729731
* ```
730732
*/
731-
function createPositionIds(model_inputs, past_key_values = null) {
733+
function createPositionIds(model_inputs, past_key_values = null, start_index = 0) {
732734
const { input_ids, inputs_embeds, attention_mask } = model_inputs;
733735

734-
const { data, dims } = cumsum_masked_fill(attention_mask);
736+
const { data, dims } = cumsum_masked_fill(attention_mask, start_index);
735737
let position_ids = new Tensor('int64', data, dims);
736738
if (past_key_values) {
737739
const offset = -(input_ids ?? inputs_embeds).dims.at(1);
@@ -3548,6 +3550,30 @@ export class Florence2ForConditionalGeneration extends Florence2PreTrainedModel
35483550
}
35493551
}
35503552

3553+
export class PaliGemmaPreTrainedModel extends PreTrainedModel {
3554+
forward_params = [
3555+
'input_ids',
3556+
// 'inputs_embeds',
3557+
'attention_mask',
3558+
'pixel_values',
3559+
'position_ids',
3560+
'past_key_values',
3561+
];
3562+
}
3563+
3564+
export class PaliGemmaForConditionalGeneration extends PaliGemmaPreTrainedModel {
3565+
_merge_input_ids_with_image_features(kwargs) {
3566+
const vision_hidden_size = kwargs.image_features.dims.at(-1);
3567+
const reshaped_image_hidden_states = kwargs.image_features.view(-1, vision_hidden_size);
3568+
3569+
return default_merge_input_ids_with_image_features({
3570+
// @ts-ignore
3571+
image_token_id: this.config.image_token_index,
3572+
...kwargs,
3573+
image_features: reshaped_image_hidden_states,
3574+
})
3575+
}
3576+
}
35513577

35523578
//////////////////////////////////////////////////
35533579
// Idefics3 Models
@@ -7000,6 +7026,7 @@ const MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES = new Map([
70007026
['florence2', ['Florence2ForConditionalGeneration', Florence2ForConditionalGeneration]],
70017027
['qwen2-vl', ['Qwen2VLForConditionalGeneration', Qwen2VLForConditionalGeneration]],
70027028
['idefics3', ['Idefics3ForConditionalGeneration', Idefics3ForConditionalGeneration]],
7029+
['paligemma', ['PaliGemmaForConditionalGeneration', PaliGemmaForConditionalGeneration]],
70037030
]);
70047031

70057032
const MODEL_FOR_DOCUMENT_QUESTION_ANSWERING_MAPPING_NAMES = new Map([
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
import { Processor } from "../../base/processing_utils.js";
2+
import { AutoImageProcessor } from "../auto/image_processing_auto.js";
3+
import { AutoTokenizer } from "../../tokenizers.js";
4+
5+
const IMAGE_TOKEN = "<image>";
6+
7+
function build_string_from_input(
8+
prompt,
9+
bos_token,
10+
image_seq_len,
11+
image_token,
12+
num_images,
13+
) {
14+
return `${image_token.repeat(image_seq_len * num_images)}${bos_token}${prompt}\n`
15+
}
16+
17+
export class PaliGemmaProcessor extends Processor {
18+
static tokenizer_class = AutoTokenizer
19+
static image_processor_class = AutoImageProcessor
20+
static uses_processor_config = false;
21+
22+
/**
23+
* @typedef {import('../../utils/image.js').RawImage} RawImage
24+
*/
25+
26+
// `images` is required, `text` is optional
27+
async _call(/** @type {RawImage|RawImage[]} */ images, text = null, kwargs = {}) {
28+
if (!text) {
29+
console.warn(
30+
"You are using PaliGemma without a text prefix. It will perform as a picture-captioning model."
31+
)
32+
text = ""
33+
}
34+
35+
if (!Array.isArray(images)) {
36+
images = [images]
37+
}
38+
39+
if (!Array.isArray(text)) {
40+
text = [text]
41+
}
42+
43+
const bos_token = this.tokenizer.bos_token;
44+
const image_seq_length = this.image_processor.config.image_seq_length;
45+
let input_strings;
46+
if (text.some((t) => t.includes(IMAGE_TOKEN))) {
47+
console.log('this.image_processor.config', this.image_processor.config)
48+
input_strings = text.map(
49+
sample => {
50+
const expanded_sample = sample.replaceAll(IMAGE_TOKEN, IMAGE_TOKEN.repeat(image_seq_length));
51+
const bos_rfind_index = expanded_sample.lastIndexOf(IMAGE_TOKEN);
52+
const bos_index = bos_rfind_index === -1 ? 0 : bos_rfind_index + IMAGE_TOKEN.length;
53+
return expanded_sample.slice(0, bos_index) + bos_token + expanded_sample.slice(bos_index) + "\n";
54+
}
55+
)
56+
} else {
57+
console.warn(
58+
"You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special " +
59+
"image tokens in the text, as many tokens as there are images per each text. It is recommended to " +
60+
"add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images " +
61+
"each text has and add special tokens."
62+
)
63+
64+
input_strings = text.map(
65+
sample => build_string_from_input(
66+
sample,
67+
bos_token,
68+
image_seq_length,
69+
IMAGE_TOKEN,
70+
images.length,
71+
)
72+
)
73+
}
74+
75+
const text_inputs = this.tokenizer(input_strings, kwargs);
76+
const image_inputs = await this.image_processor(images, kwargs);
77+
78+
return {
79+
...image_inputs,
80+
...text_inputs,
81+
}
82+
}
83+
}

src/models/processors.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ export * from './idefics3/processing_idefics3.js';
44
export * from './janus/processing_janus.js';
55
export * from './jina_clip/processing_jina_clip.js';
66
export * from './owlvit/processing_owlvit.js';
7+
export * from './paligemma/processing_paligemma.js';
78
export * from './pyannote/processing_pyannote.js';
89
export * from './qwen2_vl/processing_qwen2_vl.js';
910
export * from './sam/processing_sam.js';

src/tokenizers.js

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2605,6 +2605,12 @@ export class PreTrainedTokenizer extends Callable {
26052605
this.unk_token = this.getToken('unk_token');
26062606
this.unk_token_id = this.model.tokens_to_ids.get(this.unk_token);
26072607

2608+
this.bos_token = this.getToken('bos_token');
2609+
this.bos_token_id = this.model.tokens_to_ids.get(this.bos_token);
2610+
2611+
this.eos_token = this.getToken('eos_token');
2612+
this.eos_token_id = this.model.tokens_to_ids.get(this.eos_token);
2613+
26082614
this.model_max_length = tokenizerConfig.model_max_length;
26092615

26102616
/** @type {boolean} Whether or not to strip the text when tokenizing (removing excess spaces before and after the string). */

0 commit comments

Comments
 (0)