Skip to content

Commit 11db949

Browse files
authored
Add support for idefics3 (SmolVLM) (#1059)
* [WIP] Add support for idefics3 (SmolVLM) * Cleanup * Update `DataTypeMap` with 4-bit data types * Format the model inputs before logging to console * Use QUInt8 when quantizing models produced by onnxruntime-genai * `auto` dtype selection * Export `load_image` helper function * Add listed support for Idefics3 * Add support for batched 2d images in idefics3 processor * Update unit tests * Add another unit test to ensure correctness of pixel attention mask placement * Move image tokens out of call function * Formatting * Improve auto selection logic * Return correct pixel_attention_mask * Update pixel_attention_mask unit test * Formatting * Add idefics3 unit tests * Increase idefics processor unit test timeout
1 parent 9584263 commit 11db949

16 files changed

+807
-93
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -337,6 +337,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
337337
1. **[HerBERT](https://huggingface.co/docs/transformers/model_doc/herbert)** (from Allegro.pl, AGH University of Science and Technology) released with the paper [KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://www.aclweb.org/anthology/2020.acl-main.111.pdf) by Piotr Rybak, Robert Mroczkowski, Janusz Tracz, Ireneusz Gawlik.
338338
1. **[Hiera](https://huggingface.co/docs/transformers/model_doc/hiera)** (from Meta) released with the paper [Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles](https://arxiv.org/pdf/2306.00989) by Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer.
339339
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
340+
1. **[Idefics3](https://huggingface.co/docs/transformers/model_doc/idefics3)** (from Hugging Face) released with the paper [Building and better understanding vision-language models: insights and future directions](https://arxiv.org/abs/2408.12637) by Hugo Laurençon, Andrés Marafioti, Victor Sanh, Léo Tronchon.
340341
1. **JAIS** (from Core42) released with the paper [Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models](https://arxiv.org/pdf/2308.16149) by Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Satheesh Katipomu, Haonan Li, Fajri Koto, William Marshall, Gurpreet Gosal, Cynthia Liu, Zhiming Chen, Osama Mohammed Afzal, Samta Kamboj, Onkar Pandit, Rahul Pal, Lalit Pradhan, Zain Muhammad Mujahid, Massa Baali, Xudong Han, Sondos Mahmoud Bsharat, Alham Fikri Aji, Zhiqiang Shen, Zhengzhong Liu, Natalia Vassilieva, Joel Hestness, Andy Hock, Andrew Feldman, Jonathan Lee, Andrew Jackson, Hector Xuguang Ren, Preslav Nakov, Timothy Baldwin, Eric Xing.
341342
1. **Janus** (from DeepSeek) released with the paper [Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation](https://arxiv.org/abs/2410.13848) Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, Ping Luo.
342343
1. **JinaCLIP** (from Jina AI) released with the paper [Jina CLIP: Your CLIP Model Is Also Your Text Retriever](https://arxiv.org/abs/2405.20204) by Andreas Koukounas, Georgios Mastrapas, Michael Günther, Bo Wang, Scott Martens, Isabelle Mohr, Saba Sturua, Mohammad Kalim Akram, Joan Fontanals Martínez, Saahil Ognawala, Susana Guzman, Maximilian Werk, Nan Wang, Han Xiao.

docs/snippets/6_supported-models.snippet

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,7 @@
5252
1. **[HerBERT](https://huggingface.co/docs/transformers/model_doc/herbert)** (from Allegro.pl, AGH University of Science and Technology) released with the paper [KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://www.aclweb.org/anthology/2020.acl-main.111.pdf) by Piotr Rybak, Robert Mroczkowski, Janusz Tracz, Ireneusz Gawlik.
5353
1. **[Hiera](https://huggingface.co/docs/transformers/model_doc/hiera)** (from Meta) released with the paper [Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles](https://arxiv.org/pdf/2306.00989) by Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer.
5454
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
55+
1. **[Idefics3](https://huggingface.co/docs/transformers/model_doc/idefics3)** (from Hugging Face) released with the paper [Building and better understanding vision-language models: insights and future directions](https://arxiv.org/abs/2408.12637) by Hugo Laurençon, Andrés Marafioti, Victor Sanh, Léo Tronchon.
5556
1. **JAIS** (from Core42) released with the paper [Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models](https://arxiv.org/pdf/2308.16149) by Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Satheesh Katipomu, Haonan Li, Fajri Koto, William Marshall, Gurpreet Gosal, Cynthia Liu, Zhiming Chen, Osama Mohammed Afzal, Samta Kamboj, Onkar Pandit, Rahul Pal, Lalit Pradhan, Zain Muhammad Mujahid, Massa Baali, Xudong Han, Sondos Mahmoud Bsharat, Alham Fikri Aji, Zhiqiang Shen, Zhengzhong Liu, Natalia Vassilieva, Joel Hestness, Andy Hock, Andrew Feldman, Jonathan Lee, Andrew Jackson, Hector Xuguang Ren, Preslav Nakov, Timothy Baldwin, Eric Xing.
5657
1. **Janus** (from DeepSeek) released with the paper [Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation](https://arxiv.org/abs/2410.13848) Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, Ping Luo.
5758
1. **JinaCLIP** (from Jina AI) released with the paper [Jina CLIP: Your CLIP Model Is Also Your Text Retriever](https://arxiv.org/abs/2405.20204) by Andreas Koukounas, Georgios Mastrapas, Michael Günther, Bo Wang, Scott Martens, Isabelle Mohr, Saba Sturua, Mohammad Kalim Akram, Joan Fontanals Martínez, Saahil Ognawala, Susana Guzman, Maximilian Werk, Nan Wang, Han Xiao.

scripts/quantize.py

Lines changed: 24 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,27 @@ class QuantMode(Enum):
3636

3737
QUANTIZE_OPTIONS = tuple(x.value for x in QuantMode)
3838

39+
# A list of operators that, when detected in a model, should select QUInt8 as the weight type for 8-bit quantization.
40+
QUINT8_OPS = (
41+
# NOTE:
42+
# As of 2024/11/29, the latest version of onnxruntime-web is 1.20.1, and does not support INT8 weights for Conv layers.
43+
# If you attempt to run a model with INT8 weights for Conv layers, you will get an error like:
44+
# `Can't create a session. ERROR_CODE: 9, ERROR_MESSAGE: Could not find an implementation for ConvInteger(10) node with name '/.../Conv_quant'`
45+
#
46+
# For this reason, we choose model weight types to ensure compatibility with onnxruntime-web.
47+
#
48+
# As per docs, signed weight type (QInt8) is faster on most CPUs, so, we use that unless the model contains a Conv layer.
49+
# For more information, see:
50+
# - https://github.com/microsoft/onnxruntime/issues/3130#issuecomment-1105200621
51+
# - https://github.com/microsoft/onnxruntime/issues/2339
52+
"Conv",
53+
54+
# Models produced by onnxruntime-genai contain optimized operators that perform better with QUInt8 weights.
55+
"GroupQueryAttention",
56+
"MultiHeadAttention",
57+
58+
# TODO: "SimplifiedLayerNormalization", "SkipSimplifiedLayerNormalization"
59+
)
3960

4061
@dataclass
4162
class IOArguments:
@@ -326,20 +347,11 @@ def quantize(input_folder, output_folder, quantization_args: QuantizationArgumen
326347

327348
elif mode in (QuantMode.Q8, QuantMode.QI8, QuantMode.QU8):
328349
if mode == QuantMode.Q8:
329-
# NOTE:
330-
# As of 2024/06/28, the current latest version of onnxruntime-web is 1.18.0, and does not support INT8 weights for Conv layers.
331-
# If you attempt to run a model with INT8 weights for Conv layers, you will get an error like:
332-
# `Can't create a session. ERROR_CODE: 9, ERROR_MESSAGE: Could not find an implementation for ConvInteger(10) node with name '/.../Conv_quant'`
333-
#
334-
# For this reason, we choose model weight types to ensure compatibility with onnxruntime-web.
335-
#
336-
# As per docs, signed weight type (QInt8) is faster on most CPUs, so, we use that unless the model contains a Conv layer.
337-
# For more information, see:
338-
# - https://github.com/microsoft/onnxruntime/issues/3130#issuecomment-1105200621
339-
# - https://github.com/microsoft/onnxruntime/issues/2339
340350
op_types = get_operators(model)
341351
weight_type = (
342-
QuantType.QUInt8 if "Conv" in op_types else QuantType.QInt8
352+
QuantType.QUInt8
353+
if any(x in QUINT8_OPS for x in op_types)
354+
else QuantType.QInt8
343355
)
344356

345357
elif mode == QuantMode.QI8:

src/configs.js

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,7 @@ function getNormalizedConfig(config) {
6969
case 'paligemma':
7070
case 'florence2':
7171
case 'llava_onevision':
72+
case 'idefics3':
7273
init_normalized_config = getNormalizedConfig(config.text_config);
7374
break;
7475
case 'moondream1':
@@ -382,6 +383,6 @@ export class AutoConfig {
382383
* See https://onnxruntime.ai/docs/tutorials/web/env-flags-and-session-options.html#freedimensionoverrides
383384
* for more information.
384385
* @property {import('./utils/devices.js').DeviceType} [device] The default device to use for the model.
385-
* @property {import('./utils/dtypes.js').DataType} [dtype] The default data type to use for the model.
386+
* @property {import('./utils/dtypes.js').DataType|Record<string, import('./utils/dtypes.js').DataType>} [dtype] The default data type to use for the model.
386387
* @property {boolean|Record<string, boolean>} [use_external_data_format=false] Whether to load the model using the external data format (used for models >= 2GB in size).
387388
*/

src/models.js

Lines changed: 107 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -182,6 +182,22 @@ async function getSession(pretrained_model_name_or_path, fileName, options) {
182182
}
183183
}
184184

185+
if (dtype === DATA_TYPES.auto) {
186+
// Try to choose the auto dtype based on the custom config
187+
let config_dtype = custom_config.dtype;
188+
if (typeof config_dtype !== 'string') {
189+
config_dtype = config_dtype[fileName];
190+
}
191+
192+
if (config_dtype && config_dtype !== DATA_TYPES.auto && DATA_TYPES.hasOwnProperty(config_dtype)) {
193+
// Defined by the custom config, and is not "auto"
194+
dtype = config_dtype;
195+
} else {
196+
// Choose default dtype based on device, falling back to fp32
197+
dtype = DEFAULT_DEVICE_DTYPE_MAPPING[selectedDevice] ?? DATA_TYPES.fp32;
198+
}
199+
}
200+
185201
const selectedDtype = /** @type {import("./utils/dtypes.js").DataType} */(dtype);
186202

187203
if (!DEFAULT_DTYPE_SUFFIX_MAPPING.hasOwnProperty(selectedDtype)) {
@@ -387,9 +403,17 @@ async function sessionRun(session, inputs) {
387403
output = replaceTensors(output);
388404
return output;
389405
} catch (e) {
406+
// Error messages can be long (nested) and uninformative. For this reason,
407+
// we apply minor formatting to show the most important information
408+
const formatted = Object.fromEntries(Object.entries(checkedInputs)
409+
.map(([k, { type, dims, data }]) => [k, {
410+
// Extract these properties from the underlying ORT tensor
411+
type, dims, data,
412+
}]));
413+
390414
// This usually occurs when the inputs are of the wrong type.
391415
console.error(`An error occurred during model execution: "${e}".`);
392-
console.error('Inputs given to model:', checkedInputs)
416+
console.error('Inputs given to model:', formatted);
393417
throw e;
394418
}
395419
}
@@ -546,6 +570,39 @@ async function decoderForward(self, model_inputs, is_encoder_decoder = false) {
546570
}
547571

548572

573+
574+
function default_merge_input_ids_with_image_features({
575+
image_token_id,
576+
inputs_embeds,
577+
image_features,
578+
input_ids,
579+
attention_mask,
580+
}) {
581+
const image_tokens = input_ids.tolist().map(ids =>
582+
ids.reduce((acc, x, idx) => {
583+
if (x == image_token_id) acc.push(idx);
584+
return acc;
585+
}, [])
586+
);
587+
const n_image_tokens = image_tokens.reduce((acc, x) => acc + x.length, 0);
588+
const n_image_features = image_features.dims[0];
589+
if (n_image_tokens !== n_image_features) {
590+
throw new Error(`Image features and image tokens do not match: tokens: ${n_image_tokens}, features ${n_image_features}`);
591+
}
592+
593+
// Equivalent to performing a masked_scatter
594+
let img = 0;
595+
for (let i = 0; i < image_tokens.length; ++i) {
596+
const tokens = image_tokens[i];
597+
const embeds = inputs_embeds[i];
598+
for (let j = 0; j < tokens.length; ++j) {
599+
embeds[tokens[j]].data.set(image_features[img++].data)
600+
}
601+
}
602+
return { inputs_embeds, attention_mask }
603+
}
604+
605+
549606
/**
550607
* Forward pass of an image-text-to-text model.
551608
* @param {Object} self The image-text-to-text model model.
@@ -3304,8 +3361,8 @@ export class VisionEncoderDecoderModel extends PreTrainedModel {
33043361
export class LlavaPreTrainedModel extends PreTrainedModel {
33053362
forward_params = [
33063363
'input_ids',
3307-
'pixel_values',
33083364
'attention_mask',
3365+
'pixel_values',
33093366
'position_ids',
33103367
'past_key_values',
33113368
];
@@ -3487,6 +3544,46 @@ export class Florence2ForConditionalGeneration extends Florence2PreTrainedModel
34873544
return decoder_outputs;
34883545
}
34893546
}
3547+
3548+
3549+
//////////////////////////////////////////////////
3550+
// Idefics3 Models
3551+
export class Idefics3PreTrainedModel extends PreTrainedModel {
3552+
forward_params = [
3553+
'input_ids',
3554+
'attention_mask',
3555+
'pixel_values',
3556+
'pixel_attention_mask',
3557+
'position_ids',
3558+
'past_key_values',
3559+
];
3560+
}
3561+
3562+
/**
3563+
* The LLAVA model which consists of a vision backbone and a language model.
3564+
*/
3565+
export class Idefics3ForConditionalGeneration extends Idefics3PreTrainedModel {
3566+
3567+
async encode_image({ pixel_values, pixel_attention_mask }) {
3568+
const features = (await sessionRun(this.sessions['vision_encoder'], { pixel_values, pixel_attention_mask })).image_features;
3569+
return features;
3570+
}
3571+
3572+
_merge_input_ids_with_image_features(kwargs) {
3573+
const vision_hidden_size = kwargs.image_features.dims.at(-1);
3574+
const reshaped_image_hidden_states = kwargs.image_features.view(-1, vision_hidden_size);
3575+
3576+
return default_merge_input_ids_with_image_features({
3577+
// @ts-ignore
3578+
image_token_id: this.config.image_token_id,
3579+
...kwargs,
3580+
image_features: reshaped_image_hidden_states,
3581+
})
3582+
}
3583+
}
3584+
//////////////////////////////////////////////////
3585+
3586+
//////////////////////////////////////////////////
34903587
export class CLIPPreTrainedModel extends PreTrainedModel { }
34913588

34923589
/**
@@ -4280,36 +4377,12 @@ export class Qwen2VLForConditionalGeneration extends Qwen2VLPreTrainedModel {
42804377
return features;
42814378
}
42824379

4283-
_merge_input_ids_with_image_features({
4284-
inputs_embeds,
4285-
image_features,
4286-
input_ids,
4287-
attention_mask,
4288-
}) {
4289-
// @ts-ignore
4290-
const { image_token_id } = this.config;
4291-
const image_tokens = input_ids.tolist().map(ids =>
4292-
ids.reduce((acc, x, idx) => {
4293-
if (x == image_token_id) acc.push(idx);
4294-
return acc;
4295-
}, [])
4296-
);
4297-
const n_image_tokens = image_tokens.reduce((acc, x) => acc + x.length, 0);
4298-
const n_image_features = image_features.dims[0];
4299-
if (n_image_tokens !== n_image_features) {
4300-
throw new Error(`Image features and image tokens do not match: tokens: ${n_image_tokens}, features ${n_image_features}`);
4301-
}
4302-
4303-
// Equivalent to performing a masked_scatter
4304-
let img = 0;
4305-
for (let i = 0; i < image_tokens.length; ++i) {
4306-
const tokens = image_tokens[i];
4307-
const embeds = inputs_embeds[i];
4308-
for (let j = 0; j < tokens.length; ++j) {
4309-
embeds[tokens[j]].data.set(image_features[img++].data)
4310-
}
4311-
}
4312-
return { inputs_embeds, attention_mask }
4380+
_merge_input_ids_with_image_features(kwargs) {
4381+
return default_merge_input_ids_with_image_features({
4382+
// @ts-ignore
4383+
image_token_id: this.config.image_token_id,
4384+
...kwargs
4385+
})
43134386
}
43144387

43154388
prepare_inputs_for_generation(input_ids, model_inputs, generation_config) {
@@ -6914,6 +6987,7 @@ const MODEL_FOR_QUESTION_ANSWERING_MAPPING_NAMES = new Map([
69146987

69156988
const MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES = new Map([
69166989
['vision-encoder-decoder', ['VisionEncoderDecoderModel', VisionEncoderDecoderModel]],
6990+
['idefics3', ['Idefics3ForConditionalGeneration', Idefics3ForConditionalGeneration]],
69176991
]);
69186992

69196993
const MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES = new Map([
@@ -6922,6 +6996,7 @@ const MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES = new Map([
69226996
['moondream1', ['Moondream1ForConditionalGeneration', Moondream1ForConditionalGeneration]],
69236997
['florence2', ['Florence2ForConditionalGeneration', Florence2ForConditionalGeneration]],
69246998
['qwen2-vl', ['Qwen2VLForConditionalGeneration', Qwen2VLForConditionalGeneration]],
6999+
['idefics3', ['Idefics3ForConditionalGeneration', Idefics3ForConditionalGeneration]],
69257000
]);
69267001

69277002
const MODEL_FOR_DOCUMENT_QUESTION_ANSWERING_MAPPING_NAMES = new Map([

0 commit comments

Comments
 (0)