diff --git a/README.md b/README.md index ae61e0e8b..7312afc46 100644 --- a/README.md +++ b/README.md @@ -337,6 +337,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te 1. **[HerBERT](https://huggingface.co/docs/transformers/model_doc/herbert)** (from Allegro.pl, AGH University of Science and Technology) released with the paper [KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://www.aclweb.org/anthology/2020.acl-main.111.pdf) by Piotr Rybak, Robert Mroczkowski, Janusz Tracz, Ireneusz Gawlik. 1. **[Hiera](https://huggingface.co/docs/transformers/model_doc/hiera)** (from Meta) released with the paper [Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles](https://arxiv.org/pdf/2306.00989) by Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer. 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed. +1. **[Idefics3](https://huggingface.co/docs/transformers/model_doc/idefics3)** (from Hugging Face) released with the paper [Building and better understanding vision-language models: insights and future directions](https://arxiv.org/abs/2408.12637) by Hugo Laurençon, Andrés Marafioti, Victor Sanh, Léo Tronchon. 1. **JAIS** (from Core42) released with the paper [Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models](https://arxiv.org/pdf/2308.16149) by Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Satheesh Katipomu, Haonan Li, Fajri Koto, William Marshall, Gurpreet Gosal, Cynthia Liu, Zhiming Chen, Osama Mohammed Afzal, Samta Kamboj, Onkar Pandit, Rahul Pal, Lalit Pradhan, Zain Muhammad Mujahid, Massa Baali, Xudong Han, Sondos Mahmoud Bsharat, Alham Fikri Aji, Zhiqiang Shen, Zhengzhong Liu, Natalia Vassilieva, Joel Hestness, Andy Hock, Andrew Feldman, Jonathan Lee, Andrew Jackson, Hector Xuguang Ren, Preslav Nakov, Timothy Baldwin, Eric Xing. 1. **Janus** (from DeepSeek) released with the paper [Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation](https://arxiv.org/abs/2410.13848) Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, Ping Luo. 1. **JinaCLIP** (from Jina AI) released with the paper [Jina CLIP: Your CLIP Model Is Also Your Text Retriever](https://arxiv.org/abs/2405.20204) by Andreas Koukounas, Georgios Mastrapas, Michael Günther, Bo Wang, Scott Martens, Isabelle Mohr, Saba Sturua, Mohammad Kalim Akram, Joan Fontanals Martínez, Saahil Ognawala, Susana Guzman, Maximilian Werk, Nan Wang, Han Xiao. diff --git a/docs/snippets/6_supported-models.snippet b/docs/snippets/6_supported-models.snippet index 8dee3ac42..9fb16b687 100644 --- a/docs/snippets/6_supported-models.snippet +++ b/docs/snippets/6_supported-models.snippet @@ -52,6 +52,7 @@ 1. **[HerBERT](https://huggingface.co/docs/transformers/model_doc/herbert)** (from Allegro.pl, AGH University of Science and Technology) released with the paper [KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://www.aclweb.org/anthology/2020.acl-main.111.pdf) by Piotr Rybak, Robert Mroczkowski, Janusz Tracz, Ireneusz Gawlik. 1. **[Hiera](https://huggingface.co/docs/transformers/model_doc/hiera)** (from Meta) released with the paper [Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles](https://arxiv.org/pdf/2306.00989) by Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer. 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed. +1. **[Idefics3](https://huggingface.co/docs/transformers/model_doc/idefics3)** (from Hugging Face) released with the paper [Building and better understanding vision-language models: insights and future directions](https://arxiv.org/abs/2408.12637) by Hugo Laurençon, Andrés Marafioti, Victor Sanh, Léo Tronchon. 1. **JAIS** (from Core42) released with the paper [Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models](https://arxiv.org/pdf/2308.16149) by Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Satheesh Katipomu, Haonan Li, Fajri Koto, William Marshall, Gurpreet Gosal, Cynthia Liu, Zhiming Chen, Osama Mohammed Afzal, Samta Kamboj, Onkar Pandit, Rahul Pal, Lalit Pradhan, Zain Muhammad Mujahid, Massa Baali, Xudong Han, Sondos Mahmoud Bsharat, Alham Fikri Aji, Zhiqiang Shen, Zhengzhong Liu, Natalia Vassilieva, Joel Hestness, Andy Hock, Andrew Feldman, Jonathan Lee, Andrew Jackson, Hector Xuguang Ren, Preslav Nakov, Timothy Baldwin, Eric Xing. 1. **Janus** (from DeepSeek) released with the paper [Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation](https://arxiv.org/abs/2410.13848) Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, Ping Luo. 1. **JinaCLIP** (from Jina AI) released with the paper [Jina CLIP: Your CLIP Model Is Also Your Text Retriever](https://arxiv.org/abs/2405.20204) by Andreas Koukounas, Georgios Mastrapas, Michael Günther, Bo Wang, Scott Martens, Isabelle Mohr, Saba Sturua, Mohammad Kalim Akram, Joan Fontanals Martínez, Saahil Ognawala, Susana Guzman, Maximilian Werk, Nan Wang, Han Xiao. diff --git a/scripts/quantize.py b/scripts/quantize.py index 554a064d1..115207113 100644 --- a/scripts/quantize.py +++ b/scripts/quantize.py @@ -36,6 +36,27 @@ class QuantMode(Enum): QUANTIZE_OPTIONS = tuple(x.value for x in QuantMode) +# A list of operators that, when detected in a model, should select QUInt8 as the weight type for 8-bit quantization. +QUINT8_OPS = ( + # NOTE: + # As of 2024/11/29, the latest version of onnxruntime-web is 1.20.1, and does not support INT8 weights for Conv layers. + # If you attempt to run a model with INT8 weights for Conv layers, you will get an error like: + # `Can't create a session. ERROR_CODE: 9, ERROR_MESSAGE: Could not find an implementation for ConvInteger(10) node with name '/.../Conv_quant'` + # + # For this reason, we choose model weight types to ensure compatibility with onnxruntime-web. + # + # As per docs, signed weight type (QInt8) is faster on most CPUs, so, we use that unless the model contains a Conv layer. + # For more information, see: + # - https://github.com/microsoft/onnxruntime/issues/3130#issuecomment-1105200621 + # - https://github.com/microsoft/onnxruntime/issues/2339 + "Conv", + + # Models produced by onnxruntime-genai contain optimized operators that perform better with QUInt8 weights. + "GroupQueryAttention", + "MultiHeadAttention", + + # TODO: "SimplifiedLayerNormalization", "SkipSimplifiedLayerNormalization" +) @dataclass class IOArguments: @@ -326,20 +347,11 @@ def quantize(input_folder, output_folder, quantization_args: QuantizationArgumen elif mode in (QuantMode.Q8, QuantMode.QI8, QuantMode.QU8): if mode == QuantMode.Q8: - # NOTE: - # As of 2024/06/28, the current latest version of onnxruntime-web is 1.18.0, and does not support INT8 weights for Conv layers. - # If you attempt to run a model with INT8 weights for Conv layers, you will get an error like: - # `Can't create a session. ERROR_CODE: 9, ERROR_MESSAGE: Could not find an implementation for ConvInteger(10) node with name '/.../Conv_quant'` - # - # For this reason, we choose model weight types to ensure compatibility with onnxruntime-web. - # - # As per docs, signed weight type (QInt8) is faster on most CPUs, so, we use that unless the model contains a Conv layer. - # For more information, see: - # - https://github.com/microsoft/onnxruntime/issues/3130#issuecomment-1105200621 - # - https://github.com/microsoft/onnxruntime/issues/2339 op_types = get_operators(model) weight_type = ( - QuantType.QUInt8 if "Conv" in op_types else QuantType.QInt8 + QuantType.QUInt8 + if any(x in QUINT8_OPS for x in op_types) + else QuantType.QInt8 ) elif mode == QuantMode.QI8: diff --git a/src/configs.js b/src/configs.js index 2c277aeb1..97393faaa 100644 --- a/src/configs.js +++ b/src/configs.js @@ -69,6 +69,7 @@ function getNormalizedConfig(config) { case 'paligemma': case 'florence2': case 'llava_onevision': + case 'idefics3': init_normalized_config = getNormalizedConfig(config.text_config); break; case 'moondream1': @@ -382,6 +383,6 @@ export class AutoConfig { * See https://onnxruntime.ai/docs/tutorials/web/env-flags-and-session-options.html#freedimensionoverrides * for more information. * @property {import('./utils/devices.js').DeviceType} [device] The default device to use for the model. - * @property {import('./utils/dtypes.js').DataType} [dtype] The default data type to use for the model. + * @property {import('./utils/dtypes.js').DataType|Record} [dtype] The default data type to use for the model. * @property {boolean|Record} [use_external_data_format=false] Whether to load the model using the external data format (used for models >= 2GB in size). */ diff --git a/src/models.js b/src/models.js index 7133e64d5..3fed23ef9 100644 --- a/src/models.js +++ b/src/models.js @@ -182,6 +182,22 @@ async function getSession(pretrained_model_name_or_path, fileName, options) { } } + if (dtype === DATA_TYPES.auto) { + // Try to choose the auto dtype based on the custom config + let config_dtype = custom_config.dtype; + if (typeof config_dtype !== 'string') { + config_dtype = config_dtype[fileName]; + } + + if (config_dtype && config_dtype !== DATA_TYPES.auto && DATA_TYPES.hasOwnProperty(config_dtype)) { + // Defined by the custom config, and is not "auto" + dtype = config_dtype; + } else { + // Choose default dtype based on device, falling back to fp32 + dtype = DEFAULT_DEVICE_DTYPE_MAPPING[selectedDevice] ?? DATA_TYPES.fp32; + } + } + const selectedDtype = /** @type {import("./utils/dtypes.js").DataType} */(dtype); if (!DEFAULT_DTYPE_SUFFIX_MAPPING.hasOwnProperty(selectedDtype)) { @@ -387,9 +403,17 @@ async function sessionRun(session, inputs) { output = replaceTensors(output); return output; } catch (e) { + // Error messages can be long (nested) and uninformative. For this reason, + // we apply minor formatting to show the most important information + const formatted = Object.fromEntries(Object.entries(checkedInputs) + .map(([k, { type, dims, data }]) => [k, { + // Extract these properties from the underlying ORT tensor + type, dims, data, + }])); + // This usually occurs when the inputs are of the wrong type. console.error(`An error occurred during model execution: "${e}".`); - console.error('Inputs given to model:', checkedInputs) + console.error('Inputs given to model:', formatted); throw e; } } @@ -546,6 +570,39 @@ async function decoderForward(self, model_inputs, is_encoder_decoder = false) { } + +function default_merge_input_ids_with_image_features({ + image_token_id, + inputs_embeds, + image_features, + input_ids, + attention_mask, +}) { + const image_tokens = input_ids.tolist().map(ids => + ids.reduce((acc, x, idx) => { + if (x == image_token_id) acc.push(idx); + return acc; + }, []) + ); + const n_image_tokens = image_tokens.reduce((acc, x) => acc + x.length, 0); + const n_image_features = image_features.dims[0]; + if (n_image_tokens !== n_image_features) { + throw new Error(`Image features and image tokens do not match: tokens: ${n_image_tokens}, features ${n_image_features}`); + } + + // Equivalent to performing a masked_scatter + let img = 0; + for (let i = 0; i < image_tokens.length; ++i) { + const tokens = image_tokens[i]; + const embeds = inputs_embeds[i]; + for (let j = 0; j < tokens.length; ++j) { + embeds[tokens[j]].data.set(image_features[img++].data) + } + } + return { inputs_embeds, attention_mask } +} + + /** * Forward pass of an image-text-to-text model. * @param {Object} self The image-text-to-text model model. @@ -3304,8 +3361,8 @@ export class VisionEncoderDecoderModel extends PreTrainedModel { export class LlavaPreTrainedModel extends PreTrainedModel { forward_params = [ 'input_ids', - 'pixel_values', 'attention_mask', + 'pixel_values', 'position_ids', 'past_key_values', ]; @@ -3487,6 +3544,46 @@ export class Florence2ForConditionalGeneration extends Florence2PreTrainedModel return decoder_outputs; } } + + +////////////////////////////////////////////////// +// Idefics3 Models +export class Idefics3PreTrainedModel extends PreTrainedModel { + forward_params = [ + 'input_ids', + 'attention_mask', + 'pixel_values', + 'pixel_attention_mask', + 'position_ids', + 'past_key_values', + ]; +} + +/** + * The LLAVA model which consists of a vision backbone and a language model. + */ +export class Idefics3ForConditionalGeneration extends Idefics3PreTrainedModel { + + async encode_image({ pixel_values, pixel_attention_mask }) { + const features = (await sessionRun(this.sessions['vision_encoder'], { pixel_values, pixel_attention_mask })).image_features; + return features; + } + + _merge_input_ids_with_image_features(kwargs) { + const vision_hidden_size = kwargs.image_features.dims.at(-1); + const reshaped_image_hidden_states = kwargs.image_features.view(-1, vision_hidden_size); + + return default_merge_input_ids_with_image_features({ + // @ts-ignore + image_token_id: this.config.image_token_id, + ...kwargs, + image_features: reshaped_image_hidden_states, + }) + } +} +////////////////////////////////////////////////// + +////////////////////////////////////////////////// export class CLIPPreTrainedModel extends PreTrainedModel { } /** @@ -4280,36 +4377,12 @@ export class Qwen2VLForConditionalGeneration extends Qwen2VLPreTrainedModel { return features; } - _merge_input_ids_with_image_features({ - inputs_embeds, - image_features, - input_ids, - attention_mask, - }) { - // @ts-ignore - const { image_token_id } = this.config; - const image_tokens = input_ids.tolist().map(ids => - ids.reduce((acc, x, idx) => { - if (x == image_token_id) acc.push(idx); - return acc; - }, []) - ); - const n_image_tokens = image_tokens.reduce((acc, x) => acc + x.length, 0); - const n_image_features = image_features.dims[0]; - if (n_image_tokens !== n_image_features) { - throw new Error(`Image features and image tokens do not match: tokens: ${n_image_tokens}, features ${n_image_features}`); - } - - // Equivalent to performing a masked_scatter - let img = 0; - for (let i = 0; i < image_tokens.length; ++i) { - const tokens = image_tokens[i]; - const embeds = inputs_embeds[i]; - for (let j = 0; j < tokens.length; ++j) { - embeds[tokens[j]].data.set(image_features[img++].data) - } - } - return { inputs_embeds, attention_mask } + _merge_input_ids_with_image_features(kwargs) { + return default_merge_input_ids_with_image_features({ + // @ts-ignore + image_token_id: this.config.image_token_id, + ...kwargs + }) } prepare_inputs_for_generation(input_ids, model_inputs, generation_config) { @@ -6914,6 +6987,7 @@ const MODEL_FOR_QUESTION_ANSWERING_MAPPING_NAMES = new Map([ const MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES = new Map([ ['vision-encoder-decoder', ['VisionEncoderDecoderModel', VisionEncoderDecoderModel]], + ['idefics3', ['Idefics3ForConditionalGeneration', Idefics3ForConditionalGeneration]], ]); const MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES = new Map([ @@ -6922,6 +6996,7 @@ const MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES = new Map([ ['moondream1', ['Moondream1ForConditionalGeneration', Moondream1ForConditionalGeneration]], ['florence2', ['Florence2ForConditionalGeneration', Florence2ForConditionalGeneration]], ['qwen2-vl', ['Qwen2VLForConditionalGeneration', Qwen2VLForConditionalGeneration]], + ['idefics3', ['Idefics3ForConditionalGeneration', Idefics3ForConditionalGeneration]], ]); const MODEL_FOR_DOCUMENT_QUESTION_ANSWERING_MAPPING_NAMES = new Map([ diff --git a/src/models/idefics3/image_processing_idefics3.js b/src/models/idefics3/image_processing_idefics3.js new file mode 100644 index 000000000..0da6c2cc7 --- /dev/null +++ b/src/models/idefics3/image_processing_idefics3.js @@ -0,0 +1,219 @@ + + +import { + ImageProcessor, +} from "../../base/image_processors_utils.js"; +import { cat, full, interpolate_4d, stack } from "../../utils/tensor.js"; + +export class Idefics3ImageProcessor extends ImageProcessor { + constructor(config) { + super(config); + + this.do_image_splitting = config.do_image_splitting ?? true; + this.max_image_size = config.max_image_size; + } + + /** + * @typedef {import('../../utils/image.js').RawImage} RawImage + * @typedef {import('../../utils/tensor.js').Tensor} Tensor + */ + + /** + * Calculate size to resize images to, to be multiples of `vision_encoder_max_size` while preserving the aspect ratio. + * @param {Tensor} pixel_values Tensor of the image to resize. + * @param {number} vision_encoder_max_size Maximum size of the output image. If the image is larger than this size, + * it will be split into patches of this size, and the original image will be concatenated with the patches, resized to max_size. + */ + get_resize_for_vision_encoder(pixel_values, vision_encoder_max_size) { + let [height, width] = pixel_values.dims.slice(-2); + + const aspect_ratio = width / height; + if (width >= height) { + width = Math.ceil(width / vision_encoder_max_size) * vision_encoder_max_size; + height = Math.floor(width / aspect_ratio); + height = Math.ceil(height / vision_encoder_max_size) * vision_encoder_max_size; + } else { + height = Math.ceil(height / vision_encoder_max_size) * vision_encoder_max_size; + width = Math.floor(height * aspect_ratio); + width = Math.ceil(width / vision_encoder_max_size) * vision_encoder_max_size; + } + return { height, width }; + } + + /** @param {RawImage|RawImage[]|RawImage[][]} images */ + async _call(images, { + do_image_splitting = null, + return_row_col_info = false, + } = {}) { + + /** @type {RawImage[][]} */ + let batched_2d_images; + if (!Array.isArray(images)) { + batched_2d_images = [[images]]; + } else { + if (images.length === 0 || !images[0]) { + throw new Error("No images provided."); + } + if (!Array.isArray(images[0])) { + batched_2d_images = [/** @type {RawImage[]} */(images)]; + } else { + batched_2d_images = /** @type {RawImage[][]} */(images); + } + } + + // List of tensors, each with shape [patches, channels, height, width] + let all_pixel_values = []; + let images_list_rows = []; + let images_list_cols = []; + + const original_sizes = []; + const reshaped_input_sizes = []; + for (const image_batch of batched_2d_images) { + + let images_list = await Promise.all(image_batch.map(x => this.preprocess(x))); + + // Original sizes of images + original_sizes.push(...images_list.map(x => x.original_size)); + + // Reshaped sizes of images, before padding or cropping + reshaped_input_sizes.push(...images_list.map(x => x.reshaped_input_size)); + + // Convert images to 4D tensors for easier processing + images_list.forEach(x => x.pixel_values.unsqueeze_(0)); + + const { longest_edge } = this.max_image_size; + + /** @type {Tensor[]} */ + let images_tensor; + if (do_image_splitting ?? this.do_image_splitting) { + let image_rows = new Array(images_list.length); + let image_cols = new Array(images_list.length); + + // We first resize both height and width of each image to the nearest max_image_size multiple, disregarding the aspect ratio + images_tensor = await Promise.all(images_list.map(async (x, i) => { + const new_size = this.get_resize_for_vision_encoder(x.pixel_values, longest_edge); + + const resized = await interpolate_4d(x.pixel_values, { + size: [new_size.height, new_size.width], + }); + + const { frames, num_splits_h, num_splits_w } = await this.split_image(resized, this.max_image_size); + image_rows[i] = num_splits_h; + image_cols[i] = num_splits_w; + return cat(frames, 0); + })); + + images_list_rows.push(image_rows); + images_list_cols.push(image_cols); + + } else { + /** @type {[number, number]} */ + const size = [longest_edge, longest_edge]; + images_tensor = await Promise.all( + images_list.map(x => interpolate_4d(x.pixel_values, { size })) + ); + + images_list_rows.push(new Array(images_list.length).fill(0)); + images_list_cols.push(new Array(images_list.length).fill(0)); + } + + all_pixel_values.push(cat(images_tensor, 0)); + } + + const batch_size = all_pixel_values.length; + const [n, c, h, w] = all_pixel_values[0].dims; + + // Stack pixel values + let pixel_values; + let pixel_attention_mask; + if (batch_size === 1) { + pixel_values = all_pixel_values[0].unsqueeze_(0); + pixel_attention_mask = full([batch_size, n, h, w], true); + } else { + // Add padding (if necessary) to images with less patches than the maximum number of patches + const max_num_patches = Math.max(...all_pixel_values.map(x => x.dims.at(0))); + + pixel_attention_mask = full([batch_size, max_num_patches, h, w], true); + const pixel_attention_mask_data = pixel_attention_mask.data; + const pixel_attention_mask_stride = max_num_patches * h * w; + for (let i = 0; i < batch_size; ++i) { + const num_patches = all_pixel_values[i].dims[0]; + if (num_patches < max_num_patches) { + all_pixel_values[i] = cat([ + all_pixel_values[i], + full([max_num_patches - num_patches, c, h, w], 0), + ], 0); + + const start_offset = i * pixel_attention_mask_stride + num_patches * h * w; + const end_offset = (i + 1) * pixel_attention_mask_stride; + pixel_attention_mask_data.fill(false, start_offset, end_offset); + } + } + pixel_values = stack(all_pixel_values, 0); + } + + return { + pixel_values, + pixel_attention_mask, + + original_sizes, + reshaped_input_sizes, + ...( + return_row_col_info + ? { rows: images_list_rows, cols: images_list_cols } + : {} + ), + } + } + + async split_image(pixel_values, { longest_edge }) { + const max_height = longest_edge; + const max_width = longest_edge; + + const frames = []; + + const [height, width] = pixel_values.dims.slice(-2); + + let num_splits_h = 0, num_splits_w = 0; + + if (height > max_height || width > max_width) { + // Calculate the number of splits + num_splits_h = Math.ceil(height / max_height); + num_splits_w = Math.ceil(width / max_width); + + // Calculate the optimal width and height for the sub-images + const optimal_height = Math.ceil(height / num_splits_h); + const optimal_width = Math.ceil(width / num_splits_w); + + // Iterate through each row and column + for (let r = 0; r < num_splits_h; r++) { + for (let c = 0; c < num_splits_w; c++) { + // Calculate the starting point of the crop + const start_x = c * optimal_width; + const start_y = r * optimal_height; + + // Calculate the ending point of the crop + const end_x = Math.min(start_x + optimal_width, width); + const end_y = Math.min(start_y + optimal_height, height); + + // Crop the image + frames.push(pixel_values.slice(null, null, [start_y, end_y], [start_x, end_x])); + } + } + + // Resize the global image to match max dimensions for memory efficiency + const global_image_height = max_height; + const global_image_width = max_width; + + if (height !== global_image_height || width !== global_image_width) { + pixel_values = await interpolate_4d(pixel_values, { + size: [global_image_height, global_image_width], + }) + } + } + + frames.push(pixel_values); + + return { frames, num_splits_h, num_splits_w }; + } +} diff --git a/src/models/idefics3/processing_idefics3.js b/src/models/idefics3/processing_idefics3.js new file mode 100644 index 000000000..1bec97b32 --- /dev/null +++ b/src/models/idefics3/processing_idefics3.js @@ -0,0 +1,136 @@ + +import { Processor } from "../../base/processing_utils.js"; +import { AutoImageProcessor } from "../auto/image_processing_auto.js"; +import { AutoTokenizer } from "../../tokenizers.js"; +import { RawImage } from "../../utils/image.js"; +import { count } from "../../utils/core.js"; + +/** + * Prompt with expanded image tokens for when the image is split into patches. + * @private + */ +function _prompt_split_image(image_seq_len, image_rows, image_cols, fake_token_around_image, image_token, global_img_token) { + let text_split_images = ""; + for (let n_h = 0; n_h < image_rows; ++n_h) { + for (let n_w = 0; n_w < image_cols; ++n_w) { + text_split_images += ( + fake_token_around_image + + `` + + image_token.repeat(image_seq_len) + ); + } + text_split_images += "\n"; + } + + text_split_images += ( + `\n${fake_token_around_image}` + + `${global_img_token}` + + image_token.repeat(image_seq_len) + + `${fake_token_around_image}` + ); + return text_split_images; +} + +/** + * Prompt with expanded image tokens for a single image. + * @private + */ +function _prompt_single_image(image_seq_len, fake_token_around_image, image_token, global_img_token) { + return ( + `${fake_token_around_image}` + + `${global_img_token}` + + image_token.repeat(image_seq_len) + + `${fake_token_around_image}` + ); +} + +function get_image_prompt_string(image_rows, image_cols, image_seq_len, fake_token_around_image, image_token, global_img_token) { + if (image_rows === 0 && image_cols === 0) { + return _prompt_single_image( + image_seq_len, + fake_token_around_image, + image_token, + global_img_token + ); + } + return _prompt_split_image( + image_seq_len, image_rows, image_cols, fake_token_around_image, image_token, global_img_token + ); +} + + +export class Idefics3Processor extends Processor { + static image_processor_class = AutoImageProcessor + static tokenizer_class = AutoTokenizer + static uses_processor_config = true; + + fake_image_token = ""; + image_token = ""; + global_img_token = ""; + + /** + * + * @param {string|string[]} text + * @param {RawImage|RawImage[]|RawImage[][]} images + * @returns {Promise} + */ + async _call(text, images = null, options = {}) { + options.return_row_col_info ??= true; + + let image_inputs; + + if (images) { + image_inputs = await this.image_processor(images, options); + } + + // NOTE: We assume text is present + if (!Array.isArray(text)) { + text = [text]; + } + + const image_rows = image_inputs.rows ?? [new Array(text.length).fill(0)]; + const image_cols = image_inputs.cols ?? [new Array(text.length).fill(0)]; + + const image_seq_len = this.config.image_seq_len; + const n_images_in_text = [] + const prompt_strings = []; + for (let i = 0; i < text.length; ++i) { + const sample = text[i]; + const sample_rows = image_rows[i]; + const sample_cols = image_cols[i]; + + n_images_in_text.push(count(sample, this.image_token)); + + // Replace the image token with fake tokens around the expanded image token sequence of length `image_seq_len` + const image_prompt_strings = sample_rows.map( + (n_rows, j) => get_image_prompt_string( + n_rows, + sample_cols[j], + image_seq_len, + this.fake_image_token, + this.image_token, + this.global_img_token, + ) + ); + + const split_sample = sample.split(this.image_token); + if (split_sample.length === 0) { + throw new Error("The image token should be present in the text."); + } + + // Place in the image prompt strings where the image tokens are + let new_sample = split_sample[0]; + for (let j = 0; j < image_prompt_strings.length; ++j) { + new_sample += image_prompt_strings[j] + split_sample[j + 1]; + } + prompt_strings.push(new_sample); + } + + const text_inputs = this.tokenizer(prompt_strings); + + return { + ...text_inputs, + ...image_inputs, + } + } +} diff --git a/src/models/image_processors.js b/src/models/image_processors.js index 64c529742..02815771c 100644 --- a/src/models/image_processors.js +++ b/src/models/image_processors.js @@ -10,6 +10,7 @@ export * from './donut/image_processing_donut.js' export * from './dpt/image_processing_dpt.js' export * from './efficientnet/image_processing_efficientnet.js' export * from './glpn/image_processing_glpn.js' +export * from './idefics3/image_processing_idefics3.js' export * from './janus/image_processing_janus.js' export * from './jina_clip/image_processing_jina_clip.js' export * from './llava_onevision/image_processing_llava_onevision.js' diff --git a/src/models/processors.js b/src/models/processors.js index cc96cd7e9..165ebb345 100644 --- a/src/models/processors.js +++ b/src/models/processors.js @@ -1,5 +1,6 @@ export * from './florence2/processing_florence2.js'; export * from './mgp_str/processing_mgp_str.js'; +export * from './idefics3/processing_idefics3.js'; export * from './janus/processing_janus.js'; export * from './jina_clip/processing_jina_clip.js'; export * from './owlvit/processing_owlvit.js'; diff --git a/src/utils/core.js b/src/utils/core.js index 3cea50782..aa745af4b 100644 --- a/src/utils/core.js +++ b/src/utils/core.js @@ -187,3 +187,17 @@ export function len(s) { for (const c of s) ++length; return length; } + +/** + * Count the occurrences of a value in an array or string. + * This mimics the behavior of Python's `count` method. + * @param {any[]|string} arr The array or string to search. + * @param {any} value The value to count. + */ +export function count(arr, value) { + let count = 0; + for (const v of arr) { + if (v === value) ++count; + } + return count; +} diff --git a/src/utils/dtypes.js b/src/utils/dtypes.js index fa6d94be5..0d6e190e0 100644 --- a/src/utils/dtypes.js +++ b/src/utils/dtypes.js @@ -31,6 +31,7 @@ export const isWebGpuFp16Supported = (function () { })(); export const DATA_TYPES = Object.freeze({ + auto: 'auto', // Auto-detect based on environment fp32: 'fp32', fp16: 'fp16', q8: 'q8', @@ -47,7 +48,7 @@ export const DEFAULT_DEVICE_DTYPE_MAPPING = Object.freeze({ [DEVICE_TYPES.wasm]: DATA_TYPES.q8, }); -/** @type {Record} */ +/** @type {Record, string>} */ export const DEFAULT_DTYPE_SUFFIX_MAPPING = Object.freeze({ [DATA_TYPES.fp32]: '', [DATA_TYPES.fp16]: '_fp16', diff --git a/src/utils/image.js b/src/utils/image.js index 04562592a..37c812561 100644 --- a/src/utils/image.js +++ b/src/utils/image.js @@ -794,3 +794,8 @@ export class RawImage { }); } } + +/** + * Helper function to load an image from a URL, path, etc. + */ +export const load_image = RawImage.read.bind(RawImage); diff --git a/src/utils/tensor.js b/src/utils/tensor.js index dec65d1d7..8b8133770 100644 --- a/src/utils/tensor.js +++ b/src/utils/tensor.js @@ -32,6 +32,8 @@ const DataTypeMap = Object.freeze({ int64: BigInt64Array, uint64: BigUint64Array, bool: Uint8Array, + uint4: Uint8Array, + int4: Int8Array, }); /** @@ -1353,7 +1355,7 @@ function fullHelper(size, fill_value, dtype, cls) { /** * Creates a tensor of size size filled with fill_value. The tensor's dtype is inferred from fill_value. * @param {number[]} size A sequence of integers defining the shape of the output tensor. - * @param {number|bigint} fill_value The value to fill the output tensor with. + * @param {number|bigint|boolean} fill_value The value to fill the output tensor with. * @returns {Tensor} The filled tensor. */ export function full(size, fill_value) { @@ -1365,6 +1367,9 @@ export function full(size, fill_value) { } else if (typeof fill_value === 'bigint') { dtype = 'int64'; typedArrayCls = BigInt64Array; + } else if (typeof fill_value === 'boolean') { + dtype = 'bool'; + typedArrayCls = Uint8Array; } else { // TODO: support other dtypes throw new Error(`Unsupported data type: ${typeof fill_value}`); diff --git a/tests/init.js b/tests/init.js index 0783632c5..a52fe2cf2 100644 --- a/tests/init.js +++ b/tests/init.js @@ -58,7 +58,7 @@ export function init() { } export const MAX_MODEL_LOAD_TIME = 15_000; // 15 seconds -export const MAX_TEST_EXECUTION_TIME = 30_000; // 30 seconds +export const MAX_TEST_EXECUTION_TIME = 60_000; // 60 seconds export const MAX_MODEL_DISPOSE_TIME = 1_000; // 1 second export const MAX_TEST_TIME = MAX_MODEL_LOAD_TIME + MAX_TEST_EXECUTION_TIME + MAX_MODEL_DISPOSE_TIME; diff --git a/tests/processors.test.js b/tests/processors.test.js index cafcc9f2a..a17cd4fc3 100644 --- a/tests/processors.test.js +++ b/tests/processors.test.js @@ -1,5 +1,5 @@ import { env, AutoProcessor, AutoImageProcessor, RawImage } from "../src/transformers.js"; -import { init, MAX_TEST_EXECUTION_TIME } from "./init.js"; +import { init, MAX_TEST_TIME } from "./init.js"; import { compare } from "./test_utils.js"; // Initialise the testing environment @@ -47,27 +47,30 @@ const MODELS = { // efficientnet: 'Xenova/efficientnet-b0', florence2: "Xenova/tiny-random-Florence2ForConditionalGeneration", qwen2_vl: "hf-internal-testing/tiny-random-Qwen2VLForConditionalGeneration", + idefics3: "hf-internal-testing/tiny-random-Idefics3ForConditionalGeneration", }; +const BASE_URL = "https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/"; const TEST_IMAGES = { - white_image: "https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/white-image.png", - pattern_3x3: "https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/pattern_3x3.png", - pattern_3x5: "https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/pattern_3x5.png", - checkerboard_8x8: "https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/checkerboard_8x8.png", - checkerboard_64x32: "https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/checkerboard_64x32.png", - receipt: "https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/receipt.png", - tiger: "https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/tiger.jpg", - paper: "https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/nougat_paper.png", - cats: "https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/cats.jpg", + white_image: BASE_URL + "white-image.png", + pattern_3x3: BASE_URL + "pattern_3x3.png", + pattern_3x5: BASE_URL + "pattern_3x5.png", + checkerboard_8x8: BASE_URL + "checkerboard_8x8.png", + checkerboard_64x32: BASE_URL + "checkerboard_64x32.png", + gradient_1280x640: BASE_URL + "gradient_1280x640.png", + receipt: BASE_URL + "receipt.png", + tiger: BASE_URL + "tiger.jpg", + paper: BASE_URL + "nougat_paper.png", + cats: BASE_URL + "cats.jpg", // grayscale image skateboard: "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/ml-web-games/skateboard.png", - vitmatte_image: "https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/vitmatte_image.png", - vitmatte_trimap: "https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/vitmatte_trimap.png", + vitmatte_image: BASE_URL + "vitmatte_image.png", + vitmatte_trimap: BASE_URL + "vitmatte_trimap.png", - beetle: "https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/beetle.png", - book_cover: "https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/book-cover.png", + beetle: BASE_URL + "beetle.png", + book_cover: BASE_URL + "book-cover.png", }; describe("Processors", () => { @@ -96,7 +99,7 @@ describe("Processors", () => { compare(avg(pixel_values.data), 0.5); } }, - MAX_TEST_EXECUTION_TIME, + MAX_TEST_TIME, ); // SamProcessor/SamImageProcessor @@ -168,7 +171,7 @@ describe("Processors", () => { compare(input_boxes.tolist(), [[[0, 341.3333, 682.6667, 682.6667]]]); } }, - MAX_TEST_EXECUTION_TIME, + MAX_TEST_TIME, ); // DonutProcessor/DonutFeatureExtractor @@ -190,7 +193,7 @@ describe("Processors", () => { compare(reshaped_input_sizes, [[1280, 853]]); } }, - MAX_TEST_EXECUTION_TIME, + MAX_TEST_TIME, ); // ConvNextFeatureExtractor @@ -210,7 +213,7 @@ describe("Processors", () => { compare(reshaped_input_sizes, [[224, 224]]); } }, - MAX_TEST_EXECUTION_TIME, + MAX_TEST_TIME, ); // ViTFeatureExtractor @@ -230,7 +233,7 @@ describe("Processors", () => { compare(reshaped_input_sizes, [[224, 224]]); } }, - MAX_TEST_EXECUTION_TIME, + MAX_TEST_TIME, ); // MobileViTFeatureExtractor @@ -250,7 +253,7 @@ describe("Processors", () => { compare(reshaped_input_sizes, [[256, 256]]); } }, - MAX_TEST_EXECUTION_TIME, + MAX_TEST_TIME, ); // MobileViTFeatureExtractor @@ -272,7 +275,7 @@ describe("Processors", () => { compare(reshaped_input_sizes, [[28, 28]]); } }, - MAX_TEST_EXECUTION_TIME, + MAX_TEST_TIME, ); // MobileViTImageProcessor @@ -296,7 +299,7 @@ describe("Processors", () => { compare(pixel_values.data.slice(0, 3), [0.24313725531101227, 0.250980406999588, 0.364705890417099]); } }, - MAX_TEST_EXECUTION_TIME, + MAX_TEST_TIME, ); // DeiTFeatureExtractor @@ -316,7 +319,7 @@ describe("Processors", () => { compare(reshaped_input_sizes, [[224, 224]]); } }, - MAX_TEST_EXECUTION_TIME, + MAX_TEST_TIME, ); // BeitFeatureExtractor @@ -336,7 +339,7 @@ describe("Processors", () => { compare(reshaped_input_sizes, [[224, 224]]); } }, - MAX_TEST_EXECUTION_TIME, + MAX_TEST_TIME, ); // DetrFeatureExtractor @@ -359,7 +362,7 @@ describe("Processors", () => { compare(avg(pixel_mask.data), 1); } }, - MAX_TEST_EXECUTION_TIME, + MAX_TEST_TIME, ); // YolosFeatureExtractor @@ -379,7 +382,7 @@ describe("Processors", () => { compare(reshaped_input_sizes, [[888, 1333]]); } }, - MAX_TEST_EXECUTION_TIME, + MAX_TEST_TIME, ); // DPTFeatureExtractor @@ -400,7 +403,7 @@ describe("Processors", () => { compare(reshaped_input_sizes, [[384, 384]]); } }, - MAX_TEST_EXECUTION_TIME, + MAX_TEST_TIME, ); // GLPNForDepthEstimation @@ -432,7 +435,7 @@ describe("Processors", () => { compare(reshaped_input_sizes, [[384, 608]]); } }, - MAX_TEST_EXECUTION_TIME, + MAX_TEST_TIME, ); // NougatImageProcessor @@ -453,7 +456,7 @@ describe("Processors", () => { compare(reshaped_input_sizes, [[833, 672]]); } }, - MAX_TEST_EXECUTION_TIME, + MAX_TEST_TIME, ); // OwlViTFeatureExtractor @@ -489,7 +492,7 @@ describe("Processors", () => { compare(reshaped_input_sizes, [[224, 224]]); } }, - MAX_TEST_EXECUTION_TIME, + MAX_TEST_TIME, ); // JinaCLIPImageProcessor @@ -510,7 +513,7 @@ describe("Processors", () => { compare(reshaped_input_sizes, [[512, 512]]); } }, - MAX_TEST_EXECUTION_TIME, + MAX_TEST_TIME, ); // VitMatteImageProcessor @@ -561,7 +564,7 @@ describe("Processors", () => { compare(reshaped_input_sizes, [[5, 3]]); } }, - MAX_TEST_EXECUTION_TIME, + MAX_TEST_TIME, ); // BitImageProcessor @@ -581,7 +584,7 @@ describe("Processors", () => { compare(reshaped_input_sizes, [[224, 224]]); } }, - MAX_TEST_EXECUTION_TIME, + MAX_TEST_TIME, ); // DPTImageProcessor @@ -616,7 +619,7 @@ describe("Processors", () => { compare(reshaped_input_sizes, [[252, 518]]); } }, - MAX_TEST_EXECUTION_TIME, + MAX_TEST_TIME, ); // TODO: Add back @@ -635,9 +638,9 @@ describe("Processors", () => { // compare(original_sizes, [[480, 640]]); // compare(reshaped_input_sizes, [[224, 224]]); // } - // }, MAX_TEST_EXECUTION_TIME); + // }, MAX_TEST_TIME); - // Qwen2VLProcessor + // Qwen2VLImageProcessor // - custom image processing (min_pixels, max_pixels) it( MODELS.qwen2_vl, @@ -656,7 +659,103 @@ describe("Processors", () => { compare(reshaped_input_sizes, [[224, 224]]); } }, - MAX_TEST_EXECUTION_TIME, + MAX_TEST_TIME, + ); + + // Idefics3ImageProcessor + // - custom image processing (patching) + it( + MODELS.idefics3, + async () => { + const processor = await AutoImageProcessor.from_pretrained(MODELS.idefics3); + + const image = await load_image(TEST_IMAGES.gradient_1280x640); + const image_1 = await image.resize(1600, 1067); + const image_2 = await image.resize(224, 224); + + const white_image = await load_image(TEST_IMAGES.white_image); + const white_image_1 = await white_image.resize(1600, 1067); + const white_image_2 = await white_image.resize(224, 224); + + { + // test no image splitting + const { pixel_values, rows, cols } = await processor(image, { do_image_splitting: false, return_row_col_info: true }); + compare(pixel_values.dims, [1, 1, 3, 364, 364]); + compare( + pixel_values.mean().item(), + -0.001035306602716446, + 0.1, // threshold + ); + compare(rows, [[0]]); + compare(cols, [[0]]); + } + + { + // test batched no image splitting + const { pixel_values, pixel_attention_mask, rows, cols } = await processor([[white_image_1], [white_image_2], [white_image_1, white_image_2]], { do_image_splitting: false, return_row_col_info: true }); + compare(pixel_values.dims, [3, 2, 3, 364, 364]); + compare( + pixel_values.mean().item(), + 2 / 3, + 0.01, // threshold + ); + compare(pixel_attention_mask.dims, [3, 2, 364, 364]); + compare( + pixel_attention_mask.to("float32").mean().item(), + 2 / 3, + 0.001, // threshold + ); + compare(rows, [[0], [0], [0, 0]]); + compare(cols, [[0], [0], [0, 0]]); + + // Test that the order of the pixel attention mask matches the python implementation + compare( + pixel_attention_mask.data.reduce((a, b, i) => a + i * b, 0), + 228217205216, + ); + } + + { + // test correct patching + const { pixel_values, rows, cols } = await processor(image, { return_row_col_info: true }); + compare(pixel_values.dims, [1, 9, 3, 364, 364]); + compare( + pixel_values.flatten(2).mean(2).tolist(), + [[-0.7012196183204651, -0.30104631185531616, 0.09912905097007751, 0.49929487705230713, -0.5011996626853943, -0.10103467106819153, 0.2991456389427185, 0.6993265151977539, -0.0010353063698858023]], + 0.1, // threshold + ); + compare(rows, [[2]]); + compare(cols, [[4]]); + } + + { + // unbatched, single image + const { pixel_values, rows, cols } = await processor(image_1, { return_row_col_info: true }); + compare(pixel_values.dims, [1, 13, 3, 364, 364]); + + compare(rows, [[3]]); + compare(cols, [[4]]); + } + + { + // unbatched, multiple images + const { pixel_values, rows, cols } = await processor([image_1, image_2], { return_row_col_info: true }); + compare(pixel_values.dims, [1, 30, 3, 364, 364]); + + compare(rows, [[3, 4]]); + compare(cols, [[4, 4]]); + } + + { + // batched, multiple images + const { pixel_values, rows, cols } = await processor([[image_1], [image_1, image_2]], { return_row_col_info: true }); + compare(pixel_values.dims, [2, 30, 3, 364, 364]); + compare(rows, [[3], [3, 4]]); + compare(cols, [[4], [4, 4]]); + } + }, + // NOTE: We set a higher timeout for this test + 2 * MAX_TEST_TIME, ); }); @@ -681,7 +780,7 @@ describe("Processors", () => { expect(input_features.data[81]).toBeCloseTo(0.10727232694625854); expect(input_features.data[3001]).toBeCloseTo(0.2555035352706909); }, - MAX_TEST_EXECUTION_TIME, + MAX_TEST_TIME, ); it( @@ -716,7 +815,7 @@ describe("Processors", () => { expect(input_values.data[10000]).toBeCloseTo(0.46703237295150757); } }, - MAX_TEST_EXECUTION_TIME, + MAX_TEST_TIME, ); it( @@ -757,7 +856,7 @@ describe("Processors", () => { expect(sum(attention_mask.data)).toEqual(30); } }, - MAX_TEST_EXECUTION_TIME, + MAX_TEST_TIME, ); it( @@ -810,7 +909,7 @@ describe("Processors", () => { expect(input_features.data[64063]).toBeCloseTo(-100.0); } }, - MAX_TEST_EXECUTION_TIME, + MAX_TEST_TIME, ); it( @@ -849,7 +948,7 @@ describe("Processors", () => { expect(input_features.data.at(-1)).toBeCloseTo(-2.2504329681396484); } }, - MAX_TEST_EXECUTION_TIME, + MAX_TEST_TIME, ); }); @@ -1059,7 +1158,7 @@ describe("Processors", () => { } }); }, - MAX_TEST_EXECUTION_TIME, + MAX_TEST_TIME, ); describe( @@ -1095,7 +1194,7 @@ describe("Processors", () => { compare(image_grid_thw.dims, [1, 3]); }); }, - MAX_TEST_EXECUTION_TIME, + MAX_TEST_TIME, ); }); }); diff --git a/tests/tiny_random.test.js b/tests/tiny_random.test.js index bd2fe1c60..72f0e5c53 100644 --- a/tests/tiny_random.test.js +++ b/tests/tiny_random.test.js @@ -10,7 +10,6 @@ import { BertTokenizer, T5Tokenizer, WhisperTokenizer, - BartTokenizer, MarianTokenizer, PreTrainedTokenizer, AutoTokenizer, @@ -20,6 +19,7 @@ import { AutoProcessor, Processor, Florence2Processor, + Idefics3Processor, // Models LlamaForCausalLM, @@ -49,6 +49,7 @@ import { BertForQuestionAnswering, MusicgenForConditionalGeneration, LlavaForConditionalGeneration, + Idefics3ForConditionalGeneration, WhisperForConditionalGeneration, VisionEncoderDecoderModel, Florence2ForConditionalGeneration, @@ -756,6 +757,148 @@ describe("Tiny random models", () => { }); }); + describe("idefics3", () => { + const conversation = [ + { + role: "user", + content: [{ type: "image" }, { type: "text", text: "Can you describe this image?" }], + }, + ]; + + // Empty white and black images + const white_image_dims = [224, 224, 3]; + const white_image = new RawImage(new Uint8ClampedArray(white_image_dims[0] * white_image_dims[1] * white_image_dims[2]).fill(255), ...white_image_dims); + const black_image_dims = [720, 360, 3]; + const black_image = new RawImage(new Uint8ClampedArray(black_image_dims[0] * black_image_dims[1] * black_image_dims[2]).fill(0), ...black_image_dims); + + describe("Idefics3ForConditionalGeneration", () => { + const model_id = "hf-internal-testing/tiny-random-Idefics3ForConditionalGeneration"; + + /** @type {Idefics3ForConditionalGeneration} */ + let model; + /** @type {Idefics3Processor} */ + let processor; + /** @type {string} */ + let text; + beforeAll(async () => { + model = await Idefics3ForConditionalGeneration.from_pretrained(model_id, { + // TODO move to config + ...DEFAULT_MODEL_OPTIONS, + }); + processor = await AutoProcessor.from_pretrained(model_id); + + text = processor.apply_chat_template(conversation, { + add_generation_prompt: true, + }); + }, MAX_MODEL_LOAD_TIME); + + it( + "forward w/ image splitting (default)", + async () => { + const inputs = await processor(text, white_image, { + do_image_splitting: true, + }); + + const { logits } = await model(inputs); + expect(logits.dims).toEqual([1, 3041, 128259]); + expect(logits.mean().item()).toBeCloseTo(-0.0002692154666874558, 6); + }, + MAX_TEST_EXECUTION_TIME, + ); + + it( + "forward w/o image splitting", + async () => { + const inputs = await processor(text, white_image, { + do_image_splitting: false, + }); + + const { logits } = await model(inputs); + expect(logits.dims).toEqual([1, 189, 128259]); + expect(logits.mean().item()).toBeCloseTo(-0.00019743280427064747, 6); + }, + MAX_TEST_EXECUTION_TIME, + ); + + it( + "batch_size=1 w/ image splitting", + async () => { + const inputs = await processor(text, white_image, { + do_image_splitting: true, + }); + const generate_ids = await model.generate({ + ...inputs, + max_new_tokens: 10, + + // To obtain unique output tokens, deterministically + repetition_penalty: 2.0, + }); + expect(generate_ids.dims).toEqual([1, 3051]); + + const new_tokens = generate_ids.slice(null, [inputs.input_ids.dims.at(-1), null]); + expect(new_tokens.tolist()).toEqual([[64531n, 121777n, 70370n, 105334n, 12720n, 113356n, 47739n, 59240n, 102001n, 60344n]]); + }, + MAX_TEST_EXECUTION_TIME, + ); + + it( + "batch_size=1 w/o image splitting", + async () => { + const inputs = await processor(text, white_image, { + do_image_splitting: false, + }); + const generate_ids = await model.generate({ + ...inputs, + max_new_tokens: 10, + + // To obtain unique output tokens, deterministically + repetition_penalty: 2.0, + }); + expect(generate_ids.dims).toEqual([1, 199]); + + const new_tokens = generate_ids.slice(null, [inputs.input_ids.dims.at(-1), null]); + expect(new_tokens.tolist()).toEqual([[64531n, 121777n, 70370n, 105334n, 12720n, 113356n, 47739n, 59240n, 59697n, 65246n]]); + }, + MAX_TEST_EXECUTION_TIME, + ); + + it( + "batch_size=1 multi-image w/o image splitting", + async () => { + const multi_image_conversation = [ + { + role: "user", + content: [{ type: "image" }, { type: "image" }, { type: "text", text: "Can you describe these images?" }], + }, + ]; + + const multi_image_text = processor.apply_chat_template(multi_image_conversation, { + add_generation_prompt: true, + }); + const inputs = await processor(multi_image_text, [white_image, black_image], { + do_image_splitting: false, + }); + const generate_ids = await model.generate({ + ...inputs, + max_new_tokens: 10, + + // To obtain unique output tokens, deterministically + repetition_penalty: 2.0, + }); + expect(generate_ids.dims).toEqual([1, 374]); + + const new_tokens = generate_ids.slice(null, [inputs.input_ids.dims.at(-1), null]); + expect(new_tokens.tolist()).toEqual([[73189n, 99346n, 113252n, 51743n, 33499n, 66430n, 78739n, 89539n, 121023n, 14474n]]); + }, + MAX_TEST_EXECUTION_TIME, + ); + + afterAll(async () => { + await model?.dispose(); + }, MAX_MODEL_DISPOSE_TIME); + }); + }); + describe("florence2", () => { const texts = ["Describe with a paragraph what is shown in the image.", "Locate the objects with category name in the image."];