Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -337,6 +337,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
1. **[HerBERT](https://huggingface.co/docs/transformers/model_doc/herbert)** (from Allegro.pl, AGH University of Science and Technology) released with the paper [KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://www.aclweb.org/anthology/2020.acl-main.111.pdf) by Piotr Rybak, Robert Mroczkowski, Janusz Tracz, Ireneusz Gawlik.
1. **[Hiera](https://huggingface.co/docs/transformers/model_doc/hiera)** (from Meta) released with the paper [Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles](https://arxiv.org/pdf/2306.00989) by Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer.
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
1. **[Idefics3](https://huggingface.co/docs/transformers/model_doc/idefics3)** (from Hugging Face) released with the paper [Building and better understanding vision-language models: insights and future directions](https://arxiv.org/abs/2408.12637) by Hugo Laurençon, Andrés Marafioti, Victor Sanh, Léo Tronchon.
1. **JAIS** (from Core42) released with the paper [Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models](https://arxiv.org/pdf/2308.16149) by Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Satheesh Katipomu, Haonan Li, Fajri Koto, William Marshall, Gurpreet Gosal, Cynthia Liu, Zhiming Chen, Osama Mohammed Afzal, Samta Kamboj, Onkar Pandit, Rahul Pal, Lalit Pradhan, Zain Muhammad Mujahid, Massa Baali, Xudong Han, Sondos Mahmoud Bsharat, Alham Fikri Aji, Zhiqiang Shen, Zhengzhong Liu, Natalia Vassilieva, Joel Hestness, Andy Hock, Andrew Feldman, Jonathan Lee, Andrew Jackson, Hector Xuguang Ren, Preslav Nakov, Timothy Baldwin, Eric Xing.
1. **Janus** (from DeepSeek) released with the paper [Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation](https://arxiv.org/abs/2410.13848) Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, Ping Luo.
1. **JinaCLIP** (from Jina AI) released with the paper [Jina CLIP: Your CLIP Model Is Also Your Text Retriever](https://arxiv.org/abs/2405.20204) by Andreas Koukounas, Georgios Mastrapas, Michael Günther, Bo Wang, Scott Martens, Isabelle Mohr, Saba Sturua, Mohammad Kalim Akram, Joan Fontanals Martínez, Saahil Ognawala, Susana Guzman, Maximilian Werk, Nan Wang, Han Xiao.
Expand Down
1 change: 1 addition & 0 deletions docs/snippets/6_supported-models.snippet
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@
1. **[HerBERT](https://huggingface.co/docs/transformers/model_doc/herbert)** (from Allegro.pl, AGH University of Science and Technology) released with the paper [KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://www.aclweb.org/anthology/2020.acl-main.111.pdf) by Piotr Rybak, Robert Mroczkowski, Janusz Tracz, Ireneusz Gawlik.
1. **[Hiera](https://huggingface.co/docs/transformers/model_doc/hiera)** (from Meta) released with the paper [Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles](https://arxiv.org/pdf/2306.00989) by Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer.
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
1. **[Idefics3](https://huggingface.co/docs/transformers/model_doc/idefics3)** (from Hugging Face) released with the paper [Building and better understanding vision-language models: insights and future directions](https://arxiv.org/abs/2408.12637) by Hugo Laurençon, Andrés Marafioti, Victor Sanh, Léo Tronchon.
1. **JAIS** (from Core42) released with the paper [Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models](https://arxiv.org/pdf/2308.16149) by Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Satheesh Katipomu, Haonan Li, Fajri Koto, William Marshall, Gurpreet Gosal, Cynthia Liu, Zhiming Chen, Osama Mohammed Afzal, Samta Kamboj, Onkar Pandit, Rahul Pal, Lalit Pradhan, Zain Muhammad Mujahid, Massa Baali, Xudong Han, Sondos Mahmoud Bsharat, Alham Fikri Aji, Zhiqiang Shen, Zhengzhong Liu, Natalia Vassilieva, Joel Hestness, Andy Hock, Andrew Feldman, Jonathan Lee, Andrew Jackson, Hector Xuguang Ren, Preslav Nakov, Timothy Baldwin, Eric Xing.
1. **Janus** (from DeepSeek) released with the paper [Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation](https://arxiv.org/abs/2410.13848) Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, Ping Luo.
1. **JinaCLIP** (from Jina AI) released with the paper [Jina CLIP: Your CLIP Model Is Also Your Text Retriever](https://arxiv.org/abs/2405.20204) by Andreas Koukounas, Georgios Mastrapas, Michael Günther, Bo Wang, Scott Martens, Isabelle Mohr, Saba Sturua, Mohammad Kalim Akram, Joan Fontanals Martínez, Saahil Ognawala, Susana Guzman, Maximilian Werk, Nan Wang, Han Xiao.
Expand Down
36 changes: 24 additions & 12 deletions scripts/quantize.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,27 @@ class QuantMode(Enum):

QUANTIZE_OPTIONS = tuple(x.value for x in QuantMode)

# A list of operators that, when detected in a model, should select QUInt8 as the weight type for 8-bit quantization.
QUINT8_OPS = (
# NOTE:
# As of 2024/11/29, the latest version of onnxruntime-web is 1.20.1, and does not support INT8 weights for Conv layers.
# If you attempt to run a model with INT8 weights for Conv layers, you will get an error like:
# `Can't create a session. ERROR_CODE: 9, ERROR_MESSAGE: Could not find an implementation for ConvInteger(10) node with name '/.../Conv_quant'`
#
# For this reason, we choose model weight types to ensure compatibility with onnxruntime-web.
#
# As per docs, signed weight type (QInt8) is faster on most CPUs, so, we use that unless the model contains a Conv layer.
# For more information, see:
# - https://github.com/microsoft/onnxruntime/issues/3130#issuecomment-1105200621
# - https://github.com/microsoft/onnxruntime/issues/2339
"Conv",

# Models produced by onnxruntime-genai contain optimized operators that perform better with QUInt8 weights.
"GroupQueryAttention",
"MultiHeadAttention",

# TODO: "SimplifiedLayerNormalization", "SkipSimplifiedLayerNormalization"
)

@dataclass
class IOArguments:
Expand Down Expand Up @@ -326,20 +347,11 @@ def quantize(input_folder, output_folder, quantization_args: QuantizationArgumen

elif mode in (QuantMode.Q8, QuantMode.QI8, QuantMode.QU8):
if mode == QuantMode.Q8:
# NOTE:
# As of 2024/06/28, the current latest version of onnxruntime-web is 1.18.0, and does not support INT8 weights for Conv layers.
# If you attempt to run a model with INT8 weights for Conv layers, you will get an error like:
# `Can't create a session. ERROR_CODE: 9, ERROR_MESSAGE: Could not find an implementation for ConvInteger(10) node with name '/.../Conv_quant'`
#
# For this reason, we choose model weight types to ensure compatibility with onnxruntime-web.
#
# As per docs, signed weight type (QInt8) is faster on most CPUs, so, we use that unless the model contains a Conv layer.
# For more information, see:
# - https://github.com/microsoft/onnxruntime/issues/3130#issuecomment-1105200621
# - https://github.com/microsoft/onnxruntime/issues/2339
op_types = get_operators(model)
weight_type = (
QuantType.QUInt8 if "Conv" in op_types else QuantType.QInt8
QuantType.QUInt8
if any(x in QUINT8_OPS for x in op_types)
else QuantType.QInt8
)

elif mode == QuantMode.QI8:
Expand Down
3 changes: 2 additions & 1 deletion src/configs.js
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ function getNormalizedConfig(config) {
case 'paligemma':
case 'florence2':
case 'llava_onevision':
case 'idefics3':
init_normalized_config = getNormalizedConfig(config.text_config);
break;
case 'moondream1':
Expand Down Expand Up @@ -382,6 +383,6 @@ export class AutoConfig {
* See https://onnxruntime.ai/docs/tutorials/web/env-flags-and-session-options.html#freedimensionoverrides
* for more information.
* @property {import('./utils/devices.js').DeviceType} [device] The default device to use for the model.
* @property {import('./utils/dtypes.js').DataType} [dtype] The default data type to use for the model.
* @property {import('./utils/dtypes.js').DataType|Record<string, import('./utils/dtypes.js').DataType>} [dtype] The default data type to use for the model.
* @property {boolean|Record<string, boolean>} [use_external_data_format=false] Whether to load the model using the external data format (used for models >= 2GB in size).
*/
139 changes: 107 additions & 32 deletions src/models.js
Original file line number Diff line number Diff line change
Expand Up @@ -182,6 +182,22 @@ async function getSession(pretrained_model_name_or_path, fileName, options) {
}
}

if (dtype === DATA_TYPES.auto) {
// Try to choose the auto dtype based on the custom config
let config_dtype = custom_config.dtype;
if (typeof config_dtype !== 'string') {
config_dtype = config_dtype[fileName];
}

if (config_dtype && config_dtype !== DATA_TYPES.auto && DATA_TYPES.hasOwnProperty(config_dtype)) {
// Defined by the custom config, and is not "auto"
dtype = config_dtype;
} else {
// Choose default dtype based on device, falling back to fp32
dtype = DEFAULT_DEVICE_DTYPE_MAPPING[selectedDevice] ?? DATA_TYPES.fp32;
}
}

const selectedDtype = /** @type {import("./utils/dtypes.js").DataType} */(dtype);

if (!DEFAULT_DTYPE_SUFFIX_MAPPING.hasOwnProperty(selectedDtype)) {
Expand Down Expand Up @@ -387,9 +403,17 @@ async function sessionRun(session, inputs) {
output = replaceTensors(output);
return output;
} catch (e) {
// Error messages can be long (nested) and uninformative. For this reason,
// we apply minor formatting to show the most important information
const formatted = Object.fromEntries(Object.entries(checkedInputs)
.map(([k, { type, dims, data }]) => [k, {
// Extract these properties from the underlying ORT tensor
type, dims, data,
}]));

// This usually occurs when the inputs are of the wrong type.
console.error(`An error occurred during model execution: "${e}".`);
console.error('Inputs given to model:', checkedInputs)
console.error('Inputs given to model:', formatted);
throw e;
}
}
Expand Down Expand Up @@ -546,6 +570,39 @@ async function decoderForward(self, model_inputs, is_encoder_decoder = false) {
}



function default_merge_input_ids_with_image_features({
image_token_id,
inputs_embeds,
image_features,
input_ids,
attention_mask,
}) {
const image_tokens = input_ids.tolist().map(ids =>
ids.reduce((acc, x, idx) => {
if (x == image_token_id) acc.push(idx);
return acc;
}, [])
);
const n_image_tokens = image_tokens.reduce((acc, x) => acc + x.length, 0);
const n_image_features = image_features.dims[0];
if (n_image_tokens !== n_image_features) {
throw new Error(`Image features and image tokens do not match: tokens: ${n_image_tokens}, features ${n_image_features}`);
}

// Equivalent to performing a masked_scatter
let img = 0;
for (let i = 0; i < image_tokens.length; ++i) {
const tokens = image_tokens[i];
const embeds = inputs_embeds[i];
for (let j = 0; j < tokens.length; ++j) {
embeds[tokens[j]].data.set(image_features[img++].data)
}
}
return { inputs_embeds, attention_mask }
}


/**
* Forward pass of an image-text-to-text model.
* @param {Object} self The image-text-to-text model model.
Expand Down Expand Up @@ -3304,8 +3361,8 @@ export class VisionEncoderDecoderModel extends PreTrainedModel {
export class LlavaPreTrainedModel extends PreTrainedModel {
forward_params = [
'input_ids',
'pixel_values',
'attention_mask',
'pixel_values',
'position_ids',
'past_key_values',
];
Expand Down Expand Up @@ -3487,6 +3544,46 @@ export class Florence2ForConditionalGeneration extends Florence2PreTrainedModel
return decoder_outputs;
}
}


//////////////////////////////////////////////////
// Idefics3 Models
export class Idefics3PreTrainedModel extends PreTrainedModel {
forward_params = [
'input_ids',
'attention_mask',
'pixel_values',
'pixel_attention_mask',
'position_ids',
'past_key_values',
];
}

/**
* The LLAVA model which consists of a vision backbone and a language model.
*/
export class Idefics3ForConditionalGeneration extends Idefics3PreTrainedModel {

async encode_image({ pixel_values, pixel_attention_mask }) {
const features = (await sessionRun(this.sessions['vision_encoder'], { pixel_values, pixel_attention_mask })).image_features;
return features;
}

_merge_input_ids_with_image_features(kwargs) {
const vision_hidden_size = kwargs.image_features.dims.at(-1);
const reshaped_image_hidden_states = kwargs.image_features.view(-1, vision_hidden_size);

return default_merge_input_ids_with_image_features({
// @ts-ignore
image_token_id: this.config.image_token_id,
...kwargs,
image_features: reshaped_image_hidden_states,
})
}
}
//////////////////////////////////////////////////

//////////////////////////////////////////////////
export class CLIPPreTrainedModel extends PreTrainedModel { }

/**
Expand Down Expand Up @@ -4280,36 +4377,12 @@ export class Qwen2VLForConditionalGeneration extends Qwen2VLPreTrainedModel {
return features;
}

_merge_input_ids_with_image_features({
inputs_embeds,
image_features,
input_ids,
attention_mask,
}) {
// @ts-ignore
const { image_token_id } = this.config;
const image_tokens = input_ids.tolist().map(ids =>
ids.reduce((acc, x, idx) => {
if (x == image_token_id) acc.push(idx);
return acc;
}, [])
);
const n_image_tokens = image_tokens.reduce((acc, x) => acc + x.length, 0);
const n_image_features = image_features.dims[0];
if (n_image_tokens !== n_image_features) {
throw new Error(`Image features and image tokens do not match: tokens: ${n_image_tokens}, features ${n_image_features}`);
}

// Equivalent to performing a masked_scatter
let img = 0;
for (let i = 0; i < image_tokens.length; ++i) {
const tokens = image_tokens[i];
const embeds = inputs_embeds[i];
for (let j = 0; j < tokens.length; ++j) {
embeds[tokens[j]].data.set(image_features[img++].data)
}
}
return { inputs_embeds, attention_mask }
_merge_input_ids_with_image_features(kwargs) {
return default_merge_input_ids_with_image_features({
// @ts-ignore
image_token_id: this.config.image_token_id,
...kwargs
})
}

prepare_inputs_for_generation(input_ids, model_inputs, generation_config) {
Expand Down Expand Up @@ -6914,6 +6987,7 @@ const MODEL_FOR_QUESTION_ANSWERING_MAPPING_NAMES = new Map([

const MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES = new Map([
['vision-encoder-decoder', ['VisionEncoderDecoderModel', VisionEncoderDecoderModel]],
['idefics3', ['Idefics3ForConditionalGeneration', Idefics3ForConditionalGeneration]],
]);

const MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES = new Map([
Expand All @@ -6922,6 +6996,7 @@ const MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES = new Map([
['moondream1', ['Moondream1ForConditionalGeneration', Moondream1ForConditionalGeneration]],
['florence2', ['Florence2ForConditionalGeneration', Florence2ForConditionalGeneration]],
['qwen2-vl', ['Qwen2VLForConditionalGeneration', Qwen2VLForConditionalGeneration]],
['idefics3', ['Idefics3ForConditionalGeneration', Idefics3ForConditionalGeneration]],
]);

const MODEL_FOR_DOCUMENT_QUESTION_ANSWERING_MAPPING_NAMES = new Map([
Expand Down
Loading
Loading