Skip to content

Commit e2d17b9

Browse files
authored
Add support for SigLIP models (#473)
* Add support for SigLIP models * Skip siglip tokenizer tests * Move SigLIP-specific zero-shot-image-classification logic to pipeline
1 parent 9b84d7b commit e2d17b9

File tree

10 files changed

+218
-15
lines changed

10 files changed

+218
-15
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -328,6 +328,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
328328
1. **[RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta)** (from Facebook), released together with the paper [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
329329
1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
330330
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
331+
1. **[SigLIP](https://huggingface.co/docs/transformers/main/model_doc/siglip)** (from Google AI) released with the paper [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer.
331332
1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.
332333
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
333334
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.

docs/snippets/6_supported-models.snippet

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@
6363
1. **[RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta)** (from Facebook), released together with the paper [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
6464
1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
6565
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
66+
1. **[SigLIP](https://huggingface.co/docs/transformers/main/model_doc/siglip)** (from Google AI) released with the paper [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer.
6667
1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.
6768
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
6869
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.

scripts/convert.py

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -381,6 +381,24 @@ def main():
381381
device=conv_args.device,
382382
)
383383

384+
elif config.model_type == 'siglip' and conv_args.split_modalities:
385+
# Handle special case for exporting text and vision models separately
386+
from .extra.siglip import SiglipTextModelOnnxConfig, SiglipVisionModelOnnxConfig
387+
from transformers.models.siglip import SiglipTextModel, SiglipVisionModel
388+
389+
text_model = SiglipTextModel.from_pretrained(model_id)
390+
vision_model = SiglipVisionModel.from_pretrained(model_id)
391+
392+
export_models(
393+
models_and_onnx_configs={
394+
"text_model": (text_model, SiglipTextModelOnnxConfig(text_model.config)),
395+
"vision_model": (vision_model, SiglipVisionModelOnnxConfig(vision_model.config)),
396+
},
397+
output_dir=output_model_folder,
398+
opset=conv_args.opset,
399+
device=conv_args.device,
400+
)
401+
384402
# TODO: Enable once https://github.com/huggingface/optimum/pull/1552 is merged
385403
# elif config.model_type == 'clap' and conv_args.split_modalities:
386404
# # Handle special case for exporting text and audio models separately

scripts/extra/siglip.py

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# Support exporting vision and text models separately:
2+
# Adapted from https://github.com/huggingface/optimum/issues/1186#issuecomment-1637641760
3+
4+
from optimum.exporters.onnx.model_configs import SiglipTextOnnxConfig, ViTOnnxConfig
5+
from typing import Dict
6+
7+
8+
class SiglipVisionOnnxConfig(ViTOnnxConfig):
9+
pass
10+
11+
12+
class SiglipTextModelOnnxConfig(SiglipTextOnnxConfig):
13+
@property
14+
def outputs(self) -> Dict[str, Dict[int, str]]:
15+
return {
16+
"last_hidden_state": {0: "batch_size", 1: "sequence_length"},
17+
"pooler_output": {0: "batch_size"},
18+
}
19+
20+
def generate_dummy_inputs(self, framework: str = "pt", **kwargs):
21+
dummy_inputs = super().generate_dummy_inputs(framework=framework, **kwargs)
22+
if framework == "pt":
23+
import torch
24+
dummy_inputs["input_ids"] = dummy_inputs["input_ids"].to(dtype=torch.int64)
25+
return dummy_inputs
26+
27+
class SiglipVisionModelOnnxConfig(SiglipVisionOnnxConfig):
28+
@property
29+
def outputs(self) -> Dict[str, Dict[int, str]]:
30+
return {
31+
"last_hidden_state": {0: "batch_size"},
32+
"pooler_output": {0: "batch_size"},
33+
}

scripts/supported_models.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -778,7 +778,14 @@
778778
'nvidia/mit-b5',
779779
],
780780
},
781-
781+
'siglip': {
782+
# Zero-shot image classification and feature extraction
783+
# (with and without `--split_modalities`)
784+
# NOTE: requires --opset 13
785+
'zero-shot-image-classification': [
786+
'nielsr/siglip-base-patch16-224',
787+
],
788+
},
782789
'speecht5': {
783790
# Text-to-audio/Text-to-speech
784791
'text-to-audio': [

src/models.js

Lines changed: 122 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3159,6 +3159,125 @@ export class CLIPVisionModelWithProjection extends CLIPPreTrainedModel {
31593159
//////////////////////////////////////////////////
31603160

31613161

3162+
//////////////////////////////////////////////////
3163+
// SigLIP models
3164+
export class SiglipPreTrainedModel extends PreTrainedModel { }
3165+
3166+
/**
3167+
* SigLIP Text and Vision Model with a projection layers on top
3168+
*
3169+
* **Example:** Perform zero-shot image classification with a `SiglipModel`.
3170+
*
3171+
* ```javascript
3172+
* import { AutoTokenizer, AutoProcessor, SiglipModel, RawImage } from '@xenova/transformers';
3173+
*
3174+
* // Load tokenizer, processor, and model
3175+
* const tokenizer = await AutoTokenizer.from_pretrained('Xenova/siglip-base-patch16-224');
3176+
* const processor = await AutoProcessor.from_pretrained('Xenova/siglip-base-patch16-224');
3177+
* const model = await SiglipModel.from_pretrained('Xenova/siglip-base-patch16-224');
3178+
*
3179+
* // Run tokenization
3180+
* const texts = ['a photo of 2 cats', 'a photo of 2 dogs'];
3181+
* const text_inputs = tokenizer(texts, { padding: 'max_length', truncation: true });
3182+
*
3183+
* // Read image and run processor
3184+
* const image = await RawImage.read('http://images.cocodataset.org/val2017/000000039769.jpg');
3185+
* const image_inputs = await processor(image);
3186+
*
3187+
* // Run model with both text and pixel inputs
3188+
* const output = await model({ ...text_inputs, ...image_inputs });
3189+
* // {
3190+
* // logits_per_image: Tensor {
3191+
* // dims: [ 1, 2 ],
3192+
* // data: Float32Array(2) [ -1.6019744873046875, -10.720091819763184 ],
3193+
* // },
3194+
* // logits_per_text: Tensor {
3195+
* // dims: [ 2, 1 ],
3196+
* // data: Float32Array(2) [ -1.6019744873046875, -10.720091819763184 ],
3197+
* // },
3198+
* // text_embeds: Tensor {
3199+
* // dims: [ 2, 768 ],
3200+
* // data: Float32Array(1536) [ ... ],
3201+
* // },
3202+
* // image_embeds: Tensor {
3203+
* // dims: [ 1, 768 ],
3204+
* // data: Float32Array(768) [ ... ],
3205+
* // }
3206+
* // }
3207+
* ```
3208+
*/
3209+
export class SiglipModel extends SiglipPreTrainedModel { }
3210+
3211+
/**
3212+
* The text model from SigLIP without any head or projection on top.
3213+
*
3214+
* **Example:** Compute text embeddings with `SiglipTextModel`.
3215+
*
3216+
* ```javascript
3217+
* import { AutoTokenizer, SiglipTextModel } from '@xenova/transformers';
3218+
*
3219+
* // Load tokenizer and text model
3220+
* const tokenizer = await AutoTokenizer.from_pretrained('Xenova/siglip-base-patch16-224');
3221+
* const text_model = await SiglipTextModel.from_pretrained('Xenova/siglip-base-patch16-224');
3222+
*
3223+
* // Run tokenization
3224+
* const texts = ['a photo of 2 cats', 'a photo of 2 dogs'];
3225+
* const text_inputs = tokenizer(texts, { padding: 'max_length', truncation: true });
3226+
*
3227+
* // Compute embeddings
3228+
* const { pooler_output } = await text_model(text_inputs);
3229+
* // Tensor {
3230+
* // dims: [ 2, 768 ],
3231+
* // type: 'float32',
3232+
* // data: Float32Array(1536) [ ... ],
3233+
* // size: 1536
3234+
* // }
3235+
* ```
3236+
*/
3237+
export class SiglipTextModel extends SiglipPreTrainedModel {
3238+
3239+
/** @type {PreTrainedModel.from_pretrained} */
3240+
static async from_pretrained(pretrained_model_name_or_path, options = {}) {
3241+
// Update default model file name if not provided
3242+
options.model_file_name ??= 'text_model';
3243+
return super.from_pretrained(pretrained_model_name_or_path, options);
3244+
}
3245+
}
3246+
3247+
/**
3248+
* The vision model from SigLIP without any head or projection on top.
3249+
*
3250+
* **Example:** Compute vision embeddings with `SiglipVisionModel`.
3251+
*
3252+
* ```javascript
3253+
* import { AutoProcessor, SiglipVisionModel, RawImage} from '@xenova/transformers';
3254+
*
3255+
* // Load processor and vision model
3256+
* const processor = await AutoProcessor.from_pretrained('Xenova/siglip-base-patch16-224');
3257+
* const vision_model = await SiglipVisionModel.from_pretrained('Xenova/siglip-base-patch16-224');
3258+
*
3259+
* // Read image and run processor
3260+
* const image = await RawImage.read('https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/football-match.jpg');
3261+
* const image_inputs = await processor(image);
3262+
*
3263+
* // Compute embeddings
3264+
* const { pooler_output } = await vision_model(image_inputs);
3265+
* // Tensor {
3266+
* // dims: [ 1, 768 ],
3267+
* // type: 'float32',
3268+
* // data: Float32Array(768) [ ... ],
3269+
* // size: 768
3270+
* // }
3271+
* ```
3272+
*/
3273+
export class SiglipVisionModel extends CLIPPreTrainedModel {
3274+
/** @type {PreTrainedModel.from_pretrained} */
3275+
static async from_pretrained(pretrained_model_name_or_path, options = {}) {
3276+
// Update default model file name if not provided
3277+
options.model_file_name ??= 'vision_model';
3278+
return super.from_pretrained(pretrained_model_name_or_path, options);
3279+
}
3280+
}
31623281
//////////////////////////////////////////////////
31633282
// ChineseCLIP models
31643283
export class ChineseCLIPPreTrainedModel extends PreTrainedModel { }
@@ -4902,6 +5021,7 @@ const MODEL_MAPPING_NAMES_ENCODER_ONLY = new Map([
49025021
['clip', ['CLIPModel', CLIPModel]],
49035022
['clipseg', ['CLIPSegModel', CLIPSegModel]],
49045023
['chinese_clip', ['ChineseCLIPModel', ChineseCLIPModel]],
5024+
['siglip', ['SiglipModel', SiglipModel]],
49055025
['mobilebert', ['MobileBertModel', MobileBertModel]],
49065026
['squeezebert', ['SqueezeBertModel', SqueezeBertModel]],
49075027
['wav2vec2', ['Wav2Vec2Model', Wav2Vec2Model]],
@@ -5190,7 +5310,8 @@ for (const [mappings, type] of MODEL_CLASS_TYPE_MAPPING) {
51905310
const CUSTOM_MAPPING = [
51915311
['CLIPTextModelWithProjection', CLIPTextModelWithProjection, MODEL_TYPES.EncoderOnly],
51925312
['CLIPVisionModelWithProjection', CLIPVisionModelWithProjection, MODEL_TYPES.EncoderOnly],
5193-
5313+
['SiglipTextModel', SiglipTextModel, MODEL_TYPES.EncoderOnly],
5314+
['SiglipVisionModel', SiglipVisionModel, MODEL_TYPES.EncoderOnly],
51945315
['ClapTextModelWithProjection', ClapTextModelWithProjection, MODEL_TYPES.EncoderOnly],
51955316
['ClapAudioModelWithProjection', ClapAudioModelWithProjection, MODEL_TYPES.EncoderOnly],
51965317
]

src/pipelines.js

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1791,7 +1791,7 @@ export class ZeroShotImageClassificationPipeline extends Pipeline {
17911791

17921792
// Run tokenization
17931793
const text_inputs = this.tokenizer(texts, {
1794-
padding: true,
1794+
padding: this.model.config.model_type === 'siglip' ? 'max_length' : true,
17951795
truncation: true
17961796
});
17971797

@@ -1801,11 +1801,16 @@ export class ZeroShotImageClassificationPipeline extends Pipeline {
18011801
// Run model with both text and pixel inputs
18021802
const output = await this.model({ ...text_inputs, pixel_values });
18031803

1804+
const function_to_apply =
1805+
this.model.config.model_type === 'siglip'
1806+
? batch => batch.sigmoid().data
1807+
: batch => softmax(batch.data);
1808+
18041809
// Compare each image with each candidate label
18051810
const toReturn = [];
18061811
for (const batch of output.logits_per_image) {
18071812
// Compute softmax per image
1808-
const probs = softmax(batch.data);
1813+
const probs = function_to_apply(batch);
18091814

18101815
const result = [...probs].map((x, i) => ({
18111816
score: x,

src/processors.js

Lines changed: 16 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -211,8 +211,8 @@ export class ImageFeatureExtractor extends FeatureExtractor {
211211
constructor(config) {
212212
super(config);
213213

214-
this.image_mean = this.config.image_mean;
215-
this.image_std = this.config.image_std;
214+
this.image_mean = this.config.image_mean ?? this.config.mean;
215+
this.image_std = this.config.image_std ?? this.config.std;
216216

217217
this.resample = this.config.resample ?? 2; // 2 => bilinear
218218
this.do_rescale = this.config.do_rescale ?? true;
@@ -396,6 +396,17 @@ export class ImageFeatureExtractor extends FeatureExtractor {
396396
return [pixelData, imgDims];
397397
}
398398

399+
/**
400+
* Rescale the image' pixel values by `this.rescale_factor`.
401+
* @param {Float32Array} pixelData The pixel data to rescale.
402+
* @returns {void}
403+
*/
404+
rescale(pixelData) {
405+
for (let i = 0; i < pixelData.length; ++i) {
406+
pixelData[i] = this.rescale_factor * pixelData[i];
407+
}
408+
}
409+
399410
/**
400411
* @typedef {object} PreprocessedImage
401412
* @property {HeightWidth} original_size The original size of the image.
@@ -532,9 +543,7 @@ export class ImageFeatureExtractor extends FeatureExtractor {
532543
let imgDims = [image.height, image.width, image.channels];
533544

534545
if (this.do_rescale) {
535-
for (let i = 0; i < pixelData.length; ++i) {
536-
pixelData[i] = this.rescale_factor * pixelData[i];
537-
}
546+
this.rescale(pixelData);
538547
}
539548

540549
if (do_normalize ?? this.do_normalize) {
@@ -679,6 +688,7 @@ export class DPTFeatureExtractor extends ImageFeatureExtractor { }
679688
export class GLPNFeatureExtractor extends ImageFeatureExtractor { }
680689
export class CLIPFeatureExtractor extends ImageFeatureExtractor { }
681690
export class ChineseCLIPFeatureExtractor extends ImageFeatureExtractor { }
691+
export class SiglipImageProcessor extends ImageFeatureExtractor { }
682692
export class ConvNextFeatureExtractor extends ImageFeatureExtractor { }
683693
export class ConvNextImageProcessor extends ConvNextFeatureExtractor { } // NOTE extends ConvNextFeatureExtractor
684694
export class ViTFeatureExtractor extends ImageFeatureExtractor { }
@@ -1764,6 +1774,7 @@ export class AutoProcessor {
17641774
OwlViTFeatureExtractor,
17651775
CLIPFeatureExtractor,
17661776
ChineseCLIPFeatureExtractor,
1777+
SiglipImageProcessor,
17671778
ConvNextFeatureExtractor,
17681779
ConvNextImageProcessor,
17691780
SegformerFeatureExtractor,

src/tokenizers.js

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2490,7 +2490,7 @@ export class PreTrainedTokenizer extends Callable {
24902490
* @param {string|string[]} text The text to tokenize.
24912491
* @param {Object} options An optional object containing the following properties:
24922492
* @param {string|string[]} [options.text_pair=null] Optional second sequence to be encoded. If set, must be the same type as text.
2493-
* @param {boolean} [options.padding=false] Whether to pad the input sequences.
2493+
* @param {boolean|'max_length'} [options.padding=false] Whether to pad the input sequences.
24942494
* @param {boolean} [options.add_special_tokens=true] Whether or not to add the special tokens associated with the corresponding model.
24952495
* @param {boolean} [options.truncation=null] Whether to truncate the input sequences.
24962496
* @param {number} [options.max_length=null] Maximum length of the returned list and optionally padding length.
@@ -2551,11 +2551,13 @@ export class PreTrainedTokenizer extends Callable {
25512551
// At this point, tokens is batched: [batch_size, tokens]
25522552
// However, array may be jagged. So, we pad to max_length
25532553

2554-
let maxLengthOfBatch = max(tokens.map(x => x.length))[0];
2555-
2556-
// If null, we calculate max length from sequences
25572554
if (max_length === null) {
2558-
max_length = maxLengthOfBatch;
2555+
if (padding === 'max_length') {
2556+
max_length = this.model_max_length;
2557+
} else {
2558+
// Calculate max length from sequences
2559+
max_length = max(tokens.map(x => x.length))[0];
2560+
}
25592561
}
25602562

25612563
// Ensure it is less than model max length
@@ -4115,7 +4117,7 @@ export class WhisperTokenizer extends PreTrainedTokenizer {
41154117
}
41164118
export class CodeGenTokenizer extends PreTrainedTokenizer { }
41174119
export class CLIPTokenizer extends PreTrainedTokenizer { }
4118-
4120+
export class SiglipTokenizer extends PreTrainedTokenizer { }
41194121

41204122
/**
41214123
* @todo This model is not yet supported by Hugging Face's "fast" tokenizers library (https://github.com/huggingface/tokenizers).
@@ -4221,6 +4223,7 @@ export class AutoTokenizer {
42214223
WhisperTokenizer,
42224224
CodeGenTokenizer,
42234225
CLIPTokenizer,
4226+
SiglipTokenizer,
42244227
MarianTokenizer,
42254228
BloomTokenizer,
42264229
NllbTokenizer,

tests/generate_tests.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,9 @@
4040
# TODO: remove when https://github.com/huggingface/transformers/issues/26547 is fixed
4141
'speecht5',
4242

43+
# TODO: remove when https://github.com/huggingface/transformers/pull/26522 is merged
44+
'siglip',
45+
4346
# TODO: remove when https://github.com/huggingface/transformers/issues/28164 is fixed
4447
'roformer',
4548

0 commit comments

Comments
 (0)