Skip to content

Commit 591a112

Browse files
authored
Add support for SmolVLM2 (#1196)
* Add support for SmolVLM * Always flush text streamer after prompt * [WIP] video.js * Fix streamer unit tests * Export video.js * Video processing improvements
1 parent cfd3e55 commit 591a112

File tree

12 files changed

+153
-5
lines changed

12 files changed

+153
-5
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -402,6 +402,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
402402
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
403403
1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.
404404
1. **[SigLIP](https://huggingface.co/docs/transformers/main/model_doc/siglip)** (from Google AI) released with the paper [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer.
405+
1. **[SmolVLM](https://huggingface.co/docs/transformers/main/model_doc/smolvlm) (from Hugging Face) released with the blog posts [SmolVLM - small yet mighty Vision Language Model](https://huggingface.co/blog/smolvlm) and [SmolVLM Grows Smaller – Introducing the 250M & 500M Models!](https://huggingface.co/blog/smolervlm) by the Hugging Face TB Research team.
405406
1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.
406407
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
407408
1. **[StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu.

docs/snippets/6_supported-models.snippet

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,7 @@
117117
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
118118
1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.
119119
1. **[SigLIP](https://huggingface.co/docs/transformers/main/model_doc/siglip)** (from Google AI) released with the paper [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer.
120+
1. **[SmolVLM](https://huggingface.co/docs/transformers/main/model_doc/smolvlm) (from Hugging Face) released with the blog posts [SmolVLM - small yet mighty Vision Language Model](https://huggingface.co/blog/smolvlm) and [SmolVLM Grows Smaller – Introducing the 250M & 500M Models!](https://huggingface.co/blog/smolervlm) by the Hugging Face TB Research team.
120121
1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.
121122
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
122123
1. **[StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu.

src/configs.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,7 @@ function getNormalizedConfig(config) {
7070
case 'florence2':
7171
case 'llava_onevision':
7272
case 'idefics3':
73+
case 'smolvlm':
7374
// @ts-expect-error TS2339
7475
init_normalized_config = getNormalizedConfig(config.text_config);
7576
break;

src/generation/streamers.js

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -72,9 +72,10 @@ export class TextStreamer extends BaseStreamer {
7272
throw Error('TextStreamer only supports batch size of 1');
7373
}
7474

75-
if (this.skip_prompt && this.next_tokens_are_prompt) {
75+
const is_prompt = this.next_tokens_are_prompt;
76+
if (is_prompt) {
7677
this.next_tokens_are_prompt = false;
77-
return;
78+
if (this.skip_prompt) return;
7879
}
7980

8081
const tokens = value[0];
@@ -85,7 +86,7 @@ export class TextStreamer extends BaseStreamer {
8586
const text = this.tokenizer.decode(this.token_cache, this.decode_kwargs);
8687

8788
let printable_text;
88-
if (text.endsWith('\n')) {
89+
if (is_prompt || text.endsWith('\n')) {
8990
// After the symbol for a new line, we flush the cache.
9091
printable_text = text.slice(this.print_len);
9192
this.token_cache = [];

src/models.js

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3692,7 +3692,7 @@ export class Idefics3PreTrainedModel extends PreTrainedModel {
36923692
}
36933693

36943694
/**
3695-
* The LLAVA model which consists of a vision backbone and a language model.
3695+
* The Idefics3 model which consists of a vision backbone and a language model.
36963696
*/
36973697
export class Idefics3ForConditionalGeneration extends Idefics3PreTrainedModel {
36983698

@@ -3715,6 +3715,13 @@ export class Idefics3ForConditionalGeneration extends Idefics3PreTrainedModel {
37153715
}
37163716
//////////////////////////////////////////////////
37173717

3718+
/**
3719+
* The SmolVLM Model with a language modeling head.
3720+
* It is made up a SigLIP vision encoder, with a language modeling head on top.
3721+
*/
3722+
export class SmolVLMForConditionalGeneration extends Idefics3ForConditionalGeneration { }
3723+
3724+
//////////////////////////////////////////////////
37183725
export class Phi3VPreTrainedModel extends PreTrainedModel {
37193726
forward_params = [
37203727
'input_ids',
@@ -7316,6 +7323,7 @@ const MODEL_FOR_QUESTION_ANSWERING_MAPPING_NAMES = new Map([
73167323
const MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES = new Map([
73177324
['vision-encoder-decoder', ['VisionEncoderDecoderModel', VisionEncoderDecoderModel]],
73187325
['idefics3', ['Idefics3ForConditionalGeneration', Idefics3ForConditionalGeneration]],
7326+
['smolvlm', ['SmolVLMForConditionalGeneration', SmolVLMForConditionalGeneration]],
73197327
]);
73207328

73217329
const MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES = new Map([
@@ -7325,6 +7333,7 @@ const MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES = new Map([
73257333
['florence2', ['Florence2ForConditionalGeneration', Florence2ForConditionalGeneration]],
73267334
['qwen2-vl', ['Qwen2VLForConditionalGeneration', Qwen2VLForConditionalGeneration]],
73277335
['idefics3', ['Idefics3ForConditionalGeneration', Idefics3ForConditionalGeneration]],
7336+
['smolvlm', ['SmolVLMForConditionalGeneration', SmolVLMForConditionalGeneration]],
73287337
['paligemma', ['PaliGemmaForConditionalGeneration', PaliGemmaForConditionalGeneration]],
73297338
]);
73307339

src/models/image_processors.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ export * from './rt_detr/image_processing_rt_detr.js'
3232
export * from './sam/image_processing_sam.js'
3333
export * from './segformer/image_processing_segformer.js'
3434
export * from './siglip/image_processing_siglip.js'
35+
export * from './smolvlm/image_processing_smolvlm.js'
3536
export * from './swin2sr/image_processing_swin2sr.js'
3637
export * from './vit/image_processing_vit.js'
3738
export * from './vitmatte/image_processing_vitmatte.js'

src/models/processors.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ export * from './paligemma/processing_paligemma.js';
1111
export * from './pyannote/processing_pyannote.js';
1212
export * from './qwen2_vl/processing_qwen2_vl.js';
1313
export * from './sam/processing_sam.js';
14+
export * from './smolvlm/processing_smolvlm.js';
1415
export * from './speecht5/processing_speecht5.js';
1516
export * from './wav2vec2/processing_wav2vec2.js';
1617
export * from './wav2vec2_with_lm/processing_wav2vec2_with_lm.js';
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
2+
export { Idefics3ImageProcessor as SmolVLMImageProcessor } from "../idefics3/image_processing_idefics3.js";
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
2+
export { Idefics3Processor as SmolVLMProcessor } from "../idefics3/processing_idefics3.js";

src/transformers.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ export * from './configs.js';
2020

2121
export * from './utils/audio.js';
2222
export * from './utils/image.js';
23+
export * from './utils/video.js';
2324
export * from './utils/tensor.js';
2425
export * from './utils/maths.js';
2526

0 commit comments

Comments
 (0)