Add support for SmolVLM2 (#1196)

xenova · web-flow · commit 591a112ea52a · 2025-02-26T15:21:53.000+02:00
* Add support for SmolVLM

* Always flush text streamer after prompt

* [WIP] video.js

* Fix streamer unit tests

* Export video.js

* Video processing improvements
diff --git a/README.md b/README.md
@@ -402,6 +402,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
 1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
 1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.
 1. **[SigLIP](https://huggingface.co/docs/transformers/main/model_doc/siglip)** (from Google AI) released with the paper [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer.
+1. **[SmolVLM](https://huggingface.co/docs/transformers/main/model_doc/smolvlm) (from Hugging Face) released with the blog posts [SmolVLM - small yet mighty Vision Language Model](https://huggingface.co/blog/smolvlm) and [SmolVLM Grows Smaller – Introducing the 250M & 500M Models!](https://huggingface.co/blog/smolervlm) by the Hugging Face TB Research team.
 1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.
 1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
 1. **[StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu.
diff --git a/docs/snippets/6_supported-models.snippet b/docs/snippets/6_supported-models.snippet
@@ -117,6 +117,7 @@
 1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
 1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.
 1. **[SigLIP](https://huggingface.co/docs/transformers/main/model_doc/siglip)** (from Google AI) released with the paper [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer.
+1. **[SmolVLM](https://huggingface.co/docs/transformers/main/model_doc/smolvlm) (from Hugging Face) released with the blog posts [SmolVLM - small yet mighty Vision Language Model](https://huggingface.co/blog/smolvlm) and [SmolVLM Grows Smaller – Introducing the 250M & 500M Models!](https://huggingface.co/blog/smolervlm) by the Hugging Face TB Research team.
 1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.
 1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
 1. **[StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu.
diff --git a/src/configs.js b/src/configs.js
@@ -70,6 +70,7 @@ function getNormalizedConfig(config) {
         case 'florence2':
         case 'llava_onevision':
         case 'idefics3':
+        case 'smolvlm':
             // @ts-expect-error TS2339
             init_normalized_config = getNormalizedConfig(config.text_config);
             break;
diff --git a/src/generation/streamers.js b/src/generation/streamers.js
@@ -72,9 +72,10 @@ export class TextStreamer extends BaseStreamer {
             throw Error('TextStreamer only supports batch size of 1');
         }
 
-        if (this.skip_prompt && this.next_tokens_are_prompt) {
+        const is_prompt = this.next_tokens_are_prompt;
+        if (is_prompt) {
             this.next_tokens_are_prompt = false;
-            return;
+            if (this.skip_prompt) return;
         }
 
         const tokens = value[0];
@@ -85,7 +86,7 @@ export class TextStreamer extends BaseStreamer {
         const text = this.tokenizer.decode(this.token_cache, this.decode_kwargs);
 
         let printable_text;
-        if (text.endsWith('\n')) {
+        if (is_prompt || text.endsWith('\n')) {
             // After the symbol for a new line, we flush the cache.
             printable_text = text.slice(this.print_len);
             this.token_cache = [];
diff --git a/src/models.js b/src/models.js
@@ -3692,7 +3692,7 @@ export class Idefics3PreTrainedModel extends PreTrainedModel {
 }
 
 /**
- * The LLAVA model which consists of a vision backbone and a language model.
+ * The Idefics3 model which consists of a vision backbone and a language model.
  */
 export class Idefics3ForConditionalGeneration extends Idefics3PreTrainedModel {
 
@@ -3715,6 +3715,13 @@ export class Idefics3ForConditionalGeneration extends Idefics3PreTrainedModel {
 }
 //////////////////////////////////////////////////
 
+/**
+ * The SmolVLM Model with a language modeling head.
+ * It is made up a SigLIP vision encoder, with a language modeling head on top.
+ */
+export class SmolVLMForConditionalGeneration extends Idefics3ForConditionalGeneration { }
+
+//////////////////////////////////////////////////
 export class Phi3VPreTrainedModel extends PreTrainedModel {
     forward_params = [
         'input_ids',
@@ -7316,6 +7323,7 @@ const MODEL_FOR_QUESTION_ANSWERING_MAPPING_NAMES = new Map([
 const MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES = new Map([
     ['vision-encoder-decoder', ['VisionEncoderDecoderModel', VisionEncoderDecoderModel]],
     ['idefics3', ['Idefics3ForConditionalGeneration', Idefics3ForConditionalGeneration]],
+    ['smolvlm', ['SmolVLMForConditionalGeneration', SmolVLMForConditionalGeneration]],
 ]);
 
 const MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES = new Map([
@@ -7325,6 +7333,7 @@ const MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES = new Map([
     ['florence2', ['Florence2ForConditionalGeneration', Florence2ForConditionalGeneration]],
     ['qwen2-vl', ['Qwen2VLForConditionalGeneration', Qwen2VLForConditionalGeneration]],
     ['idefics3', ['Idefics3ForConditionalGeneration', Idefics3ForConditionalGeneration]],
+    ['smolvlm', ['SmolVLMForConditionalGeneration', SmolVLMForConditionalGeneration]],
     ['paligemma', ['PaliGemmaForConditionalGeneration', PaliGemmaForConditionalGeneration]],
 ]);
 
diff --git a/src/models/image_processors.js b/src/models/image_processors.js
@@ -32,6 +32,7 @@ export * from './rt_detr/image_processing_rt_detr.js'
 export * from './sam/image_processing_sam.js'
 export * from './segformer/image_processing_segformer.js'
 export * from './siglip/image_processing_siglip.js'
+export * from './smolvlm/image_processing_smolvlm.js'
 export * from './swin2sr/image_processing_swin2sr.js'
 export * from './vit/image_processing_vit.js'
 export * from './vitmatte/image_processing_vitmatte.js'
diff --git a/src/models/processors.js b/src/models/processors.js
@@ -11,6 +11,7 @@ export * from './paligemma/processing_paligemma.js';
 export * from './pyannote/processing_pyannote.js';
 export * from './qwen2_vl/processing_qwen2_vl.js';
 export * from './sam/processing_sam.js';
+export * from './smolvlm/processing_smolvlm.js';
 export * from './speecht5/processing_speecht5.js';
 export * from './wav2vec2/processing_wav2vec2.js';
 export * from './wav2vec2_with_lm/processing_wav2vec2_with_lm.js';
diff --git a/src/models/smolvlm/image_processing_smolvlm.js b/src/models/smolvlm/image_processing_smolvlm.js
@@ -0,0 +1,2 @@
+
+export { Idefics3ImageProcessor as SmolVLMImageProcessor } from "../idefics3/image_processing_idefics3.js";
diff --git a/src/models/smolvlm/processing_smolvlm.js b/src/models/smolvlm/processing_smolvlm.js
@@ -0,0 +1,2 @@
+
+export { Idefics3Processor as SmolVLMProcessor } from "../idefics3/processing_idefics3.js";
diff --git a/src/transformers.js b/src/transformers.js
@@ -20,6 +20,7 @@ export * from './configs.js';
 
 export * from './utils/audio.js';
 export * from './utils/image.js';
+export * from './utils/video.js';
 export * from './utils/tensor.js';
 export * from './utils/maths.js';
 
diff --git a/src/utils/video.js b/src/utils/video.js
@@ -0,0 +1,128 @@
+import { RawImage } from "./image.js";
+import { apis } from "../env.js";
+
+export class RawVideoFrame {
+
+    /**
+     * @param {RawImage} image
+     * @param {number} timestamp
+     */
+    constructor(image, timestamp) {
+        this.image = image;
+        this.timestamp = timestamp;
+    }
+}
+
+export class RawVideo {
+    /**
+     * @param {RawVideoFrame[]|RawImage[]} frames
+     * @param {number} duration
+     */
+    constructor(frames, duration) {
+        if (frames.length > 0 && frames[0] instanceof RawImage) {
+            // Assume uniform timestamps
+            frames = frames.map((image, i) => new RawVideoFrame(image, (i + 1) / (frames.length + 1) * duration));
+        }
+        this.frames = /** @type {RawVideoFrame[]} */ (frames);
+        this.duration = duration;
+    }
+
+    get width() {
+        return this.frames[0].image.width;
+    }
+    get height() {
+        return this.frames[0].image.height;
+    }
+
+    get fps() {
+        return this.frames.length / this.duration;
+    }
+}
+
+
+/**
+ * Loads a video.
+ *
+ * @param {string|Blob|HTMLVideoElement} src The video to process.
+ * @param {Object} [options] Optional parameters.
+ * @param {number} [options.num_frames=null] The number of frames to sample uniformly.
+ * @param {number} [options.fps=null] The number of frames to sample per second.
+ *
+ * @returns {Promise<RawVideo>} The loaded video.
+ */
+export async function load_video(src, { num_frames = null, fps = null } = {}) {
+    if (!apis.IS_BROWSER_ENV) {
+        throw new Error("`load_video` is currently only supported in browser environments.");
+    }
+
+    // TODO: Support efficiently loading all frames using the WebCodecs API.
+    // Specfically, https://developer.mozilla.org/en-US/docs/Web/API/VideoDecoder
+    if (num_frames == null && fps == null) {
+        throw new Error("Either num_frames or fps must be provided.");
+    }
+
+    const frames = [];
+
+    const video = document.createElement("video");
+    video.crossOrigin = "anonymous";
+    video.muted = true; // mute to allow autoplay and seeking
+
+    if (typeof src === 'string') {
+        video.src = src;
+    } else if (src instanceof Blob) {
+        video.src = URL.createObjectURL(src);
+    } else if (src instanceof HTMLVideoElement) {
+        video.src = src.src;
+    } else {
+        throw new Error("Invalid URL or video element provided.");
+    }
+    // Wait for metadata to load to obtain duration
+    await new Promise((resolve) => video.onloadedmetadata = resolve);
+
+    if (video.seekable.start(0) === video.seekable.end(0)) {
+        // Fallback: Download entire video if not seekable
+        const response = await fetch(video.src);
+        const blob = await response.blob();
+        video.src = URL.createObjectURL(blob);
+        await new Promise((resolve) => video.onloadedmetadata = resolve);
+    }
+
+    const duration = video.duration;
+
+    let count, step;
+    if (num_frames != null) {
+        count = num_frames;
+        step = num_frames === 1 ? 0 : duration / (num_frames - 1);
+    } else {
+        step = 1 / fps;
+        count = Math.floor(duration / step);
+    }
+
+    // Build an array of sample times based on num_frames or fps
+    let sampleTimes = [];
+    for (let i = 0; i < count; ++i) {
+        sampleTimes.push(num_frames === 1 ? duration / 2 : i * step);
+    }
+
+    const canvas = document.createElement("canvas");
+    canvas.width = video.videoWidth;
+    canvas.height = video.videoHeight;
+    const ctx = canvas.getContext("2d", { willReadFrequently: true });
+    for (const t of sampleTimes) {
+        video.currentTime = t;
+        await new Promise((resolve) => {
+            video.onseeked = resolve;
+        });
+        ctx.drawImage(video, 0, 0, canvas.width, canvas.height);
+        const imageData = ctx.getImageData(0, 0, canvas.width, canvas.height);
+        const frameData = new RawImage(imageData.data, canvas.width, canvas.height, 4);
+
+        const frame = new RawVideoFrame(frameData, t);
+        frames.push(frame);
+    }
+
+    // Clean up video element.
+    video.remove();
+
+    return new RawVideo(frames, duration);
+}
diff --git a/tests/utils/generation.test.js b/tests/utils/generation.test.js
@@ -195,7 +195,7 @@ describe("Streamers", () => {
     it(
       "batch_size=1",
       async () => {
-        const target_chunks = ["helloerdingsdelete ", "melytabular ", "Stadiumoba ", "alcune ", "drug"];
+        const target_chunks = ["hello", "erdingsdelete ", "melytabular ", "Stadiumoba ", "alcune ", "drug"];
         const chunks = [];
         const callback_function = (text) => {
           chunks.push(text);

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+`
	`2`	`+export { Idefics3ImageProcessor as SmolVLMImageProcessor } from "../idefics3/image_processing_idefics3.js";`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+`
	`2`	`+export { Idefics3Processor as SmolVLMProcessor } from "../idefics3/processing_idefics3.js";`