huggingface
diff --git a/‎README.md‎
Lines changed: 1 addition & 0 deletions b/‎README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/snippets/6_supported-models.snippet‎
Lines changed: 1 addition & 0 deletions b/‎docs/snippets/6_supported-models.snippet‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎package-lock.json‎
Lines changed: 23 additions & 38 deletions b/‎package-lock.json‎
Lines changed: 23 additions & 38 deletions
diff --git a/‎src/base/processing_utils.js‎
Lines changed: 13 additions & 6 deletions b/‎src/base/processing_utils.js‎
Lines changed: 13 additions & 6 deletions
diff --git a/‎src/configs.js‎
Lines changed: 2 additions & 0 deletions b/‎src/configs.js‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎src/models.js‎
Lines changed: 126 additions & 1 deletion b/‎src/models.js‎
Lines changed: 126 additions & 1 deletion
@@ -332,6 +332,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
 1. **[Gemma](https://huggingface.co/docs/transformers/main/model_doc/gemma)** (from Google) released with the paper [Gemma: Open Models Based on Gemini Technology and Research](https://blog.google/technology/developers/gemma-open-models/) by the Gemma Google team.
 1. **[Gemma2](https://huggingface.co/docs/transformers/main/model_doc/gemma2)** (from Google) released with the paper [Gemma2: Open Models Based on Gemini Technology and Research](https://blog.google/technology/developers/google-gemma-2/) by the Gemma Google team.
 1. **[Gemma3](https://huggingface.co/docs/transformers/main/model_doc/gemma3)** (from Google) released with the paper [Introducing Gemma 3: The most capable model you can run on a single GPU or TPU](https://blog.google/technology/developers/gemma-3/) by the Gemma Google team.
+1. **[Gemma3n](https://huggingface.co/docs/transformers/main/model_doc/gemma3n)** (from Google) released with the paper [Announcing Gemma 3n preview: powerful, efficient, mobile-first AI](https://developers.googleblog.com/en/introducing-gemma-3n/) by the Gemma Google team.
 1. **[GLM](https://huggingface.co/docs/transformers/main/model_doc/glm)** (from the GLM Team, THUDM & ZhipuAI) released with the paper [ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools](https://huggingface.co/papers/2406.12793v2) by Team GLM: Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Jingyu Sun, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, Shuxun Yang, Weng Lam Tam, Wenyi Zhao, Xiao Liu, Xiao Xia, Xiaohan Zhang, Xiaotao Gu, Xin Lv, Xinghan Liu, Xinyi Liu, Xinyue Yang, Xixuan Song, Xunkai Zhang, Yifan An, Yifan Xu, Yilin Niu, Yuantao Yang, Yueyan Li, Yushi Bai, Yuxiao Dong, Zehan Qi, Zhaoyu Wang, Zhen Yang, Zhengxiao Du, Zhenyu Hou, Zihan Wang.
 1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://huggingface.co/papers/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim.
 1. **[GPT Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo)** (from EleutherAI) released in the repository [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy.
 
@@ -46,6 +46,7 @@
 1. **[Gemma](https://huggingface.co/docs/transformers/main/model_doc/gemma)** (from Google) released with the paper [Gemma: Open Models Based on Gemini Technology and Research](https://blog.google/technology/developers/gemma-open-models/) by the Gemma Google team.
 1. **[Gemma2](https://huggingface.co/docs/transformers/main/model_doc/gemma2)** (from Google) released with the paper [Gemma2: Open Models Based on Gemini Technology and Research](https://blog.google/technology/developers/google-gemma-2/) by the Gemma Google team.
 1. **[Gemma3](https://huggingface.co/docs/transformers/main/model_doc/gemma3)** (from Google) released with the paper [Introducing Gemma 3: The most capable model you can run on a single GPU or TPU](https://blog.google/technology/developers/gemma-3/) by the Gemma Google team.
+1. **[Gemma3n](https://huggingface.co/docs/transformers/main/model_doc/gemma3n)** (from Google) released with the paper [Announcing Gemma 3n preview: powerful, efficient, mobile-first AI](https://developers.googleblog.com/en/introducing-gemma-3n/) by the Gemma Google team.
 1. **[GLM](https://huggingface.co/docs/transformers/main/model_doc/glm)** (from the GLM Team, THUDM & ZhipuAI) released with the paper [ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools](https://huggingface.co/papers/2406.12793v2) by Team GLM: Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Jingyu Sun, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, Shuxun Yang, Weng Lam Tam, Wenyi Zhao, Xiao Liu, Xiao Xia, Xiaohan Zhang, Xiaotao Gu, Xin Lv, Xinghan Liu, Xinyi Liu, Xinyue Yang, Xixuan Song, Xunkai Zhang, Yifan An, Yifan Xu, Yilin Niu, Yuantao Yang, Yueyan Li, Yushi Bai, Yuxiao Dong, Zehan Qi, Zhaoyu Wang, Zhen Yang, Zhengxiao Du, Zhenyu Hou, Zihan Wang.
 1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://huggingface.co/papers/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim.
 1. **[GPT Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo)** (from EleutherAI) released in the repository [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy.
 
@@ -19,11 +19,11 @@
  * 
  * @module processors
  */
-import { PROCESSOR_NAME } from '../utils/constants.js';
+import { PROCESSOR_NAME, CHAT_TEMPLATE_NAME } from '../utils/constants.js';
 import {
     Callable,
 } from '../utils/generic.js';
-import { getModelJSON } from '../utils/hub.js';
+import { getModelJSON, getModelText } from '../utils/hub.js';
 
 /**
  * @typedef {Object} ProcessorProperties Additional processor-specific properties.
@@ -42,16 +42,19 @@ export class Processor extends Callable {
         'feature_extractor_class',
     ]
     static uses_processor_config = false;
+    static uses_chat_template_file = false;
 
     /**
      * Creates a new Processor with the given components
      * @param {Object} config 
      * @param {Record<string, Object>} components 
+     * @param {string} chat_template
      */
-    constructor(config, components) {
+    constructor(config, components, chat_template) {
         super();
         this.config = config;
         this.components = components;
+        this.chat_template = chat_template;
     }
 
     /**
@@ -86,6 +89,7 @@ export class Processor extends Callable {
         }
         return this.tokenizer.apply_chat_template(messages, {
             tokenize: false, // default to false
+            chat_template: this.chat_template ?? undefined,
             ...options,
         });
     }
@@ -146,7 +150,7 @@ export class Processor extends Callable {
      */
     static async from_pretrained(pretrained_model_name_or_path, options) {
 
-        const [config, components] = await Promise.all([
+        const [config, components, chat_template] = await Promise.all([
             // TODO:
             this.uses_processor_config
                 ? getModelJSON(pretrained_model_name_or_path, PROCESSOR_NAME, true, options)
@@ -158,9 +162,12 @@ export class Processor extends Callable {
                         const component = await this[cls].from_pretrained(pretrained_model_name_or_path, options);
                         return [cls.replace(/_class$/, ''), component];
                     })
-            ).then(Object.fromEntries)
+            ).then(Object.fromEntries),
+            this.uses_chat_template_file
+                ? getModelText(pretrained_model_name_or_path, CHAT_TEMPLATE_NAME, true, options)
+                : null,
         ]);
 
-        return new this(config, components);
+        return new this(config, components, chat_template);
     }
 }
@@ -73,6 +73,7 @@ function getNormalizedConfig(config) {
         case 'idefics3':
         case 'ultravox':
         case 'smolvlm':
+        case 'gemma3n':
             // @ts-expect-error TS2339
             init_normalized_config = getNormalizedConfig(config.text_config);
             break;
@@ -130,6 +131,7 @@ function getNormalizedConfig(config) {
         case 'gemma':
         case 'gemma2':
         case 'gemma3_text':
+        case 'gemma3n_text':
         case 'glm':
         case 'helium':
             mapping['num_heads'] = 'num_key_value_heads';
 
@@ -136,6 +136,7 @@ const MODEL_TYPES = {
     Phi3V: 9,
     AudioTextToText: 10,
     AutoEncoder: 11,
+    ImageAudioTextToText: 12,
 }
 //////////////////////////////////////////////////
 
@@ -1057,6 +1058,7 @@ export class PreTrainedModel extends Callable {
                 this._prepare_inputs_for_generation = multimodal_text_to_text_prepare_inputs_for_generation;
                 break;
             case MODEL_TYPES.Phi3V:
+            case MODEL_TYPES.ImageAudioTextToText:
                 this.can_generate = true;
                 this._prepare_inputs_for_generation = multimodal_text_to_text_prepare_inputs_for_generation;
                 break;
@@ -1210,7 +1212,19 @@ export class PreTrainedModel extends Callable {
                     generation_config: 'generation_config.json',
                 }, options),
             ]);
-
+        } else if (modelType === MODEL_TYPES.ImageAudioTextToText) {
+            const sessions = {
+                embed_tokens: 'embed_tokens',
+                audio_encoder: 'audio_encoder',
+                vision_encoder: 'vision_encoder',
+                decoder_model_merged: 'decoder_model_merged',
+            }
+            info = await Promise.all([
+                constructSessions(pretrained_model_name_or_path, sessions, options),
+                getOptionalConfigs(pretrained_model_name_or_path, {
+                    generation_config: 'generation_config.json',
+                }, options),
+            ]);
         } else if (modelType === MODEL_TYPES.Musicgen) {
             info = await Promise.all([
                 constructSessions(pretrained_model_name_or_path, {
@@ -3795,6 +3809,114 @@ export class LlavaQwen2ForCausalLM extends LlavaPreTrainedModel {
     }
 }
 
+export class Gemma3nPreTrainedModel extends PreTrainedModel {
+    forward_params = [
+        'input_ids',
+        'attention_mask',
+        'inputs_embeds',
+        'per_layer_inputs',
+
+        'position_ids',
+        'pixel_values',
+        'input_features',
+        'input_features_mask',
+        'past_key_values',
+    ];
+}
+export class Gemma3nForConditionalGeneration extends Gemma3nPreTrainedModel {
+
+    async forward({
+        // Produced by the tokenizer/processor:
+        input_ids = null,
+        attention_mask = null,
+        pixel_values = null,
+        input_features = null,
+        input_features_mask = null,
+
+        // Used during generation:
+        position_ids = null,
+        inputs_embeds = null,
+        per_layer_inputs=null,
+        past_key_values = null,
+
+        // Generic generation parameters
+        generation_config = null,
+        logits_processor = null,
+
+        // TODO: needed?
+        ...kwargs
+    }) {
+        if (!inputs_embeds || !per_layer_inputs) {
+            // 1. Extract the text embeddings.
+            ({ inputs_embeds, per_layer_inputs} = await sessionRun(this.sessions['embed_tokens'], {
+                input_ids,
+            }));
+            if (input_ids.dims[1] !== 1) {
+                if (pixel_values) {
+                    // Encode the image
+                    const { image_features } = await sessionRun(this.sessions['vision_encoder'], {
+                        pixel_values,
+                    });
+                    ({ inputs_embeds, attention_mask } = this._merge_input_ids_with_image_features({
+                        image_features,
+                        inputs_embeds,
+                        input_ids,
+                        attention_mask,
+                    }));
+                }
+
+                if (input_features) {
+                    // Encode the audio
+                    const { audio_features } = await sessionRun(this.sessions['audio_encoder'], {
+                        input_features,
+                        input_features_mask,
+                    });
+                    ({ inputs_embeds, attention_mask } = this._merge_input_ids_with_audio_features({
+                        audio_features,
+                        inputs_embeds,
+                        input_ids,
+                        attention_mask,
+                    }));
+                }
+            }
+        }
+
+        const outputs = await decoderForward(this, {
+            inputs_embeds,
+            per_layer_inputs,
+            past_key_values,
+            attention_mask,
+            position_ids,
+            generation_config,
+            logits_processor,
+        }, true);
+        return outputs;
+    }
+
+    _merge_input_ids_with_image_features(kwargs) {
+        const vision_hidden_size = kwargs.image_features.dims.at(-1);
+        const reshaped_image_hidden_states = kwargs.image_features.view(-1, vision_hidden_size);
+        return default_merge_input_ids_with_image_features({
+            // @ts-ignore
+            image_token_id: this.config.image_token_id,
+            ...kwargs,
+            image_features: reshaped_image_hidden_states,
+        });
+    }
+    _merge_input_ids_with_audio_features(kwargs) {
+        const audio_hidden_size = kwargs.audio_features.dims.at(-1);
+        const reshaped_audio_features = kwargs.audio_features.view(-1, audio_hidden_size);
+
+        return default_merge_input_ids_with_audio_features({
+            // @ts-ignore
+            audio_token_id: this.config.audio_token_id,
+            ...kwargs,
+            audio_features: reshaped_audio_features,
+        })
+    }
+}
+        
+
 //////////////////////////////////////////////////
 // Idefics3 Models
 export class Idefics3PreTrainedModel extends PreTrainedModel {
@@ -7799,6 +7921,7 @@ const MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES = new Map([
     ['smolvlm', ['SmolVLMForConditionalGeneration', SmolVLMForConditionalGeneration]],
     ['paligemma', ['PaliGemmaForConditionalGeneration', PaliGemmaForConditionalGeneration]],
     ['llava_qwen2', ['LlavaQwen2ForCausalLM', LlavaQwen2ForCausalLM]],
+    ['gemma3n', ['Gemma3nForConditionalGeneration', Gemma3nForConditionalGeneration]],
 ]);
 
 const MODEL_FOR_AUDIO_TEXT_TO_TEXT_MAPPING_NAMES = new Map([
@@ -8015,6 +8138,8 @@ const CUSTOM_MAPPING = [
     ['MimiDecoderModel', MimiDecoderModel, MODEL_TYPES.EncoderOnly],
     ['SnacEncoderModel', SnacEncoderModel, MODEL_TYPES.EncoderOnly],
     ['SnacDecoderModel', SnacDecoderModel, MODEL_TYPES.EncoderOnly],
+
+    ['Gemma3nForConditionalGeneration', Gemma3nForConditionalGeneration, MODEL_TYPES.ImageAudioTextToText],
 ]
 for (const [name, model, type] of CUSTOM_MAPPING) {
     MODEL_TYPE_MAPPING.set(name, type);