Add Gemma 3 Text support #436

nyo16 · 2025-12-28T17:12:19Z

Adds support for Gemma 3 architecture, enabling FunctionGemma (google/functiongemma-270m-it) and other Gemma 3 models to run in Bumblebee.

Why FunctionGemma?

Lightweight - Only 270M parameters, runs on CPU or modest GPU
Function calling - Specifically trained for tool/function invocation
Easy to fine-tune - Small enough to train on Google Colab T4
Edge/IoT ready - Perfect for home assistants, voice interfaces, embedded systems

Gemma 3 Architecture Changes

Gemma 3 has several key differences from Gemma v1:

Feature	Gemma v1	Gemma 3
QK-norm	No	Yes (RMS norm on Q/K after projection)
FFN layer norms	1 (post-attention)	3 (post-attention, pre-FFN, post-FFN)
Residual order	Before post-attention norm	After post-attention norm
Attention	Global only	Alternating local/global (5:1 ratio)
RMS norm formula	`weight * normalized`	`(1 + weight) * normalized`

Files Changed

lib/bumblebee/text/gemma3.ex - Full Gemma 3 model implementation with custom decoder supporting QK-norm and extra layer norms
lib/bumblebee.ex - Model and tokenizer registrations for Gemma3Model, Gemma3ForCausalLM, etc.
lib/bumblebee/layers/transformer.ex - Per-layer attention_window_size callback for alternating local/global attention
test/bumblebee/text/gemma3_test.exs - Unit tests (require tiny-random models on HuggingFace)
notebooks/function_calling.livemd - Comprehensive Livebook example with:
- FunctionGemma.Schema - Build function declarations
- FunctionGemma.Parser - Parse function call responses
- FunctionGemma.Executor - Execute parsed calls
- SmartHome - Mock functions demo (lights, thermostat, weather)

Example Usage

{:ok, model_info} = Bumblebee.load_model({:hf, "google/functiongemma-270m-it", auth_token: token})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "google/functiongemma-270m-it", auth_token: token})
{:ok, generation_config} = Bumblebee.load_generation_config({:hf, "google/functiongemma-270m-it", auth_token: token})

serving = Bumblebee.Text.generation(model_info, tokenizer, generation_config,
  compile: [batch_size: 1, sequence_length: 512],
  defn_options: [compiler: EXLA]
)

prompt = """
<start_of_turn>developer
You are a helpful assistant.
<start_function_declaration>declaration:get_weather{description:<escape>Get
weather<escape>,parameters:{properties:{location:{type:<escape>STRING<escape>}},required:[<escape>location<escape>],type:<escape>OBJECT<escape>}}<end_function_declaration><end_of_turn>
<start_of_turn>user
What's the weather in Paris?<end_of_turn>
<start_of_turn>model
"""

%{results: [%{text: text}]} = Nx.Serving.run(serving, prompt)
# => "<start_function_call>call:get_weather{location:<escape>Paris<escape>}<end_function_call>"

Test Plan

Model loads without unused params warnings
FunctionGemma generates correct function call format
Multiple function types work (weather, lights, thermostat)
Livebook runs end-to-end with mock function execution
Unit tests pass (requires creating tiny-random-Gemma3* models on HuggingFace)"

nyo16 · 2025-12-28T17:29:39Z

I generated the tiny models with

# Create tiny random Gemma 3 models for Bumblebee testing
# Run with: HF_TOKEN=hf_xxx python create_tiny_gemma3.py

import os
from huggingface_hub import login

# Login with token
token = os.environ.get("HF_TOKEN")
if token:
    login(token=token)
else:
    print("Warning: HF_TOKEN not set, using cached credentials")

from transformers import (
    Gemma3TextConfig,
    Gemma3TextModel,
    Gemma3ForCausalLM,  # No "Text" variant for CausalLM
    Gemma3TextForSequenceClassification,
)

# Tiny config matching Gemma 3 text architecture
config = Gemma3TextConfig(
    vocab_size=1024,
    hidden_size=32,
    num_hidden_layers=2,
    num_attention_heads=4,
    num_key_value_heads=2,
    head_dim=8,
    intermediate_size=64,
    hidden_activation="gelu_pytorch_tanh",
    max_position_embeddings=512,
    rms_norm_eps=1e-6,
    rope_theta=10000.0,
    attention_bias=False,
    attention_dropout=0.0,
    sliding_window=128,
    pad_token_id=0,
    bos_token_id=2,
    eos_token_id=1,
    tie_word_embeddings=True,
    initializer_range=0.02,
    query_pre_attn_scalar=8,
)

# For sequence classification
config_seq_cls = Gemma3TextConfig(
    vocab_size=1024,
    hidden_size=32,
    num_hidden_layers=2,
    num_attention_heads=4,
    num_key_value_heads=2,
    head_dim=8,
    intermediate_size=64,
    hidden_activation="gelu_pytorch_tanh",
    max_position_embeddings=512,
    rms_norm_eps=1e-6,
    rope_theta=10000.0,
    attention_bias=False,
    attention_dropout=0.0,
    sliding_window=128,
    pad_token_id=0,
    bos_token_id=2,
    eos_token_id=1,
    tie_word_embeddings=True,
    initializer_range=0.02,
    query_pre_attn_scalar=8,
    num_labels=2,
)

models = [
    (Gemma3TextModel, config, "tiny-random-Gemma3Model"),
    (Gemma3ForCausalLM, config, "tiny-random-Gemma3ForCausalLM"),
    (Gemma3TextForSequenceClassification, config_seq_cls, "tiny-random-Gemma3ForSequenceClassification"),
]

print("Creating tiny random Gemma 3 models...")

for model_class, model_config, name in models:
    print(f"\nCreating {name}...")
    model = model_class(model_config)

    local_path = f"./{name}"
    model.save_pretrained(local_path)
    print(f"  Saved to {local_path}")

    repo_id = f"nmaroulis/{name}"
    model.push_to_hub(repo_id)
    print(f"  Pushed to https://huggingface.co/{repo_id}")

print("\nDone!")

Gemma 3 architecture includes several key differences from Gemma v1: - QK-norm (RMS normalization on query/key after projection) - Pre/post FFN layer norms (pre_feedforward_layernorm, post_feedforward_layernorm) - Different residual connection order (after post_attention_layernorm) - Alternating local/global attention (sliding window) - RMS norm with shift=1.0 formula: output * (1.0 + weight) Files added: - lib/bumblebee/text/gemma3.ex: Full Gemma 3 model implementation - test/bumblebee/text/gemma3_test.exs: Unit tests - notebooks/function_calling.livemd: Livebook with FunctionGemma examples Files modified: - lib/bumblebee.ex: Model and tokenizer registrations - lib/bumblebee/layers/transformer.ex: Per-layer attention_window_size support

test/bumblebee/text/gemma3_text_test.exs

lib/bumblebee/text/gemma3.ex

…ests - Refactor decoder to use shared Layers.Transformer.blocks infrastructure - Use per-layer attention_window_size function for alternating local/global attention - Use query_norm/key_norm options for QK-normalization - Use custom block_type function for Gemma 3's unique normalization structure - Add assert_all_close with reference values from Python transformers - Fix bug in Layers.Transformer.blocks where attention_window_size was duplicated when using a function for per-layer configuration - Update params_mapping to use query_norm/key_norm naming from shared infrastructure

nyo16 · 2025-12-29T19:00:04Z

Thanks for the review feedback! I've addressed both comments:

Tests with Reference Values from Python Transformers

Added assert_all_close assertions with reference values obtained from Python transformers:

import torch
from transformers import AutoModel, AutoModelForCausalLM, AutoModelForSequenceClassification

input_ids = torch.tensor([[10, 20, 30, 40, 50, 60, 70, 80, 0, 0]])
attention_mask = torch.tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0]])

:base

model = AutoModel.from_pretrained("nmaroulis/tiny-random-Gemma3Model")
outputs = model(input_ids=input_ids, attention_mask=attention_mask)

hidden_state[[.., 1..3, 1..3]]: [[-1.6458, 0.7249, -0.5747], [-1.9452, -0.1602, -0.2329], [-2.3408, -0.4665, -0.1177]]

:for_causal_language_modeling

model = AutoModelForCausalLM.from_pretrained("nmaroulis/tiny-random-Gemma3ForCausalLM")
outputs = model(input_ids=input_ids, attention_mask=attention_mask)

logits[[.., 1..3, 1..3]]: [[0.1472, 0.0633, 0.0922], [-0.1089, -0.0344, 0.0755], [0.0112, 0.1083, 0.1461]]

:for_sequence_classification

model = AutoModelForSequenceClassification.from_pretrained("nmaroulis/tiny-random-Gemma3ForSequenceClassification")
outputs = model(input_ids=input_ids, attention_mask=attention_mask)

logits: [[-0.0060, -0.0212]]

All tests now verify numerical equivalence with Python transformers.

Refactored to Use Layers.Transformer.blocks

Replaced the custom block iteration with Layers.Transformer.blocks. The key changes:

Per-layer attention window size: Uses function fn idx -> ... end for alternating local/global attention
QK-norm: Uses query_norm and key_norm options with RMS norm
Custom block structure: Uses block_type function for Gemma 3's unique normalization (post-attention norm before residual, pre/post FFN norms)

Layers.Transformer.blocks(hidden_state,
# ... standard options ...
attention_window_size: fn idx ->
if rem(idx + 1, spec.global_attention_layer_interval) == 0, do: nil,
else: {spec.sliding_window, spec.sliding_window}
end,
query_norm: &Layers.rms_norm(&1, shift: 1.0, epsilon: spec.layer_norm_epsilon, name: &2),
key_norm: &Layers.rms_norm(&1, shift: 1.0, epsilon: spec.layer_norm_epsilon, name: &2),
block_type: fn hidden_state, steps, block_name ->
gemma3_block_impl(hidden_state, steps, block_name, spec)
end,
# ...
)

The custom gemma3_block_impl/4 function handles Gemma 3's unique block structure while leveraging the shared attention infrastructure.

lib/bumblebee.ex

jonatanklosko · 2025-12-31T18:23:59Z

Btw. I uploaded the tiny-random models to bumblebee-testing, so you can switch the tests to use those :)

jonatanklosko · 2025-12-31T18:51:32Z

@nyo16 by the way, the gemma3 attention uses a configurable scalar here. I've just pushed a8caabd, which adds support for :attention_scale option in Layers.Transformer.blocks/2. For gemma3 we want to pass config.query_pre_attn_scalar**-0.5, except I would rename query_pre_attn_scalar to attention_scale_base (since it is a number used to compute attention_scale). Make sure to rebase first :)

I updated the tiny-random checkpoints I pushed, so that query_pre_attn_scalar != head_dim, otherwise it works by accident.

- Rename Bumblebee.Text.Gemma3 to Bumblebee.Text.Gemma3Text to distinguish text-only model from future multimodal Gemma3 - Add attention_scale_base config option (from query_pre_attn_scalar) - Compute attention scale as attention_scale_base ** -0.5 - Update model mappings to use Gemma3Text* variants - Update tests to use bumblebee-testing models with Python reference values - Fix duplicate attention_window_size key in transformer.ex after merge

nyo16 · 2025-12-31T19:21:45Z

Thank so much @jonatanklosko

I've updated the branch based on your feedback:

Renamed module to Gemma3Text - Changed Bumblebee.Text.Gemma3 to Bumblebee.Text.Gemma3Text to clearly distinguish the text-only model. This leaves room for a future Bumblebee.Multimodal.Gemma3 module.
Added attention_scale_base support - Added the attention_scale_base config option that loads from query_pre_attn_scalar in HuggingFace config. The attention scale is computed as attention_scale_base ** -0.5.
Updated tests to use bumblebee-testing models - Now using the updated tiny-random checkpoints where query_pre_attn_scalar != head_dim. Reference values were generated from Python/transformers.
Updated model mappings - Removed the generic Gemma3Model* mappings, kept Gemma3Text* variants, and added Gemma3ForCausalLM mapping (which is what the CausalLM test model reports as).

Note: There are still some small numerical differences between Elixir and Python outputs (max ~0.15), so I used slightly higher tolerances in the tests. This might be worth investigating further if needed, but the model works correctly with FunctionGemma examples.

I tested with the code and go correct answer, if i find more time i will investigate more.

lib/bumblebee/text/gemma3_text.ex

Co-authored-by: Jonatan Kłosko <[email protected]>

jonatanklosko · 2025-12-31T19:45:33Z

lib/bumblebee/text/gemma3_text.ex

+      global_attention_layer_interval: [
+        default: 6,
+        doc: """
+        the interval for global attention layers. In Gemma 3, every Nth layer uses global
+        attention while others use local (sliding window) attention. A value of 6 means
+        layers 5, 11, 17, 23... use global attention (5:1 local/global ratio)
+        """
+      ],


We never load this from the HuggingFace config, so it will cause discrepancies (could be the reason why tests failed).

In fact, the "interval" approach is deprecated, see this code.

I think we can go with option :layer_types, which is a list of either :sliding_attention or :full_attention. We have a similar config here:

bumblebee/lib/bumblebee/diffusion/controlnet.ex

Lines 414 to 422 in a8caabd

up_block_types: {

"up_block_types",

list(

mapping(%{

"UpBlock2D" => :up_block,

"CrossAttnUpBlock2D" => :cross_attention_up_block

})

)

},

We should also handle "sliding_window_pattern" if present. One way to do it would be to do something like this:

# Support sliding_window_pattern for backward compatibility, see https://github.com/huggingface/transformers/blob/v5.0.0rc1/src/transformers/models/gemma3/configuration_gemma3.py#L188-L195 data = Map.put_new_lazy(data, "layer_types", fn -> pattern = data["sliding_window_pattern"] || 6 # generate a list of Python-like layer types end)

Btw. I pushed updated checkpoints to HF with:

"layer_types": [ "sliding_attention", "full_attention" ],

This way we can test both layer types.

You will need to update the reference values.

jonatanklosko · 2025-12-31T19:49:07Z

Note: There are still some small numerical differences between Elixir and Python outputs (max ~0.15), so I used slightly higher tolerances in the tests. This might be worth investigating further if needed, but the model works correctly with FunctionGemma examples.

We should be within 4 digit precision, or in other words we should not need :atol in the tests. I dropped a comment on one of the reasons the implementations don't match, but if the numbers still don't match, there may be something else too.

FWIW. I generated the checkpoints using this config:

from transformers import Gemma3TextConfig, Gemma3TextModel, Gemma3ForCausalLM, Gemma3TextForSequenceClassification

config = Gemma3TextConfig(
  vocab_size=1024,
  hidden_size=32,
  num_hidden_layers=2,
  num_attention_heads=4,
  num_key_value_heads=2,
  intermediate_size=37,
  hidden_act="gelu_pytorch_tanh",
  hidden_dropout_prob=0.1,
  attention_probs_dropout_prob=0.1,
  max_position_embeddings=512,
  type_vocab_size=16,
  is_decoder=False,
  initializer_range=0.02,
  pad_token_id=0,
  bos_token_id=2,
  eos_token_id=1,
  sliding_window=128,
  query_pre_attn_scalar=8,
  sliding_window_pattern=2
  # layer_types=[
  #   "sliding_attention",
  #   "full_attention"
  # ]
)

for c in [Gemma3TextModel, Gemma3ForCausalLM, Gemma3TextForSequenceClassification]:
  name = c.__name__
  c(config).save_pretrained(f"bumblebee-testing/tiny-random-{name}", repo_id=f"bumblebee-testing/tiny-random-{name}", push_to_hub=True)

- Add layer_types config option for per-layer attention type (sliding vs full) - Generate layer_types from sliding_window_pattern for backward compatibility - Update test expected values with Python reference outputs - Add tolerances to tests (base model needs larger tolerance, see TODO)

Gemma 3 uses different RoPE bases for local vs global attention: - Local (sliding) attention: rope_local_base_freq = 10,000 - Global (full) attention: rope_theta = 1,000,000 This fixes numerical discrepancies where positions 2+ diverged by ~0.2-0.3 from Python reference values. Now achieves 4-digit precision (atol: 1.0e-4) across all positions. Changes: - Add rotary_embedding_base_local spec option - Load rope_local_base_freq from HuggingFace config - Use per-layer rotary embedding base based on layer type - Tighten test tolerances from 0.35 to 1.0e-4

nyo16 · 2026-01-01T14:57:54Z

@jonatanklosko Thanks for the updated test models and pushing for better testing!
Claude and I investigated the numerical discrepancy and found the root cause.

Issue: Gemma 3 uses different RoPE bases for local vs global attention layers:

rope_local_base_freq = 10,000 for sliding attention layers
rope_theta = 1,000,000 for full attention layers

Bumblebee was using the global base (1M) for all layers, which caused positions 2+ to diverge significantly (~0.2-0.3).

Fix: Added rotary_embedding_base_local config option and updated the decoder to use per-layer rotary embedding bases based on layer_types.

All tests now pass with atol: 1.0e-4 (4-digit precision) as you expected. 🎉

nyo16 · 2026-01-01T15:01:02Z

Also the python code that i generated the test numbers

#!/usr/bin/env python3
  """Generate reference values for Bumblebee Gemma3Text tests."""

  import torch
  from transformers import AutoModel, AutoModelForCausalLM, AutoModelForSequenceClassification

  input_ids = torch.tensor([[10, 20, 30, 40, 50, 60, 70, 80, 0, 0]])
  attention_mask = torch.tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0]])

  # :base test
  model = AutoModel.from_pretrained("bumblebee-testing/tiny-random-Gemma3TextModel")
  model.eval()
  with torch.no_grad():
      outputs = model(input_ids, attention_mask=attention_mask)
      print(":base - hidden_state[0, 1:4, 1:4]:")
      print(outputs.last_hidden_state[0, 1:4, 1:4])

  # :for_causal_language_modeling test
  model = AutoModelForCausalLM.from_pretrained("bumblebee-testing/tiny-random-Gemma3ForCausalLM")
  model.eval()
  with torch.no_grad():
      outputs = model(input_ids, attention_mask=attention_mask)
      print("\n:for_causal_language_modeling - logits[0, 1:4, 1:4]:")
      print(outputs.logits[0, 1:4, 1:4])

  # :for_sequence_classification test
  model = AutoModelForSequenceClassification.from_pretrained(
      "bumblebee-testing/tiny-random-Gemma3TextForSequenceClassification"
  )
  model.eval()
  with torch.no_grad():
      outputs = model(input_ids, attention_mask=attention_mask)
      print("\n:for_sequence_classification - logits:")
      print(outputs.logits)

jonatanklosko · 2026-01-05T17:59:44Z

lib/bumblebee/text/gemma3_text.ex

+          activation: {"hidden_activation", activation()},
+          use_attention_bias: {"attention_bias", boolean()},
+          rotary_embedding_base: {"rope_theta", number()},
+          rotary_embedding_base_local: {"rope_local_base_freq", number()},


rope_local_base_freq does not actually exit on the config. Looks like now they have a whole configuration struct for rope, and in this case one per layer type. I will look into handling the new config shape.

Actually, rope_local_base_freq is still present in the config on Transformers v4, and v5 is still a release candidate. I pushed a small update, but I think we can merge this, I will make the rope config update separately.

Awesome! Thank you @jonatanklosko

nyo16 force-pushed the feat/add-functiongemma-support branch from 1f4f847 to 13b3a33 Compare December 28, 2025 17:28

nyo16 force-pushed the feat/add-functiongemma-support branch from 13b3a33 to 1fc7aaf Compare December 28, 2025 17:30

Merge branch 'main' into feat/add-functiongemma-support

726fef6

jonatanklosko reviewed Dec 29, 2025

View reviewed changes

test/bumblebee/text/gemma3_text_test.exs Show resolved Hide resolved

jonatanklosko reviewed Dec 29, 2025

View reviewed changes

lib/bumblebee/text/gemma3.ex Outdated Show resolved Hide resolved

Add fine-tuning resources to FunctionGemma notebook

f7c16c6

jonatanklosko reviewed Dec 31, 2025

View reviewed changes

lib/bumblebee.ex Outdated Show resolved Hide resolved

nyo16 added 2 commits December 31, 2025 13:58

Merge branch 'main' into feat/add-functiongemma-support

aebee54

jonatanklosko reviewed Dec 31, 2025

View reviewed changes

lib/bumblebee/text/gemma3_text.ex Outdated Show resolved Hide resolved

jonatanklosko reviewed Dec 31, 2025

View reviewed changes

lib/bumblebee/text/gemma3_text.ex Outdated Show resolved Hide resolved

jonatanklosko reviewed Dec 31, 2025

View reviewed changes

lib/bumblebee/text/gemma3_text.ex Outdated Show resolved Hide resolved

jonatanklosko reviewed Dec 31, 2025

View reviewed changes

lib/bumblebee/text/gemma3_text.ex Outdated Show resolved Hide resolved

nyo16 and others added 4 commits December 31, 2025 14:33

Update lib/bumblebee/text/gemma3_text.ex

d4d3bff

Co-authored-by: Jonatan Kłosko <[email protected]>

Update lib/bumblebee/text/gemma3_text.ex

91e74d0

Co-authored-by: Jonatan Kłosko <[email protected]>

Update lib/bumblebee/text/gemma3_text.ex

a2bf4a5

Co-authored-by: Jonatan Kłosko <[email protected]>

Update lib/bumblebee/text/gemma3_text.ex

a8ffd0d

Co-authored-by: Jonatan Kłosko <[email protected]>

jonatanklosko reviewed Dec 31, 2025

View reviewed changes

nyo16 added 2 commits January 1, 2026 08:38

jonatanklosko reviewed Jan 5, 2026

View reviewed changes

up

d38cd62

jonatanklosko changed the title ~~Add Gemma 3 support for FunctionGemma and other Gemma 3 models~~ Add Gemma 3 (text) support Jan 5, 2026

jonatanklosko changed the title ~~Add Gemma 3 (text) support~~ Add Gemma 3 Text support Jan 5, 2026

jonatanklosko merged commit 615cfee into elixir-nx:main Jan 5, 2026
2 checks passed

	up_block_types: {
	"up_block_types",
	list(
	mapping(%{
	"UpBlock2D" => :up_block,
	"CrossAttnUpBlock2D" => :cross_attention_up_block
	})
	)
	},

Add Gemma 3 Text support #436

Add Gemma 3 Text support #436

Conversation

nyo16 commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why FunctionGemma?

Gemma 3 Architecture Changes

Files Changed

Example Usage

Test Plan

Uh oh!

nyo16 commented Dec 28, 2025

Uh oh!

Uh oh!

Uh oh!

nyo16 commented Dec 29, 2025

:base

hidden_state[[.., 1..3, 1..3]]: [[-1.6458, 0.7249, -0.5747], [-1.9452, -0.1602, -0.2329], [-2.3408, -0.4665, -0.1177]]

:for_causal_language_modeling

logits[[.., 1..3, 1..3]]: [[0.1472, 0.0633, 0.0922], [-0.1089, -0.0344, 0.0755], [0.0112, 0.1083, 0.1461]]

:for_sequence_classification

logits: [[-0.0060, -0.0212]]

Uh oh!

Uh oh!

jonatanklosko commented Dec 31, 2025

Uh oh!

jonatanklosko commented Dec 31, 2025

Uh oh!

nyo16 commented Dec 31, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jonatanklosko Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonatanklosko Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

jonatanklosko commented Dec 31, 2025

Uh oh!

nyo16 commented Jan 1, 2026

Uh oh!

nyo16 commented Jan 1, 2026

Uh oh!

jonatanklosko Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

jonatanklosko Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

nyo16 Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nyo16 commented Dec 28, 2025 •

edited

Loading

jonatanklosko Dec 31, 2025 •

edited

Loading