mtmd: add Gemma 4 audio conformer encoder support by stephencox-ict · Pull Request #21421 · ggml-org/llama.cpp

stephencox-ict · 2026-04-04T09:54:05Z

Overview

Add audio processing support for Gemma 4 models via a USM-style Conformer encoder.

Architecture:

12-layer Conformer: FFN -> Self-Attention -> Causal Conv1D -> FFN -> Norm
Subsampling Conv Projection: 2x Conv2D(stride=2) with LayerNorm
Chunked local attention with sinusoidal RPE (chunk_size=12, context_size=24)
Logit softcapping at 50.0, ClippableLinear with per-tensor clamping
Output projection -> RMSNorm -> multimodal embedder

Chunked local attention (matching PyTorch reference):

Q split into non-overlapping blocks of 12
K/V extracted as overlapping context windows of 24 via `ggml_view_4d` with stride 12
Per-block causal mask matching PyTorch's `dist < left_window_size` condition
Blocked relative position shift (Transformer-XL appendix B)
RPE: 13 sinusoidal position embeddings [12, 11, ..., 0]

Mel preprocessing (dedicated `mtmd_audio_preprocessor_gemma4a`):

HTK mel scale, 128 bins, magnitude STFT, mel_floor=1e-3
Standard periodic Hann window (320 samples), zero-padded to FFT size
30-second chunking (splits long audio into 30s segments)
Mel cosine similarity vs PyTorch: 0.9998

Fixes (beyond the initial encoder):

Conv norm weight mapping (`tensor_mapping.py`): `A_ENC_CONV_NORM` and `A_ENC_NORM_CONV` had their Gemma4 entries inverted, swapping conv pre-norm and internal norm weights. Encoder cosine improved from 0.67 → 0.9999.
Causal mask off-by-one (`clip.cpp`): Added `(gq - gk) < max_past` to match PyTorch's `dist < left_window_size` (was attending to 13 past tokens instead of 12).
Mask invalid value (`clip.cpp`): Use `-1e9` instead of `-INFINITY` for masked positions to match PyTorch's `attention_invalid_logits_value`.
Double-precision preprocessing (`mtmd-audio.cpp`, `clip.cpp`): Use double-precision trig for FFT twiddle factors, Hann window, and sinusoidal RPE computation.
Attention softcapping (`llama-model.cpp`): Gemma4's text model does NOT use attention logit softcapping (unlike Gemma2). Was incorrectly hardcoded to `true` with default value 50.0.
BF16 precision rounding (`gemma4-iswa.cpp`): Use BF16-rounded embedding scale constants to reduce divergence from PyTorch's native BF16 training precision (ref: PR Gemma 4: move some computations to BF16 #21451).

Test results (E2B Q4_K_M):

Short audio (5.9s LibriSpeech) - works on CPU, Vulkan, and CUDA:
```
Ground truth: "MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL"
Output: "Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel."
```

Known limitation: Longer audio (17s+) still produces repetitive output. The audio encoder output is correct (0.999 cosine vs PyTorch across all 12 layers + output projection) but the LM enters thinking mode and loops. This appears to be an upstream LM precision issue — PyTorch FP32 transcribes correctly with the same encoder output. See PR #21451 for the full BF16 computation fix needed on the LM side.

Generation parameters (from model's `generation_config.json`):
`--temp 1.0 --top-k 64 --top-p 0.95`

Additional information

Test plan:

Ref: #21325

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES - Claude Code was used in an assistive capacity for iterative debugging (tensor comparison, mel spectrogram verification, conformer layer tracing) and code review. All architecture decisions, algorithm implementations, and code were manually reviewed and verified against the PyTorch reference.

ggml-gh-bot · 2026-04-04T09:57:54Z

Hi @stephencox-ict, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

tools/mtmd/mtmd-audio.cpp

ngxson · 2026-04-04T10:07:25Z

Nice, seems to work but not 100% correct (using e4b, f16):

Analyze the Request: The user wants me to transcribe the provided text.

Analyze the Input Text: The text is a highly dramatic, repetitive, and emphatic piece of writing, likely intended as an urgent news headline or intro, with some slight repetition/redundancy in phrasing.

Perform Transcription: I will transcribe the text exactly as it is written, preserving capitalization, punctuation, and structure.

Self-Correction/Verification: The input is slightly fragmented due to the rapid, breathless style, but the goal is faithful transcription.

Final Output Generation.<channel|>"The Man on the Moon declared the New York Times from July 20th, 1969. This isn't just a news event. This is the moment when humanity's oldest dream became a reality. The Man on the Moon declared the New York Times, from July 20th, 1969. This isn't just a news event. This is the moment when humanity's oldest dream became a reality. The Man on the Moon declared the New York Times. This isn't just a news event. This is the moment when humanity's oldest dream became a reality. The Man on the Moon declared the New York Times."

However, the correct transcription should be:

The new york times from july 21 1969. this isn't just newsprint and ink this is the moment when humanities oldest dream became front page reality men walk on moon declares the bold headline across america's newspaper of record for over a century the new york times has documented our nation's most pivotal moments but rarely has any story matched the cosmic significance of this one.

README.md

tools/mtmd/mtmd.cpp

tools/mtmd/mtmd-audio.h

docs/multimodal/gemma4.md

stephencox-ict · 2026-04-04T10:15:26Z

Nice, seems to work but not 100% correct (using e4b, f16):

Analyze the Request: The user wants me to transcribe the provided text.

Analyze the Input Text: The text is a highly dramatic, repetitive, and emphatic piece of writing, likely intended as an urgent news headline or intro, with some slight repetition/redundancy in phrasing.

Perform Transcription: I will transcribe the text exactly as it is written, preserving capitalization, punctuation, and structure.

Self-Correction/Verification: The input is slightly fragmented due to the rapid, breathless style, but the goal is faithful transcription.

Final Output Generation.<channel|>"The Man on the Moon declared the New York Times from July 20th, 1969. This isn't just a news event. This is the moment when humanity's oldest dream became a reality. The Man on the Moon declared the New York Times, from July 20th, 1969. This isn't just a news event. This is the moment when humanity's oldest dream became a reality. The Man on the Moon declared the New York Times. This isn't just a news event. This is the moment when humanity's oldest dream became a reality. The Man on the Moon declared the New York Times."

However, the correct transcription should be:

The new york times from july 21 1969. this isn't just newsprint and ink this is the moment when humanities oldest dream became front page reality men walk on moon declares the bold headline across america's newspaper of record for over a century the new york times has documented our nation's most pivotal moments but rarely has any story matched the cosmic significance of this one.

I haven't yet implemented chunked local self-attention. Focussed on the testing side now and will come back to this

tools/mtmd/clip.cpp

JohannesGaessler

The changes to test-llama-archs.cpp LGTM otherwise. For some of the other files I'm seeing though that you are adding code comments with EM dashes. Please stick to ASCII unless there is a good reason not to.

tests/test-llama-archs.cpp

stephencox-ict · 2026-04-05T07:58:40Z

The changes to test-llama-archs.cpp LGTM otherwise. For some of the other files I'm seeing though that you are adding code comments with EM dashes. Please stick to ASCII unless there is a good reason not to.

Fixed

tools/mtmd/models/gemma4a.cpp

tools/mtmd/clip-impl.h

tools/mtmd/clip-model.h

tools/mtmd/clip.cpp

tools/mtmd/mtmd-audio.cpp

theo77186 · 2026-04-05T18:54:49Z

I've tested transcription with the E4B model, it seems it struggles with longer prompts (I tested with a 20 seconds audio in French), only transcribing near the end of the audio. Also crashes with CUDA because of a missing kernel, but it's an one line patch that should be in a separate PR.

--- a/ggml/src/ggml-cuda/ssm-conv.cu
+++ b/ggml/src/ggml-cuda/ssm-conv.cu
@@ -134,8 +134,9 @@ static void ssm_conv_f32_cuda(const float * src0, const float * src1, const int
     switch (nc) {
         case 3: launch_kernel(std::integral_constant<int, 3>{}); break;
         case 4: launch_kernel(std::integral_constant<int, 4>{}); break;
+        case 5: launch_kernel(std::integral_constant<int, 5>{}); break;
         case 9: launch_kernel(std::integral_constant<int, 9>{}); break;
-        default: GGML_ABORT("Only support kernel sizes 3, 4, 9 right now.");
+        default: GGML_ABORT("Only support kernel sizes 3, 4, 5, 9 right now.");
     }
 }

I suspect that the missing chunked attention might be the culprit.

stephencox-ict · 2026-04-05T23:55:08Z

Looking into chunked encoding and cuda issue

stephencox-ict · 2026-04-06T00:08:15Z

Is there any other problems with the implementation? I ran the test again, but it still giving repeated output:

Perform the Transcription (Line-by-Line/Segment-by-Segment):

(Initial segment - check punctuation and phrasing)

"The New York Times from July 20th, 1969."

"This isn't just a newsprint, this is the dawn of a new era."

"Men walk on the moon."

"The New York Times from July 20th, 1969." (The text repeats this phrase)

"This isn't just a newsprint, this is the moment when humanity's oldest dream became a reality."

"Men walk on the moon."

"The New York Times from July 20th, 1969." (Again, the repetition)

"This isn't just a newsprint, this is the moment when humanity's oldest dream became a reality." (The repetition continues)

There is some instabilty I'm looking into.

stephencox-ict · 2026-04-06T01:49:50Z

All looks fixed now. Stability issues was some unbounded limits. Tested both e2b and e4b on Cuda and worked well

ngxson · 2026-04-06T10:03:32Z

Hmm, I still get repetitive text with E4B model:

<channel|>The New York Times from July 20th, 1969. Man on the Moon declares the New York Times from July 20th, 1969. This is the moment when humanity's oldest dream became a reality. Man on the Moon declares the New York Times from July 20th, 1969. This is the moment when humanity's oldest dream became a reality. Man on the Moon declares the New York Times from July 20th, 1969. This is the moment when humanity's oldest dream became a reality. Man on the Moon declares the New York Times from July 20th, 1969. This is the moment when humanity's oldest dream became a reality. Man on the Moon declares the New York Times from July 20th, 1969. This is the moment when humanity's oldest dream became a reality. Man on the Moon declares the New York Times from July 20th, 1969. This is the moment when humanity's oldest dream became a reality. Man on the Moon declares the New York Times from July 20th, 1969. This is the moment when humanity's oldest dream became a reality.

My command: llama-mtmd-cli -m ../models/gemma-4-e4b-it/model.gguf -mm ../models/gemma-4-e4b-it/mmproj.gguf --image tools/mtmd/test-2.mp3 -p "transcribe this" --jinja -n 1000

ngxson · 2026-04-06T10:04:48Z

Btw, can you prevent force-pushing to this PR? Force-pushing make it hard to keep track of line-level changes

stephencox-ict · 2026-04-06T10:18:23Z

Hmm, I still get repetitive text with E4B model:

<channel|>The New York Times from July 20th, 1969. Man on the Moon declares the New York Times from July 20th, 1969. This is the moment when humanity's oldest dream became a reality. Man on the Moon declares the New York Times from July 20th, 1969. This is the moment when humanity's oldest dream became a reality. Man on the Moon declares the New York Times from July 20th, 1969. This is the moment when humanity's oldest dream became a reality. Man on the Moon declares the New York Times from July 20th, 1969. This is the moment when humanity's oldest dream became a reality. Man on the Moon declares the New York Times from July 20th, 1969. This is the moment when humanity's oldest dream became a reality. Man on the Moon declares the New York Times from July 20th, 1969. This is the moment when humanity's oldest dream became a reality. Man on the Moon declares the New York Times from July 20th, 1969. This is the moment when humanity's oldest dream became a reality.

My command: llama-mtmd-cli -m ../models/gemma-4-e4b-it/model.gguf -mm ../models/gemma-4-e4b-it/mmproj.gguf --image tools/mtmd/test-2.mp3 -p "transcribe this" --jinja -n 1000

Thanks for that. Reproduced with that file. Looking into it.

Noted about commits

angelsu · 2026-04-06T20:31:23Z

Bug Found: `softplus_pds` input tensors never populated

I have been working on getting this PR to produce correct embeddings and found the root cause of the quality issues.

The Bug

In gemma4a.cpp, the graph declares input tensors softplus_pds_{il} for each conformer layer:

auto * pds = ggml_new_tensor_1d(ctx0, GGML_TYPE_F32, d_head);
ggml_set_name(pds, pds_name);
ggml_set_input(pds);
Qcur = ggml_mul(ctx0, Qcur, ggml_reshape_3d(ctx0, pds, d_head, 1, 1));

But in clip.cpp, the set_input section for PROJECTOR_TYPE_GEMMA4A only populates kq_mask and pos_emb — softplus_pds_* tensors are never filled with data. They remain zero-initialized, so all Q vectors in the attention get multiplied by ~0.

The Fix

Add this after the pos_emb set_input block in clip.cpp:

// Pre-compute softplus(per_dim_scale) for each conformer layer
{
    const int n_layer = ctx->model.hparams.n_layer;
    const int d_head  = ctx->model.hparams.n_embd / ctx->model.hparams.n_head;
    for (int il = 0; il < n_layer; il++) {
        const auto & layer = ctx->model.layers[il];
        if (!layer.per_dim_scale_w) continue;
        std::vector<float> pds_data(d_head);
        ggml_backend_tensor_get(layer.per_dim_scale_w, pds_data.data(), 0, d_head * sizeof(float));
        for (int i = 0; i < d_head; i++) {
            pds_data[i] = logf(1.0f + expf(pds_data[i]));
        }
        char pds_name[64];
        snprintf(pds_name, sizeof(pds_name), "softplus_pds_%d", il);
        set_input_f32(pds_name, pds_data);
    }
}

Secondary Fix: NaN from -INFINITY mask

The attention mask uses -INFINITY as the masked value. For padded positions (where all scores are masked), softmax(-inf, -inf, ...) = NaN, which propagates through the network. Changing to -1e9f produces the same ~0 softmax output without NaN:

std::vector<float> mask(context_size * chunk_size * num_blocks, -1e9f);

Results

With both fixes applied (tested with E2B Q4_K_M):

Conformer embeddings cosine similarity vs PyTorch reference: 0.87 → 0.9946
All 436 tokens now have cos > 0.90 (previously only 34%)
Model can correctly transcribe audio and answer questions about it

Verified by building a PyTorch reference conformer using GGUF weights and comparing layer-by-layer outputs.

Add audio processing for Gemma 4 E2B/E4B via a USM-style Conformer. Architecture: - 12-layer Conformer: FFN → Self-Attention → Causal Conv1D → FFN → Norm - Subsampling Conv Projection: 2x Conv2D(stride=2) with LayerNorm - Full self-attention with sinusoidal RPE and sliding window mask (24) - Logit softcapping at 50.0, ClippableLinear clamping - Output: 1024 → 1536 → RMSNorm → multimodal embedder Mel preprocessing (dedicated mtmd_audio_preprocessor_gemma4a): - HTK mel scale, 128 bins, magnitude STFT, mel_floor=1e-3 - Standard periodic Hann window (320 samples), zero-padded to FFT size - Semicausal left-padding (frame_length/2 samples) - Frame count matched to PyTorch (unfold formula) - No pre-emphasis, no Whisper-style normalization - Mel cosine similarity vs PyTorch: 0.9998 Key fixes: - Tensor loading dedup: prevent get_tensor() from creating duplicate entries in ctx_data. Fixed with std::set guard. - ClippableLinear clamp_info loading moved after per-layer tensors. - Sliding window mask (24 positions) matching PyTorch context_size. - Skip Whisper normalization for Gemma4 mel output. Tested on E2B and E4B with CPU and Vulkan backends. Transcribes: "Glad to see things are going well and business is starting to pick up" (matching ground truth). Ref: ggml-org#21325 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Audio encoder fixes: - Fix swapped conv norm weight mapping in tensor_mapping.py (A_ENC_CONV_NORM and A_ENC_NORM_CONV had their gemma4 entries inverted, causing the conv pre-norm and internal norm weights to be swapped in GGUF. This produced 0.67 encoder cosine vs PyTorch; now 0.9999) - Fix causal mask off-by-one: add (gq - gk) < max_past to match PyTorch's dist < left_window_size (was attending to 13 past tokens instead of 12) - Use -1e9 instead of -INFINITY for masked positions to match PyTorch's attention_invalid_logits_value and avoid NaN in padded attention weights LM fixes: - Disable attention logit softcapping for Gemma4 (unlike Gemma2, Gemma4's text model does not use attn softcapping; was incorrectly hardcoded) - Use BF16-rounded embedding scale constants to match PyTorch's native BF16 training precision (ref: PR ggml-org#21451). Fixes long-context coherence on CPU/Vulkan backends. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

stephencox-ict · 2026-04-07T01:10:42Z

@ngxson #21451 is in the right direction and helped me fix the CPU/Vulkan case.

Use double-precision trig (sin/cos) instead of float (sinf/cosf) for precomputed FFT twiddle factors, Hann window, and sinusoidal RPE to match PyTorch's precision in the audio encoder preprocessing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

stephencox-ict · 2026-04-07T06:51:16Z

@ngxson #21451 is in the right direction and helped me fix the CPU/Vulkan case.

Fix is misleading here, it made the cosine similarity go up but I still have an issue with repeating on the test-2 sample.

stephencox-ict · 2026-04-07T09:10:19Z

I'm still busy tracing the divergence for the test-2.mp3 file. Making progress, but takes a bit of time

github-actions bot added documentation Improvements or additions to documentation examples labels Apr 4, 2026