Skip to content

mtmd: add Gemma 4 audio conformer encoder support#21421

Draft
stephencox-ict wants to merge 3 commits intoggml-org:masterfrom
stephencox-ict:gemma4-audio-pr
Draft

mtmd: add Gemma 4 audio conformer encoder support#21421
stephencox-ict wants to merge 3 commits intoggml-org:masterfrom
stephencox-ict:gemma4-audio-pr

Conversation

@stephencox-ict
Copy link
Copy Markdown

@stephencox-ict stephencox-ict commented Apr 4, 2026

Overview

Add audio processing support for Gemma 4 models via a USM-style Conformer encoder.

Architecture:

  • 12-layer Conformer: FFN -> Self-Attention -> Causal Conv1D -> FFN -> Norm
  • Subsampling Conv Projection: 2x Conv2D(stride=2) with LayerNorm
  • Chunked local attention with sinusoidal RPE (chunk_size=12, context_size=24)
  • Logit softcapping at 50.0, ClippableLinear with per-tensor clamping
  • Output projection -> RMSNorm -> multimodal embedder

Chunked local attention (matching PyTorch reference):

  • Q split into non-overlapping blocks of 12
  • K/V extracted as overlapping context windows of 24 via `ggml_view_4d` with stride 12
  • Per-block causal mask matching PyTorch's `dist < left_window_size` condition
  • Blocked relative position shift (Transformer-XL appendix B)
  • RPE: 13 sinusoidal position embeddings [12, 11, ..., 0]

Mel preprocessing (dedicated `mtmd_audio_preprocessor_gemma4a`):

  • HTK mel scale, 128 bins, magnitude STFT, mel_floor=1e-3
  • Standard periodic Hann window (320 samples), zero-padded to FFT size
  • 30-second chunking (splits long audio into 30s segments)
  • Mel cosine similarity vs PyTorch: 0.9998

Fixes (beyond the initial encoder):

  1. Conv norm weight mapping (`tensor_mapping.py`): `A_ENC_CONV_NORM` and `A_ENC_NORM_CONV` had their Gemma4 entries inverted, swapping conv pre-norm and internal norm weights. Encoder cosine improved from 0.67 → 0.9999.

  2. Causal mask off-by-one (`clip.cpp`): Added `(gq - gk) < max_past` to match PyTorch's `dist < left_window_size` (was attending to 13 past tokens instead of 12).

  3. Mask invalid value (`clip.cpp`): Use `-1e9` instead of `-INFINITY` for masked positions to match PyTorch's `attention_invalid_logits_value`.

  4. Double-precision preprocessing (`mtmd-audio.cpp`, `clip.cpp`): Use double-precision trig for FFT twiddle factors, Hann window, and sinusoidal RPE computation.

  5. Attention softcapping (`llama-model.cpp`): Gemma4's text model does NOT use attention logit softcapping (unlike Gemma2). Was incorrectly hardcoded to `true` with default value 50.0.

  6. BF16 precision rounding (`gemma4-iswa.cpp`): Use BF16-rounded embedding scale constants to reduce divergence from PyTorch's native BF16 training precision (ref: PR Gemma 4: move some computations to BF16 #21451).

Test results (E2B Q4_K_M):

Short audio (5.9s LibriSpeech) - works on CPU, Vulkan, and CUDA:
```
Ground truth: "MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL"
Output: "Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel."
```

Known limitation: Longer audio (17s+) still produces repetitive output. The audio encoder output is correct (0.999 cosine vs PyTorch across all 12 layers + output projection) but the LM enters thinking mode and loops. This appears to be an upstream LM precision issue — PyTorch FP32 transcribes correctly with the same encoder output. See PR #21451 for the full BF16 computation fix needed on the LM side.

Generation parameters (from model's `generation_config.json`):
`--temp 1.0 --top-k 64 --top-p 0.95`

Additional information

Test plan:

  • `test-mtmd-c-api` passes
  • `test-llama-archs` passes
  • E2B Q4_K_M transcription on CPU, Vulkan, and CUDA (RTX 3060)
  • E4B Q4_K_M transcription on Vulkan, CUDA RTX 3060, CUDA Tesla T4
  • LibriSpeech samples with known ground truth
  • Mel values verified against PyTorch (cosine 0.9998)
  • Encoder output cosine vs PyTorch: 0.9999 (all 12 conformer layers)
  • Tower output cosine vs PyTorch: 0.999
  • CI ctest: 49/49 debug passed
  • CUDA ssm_conv kernel_size=5 tested on RTX 3060 and Tesla T4

Ref: #21325

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES - Claude Code was used in an assistive capacity for iterative debugging (tensor comparison, mel spectrogram verification, conformer layer tracing) and code review. All architecture decisions, algorithm implementations, and code were manually reviewed and verified against the PyTorch reference.

@github-actions github-actions bot added documentation Improvements or additions to documentation examples labels Apr 4, 2026
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot bot commented Apr 4, 2026

Hi @stephencox-ict, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Apr 4, 2026

Nice, seems to work but not 100% correct (using e4b, f16):

  1. Analyze the Request: The user wants me to transcribe the provided text.
  2. Analyze the Input Text: The text is a highly dramatic, repetitive, and emphatic piece of writing, likely intended as an urgent news headline or intro, with some slight repetition/redundancy in phrasing.
  3. Perform Transcription: I will transcribe the text exactly as it is written, preserving capitalization, punctuation, and structure.

Self-Correction/Verification: The input is slightly fragmented due to the rapid, breathless style, but the goal is faithful transcription.

  1. Final Output Generation.<channel|>"The Man on the Moon declared the New York Times from July 20th, 1969. This isn't just a news event. This is the moment when humanity's oldest dream became a reality. The Man on the Moon declared the New York Times, from July 20th, 1969. This isn't just a news event. This is the moment when humanity's oldest dream became a reality. The Man on the Moon declared the New York Times. This isn't just a news event. This is the moment when humanity's oldest dream became a reality. The Man on the Moon declared the New York Times."

However, the correct transcription should be:

The new york times from july 21 1969. this isn't just newsprint and ink this is the moment when humanities oldest dream became front page reality men walk on moon declares the bold headline across america's newspaper of record for over a century the new york times has documented our nation's most pivotal moments but rarely has any story matched the cosmic significance of this one.

@stephencox-ict
Copy link
Copy Markdown
Author

Nice, seems to work but not 100% correct (using e4b, f16):

  1. Analyze the Request: The user wants me to transcribe the provided text.
  2. Analyze the Input Text: The text is a highly dramatic, repetitive, and emphatic piece of writing, likely intended as an urgent news headline or intro, with some slight repetition/redundancy in phrasing.
  3. Perform Transcription: I will transcribe the text exactly as it is written, preserving capitalization, punctuation, and structure.

Self-Correction/Verification: The input is slightly fragmented due to the rapid, breathless style, but the goal is faithful transcription.

  1. Final Output Generation.<channel|>"The Man on the Moon declared the New York Times from July 20th, 1969. This isn't just a news event. This is the moment when humanity's oldest dream became a reality. The Man on the Moon declared the New York Times, from July 20th, 1969. This isn't just a news event. This is the moment when humanity's oldest dream became a reality. The Man on the Moon declared the New York Times. This isn't just a news event. This is the moment when humanity's oldest dream became a reality. The Man on the Moon declared the New York Times."

However, the correct transcription should be:

The new york times from july 21 1969. this isn't just newsprint and ink this is the moment when humanities oldest dream became front page reality men walk on moon declares the bold headline across america's newspaper of record for over a century the new york times has documented our nation's most pivotal moments but rarely has any story matched the cosmic significance of this one.

I haven't yet implemented chunked local self-attention. Focussed on the testing side now and will come back to this

@github-actions github-actions bot added the testing Everything test related label Apr 4, 2026
@stephencox-ict stephencox-ict force-pushed the gemma4-audio-pr branch 3 times, most recently from 83d1f37 to 13e9f5e Compare April 4, 2026 22:53
@stephencox-ict stephencox-ict force-pushed the gemma4-audio-pr branch 3 times, most recently from 29dd32e to 7435a59 Compare April 5, 2026 00:15
@stephencox-ict stephencox-ict marked this pull request as ready for review April 5, 2026 00:16
@stephencox-ict stephencox-ict requested review from a team and JohannesGaessler as code owners April 5, 2026 00:16
@stephencox-ict stephencox-ict requested a review from ngxson April 5, 2026 00:19
Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to test-llama-archs.cpp LGTM otherwise. For some of the other files I'm seeing though that you are adding code comments with EM dashes. Please stick to ASCII unless there is a good reason not to.

@stephencox-ict
Copy link
Copy Markdown
Author

The changes to test-llama-archs.cpp LGTM otherwise. For some of the other files I'm seeing though that you are adding code comments with EM dashes. Please stick to ASCII unless there is a good reason not to.

Fixed

@theo77186
Copy link
Copy Markdown
Contributor

theo77186 commented Apr 5, 2026

I've tested transcription with the E4B model, it seems it struggles with longer prompts (I tested with a 20 seconds audio in French), only transcribing near the end of the audio. Also crashes with CUDA because of a missing kernel, but it's an one line patch that should be in a separate PR.

--- a/ggml/src/ggml-cuda/ssm-conv.cu
+++ b/ggml/src/ggml-cuda/ssm-conv.cu
@@ -134,8 +134,9 @@ static void ssm_conv_f32_cuda(const float * src0, const float * src1, const int
     switch (nc) {
         case 3: launch_kernel(std::integral_constant<int, 3>{}); break;
         case 4: launch_kernel(std::integral_constant<int, 4>{}); break;
+        case 5: launch_kernel(std::integral_constant<int, 5>{}); break;
         case 9: launch_kernel(std::integral_constant<int, 9>{}); break;
-        default: GGML_ABORT("Only support kernel sizes 3, 4, 9 right now.");
+        default: GGML_ABORT("Only support kernel sizes 3, 4, 5, 9 right now.");
     }
 }

I suspect that the missing chunked attention might be the culprit.

@stephencox-ict stephencox-ict force-pushed the gemma4-audio-pr branch 2 times, most recently from 81b0202 to f3b827d Compare April 5, 2026 22:25
@stephencox-ict
Copy link
Copy Markdown
Author

Looking into chunked encoding and cuda issue

@stephencox-ict
Copy link
Copy Markdown
Author

Is there any other problems with the implementation? I ran the test again, but it still giving repeated output:

  1. Perform the Transcription (Line-by-Line/Segment-by-Segment):

    • (Initial segment - check punctuation and phrasing)
    • "The New York Times from July 20th, 1969."
    • "This isn't just a newsprint, this is the dawn of a new era."
    • "Men walk on the moon."
    • "The New York Times from July 20th, 1969." (The text repeats this phrase)
    • "This isn't just a newsprint, this is the moment when humanity's oldest dream became a reality."
    • "Men walk on the moon."
    • "The New York Times from July 20th, 1969." (Again, the repetition)
    • "This isn't just a newsprint, this is the moment when humanity's oldest dream became a reality." (The repetition continues)

There is some instabilty I'm looking into.

@stephencox-ict stephencox-ict requested a review from a team as a code owner April 6, 2026 00:27
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 6, 2026
@stephencox-ict
Copy link
Copy Markdown
Author

All looks fixed now. Stability issues was some unbounded limits. Tested both e2b and e4b on Cuda and worked well

@stephencox-ict stephencox-ict requested a review from ngxson April 6, 2026 01:49
@stephencox-ict stephencox-ict force-pushed the gemma4-audio-pr branch 2 times, most recently from f0484a7 to 8a1494c Compare April 6, 2026 01:56
@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Apr 6, 2026

Hmm, I still get repetitive text with E4B model:

<channel|>The New York Times from July 20th, 1969. Man on the Moon declares the New York Times from July 20th, 1969. This is the moment when humanity's oldest dream became a reality. Man on the Moon declares the New York Times from July 20th, 1969. This is the moment when humanity's oldest dream became a reality. Man on the Moon declares the New York Times from July 20th, 1969. This is the moment when humanity's oldest dream became a reality. Man on the Moon declares the New York Times from July 20th, 1969. This is the moment when humanity's oldest dream became a reality. Man on the Moon declares the New York Times from July 20th, 1969. This is the moment when humanity's oldest dream became a reality. Man on the Moon declares the New York Times from July 20th, 1969. This is the moment when humanity's oldest dream became a reality. Man on the Moon declares the New York Times from July 20th, 1969. This is the moment when humanity's oldest dream became a reality.

My command: llama-mtmd-cli -m ../models/gemma-4-e4b-it/model.gguf -mm ../models/gemma-4-e4b-it/mmproj.gguf --image tools/mtmd/test-2.mp3 -p "transcribe this" --jinja -n 1000

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Apr 6, 2026

Btw, can you prevent force-pushing to this PR? Force-pushing make it hard to keep track of line-level changes

@stephencox-ict
Copy link
Copy Markdown
Author

Hmm, I still get repetitive text with E4B model:

<channel|>The New York Times from July 20th, 1969. Man on the Moon declares the New York Times from July 20th, 1969. This is the moment when humanity's oldest dream became a reality. Man on the Moon declares the New York Times from July 20th, 1969. This is the moment when humanity's oldest dream became a reality. Man on the Moon declares the New York Times from July 20th, 1969. This is the moment when humanity's oldest dream became a reality. Man on the Moon declares the New York Times from July 20th, 1969. This is the moment when humanity's oldest dream became a reality. Man on the Moon declares the New York Times from July 20th, 1969. This is the moment when humanity's oldest dream became a reality. Man on the Moon declares the New York Times from July 20th, 1969. This is the moment when humanity's oldest dream became a reality. Man on the Moon declares the New York Times from July 20th, 1969. This is the moment when humanity's oldest dream became a reality.

My command: llama-mtmd-cli -m ../models/gemma-4-e4b-it/model.gguf -mm ../models/gemma-4-e4b-it/mmproj.gguf --image tools/mtmd/test-2.mp3 -p "transcribe this" --jinja -n 1000

Thanks for that. Reproduced with that file. Looking into it.

Noted about commits

@angelsu
Copy link
Copy Markdown

angelsu commented Apr 6, 2026

Bug Found: softplus_pds input tensors never populated

I have been working on getting this PR to produce correct embeddings and found the root cause of the quality issues.

The Bug

In gemma4a.cpp, the graph declares input tensors softplus_pds_{il} for each conformer layer:

auto * pds = ggml_new_tensor_1d(ctx0, GGML_TYPE_F32, d_head);
ggml_set_name(pds, pds_name);
ggml_set_input(pds);
Qcur = ggml_mul(ctx0, Qcur, ggml_reshape_3d(ctx0, pds, d_head, 1, 1));

But in clip.cpp, the set_input section for PROJECTOR_TYPE_GEMMA4A only populates kq_mask and pos_embsoftplus_pds_* tensors are never filled with data. They remain zero-initialized, so all Q vectors in the attention get multiplied by ~0.

The Fix

Add this after the pos_emb set_input block in clip.cpp:

// Pre-compute softplus(per_dim_scale) for each conformer layer
{
    const int n_layer = ctx->model.hparams.n_layer;
    const int d_head  = ctx->model.hparams.n_embd / ctx->model.hparams.n_head;
    for (int il = 0; il < n_layer; il++) {
        const auto & layer = ctx->model.layers[il];
        if (!layer.per_dim_scale_w) continue;
        std::vector<float> pds_data(d_head);
        ggml_backend_tensor_get(layer.per_dim_scale_w, pds_data.data(), 0, d_head * sizeof(float));
        for (int i = 0; i < d_head; i++) {
            pds_data[i] = logf(1.0f + expf(pds_data[i]));
        }
        char pds_name[64];
        snprintf(pds_name, sizeof(pds_name), "softplus_pds_%d", il);
        set_input_f32(pds_name, pds_data);
    }
}

Secondary Fix: NaN from -INFINITY mask

The attention mask uses -INFINITY as the masked value. For padded positions (where all scores are masked), softmax(-inf, -inf, ...) = NaN, which propagates through the network. Changing to -1e9f produces the same ~0 softmax output without NaN:

std::vector<float> mask(context_size * chunk_size * num_blocks, -1e9f);

Results

With both fixes applied (tested with E2B Q4_K_M):

  • Conformer embeddings cosine similarity vs PyTorch reference: 0.87 → 0.9946
  • All 436 tokens now have cos > 0.90 (previously only 34%)
  • Model can correctly transcribe audio and answer questions about it

Verified by building a PyTorch reference conformer using GGUF weights and comparing layer-by-layer outputs.

stephencox and others added 2 commits April 7, 2026 12:19
Add audio processing for Gemma 4 E2B/E4B via a USM-style Conformer.

Architecture:
- 12-layer Conformer: FFN → Self-Attention → Causal Conv1D → FFN → Norm
- Subsampling Conv Projection: 2x Conv2D(stride=2) with LayerNorm
- Full self-attention with sinusoidal RPE and sliding window mask (24)
- Logit softcapping at 50.0, ClippableLinear clamping
- Output: 1024 → 1536 → RMSNorm → multimodal embedder

Mel preprocessing (dedicated mtmd_audio_preprocessor_gemma4a):
- HTK mel scale, 128 bins, magnitude STFT, mel_floor=1e-3
- Standard periodic Hann window (320 samples), zero-padded to FFT size
- Semicausal left-padding (frame_length/2 samples)
- Frame count matched to PyTorch (unfold formula)
- No pre-emphasis, no Whisper-style normalization
- Mel cosine similarity vs PyTorch: 0.9998

Key fixes:
- Tensor loading dedup: prevent get_tensor() from creating duplicate
  entries in ctx_data. Fixed with std::set guard.
- ClippableLinear clamp_info loading moved after per-layer tensors.
- Sliding window mask (24 positions) matching PyTorch context_size.
- Skip Whisper normalization for Gemma4 mel output.

Tested on E2B and E4B with CPU and Vulkan backends.
Transcribes: "Glad to see things are going well and business is starting
to pick up" (matching ground truth).

Ref: ggml-org#21325

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Audio encoder fixes:
- Fix swapped conv norm weight mapping in tensor_mapping.py
  (A_ENC_CONV_NORM and A_ENC_NORM_CONV had their gemma4 entries inverted,
  causing the conv pre-norm and internal norm weights to be swapped in GGUF.
  This produced 0.67 encoder cosine vs PyTorch; now 0.9999)
- Fix causal mask off-by-one: add (gq - gk) < max_past to match PyTorch's
  dist < left_window_size (was attending to 13 past tokens instead of 12)
- Use -1e9 instead of -INFINITY for masked positions to match PyTorch's
  attention_invalid_logits_value and avoid NaN in padded attention weights

LM fixes:
- Disable attention logit softcapping for Gemma4 (unlike Gemma2, Gemma4's
  text model does not use attn softcapping; was incorrectly hardcoded)
- Use BF16-rounded embedding scale constants to match PyTorch's native
  BF16 training precision (ref: PR ggml-org#21451). Fixes long-context coherence
  on CPU/Vulkan backends.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@stephencox-ict stephencox-ict requested a review from CISC as a code owner April 7, 2026 01:06
@github-actions github-actions bot added model Model specific python python script changes labels Apr 7, 2026
@stephencox-ict
Copy link
Copy Markdown
Author

@ngxson #21451 is in the right direction and helped me fix the CPU/Vulkan case.

Use double-precision trig (sin/cos) instead of float (sinf/cosf) for
precomputed FFT twiddle factors, Hann window, and sinusoidal RPE to
match PyTorch's precision in the audio encoder preprocessing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Randy420Marsh added a commit to Randy420Marsh/llama.cpp that referenced this pull request Apr 7, 2026
@stephencox-ict
Copy link
Copy Markdown
Author

@ngxson #21451 is in the right direction and helped me fix the CPU/Vulkan case.

Fix is misleading here, it made the cosine similarity go up but I still have an issue with repeating on the test-2 sample.

@stephencox-ict
Copy link
Copy Markdown
Author

I'm still busy tracing the divergence for the test-2.mp3 file. Making progress, but takes a bit of time

@stephencox-ict stephencox-ict marked this pull request as draft April 7, 2026 09:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation examples ggml changes relating to the ggml tensor library for machine learning model Model specific Nvidia GPU Issues specific to Nvidia GPUs python python script changes testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants