Skip to content

Conversation

@lucianommartins
Copy link
Contributor

@lucianommartins lucianommartins commented Nov 21, 2025

Summary

This PR restores custom attention mask generation for Gemma3 GGUF multimodal models that was partially reverted in #28995. The implementation uses robust GGUF-only file format guards to ensure the feature exclusively applies to GGUF models and does not affect HuggingFace models.

Resolves: #28995 (HF model regression)
Restores functionality from: #27772

Background

PR #27772 initially added Gemma3 GGUF multimodal support, enabling users to run quantized Gemma3 multimodal models with both text-only and image+text prompts. However, it was partially reverted in #28995 because the custom attention mask logic incorrectly triggered for HuggingFace models, causing test failures.

Root cause of #28995: The original implementation lacked file format guards, causing the custom attention mask generation to activate for both GGUF and HF models.

Solution

This PR addresses the regression by implementing a 3-layer defense-in-depth guard mechanism:

Layer 1: Model Format Check (Primary Guard)

def uses_custom_attention_masks(config: PretrainedConfig, model_path: str) -> bool:
    """Only return True for GGUF Gemma3 multimodal models."""
    architectures = getattr(config, "architectures", [])
    is_gemma3 = "Gemma3ForConditionalGeneration" in architectures
    is_gguf = check_gguf_file(model_path)  # ← Critical GGUF guard
    return is_gemma3 and is_gguf

Layer 2: Multimodal Feature Check

has_mm_features = any(
    req_state.mm_features for req_state in self.requests.values()
)

Layer 3: Method Existence Check

hasattr(self.model, "generate_attention_masks")

Result: HF models never have uses_custom_attention_masks = True, preventing the issue that caused #28995.

Changes

Files Modified (4)

  1. vllm/transformers_utils/config.py

    • Add uses_custom_attention_masks() utility function
    • Implements GGUF file format check using check_gguf_file()
  2. vllm/config/model.py

    • Add uses_custom_attention_masks property to ModelConfig
    • Delegates to utility function with model path for GGUF detection
  3. vllm/v1/worker/gpu_model_runner.py

    • Initialize uses_custom_attention_masks attribute in GPUModelRunner
    • Apply 3-layer guard before calling custom attention mask generation
  4. vllm/model_executor/models/gemma3_mm.py

    • Restore generate_attention_masks() method
    • Generates custom masks enabling bidirectional attention between image tokens
    • Handles sliding window attention for GGUF compatibility

Test Plan

GGUF Model Validation

Tested with multiple quantized Gemma3 GGUF models to ensure functionality across different model sizes:

Text-Only Inference:

  • Gemma3 1B GGUF (Q4_0 quantization)
  • Gemma3 4B GGUF (Q4_0 quantization)
  • Gemma3 270M GGUF (Q4_0 quantization)

Multimodal Inference:

  • Gemma3 4B GGUF with mmproj.gguf vision tower
  • Input: Multi-image prompts (2 images: kitten photo + autumn forest scene)
  • Expected: Accurate image descriptions with proper bidirectional attention between image tokens

HuggingFace Model Regression Testing

Executed the full vLLM multimodal test suite to verify zero impact on HF models:

pytest -s -v tests/models/multimodal/generation/test_common.py -k "gemma3-test"

This ensures the GGUF guards prevent any unintended activation of custom attention mask logic for HuggingFace models.

Test Results

GGUF Model Results (All Pass)

Model Quantization Test Type Status Details
Gemma3-1B Q4_0 Text-only PASSED Generates coherent haiku about coding
Gemma3-4B Q4_0 Text-only PASSED Generates coherent haiku about coding
Gemma3-270M Q4_0 Text-only PASSED Generates coherent haiku about coding
Gemma3-4B Q4_0 Multimodal PASSED Correctly describes both images with accurate details

Multimodal Output Example:

Image 1: A close-up shot of a tabby kitten with striking blue eyes. The kitten 
is lying on a green surface, likely a rug or carpet. Its fur is a mix of brown 
and black stripes, and it has a slightly melancholic expression...

Image 2: A breathtaking landscape photograph of a dense pine forest bathed in 
the warm, golden light of a late autumn or early winter day...

HuggingFace Model Regression Test (All Pass)

pytest -s -v tests/models/multimodal/generation/test_common.py -k gemma3-test
# Result: 8 passed, 335 deselected, 23 warnings in 915.69s (15m 15s)

Test Coverage:

  • Single-image inference (3 test cases)
  • Multi-image inference (3 test cases)
  • Various prompt templates (2 test cases)
  • All 8 test cases pass - confirms zero regression for HF models

Verification of Fix for #28995

The failing test from #28995 (pytest gemma3-test) now passes completely:

Why it works now:

  • uses_custom_attention_masks returns False for HF models (no .gguf file detected)
  • Custom attention mask generation never executes for HF models
  • HF models use their native attention mechanism without interference

Isolation & Safety Guarantees

How HF Models Are Protected:

  1. File Format Check:

    is_gguf = check_gguf_file(model_path)  # Returns False for HF models
  2. Short-Circuit Logic:

    return is_gemma3 and is_gguf  # Requires BOTH conditions
  3. Runtime Guard:

    if self.uses_custom_attention_masks:  # Always False for HF
        # Custom mask generation (never executed for HF)

What Changed from #27772:

Aspect Original PR #27772 This PR (Fixed)
Guard Mechanism Architecture check only Architecture + GGUF file format check
HF Model Impact Incorrectly triggered Never triggers
GGUF Multimodal Works Works

Code Quality

  • All linting checks pass (ruff check, ruff format, mypy)
  • All pre-commit hooks pass
  • Follows Google Python style guide
  • Comprehensive docstrings with clear GGUF-only notes
  • Defense-in-depth guard pattern

Backward Compatibility

  • Zero impact on existing HuggingFace models (verified via pytest)
  • Zero impact on other model architectures (GGUF check prevents any non-Gemma3 activation)
  • Restores functionality from [Model] Add Gemma3 GGUF multimodal support #27772 for GGUF users
  • No API changes, no breaking changes

Documentation

No user-facing documentation changes required. The feature is transparent to users - GGUF Gemma3 multimodal models work automatically without configuration.

Release Notes

This fix should be included in release notes as:

[Model] Restored Gemma3 GGUF multimodal support - Re-enables custom attention mask generation for Gemma3 GGUF multimodal models with robust GGUF-only guards, fixing the regression introduced in #28995 while maintaining zero impact on HuggingFace models.


Checklist


Related PRs:

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively restores multimodal support for Gemma3 GGUF models by introducing robust file-format-based guards. The approach is sound and the defense-in-depth mechanism is a good practice to prevent regressions on HuggingFace models. My review focuses on performance optimizations within the newly restored generate_attention_masks method, suggesting more idiomatic and efficient PyTorch constructs to improve performance on this critical path.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Restores custom attention mask generation for Gemma3 GGUF multimodal models
that was partially reverted in vllm-project#28995. Implements robust GGUF-only guards to
ensure the feature only applies to GGUF models and does not affect HF models.

Changes:
- Add uses_custom_attention_masks() utility with GGUF file format check
- Add uses_custom_attention_masks property to ModelConfig
- Initialize uses_custom_attention_masks in GPUModelRunner
- Restore generate_attention_masks() method to Gemma3ForConditionalGeneration
- Implement 3-layer defense-in-depth guard mechanism

The implementation uses check_gguf_file() to guarantee that custom attention
mask logic only triggers for GGUF files, preventing the issue that caused
the original revert where HF models incorrectly triggered the custom logic.

Tested with GGUF models (1B, 4B, 270M) for both text-only and multimodal
inference. HF model compatibility verified via pytest multimodal test suite.

Signed-off-by: Luciano Martins <[email protected]>
@lucianommartins
Copy link
Contributor Author

Hi @Isotr0py / @DarkLight1337,

It is a quick one - reintroducing #27772, but now with guardrails to avoid the problems that caused the PR to be reverted via #28995.

It is pretty much all reviewed (as not much changed since #27772) and ready to go :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant