Skip to content

[#9587][fix] AutoDeploy: Support Gemma3 VLM#10096

Draft
bmarimuthu-nv wants to merge 24 commits intoNVIDIA:mainfrom
nv-auto-deploy:bala/fix-gemma-3-27b-it
Draft

[#9587][fix] AutoDeploy: Support Gemma3 VLM#10096
bmarimuthu-nv wants to merge 24 commits intoNVIDIA:mainfrom
nv-auto-deploy:bala/fix-gemma-3-27b-it

Conversation

@bmarimuthu-nv
Copy link
Collaborator

@bmarimuthu-nv bmarimuthu-nv commented Dec 17, 2025

Summary by CodeRabbit

  • New Features
    • Added Vision Language Model (VLM) attention mask generation with support for causal and sliding window constraints
    • Enhanced Gemma3 multimodal model support with improved mask handling
    • Expanded FlashInfer backend integration for custom attention masks

Summary by Author

Background:

  • Gemma3 VLM Module hierarchy is:

Gemma3ForConditionalGeneration -> Gemma3Model -> Gemma3TextModel
We export only Gemma3TextModel.

  • Gemma3 has alternating full attention and sliding window attention based on layer_idx (one full attention every 5 sliding window)

Challenges:

  1. Custom Mask preparation (based on input text & images) happens in Gemma3Model and mask tensor for a given layer dix/type (sliding window or full attention) prepared already and is directly passed to Gemma3TextModel
  2. (full graph mode) The classifier lm_head is in the Gemma3ForCausalGeneration, input embedding is in Gemma3TextModel. But weight tying is enabled between the lm_head and embedding. But we export only Gemma3TextModel and load weights. Sadly, in the checkpoint, the lm_head weight and embedding weight are not identical (for some reason). So the generation is bad if we don't enforce the weight tying after loading the weights.
Gemma3ForConditionalGeneration
├── model (Gemma3Model)
│   └── language_model ← EXPORTED to GraphModule
│       └── embed_tokens.weight ← src (canonical)
└── lm_head.weight ← dst (tied to src) - NOT exported

Solution V2

for supporting gemma3 VLM, the following changes are made in this branch:

  1. allow custom attention to be suplied
    1.a we achieved this by
    1.a.1 adding placeholder op in attention_interface during tracing
    1.a.2 creating a per model, per backend custom mask generation hook that can be replace the placeholder maskgen op during KVCachetransformation pass

additional infra support to support custom attention masking:

  1. enable passing in additional inputs that are not part of the original graph module
  2. enable passing in inputs directly from top module to graphmodule (exported graph module can be nested). This means, bypassing any kwargs capture in the in-between torch nn modules. (see "ad" prefix on "token_type_ids"

Summary
Export time: Capture metadata in markers
Transform time: Replace markers with backend-specific computation
Runtime: Execute the mask op with dynamic inputs

Infra updates:
AdditionalGraphInput - generic way to add inputs post-export
CustomMaskGeneratorRegistry - extensible for new models/backends
_ad_ prefix convention - generic bypass mechanism

  1. detect and handle weight tying between exported graphModule and non-exported layer. This is added as a post load-hook.

Solutions

Problem 1: Custom Mask preparation and supplying

* Sub problem 1: Custom mask preparation
    AutoDeploy VLM flow exports and compiles only TextModel (e.g Gemma3TextModel) and AutoDeploy graph export also doesn't take in custom attention mask today.
* Sub problem 2: Layer Index based attention mask (full vs sliding window)

Solution:

  • Support flashInfer attention backend only for now
  • FlashInfer backend optionally takes in boolean attention_mask.

During export:

We create a custom mask generation op that creates the boolean mask and provides it to the flashinfer backend.

  • Patch model export to always supply in token_type_ids as graph input
    • Gemma3Model.forward signature has token_type_ids as an explicit parameter, so it gets consumed there and doesn't flow through **lm_kwargs to language_model.
    • we patch Gemma3TextModel.__call__ to inject token_type_ids BEFORE any pre-hooks run, specifically the args capturing prehook in export_to_gm in auto deploy. The flow: patched_call runs → injects token_type_ids into kwargs
  • In KVCache Transform: flashinfer_gemma3_mask_gen op is inserted into the graph, taking token_type_ids as input
    • Generate ONE mask with just the bidirectional image token logic (no sliding window baked in)
    • Let FlashInfer apply the sliding window via window_left parameter for sliding attention layers

During running inference:

  • During inference time, we again need to update the inputs passed onto graphModule. This is because:
    Gemma3Model.forward signature has token_type_ids as an explicit parameter, so it gets consumed there and doesn't flow through **lm_kwargs to language_model.
    • So, we add a add a register_forward_pre_hook for the graphModule. The hook reads the current token_type_ids from engine.cache_seq_interface.info._extra_args (which gets populated during _prepare_inputs()

Problem 2: Respecting weight tying between exported graph layer and eager layer (in parent module/outside the graph)

Before export:
  embed_tokens.weight ─┬─ tensor T (shared)
  lm_head.weight ──────┘

After export (tying broken):
  embed_tokens.weight ─── tensor T (in GraphModule, loaded correctly from checkpoint)
  lm_head.weight ──────── tensor X (separate, NEVER loaded, random values)

During inference:
  embed_tokens → correct embeddings
  lm_head → GARBAGE output → bad generation!

Solution (OLD):

Added a new transform sync_tied_weights that runs after weights are loaded (stage: post_load_fusion) that:

  • Detects cross-boundary tied weights by:
    * Reading _tied_weights_keys from the model
    * Using get_input_embeddings() / get_output_embeddings() to find the actual pair
    * Checking which weights are inside GraphModules (exported) vs outside

  • Syncs the weights by making the non-exported weight (lm_head.weight) point to the exported weight's tensor

  • Tests

    • Comprehensive test coverage for VLM mask generation and attention operations

✏️ Tip: You can customize this high-level summary in your review settings.

Fixes #9587

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@bmarimuthu-nv bmarimuthu-nv force-pushed the bala/fix-gemma-3-27b-it branch 4 times, most recently from 8282fc4 to 7abd5bc Compare December 19, 2025 01:27
@bmarimuthu-nv
Copy link
Collaborator Author

/bot run

@bmarimuthu-nv bmarimuthu-nv changed the title [Draft] Bala/fix gemma 3 27b it [9587][fix] AutoDeploy: Support Gemma3 VLM Dec 19, 2025
@bmarimuthu-nv bmarimuthu-nv changed the title [9587][fix] AutoDeploy: Support Gemma3 VLM [#9587][fix] AutoDeploy: Support Gemma3 VLM Dec 19, 2025
@bmarimuthu-nv
Copy link
Collaborator Author

@coderabbitai summary

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 19, 2025

✅ Actions performed

Summary regeneration triggered.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 19, 2025

📝 Walkthrough

Walkthrough

This PR introduces comprehensive Vision-Language Model (VLM) support to AutoDeploy, enabling Gemma3 and similar multimodal models to export and run efficiently. Changes include custom attention mask generation operators, FlashInfer integration for VLM masking, export-time transformations for mask metadata tagging, runtime VLM mask preparation, and extensive supporting infrastructure including utilities, patches, and test coverage.

Changes

Cohort / File(s) Summary
Custom Attention Mask Operators
tensorrt_llm/_torch/auto_deploy/custom_ops/custom_attn_mask_gen.py
tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_gemma3_mask.py
New custom PyTorch ops for generating attention masks with causal, bidirectional image-to-image, and sliding-window constraints; supports both full and sliding attention modes for VLM prefill phases.
FlashInfer Integration & Planning
tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py
tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py
Extended PlanParams with VLM fields (has_custom_mask, window_left, logits_soft_cap); added VLM custom mask input initialization and argument passing for cached attention; updated planning logic to handle custom masks.
Export Configuration & Metadata
tensorrt_llm/_torch/auto_deploy/config/default.yaml
tensorrt_llm/_torch/auto_deploy/export/library/unified_attn.py
tensorrt_llm/_torch/auto_deploy/models/hf.py
Added new transforms sync_tied_weights and tag_vlm_mask_kind to config; added mask-kind inference helpers; extended dynamic shape lookups to include cache_position and token_type_ids.
Export-Time VLM Support
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py
tensorrt_llm/_torch/auto_deploy/transform/library/kvcache_transformers.py
Added kwargs sanitization for HF-specific parameters; implemented VLM metadata tagging (ad_is_vlm, ad_mask_kind_by_module) on exported submodules; integrated mask generation during cached attention path.
VLM-Specific Transforms
tensorrt_llm/_torch/auto_deploy/transform/library/sync_tied_weights.py
tensorrt_llm/_torch/auto_deploy/transform/library/tag_vlm_mask_kind.py
tensorrt_llm/_torch/auto_deploy/models/patches/gemma3.py
New post-export transform to sync tied weights across boundaries; new transform to annotate mask-kind metadata on attention nodes; Gemma3-specific patch disabling vmap-based mask creation for export compatibility.
Runtime VLM Handling
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
tensorrt_llm/_torch/auto_deploy/utils/vlm_utils.py
Added VLM mask preparation logic in engine constructor; new utilities for detecting image tokens and extracting image-token masks by model type; conditional mask injection into model call kwargs.
Diagnostic & Utility Infrastructure
tensorrt_llm/_torch/auto_deploy/transform/optimizer.py
tensorrt_llm/_torch/auto_deploy/utils/_graph.py
Added diagnostic dump calls pre/post compile and post tag_vlm_mask_kind; new graph dumping utility to export GraphModule code and graph representations to files.
Ignore Patterns
.cursorignore
Extended ignore list to exclude tensorrt_llm/_torch/models/ from indexing.
Unit Tests
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_attention_op.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_attention_vlm.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_gemma3_mask.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_vlm_constraints.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kvcache_vlm_detection.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_tag_vlm_mask_kind.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/utils/test_vlm_utils.py
Comprehensive test coverage: updated FlashInfer attention tests with new mask parameters; new tests for attention mask selection, PlanParams hashing, and constants extraction; new tests for Gemma3 mask generation and edge cases; new VLM constraint validation tests; new VLM auto-detection and mask mapping tests; new tag_vlm_mask_kind integration tests; new VLM utility tests covering image-token detection across model types.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–90 minutes

Areas requiring extra attention:

  • custom_attn_mask_gen.py — Complex masking logic with causal constraints, bidirectional image-token handling, and sliding-window interactions; validate correctness of mask construction and broadcast safety.
  • flashinfer_attention.py — Significant changes to planning, PlanParams dataclass, and FlashInfer wrapper integration; verify proper propagation of window_left and logits_soft_cap through planning paths and correct hash/equality semantics with new fields.
  • kvcache.py — New VLM mask input initialization and argument appending; verify that lazy initialization occurs correctly, error handling when masks required but missing, and that non-FlashInfer paths remain unchanged.
  • ad_executor.py — New VLM mask preparation and conditional injection logic; ensure mask computation is correct, VLM-specific kwargs are properly stripped before model invocation, and configuration hook is called at the right stage.
  • export_to_gm.py — VLM metadata tagging and kwargs sanitization; verify that ad_is_vlm and ad_mask_kind_by_module are correctly populated, that HF-only kwargs are properly filtered at all stages, and that non-VLM models are unaffected.
  • kvcache_transformers.py — VLM mask generation integration during forward; verify correct detection of mask requirements, proper integration with fake profiling, and consistent behavior across module paths.
  • Integration across sync_tied_weights.py and tag_vlm_mask_kind.py — Verify transform ordering and interaction; ensure tied-weights sync doesn't interfere with mask-kind tagging, and both transforms handle edge cases (missing metadata, multiple export boundaries).
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd5b3c2 and 7abd5bc.

📒 Files selected for processing (25)
  • .cursorignore (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/config/default.yaml (2 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/custom_attn_mask_gen.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py (10 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_gemma3_mask.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/export/library/unified_attn.py (3 hunks)
  • tensorrt_llm/_torch/auto_deploy/models/hf.py (2 hunks)
  • tensorrt_llm/_torch/auto_deploy/models/patches/gemma3.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (3 hunks)
  • tensorrt_llm/_torch/auto_deploy/transform/library/cleanup_input_constraints.py (2 hunks)
  • tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py (4 hunks)
  • tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py (3 hunks)
  • tensorrt_llm/_torch/auto_deploy/transform/library/kvcache_transformers.py (5 hunks)
  • tensorrt_llm/_torch/auto_deploy/transform/library/sync_tied_weights.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/transform/library/tag_vlm_mask_kind.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/transform/optimizer.py (2 hunks)
  • tensorrt_llm/_torch/auto_deploy/utils/_graph.py (2 hunks)
  • tensorrt_llm/_torch/auto_deploy/utils/vlm_utils.py (1 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_attention_op.py (8 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_attention_vlm.py (1 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_gemma3_mask.py (1 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_vlm_constraints.py (1 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kvcache_vlm_detection.py (1 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_tag_vlm_mask_kind.py (1 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/utils/test_vlm_utils.py (1 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bmarimuthu-nv
Copy link
Collaborator Author

/bot run

Comment on lines +258 to +260
@TransformRegistry.register("sync_tied_weights")
class SyncTiedWeights(BaseTransform):
"""Sync tied weights that cross the export boundary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also check if we can just do this with load hooks during/before export.

Copy link
Collaborator Author

@bmarimuthu-nv bmarimuthu-nv Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into the load hooks and it feels to be a better fit for this instead of a separate transform. So I reverted the transform and made the export add a post load hook to sync tied weights - 7d1c5c8

# return a list of tensors
return self.cache_seq_interface.info.unnest_sequences(logits)

def _prepare_vlm_kwargs(self, kwargs: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to find a way to make this part of the export pipeline without requiring eager hacks. Here is my broad suggestion which relies on model patching or re-writing the model definition:

  1. We have a reference op torch.ops.auto_deploy.create_attention_mask that takes a set of arguments is called inside the model (either by rewriting or patching the forward function or some masking utils) and can return the correct mask. It can be called multiple times to create different masks. The different mask are then given as argument to torch.ops.auto_deploy.torch_attention.
  2. When you export the model you now have multiple instances of these mask ops available. Since it is part of the export graph each layer and invocation of torch_attention already has the right mask (sliding, full, etc...)
  3. During swapping from torch_attention to the backend attention (e.g. flashinfer_attention) you swap the generic torch.ops.auto_deploy.create_attention_mask for the backend-specific mask creation tensor. Or you keep the generic one if the generic mask creation is compatible with the attention backend.

I think this design of shifting more into the export + model patching stage should avoid a lot of the hard-coded heuristics like here throughout the code base.

Let me know what you think and happy to discuss this further

Copy link
Collaborator

@govind-ramnarayan govind-ramnarayan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Mostly looks good to me. Leaving some comments. Mostly would be good to revisit some of these tests for usefulness.

Also would be good to clarify (and document where it makes sense) why we have a custom mask during prefill and not generation (because there may be multi-token sequences in the future that are not prefill with speculative decoding e.g. here: https://github.com/NVIDIA/TensorRT-LLM/pull/10096/changes#diff-3ea5c563f6bdbaf80e42e4281b753c2a69e59c6dc9b43595b263b352c3f6cca3R280

@@ -252,16 +276,23 @@ def flashinfer_mha_with_cache(
n_heads = q.shape[1]
n_kv_heads = k.shape[1]

is_generate = s == 1
# Custom mask only applies during prefill, not generation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Just curious, is this a fundamental aspect of prefill, or is it directly applicable to any request with s > 1. The reason I ask is because of speculative decoding - where decode sequences can have multiple tokens. I don't think you need to investigate this, but clarifying the reason that the custom mask only applies during prefill (maybe in your own notes or a Google doc if it is too long for a comment) might be useful in the future so we don't need to ask this when doing (speculative decoding) x VLMs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, the primary reason why we apply custom mask is because in VLMs the of the presence of image or other modal tokens might need a non-causal mask (like bidirectional masking). This custom mask is needed only for context generation/prefill. Once it prefill is done, during generation, we are doing causal token generation and hence we fallback to the attn backend's causal mask generation (ImageTextToText type models).

As for supporting spec decoding (s > 1), maybe we can look into having an explicit param to denote the phase (prefill / decode), instead of relying on s == 1. But as along we are running text generation models, decode phase with s > 1 should still use the causal mask from attn-backend instead of the custom mask. So we should be fine with an explicit param when doing (speculative decoding) x VLMs.

modeling_gemma3.Gemma3Model.forward
)

def _revert_patch(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Maybe assert that this function covers all the keys in self.original_values. I can see a potential issue where we add more stuff to _apply_patch() in the future but do not properly update _revert_patch().

I think maybe if this type of pattern becomes common for other models, we could try to turn this into a utility that just iterates over self.original_values and figures out the values to update from the keys; but this seems overly complicated for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! that's a good idea and seems valid for all export only patches at least 👍

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
@bmarimuthu-nv bmarimuthu-nv force-pushed the bala/fix-gemma-3-27b-it branch from df120f0 to a1d5ee0 Compare December 23, 2025 23:28
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enhancement]: Support Gemma3 VLM in Autodeploy

3 participants