[#9587][fix] AutoDeploy: Support Gemma3 VLM by bmarimuthu-nv · Pull Request #10096 · NVIDIA/TensorRT-LLM

bmarimuthu-nv · 2025-12-17T23:28:49Z

Summary by CodeRabbit

New Features
- Added Vision Language Model (VLM) attention mask generation with support for causal and sliding window constraints
- Enhanced Gemma3 multimodal model support with improved mask handling
- Expanded FlashInfer backend integration for custom attention masks

Summary by Author

Background:

Gemma3 VLM Module hierarchy is:

Gemma3ForConditionalGeneration -> Gemma3Model -> Gemma3TextModel
We export only Gemma3TextModel.

Gemma3 has alternating full attention and sliding window attention based on layer_idx (one full attention every 5 sliding window)

Challenges:

Custom Mask preparation (based on input text & images) happens in Gemma3Model and mask tensor for a given layer dix/type (sliding window or full attention) prepared already and is directly passed to Gemma3TextModel
(full graph mode) The classifier lm_head is in the Gemma3ForCausalGeneration, input embedding is in Gemma3TextModel. But weight tying is enabled between the lm_head and embedding. But we export only Gemma3TextModel and load weights. Sadly, in the checkpoint, the lm_head weight and embedding weight are not identical (for some reason). So the generation is bad if we don't enforce the weight tying after loading the weights.

Gemma3ForConditionalGeneration
├── model (Gemma3Model)
│   └── language_model ← EXPORTED to GraphModule
│       └── embed_tokens.weight ← src (canonical)
└── lm_head.weight ← dst (tied to src) - NOT exported

Solution V2

for supporting gemma3 VLM, the following changes are made in this branch:

allow custom attention to be suplied
1.a we achieved this by
1.a.1 adding placeholder op in attention_interface during tracing
1.a.2 creating a per model, per backend custom mask generation hook that can be replace the placeholder maskgen op during KVCachetransformation pass

additional infra support to support custom attention masking:

enable passing in additional inputs that are not part of the original graph module
enable passing in inputs directly from top module to graphmodule (exported graph module can be nested). This means, bypassing any kwargs capture in the in-between torch nn modules. (see "ad" prefix on "token_type_ids"

Summary
Export time: Capture metadata in markers
Transform time: Replace markers with backend-specific computation
Runtime: Execute the mask op with dynamic inputs

Infra updates:
AdditionalGraphInput - generic way to add inputs post-export
CustomMaskGeneratorRegistry - extensible for new models/backends
_ad_ prefix convention - generic bypass mechanism

detect and handle weight tying between exported graphModule and non-exported layer. This is added as a post load-hook.

Solutions

Problem 1: Custom Mask preparation and supplying

* Sub problem 1: Custom mask preparation
    AutoDeploy VLM flow exports and compiles only TextModel (e.g Gemma3TextModel) and AutoDeploy graph export also doesn't take in custom attention mask today.
* Sub problem 2: Layer Index based attention mask (full vs sliding window)

Solution:

Support flashInfer attention backend only for now
FlashInfer backend optionally takes in boolean attention_mask.

During export:

We create a custom mask generation op that creates the boolean mask and provides it to the flashinfer backend.

Patch model export to always supply in token_type_ids as graph input
- Gemma3Model.forward signature has token_type_ids as an explicit parameter, so it gets consumed there and doesn't flow through **lm_kwargs to language_model.
- we patch Gemma3TextModel.__call__ to inject token_type_ids BEFORE any pre-hooks run, specifically the args capturing prehook in export_to_gm in auto deploy. The flow: patched_call runs → injects token_type_ids into kwargs
In KVCache Transform: flashinfer_gemma3_mask_gen op is inserted into the graph, taking token_type_ids as input
- Generate ONE mask with just the bidirectional image token logic (no sliding window baked in)
- Let FlashInfer apply the sliding window via window_left parameter for sliding attention layers

During running inference:

During inference time, we again need to update the inputs passed onto graphModule. This is because:
Gemma3Model.forward signature has token_type_ids as an explicit parameter, so it gets consumed there and doesn't flow through **lm_kwargs to language_model.
- So, we add a add a register_forward_pre_hook for the graphModule. The hook reads the current token_type_ids from engine.cache_seq_interface.info._extra_args (which gets populated during _prepare_inputs()

Problem 2: Respecting weight tying between exported graph layer and eager layer (in parent module/outside the graph)

Before export:
  embed_tokens.weight ─┬─ tensor T (shared)
  lm_head.weight ──────┘

After export (tying broken):
  embed_tokens.weight ─── tensor T (in GraphModule, loaded correctly from checkpoint)
  lm_head.weight ──────── tensor X (separate, NEVER loaded, random values)

During inference:
  embed_tokens → correct embeddings
  lm_head → GARBAGE output → bad generation!

Solution (OLD):

Added a new transform sync_tied_weights that runs after weights are loaded (stage: post_load_fusion) that:

Detects cross-boundary tied weights by:
* Reading _tied_weights_keys from the model
* Using get_input_embeddings() / get_output_embeddings() to find the actual pair
* Checking which weights are inside GraphModules (exported) vs outside
Syncs the weights by making the non-exported weight (lm_head.weight) point to the exported weight's tensor
Tests
- Comprehensive test coverage for VLM mask generation and attention operations

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Fixes #9587

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

bmarimuthu-nv · 2025-12-19T02:02:11Z

/bot run

bmarimuthu-nv · 2025-12-19T05:06:57Z

@coderabbitai summary

coderabbitai · 2025-12-19T05:07:03Z

✅ Actions performed

Summary regeneration triggered.

coderabbitai · 2025-12-19T05:08:21Z

📝 Walkthrough

Walkthrough

This PR introduces comprehensive Vision-Language Model (VLM) support to AutoDeploy, enabling Gemma3 and similar multimodal models to export and run efficiently. Changes include custom attention mask generation operators, FlashInfer integration for VLM masking, export-time transformations for mask metadata tagging, runtime VLM mask preparation, and extensive supporting infrastructure including utilities, patches, and test coverage.

Changes

Cohort / File(s)	Summary
Custom Attention Mask Operators `tensorrt_llm/_torch/auto_deploy/custom_ops/custom_attn_mask_gen.py` `tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_gemma3_mask.py`	New custom PyTorch ops for generating attention masks with causal, bidirectional image-to-image, and sliding-window constraints; supports both full and sliding attention modes for VLM prefill phases.
FlashInfer Integration & Planning `tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py` `tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py`	Extended PlanParams with VLM fields (has_custom_mask, window_left, logits_soft_cap); added VLM custom mask input initialization and argument passing for cached attention; updated planning logic to handle custom masks.
Export Configuration & Metadata `tensorrt_llm/_torch/auto_deploy/config/default.yaml` `tensorrt_llm/_torch/auto_deploy/export/library/unified_attn.py` `tensorrt_llm/_torch/auto_deploy/models/hf.py`	Added new transforms sync_tied_weights and tag_vlm_mask_kind to config; added mask-kind inference helpers; extended dynamic shape lookups to include cache_position and token_type_ids.
Export-Time VLM Support `tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py` `tensorrt_llm/_torch/auto_deploy/transform/library/kvcache_transformers.py`	Added kwargs sanitization for HF-specific parameters; implemented VLM metadata tagging (ad_is_vlm, ad_mask_kind_by_module) on exported submodules; integrated mask generation during cached attention path.
VLM-Specific Transforms `tensorrt_llm/_torch/auto_deploy/transform/library/sync_tied_weights.py` `tensorrt_llm/_torch/auto_deploy/transform/library/tag_vlm_mask_kind.py` `tensorrt_llm/_torch/auto_deploy/models/patches/gemma3.py`	New post-export transform to sync tied weights across boundaries; new transform to annotate mask-kind metadata on attention nodes; Gemma3-specific patch disabling vmap-based mask creation for export compatibility.
Runtime VLM Handling `tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py` `tensorrt_llm/_torch/auto_deploy/utils/vlm_utils.py`	Added VLM mask preparation logic in engine constructor; new utilities for detecting image tokens and extracting image-token masks by model type; conditional mask injection into model call kwargs.
Diagnostic & Utility Infrastructure `tensorrt_llm/_torch/auto_deploy/transform/optimizer.py` `tensorrt_llm/_torch/auto_deploy/utils/_graph.py`	Added diagnostic dump calls pre/post compile and post tag_vlm_mask_kind; new graph dumping utility to export GraphModule code and graph representations to files.
Ignore Patterns `.cursorignore`	Extended ignore list to exclude tensorrt_llm/_torch/models/ from indexing.
Unit Tests `tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_attention_op.py` `tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_attention_vlm.py` `tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_gemma3_mask.py` `tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_vlm_constraints.py` `tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kvcache_vlm_detection.py` `tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_tag_vlm_mask_kind.py` `tests/unittest/_torch/auto_deploy/unit/singlegpu/utils/test_vlm_utils.py`	Comprehensive test coverage: updated FlashInfer attention tests with new mask parameters; new tests for attention mask selection, PlanParams hashing, and constants extraction; new tests for Gemma3 mask generation and edge cases; new VLM constraint validation tests; new VLM auto-detection and mask mapping tests; new tag_vlm_mask_kind integration tests; new VLM utility tests covering image-token detection across model types.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–90 minutes

Areas requiring extra attention:

custom_attn_mask_gen.py — Complex masking logic with causal constraints, bidirectional image-token handling, and sliding-window interactions; validate correctness of mask construction and broadcast safety.
flashinfer_attention.py — Significant changes to planning, PlanParams dataclass, and FlashInfer wrapper integration; verify proper propagation of window_left and logits_soft_cap through planning paths and correct hash/equality semantics with new fields.
kvcache.py — New VLM mask input initialization and argument appending; verify that lazy initialization occurs correctly, error handling when masks required but missing, and that non-FlashInfer paths remain unchanged.
ad_executor.py — New VLM mask preparation and conditional injection logic; ensure mask computation is correct, VLM-specific kwargs are properly stripped before model invocation, and configuration hook is called at the right stage.
export_to_gm.py — VLM metadata tagging and kwargs sanitization; verify that ad_is_vlm and ad_mask_kind_by_module are correctly populated, that HF-only kwargs are properly filtered at all stages, and that non-VLM models are unaffected.
kvcache_transformers.py — VLM mask generation integration during forward; verify correct detection of mask requirements, proper integration with fake profiling, and consistent behavior across module paths.
Integration across sync_tied_weights.py and tag_vlm_mask_kind.py — Verify transform ordering and interaction; ensure tied-weights sync doesn't interfere with mask-kind tagging, and both transforms handle edge cases (missing metadata, multiple export boundaries).

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd5b3c2 and 7abd5bc.

📒 Files selected for processing (25)

.cursorignore (1 hunks)
tensorrt_llm/_torch/auto_deploy/config/default.yaml (2 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/custom_attn_mask_gen.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py (10 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_gemma3_mask.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/export/library/unified_attn.py (3 hunks)
tensorrt_llm/_torch/auto_deploy/models/hf.py (2 hunks)
tensorrt_llm/_torch/auto_deploy/models/patches/gemma3.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (3 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/cleanup_input_constraints.py (2 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py (4 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py (3 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/kvcache_transformers.py (5 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/sync_tied_weights.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/tag_vlm_mask_kind.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/transform/optimizer.py (2 hunks)
tensorrt_llm/_torch/auto_deploy/utils/_graph.py (2 hunks)
tensorrt_llm/_torch/auto_deploy/utils/vlm_utils.py (1 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_attention_op.py (8 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_attention_vlm.py (1 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_gemma3_mask.py (1 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_vlm_constraints.py (1 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kvcache_vlm_detection.py (1 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_tag_vlm_mask_kind.py (1 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/utils/test_vlm_utils.py (1 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

bmarimuthu-nv · 2025-12-19T06:31:11Z

/bot run

tensorrt_llm/_torch/auto_deploy/export/library/unified_attn.py

tensorrt_llm/_torch/auto_deploy/models/patches/gemma3.py

tensorrt_llm/_torch/auto_deploy/models/hf.py

tensorrt_llm/_torch/auto_deploy/transform/library/sync_tied_weights.py

lucaslie · 2025-12-19T14:33:37Z

tensorrt_llm/_torch/auto_deploy/transform/library/sync_tied_weights.py

+@TransformRegistry.register("sync_tied_weights")
+class SyncTiedWeights(BaseTransform):
+    """Sync tied weights that cross the export boundary.


we should also check if we can just do this with load hooks during/before export.

I looked into the load hooks and it feels to be a better fit for this instead of a separate transform. So I reverted the transform and made the export add a post load hook to sync tied weights - 7d1c5c8

tensorrt_llm/_torch/auto_deploy/utils/vlm_utils.py

tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py

lucaslie · 2025-12-19T14:43:52Z

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py

+        # return a list of tensors
+        return self.cache_seq_interface.info.unnest_sequences(logits)
+
+    def _prepare_vlm_kwargs(self, kwargs: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:


I think we need to find a way to make this part of the export pipeline without requiring eager hacks. Here is my broad suggestion which relies on model patching or re-writing the model definition:

We have a reference op torch.ops.auto_deploy.create_attention_mask that takes a set of arguments is called inside the model (either by rewriting or patching the forward function or some masking utils) and can return the correct mask. It can be called multiple times to create different masks. The different mask are then given as argument to torch.ops.auto_deploy.torch_attention.

When you export the model you now have multiple instances of these mask ops available. Since it is part of the export graph each layer and invocation of torch_attention already has the right mask (sliding, full, etc...)

During swapping from torch_attention to the backend attention (e.g. flashinfer_attention) you swap the generic torch.ops.auto_deploy.create_attention_mask for the backend-specific mask creation tensor. Or you keep the generic one if the generic mask creation is compatible with the attention backend.

I think this design of shifting more into the export + model patching stage should avoid a lot of the hard-coded heuristics like here throughout the code base.

Let me know what you think and happy to discuss this further

govind-ramnarayan

Thanks! Mostly looks good to me. Leaving some comments. Mostly would be good to revisit some of these tests for usefulness.

Also would be good to clarify (and document where it makes sense) why we have a custom mask during prefill and not generation (because there may be multi-token sequences in the future that are not prefill with speculative decoding e.g. here: https://github.com/NVIDIA/TensorRT-LLM/pull/10096/changes#diff-3ea5c563f6bdbaf80e42e4281b753c2a69e59c6dc9b43595b263b352c3f6cca3R280

tensorrt_llm/_torch/auto_deploy/utils/_graph.py

govind-ramnarayan · 2025-12-23T00:18:25Z

tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py

@@ -252,16 +276,23 @@ def flashinfer_mha_with_cache(
    n_heads = q.shape[1]
    n_kv_heads = k.shape[1]

+    is_generate = s == 1
+    # Custom mask only applies during prefill, not generation


Q: Just curious, is this a fundamental aspect of prefill, or is it directly applicable to any request with s > 1. The reason I ask is because of speculative decoding - where decode sequences can have multiple tokens. I don't think you need to investigate this, but clarifying the reason that the custom mask only applies during prefill (maybe in your own notes or a Google doc if it is too long for a comment) might be useful in the future so we don't need to ask this when doing (speculative decoding) x VLMs.

Sure, the primary reason why we apply custom mask is because in VLMs the of the presence of image or other modal tokens might need a non-causal mask (like bidirectional masking). This custom mask is needed only for context generation/prefill. Once it prefill is done, during generation, we are doing causal token generation and hence we fallback to the attn backend's causal mask generation (ImageTextToText type models).

As for supporting spec decoding (s > 1), maybe we can look into having an explicit param to denote the phase (prefill / decode), instead of relying on s == 1. But as along we are running text generation models, decode phase with s > 1 should still use the causal mask from attn-backend instead of the custom mask. So we should be fine with an explicit param when doing (speculative decoding) x VLMs.

govind-ramnarayan · 2025-12-23T01:06:18Z

tensorrt_llm/_torch/auto_deploy/models/patches/gemma3.py

+                modeling_gemma3.Gemma3Model.forward
+            )
+
+    def _revert_patch(self):


Nit: Maybe assert that this function covers all the keys in self.original_values. I can see a potential issue where we add more stuff to _apply_patch() in the future but do not properly update _revert_patch().

I think maybe if this type of pattern becomes common for other models, we could try to turn this into a utility that just iterates over self.original_values and figures out the values to update from the keys; but this seems overly complicated for now.

Makes sense! that's a good idea and seems valid for all export only patches at least 👍

tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_attention_vlm.py

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

bmarimuthu-nv force-pushed the bala/fix-gemma-3-27b-it branch 4 times, most recently from 8282fc4 to 7abd5bc Compare December 19, 2025 01:27

bmarimuthu-nv changed the title ~~[Draft] Bala/fix gemma 3 27b it~~ [9587][fix] AutoDeploy: Support Gemma3 VLM Dec 19, 2025

bmarimuthu-nv changed the title ~~[9587][fix] AutoDeploy: Support Gemma3 VLM~~ [#9587][fix] AutoDeploy: Support Gemma3 VLM Dec 19, 2025

lucaslie reviewed Dec 19, 2025

View reviewed changes

govind-ramnarayan reviewed Dec 23, 2025

View reviewed changes

bmarimuthu-nv added 17 commits December 23, 2025 15:15

tracing + flashinfer extensions

faf541e

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

added unit tests

342330f

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

clean up

3a28ba2

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

export fix v2

4851032

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

clean run but bad generation

d4a3898

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

mask_kind wiring fix

361df1a

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

fix all plumbling + bad generation

84cbd9f

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

lm_head weight tying fix

fa19b8c

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

clean rebase

b05a330

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

clean up

21e8501

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

clean up

253ad1d

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

clean up

6ee70a5

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

move mask generation inside the graph

03f9beb

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

fix ad_engine token_type_ids plumbing

9f51f0d

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

updated tests

4feb4e0

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

review fixes + cleanup

eb916de

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

weight tying to post load hook

eb972d1

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

bmarimuthu-nv added 5 commits December 23, 2025 15:26

fix sliding window + single mask

cec254d

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

clean up

e0f5034

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

added docstring

cd82658

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

add generic create_attention_mask custom op

b9461d0

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

generic attn mask op

a1d5ee0

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

bmarimuthu-nv force-pushed the bala/fix-gemma-3-27b-it branch from df120f0 to a1d5ee0 Compare December 23, 2025 23:28

bmarimuthu-nv added 2 commits December 23, 2025 16:17

formatting

1b073c0

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

handle additional graph inputs

01d93fe

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

Conversation

bmarimuthu-nv commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Summary by Author

Solution V2

Solutions

Problem 1: Custom Mask preparation and supplying

Solution:

During export:

During running inference:

Problem 2: Respecting weight tying between exported graph layer and eager layer (in parent module/outside the graph)

Solution (OLD):

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

bmarimuthu-nv commented Dec 19, 2025

Uh oh!

bmarimuthu-nv commented Dec 19, 2025

Uh oh!

coderabbitai bot commented Dec 19, 2025

Uh oh!

coderabbitai bot commented Dec 19, 2025

Walkthrough

Changes

Estimated code review effort

Uh oh!

bmarimuthu-nv commented Dec 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lucaslie Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

bmarimuthu-nv Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lucaslie Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

govind-ramnarayan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

govind-ramnarayan Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

bmarimuthu-nv Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

govind-ramnarayan Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

bmarimuthu-nv Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bmarimuthu-nv commented Dec 17, 2025 •

edited

Loading

bmarimuthu-nv Dec 23, 2025 •

edited

Loading