[None][fix] Enable AttentionDP on Qwen3-VL and fix test #10435

yechank-nvidia · 2026-01-06T05:49:36Z

With this fix, Qwen3-VL-Dense/MoE can do Attention DP on LLM side. Note that other multimodal models are not tested.

Summary by CodeRabbit

New Features
- Added support for Qwen3-VL-8B-Instruct multimodal model with 55.11% MMMU accuracy benchmark
- Enhanced tensor parallelism configuration for optimized vision model inference
Tests
- Expanded multimodal modeling test coverage
- Added test scenarios for improved model compatibility

_{✏️ Tip: You can customize this high-level summary in your review settings.}

yechank-nvidia · 2026-01-06T05:50:13Z

/bot run

tensorrt-cicd · 2026-01-06T05:55:39Z

PR_Github #30687 [ run ] triggered by Bot. Commit: 54121ed

coderabbitai · 2026-01-06T05:55:41Z

📝 Walkthrough

Walkthrough

These changes introduce dynamic tensor parallel configuration for vision models by computing mapping-aware parameters during initialization. Key modifications include updating Qwen2.5-VL attention initialization, refactoring Qwen3-VL vision components to compute and propagate custom mappings, updating forward signatures, and adding test coverage for the Qwen3-VL-8B-Instruct model.

Changes

Cohort / File(s)	Summary
Model Architecture Updates `tensorrt_llm/_torch/models/modeling_qwen2vl.py`, `tensorrt_llm/_torch/models/modeling_qwen3vl.py`, `tensorrt_llm/_torch/modules/mlp.py`	Qwen2VL: Removes `reduce_output` parameter from `Qwen2_5_VLVisionAttention.__init__`. Qwen3VL: Introduces conditional mapping computation when attention data parallelism is enabled; updates `Qwen3VLVisionBlock.forward` to accept `pixel_values` instead of `hidden_states`; stores `model_config` on instance for dynamic mapping decisions; propagates computed mappings to Linear layers in VisionAttention, VisionMLP, and VisionPatchMerger. MLP: Adds optional `overridden_tp_size` parameter to conditionally compute and use a derived mapping for layer initialization.
Test Data and Configuration `tests/integration/defs/accuracy/references/mmmu.yaml`, `tests/integration/test_lists/test-db/l0_l40s.yml`	Adds reference accuracy entry for Qwen3-VL-8B-Instruct model (55.11) to MMMU benchmarks. Appends multimodal modeling test entries including `test_modeling_qwen3vl.py::TestQwen3VL::test_all` to L0/L40S test suite.
Test Implementations `tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py`, `tests/unittest/_torch/modeling/test_modeling_qwen3vl.py`	Adds `TestQwen3VL` test class with MMMU evaluation harness for Qwen3-VL-8B-Instruct. Introduces test scenarios covering Qwen3VL with `disable_fuse_rope=True` flag under non-CUDA-graph execution paths.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description check	⚠️ Warning	The PR description is minimal and lacks detail about the implementation approach, specific changes, and test coverage as required by the template sections.	Expand the description to include: the specific implementation changes (Mapping-aware tensor parallel config, parameter additions), how test coverage validates the changes, and confirmation that PR checklist items were reviewed.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the main change: enabling AttentionDP on Qwen3-VL and fixing tests, which directly aligns with the core modifications in the changeset.

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/models/modeling_qwen2vl.py (1)

505-524: Fix class name reference and add required NVIDIA copyright header.

The class name should be Qwen2_5_VLVisionAttention (not QWen2VLVisionAttention). The parameter propagation to the base Attention class is correct—the base class accepts reduce_output: bool = True and the subclass properly passes it via super().__init__().

However, the file is missing the required NVIDIA copyright header with the year of latest meaningful modification, as mandated by coding guidelines for all TensorRT-LLM source files.

🤖 Fix all issues with AI Agents

In @tensorrt_llm/_torch/models/modeling_qwen3vl.py:
- Line 654: The forward method's return type is wrong: it returns
(hidden_states, deepstack_feature_lists) but is annotated as torch.Tensor;
update the signature of forward(self, pixel_values: torch.Tensor, grid_thw:
torch.Tensor, **kwargs) -> Tuple[torch.Tensor, List[torch.Tensor]] (or
Sequence[torch.Tensor] if preferred) and add the corresponding typing imports
(from typing import Tuple, List) at the top of the module; leave **kwargs unused
for API consistency.

In @tensorrt_llm/_torch/modules/mlp.py:
- Around line 51-53: The up_lora LoraLayer is built using intermediate_size //
config.mapping.tp_size which mismatches up_proj when an overridden_tp_size is
used; update the up_lora construction (LoraLayer) to use the effective tp_size
used by up_proj (i.e., use overridden_tp_size when provided, otherwise fall back
to config.mapping.tp_size) so the LoraLayer shard size matches the Linear layer
shard size (adjust the code paths around up_lora and up_proj to derive and reuse
a single tp_size variable).

🧹 Nitpick comments (3)

tensorrt_llm/_torch/modules/mlp.py (1)

35-49: Consider refactoring the mapping override logic.

The current approach of "misusing" pp_size to create smaller TP groups is a clever workaround but may be fragile. Consider whether a more explicit mechanism for Attention DP would be clearer, such as:

Adding explicit support in the Mapping class for attention data parallel groups

Using a dedicated attn_tp_size parameter that's already mentioned in learnings

This would make the code more maintainable and less prone to confusion.
tensorrt_llm/_torch/models/modeling_qwen3vl.py (2)
390-390: Consider removing unused attribute.

self.model_config is stored but not referenced elsewhere in Qwen3VLVisionBlock. If this is intentional for debugging or future use, consider adding a brief comment; otherwise, it can be removed.

442-476: LGTM with suggestion for clarity.

The mapping override logic correctly configures linear layers for attention DP by treating the TP dimension as PP for all-reduce purposes. The implementation is sound.

Consider rephrasing the comment on line 448 for clarity:
-            # "Misuse" pp_size here to perform all-reduce within smaller groups
+            # Use pp_size to route all-reduce to the original TP group when TP is overridden to 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 810249c and 54121ed.

📒 Files selected for processing (7)

tensorrt_llm/_torch/models/modeling_qwen2vl.py
tensorrt_llm/_torch/models/modeling_qwen3vl.py
tensorrt_llm/_torch/modules/mlp.py
tests/integration/defs/accuracy/references/mmmu.yaml
tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py
tests/integration/test_lists/test-db/l0_l40s.yml
tests/unittest/_torch/modeling/test_modeling_qwen3vl.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing Python modules, even if only one class or function from a module is used
Python filenames should use snake_case (e.g., some_file.py)
Python classes should use PascalCase (e.g., class SomeClass)
Python functions and methods should use snake_case (e.g., def my_awesome_function():)
Python local variables should use snake_case, with prefix k for variable names that start with a number (e.g., k_99th_percentile)
Python global variables should use upper snake_case with prefix G (e.g., G_MY_GLOBAL)
Python constants should use upper snake_case (e.g., MY_CONSTANT)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Use comments in Python for code within a function, or interfaces that are local to a file
Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with the format """<type>: Description"""
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block for the main logic

Files:

tensorrt_llm/_torch/modules/mlp.py
tensorrt_llm/_torch/models/modeling_qwen2vl.py
tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py
tests/unittest/_torch/modeling/test_modeling_qwen3vl.py
tensorrt_llm/_torch/models/modeling_qwen3vl.py

**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification

Files:

tensorrt_llm/_torch/modules/mlp.py
tensorrt_llm/_torch/models/modeling_qwen2vl.py
tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py
tests/unittest/_torch/modeling/test_modeling_qwen3vl.py
tensorrt_llm/_torch/models/modeling_qwen3vl.py

🧠 Learnings (15)

📓 Common learnings

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.

Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.

Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

📚 Learning: 2025-09-09T09:40:45.658Z

Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

tests/integration/test_lists/test-db/l0_l40s.yml
tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py

📚 Learning: 2025-09-17T02:48:52.732Z

Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7781
File: tests/integration/test_lists/waives.txt:313-313
Timestamp: 2025-09-17T02:48:52.732Z
Learning: In TensorRT-LLM, `tests/integration/test_lists/waives.txt` is specifically for waiving/skipping tests, while other test list files like those in `test-db/` and `qa/` directories are for different test execution contexts (pre-merge, post-merge, QA tests). The same test appearing in both waives.txt and execution list files is intentional - the test is part of test suites but will be skipped due to the waiver.

Applied to files:

tests/integration/test_lists/test-db/l0_l40s.yml

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/test_lists/test-db/l0_l40s.yml
tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py

📚 Learning: 2025-08-14T06:36:40.701Z

Learnt from: timlee0212
Repo: NVIDIA/TensorRT-LLM PR: 6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

tensorrt_llm/_torch/modules/mlp.py
tensorrt_llm/_torch/models/modeling_qwen2vl.py
tensorrt_llm/_torch/models/modeling_qwen3vl.py

📚 Learning: 2025-08-26T06:07:02.166Z

Learnt from: shaharmor98
Repo: NVIDIA/TensorRT-LLM PR: 7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.

Applied to files:

tensorrt_llm/_torch/modules/mlp.py
tensorrt_llm/_torch/models/modeling_qwen3vl.py

📚 Learning: 2025-09-29T15:14:28.503Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.

Applied to files:

tensorrt_llm/_torch/modules/mlp.py
tensorrt_llm/_torch/models/modeling_qwen3vl.py

📚 Learning: 2025-09-29T15:14:28.503Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.

Applied to files:

tensorrt_llm/_torch/modules/mlp.py
tensorrt_llm/_torch/models/modeling_qwen3vl.py

📚 Learning: 2025-08-26T09:37:10.463Z

Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Applied to files:

tensorrt_llm/_torch/modules/mlp.py

📚 Learning: 2025-08-26T09:37:10.463Z

Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.

Applied to files:

tensorrt_llm/_torch/modules/mlp.py

📚 Learning: 2025-08-14T15:43:23.107Z

Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.

Applied to files:

tensorrt_llm/_torch/models/modeling_qwen3vl.py

📚 Learning: 2025-08-15T06:46:53.813Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

tensorrt_llm/_torch/models/modeling_qwen3vl.py

📚 Learning: 2025-12-19T06:31:54.973Z

Learnt from: nvyocox
Repo: NVIDIA/TensorRT-LLM PR: 10117
File: tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py:336-339
Timestamp: 2025-12-19T06:31:54.973Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py, the cast to torch.float16 for qkv_node before creating the AttentionPlugin is intentional and required because DriveOS LLM expects float16 dtype specifically. This should not be changed to preserve original dtype or made configurable for bfloat16 models in the DriveOS LLM ONNX export path.

Applied to files:

tensorrt_llm/_torch/models/modeling_qwen3vl.py

📚 Learning: 2025-08-14T21:04:50.248Z

Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

tensorrt_llm/_torch/models/modeling_qwen3vl.py

📚 Learning: 2025-08-19T12:45:11.997Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

tensorrt_llm/_torch/models/modeling_qwen3vl.py

🧬 Code graph analysis (3)

tensorrt_llm/_torch/modules/mlp.py (2)

tensorrt_llm/mapping.py (1)

Mapping (361-540)

tensorrt_llm/_torch/distributed/communicator.py (2)

tp_size (64-65)

pp_size (60-61)

tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py (3)

tests/integration/defs/accuracy/accuracy_core.py (1)

LlmapiAccuracyTestHarness (898-909)

tensorrt_llm/sampling_params.py (1)

SamplingParams (113-552)

tensorrt_llm/evaluate/lm_eval.py (3)

MMMU (696-749)

evaluate (428-463)

evaluate (795-849)

tensorrt_llm/_torch/models/modeling_qwen3vl.py (3)

tensorrt_llm/mapping.py (1)

Mapping (361-540)

tensorrt_llm/_torch/modules/linear.py (1)

Linear (2045-2297)

tensorrt_llm/_torch/models/modeling_qwen3.py (2)

forward (117-164)

forward (193-226)

🪛 Ruff (0.14.10)

tensorrt_llm/_torch/models/modeling_qwen3vl.py

654-654: Unused method argument: kwargs

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (8)

tests/integration/test_lists/test-db/l0_l40s.yml (1)

17-26: LGTM!

The addition of multimodal modeling tests for Qwen3VL is consistent with the existing test structure and aligns with the PR objectives.

tests/integration/defs/accuracy/references/mmmu.yaml (1)

29-30: LGTM!

The accuracy reference for Qwen3-VL-8B-Instruct is properly formatted and consistent with other model entries.

tests/unittest/_torch/modeling/test_modeling_qwen3vl.py (1)

256-263: LGTM!

The test scenario for disable_fuse_rope=True provides appropriate coverage for the new functionality. The scenario integrates well with the existing test framework.

tensorrt_llm/_torch/modules/mlp.py (1)

36-47: The rank mapping is correct and intentional—this design enables all-reduce within smaller TP groups as intended for Attention DP.

When overridden_tp_size is used, the global rank remains unchanged, but the TP and PP group memberships are recalculated based on the new tp_size and pp_size values. The Mapping class correctly computes tp_rank = rank % (tp_size * cp_size) // cp_size and pp_rank = rank // (tp_size * cp_size), so all-reduce operations automatically target the correct smaller TP groups. The world_size is mathematically preserved (new_tp_size * new_pp_size = old_tp_size * old_pp_size), and the comment "Misuse pp_size here to perform all-reduce within smaller groups" reflects the deliberate reuse of the PP dimension for grouping. This pattern is used consistently across multiple models (Qwen3, DeepSeekV3, Llama, GLM) for enable_attention_dp scenarios and works correctly without issues.

Likely an incorrect or invalid review comment.

tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py (1)

321-336: LGTM!

The new TestQwen3VL test class follows the established patterns in this file, consistent with TestQwen3VL_MOE. The test configuration is appropriate for validating Qwen3-VL-8B-Instruct model accuracy.

tensorrt_llm/_torch/models/modeling_qwen3vl.py (3)

18-18: LGTM!

The Mapping import is correctly added to support creating custom mappings for attention DP configuration.

363-369: LGTM!

The reduce_output logic correctly disables all-reduce in vision attention when attention DP is enabled, allowing the LLM side to handle parallelism appropriately.

383-383: LGTM!

Correctly overrides tp_size to 1 for vision MLP when attention DP is enabled, consistent with the attention changes.

tensorrt_llm/_torch/models/modeling_qwen3vl.py

tensorrt_llm/_torch/modules/mlp.py

yechank-nvidia · 2026-01-06T06:00:55Z

/bot run

tensorrt-cicd · 2026-01-06T06:08:04Z

PR_Github #30689 [ run ] triggered by Bot. Commit: 890b6a3

jaedeok-nvidia

Thanks @yechank-nvidia for your quick update on the attention dp feature. Overall looks good to me, although I've left several comments. Thx!

tensorrt_llm/_torch/models/modeling_qwen3vl.py

tensorrt_llm/_torch/modules/mlp.py

tensorrt-cicd · 2026-01-06T09:43:08Z

PR_Github #30689 [ run ] completed with state SUCCESS. Commit: 890b6a3
/LLM/main/L0_MergeRequest_PR pipeline #23679 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

yechank-nvidia · 2026-01-07T05:08:14Z

/bot run

tensorrt-cicd · 2026-01-07T05:14:07Z

PR_Github #30837 [ run ] triggered by Bot. Commit: e5e3184

yechank-nvidia · 2026-01-07T05:17:52Z

/bot run

tensorrt-cicd · 2026-01-07T05:24:22Z

PR_Github #30839 [ run ] triggered by Bot. Commit: 0137afe

tensorrt-cicd · 2026-01-07T05:24:26Z

PR_Github #30837 [ run ] completed with state ABORTED. Commit: e5e3184

tensorrt-cicd · 2026-01-07T11:13:10Z

PR_Github #30839 [ run ] completed with state SUCCESS. Commit: 0137afe
/LLM/main/L0_MergeRequest_PR pipeline #23816 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

yechank-nvidia · 2026-01-08T00:57:18Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-08T01:03:48Z

PR_Github #30949 [ run ] triggered by Bot. Commit: 0137afe

tensorrt-cicd · 2026-01-08T05:21:06Z

PR_Github #30949 [ run ] completed with state SUCCESS. Commit: 0137afe
/LLM/main/L0_MergeRequest_PR pipeline #23910 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

yechank-nvidia · 2026-01-08T05:31:15Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-08T06:19:53Z

PR_Github #31005 [ run ] triggered by Bot. Commit: 1440e3b

tensorrt-cicd · 2026-01-08T11:48:08Z

PR_Github #31005 [ run ] completed with state SUCCESS. Commit: 1440e3b
/LLM/main/L0_MergeRequest_PR pipeline #23957 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Signed-off-by: yechank <[email protected]>

yechank-nvidia · 2026-01-08T12:36:39Z

/bot run

tensorrt-cicd · 2026-01-08T12:42:23Z

PR_Github #31064 [ run ] triggered by Bot. Commit: a56eba6

mikeiovine

Approved changes in modules/mlp.py

tensorrt-cicd · 2026-01-08T17:38:03Z

PR_Github #31064 [ run ] completed with state SUCCESS. Commit: a56eba6
/LLM/main/L0_MergeRequest_PR pipeline #24002 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

yechank-nvidia · 2026-01-09T00:54:52Z

/bot run

tensorrt-cicd · 2026-01-09T01:04:59Z

PR_Github #31139 [ run ] triggered by Bot. Commit: a56eba6

tensorrt-cicd · 2026-01-09T02:35:03Z

PR_Github #31139 [ run ] completed with state SUCCESS. Commit: a56eba6
/LLM/main/L0_MergeRequest_PR pipeline #24055 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

yechank-nvidia · 2026-01-09T03:49:05Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-09T04:00:22Z

PR_Github #31186 [ run ] triggered by Bot. Commit: a56eba6

tensorrt-cicd · 2026-01-09T07:44:27Z

PR_Github #31186 [ run ] completed with state SUCCESS. Commit: a56eba6
/LLM/main/L0_MergeRequest_PR pipeline #24098 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

yechank-nvidia · 2026-01-09T07:46:43Z

/bot run

tensorrt-cicd · 2026-01-09T08:00:12Z

PR_Github #31221 [ run ] triggered by Bot. Commit: a56eba6

tensorrt-cicd · 2026-01-09T14:47:59Z

PR_Github #31221 [ run ] completed with state SUCCESS. Commit: a56eba6
/LLM/main/L0_MergeRequest_PR pipeline #24129 completed with status: 'SUCCESS'

Signed-off-by: yechank <[email protected]> Signed-off-by: Daniil Kulko <[email protected]>

yechank-nvidia requested review from a team as code owners January 6, 2026 05:49

yechank-nvidia requested review from dongjiyingdjy, mikeiovine and symphonylyh January 6, 2026 05:49

yechank-nvidia self-assigned this Jan 6, 2026

yechank-nvidia added the Multimodal Label for issues & PRs regarding Multimodal related objects label Jan 6, 2026

coderabbitai bot reviewed Jan 6, 2026

View reviewed changes

tensorrt_llm/_torch/models/modeling_qwen3vl.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/modules/mlp.py Show resolved Hide resolved

jaedeok-nvidia approved these changes Jan 6, 2026

View reviewed changes

yechank-nvidia mentioned this pull request Jan 7, 2026

[None][feat] EPD for Qwen3 VL #10470

Merged

1 task

yechank-nvidia force-pushed the multimodal_attention_dp branch from 1d4f849 to 890b6a3 Compare January 7, 2026 02:15

yechank-nvidia changed the title ~~[None][fix] Support AttentionDP on Qwen3-VL and fix test~~ [None][fix] Enable AttentionDP on Qwen3-VL and fix test Jan 7, 2026

yechank-nvidia force-pushed the multimodal_attention_dp branch from 0137afe to 1440e3b Compare January 8, 2026 05:30

yechank-nvidia added 4 commits January 8, 2026 21:35

support attention dp and fix test

852c5d4

Signed-off-by: yechank <[email protected]>

fix return type

958a0dd

Signed-off-by: yechank <[email protected]>

address mapping clone

f158ba4

Signed-off-by: yechank <[email protected]>

rollback to mapping class

a56eba6

Signed-off-by: yechank <[email protected]>

yechank-nvidia force-pushed the multimodal_attention_dp branch from 1440e3b to a56eba6 Compare January 8, 2026 12:36

mikeiovine approved these changes Jan 8, 2026

View reviewed changes

yechank-nvidia merged commit 7295af6 into NVIDIA:main Jan 9, 2026
5 checks passed

videodanchik pushed a commit to videodanchik/TensorRT-LLM that referenced this pull request Jan 14, 2026

[None][fix] Enable AttentionDP on Qwen3-VL and fix test (NVIDIA#10435)

5532b50

Signed-off-by: yechank <[email protected]> Signed-off-by: Daniil Kulko <[email protected]>

[None][fix] Enable AttentionDP on Qwen3-VL and fix test #10435

[None][fix] Enable AttentionDP on Qwen3-VL and fix test #10435

Conversation

yechank-nvidia commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

yechank-nvidia commented Jan 6, 2026

Uh oh!

tensorrt-cicd commented Jan 6, 2026

Uh oh!

coderabbitai bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yechank-nvidia commented Jan 6, 2026

Uh oh!

tensorrt-cicd commented Jan 6, 2026

Uh oh!

jaedeok-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Jan 6, 2026

Uh oh!

yechank-nvidia commented Jan 7, 2026

Uh oh!

tensorrt-cicd commented Jan 7, 2026

Uh oh!

yechank-nvidia commented Jan 7, 2026

Uh oh!

tensorrt-cicd commented Jan 7, 2026

Uh oh!

tensorrt-cicd commented Jan 7, 2026

Uh oh!

tensorrt-cicd commented Jan 7, 2026

Uh oh!

yechank-nvidia commented Jan 8, 2026

Uh oh!

tensorrt-cicd commented Jan 8, 2026

Uh oh!

tensorrt-cicd commented Jan 8, 2026

Uh oh!

yechank-nvidia commented Jan 8, 2026

Uh oh!

tensorrt-cicd commented Jan 8, 2026

Uh oh!

tensorrt-cicd commented Jan 8, 2026

Uh oh!

yechank-nvidia commented Jan 8, 2026

Uh oh!

tensorrt-cicd commented Jan 8, 2026

Uh oh!

mikeiovine left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Jan 8, 2026

Uh oh!

yechank-nvidia commented Jan 9, 2026

Uh oh!

tensorrt-cicd commented Jan 9, 2026

Uh oh!

tensorrt-cicd commented Jan 9, 2026

Uh oh!

yechank-nvidia commented Jan 9, 2026

Uh oh!

tensorrt-cicd commented Jan 9, 2026

Uh oh!

tensorrt-cicd commented Jan 9, 2026

yechank-nvidia commented Jan 6, 2026 •

edited

Loading

coderabbitai bot commented Jan 6, 2026 •

edited

Loading