Skip to content

Conversation

@yechank-nvidia
Copy link
Collaborator

@yechank-nvidia yechank-nvidia commented Jan 6, 2026

With this fix, Qwen3-VL-Dense/MoE can do Attention DP on LLM side. Note that other multimodal models are not tested.

Summary by CodeRabbit

  • New Features

    • Added support for Qwen3-VL-8B-Instruct multimodal model with 55.11% MMMU accuracy benchmark
    • Enhanced tensor parallelism configuration for optimized vision model inference
  • Tests

    • Expanded multimodal modeling test coverage
    • Added test scenarios for improved model compatibility

✏️ Tip: You can customize this high-level summary in your review settings.

@yechank-nvidia
Copy link
Collaborator Author

/bot run

@yechank-nvidia yechank-nvidia self-assigned this Jan 6, 2026
@yechank-nvidia yechank-nvidia added the Multimodal Label for issues & PRs regarding Multimodal related objects label Jan 6, 2026
@tensorrt-cicd
Copy link
Collaborator

PR_Github #30687 [ run ] triggered by Bot. Commit: 54121ed

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 6, 2026

📝 Walkthrough

Walkthrough

These changes introduce dynamic tensor parallel configuration for vision models by computing mapping-aware parameters during initialization. Key modifications include updating Qwen2.5-VL attention initialization, refactoring Qwen3-VL vision components to compute and propagate custom mappings, updating forward signatures, and adding test coverage for the Qwen3-VL-8B-Instruct model.

Changes

Cohort / File(s) Summary
Model Architecture Updates
tensorrt_llm/_torch/models/modeling_qwen2vl.py, tensorrt_llm/_torch/models/modeling_qwen3vl.py, tensorrt_llm/_torch/modules/mlp.py
Qwen2VL: Removes reduce_output parameter from Qwen2_5_VLVisionAttention.__init__. Qwen3VL: Introduces conditional mapping computation when attention data parallelism is enabled; updates Qwen3VLVisionBlock.forward to accept pixel_values instead of hidden_states; stores model_config on instance for dynamic mapping decisions; propagates computed mappings to Linear layers in VisionAttention, VisionMLP, and VisionPatchMerger. MLP: Adds optional overridden_tp_size parameter to conditionally compute and use a derived mapping for layer initialization.
Test Data and Configuration
tests/integration/defs/accuracy/references/mmmu.yaml, tests/integration/test_lists/test-db/l0_l40s.yml
Adds reference accuracy entry for Qwen3-VL-8B-Instruct model (55.11) to MMMU benchmarks. Appends multimodal modeling test entries including test_modeling_qwen3vl.py::TestQwen3VL::test_all to L0/L40S test suite.
Test Implementations
tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py, tests/unittest/_torch/modeling/test_modeling_qwen3vl.py
Adds TestQwen3VL test class with MMMU evaluation harness for Qwen3-VL-8B-Instruct. Introduces test scenarios covering Qwen3VL with disable_fuse_rope=True flag under non-CUDA-graph execution paths.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description check ⚠️ Warning The PR description is minimal and lacks detail about the implementation approach, specific changes, and test coverage as required by the template sections. Expand the description to include: the specific implementation changes (Mapping-aware tensor parallel config, parameter additions), how test coverage validates the changes, and confirmation that PR checklist items were reviewed.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the main change: enabling AttentionDP on Qwen3-VL and fixing tests, which directly aligns with the core modifications in the changeset.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/models/modeling_qwen2vl.py (1)

505-524: Fix class name reference and add required NVIDIA copyright header.

The class name should be Qwen2_5_VLVisionAttention (not QWen2VLVisionAttention). The parameter propagation to the base Attention class is correct—the base class accepts reduce_output: bool = True and the subclass properly passes it via super().__init__().

However, the file is missing the required NVIDIA copyright header with the year of latest meaningful modification, as mandated by coding guidelines for all TensorRT-LLM source files.

🤖 Fix all issues with AI Agents
In @tensorrt_llm/_torch/models/modeling_qwen3vl.py:
- Line 654: The forward method's return type is wrong: it returns
(hidden_states, deepstack_feature_lists) but is annotated as torch.Tensor;
update the signature of forward(self, pixel_values: torch.Tensor, grid_thw:
torch.Tensor, **kwargs) -> Tuple[torch.Tensor, List[torch.Tensor]] (or
Sequence[torch.Tensor] if preferred) and add the corresponding typing imports
(from typing import Tuple, List) at the top of the module; leave **kwargs unused
for API consistency.

In @tensorrt_llm/_torch/modules/mlp.py:
- Around line 51-53: The up_lora LoraLayer is built using intermediate_size //
config.mapping.tp_size which mismatches up_proj when an overridden_tp_size is
used; update the up_lora construction (LoraLayer) to use the effective tp_size
used by up_proj (i.e., use overridden_tp_size when provided, otherwise fall back
to config.mapping.tp_size) so the LoraLayer shard size matches the Linear layer
shard size (adjust the code paths around up_lora and up_proj to derive and reuse
a single tp_size variable).
🧹 Nitpick comments (3)
tensorrt_llm/_torch/modules/mlp.py (1)

35-49: Consider refactoring the mapping override logic.

The current approach of "misusing" pp_size to create smaller TP groups is a clever workaround but may be fragile. Consider whether a more explicit mechanism for Attention DP would be clearer, such as:

  • Adding explicit support in the Mapping class for attention data parallel groups
  • Using a dedicated attn_tp_size parameter that's already mentioned in learnings

This would make the code more maintainable and less prone to confusion.

tensorrt_llm/_torch/models/modeling_qwen3vl.py (2)

390-390: Consider removing unused attribute.

self.model_config is stored but not referenced elsewhere in Qwen3VLVisionBlock. If this is intentional for debugging or future use, consider adding a brief comment; otherwise, it can be removed.


442-476: LGTM with suggestion for clarity.

The mapping override logic correctly configures linear layers for attention DP by treating the TP dimension as PP for all-reduce purposes. The implementation is sound.

Consider rephrasing the comment on line 448 for clarity:

-            # "Misuse" pp_size here to perform all-reduce within smaller groups
+            # Use pp_size to route all-reduce to the original TP group when TP is overridden to 1
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 810249c and 54121ed.

📒 Files selected for processing (7)
  • tensorrt_llm/_torch/models/modeling_qwen2vl.py
  • tensorrt_llm/_torch/models/modeling_qwen3vl.py
  • tensorrt_llm/_torch/modules/mlp.py
  • tests/integration/defs/accuracy/references/mmmu.yaml
  • tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py
  • tests/integration/test_lists/test-db/l0_l40s.yml
  • tests/unittest/_torch/modeling/test_modeling_qwen3vl.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing Python modules, even if only one class or function from a module is used
Python filenames should use snake_case (e.g., some_file.py)
Python classes should use PascalCase (e.g., class SomeClass)
Python functions and methods should use snake_case (e.g., def my_awesome_function():)
Python local variables should use snake_case, with prefix k for variable names that start with a number (e.g., k_99th_percentile)
Python global variables should use upper snake_case with prefix G (e.g., G_MY_GLOBAL)
Python constants should use upper snake_case (e.g., MY_CONSTANT)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Use comments in Python for code within a function, or interfaces that are local to a file
Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with the format """<type>: Description"""
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block for the main logic

Files:

  • tensorrt_llm/_torch/modules/mlp.py
  • tensorrt_llm/_torch/models/modeling_qwen2vl.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py
  • tests/unittest/_torch/modeling/test_modeling_qwen3vl.py
  • tensorrt_llm/_torch/models/modeling_qwen3vl.py
**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification

Files:

  • tensorrt_llm/_torch/modules/mlp.py
  • tensorrt_llm/_torch/models/modeling_qwen2vl.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py
  • tests/unittest/_torch/modeling/test_modeling_qwen3vl.py
  • tensorrt_llm/_torch/models/modeling_qwen3vl.py
🧠 Learnings (15)
📓 Common learnings
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

  • tests/integration/test_lists/test-db/l0_l40s.yml
  • tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py
📚 Learning: 2025-09-17T02:48:52.732Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7781
File: tests/integration/test_lists/waives.txt:313-313
Timestamp: 2025-09-17T02:48:52.732Z
Learning: In TensorRT-LLM, `tests/integration/test_lists/waives.txt` is specifically for waiving/skipping tests, while other test list files like those in `test-db/` and `qa/` directories are for different test execution contexts (pre-merge, post-merge, QA tests). The same test appearing in both waives.txt and execution list files is intentional - the test is part of test suites but will be skipped due to the waiver.

Applied to files:

  • tests/integration/test_lists/test-db/l0_l40s.yml
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/test-db/l0_l40s.yml
  • tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
Repo: NVIDIA/TensorRT-LLM PR: 6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

  • tensorrt_llm/_torch/modules/mlp.py
  • tensorrt_llm/_torch/models/modeling_qwen2vl.py
  • tensorrt_llm/_torch/models/modeling_qwen3vl.py
📚 Learning: 2025-08-26T06:07:02.166Z
Learnt from: shaharmor98
Repo: NVIDIA/TensorRT-LLM PR: 7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.

Applied to files:

  • tensorrt_llm/_torch/modules/mlp.py
  • tensorrt_llm/_torch/models/modeling_qwen3vl.py
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.

Applied to files:

  • tensorrt_llm/_torch/modules/mlp.py
  • tensorrt_llm/_torch/models/modeling_qwen3vl.py
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.

Applied to files:

  • tensorrt_llm/_torch/modules/mlp.py
  • tensorrt_llm/_torch/models/modeling_qwen3vl.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Applied to files:

  • tensorrt_llm/_torch/modules/mlp.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.

Applied to files:

  • tensorrt_llm/_torch/modules/mlp.py
📚 Learning: 2025-08-14T15:43:23.107Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_qwen3vl.py
📚 Learning: 2025-08-15T06:46:53.813Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_qwen3vl.py
📚 Learning: 2025-12-19T06:31:54.973Z
Learnt from: nvyocox
Repo: NVIDIA/TensorRT-LLM PR: 10117
File: tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py:336-339
Timestamp: 2025-12-19T06:31:54.973Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py, the cast to torch.float16 for qkv_node before creating the AttentionPlugin is intentional and required because DriveOS LLM expects float16 dtype specifically. This should not be changed to preserve original dtype or made configurable for bfloat16 models in the DriveOS LLM ONNX export path.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_qwen3vl.py
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_qwen3vl.py
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_qwen3vl.py
🧬 Code graph analysis (3)
tensorrt_llm/_torch/modules/mlp.py (2)
tensorrt_llm/mapping.py (1)
  • Mapping (361-540)
tensorrt_llm/_torch/distributed/communicator.py (2)
  • tp_size (64-65)
  • pp_size (60-61)
tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py (3)
tests/integration/defs/accuracy/accuracy_core.py (1)
  • LlmapiAccuracyTestHarness (898-909)
tensorrt_llm/sampling_params.py (1)
  • SamplingParams (113-552)
tensorrt_llm/evaluate/lm_eval.py (3)
  • MMMU (696-749)
  • evaluate (428-463)
  • evaluate (795-849)
tensorrt_llm/_torch/models/modeling_qwen3vl.py (3)
tensorrt_llm/mapping.py (1)
  • Mapping (361-540)
tensorrt_llm/_torch/modules/linear.py (1)
  • Linear (2045-2297)
tensorrt_llm/_torch/models/modeling_qwen3.py (2)
  • forward (117-164)
  • forward (193-226)
🪛 Ruff (0.14.10)
tensorrt_llm/_torch/models/modeling_qwen3vl.py

654-654: Unused method argument: kwargs

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (8)
tests/integration/test_lists/test-db/l0_l40s.yml (1)

17-26: LGTM!

The addition of multimodal modeling tests for Qwen3VL is consistent with the existing test structure and aligns with the PR objectives.

tests/integration/defs/accuracy/references/mmmu.yaml (1)

29-30: LGTM!

The accuracy reference for Qwen3-VL-8B-Instruct is properly formatted and consistent with other model entries.

tests/unittest/_torch/modeling/test_modeling_qwen3vl.py (1)

256-263: LGTM!

The test scenario for disable_fuse_rope=True provides appropriate coverage for the new functionality. The scenario integrates well with the existing test framework.

tensorrt_llm/_torch/modules/mlp.py (1)

36-47: The rank mapping is correct and intentional—this design enables all-reduce within smaller TP groups as intended for Attention DP.

When overridden_tp_size is used, the global rank remains unchanged, but the TP and PP group memberships are recalculated based on the new tp_size and pp_size values. The Mapping class correctly computes tp_rank = rank % (tp_size * cp_size) // cp_size and pp_rank = rank // (tp_size * cp_size), so all-reduce operations automatically target the correct smaller TP groups. The world_size is mathematically preserved (new_tp_size * new_pp_size = old_tp_size * old_pp_size), and the comment "Misuse pp_size here to perform all-reduce within smaller groups" reflects the deliberate reuse of the PP dimension for grouping. This pattern is used consistently across multiple models (Qwen3, DeepSeekV3, Llama, GLM) for enable_attention_dp scenarios and works correctly without issues.

Likely an incorrect or invalid review comment.

tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py (1)

321-336: LGTM!

The new TestQwen3VL test class follows the established patterns in this file, consistent with TestQwen3VL_MOE. The test configuration is appropriate for validating Qwen3-VL-8B-Instruct model accuracy.

tensorrt_llm/_torch/models/modeling_qwen3vl.py (3)

18-18: LGTM!

The Mapping import is correctly added to support creating custom mappings for attention DP configuration.


363-369: LGTM!

The reduce_output logic correctly disables all-reduce in vision attention when attention DP is enabled, allowing the LLM side to handle parallelism appropriately.


383-383: LGTM!

Correctly overrides tp_size to 1 for vision MLP when attention DP is enabled, consistent with the attention changes.

@yechank-nvidia
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30689 [ run ] triggered by Bot. Commit: 890b6a3

Copy link
Collaborator

@jaedeok-nvidia jaedeok-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yechank-nvidia for your quick update on the attention dp feature. Overall looks good to me, although I've left several comments. Thx!

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30689 [ run ] completed with state SUCCESS. Commit: 890b6a3
/LLM/main/L0_MergeRequest_PR pipeline #23679 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@yechank-nvidia yechank-nvidia force-pushed the multimodal_attention_dp branch from 1d4f849 to 890b6a3 Compare January 7, 2026 02:15
@yechank-nvidia yechank-nvidia changed the title [None][fix] Support AttentionDP on Qwen3-VL and fix test [None][fix] Enable AttentionDP on Qwen3-VL and fix test Jan 7, 2026
@yechank-nvidia
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30837 [ run ] triggered by Bot. Commit: e5e3184

@yechank-nvidia
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30839 [ run ] triggered by Bot. Commit: 0137afe

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30837 [ run ] completed with state ABORTED. Commit: e5e3184

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30839 [ run ] completed with state SUCCESS. Commit: 0137afe
/LLM/main/L0_MergeRequest_PR pipeline #23816 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@yechank-nvidia
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30949 [ run ] triggered by Bot. Commit: 0137afe

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30949 [ run ] completed with state SUCCESS. Commit: 0137afe
/LLM/main/L0_MergeRequest_PR pipeline #23910 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@yechank-nvidia yechank-nvidia force-pushed the multimodal_attention_dp branch from 0137afe to 1440e3b Compare January 8, 2026 05:30
@yechank-nvidia
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31005 [ run ] triggered by Bot. Commit: 1440e3b

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31005 [ run ] completed with state SUCCESS. Commit: 1440e3b
/LLM/main/L0_MergeRequest_PR pipeline #23957 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@yechank-nvidia yechank-nvidia force-pushed the multimodal_attention_dp branch from 1440e3b to a56eba6 Compare January 8, 2026 12:36
@yechank-nvidia
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31064 [ run ] triggered by Bot. Commit: a56eba6

Copy link
Collaborator

@mikeiovine mikeiovine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved changes in modules/mlp.py

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31064 [ run ] completed with state SUCCESS. Commit: a56eba6
/LLM/main/L0_MergeRequest_PR pipeline #24002 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@yechank-nvidia
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31139 [ run ] triggered by Bot. Commit: a56eba6

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31139 [ run ] completed with state SUCCESS. Commit: a56eba6
/LLM/main/L0_MergeRequest_PR pipeline #24055 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@yechank-nvidia
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31186 [ run ] triggered by Bot. Commit: a56eba6

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31186 [ run ] completed with state SUCCESS. Commit: a56eba6
/LLM/main/L0_MergeRequest_PR pipeline #24098 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@yechank-nvidia
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31221 [ run ] triggered by Bot. Commit: a56eba6

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31221 [ run ] completed with state SUCCESS. Commit: a56eba6
/LLM/main/L0_MergeRequest_PR pipeline #24129 completed with status: 'SUCCESS'

@yechank-nvidia yechank-nvidia merged commit 7295af6 into NVIDIA:main Jan 9, 2026
5 checks passed
videodanchik pushed a commit to videodanchik/TensorRT-LLM that referenced this pull request Jan 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Multimodal Label for issues & PRs regarding Multimodal related objects

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants