-
Notifications
You must be signed in to change notification settings - Fork 2k
[None][fix] Enable AttentionDP on Qwen3-VL and fix test #10435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[None][fix] Enable AttentionDP on Qwen3-VL and fix test #10435
Conversation
|
/bot run |
|
PR_Github #30687 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughThese changes introduce dynamic tensor parallel configuration for vision models by computing mapping-aware parameters during initialization. Key modifications include updating Qwen2.5-VL attention initialization, refactoring Qwen3-VL vision components to compute and propagate custom mappings, updating forward signatures, and adding test coverage for the Qwen3-VL-8B-Instruct model. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/models/modeling_qwen2vl.py (1)
505-524: Fix class name reference and add required NVIDIA copyright header.The class name should be
Qwen2_5_VLVisionAttention(notQWen2VLVisionAttention). The parameter propagation to the baseAttentionclass is correct—the base class acceptsreduce_output: bool = Trueand the subclass properly passes it viasuper().__init__().However, the file is missing the required NVIDIA copyright header with the year of latest meaningful modification, as mandated by coding guidelines for all TensorRT-LLM source files.
🤖 Fix all issues with AI Agents
In @tensorrt_llm/_torch/models/modeling_qwen3vl.py:
- Line 654: The forward method's return type is wrong: it returns
(hidden_states, deepstack_feature_lists) but is annotated as torch.Tensor;
update the signature of forward(self, pixel_values: torch.Tensor, grid_thw:
torch.Tensor, **kwargs) -> Tuple[torch.Tensor, List[torch.Tensor]] (or
Sequence[torch.Tensor] if preferred) and add the corresponding typing imports
(from typing import Tuple, List) at the top of the module; leave **kwargs unused
for API consistency.
In @tensorrt_llm/_torch/modules/mlp.py:
- Around line 51-53: The up_lora LoraLayer is built using intermediate_size //
config.mapping.tp_size which mismatches up_proj when an overridden_tp_size is
used; update the up_lora construction (LoraLayer) to use the effective tp_size
used by up_proj (i.e., use overridden_tp_size when provided, otherwise fall back
to config.mapping.tp_size) so the LoraLayer shard size matches the Linear layer
shard size (adjust the code paths around up_lora and up_proj to derive and reuse
a single tp_size variable).
🧹 Nitpick comments (3)
tensorrt_llm/_torch/modules/mlp.py (1)
35-49: Consider refactoring the mapping override logic.The current approach of "misusing"
pp_sizeto create smaller TP groups is a clever workaround but may be fragile. Consider whether a more explicit mechanism for Attention DP would be clearer, such as:
- Adding explicit support in the
Mappingclass for attention data parallel groups- Using a dedicated
attn_tp_sizeparameter that's already mentioned in learningsThis would make the code more maintainable and less prone to confusion.
tensorrt_llm/_torch/models/modeling_qwen3vl.py (2)
390-390: Consider removing unused attribute.
self.model_configis stored but not referenced elsewhere inQwen3VLVisionBlock. If this is intentional for debugging or future use, consider adding a brief comment; otherwise, it can be removed.
442-476: LGTM with suggestion for clarity.The mapping override logic correctly configures linear layers for attention DP by treating the TP dimension as PP for all-reduce purposes. The implementation is sound.
Consider rephrasing the comment on line 448 for clarity:
- # "Misuse" pp_size here to perform all-reduce within smaller groups + # Use pp_size to route all-reduce to the original TP group when TP is overridden to 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
tensorrt_llm/_torch/models/modeling_qwen2vl.pytensorrt_llm/_torch/models/modeling_qwen3vl.pytensorrt_llm/_torch/modules/mlp.pytests/integration/defs/accuracy/references/mmmu.yamltests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.pytests/integration/test_lists/test-db/l0_l40s.ymltests/unittest/_torch/modeling/test_modeling_qwen3vl.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing Python modules, even if only one class or function from a module is used
Python filenames should use snake_case (e.g.,some_file.py)
Python classes should use PascalCase (e.g.,class SomeClass)
Python functions and methods should use snake_case (e.g.,def my_awesome_function():)
Python local variables should use snake_case, with prefixkfor variable names that start with a number (e.g.,k_99th_percentile)
Python global variables should use upper snake_case with prefixG(e.g.,G_MY_GLOBAL)
Python constants should use upper snake_case (e.g.,MY_CONSTANT)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Use comments in Python for code within a function, or interfaces that are local to a file
Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with the format"""<type>: Description"""
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block for the main logic
Files:
tensorrt_llm/_torch/modules/mlp.pytensorrt_llm/_torch/models/modeling_qwen2vl.pytests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.pytests/unittest/_torch/modeling/test_modeling_qwen3vl.pytensorrt_llm/_torch/models/modeling_qwen3vl.py
**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification
Files:
tensorrt_llm/_torch/modules/mlp.pytensorrt_llm/_torch/models/modeling_qwen2vl.pytests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.pytests/unittest/_torch/modeling/test_modeling_qwen3vl.pytensorrt_llm/_torch/models/modeling_qwen3vl.py
🧠 Learnings (15)
📓 Common learnings
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.
Applied to files:
tests/integration/test_lists/test-db/l0_l40s.ymltests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py
📚 Learning: 2025-09-17T02:48:52.732Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7781
File: tests/integration/test_lists/waives.txt:313-313
Timestamp: 2025-09-17T02:48:52.732Z
Learning: In TensorRT-LLM, `tests/integration/test_lists/waives.txt` is specifically for waiving/skipping tests, while other test list files like those in `test-db/` and `qa/` directories are for different test execution contexts (pre-merge, post-merge, QA tests). The same test appearing in both waives.txt and execution list files is intentional - the test is part of test suites but will be skipped due to the waiver.
Applied to files:
tests/integration/test_lists/test-db/l0_l40s.yml
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/test-db/l0_l40s.ymltests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
Repo: NVIDIA/TensorRT-LLM PR: 6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
Applied to files:
tensorrt_llm/_torch/modules/mlp.pytensorrt_llm/_torch/models/modeling_qwen2vl.pytensorrt_llm/_torch/models/modeling_qwen3vl.py
📚 Learning: 2025-08-26T06:07:02.166Z
Learnt from: shaharmor98
Repo: NVIDIA/TensorRT-LLM PR: 7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.
Applied to files:
tensorrt_llm/_torch/modules/mlp.pytensorrt_llm/_torch/models/modeling_qwen3vl.py
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.
Applied to files:
tensorrt_llm/_torch/modules/mlp.pytensorrt_llm/_torch/models/modeling_qwen3vl.py
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.
Applied to files:
tensorrt_llm/_torch/modules/mlp.pytensorrt_llm/_torch/models/modeling_qwen3vl.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.
Applied to files:
tensorrt_llm/_torch/modules/mlp.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.
Applied to files:
tensorrt_llm/_torch/modules/mlp.py
📚 Learning: 2025-08-14T15:43:23.107Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.
Applied to files:
tensorrt_llm/_torch/models/modeling_qwen3vl.py
📚 Learning: 2025-08-15T06:46:53.813Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Applied to files:
tensorrt_llm/_torch/models/modeling_qwen3vl.py
📚 Learning: 2025-12-19T06:31:54.973Z
Learnt from: nvyocox
Repo: NVIDIA/TensorRT-LLM PR: 10117
File: tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py:336-339
Timestamp: 2025-12-19T06:31:54.973Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py, the cast to torch.float16 for qkv_node before creating the AttentionPlugin is intentional and required because DriveOS LLM expects float16 dtype specifically. This should not be changed to preserve original dtype or made configurable for bfloat16 models in the DriveOS LLM ONNX export path.
Applied to files:
tensorrt_llm/_torch/models/modeling_qwen3vl.py
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
tensorrt_llm/_torch/models/modeling_qwen3vl.py
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.
Applied to files:
tensorrt_llm/_torch/models/modeling_qwen3vl.py
🧬 Code graph analysis (3)
tensorrt_llm/_torch/modules/mlp.py (2)
tensorrt_llm/mapping.py (1)
Mapping(361-540)tensorrt_llm/_torch/distributed/communicator.py (2)
tp_size(64-65)pp_size(60-61)
tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py (3)
tests/integration/defs/accuracy/accuracy_core.py (1)
LlmapiAccuracyTestHarness(898-909)tensorrt_llm/sampling_params.py (1)
SamplingParams(113-552)tensorrt_llm/evaluate/lm_eval.py (3)
MMMU(696-749)evaluate(428-463)evaluate(795-849)
tensorrt_llm/_torch/models/modeling_qwen3vl.py (3)
tensorrt_llm/mapping.py (1)
Mapping(361-540)tensorrt_llm/_torch/modules/linear.py (1)
Linear(2045-2297)tensorrt_llm/_torch/models/modeling_qwen3.py (2)
forward(117-164)forward(193-226)
🪛 Ruff (0.14.10)
tensorrt_llm/_torch/models/modeling_qwen3vl.py
654-654: Unused method argument: kwargs
(ARG002)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (8)
tests/integration/test_lists/test-db/l0_l40s.yml (1)
17-26: LGTM!The addition of multimodal modeling tests for Qwen3VL is consistent with the existing test structure and aligns with the PR objectives.
tests/integration/defs/accuracy/references/mmmu.yaml (1)
29-30: LGTM!The accuracy reference for Qwen3-VL-8B-Instruct is properly formatted and consistent with other model entries.
tests/unittest/_torch/modeling/test_modeling_qwen3vl.py (1)
256-263: LGTM!The test scenario for
disable_fuse_rope=Trueprovides appropriate coverage for the new functionality. The scenario integrates well with the existing test framework.tensorrt_llm/_torch/modules/mlp.py (1)
36-47: The rank mapping is correct and intentional—this design enables all-reduce within smaller TP groups as intended for Attention DP.When
overridden_tp_sizeis used, the global rank remains unchanged, but the TP and PP group memberships are recalculated based on the newtp_sizeandpp_sizevalues. TheMappingclass correctly computestp_rank = rank % (tp_size * cp_size) // cp_sizeandpp_rank = rank // (tp_size * cp_size), so all-reduce operations automatically target the correct smaller TP groups. Theworld_sizeis mathematically preserved (new_tp_size * new_pp_size = old_tp_size * old_pp_size), and the comment "Misuse pp_size here to perform all-reduce within smaller groups" reflects the deliberate reuse of the PP dimension for grouping. This pattern is used consistently across multiple models (Qwen3, DeepSeekV3, Llama, GLM) forenable_attention_dpscenarios and works correctly without issues.Likely an incorrect or invalid review comment.
tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py (1)
321-336: LGTM!The new
TestQwen3VLtest class follows the established patterns in this file, consistent withTestQwen3VL_MOE. The test configuration is appropriate for validating Qwen3-VL-8B-Instruct model accuracy.tensorrt_llm/_torch/models/modeling_qwen3vl.py (3)
18-18: LGTM!The
Mappingimport is correctly added to support creating custom mappings for attention DP configuration.
363-369: LGTM!The
reduce_outputlogic correctly disables all-reduce in vision attention when attention DP is enabled, allowing the LLM side to handle parallelism appropriately.
383-383: LGTM!Correctly overrides tp_size to 1 for vision MLP when attention DP is enabled, consistent with the attention changes.
|
/bot run |
|
PR_Github #30689 [ run ] triggered by Bot. Commit: |
jaedeok-nvidia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @yechank-nvidia for your quick update on the attention dp feature. Overall looks good to me, although I've left several comments. Thx!
|
PR_Github #30689 [ run ] completed with state
|
1d4f849 to
890b6a3
Compare
|
/bot run |
|
PR_Github #30837 [ run ] triggered by Bot. Commit: |
|
/bot run |
|
PR_Github #30839 [ run ] triggered by Bot. Commit: |
|
PR_Github #30837 [ run ] completed with state |
|
PR_Github #30839 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #30949 [ run ] triggered by Bot. Commit: |
|
PR_Github #30949 [ run ] completed with state
|
0137afe to
1440e3b
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #31005 [ run ] triggered by Bot. Commit: |
|
PR_Github #31005 [ run ] completed with state
|
Signed-off-by: yechank <[email protected]>
Signed-off-by: yechank <[email protected]>
Signed-off-by: yechank <[email protected]>
Signed-off-by: yechank <[email protected]>
1440e3b to
a56eba6
Compare
|
/bot run |
|
PR_Github #31064 [ run ] triggered by Bot. Commit: |
mikeiovine
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved changes in modules/mlp.py
|
PR_Github #31064 [ run ] completed with state
|
|
/bot run |
|
PR_Github #31139 [ run ] triggered by Bot. Commit: |
|
PR_Github #31139 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #31186 [ run ] triggered by Bot. Commit: |
|
PR_Github #31186 [ run ] completed with state
|
|
/bot run |
|
PR_Github #31221 [ run ] triggered by Bot. Commit: |
|
PR_Github #31221 [ run ] completed with state |
Signed-off-by: yechank <[email protected]> Signed-off-by: Daniil Kulko <[email protected]>
With this fix, Qwen3-VL-Dense/MoE can do Attention DP on LLM side. Note that other multimodal models are not tested.
Summary by CodeRabbit
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.