Adding CUDA Graph Support for Vision Encoder#2274
Adding CUDA Graph Support for Vision Encoder#2274tomlifu wants to merge 12 commits intoNVIDIA-NeMo:mainfrom
Conversation
* qwen3-vl migration [wip] Signed-off-by: jinliangl <jinliangl@nvidia.com> * support qwen3vl bshd training Signed-off-by: jinliangl <jinliangl@nvidia.com> * support bshd training, thd training, thd training with cp, loss convergency has not been test yet. * adjust how to get packed seq params * supoort qwen3vl dense * support convert hf ckpt for qwenvl vision module, can convert deepstack's param, need to check it further. * fix qwen3vl shard_state_dict * align megagtron-bridge and mbridge fwd bitwisely * align bshd and thd training, cp training remains to do * fix full recompute of qwen3vl bug with credit to xuwen Signed-off-by: jinliangl <jinliangl@nvidia.com> * support hf vision model and megatron vision model; use model.use_hf_vision_model=true to enable it Signed-off-by: jinliangl <jinliangl@nvidia.com> * align with pr 1997, thd and bshd loss curve is verified Signed-off-by: jinliangl <jinliangl@nvidia.com> * init hf_config from scratch to avoid deepcopy parallel_state Signed-off-by: jinliangl <jinliangl@nvidia.com> * use pg_collection instead of mpu Signed-off-by: jinliangl <jinliangl@nvidia.com> * add qwen3vl_step new file --------- Signed-off-by: jinliangl <jinliangl@nvidia.com> Signed-off-by: Shifang Xu <shifangx@nvidia.com> Co-authored-by: jinliangl <jinliangl@nvidia.com>
…guage_transformer_config), vision_transformer_config)
…t support apply_rope_fusion currently.
Signed-off-by: Lifu Zhang <lifuz@login-lyris02.lyris.clusters.nvidia.com>
📝 WalkthroughWalkthroughThis pull request extends Megatron-Bridge with comprehensive vision model support for Qwen3VL, including CUDA graph optimization for vision encoders, process group collection integration, and packed sequence handling. It introduces new vision architecture components (encoder, attention, RoPE), updates the training pipeline for multi-modal support, adds example configurations, and upgrades the submodule pointer. Changes
Sequence Diagram(s)sequenceDiagram
participant DataLoader as Data Iterator
participant TrainStep as qwen3vl_step
participant PackSeq as pack_or_pad_batch_sequences
participant Model as Qwen3VLModel
participant VisionModel as Qwen3VLVisionModel
participant Attention as Qwen3VLSelfAttention
DataLoader->>TrainStep: get_batch_from_iterator()
TrainStep->>PackSeq: tokens, labels, attention_mask
PackSeq->>PackSeq: Compute cu_seqlens, split by CP rank
PackSeq-->>TrainStep: packed tokens, PackedSeqParams
TrainStep->>Model: forward(input_ids, pixel_values, packed_seq_params)
Model->>VisionModel: forward(pixel_values, grid_thw)
VisionModel->>VisionModel: patch_embedding()
VisionModel->>VisionModel: rot_pos_emb() + fast_pos_embed_interpolate()
VisionModel->>VisionModel: transformer_block() with packed_seq_params
VisionModel-->>Model: vision_embeddings
Model->>Model: Organize vision embeddings with vision_mask
Model->>Model: Prepare language + vision combined embeddings
Model->>Model: language_model.forward() with packed_seq_params
Model->>Attention: forward() with packed sequences
Attention->>Attention: apply_rotary_pos_emb_absolute() for RoPE
Attention->>Attention: flash_decode or static batching kernel
Attention-->>Model: attention_output
Model-->>TrainStep: loss
TrainStep-->>DataLoader: gradients
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 12
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
src/megatron/bridge/training/vlm_step.py (1)
145-192:⚠️ Potential issue | 🟡 Minor
if True:renders the else branch dead code — remove it or gate on a real condition.The
elsebranch (lines 159–192) is unreachable. Additionally, line 190 (if attn is not None:) is dedented outside theif tokens_or_input is not None:block, sotarget_lenwould be an unboundNameErroriftokens_or_inputwereNone— though this can never execute today.If the non-PP rounding path is no longer needed, delete it. If it may be re-enabled, gate on a proper flag (e.g.,
cfg.model.use_fixed_shape_padding) and fix the indentation of line 190.Suggested minimal cleanup
- if True: + # Always pad to model seq_length for fixed-shape CUDA graph compatibility + if True: # TODO: replace with a config flag if variable-shape padding is needed later seq_len = cfg.model.seq_length ... - else: - # No PP: pad sequence length to nearest multiple of 128 for efficiency (capped at model seq_length) - ... (remove entire else block)src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.py (1)
60-92:⚠️ Potential issue | 🔴 CriticalPotential
AttributeErrorwhenpg_collectionisNone.
pg_collectiondefaults toNone(line 60), but line 91 unconditionally accessespg_collection.cp. Any caller that omitspg_collectionwill crash withAttributeError: 'NoneType' object has no attribute 'cp'.Either make
pg_collectionrequired (remove the default) or add a guard:Option A: Make required
- pg_collection: ProcessGroupCollection = None, + pg_collection: ProcessGroupCollection,Option B: Guard the access
self.rotary_pos_emb = Qwen3VLMultimodalRotaryEmbedding( kv_channels=self.config.kv_channels, rotary_percent=rotary_percent, rotary_interleaved=self.config.rotary_interleaved, seq_len_interpolation_factor=seq_len_interpolation_factor, rotary_base=rotary_base, - cp_group=pg_collection.cp, + cp_group=pg_collection.cp if pg_collection is not None else None, )tests/unit_tests/models/qwen_vl/modelling_qwen3_vl/test_model.py (1)
314-323:⚠️ Potential issue | 🔴 Critical
pg_collectionis missing frommodel_no_decoder— this will cause an AttributeError.The
Qwen3VLModelconstructor acceptspg_collectionwith a default ofNone, but immediately accesses its attributes without null checks (lines 103–105 in model.py:pg_collection.cp,pg_collection.tp,pg_collection.pp, and line 112:pg_collection.embd). Instantiatingmodel_no_decoderwithout passingpg_collectionwill raise anAttributeErroron line 103, causing the test to fail at runtime. Either passpg_collection=pg_collectiontomodel_no_decoder, or refactor the constructor to handleNonegracefully.src/megatron/bridge/models/qwen_vl/qwen3_vl_bridge.py (1)
155-211:⚠️ Potential issue | 🟠 MajorMoE bridge uses incorrect blanket vision mapping instead of matching dense bridge's granular vision mappings.
The
Qwen3VLMoEBridge.mapping_registry()applies a blanketReplicatedMapping(megatron_param="vision_model.**", hf_param="model.visual.**")that bypasses the specialized vision weight handling in the dense bridge. The denseQwen3VLBridgeexplicitly maps vision QKV weights usingConcatenatedQKVMappingand provides granular entries for other vision components (decoder layers, deepstack merger, etc.), but the MoE bridge's catch-allvision_model.**pattern will match all vision parameters with simple 1:1 replication, skipping the QKV interleaving transformation.Since the vision encoder architecture is identical between the dense and MoE variants (per the docstring claiming "same as dense model"), the MoE bridge must use the same granular vision mappings as the dense bridge, including
ConcatenatedQKVMappingfor QKV weights.
🤖 Fix all issues with AI agents
In `@examples/recipes/decentralized_pg/pretrain_qwen3_vl_simple.py`:
- Around line 68-71: Remove the manual distributed cleanup block that calls
torch.distributed.is_initialized(), torch.distributed.barrier(), and
torch.distributed.destroy_process_group() in the example; pretrain() already
performs cleanup via _maybe_destroy_process_group(), so delete the
barrier/destroy_process_group block (the lines around the manual cleanup) to
avoid double-destroy/runtime errors when pretrain() has already torn down the
process group.
In `@scripts/performance/utils/overrides.py`:
- Around line 130-167: The _set_vision_cuda_graph_overrides function is never
invoked and also lacks resetting behavior for the "none" case; to fix, call
_set_vision_cuda_graph_overrides from the same two places where
_set_cuda_graph_overrides is called (inside set_workload_base_configs and
set_user_overrides) so vision overrides are applied, and add explicit handling
inside _set_vision_cuda_graph_overrides for vision_cuda_graph_impl == "none" to
disable TE RNG tracking (set recipe.rng.te_rng_tracker =
recipe.model.use_te_rng_tracker = False) mirroring the reset logic in
_set_cuda_graph_overrides; ensure the function still validates
vision_cuda_graph_scope against valid_te_scopes and sets
recipe.model.vision_cuda_graph_scope when provided.
In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py`:
- Around line 178-206: Remove the duplicated CUDA graph compatibility blocks and
keep a single instance that checks getattr(self.language_model.config,
"cuda_graph_impl", "none") != "none" and, if true, assigns
self.position_embedding_type = self.language_model.position_embedding_type,
self.rotary_pos_emb = self.language_model.rotary_pos_emb, and self.decoder =
self.language_model.decoder; locate the repeated blocks around the class that
references self.language_model and consolidate them into one occurrence to
eliminate the triplication.
In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.py`:
- Around line 120-126: The code calls self.cp_group.size() without ensuring
cp_group is set; change the conditional in the method that manipulates emb to
first check that self.cp_group is not None (e.g., self.cp_group is not None and
self.cp_group.size() > 1 and not self.is_thd_format) so you only call size()
when cp_group exists, and then call get_pos_emb_on_this_cp_rank(emb, 0,
self.cp_group) as before; reference symbols: self.cp_group, is_thd_format, emb,
get_pos_emb_on_this_cp_rank.
- Around line 265-279: The function apply_rotary_pos_emb_thd_absolute currently
ignores the cu_seqlens argument causing incorrect RoPE for packed THD inputs;
update it to compute per-sequence absolute token positions from cu_seqlens and
apply rotary embeddings per segment (e.g., iterate or vectorize over batches
using cu_seqlens to slice t and select matching freqs offsets) before
concatenating results, using _apply_rotary_pos_emb_bshd on each sequence chunk
(or a vectorized equivalent) so positions reset per sequence; alternatively if
you intend only single-sequence use remove the cu_seqlens parameter and rename
the function to reflect single-sequence behavior.
In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py`:
- Around line 455-459: The code calls an undefined function
sharded_state_dict_default inside Qwen3VLVisionTransformerBlock when iterating
self.named_children(), which will raise NameError; import
sharded_state_dict_default (e.g. from megatron.core.dist_checkpointing.utils
import sharded_state_dict_default) at the top with other imports and replace the
identity check "not module is self.layers" and "not module is
self.deepstack_merger_list" with the Pythonic "module is not self.layers" and
"module is not self.deepstack_merger_list" so the loop correctly calls
sharded_state_dict_default for child modules like self.final_layernorm.
In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.py`:
- Around line 365-384: The None guard in split_data_cp_rank is currently after
accessing val.shape and thus ineffective; move the "if val is None: return val"
to the top of the function (before any assertions or accesses to val.shape),
then perform the existing assertions and processing (assert cp_size > 1, assert
0 == val.shape[seq_dim] % (2 * cp_size), cp_rank handling,
view/index_select/rescale steps) so the function early-returns for None inputs
and avoids AttributeError when val is None.
In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_provider.py`:
- Around line 155-191: The local vision_transformer_config is modified (CUDA
graph settings, scope, max_vision_cuda_graph_seq_length) but the Qwen3VLModel
call uses the original hf_vision_config; update the Qwen3VLModel instantiation
to pass the configured vision_transformer_config (i.e., change
vision_transformer_config=hf_vision_config to
vision_transformer_config=vision_transformer_config) so the applied settings are
used, and make the same fix in the MoE provider where the analogous local
vision_transformer_config is created and then incorrectly replaced with
hf_vision_config in that model's constructor.
In `@src/megatron/bridge/training/qwen3vl_step.py`:
- Around line 191-206: The loop that repacks sequences into
packed_tokens/packed_labels/packed_loss_mask/packed_position_ids is slicing the
batch dim instead of the sequence dim (packed_tokens[start_idx:...]) causing
empty assignments for start_idx>0; update the assignments in qwen3vl_step.py
(the for loop using seqlens_in_batch_cpu and cu_seqlens_padded_cpu) to index
dimension 1 explicitly (e.g., select all rows then slice columns like [:,
start_idx:start_idx+seqlen] or use .narrow on dim=1) so each tensor receives the
sequence slice into the sequence axis rather than the batch axis.
- Around line 312-320: The call to get_thd_batch_on_this_cp_rank is using
non-existent PackedSeqParams attributes (cu_seqlens, cu_seqlens_padded,
max_seqlen); update the call in qwen3vl_step.py to use the correct
PackedSeqParams fields (choose the appropriate query or key/value variants) such
as cu_seqlens_q or cu_seqlens_kv, cu_seqlens_q_padded or cu_seqlens_kv_padded,
and max_seqlen_q or max_seqlen_kv so they match the PackedSeqParams constructor
and usages; ensure the chosen _q vs _kv set aligns with how
get_thd_batch_on_this_cp_rank expects sequence lengths.
In `@src/megatron/bridge/training/train.py`:
- Around line 50-53: The imports from
megatron.core.transformer.vision_cuda_graphs are invalid; update the import
statements that reference VisionTECudaGraphHelper and
get_vision_cuda_graph_seq_length to import them from
megatron.core.transformer.cuda_graphs instead (or remove these imports entirely
if those symbols are unused in train.py); ensure any downstream references to
VisionTECudaGraphHelper and get_vision_cuda_graph_seq_length still resolve after
changing the import path and run tests to confirm no runtime import errors.
In `@src/megatron/bridge/training/utils/packed_seq_utils.py`:
- Around line 86-92: The packing loop in packed_seq_utils.py indexes tensors
with attention_mask (e.g., input_ids[i, attention_mask[i]]) which crashes when
attention_mask is None; update the loop to guard indexing by computing
per-example views: for each i use mask_i = attention_mask[i] if attention_mask
is not None else slice(0, seqlens_in_batch[i]) and then use input_ids[i, mask_i]
(and do the same for token_type_ids, position_ids, etc.), or simply slice with
:seqlens_in_batch[i] when attention_mask is None so the code works regardless of
whether attention_mask is provided.
🟠 Major comments (10)
src/megatron/bridge/training/utils/packed_seq_utils.py-73-74 (1)
73-74: 🛠️ Refactor suggestion | 🟠 MajorMid-file import violates import ordering guidelines.
from megatron.core import mpushould be moved to the top of the file alongside the other imports (lines 17–19). Having it after the first function definition breaks the expected import order.Move import to top
import torch import math from megatron.core.packed_seq_params import PackedSeqParams +from megatron.core import mpu def get_packed_seq_params(batch: dict[str, torch.Tensor]) -> PackedSeqParams: ... - -# Copied from verl/verl/models/mcore/util.py -from megatron.core import mpuAs per coding guidelines, imports should be organized at the top: future imports, standard library, third-party, first-party, separated by blank lines.
src/megatron/bridge/training/train.py-252-280 (1)
252-280:⚠️ Potential issue | 🟠 MajorOverly broad
except Exceptionsilently swallows initialization errors.If vision CUDA graph initialization fails due to a genuine bug (e.g., wrong attribute, shape mismatch, misconfiguration), the catch-all silently degrades to no vision CUDA graphs with only a
print_rank_0warning that's easy to miss. This makes debugging very difficult.Narrow the exception to the specific, expected failure modes (e.g.,
AttributeErrorfor missingvision_model, or a guard check before thetry), and let unexpected errors propagate.Suggested approach: guard instead of broad try/except
- vision_config = getattr(config.model, 'vision_cuda_graph_impl', None) - if vision_config == "transformer_engine": - # Try to get vision config from the model - try: - for model_chunk in model: - unwrapped = get_attr_wrapped_model( - model_chunk, 'vision_model', allow_none=True, return_model_obj=True - ) - if unwrapped is not None and hasattr(unwrapped, 'vision_model') and unwrapped.vision_model is not None: - vision_model_config = unwrapped.vision_model.config - if vision_model_config.cuda_graph_impl == "transformer_engine": - vision_seq_length = get_vision_cuda_graph_seq_length(vision_model_config) - vision_cuda_graph_helper = VisionTECudaGraphHelper( - model=model, - vision_config=vision_model_config, - vision_seq_length=vision_seq_length, - micro_batch_size=config.train.micro_batch_size, - num_microbatches=get_num_microbatches(), - ) - print_rank_0( - f"Vision encoder CUDA graph enabled with seq_length={vision_seq_length}" - ) - break - except Exception as e: - print_rank_0(f"Warning: Failed to initialize vision CUDA graph helper: {e}") - vision_cuda_graph_helper = None + vision_config = getattr(config.model, 'vision_cuda_graph_impl', None) + if vision_config == "transformer_engine": + for model_chunk in model: + unwrapped = get_attr_wrapped_model( + model_chunk, 'vision_model', allow_none=True, return_model_obj=True + ) + if unwrapped is not None and hasattr(unwrapped, 'vision_model') and unwrapped.vision_model is not None: + vision_model_config = unwrapped.vision_model.config + if vision_model_config.cuda_graph_impl == "transformer_engine": + vision_seq_length = get_vision_cuda_graph_seq_length(vision_model_config) + vision_cuda_graph_helper = VisionTECudaGraphHelper( + model=model, + vision_config=vision_model_config, + vision_seq_length=vision_seq_length, + micro_batch_size=config.train.micro_batch_size, + num_microbatches=get_num_microbatches(), + ) + print_rank_0( + f"Vision encoder CUDA graph enabled with seq_length={vision_seq_length}" + ) + break + if vision_cuda_graph_helper is None: + print_rank_0("Warning: vision_cuda_graph_impl is 'transformer_engine' but no vision model with matching config found.")src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py-16-16 (1)
16-16: 🛠️ Refactor suggestion | 🟠 MajorReplace wildcard import with explicit imports.
from megatron.core.transformer.attention import *makes every symbol in the file potentially undefined (dozens of F405 warnings from Ruff). Explicitly import the symbols you need (SelfAttention,Tensor,Optional,Union,Tuple,PackedSeqParams,BaseInferenceContext,deprecate_inference_params,HAVE_FA3,is_fa_min_version,nvtx_range_push,nvtx_range_pop,rearrange).This improves readability, catches missing dependencies at import time, and satisfies the linter.
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/vision_model.py-42-50 (1)
42-50:⚠️ Potential issue | 🟠 Major
pg_collectiondefaults toNonebut is immediately dereferenced without a guard.Line 42 declares
pg_collection: ProcessGroupCollection = None, and line 50 unconditionally accessesself.pg_collection.tp. Ifpg_collectionisNone, this raisesAttributeError. Either remove theNonedefault (making it required) or add a null check.Also, per PEP 484 / coding guidelines, use
ProcessGroupCollection | None = Noneif you intend to keep the default.Proposed fix (make it required)
- pg_collection: ProcessGroupCollection = None, + pg_collection: ProcessGroupCollection,src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_config.py-63-64 (1)
63-64: 🛠️ Refactor suggestion | 🟠 MajorAdd a docstring, type hints, and fix the blank line.
get_vision_model_configis a public function but lacks a docstring, a type annotation forhf_config, and a return type hint. Line 64 is an unnecessary blank line. As per coding guidelines, use Google-style docstrings and type hints for function arguments and return types.Proposed fix (signature only)
-def get_vision_model_config(config: Qwen3VLTransformerConfig, hf_config): - +def get_vision_model_config( + config: Qwen3VLTransformerConfig, hf_config: "Qwen3VLVisionConfig" +) -> Qwen3VLTransformerConfig: + """Build a vision-encoder TransformerConfig by overriding fields from the HF vision config. + + Args: + config: Base transformer config (will be mutated in-place). + hf_config: HuggingFace vision config providing encoder hyperparameters. + + Returns: + The mutated config with vision-encoder settings applied. + """src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py-513-513 (1)
513-513:⚠️ Potential issue | 🟠 MajorMutating shared state on
rotary_pos_embduring forward pass.
self.language_model.rotary_pos_emb.is_thd_format = Trueis set conditionally during forward but never reset toFalse. If the same model instance processes a mix of packed and non-packed batches, the flag will remainTruefrom a previous packed call, potentially corrupting rotary embeddings for subsequent non-packed forward passes.src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py-177-177 (1)
177-177:⚠️ Potential issue | 🟠 Major
self.pg_collectionis silently overwritten.
self.pg_collectionis first assigned from the constructor argument at line 101, then overwritten here byget_pg_collection(self). If the intent is to always use the result ofget_pg_collection, the earlier assignment (and the constructor parameter) is misleading. If the constructor value should be authoritative, this line is a bug. Please clarify the intent and remove the redundant assignment.src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py-77-77 (1)
77-77:⚠️ Potential issue | 🟠 Major
pg_collectiondefaults toNonebut is accessed unconditionally — will raiseAttributeError.Line 77 declares
pg_collection: ProcessGroupCollection = None, but lines 102–104 immediately dopg_collection.cp,.tp,.ppwithout a null check. Any caller that omits this argument (or passesNone) will crash.Either remove the default (making the argument required) or add a guard.
Proposed fix (make required)
- pg_collection: ProcessGroupCollection = None, + pg_collection: ProcessGroupCollection,Also applies to: 100-112
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py-136-136 (1)
136-136: 🛠️ Refactor suggestion | 🟠 MajorUse
print_rank_0instead ofAs per coding guidelines for
src/megatron/bridge/models/**/*.py, useprint_rank_0for logging in model bridge code.Proposed fix
- print(f"rank {torch.distributed.get_rank()} use megatron vision model") + print_rank_0("use megatron vision model")- print(f"rank {torch.distributed.get_rank()} use hf vision model") + print_rank_0("use hf vision model")Import at the top:
from megatron.bridge.utils.common_utils import print_rank_0As per coding guidelines: "Use 'print_rank_0' for logging in model bridge to avoid duplicate output across ranks."
Also applies to: 149-149
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py-381-394 (1)
381-394:⚠️ Potential issue | 🟠 MajorHardcoded
torch.bfloat16for zero-sized vision placeholders.If the model is running in fp16 or fp32, these tensors will have a dtype mismatch with the rest of the computation. Use
vision_data.dtype(or derive from config) instead.Proposed fix
- device=vision_data.device, - dtype=torch.bfloat16, + device=vision_data.device, + dtype=vision_data.dtype,(apply to both occurrences on lines 384 and 392)
🟡 Minor comments (13)
src/megatron/bridge/training/qwen3vl_step.py-224-224 (1)
224-224:⚠️ Potential issue | 🟡 MinorUnused variable
max_seqlen_in_batch.
max_seqlen_in_batchis assigned but never used (confirmed by both Ruff and Flake8). Remove it.Fix
- max_seqlen_in_batch = seqlens_in_batch.max().item()src/megatron/bridge/training/qwen3vl_step.py-136-141 (1)
136-141:⚠️ Potential issue | 🟡 MinorPotential
AttributeErrorifvisual_inputsisNone.Line 139 calls
.normalized_for_model()on the result ofbatch.get("visual_inputs"). If the key exists but the value isNone, this raisesAttributeError. The"visual_inputs" in batchcheck (line 136) only verifies key presence, not that the value is non-None.Suggested fix
- if "visual_inputs" in batch: + if batch.get("visual_inputs") is not None: # convert visual_inputs to multi_modal_inputs which is a dict contains "pixel_values" and "image_grid_thw" # TODO(jinliangl): add video support multi_modal_inputs = batch.get("visual_inputs").normalized_for_model()src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.py-242-242 (1)
242-242:⚠️ Potential issue | 🟡 MinorTypo in assertion message: "flaaten" → "flatten".
- assert input_ids.dim() == 1, "input_ids should be flaaten" + assert input_ids.dim() == 1, "input_ids should be flattened"examples/recipes/qwen_vl/finetune_qwen_vl.py-106-108 (1)
106-108:⚠️ Potential issue | 🟡 MinorUnused import and incorrect import ordering.
functools.partial(line 108) is imported but never used (confirmed by Flake8 F401). Additionally, as a stdlib import,functoolsshould be placed before third-party and first-party imports per the project's import ordering guidelines.Remove the unused import:
Proposed fix
from megatron.bridge.training.qwen3vl_step import forward_step from megatron.bridge.utils.common_utils import get_rank_safe -from functools import partialAs per coding guidelines, imports should be organized: future imports, standard library, third-party, first-party, local folder imports, separated by blank lines.
examples/recipes/decentralized_pg/pretrain_qwen3_vl_simple.py-37-37 (1)
37-37:⚠️ Potential issue | 🟡 MinorUse
qwen3vl_step.forward_stepinstead ofvlm_step.forward_step.The
finetune_qwen_vl.pyrecipe was updated to importforward_stepfrommegatron.bridge.training.qwen3vl_step, which includes a Qwen3-VL specific assertion and model-specific optimizations (custom sequence packing, position ID calculation, etc.). Since this file also targets Qwen3-VL (based on the configqwen3_vl_30b_a3b_pretrain_config), it should use the sameqwen3vl_step.forward_stepfor consistency and to benefit from these optimizations.src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py-127-127 (1)
127-127:⚠️ Potential issue | 🟡 MinorRemove extraneous
fprefix from string literal (no placeholders).Flagged by both Ruff (F541) and Flake8 (F541).
- raise ValueError(f"CUDA graphs must use flash decode with static batching!") + raise ValueError("CUDA graphs must use flash decode with static batching!")src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py-57-57 (1)
57-57:⚠️ Potential issue | 🟡 MinorTypo in docstring:
PackedSeqparams→PackedSeqParams.- packed_seq_params (Optional[PackedSeqparams]): Parameters used for THD format. + packed_seq_params (Optional[PackedSeqParams]): Parameters used for THD format.src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_config.py-20-21 (1)
20-21:⚠️ Potential issue | 🟡 MinorRemove unused
torchimport.
torch(line 20) is imported but unused, as flagged by Flake8 (F401).torch.nn.functional as Fis used only at line 92 and is fine to keep.Proposed fix
-import torch -import torch.nn.functional as F +import torch.nn.functional as Fsrc/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/vision_model.py-1-24 (1)
1-24:⚠️ Potential issue | 🟡 MinorMissing NVIDIA copyright header and unused imports.
This new file lacks the required NVIDIA copyright header. As per coding guidelines, all Python files must have the NVIDIA copyright header at the top.
Additionally, Flake8 flags several unused imports:
dataclass,Union,MultimodalProjector,MegatronModule,build_module,get_tensor_model_parallel_group_if_none. Remove them to keep the import section clean.src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py-56-95 (1)
56-95:⚠️ Potential issue | 🟡 MinorRemove unused group attributes
cp_groupandtp_groupfrom__init__.Lines 81-82 assign
self.cp_groupandself.tp_group, but neither is used anywhere in this class. Onlyself.pp_group(line 83) is referenced, on line 424 forself.pp_group.rank(). Remove the unused assignments to reduce clutter.src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/vision_model.py-248-264 (1)
248-264:⚠️ Potential issue | 🟡 MinorValidate
max_vision_cuda_graph_seq_lengthalignment for TransformerEngine compatibility.The overflow check at lines 250-254 is good. However, the code does not validate that
max_vision_cuda_graph_seq_lengthitself meets TransformerEngine's alignment constraints. TransformerEngine's FusedAttention backend requires sequence length to be divisible by 64 (and ideally by 128 for optimal kernel dispatch), but there is no validation that the configured or calculatedmax_seq_lensatisfies this requirement. If a user setsmax_vision_cuda_graph_seq_lengthto a misaligned value, the downstream transformer layers may fail or experience severe performance degradation.Add alignment validation in
_get_max_vision_seq_length()or in the config initialization to ensuremax_vision_cuda_graph_seq_length % 64 == 0(or round up to the nearest multiple of 128, consistent with other parts of the codebase). Alternatively, document this constraint clearly in the config field docstring.src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py-17-17 (1)
17-17:⚠️ Potential issue | 🟡 MinorUnused import
mpu.Static analysis confirms
mpuis imported but never used. Remove it.Proposed fix
-from megatron.core import InferenceParams, mpu, tensor_parallel +from megatron.core import InferenceParams, tensor_parallelsrc/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py-305-306 (1)
305-306:⚠️ Potential issue | 🟡 MinorUse explicit
| Nonefor nullable type hints.Static analysis (RUF013) flags these. Per coding guidelines, use
T | Noneinstead of implicitOptional.Proposed fix
- cp_img_num: list[int] = None, - images_padded: list[bool] = None, + cp_img_num: list[int] | None = None, + images_padded: list[bool] | None = None,As per coding guidelines: "Use 'T | None' for nullable types instead of 'Optional[T]'."
🧹 Nitpick comments (25)
src/megatron/bridge/training/vlm_step.py (1)
124-125: Hardcodedis_first = True/is_last = Truebypasses pipeline-parallel stage detection.
is_pp_first_stage/is_pp_last_stageare imported but no longer called. With both flags forced toTrue, every PP stage now loadslabelsandloss_maskfrom the iterator and always pads tocfg.model.seq_length. If this is intentional for CUDA Graph fixed-shape requirements, please add a brief comment explaining why, and consider removing the now-unusedis_pp_first_stage/is_pp_last_stageimports from line 21 to avoid confusion.src/megatron/bridge/training/train.py (2)
1363-1366: Type hint forcuda_graph_helpershould beTECudaGraphHelper | None.The parameter can be
None(initialized asNoneon line 241 and passed directly on line 592). The current annotationTECudaGraphHelperdoesn't reflect this, though the body does handleNoneat line 1387.Fix
def _delete_cuda_graphs( - cuda_graph_helper: TECudaGraphHelper, + cuda_graph_helper: TECudaGraphHelper | None, vision_cuda_graph_helper: Optional[VisionTECudaGraphHelper] = None, ):Also, per coding guidelines, prefer
VisionTECudaGraphHelper | NoneoverOptional[VisionTECudaGraphHelper]for consistency with the project's Python 3.10+ target.
366-374: Vision CUDA graph warmup uses language-modelcuda_graph_warmup_steps.The vision graph capture condition (line 370) reuses
model_config.cuda_graph_warmup_steps, which is the language model's warmup setting. If the vision encoder requires a different warmup period (or if the LM has CUDA graphs disabled but vision has them enabled), this shared condition may not trigger correctly. Consider whether a separatevision_cuda_graph_warmup_stepsconfig is needed.src/megatron/bridge/training/utils/packed_seq_utils.py (2)
77-171: Significant code duplication withsrc/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.py.Both
preprocess_packed_seqsandpostprocess_packed_seqsare near-identical copies of the implementations in the qwen_vl module (see relevant snippets). The only meaningful difference is theuse_fp8_paddingparameter and the FP8 padding logic (lines 97–115). Consider extracting a shared implementation to avoid drift between the two copies.Also,
postprocess_packed_seqsdocstring (line 184) states "This function is not used for now" — if unused, consider deferring its addition until it's needed (YAGNI).
78-78: Missing type hint foruse_fp8_paddingparameter.-def preprocess_packed_seqs( - input_ids: torch.Tensor, attention_mask: torch.Tensor, pre_process: bool = True, use_fp8_padding=False -) -> tuple[torch.Tensor, PackedSeqParams]: +def preprocess_packed_seqs( + input_ids: torch.Tensor, attention_mask: torch.Tensor, pre_process: bool = True, use_fp8_padding: bool = False +) -> tuple[torch.Tensor, PackedSeqParams]:As per coding guidelines: "Use type hints for function arguments and return types."
src/megatron/bridge/training/qwen3vl_step.py (3)
266-267:assert is_qwen3vlmakes subsequentis_qwen3vlchecks redundant.Line 267 asserts the model is Qwen3-VL, guaranteeing
is_qwen3vlis alwaysTruepast this point. The later checks at lines 289 and 326 are therefore dead conditions. Either remove the assert and keep the conditional logic (to support future models), or remove the redundant checks.
154-163: Missing type hint forthis_pg_collectionparameter.def pack_or_pad_batch_sequences( tokens: torch.Tensor, labels: torch.Tensor, loss_mask: torch.Tensor, attention_mask: torch.Tensor, position_ids: torch.Tensor, - this_pg_collection, + this_pg_collection: ProcessGroupCollection, use_fp8_padding: bool = False, data_format: str = "bshd", ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, PackedSeqParams]:You'll need to add the
ProcessGroupCollectionimport frommegatron.core.process_groups_config. As per coding guidelines: "Use type hints for function arguments and return types."
14-35: Import ordering:megatron.core.packed_seq_paramsimport is misplaced between Bridge imports.Line 28 (
from megatron.core.packed_seq_params import PackedSeqParams) is between Bridge imports (lines 24-27 and lines 29-35). It should be grouped with the othermegatron.coreimports (lines 20-22).Reordered imports
import torch from megatron.core.models.gpt import GPTModel from megatron.core.pipeline_parallel.utils import is_pp_first_stage, is_pp_last_stage from megatron.core.utils import get_batch_on_this_cp_rank, get_model_config, get_thd_batch_on_this_cp_rank +from megatron.core.packed_seq_params import PackedSeqParams from megatron.bridge.training.config import ConfigContainer from megatron.bridge.training.losses import ( create_masked_next_token_loss_function as _create_loss_function, ) -from megatron.core.packed_seq_params import PackedSeqParams from megatron.bridge.training.state import GlobalStatesrc/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.py (3)
348-349: Useraise AssertionErrorinstead ofassert False.
assert Falseis stripped bypython -O. Since this is a reachable error path (unexpected token_id), use an explicit raise.- assert False, f"should not have {token_id=}" + raise AssertionError(f"Unexpected token_id: {token_id}")
16-17: Use modern union syntax per coding guidelines.The guidelines require
T | Noneinstead ofOptional[T]andX | Yinstead ofUnion[X, Y].Proposed fix
-from dataclasses import dataclass -from typing import Optional, Union +from __future__ import annotations + +from dataclasses import dataclassThen update usages (e.g., line 92-94, line 103, line 365) from
Optional[...]/Union[...]to... | None/X | Y.As per coding guidelines: "Use 'T | None' for nullable types instead of 'Optional[T]'" and "Use 'X | Y' for union types instead of 'Union[X, Y]'".
493-493: Change@torch.no_gradto@torch.no_grad().The documented and correct form in PyTorch is
@torch.no_grad()with parentheses. This creates and applies theno_gradcontext manager as a decorator, which is the official API convention shown in the PyTorch documentation.src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_config.py (2)
115-132:hasattrchecks are fragile on a dataclass — prefergetattrwith defaults.
configis aQwen3VLTransformerConfigdataclass, so usinghasattris unnecessary if the fields exist on the class, or fragile if they're dynamically set upstream. Consider usinggetattr(config, 'vision_cuda_graph_impl', 'none')instead, and similarly forvision_cuda_graph_scope. This also avoids potentialAttributeErrorif the field is absent while keeping the same fallback behavior.Also, per coding guidelines, use double quotes for strings (e.g.,
"none"instead of'none','vision_cuda_graph_impl'→"vision_cuda_graph_impl").Proposed refactor
- if hasattr(config, 'vision_cuda_graph_impl') and config.vision_cuda_graph_impl != "none": - config.cuda_graph_impl = config.vision_cuda_graph_impl - if hasattr(config, 'vision_cuda_graph_scope') and config.vision_cuda_graph_scope: - # Convert string scope list to CudaGraphScope enums if needed - from megatron.core.transformer.cuda_graphs import CudaGraphScope - scope_list = config.vision_cuda_graph_scope - if scope_list and isinstance(scope_list[0], str): - config.cuda_graph_scope = [CudaGraphScope[scope] for scope in scope_list] - else: - config.cuda_graph_scope = scope_list - else: - config.cuda_graph_scope = [] - else: - config.cuda_graph_impl = "none" - config.cuda_graph_scope = [] + from megatron.core.transformer.cuda_graphs import CudaGraphScope + + vision_impl = getattr(config, "vision_cuda_graph_impl", "none") + if vision_impl != "none": + config.cuda_graph_impl = vision_impl + scope_list = getattr(config, "vision_cuda_graph_scope", []) + if scope_list and isinstance(scope_list[0], str): + config.cuda_graph_scope = [CudaGraphScope[scope] for scope in scope_list] + else: + config.cuda_graph_scope = scope_list if scope_list else [] + else: + config.cuda_graph_impl = "none" + config.cuda_graph_scope = []
15-18: UseT | Noneinstead ofOptional[T].Per coding guidelines, prefer
T | NoneoverOptional[T]andX | YoverUnion[X, Y](Python 3.10+). TheOptionalimport on line 18 and its usage on line 60 should be updated.Proposed fix
-from typing import List, Optional +from typing import ListAnd at line 60:
- max_vision_cuda_graph_seq_length: Optional[int] = None + max_vision_cuda_graph_seq_length: int | None = Nonesrc/megatron/bridge/models/qwen_vl/qwen3_vl_bridge.py (1)
168-168: Double hash-comment prefix (# #) looks like an editing artifact.Line 168 uses
# #before "vision module deepstack merger". If this is intentional (e.g., a section sub-header), a single#with the comment text is clearer.- # # vision module deepstack merger + # vision module deepstack mergersrc/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.py (2)
177-178: Variableinput_idsis shadowed inside the loop.The outer
input_idsparameter (line 135) is stored astotal_input_idson line 165, but theforloop at line 177 reusesinput_idsas the loop variable. Then at line 178,input_idsis further reassigned to a masked slice. This means at line 241,input_ids.devicerefers to the last loop iteration's masked tensor, not the originalinput_idsargument — which happens to work but is fragile and confusing.Rename the loop variable to avoid shadowing (e.g.,
sample_ids).
54-54: Use explicit| Noneannotation instead of implicitOptional.Per coding guidelines: use
T | Nonefor nullable types. Also, PEP 484 prohibitsparam: SomeType = Nonewithout an explicit optional annotation.- cp_group: torch.distributed.ProcessGroup = None, + cp_group: torch.distributed.ProcessGroup | None = None,src/megatron/bridge/models/qwen_vl/qwen3_vl_provider.py (2)
45-45: Import ordering:from copy import deepcopyshould be in the standard library group.Per coding guidelines, organize imports as: future → standard library → third-party → first-party → local.
copyis a standard library module but is placed after first-party imports.
146-151: Mutatingself(the provider) when convertingcuda_graph_scope— side effect on repeated calls.
language_transformer_configisself(line 144/339). Convertingcuda_graph_scopefrom strings to enums mutates the provider's own field. Ifprovide()is called a second time,isinstance(...[0], str)will beFalseand the conversion is skipped, which is harmless. However, mutating the provider's state is a subtle side effect. Consider operating on a copy or doing this conversion infinalize()instead.src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/vision_model.py (2)
314-318:torch.split+torch.catalong the same dimension is a no-op.Lines 315-317 split
hidden_statesby per-grid sizes and immediately concatenate them back alongdim=0. This produces the same tensor as the input. If the intent is to validate that sizes sum correctly, anassertwould be clearer. If it's prep for future per-grid processing, a comment explaining the intent would help.
134-171: Loop variabletunused at line 140;zip()calls lackstrict=parameter.Ruff flags
tas unused (B007) at line 140. If the temporal dimension isn't needed in the interpolation loop body, use_instead:- for t, h, w in zip(grid_ts, grid_hs, grid_ws): + for _, h, w in zip(grid_ts, grid_hs, grid_ws, strict=True):Adding
strict=Truetozip()calls (lines 140, 182, 186) catches mismatched lengths early.src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py (2)
124-131: CUDA graph detection logic is duplicated between checkpointed and non-checkpointed paths.The
layer_uses_te_cudagraphcheck (lines 124-130 and 339-345) is identical in both the_checkpointed_forwardand the non-checkpointedforwardpaths. Extract this into a small helper method to reduce duplication and ensure both paths stay in sync.Proposed helper
`@staticmethod` def _layer_uses_te_cudagraph(layer) -> bool: """Check if a transformer layer will use TE CUDA graph replay.""" return ( hasattr(layer, "cuda_graphs") and layer.cuda_graphs and layer.training and hasattr(layer, "config") and getattr(layer.config, "cuda_graph_impl", "none") == "transformer_engine" )Also applies to: 339-346
69-69: Duplicate assertion:assert post_process and pre_processexists here and inQwen3VLVisionModel.__init__(vision_model.py line 44).The same guard is present in both the vision model and its inner transformer block. Having it in both places is defensive but redundant — the vision model already validates this before constructing the block. Consider removing the inner one or adding a comment explaining why it's repeated.
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py (3)
124-124: Line exceeds 119-character limit.This line is well over the configured max line length.
Proposed fix
- megatron_vision_transformer_config = get_vision_model_config(copy(language_transformer_config), vision_transformer_config) + megatron_vision_transformer_config = get_vision_model_config( + copy(language_transformer_config), vision_transformer_config + )As per coding guidelines: "Maximum line length is 119 characters (matching ruff configuration)."
24-39: Import ordering: third-partytransformersimport is interleaved with first-party imports.Per coding guidelines, imports should be organized: standard library → third-party (including megatron.core, torch, transformers) → first-party (megatron.bridge.*) → local, separated by blank lines. The
transformersimport on line 25 is sandwiched betweenmegatron.bridgeimports.Proposed fix (reorder)
from megatron.bridge.models.qwen_vl.modelling_qwen3_vl.text_model import Qwen3VLGPTModel -from transformers.models.qwen3_vl.configuration_qwen3_vl import Qwen3VLConfig as Qwen3VLConfigHF from megatron.bridge.models.qwen_vl.modelling_qwen3_vl.transformer_config import Qwen3VLTransformerConfigMove the
transformersimport up to the third-party block (after themegatron.coreimports):from transformers.models.qwen3_vl.configuration_qwen3_vl import Qwen3VLConfig as Qwen3VLConfigHF
310-315: Informal workaround notes in the public docstring.These lines read like internal developer notes (e.g., "there is a workaround", "cooperate with verl's models/mcore/model_forward.py") rather than proper API documentation. Either move them to inline comments within the method body or rewrite as a proper docstring note.
| # Cleanup | ||
| if torch.distributed.is_initialized(): | ||
| torch.distributed.barrier() | ||
| torch.distributed.destroy_process_group() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, find the pretrain function to understand its cleanup behavior
fd '.*pretrain.*\.py' | head -20Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1262
🏁 Script executed:
#!/bin/bash
# Check for destroy_process_group in pretrain-related files
rg -n "destroy_process_group|destroy_model_parallel" --type=py -C3 -g '*pretrain*'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 8630
🏁 Script executed:
#!/bin/bash
# Also check the specific file mentioned in the review
cat -n examples/recipes/decentralized_pg/pretrain_qwen3_vl_simple.py | head -100Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 3428
Remove manual process group cleanup; pretrain() handles this internally.
The pretrain() function already manages distributed cleanup via _maybe_destroy_process_group(), which destroys the process group only if it was created by pretrain() itself (lines 111, 191 in src/megatron/bridge/training/pretrain.py). The manual barrier() + destroy_process_group() call will cause a runtime error in the typical scenario where the process group is not initialized before pretrain() is called, since pretrain() will have already destroyed it upon return.
Delete lines 68-71 (or the equivalent cleanup block in similar example files).
🤖 Prompt for AI Agents
In `@examples/recipes/decentralized_pg/pretrain_qwen3_vl_simple.py` around lines
68 - 71, Remove the manual distributed cleanup block that calls
torch.distributed.is_initialized(), torch.distributed.barrier(), and
torch.distributed.destroy_process_group() in the example; pretrain() already
performs cleanup via _maybe_destroy_process_group(), so delete the
barrier/destroy_process_group block (the lines around the manual cleanup) to
avoid double-destroy/runtime errors when pretrain() has already torn down the
process group.
| def _set_vision_cuda_graph_overrides( | ||
| recipe: ConfigContainer, | ||
| vision_cuda_graph_impl: Optional[str] = None, | ||
| vision_cuda_graph_scope: Optional[str | List[str]] = None, | ||
| ) -> ConfigContainer: | ||
| """Set the vision encoder CUDA graph overrides. | ||
|
|
||
| This enables TE CUDA graph for the vision encoder separately from the language model. | ||
|
|
||
| Args: | ||
| recipe: The config container | ||
| vision_cuda_graph_impl: Vision encoder CUDA graph implementation ("none" or "transformer_engine") | ||
| vision_cuda_graph_scope: Vision encoder CUDA graph scope (e.g., ["attn"]) | ||
|
|
||
| Returns: | ||
| Updated config container | ||
| """ | ||
| if isinstance(vision_cuda_graph_scope, str): | ||
| vision_cuda_graph_scope = [vision_cuda_graph_scope] | ||
|
|
||
| if vision_cuda_graph_impl is not None: | ||
| recipe.model.vision_cuda_graph_impl = vision_cuda_graph_impl | ||
|
|
||
| if vision_cuda_graph_impl == "transformer_engine": | ||
| # Ensure TE RNG tracker is enabled for CUDA graph compatibility | ||
| recipe.rng.te_rng_tracker = recipe.model.use_te_rng_tracker = True | ||
|
|
||
| valid_te_scopes = ["attn", "mlp"] # Vision encoder typically only has attn and mlp | ||
| if vision_cuda_graph_scope: | ||
| assert all(scope in valid_te_scopes for scope in vision_cuda_graph_scope), ( | ||
| f"Invalid vision cuda graph scope: {vision_cuda_graph_scope}. " | ||
| f"Valid options for vision encoder are: {valid_te_scopes}" | ||
| ) | ||
|
|
||
| if vision_cuda_graph_scope is not None: | ||
| recipe.model.vision_cuda_graph_scope = vision_cuda_graph_scope | ||
|
|
||
| return recipe |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check if _set_vision_cuda_graph_overrides is called anywhere in the codebase
rg -n '_set_vision_cuda_graph_overrides' --type=pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 151
🏁 Script executed:
# Also check the file to see the full context of both functions and their calls
cat -n scripts/performance/utils/overrides.py | head -270Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 13292
_set_vision_cuda_graph_overrides is defined but never called and lacks critical handling for the "none" case.
The function is not integrated into set_workload_base_configs (line 202) or set_user_overrides (line 252), so vision CUDA graph overrides will not be applied. Compare with _set_cuda_graph_overrides (called at lines 215 and 255).
Additionally, the function is missing the reset logic for vision_cuda_graph_impl == "none". In _set_cuda_graph_overrides (lines 115–116), when the implementation is set to "none", the RNG tracker is explicitly disabled. Without this in the vision variant, stale RNG tracker state could persist if the vision graph is disabled after being enabled.
🤖 Prompt for AI Agents
In `@scripts/performance/utils/overrides.py` around lines 130 - 167, The
_set_vision_cuda_graph_overrides function is never invoked and also lacks
resetting behavior for the "none" case; to fix, call
_set_vision_cuda_graph_overrides from the same two places where
_set_cuda_graph_overrides is called (inside set_workload_base_configs and
set_user_overrides) so vision overrides are applied, and add explicit handling
inside _set_vision_cuda_graph_overrides for vision_cuda_graph_impl == "none" to
disable TE RNG tracking (set recipe.rng.te_rng_tracker =
recipe.model.use_te_rng_tracker = False) mirroring the reset logic in
_set_cuda_graph_overrides; ensure the function still validates
vision_cuda_graph_scope against valid_te_scopes and sets
recipe.model.vision_cuda_graph_scope when provided.
| # Expose position_embedding_type, rotary_pos_emb, and decoder for CUDA graph helper compatibility | ||
| # The CUDA graph helper expects model.position_embedding_type, model.rotary_pos_emb, and model.decoder, | ||
| # but in Qwen3VL these are nested under language_model. This provides direct access. | ||
| # Expose these attributes for CUDA graph helper compatibility only when CUDA graph is enabled | ||
| cuda_graph_enabled = getattr(self.language_model.config, "cuda_graph_impl", "none") != "none" | ||
| if cuda_graph_enabled: | ||
| self.position_embedding_type = self.language_model.position_embedding_type | ||
| self.rotary_pos_emb = self.language_model.rotary_pos_emb | ||
| self.decoder = self.language_model.decoder | ||
|
|
||
| # Expose position_embedding_type, rotary_pos_emb, and decoder for CUDA graph helper compatibility | ||
| # The CUDA graph helper expects model.position_embedding_type, model.rotary_pos_emb, and model.decoder, | ||
| # but in Qwen3VL these are nested under language_model. This provides direct access. | ||
| # Expose these attributes for CUDA graph helper compatibility only when CUDA graph is enabled | ||
| cuda_graph_enabled = getattr(self.language_model.config, "cuda_graph_impl", "none") != "none" | ||
| if cuda_graph_enabled: | ||
| self.position_embedding_type = self.language_model.position_embedding_type | ||
| self.rotary_pos_emb = self.language_model.rotary_pos_emb | ||
| self.decoder = self.language_model.decoder | ||
|
|
||
| # Expose position_embedding_type, rotary_pos_emb, and decoder for CUDA graph helper compatibility | ||
| # The CUDA graph helper expects model.position_embedding_type, model.rotary_pos_emb, and model.decoder, | ||
| # but in Qwen3VL these are nested under language_model. This provides direct access. | ||
| # Expose these attributes for CUDA graph helper compatibility only when CUDA graph is enabled | ||
| cuda_graph_enabled = getattr(self.language_model.config, "cuda_graph_impl", "none") != "none" | ||
| if cuda_graph_enabled: | ||
| self.position_embedding_type = self.language_model.position_embedding_type | ||
| self.rotary_pos_emb = self.language_model.rotary_pos_emb | ||
| self.decoder = self.language_model.decoder |
There was a problem hiding this comment.
Copy-paste bug: the CUDA graph compatibility block is duplicated three times.
Lines 178–186, 188–196, and 198–206 are identical. This is clearly an accidental triplication. Keep only one copy.
Proposed fix
self.pg_collection = get_pg_collection(self)
# Expose position_embedding_type, rotary_pos_emb, and decoder for CUDA graph helper compatibility
# The CUDA graph helper expects model.position_embedding_type, model.rotary_pos_emb, and model.decoder,
# but in Qwen3VL these are nested under language_model. This provides direct access.
# Expose these attributes for CUDA graph helper compatibility only when CUDA graph is enabled
cuda_graph_enabled = getattr(self.language_model.config, "cuda_graph_impl", "none") != "none"
if cuda_graph_enabled:
self.position_embedding_type = self.language_model.position_embedding_type
self.rotary_pos_emb = self.language_model.rotary_pos_emb
self.decoder = self.language_model.decoder
- # Expose position_embedding_type, rotary_pos_emb, and decoder for CUDA graph helper compatibility
- # The CUDA graph helper expects model.position_embedding_type, model.rotary_pos_emb, and model.decoder,
- # but in Qwen3VL these are nested under language_model. This provides direct access.
- # Expose these attributes for CUDA graph helper compatibility only when CUDA graph is enabled
- cuda_graph_enabled = getattr(self.language_model.config, "cuda_graph_impl", "none") != "none"
- if cuda_graph_enabled:
- self.position_embedding_type = self.language_model.position_embedding_type
- self.rotary_pos_emb = self.language_model.rotary_pos_emb
- self.decoder = self.language_model.decoder
-
- # Expose position_embedding_type, rotary_pos_emb, and decoder for CUDA graph helper compatibility
- # The CUDA graph helper expects model.position_embedding_type, model.rotary_pos_emb, and model.decoder,
- # but in Qwen3VL these are nested under language_model. This provides direct access.
- # Expose these attributes for CUDA graph helper compatibility only when CUDA graph is enabled
- cuda_graph_enabled = getattr(self.language_model.config, "cuda_graph_impl", "none") != "none"
- if cuda_graph_enabled:
- self.position_embedding_type = self.language_model.position_embedding_type
- self.rotary_pos_emb = self.language_model.rotary_pos_emb
- self.decoder = self.language_model.decoder📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Expose position_embedding_type, rotary_pos_emb, and decoder for CUDA graph helper compatibility | |
| # The CUDA graph helper expects model.position_embedding_type, model.rotary_pos_emb, and model.decoder, | |
| # but in Qwen3VL these are nested under language_model. This provides direct access. | |
| # Expose these attributes for CUDA graph helper compatibility only when CUDA graph is enabled | |
| cuda_graph_enabled = getattr(self.language_model.config, "cuda_graph_impl", "none") != "none" | |
| if cuda_graph_enabled: | |
| self.position_embedding_type = self.language_model.position_embedding_type | |
| self.rotary_pos_emb = self.language_model.rotary_pos_emb | |
| self.decoder = self.language_model.decoder | |
| # Expose position_embedding_type, rotary_pos_emb, and decoder for CUDA graph helper compatibility | |
| # The CUDA graph helper expects model.position_embedding_type, model.rotary_pos_emb, and model.decoder, | |
| # but in Qwen3VL these are nested under language_model. This provides direct access. | |
| # Expose these attributes for CUDA graph helper compatibility only when CUDA graph is enabled | |
| cuda_graph_enabled = getattr(self.language_model.config, "cuda_graph_impl", "none") != "none" | |
| if cuda_graph_enabled: | |
| self.position_embedding_type = self.language_model.position_embedding_type | |
| self.rotary_pos_emb = self.language_model.rotary_pos_emb | |
| self.decoder = self.language_model.decoder | |
| # Expose position_embedding_type, rotary_pos_emb, and decoder for CUDA graph helper compatibility | |
| # The CUDA graph helper expects model.position_embedding_type, model.rotary_pos_emb, and model.decoder, | |
| # but in Qwen3VL these are nested under language_model. This provides direct access. | |
| # Expose these attributes for CUDA graph helper compatibility only when CUDA graph is enabled | |
| cuda_graph_enabled = getattr(self.language_model.config, "cuda_graph_impl", "none") != "none" | |
| if cuda_graph_enabled: | |
| self.position_embedding_type = self.language_model.position_embedding_type | |
| self.rotary_pos_emb = self.language_model.rotary_pos_emb | |
| self.decoder = self.language_model.decoder | |
| self.pg_collection = get_pg_collection(self) | |
| # Expose position_embedding_type, rotary_pos_emb, and decoder for CUDA graph helper compatibility | |
| # The CUDA graph helper expects model.position_embedding_type, model.rotary_pos_emb, and model.decoder, | |
| # but in Qwen3VL these are nested under language_model. This provides direct access. | |
| # Expose these attributes for CUDA graph helper compatibility only when CUDA graph is enabled | |
| cuda_graph_enabled = getattr(self.language_model.config, "cuda_graph_impl", "none") != "none" | |
| if cuda_graph_enabled: | |
| self.position_embedding_type = self.language_model.position_embedding_type | |
| self.rotary_pos_emb = self.language_model.rotary_pos_emb | |
| self.decoder = self.language_model.decoder |
🤖 Prompt for AI Agents
In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py` around lines
178 - 206, Remove the duplicated CUDA graph compatibility blocks and keep a
single instance that checks getattr(self.language_model.config,
"cuda_graph_impl", "none") != "none" and, if true, assigns
self.position_embedding_type = self.language_model.position_embedding_type,
self.rotary_pos_emb = self.language_model.rotary_pos_emb, and self.decoder =
self.language_model.decoder; locate the repeated blocks around the class that
references self.language_model and consolidate them into one occurrence to
eliminate the triplication.
| # shape (seq_length, bs, 1, 2 * dim) | ||
| emb = emb[..., None, :].transpose(0, 1).contiguous() | ||
| _ = packed_seq_params # packed sequences not supported yet | ||
| if self.cp_group.size() > 1 and not self.is_thd_format: | ||
| # slice rotary_pos_emb along sequence dimension and select the parition of the current | ||
| # CP rank | ||
| emb = get_pos_emb_on_this_cp_rank(emb, 0, self.cp_group) | ||
| return emb |
There was a problem hiding this comment.
self.cp_group.size() will raise AttributeError when cp_group is None.
The constructor default for cp_group is None (line 54), but line 122 unconditionally calls self.cp_group.size(). If the caller doesn't supply a process group, this will crash at runtime.
Proposed fix
- if self.cp_group.size() > 1 and not self.is_thd_format:
+ if self.cp_group is not None and self.cp_group.size() > 1 and not self.is_thd_format:🤖 Prompt for AI Agents
In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.py` around lines
120 - 126, The code calls self.cp_group.size() without ensuring cp_group is set;
change the conditional in the method that manipulates emb to first check that
self.cp_group is not None (e.g., self.cp_group is not None and
self.cp_group.size() > 1 and not self.is_thd_format) so you only call size()
when cp_group exists, and then call get_pos_emb_on_this_cp_rank(emb, 0,
self.cp_group) as before; reference symbols: self.cp_group, is_thd_format, emb,
get_pos_emb_on_this_cp_rank.
| def apply_rotary_pos_emb_thd_absolute( | ||
| t: Tensor, cu_seqlens: Tensor, freqs: Tensor, rotary_interleaved: bool = False | ||
| ) -> Tensor: | ||
| """A baseline implementation of applying RoPE for `thd` format. | ||
|
|
||
| Args: | ||
| t (Tensor): Input tensor T is of shape [t, h, d] | ||
| cu_seqlens(Tensor): Cumulative sum of sequence lengths in a batch for `t`, | ||
| with shape [b + 1] and dtype torch.int32. | ||
| freqs (Tensor): Rotary Positional embedding tensor freq is of shape [max_s, 1, 1, d] | ||
|
|
||
| Returns: | ||
| Tensor: Shape [t, h, d]. The input tensor after applying RoPE. | ||
| """ | ||
| return _apply_rotary_pos_emb_bshd(t[:, None], freqs, rotary_interleaved=rotary_interleaved).squeeze(1) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the rope.py file to understand the full context
wc -l src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 136
🏁 Script executed:
# Get the function in question and surrounding context
sed -n '260,285p' src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 945
🏁 Script executed:
# Search for _apply_rotary_pos_emb_bshd function definition
rg -n "_apply_rotary_pos_emb_bshd" src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.py -A 20Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2139
🏁 Script executed:
# Check for other thd/cu_seqlens related functions in the file
rg -n "thd|cu_seqlens" src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1071
🏁 Script executed:
# Find where apply_rotary_pos_emb_thd_absolute is called
rg -n "apply_rotary_pos_emb_thd_absolute" --type py -B 2 -A 2Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1144
🏁 Script executed:
# Check if there's a THD-aware implementation of _apply_rotary_pos_emb_bshd elsewhere
rg -n "def _apply_rotary_pos_emb_bshd" --type py -A 25Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
# Check the imports to see where _apply_rotary_pos_emb_bshd comes from
head -30 src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1128
🏁 Script executed:
# Search for _apply_rotary_pos_emb_bshd definition in the entire codebase
rg -n "def _apply_rotary_pos_emb_bshd" --type py -A 30Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
# Search for any other _apply_rotary_pos_emb functions or THD-specific implementations
rg -n "apply_rotary_pos_emb" --type py -i | head -50Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2922
🏁 Script executed:
# Look for any references to how THD format should be handled with RoPE
rg -n "thd.*format\|packed.*seq" --type py -i -C 3 src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
# Check if there's a test or usage example that clarifies the intent
rg -n "apply_rotary_pos_emb_thd_absolute\|is_thd_format" --type py -C 5Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
# Let's look at the attention.py file to see how apply_rotary_pos_emb_absolute is called
sed -n '160,200p' src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1684
🏁 Script executed:
# Get more context on how packed_seq_params is passed
grep -n "packed_seq_params\|cu_seqlens" src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py | head -20Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1178
🏁 Script executed:
# Check if _apply_rotary_pos_emb_bshd from megatron.core has THD support or different signature
rg -n "_apply_rotary_pos_emb_bshd\|apply_rotary_pos_emb_thd" --type py | grep -v "qwen3_vl"Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
# Look for any other THD-aware RoPE implementations in megatron.core
find . -name "*.py" -path "*/megatron/core/*" -type f | xargs grep -l "thd.*rope\|rope.*thd" 2>/dev/null | head -5Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
# Let's examine the full apply_rotary_pos_emb_thd_absolute function and context
sed -n '265,280p' src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 769
🏁 Script executed:
# Check the full apply_rotary_pos_emb_absolute function to understand the routing logic
sed -n '282,307p' src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 964
🏁 Script executed:
# Search for comments or TODOs that might indicate this is incomplete work
grep -n "TODO\|FIXME\|XXX\|HACK" src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
cu_seqlens parameter is accepted but never used — THD RoPE will be incorrect for packed sequences.
apply_rotary_pos_emb_thd_absolute accepts cu_seqlens but completely ignores it, simply calling _apply_rotary_pos_emb_bshd(t[:, None], freqs, rotary_interleaved=rotary_interleaved).squeeze(1). The function description labels this a "baseline implementation," indicating incomplete work.
For THD (packed sequence) format, RoPE must apply positional embeddings relative to each sequence's position within its boundaries, not globally across all tokens. The current implementation treats the entire packed input as a single contiguous sequence, violating correctness for multi-sequence batches. Ruff flags cu_seqlens as unused (ARG001).
Either implement proper THD-aware RoPE using cu_seqlens to compute per-sequence position offsets, or remove the parameter and rename the function to clarify it only handles single-sequence batches.
🧰 Tools
🪛 Ruff (0.14.14)
[warning] 266-266: Unused function argument: cu_seqlens
(ARG001)
🤖 Prompt for AI Agents
In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.py` around lines
265 - 279, The function apply_rotary_pos_emb_thd_absolute currently ignores the
cu_seqlens argument causing incorrect RoPE for packed THD inputs; update it to
compute per-sequence absolute token positions from cu_seqlens and apply rotary
embeddings per segment (e.g., iterate or vectorize over batches using cu_seqlens
to slice t and select matching freqs offsets) before concatenating results,
using _apply_rotary_pos_emb_bshd on each sequence chunk (or a vectorized
equivalent) so positions reset per sequence; alternatively if you intend only
single-sequence use remove the cu_seqlens parameter and rename the function to
reflect single-sequence behavior.
| vision_transformer_config = get_vision_model_config(deepcopy(language_transformer_config), hf_vision_config) | ||
| vision_transformer_config.pipeline_model_parallel_size = 1 | ||
| vision_transformer_config.first_pipeline_num_layers = None | ||
|
|
||
| # Apply vision encoder CUDA graph settings | ||
| vision_transformer_config.cuda_graph_impl = self.vision_cuda_graph_impl | ||
| # Convert string scope list to CudaGraphScope enums | ||
| vision_transformer_config.cuda_graph_scope = _convert_cuda_graph_scope_to_enum( | ||
| self.vision_cuda_graph_scope | ||
| ) if self.vision_cuda_graph_scope else [] | ||
| # Set max sequence length for vision CUDA graphs | ||
| vision_transformer_config.max_vision_cuda_graph_seq_length = self.max_vision_cuda_graph_seq_length | ||
|
|
||
| # Spec for the Qwen3VLTransformerLayer | ||
| language_transformer_layer_spec = get_gpt_layer_with_transformer_engine_spec( | ||
| num_experts=None, | ||
| moe_grouped_gemm=False, | ||
| qk_layernorm=self.qk_layernorm, | ||
| fp8=False, | ||
| ) | ||
| vision_transformer_layer_spec = get_vit_layer_with_transformer_engine_spec() | ||
| vision_patch_merger_spec = PatchMergerSubmodules( | ||
| patch_norm=TENorm, | ||
| linear_fc1=TEColumnParallelLinear, | ||
| linear_fc2=TERowParallelLinear, | ||
| ) | ||
|
|
||
| model = Qwen3VLModel( | ||
| language_transformer_config=language_transformer_config, | ||
| language_transformer_layer_spec=language_transformer_layer_spec, | ||
| vision_transformer_config=hf_vision_config, | ||
| vision_transformer_layer_spec=vision_transformer_layer_spec, | ||
| vision_patch_merger_spec=vision_patch_merger_spec, | ||
| pre_process=pre_process, | ||
| post_process=post_process, | ||
| pg_collection=self._pg_collection, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, check the file exists and get overall structure
wc -l src/megatron/bridge/models/qwen_vl/qwen3_vl_provider.py
# Check lines around 155 and 185 in qwen3_vl_provider.py
sed -n '150,195p' src/megatron/bridge/models/qwen_vl/qwen3_vl_provider.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2294
🏁 Script executed:
#!/bin/bash
# Find the Qwen3VLModel class definition
fd qwen3_vl -t f --extension py | grep -E "model|qwen3_vl"Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 607
🏁 Script executed:
#!/bin/bash
# Search for Qwen3VLModel class definition
rg -n "class Qwen3VLModel" src/megatron/bridge/models/qwen_vl/ -A 20Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 4969
🏁 Script executed:
#!/bin/bash
# Check the MoE provider for the same issue at lines 352-363 and 383
sed -n '340,390p' src/megatron/bridge/models/qwen_vl/qwen3_vl_provider.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2616
🏁 Script executed:
#!/bin/bash
# Get the Qwen3VLModel __init__ signature to see what types it expects
sed -n '42,120p' src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 4223
🏁 Script executed:
#!/bin/bash
# Check what get_vision_model_config returns
rg -n "def get_vision_model_config" src/megatron/bridge/models/qwen_vl/ -A 15Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1968
🏁 Script executed:
#!/bin/bash
# Check the type hints and what Qwen3VLConfigHF is
rg -n "Qwen3VLConfigHF" src/megatron/bridge/models/qwen_vl/ -B 2 -A 2Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1361
🏁 Script executed:
#!/bin/bash
# Verify the type of get_vision_model_config's return value by checking full function
sed -n '63,110p' src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_config.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2301
🏁 Script executed:
#!/bin/bash
# Check what attributes vision_transformer_config is expected to have in Qwen3VLModel
rg -n "vision_transformer_config\." src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py | head -20Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 529
🏁 Script executed:
#!/bin/bash
# Check the return statement of get_vision_model_config
sed -n '63,130p' src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_config.py | tail -20Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1171
🏁 Script executed:
#!/bin/bash
# Get the full end of get_vision_model_config function to see return statement
sed -n '63,135p' src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_config.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 3498
🏁 Script executed:
#!/bin/bash
# Verify Qwen3VLTransformerConfig is the return type - check inheritance and what it is
rg -n "class Qwen3VLTransformerConfig" src/megatron/bridge/models/qwen_vl/ -A 3Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 552
🏁 Script executed:
#!/bin/bash
# Double-check the actual line where model is instantiated in dense provider
sed -n '180,195p' src/megatron/bridge/models/qwen_vl/qwen3_vl_provider.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 775
Critical: vision_transformer_config is configured with CUDA graph settings but never passed to Qwen3VLModel.
At line 155, vision_transformer_config is created via get_vision_model_config(deepcopy(language_transformer_config), hf_vision_config), and lines 159-166 apply CUDA graph settings to it. However, at line 185, Qwen3VLModel is instantiated with vision_transformer_config=hf_vision_config (the raw HuggingFace config), not the configured vision_transformer_config. The CUDA graph settings applied to the local variable are never used.
The same issue exists in the MoE provider at lines 352-363 / 383.
Proposed fix (dense provider)
model = Qwen3VLModel(
language_transformer_config=language_transformer_config,
language_transformer_layer_spec=language_transformer_layer_spec,
- vision_transformer_config=hf_vision_config,
+ vision_transformer_config=vision_transformer_config,
vision_transformer_layer_spec=vision_transformer_layer_spec,
vision_patch_merger_spec=vision_patch_merger_spec,
pre_process=pre_process,
post_process=post_process,
pg_collection=self._pg_collection,
)🤖 Prompt for AI Agents
In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_provider.py` around lines 155 -
191, The local vision_transformer_config is modified (CUDA graph settings,
scope, max_vision_cuda_graph_seq_length) but the Qwen3VLModel call uses the
original hf_vision_config; update the Qwen3VLModel instantiation to pass the
configured vision_transformer_config (i.e., change
vision_transformer_config=hf_vision_config to
vision_transformer_config=vision_transformer_config) so the applied settings are
used, and make the same fix in the MoE provider where the analogous local
vision_transformer_config is created and then incorrectly replaced with
hf_vision_config in that model's constructor.
| packed_tokens = torch.zeros(1, total_len, dtype=tokens.dtype, device=device) | ||
| packed_labels = torch.zeros(1, total_len, dtype=labels.dtype, device=device) | ||
| packed_loss_mask = torch.zeros(1, total_len, dtype=loss_mask.dtype, device=device) | ||
| packed_position_ids = torch.zeros(1, total_len, dtype=position_ids.dtype, device=device) | ||
|
|
||
| for i, seqlen in enumerate(seqlens_in_batch_cpu): | ||
| start_idx = cu_seqlens_padded_cpu[i] | ||
| packed_tokens[start_idx : start_idx + seqlen] = tokens[i, :seqlen] | ||
| packed_labels[start_idx : start_idx + seqlen] = labels[i, :seqlen] | ||
| packed_loss_mask[start_idx : start_idx + seqlen] = loss_mask[i, :seqlen] | ||
| packed_position_ids[start_idx : start_idx + seqlen] = position_ids[i, :seqlen] | ||
|
|
||
| tokens = packed_tokens | ||
| labels = packed_labels | ||
| loss_mask = packed_loss_mask | ||
| position_ids = packed_position_ids |
There was a problem hiding this comment.
Critical indexing bug: packing loop slices along dim 0 instead of dim 1.
packed_tokens has shape (1, total_len). The assignment packed_tokens[start_idx : start_idx + seqlen] slices along dimension 0 (batch), not dimension 1 (sequence). For any start_idx > 0, this produces an empty slice and silently loses all data. The same bug affects packed_labels, packed_loss_mask, and packed_position_ids.
Fix: index dim 1 explicitly
for i, seqlen in enumerate(seqlens_in_batch_cpu):
start_idx = cu_seqlens_padded_cpu[i]
- packed_tokens[start_idx : start_idx + seqlen] = tokens[i, :seqlen]
- packed_labels[start_idx : start_idx + seqlen] = labels[i, :seqlen]
- packed_loss_mask[start_idx : start_idx + seqlen] = loss_mask[i, :seqlen]
- packed_position_ids[start_idx : start_idx + seqlen] = position_ids[i, :seqlen]
+ packed_tokens[0, start_idx : start_idx + seqlen] = tokens[i, :seqlen]
+ packed_labels[0, start_idx : start_idx + seqlen] = labels[i, :seqlen]
+ packed_loss_mask[0, start_idx : start_idx + seqlen] = loss_mask[i, :seqlen]
+ packed_position_ids[0, start_idx : start_idx + seqlen] = position_ids[i, :seqlen]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| packed_tokens = torch.zeros(1, total_len, dtype=tokens.dtype, device=device) | |
| packed_labels = torch.zeros(1, total_len, dtype=labels.dtype, device=device) | |
| packed_loss_mask = torch.zeros(1, total_len, dtype=loss_mask.dtype, device=device) | |
| packed_position_ids = torch.zeros(1, total_len, dtype=position_ids.dtype, device=device) | |
| for i, seqlen in enumerate(seqlens_in_batch_cpu): | |
| start_idx = cu_seqlens_padded_cpu[i] | |
| packed_tokens[start_idx : start_idx + seqlen] = tokens[i, :seqlen] | |
| packed_labels[start_idx : start_idx + seqlen] = labels[i, :seqlen] | |
| packed_loss_mask[start_idx : start_idx + seqlen] = loss_mask[i, :seqlen] | |
| packed_position_ids[start_idx : start_idx + seqlen] = position_ids[i, :seqlen] | |
| tokens = packed_tokens | |
| labels = packed_labels | |
| loss_mask = packed_loss_mask | |
| position_ids = packed_position_ids | |
| packed_tokens = torch.zeros(1, total_len, dtype=tokens.dtype, device=device) | |
| packed_labels = torch.zeros(1, total_len, dtype=labels.dtype, device=device) | |
| packed_loss_mask = torch.zeros(1, total_len, dtype=loss_mask.dtype, device=device) | |
| packed_position_ids = torch.zeros(1, total_len, dtype=position_ids.dtype, device=device) | |
| for i, seqlen in enumerate(seqlens_in_batch_cpu): | |
| start_idx = cu_seqlens_padded_cpu[i] | |
| packed_tokens[0, start_idx : start_idx + seqlen] = tokens[i, :seqlen] | |
| packed_labels[0, start_idx : start_idx + seqlen] = labels[i, :seqlen] | |
| packed_loss_mask[0, start_idx : start_idx + seqlen] = loss_mask[i, :seqlen] | |
| packed_position_ids[0, start_idx : start_idx + seqlen] = position_ids[i, :seqlen] | |
| tokens = packed_tokens | |
| labels = packed_labels | |
| loss_mask = packed_loss_mask | |
| position_ids = packed_position_ids |
🤖 Prompt for AI Agents
In `@src/megatron/bridge/training/qwen3vl_step.py` around lines 191 - 206, The
loop that repacks sequences into
packed_tokens/packed_labels/packed_loss_mask/packed_position_ids is slicing the
batch dim instead of the sequence dim (packed_tokens[start_idx:...]) causing
empty assignments for start_idx>0; update the assignments in qwen3vl_step.py
(the for loop using seqlens_in_batch_cpu and cu_seqlens_padded_cpu) to index
dimension 1 explicitly (e.g., select all rows then slice columns like [:,
start_idx:start_idx+seqlen] or use .narrow on dim=1) so each tensor receives the
sequence slice into the sequence axis rather than the batch axis.
| if data_format == "thd": | ||
| forward_args, packed_seq_params = get_thd_batch_on_this_cp_rank( | ||
| forward_args, | ||
| packed_seq_params.cu_seqlens, | ||
| packed_seq_params.cu_seqlens_padded, | ||
| packed_seq_params.max_seqlen, | ||
| cp_size=this_pg_collection.cp.size(), | ||
| cp_rank=this_pg_collection.cp.rank(), | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Search for PackedSeqParams class definition
rg -n 'class PackedSeqParams' --type=py -A 30Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
# Also search for any dataclass or NamedTuple definition
rg -n '@dataclass|@dataclasses' --type=py -B 2 -A 20 | grep -A 20 'PackedSeqParams'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
# Search for where PackedSeqParams is defined across the codebase
fd --type=py -x rg -l 'class PackedSeqParams'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 240
🏁 Script executed:
# Read the file to see imports and context
cat -n src/megatron/bridge/training/qwen3vl_step.py | head -50Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2210
🏁 Script executed:
# Look at lines 229-237 and 312-320 specifically
cat -n src/megatron/bridge/training/qwen3vl_step.py | sed -n '220,240p'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1189
🏁 Script executed:
# Look at lines 312-320
cat -n src/megatron/bridge/training/qwen3vl_step.py | sed -n '312,320p'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 504
🏁 Script executed:
# Search for PackedSeqParams more broadly without file type restriction
rg -n 'PackedSeqParams' --type-add='py:*.py'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 6738
🏁 Script executed:
# Find the PackedSeqParams class definition
find . -name "packed_seq_params.py" -type fRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
# Search for the class definition
rg -n 'class PackedSeqParams' -A 50Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
# Find the get_thd_batch_on_this_cp_rank function
rg -n 'def get_thd_batch_on_this_cp_rank' -A 20Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
# Check how PackedSeqParams is used in other parts of this file and related files
rg -n 'packed_seq_params\.' src/megatron/bridge/training/qwen3vl_step.py -B 2 -A 2Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 430
🏁 Script executed:
# Check the signature of get_thd_batch_on_this_cp_rank in other files that use it
rg -n 'get_thd_batch_on_this_cp_rank' -A 15Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 3008
🏁 Script executed:
# Look for how packed_seq_params attributes are accessed in other files
rg -n 'packed_seq_params\.(cu_seqlens|max_seqlen)' -B 2 -A 2Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 8692
🏁 Script executed:
# Check the packed_seq_utils.py to see how PackedSeqParams is constructed and used
cat -n src/megatron/bridge/training/utils/packed_seq_utils.py | head -100Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 4911
🏁 Script executed:
# Look at lines around the construction in packed_seq_utils.py
sed -n '155,180p' src/megatron/bridge/training/utils/packed_seq_utils.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 906
Fix AttributeError: PackedSeqParams does not have cu_seqlens, cu_seqlens_padded, or max_seqlen attributes.
Lines 315–317 attempt to access non-existent attributes on the PackedSeqParams object. Based on the constructor at lines 229–237 and usages throughout the codebase, PackedSeqParams only provides cu_seqlens_q, cu_seqlens_kv, cu_seqlens_q_padded, cu_seqlens_kv_padded, max_seqlen_q, and max_seqlen_kv. Replace the generic attribute names with the appropriate _q or _kv suffixed variants.
🤖 Prompt for AI Agents
In `@src/megatron/bridge/training/qwen3vl_step.py` around lines 312 - 320, The
call to get_thd_batch_on_this_cp_rank is using non-existent PackedSeqParams
attributes (cu_seqlens, cu_seqlens_padded, max_seqlen); update the call in
qwen3vl_step.py to use the correct PackedSeqParams fields (choose the
appropriate query or key/value variants) such as cu_seqlens_q or cu_seqlens_kv,
cu_seqlens_q_padded or cu_seqlens_kv_padded, and max_seqlen_q or max_seqlen_kv
so they match the PackedSeqParams constructor and usages; ensure the chosen _q
vs _kv set aligns with how get_thd_batch_on_this_cp_rank expects sequence
lengths.
| from megatron.core.transformer.vision_cuda_graphs import ( | ||
| VisionTECudaGraphHelper, | ||
| get_vision_cuda_graph_seq_length, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if vision_cuda_graphs module exists in the megatron.core package
fd 'vision_cuda_graphs' --type f
# Also check the megatron-core version requirement
rg 'megatron.core|megatron-core' -A2 -B2 --type py --type toml --type txt | head -100Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 7770
🏁 Script executed:
#!/bin/bash
# Read the actual file to see the imports
echo "=== train.py imports around lines 50-53 ==="
sed -n '40,65p' src/megatron/bridge/training/train.py
# Search for any references to vision_cuda_graphs in the entire codebase
echo -e "\n=== Search for vision_cuda_graphs references ==="
rg 'vision_cuda_graphs' -B2 -A2
# Check pyproject.toml for megatron-core version
echo -e "\n=== megatron-core version requirement ==="
rg 'megatron' pyproject.toml requirements.txt 2>/dev/null || echo "Checking pyproject.toml..."
cat pyproject.toml | grep -A20 dependenciesRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 4325
🏁 Script executed:
#!/bin/bash
# Check if 3rdparty/Megatron-LM exists and look for vision_cuda_graphs
echo "=== Check 3rdparty/Megatron-LM structure ==="
ls -la 3rdparty/ 2>/dev/null || echo "3rdparty directory not found or empty"
# Look for vision_cuda_graphs in the megatron-core path
echo -e "\n=== Search for vision_cuda_graphs in expected location ==="
find 3rdparty -name "*vision_cuda*" -type f 2>/dev/null || echo "Not found in 3rdparty"
# Check if megatron-core is available as installed package by checking Python imports
echo -e "\n=== Try to check via Python if module exists ==="
python3 << 'PYEOF'
try:
# Try importing from megatron.core
from megatron.core.transformer import vision_cuda_graphs
print("✓ megatron.core.transformer.vision_cuda_graphs module exists")
print(f"Module path: {vision_cuda_graphs.__file__}")
print(f"Available items: {dir(vision_cuda_graphs)}")
except ImportError as e:
print(f"✗ Import failed: {e}")
except Exception as e:
print(f"✗ Error: {e}")
PYEOFRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 441
🏁 Script executed:
#!/bin/bash
# Check the actual Megatron-LM directory structure
echo "=== Check Megatron-LM directory ==="
ls -la 3rdparty/Megatron-LM/Megatron-LM/ 2>/dev/null || echo "Directory empty or structure different"
# Check if there's a megatron directory
echo -e "\n=== Look for megatron core structure ==="
find 3rdparty/Megatron-LM -type d -name 'core' 2>/dev/null | head -5
# Look for any vision-related files in the megatron-core transformer directory
echo -e "\n=== Search all vision-related files in 3rdparty ==="
find 3rdparty -name "*vision*" -type f 2>/dev/null
# Check git status to see if submodule is initialized
echo -e "\n=== Check if Megatron-LM is a git submodule ==="
git config --file .gitmodules --get-regexp path | grep -i megatron 2>/dev/null || echo "Not a git submodule or empty"
# List all .py files in megatron/core/transformer if it exists
echo -e "\n=== List Python files in megatron/core/transformer ==="
find 3rdparty -path "*/megatron/core/transformer/*.py" -type f 2>/dev/null | head -20Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 402
🌐 Web query:
megatron-core vision_cuda_graphs module 0.15.0
💡 Result:
In megatron-core v0.15.0, there is no Python module named vision_cuda_graphs (it does not appear in the v0.15.0 module index / API docs), so an import like from megatron.core... import vision_cuda_graphs will fail. The CUDA-Graphs implementation that is documented in v0.15.0 lives under core.transformer.cuda_graphs (and related CUDA-graph hooks on transformer modules), not under core.models.vision. [1], [2]
If you meant “CUDA Graphs for vision models (ViT/CLIP)”, the vision package in the docs only lists submodules like vit_layer_specs, clip_vit_model, radio, and multimodal_projector—no vision_cuda_graphs. [3]
What to do
- Replace any
vision_cuda_graphsimport withcore.transformer.cuda_graphs(or use the CUDA-graph entrypoints exposed on transformer layers/modules). [1], [2] - If you saw
vision_cuda_graphsin some fork/branch, it’s not part of the official 0.15.0 surface area; check the exact repo/commit you copied from (or upgrade/downgrade to the version that contains it). The official Megatron Core 0.15.0 release tag exists on GitHub. [2]
Sources
[1] Megatron Core 0.15.0 Python module index (shows core.transformer.cuda_graphs, no vision_cuda_graphs)
[2] NVIDIA/Megatron-LM GitHub releases (Core v0.15.0 tag/release)
[3] Megatron Core docs: core.models.vision submodules list (no vision_cuda_graphs)
Remove or update imports from non-existent megatron.core.transformer.vision_cuda_graphs module.
The module megatron.core.transformer.vision_cuda_graphs does not exist in megatron-core v0.15.0 (the pinned version constraint is >=0.15.0a0,<0.17.0). The imports for VisionTECudaGraphHelper and get_vision_cuda_graph_seq_length will fail at runtime. The correct CUDA graphs module in megatron-core is megatron.core.transformer.cuda_graphs. Either update the imports to use the correct module path, verify that the megatron-core version constraint allows a version containing this module, or remove the imports if they are not yet supported.
🤖 Prompt for AI Agents
In `@src/megatron/bridge/training/train.py` around lines 50 - 53, The imports from
megatron.core.transformer.vision_cuda_graphs are invalid; update the import
statements that reference VisionTECudaGraphHelper and
get_vision_cuda_graph_seq_length to import them from
megatron.core.transformer.cuda_graphs instead (or remove these imports entirely
if those symbols are unused in train.py); ensure any downstream references to
VisionTECudaGraphHelper and get_vision_cuda_graph_seq_length still resolve after
changing the import path and run tests to confirm no runtime import errors.
| batch_size = input_ids.shape[0] | ||
|
|
||
| seqlens_in_batch = ( | ||
| attention_mask.sum(dim=-1, dtype=torch.int32) | ||
| if attention_mask is not None | ||
| else torch.ones((batch_size), dtype=torch.int32, device=input_ids.device) * input_ids.shape[1] | ||
| ) |
There was a problem hiding this comment.
attention_mask=None causes crash when used as boolean index in the packing loop.
Lines 88–92 handle a None attention mask for computing seqlens_in_batch, but the packing loop at lines 137 and 145 unconditionally indexes input_ids[i, attention_mask[i]], which will raise a TypeError when attention_mask is None.
Suggested fix: guard the indexing
for i in range(batch_size):
# Use Python int, so no GPU→CPU sync in the loop
if cp_size <= 1:
seqlen = seqlens_in_batch_cpu[i]
start_idx = cu_seqlens_padded_cpu[i]
- input_ids_rmpad[start_idx : start_idx + seqlen] = input_ids[i, attention_mask[i]]
+ if attention_mask is not None:
+ input_ids_rmpad[start_idx : start_idx + seqlen] = input_ids[i, attention_mask[i]]
+ else:
+ input_ids_rmpad[start_idx : start_idx + seqlen] = input_ids[i, :seqlen]
continue
seqlen_padded_i = seqlens_in_batch_padded_cpu[i]
seqlen = seqlen_padded_i // cp_size
half_seqlen = seqlen // 2
start_idx = cu_seqlens_padded_cpu[i] // cp_size
# split to 2 chunks
- d = input_ids[i, attention_mask[i]]
+ if attention_mask is not None:
+ d = input_ids[i, attention_mask[i]]
+ else:
+ d = input_ids[i, :seqlens_in_batch_cpu[i]]Also applies to: 130-145
🤖 Prompt for AI Agents
In `@src/megatron/bridge/training/utils/packed_seq_utils.py` around lines 86 - 92,
The packing loop in packed_seq_utils.py indexes tensors with attention_mask
(e.g., input_ids[i, attention_mask[i]]) which crashes when attention_mask is
None; update the loop to guard indexing by computing per-example views: for each
i use mask_i = attention_mask[i] if attention_mask is not None else slice(0,
seqlens_in_batch[i]) and then use input_ids[i, mask_i] (and do the same for
token_type_ids, position_ids, etc.), or simply slice with :seqlens_in_batch[i]
when attention_mask is None so the code works regardless of whether
attention_mask is provided.
What does this PR do ?
This PR adds CUDA Graph support for vision encoder.
Related Mcore PR:NVIDIA/Megatron-LM#3293, NVIDIA/Megatron-LM#3294
Related TE PR:NVIDIA/TransformerEngine#2657
Example run command:
Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit
Release Notes
New Features
Chores