Skip to content

Adding CUDA Graph Support for Vision Encoder#2274

Open
tomlifu wants to merge 12 commits intoNVIDIA-NeMo:mainfrom
shifangx:shifang/qwen3_vl_perf_m4_mcore_encoder
Open

Adding CUDA Graph Support for Vision Encoder#2274
tomlifu wants to merge 12 commits intoNVIDIA-NeMo:mainfrom
shifangx:shifang/qwen3_vl_perf_m4_mcore_encoder

Conversation

@tomlifu
Copy link

@tomlifu tomlifu commented Feb 8, 2026

What does this PR do ?

This PR adds CUDA Graph support for vision encoder.

Related Mcore PR:NVIDIA/Megatron-LM#3293, NVIDIA/Megatron-LM#3294
Related TE PR:NVIDIA/TransformerEngine#2657

Example run command:

python ${MEGATRON_BRIDGE_PATH}/scripts/performance/run_script.py \
--account my-account \
--partition my-partition \
--container_image my-container-image \
--gpu gb200 \
--domain vlm \
--model_family_name qwen_vl \
--model_recipe_name qwen3_vl_235b_a22b \
--gpus_per_node 4 \
--num_gpus 64 \
--hf_token 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 \
--log_dir ${WORKSPACE}/../logs \
dataset.seq_length=4096 \
dataset.image_size=[392,392] \
model.seq_length=4096 \
model.freeze_language_model=False \
model.freeze_vision_model=False \
model.freeze_vision_projection=False \
model.pipeline_model_parallel_size=8 \
model.expert_model_parallel_size=8 \
model.account_for_embedding_in_pipeline_split=False \
model.account_for_loss_in_pipeline_split=False \
model.num_layers_in_first_pipeline_stage=4 \
model.num_layers_in_last_pipeline_stage=12 \
train.micro_batch_size=1 \
train.global_batch_size=1024 \
train.train_iters=20 \
train.eval_iters=0 \
train.manual_gc=true \
train.manual_gc_interval=100 \
logger.log_interval=1 \
model.cuda_graph_impl=transformer_engine \
model.cuda_graph_scope=[attn,moe_router,moe_preprocess] \
model.vision_cuda_graph_impl=transformer_engine \
model.vision_cuda_graph_scope=[attn,mlp] \
model.max_vision_cuda_graph_seq_length=784 \

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

Release Notes

  • New Features

    • Added Qwen3 Vision-Language model support with optimized vision encoder integration.
    • Introduced decentralized process group support for distributed training workflows.
    • Added vision encoder CUDA graph optimization capabilities.
    • Enhanced sequence packing and padding support for efficient batch processing.
  • Chores

    • Updated Megatron-LM submodule to latest version.

shifangx and others added 12 commits January 28, 2026 23:19
* qwen3-vl migration [wip]

Signed-off-by: jinliangl <jinliangl@nvidia.com>

* support qwen3vl bshd training

Signed-off-by: jinliangl <jinliangl@nvidia.com>

* support bshd training, thd training, thd training with cp, loss convergency has not been test yet.

* adjust how to get packed seq params

* supoort qwen3vl dense

* support convert hf ckpt for qwenvl vision module, can convert deepstack's param, need to check it further.

* fix qwen3vl shard_state_dict

* align megagtron-bridge and mbridge fwd bitwisely

* align bshd and thd training, cp training remains to do

* fix full recompute of qwen3vl bug with credit to xuwen

Signed-off-by: jinliangl <jinliangl@nvidia.com>

* support hf vision model and megatron vision model; use model.use_hf_vision_model=true to enable it

Signed-off-by: jinliangl <jinliangl@nvidia.com>

* align with pr 1997, thd and bshd loss curve is verified

Signed-off-by: jinliangl <jinliangl@nvidia.com>

* init hf_config from scratch to avoid deepcopy parallel_state

Signed-off-by: jinliangl <jinliangl@nvidia.com>

* use pg_collection instead of mpu

Signed-off-by: jinliangl <jinliangl@nvidia.com>

* add qwen3vl_step new file

---------

Signed-off-by: jinliangl <jinliangl@nvidia.com>
Signed-off-by: Shifang Xu <shifangx@nvidia.com>
Co-authored-by: jinliangl <jinliangl@nvidia.com>
…guage_transformer_config), vision_transformer_config)
Signed-off-by: Lifu Zhang <lifuz@login-lyris02.lyris.clusters.nvidia.com>
@tomlifu tomlifu requested review from a team, erhoo82 and malay-nagda as code owners February 8, 2026 22:04
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 8, 2026

📝 Walkthrough

Walkthrough

This pull request extends Megatron-Bridge with comprehensive vision model support for Qwen3VL, including CUDA graph optimization for vision encoders, process group collection integration, and packed sequence handling. It introduces new vision architecture components (encoder, attention, RoPE), updates the training pipeline for multi-modal support, adds example configurations, and upgrades the submodule pointer.

Changes

Cohort / File(s) Summary
Submodule & Core Examples
3rdparty/Megatron-LM, examples/recipes/decentralized_pg/pretrain_qwen3_vl_simple.py, examples/recipes/qwen_vl/conf/qwen3_vl_pretrain_override_example.yaml
Updates Megatron-LM submodule pointer; adds new Qwen3VL pretraining example with decentralized process groups and mock data; extends YAML configuration with parallelization, profiling, and optimization settings.
Performance Scripts & Domain Support
scripts/performance/argument_parser.py, scripts/performance/run_script.py, scripts/performance/utils/overrides.py
Adds "qwen3vl" domain support to performance testing pipeline; imports qwen3vl_forward_step; introduces _set_vision_cuda_graph_overrides helper for configuring vision encoder CUDA graphs with TE RNG tracking and scope validation.
Data & Configuration
src/megatron/bridge/data/vlm_datasets/hf_provider.py, src/megatron/bridge/models/conversion/param_mapping.py, examples/recipes/qwen_vl/finetune_qwen_vl.py
Adds pack_sequences_in_batch field to HFDatasetConversationProvider for batch-level sequence packing; updates param mapping documentation; changes forward_step import path to qwen3vl_step module.
Vision Model Core Architecture
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py, src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.py, src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/vision_model.py
Introduces Qwen3VLSelfAttention with RoPE handling for inference contexts and packed sequences; replaces MoE RoPE with Qwen3VLMultimodalRotaryEmbedding supporting interleaving and CP-rank partitioning; adds new Qwen3VLVisionModel class with patch embedding, positional embeddings, and CUDA graph padding.
Vision Utilities & Helpers
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.py
Introduces comprehensive vision-related utilities: Qwen3VLVisionPatchEmbed, Qwen3VLVisionRotaryEmbedding, Qwen3VLVisionPatchMerger; adds data partitioning functions (split_part_by_cp_tp, split_deepstack_embs, reorganize_inputs); implements vision embedding gathering (AllGatherVisionEmbeddings) and packed sequence handling (preprocess_packed_seqs, postprocess_packed_seqs).
Model Architecture & Configuration
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py, src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.py, src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py, src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_config.py
Updates Qwen3VLModel to support ProcessGroupCollection, vision model path selection (HF vs. Megatron), and packed sequences with CP/SP handling; extends Qwen3VLGPTModel to use ProcessGroupCollection and unified multimodal RoPE; introduces Qwen3VLVisionTransformerBlock with deepstack visual processing; adds get_vision_model_config function with CUDA graph propagation and configuration overrides.
Bridge & Provider
src/megatron/bridge/models/qwen_vl/qwen3_vl_bridge.py, src/megatron/bridge/models/qwen_vl/qwen3_vl_provider.py
Extends param mapping with ConcatenatedQKVMapping for vision QKV weights and ReplicatedMapping for vision components; adds vision CUDA graph fields (vision_cuda_graph_impl, vision_cuda_graph_scope, max_vision_cuda_graph_seq_length), use_hf_vision_model flag, and bias_activation_fusion to both standard and MoE providers; integrates vision specs and pg_collection propagation.
Training Infrastructure
src/megatron/bridge/training/qwen3vl_step.py, src/megatron/bridge/training/train.py, src/megatron/bridge/training/utils/packed_seq_utils.py, src/megatron/bridge/training/utils/train_utils.py, src/megatron/bridge/training/vlm_step.py
Introduces new qwen3vl_step.py module with multi-modal batch handling, pack_or_pad_batch_sequences, and forward_step orchestration; extends train.py with VisionTECudaGraphHelper for vision encoder CUDA graphs; updates packed_seq_utils with preprocess_packed_seqs and postprocess_packed_seqs; forces PP-related padding in vlm_step.py; adds pg_collection to metrics tracking.
Testing
tests/unit_tests/models/qwen_vl/modelling_qwen3_vl/test_model.py
Updates unit tests to use ProcessGroupCollection.use_mpu_process_groups() and passes pg_collection to Qwen3VLModel constructor.

Sequence Diagram(s)

sequenceDiagram
    participant DataLoader as Data Iterator
    participant TrainStep as qwen3vl_step
    participant PackSeq as pack_or_pad_batch_sequences
    participant Model as Qwen3VLModel
    participant VisionModel as Qwen3VLVisionModel
    participant Attention as Qwen3VLSelfAttention
    
    DataLoader->>TrainStep: get_batch_from_iterator()
    TrainStep->>PackSeq: tokens, labels, attention_mask
    PackSeq->>PackSeq: Compute cu_seqlens, split by CP rank
    PackSeq-->>TrainStep: packed tokens, PackedSeqParams
    
    TrainStep->>Model: forward(input_ids, pixel_values, packed_seq_params)
    Model->>VisionModel: forward(pixel_values, grid_thw)
    VisionModel->>VisionModel: patch_embedding()
    VisionModel->>VisionModel: rot_pos_emb() + fast_pos_embed_interpolate()
    VisionModel->>VisionModel: transformer_block() with packed_seq_params
    VisionModel-->>Model: vision_embeddings
    
    Model->>Model: Organize vision embeddings with vision_mask
    Model->>Model: Prepare language + vision combined embeddings
    Model->>Model: language_model.forward() with packed_seq_params
    Model->>Attention: forward() with packed sequences
    Attention->>Attention: apply_rotary_pos_emb_absolute() for RoPE
    Attention->>Attention: flash_decode or static batching kernel
    Attention-->>Model: attention_output
    
    Model-->>TrainStep: loss
    TrainStep-->>DataLoader: gradients
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested labels

Run CICD

Suggested reviewers

  • yaoyu-33
🚥 Pre-merge checks | ✅ 2 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.49% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ⚠️ Warning PR implements major CUDA Graph feature with significant architectural changes but lacks test results, performance benchmarks, convergence verification, and has critical bugs. Add comprehensive test results, performance benchmarks showing speedup gains, convergence verification, fix critical bugs, and expand unit test coverage before resubmission.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly identifies the primary change: adding CUDA Graph support for the vision encoder, which is reflected across multiple files including training, configuration, and model implementations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
src/megatron/bridge/training/vlm_step.py (1)

145-192: ⚠️ Potential issue | 🟡 Minor

if True: renders the else branch dead code — remove it or gate on a real condition.

The else branch (lines 159–192) is unreachable. Additionally, line 190 (if attn is not None:) is dedented outside the if tokens_or_input is not None: block, so target_len would be an unbound NameError if tokens_or_input were None — though this can never execute today.

If the non-PP rounding path is no longer needed, delete it. If it may be re-enabled, gate on a proper flag (e.g., cfg.model.use_fixed_shape_padding) and fix the indentation of line 190.

Suggested minimal cleanup
-    if True:
+    # Always pad to model seq_length for fixed-shape CUDA graph compatibility
+    if True:  # TODO: replace with a config flag if variable-shape padding is needed later
         seq_len = cfg.model.seq_length
         ...
-    else:
-        # No PP: pad sequence length to nearest multiple of 128 for efficiency (capped at model seq_length)
-        ...  (remove entire else block)
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.py (1)

60-92: ⚠️ Potential issue | 🔴 Critical

Potential AttributeError when pg_collection is None.

pg_collection defaults to None (line 60), but line 91 unconditionally accesses pg_collection.cp. Any caller that omits pg_collection will crash with AttributeError: 'NoneType' object has no attribute 'cp'.

Either make pg_collection required (remove the default) or add a guard:

Option A: Make required
-        pg_collection: ProcessGroupCollection = None,
+        pg_collection: ProcessGroupCollection,
Option B: Guard the access
         self.rotary_pos_emb = Qwen3VLMultimodalRotaryEmbedding(
             kv_channels=self.config.kv_channels,
             rotary_percent=rotary_percent,
             rotary_interleaved=self.config.rotary_interleaved,
             seq_len_interpolation_factor=seq_len_interpolation_factor,
             rotary_base=rotary_base,
-            cp_group=pg_collection.cp,
+            cp_group=pg_collection.cp if pg_collection is not None else None,
         )
tests/unit_tests/models/qwen_vl/modelling_qwen3_vl/test_model.py (1)

314-323: ⚠️ Potential issue | 🔴 Critical

pg_collection is missing from model_no_decoder — this will cause an AttributeError.

The Qwen3VLModel constructor accepts pg_collection with a default of None, but immediately accesses its attributes without null checks (lines 103–105 in model.py: pg_collection.cp, pg_collection.tp, pg_collection.pp, and line 112: pg_collection.embd). Instantiating model_no_decoder without passing pg_collection will raise an AttributeError on line 103, causing the test to fail at runtime. Either pass pg_collection=pg_collection to model_no_decoder, or refactor the constructor to handle None gracefully.

src/megatron/bridge/models/qwen_vl/qwen3_vl_bridge.py (1)

155-211: ⚠️ Potential issue | 🟠 Major

MoE bridge uses incorrect blanket vision mapping instead of matching dense bridge's granular vision mappings.

The Qwen3VLMoEBridge.mapping_registry() applies a blanket ReplicatedMapping(megatron_param="vision_model.**", hf_param="model.visual.**") that bypasses the specialized vision weight handling in the dense bridge. The dense Qwen3VLBridge explicitly maps vision QKV weights using ConcatenatedQKVMapping and provides granular entries for other vision components (decoder layers, deepstack merger, etc.), but the MoE bridge's catch-all vision_model.** pattern will match all vision parameters with simple 1:1 replication, skipping the QKV interleaving transformation.

Since the vision encoder architecture is identical between the dense and MoE variants (per the docstring claiming "same as dense model"), the MoE bridge must use the same granular vision mappings as the dense bridge, including ConcatenatedQKVMapping for QKV weights.

🤖 Fix all issues with AI agents
In `@examples/recipes/decentralized_pg/pretrain_qwen3_vl_simple.py`:
- Around line 68-71: Remove the manual distributed cleanup block that calls
torch.distributed.is_initialized(), torch.distributed.barrier(), and
torch.distributed.destroy_process_group() in the example; pretrain() already
performs cleanup via _maybe_destroy_process_group(), so delete the
barrier/destroy_process_group block (the lines around the manual cleanup) to
avoid double-destroy/runtime errors when pretrain() has already torn down the
process group.

In `@scripts/performance/utils/overrides.py`:
- Around line 130-167: The _set_vision_cuda_graph_overrides function is never
invoked and also lacks resetting behavior for the "none" case; to fix, call
_set_vision_cuda_graph_overrides from the same two places where
_set_cuda_graph_overrides is called (inside set_workload_base_configs and
set_user_overrides) so vision overrides are applied, and add explicit handling
inside _set_vision_cuda_graph_overrides for vision_cuda_graph_impl == "none" to
disable TE RNG tracking (set recipe.rng.te_rng_tracker =
recipe.model.use_te_rng_tracker = False) mirroring the reset logic in
_set_cuda_graph_overrides; ensure the function still validates
vision_cuda_graph_scope against valid_te_scopes and sets
recipe.model.vision_cuda_graph_scope when provided.

In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py`:
- Around line 178-206: Remove the duplicated CUDA graph compatibility blocks and
keep a single instance that checks getattr(self.language_model.config,
"cuda_graph_impl", "none") != "none" and, if true, assigns
self.position_embedding_type = self.language_model.position_embedding_type,
self.rotary_pos_emb = self.language_model.rotary_pos_emb, and self.decoder =
self.language_model.decoder; locate the repeated blocks around the class that
references self.language_model and consolidate them into one occurrence to
eliminate the triplication.

In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.py`:
- Around line 120-126: The code calls self.cp_group.size() without ensuring
cp_group is set; change the conditional in the method that manipulates emb to
first check that self.cp_group is not None (e.g., self.cp_group is not None and
self.cp_group.size() > 1 and not self.is_thd_format) so you only call size()
when cp_group exists, and then call get_pos_emb_on_this_cp_rank(emb, 0,
self.cp_group) as before; reference symbols: self.cp_group, is_thd_format, emb,
get_pos_emb_on_this_cp_rank.
- Around line 265-279: The function apply_rotary_pos_emb_thd_absolute currently
ignores the cu_seqlens argument causing incorrect RoPE for packed THD inputs;
update it to compute per-sequence absolute token positions from cu_seqlens and
apply rotary embeddings per segment (e.g., iterate or vectorize over batches
using cu_seqlens to slice t and select matching freqs offsets) before
concatenating results, using _apply_rotary_pos_emb_bshd on each sequence chunk
(or a vectorized equivalent) so positions reset per sequence; alternatively if
you intend only single-sequence use remove the cu_seqlens parameter and rename
the function to reflect single-sequence behavior.

In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py`:
- Around line 455-459: The code calls an undefined function
sharded_state_dict_default inside Qwen3VLVisionTransformerBlock when iterating
self.named_children(), which will raise NameError; import
sharded_state_dict_default (e.g. from megatron.core.dist_checkpointing.utils
import sharded_state_dict_default) at the top with other imports and replace the
identity check "not module is self.layers" and "not module is
self.deepstack_merger_list" with the Pythonic "module is not self.layers" and
"module is not self.deepstack_merger_list" so the loop correctly calls
sharded_state_dict_default for child modules like self.final_layernorm.

In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.py`:
- Around line 365-384: The None guard in split_data_cp_rank is currently after
accessing val.shape and thus ineffective; move the "if val is None: return val"
to the top of the function (before any assertions or accesses to val.shape),
then perform the existing assertions and processing (assert cp_size > 1, assert
0 == val.shape[seq_dim] % (2 * cp_size), cp_rank handling,
view/index_select/rescale steps) so the function early-returns for None inputs
and avoids AttributeError when val is None.

In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_provider.py`:
- Around line 155-191: The local vision_transformer_config is modified (CUDA
graph settings, scope, max_vision_cuda_graph_seq_length) but the Qwen3VLModel
call uses the original hf_vision_config; update the Qwen3VLModel instantiation
to pass the configured vision_transformer_config (i.e., change
vision_transformer_config=hf_vision_config to
vision_transformer_config=vision_transformer_config) so the applied settings are
used, and make the same fix in the MoE provider where the analogous local
vision_transformer_config is created and then incorrectly replaced with
hf_vision_config in that model's constructor.

In `@src/megatron/bridge/training/qwen3vl_step.py`:
- Around line 191-206: The loop that repacks sequences into
packed_tokens/packed_labels/packed_loss_mask/packed_position_ids is slicing the
batch dim instead of the sequence dim (packed_tokens[start_idx:...]) causing
empty assignments for start_idx>0; update the assignments in qwen3vl_step.py
(the for loop using seqlens_in_batch_cpu and cu_seqlens_padded_cpu) to index
dimension 1 explicitly (e.g., select all rows then slice columns like [:,
start_idx:start_idx+seqlen] or use .narrow on dim=1) so each tensor receives the
sequence slice into the sequence axis rather than the batch axis.
- Around line 312-320: The call to get_thd_batch_on_this_cp_rank is using
non-existent PackedSeqParams attributes (cu_seqlens, cu_seqlens_padded,
max_seqlen); update the call in qwen3vl_step.py to use the correct
PackedSeqParams fields (choose the appropriate query or key/value variants) such
as cu_seqlens_q or cu_seqlens_kv, cu_seqlens_q_padded or cu_seqlens_kv_padded,
and max_seqlen_q or max_seqlen_kv so they match the PackedSeqParams constructor
and usages; ensure the chosen _q vs _kv set aligns with how
get_thd_batch_on_this_cp_rank expects sequence lengths.

In `@src/megatron/bridge/training/train.py`:
- Around line 50-53: The imports from
megatron.core.transformer.vision_cuda_graphs are invalid; update the import
statements that reference VisionTECudaGraphHelper and
get_vision_cuda_graph_seq_length to import them from
megatron.core.transformer.cuda_graphs instead (or remove these imports entirely
if those symbols are unused in train.py); ensure any downstream references to
VisionTECudaGraphHelper and get_vision_cuda_graph_seq_length still resolve after
changing the import path and run tests to confirm no runtime import errors.

In `@src/megatron/bridge/training/utils/packed_seq_utils.py`:
- Around line 86-92: The packing loop in packed_seq_utils.py indexes tensors
with attention_mask (e.g., input_ids[i, attention_mask[i]]) which crashes when
attention_mask is None; update the loop to guard indexing by computing
per-example views: for each i use mask_i = attention_mask[i] if attention_mask
is not None else slice(0, seqlens_in_batch[i]) and then use input_ids[i, mask_i]
(and do the same for token_type_ids, position_ids, etc.), or simply slice with
:seqlens_in_batch[i] when attention_mask is None so the code works regardless of
whether attention_mask is provided.
🟠 Major comments (10)
src/megatron/bridge/training/utils/packed_seq_utils.py-73-74 (1)

73-74: 🛠️ Refactor suggestion | 🟠 Major

Mid-file import violates import ordering guidelines.

from megatron.core import mpu should be moved to the top of the file alongside the other imports (lines 17–19). Having it after the first function definition breaks the expected import order.

Move import to top
 import torch
 import math
 from megatron.core.packed_seq_params import PackedSeqParams
+from megatron.core import mpu
 
 
 def get_packed_seq_params(batch: dict[str, torch.Tensor]) -> PackedSeqParams:
     ...
-
-# Copied from verl/verl/models/mcore/util.py
-from megatron.core import mpu

As per coding guidelines, imports should be organized at the top: future imports, standard library, third-party, first-party, separated by blank lines.

src/megatron/bridge/training/train.py-252-280 (1)

252-280: ⚠️ Potential issue | 🟠 Major

Overly broad except Exception silently swallows initialization errors.

If vision CUDA graph initialization fails due to a genuine bug (e.g., wrong attribute, shape mismatch, misconfiguration), the catch-all silently degrades to no vision CUDA graphs with only a print_rank_0 warning that's easy to miss. This makes debugging very difficult.

Narrow the exception to the specific, expected failure modes (e.g., AttributeError for missing vision_model, or a guard check before the try), and let unexpected errors propagate.

Suggested approach: guard instead of broad try/except
-    vision_config = getattr(config.model, 'vision_cuda_graph_impl', None)
-    if vision_config == "transformer_engine":
-        # Try to get vision config from the model
-        try:
-            for model_chunk in model:
-                unwrapped = get_attr_wrapped_model(
-                    model_chunk, 'vision_model', allow_none=True, return_model_obj=True
-                )
-                if unwrapped is not None and hasattr(unwrapped, 'vision_model') and unwrapped.vision_model is not None:
-                    vision_model_config = unwrapped.vision_model.config
-                    if vision_model_config.cuda_graph_impl == "transformer_engine":
-                        vision_seq_length = get_vision_cuda_graph_seq_length(vision_model_config)
-                        vision_cuda_graph_helper = VisionTECudaGraphHelper(
-                            model=model,
-                            vision_config=vision_model_config,
-                            vision_seq_length=vision_seq_length,
-                            micro_batch_size=config.train.micro_batch_size,
-                            num_microbatches=get_num_microbatches(),
-                        )
-                        print_rank_0(
-                            f"Vision encoder CUDA graph enabled with seq_length={vision_seq_length}"
-                        )
-                    break
-        except Exception as e:
-            print_rank_0(f"Warning: Failed to initialize vision CUDA graph helper: {e}")
-            vision_cuda_graph_helper = None
+    vision_config = getattr(config.model, 'vision_cuda_graph_impl', None)
+    if vision_config == "transformer_engine":
+        for model_chunk in model:
+            unwrapped = get_attr_wrapped_model(
+                model_chunk, 'vision_model', allow_none=True, return_model_obj=True
+            )
+            if unwrapped is not None and hasattr(unwrapped, 'vision_model') and unwrapped.vision_model is not None:
+                vision_model_config = unwrapped.vision_model.config
+                if vision_model_config.cuda_graph_impl == "transformer_engine":
+                    vision_seq_length = get_vision_cuda_graph_seq_length(vision_model_config)
+                    vision_cuda_graph_helper = VisionTECudaGraphHelper(
+                        model=model,
+                        vision_config=vision_model_config,
+                        vision_seq_length=vision_seq_length,
+                        micro_batch_size=config.train.micro_batch_size,
+                        num_microbatches=get_num_microbatches(),
+                    )
+                    print_rank_0(
+                        f"Vision encoder CUDA graph enabled with seq_length={vision_seq_length}"
+                    )
+                break
+        if vision_cuda_graph_helper is None:
+            print_rank_0("Warning: vision_cuda_graph_impl is 'transformer_engine' but no vision model with matching config found.")
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py-16-16 (1)

16-16: 🛠️ Refactor suggestion | 🟠 Major

Replace wildcard import with explicit imports.

from megatron.core.transformer.attention import * makes every symbol in the file potentially undefined (dozens of F405 warnings from Ruff). Explicitly import the symbols you need (SelfAttention, Tensor, Optional, Union, Tuple, PackedSeqParams, BaseInferenceContext, deprecate_inference_params, HAVE_FA3, is_fa_min_version, nvtx_range_push, nvtx_range_pop, rearrange).

This improves readability, catches missing dependencies at import time, and satisfies the linter.

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/vision_model.py-42-50 (1)

42-50: ⚠️ Potential issue | 🟠 Major

pg_collection defaults to None but is immediately dereferenced without a guard.

Line 42 declares pg_collection: ProcessGroupCollection = None, and line 50 unconditionally accesses self.pg_collection.tp. If pg_collection is None, this raises AttributeError. Either remove the None default (making it required) or add a null check.

Also, per PEP 484 / coding guidelines, use ProcessGroupCollection | None = None if you intend to keep the default.

Proposed fix (make it required)
-        pg_collection: ProcessGroupCollection = None,
+        pg_collection: ProcessGroupCollection,
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_config.py-63-64 (1)

63-64: 🛠️ Refactor suggestion | 🟠 Major

Add a docstring, type hints, and fix the blank line.

get_vision_model_config is a public function but lacks a docstring, a type annotation for hf_config, and a return type hint. Line 64 is an unnecessary blank line. As per coding guidelines, use Google-style docstrings and type hints for function arguments and return types.

Proposed fix (signature only)
-def get_vision_model_config(config: Qwen3VLTransformerConfig, hf_config):
-    
+def get_vision_model_config(
+    config: Qwen3VLTransformerConfig, hf_config: "Qwen3VLVisionConfig"
+) -> Qwen3VLTransformerConfig:
+    """Build a vision-encoder TransformerConfig by overriding fields from the HF vision config.
+
+    Args:
+        config: Base transformer config (will be mutated in-place).
+        hf_config: HuggingFace vision config providing encoder hyperparameters.
+
+    Returns:
+        The mutated config with vision-encoder settings applied.
+    """
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py-513-513 (1)

513-513: ⚠️ Potential issue | 🟠 Major

Mutating shared state on rotary_pos_emb during forward pass.

self.language_model.rotary_pos_emb.is_thd_format = True is set conditionally during forward but never reset to False. If the same model instance processes a mix of packed and non-packed batches, the flag will remain True from a previous packed call, potentially corrupting rotary embeddings for subsequent non-packed forward passes.

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py-177-177 (1)

177-177: ⚠️ Potential issue | 🟠 Major

self.pg_collection is silently overwritten.

self.pg_collection is first assigned from the constructor argument at line 101, then overwritten here by get_pg_collection(self). If the intent is to always use the result of get_pg_collection, the earlier assignment (and the constructor parameter) is misleading. If the constructor value should be authoritative, this line is a bug. Please clarify the intent and remove the redundant assignment.

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py-77-77 (1)

77-77: ⚠️ Potential issue | 🟠 Major

pg_collection defaults to None but is accessed unconditionally — will raise AttributeError.

Line 77 declares pg_collection: ProcessGroupCollection = None, but lines 102–104 immediately do pg_collection.cp, .tp, .pp without a null check. Any caller that omits this argument (or passes None) will crash.

Either remove the default (making the argument required) or add a guard.

Proposed fix (make required)
-        pg_collection: ProcessGroupCollection = None,
+        pg_collection: ProcessGroupCollection,

Also applies to: 100-112

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py-136-136 (1)

136-136: 🛠️ Refactor suggestion | 🟠 Major

Use print_rank_0 instead of print to avoid duplicate output across ranks.

As per coding guidelines for src/megatron/bridge/models/**/*.py, use print_rank_0 for logging in model bridge code.

Proposed fix
-                print(f"rank {torch.distributed.get_rank()} use megatron vision model")
+                print_rank_0("use megatron vision model")
-                print(f"rank {torch.distributed.get_rank()} use hf vision model")
+                print_rank_0("use hf vision model")

Import at the top:

from megatron.bridge.utils.common_utils import print_rank_0

As per coding guidelines: "Use 'print_rank_0' for logging in model bridge to avoid duplicate output across ranks."

Also applies to: 149-149

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py-381-394 (1)

381-394: ⚠️ Potential issue | 🟠 Major

Hardcoded torch.bfloat16 for zero-sized vision placeholders.

If the model is running in fp16 or fp32, these tensors will have a dtype mismatch with the rest of the computation. Use vision_data.dtype (or derive from config) instead.

Proposed fix
-                        device=vision_data.device,
-                        dtype=torch.bfloat16,
+                        device=vision_data.device,
+                        dtype=vision_data.dtype,

(apply to both occurrences on lines 384 and 392)

🟡 Minor comments (13)
src/megatron/bridge/training/qwen3vl_step.py-224-224 (1)

224-224: ⚠️ Potential issue | 🟡 Minor

Unused variable max_seqlen_in_batch.

max_seqlen_in_batch is assigned but never used (confirmed by both Ruff and Flake8). Remove it.

Fix
-        max_seqlen_in_batch = seqlens_in_batch.max().item()
src/megatron/bridge/training/qwen3vl_step.py-136-141 (1)

136-141: ⚠️ Potential issue | 🟡 Minor

Potential AttributeError if visual_inputs is None.

Line 139 calls .normalized_for_model() on the result of batch.get("visual_inputs"). If the key exists but the value is None, this raises AttributeError. The "visual_inputs" in batch check (line 136) only verifies key presence, not that the value is non-None.

Suggested fix
-    if "visual_inputs" in batch:
+    if batch.get("visual_inputs") is not None:
         # convert visual_inputs to multi_modal_inputs which is a dict contains "pixel_values" and "image_grid_thw"
         # TODO(jinliangl): add video support
         multi_modal_inputs = batch.get("visual_inputs").normalized_for_model()
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.py-242-242 (1)

242-242: ⚠️ Potential issue | 🟡 Minor

Typo in assertion message: "flaaten" → "flatten".

-    assert input_ids.dim() == 1, "input_ids should be flaaten"
+    assert input_ids.dim() == 1, "input_ids should be flattened"
examples/recipes/qwen_vl/finetune_qwen_vl.py-106-108 (1)

106-108: ⚠️ Potential issue | 🟡 Minor

Unused import and incorrect import ordering.

functools.partial (line 108) is imported but never used (confirmed by Flake8 F401). Additionally, as a stdlib import, functools should be placed before third-party and first-party imports per the project's import ordering guidelines.

Remove the unused import:

Proposed fix
 from megatron.bridge.training.qwen3vl_step import forward_step
 from megatron.bridge.utils.common_utils import get_rank_safe
-from functools import partial

As per coding guidelines, imports should be organized: future imports, standard library, third-party, first-party, local folder imports, separated by blank lines.

examples/recipes/decentralized_pg/pretrain_qwen3_vl_simple.py-37-37 (1)

37-37: ⚠️ Potential issue | 🟡 Minor

Use qwen3vl_step.forward_step instead of vlm_step.forward_step.

The finetune_qwen_vl.py recipe was updated to import forward_step from megatron.bridge.training.qwen3vl_step, which includes a Qwen3-VL specific assertion and model-specific optimizations (custom sequence packing, position ID calculation, etc.). Since this file also targets Qwen3-VL (based on the config qwen3_vl_30b_a3b_pretrain_config), it should use the same qwen3vl_step.forward_step for consistency and to benefit from these optimizations.

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py-127-127 (1)

127-127: ⚠️ Potential issue | 🟡 Minor

Remove extraneous f prefix from string literal (no placeholders).

Flagged by both Ruff (F541) and Flake8 (F541).

-            raise ValueError(f"CUDA graphs must use flash decode with static batching!")
+            raise ValueError("CUDA graphs must use flash decode with static batching!")
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py-57-57 (1)

57-57: ⚠️ Potential issue | 🟡 Minor

Typo in docstring: PackedSeqparamsPackedSeqParams.

-            packed_seq_params (Optional[PackedSeqparams]): Parameters used for THD format.
+            packed_seq_params (Optional[PackedSeqParams]): Parameters used for THD format.
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_config.py-20-21 (1)

20-21: ⚠️ Potential issue | 🟡 Minor

Remove unused torch import.

torch (line 20) is imported but unused, as flagged by Flake8 (F401). torch.nn.functional as F is used only at line 92 and is fine to keep.

Proposed fix
-import torch
-import torch.nn.functional as F
+import torch.nn.functional as F
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/vision_model.py-1-24 (1)

1-24: ⚠️ Potential issue | 🟡 Minor

Missing NVIDIA copyright header and unused imports.

This new file lacks the required NVIDIA copyright header. As per coding guidelines, all Python files must have the NVIDIA copyright header at the top.

Additionally, Flake8 flags several unused imports: dataclass, Union, MultimodalProjector, MegatronModule, build_module, get_tensor_model_parallel_group_if_none. Remove them to keep the import section clean.

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py-56-95 (1)

56-95: ⚠️ Potential issue | 🟡 Minor

Remove unused group attributes cp_group and tp_group from __init__.

Lines 81-82 assign self.cp_group and self.tp_group, but neither is used anywhere in this class. Only self.pp_group (line 83) is referenced, on line 424 for self.pp_group.rank(). Remove the unused assignments to reduce clutter.

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/vision_model.py-248-264 (1)

248-264: ⚠️ Potential issue | 🟡 Minor

Validate max_vision_cuda_graph_seq_length alignment for TransformerEngine compatibility.

The overflow check at lines 250-254 is good. However, the code does not validate that max_vision_cuda_graph_seq_length itself meets TransformerEngine's alignment constraints. TransformerEngine's FusedAttention backend requires sequence length to be divisible by 64 (and ideally by 128 for optimal kernel dispatch), but there is no validation that the configured or calculated max_seq_len satisfies this requirement. If a user sets max_vision_cuda_graph_seq_length to a misaligned value, the downstream transformer layers may fail or experience severe performance degradation.

Add alignment validation in _get_max_vision_seq_length() or in the config initialization to ensure max_vision_cuda_graph_seq_length % 64 == 0 (or round up to the nearest multiple of 128, consistent with other parts of the codebase). Alternatively, document this constraint clearly in the config field docstring.

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py-17-17 (1)

17-17: ⚠️ Potential issue | 🟡 Minor

Unused import mpu.

Static analysis confirms mpu is imported but never used. Remove it.

Proposed fix
-from megatron.core import InferenceParams, mpu, tensor_parallel
+from megatron.core import InferenceParams, tensor_parallel
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py-305-306 (1)

305-306: ⚠️ Potential issue | 🟡 Minor

Use explicit | None for nullable type hints.

Static analysis (RUF013) flags these. Per coding guidelines, use T | None instead of implicit Optional.

Proposed fix
-        cp_img_num: list[int] = None,
-        images_padded: list[bool] = None,
+        cp_img_num: list[int] | None = None,
+        images_padded: list[bool] | None = None,

As per coding guidelines: "Use 'T | None' for nullable types instead of 'Optional[T]'."

🧹 Nitpick comments (25)
src/megatron/bridge/training/vlm_step.py (1)

124-125: Hardcoded is_first = True / is_last = True bypasses pipeline-parallel stage detection.

is_pp_first_stage / is_pp_last_stage are imported but no longer called. With both flags forced to True, every PP stage now loads labels and loss_mask from the iterator and always pads to cfg.model.seq_length. If this is intentional for CUDA Graph fixed-shape requirements, please add a brief comment explaining why, and consider removing the now-unused is_pp_first_stage / is_pp_last_stage imports from line 21 to avoid confusion.

src/megatron/bridge/training/train.py (2)

1363-1366: Type hint for cuda_graph_helper should be TECudaGraphHelper | None.

The parameter can be None (initialized as None on line 241 and passed directly on line 592). The current annotation TECudaGraphHelper doesn't reflect this, though the body does handle None at line 1387.

Fix
 def _delete_cuda_graphs(
-    cuda_graph_helper: TECudaGraphHelper,
+    cuda_graph_helper: TECudaGraphHelper | None,
     vision_cuda_graph_helper: Optional[VisionTECudaGraphHelper] = None,
 ):

Also, per coding guidelines, prefer VisionTECudaGraphHelper | None over Optional[VisionTECudaGraphHelper] for consistency with the project's Python 3.10+ target.


366-374: Vision CUDA graph warmup uses language-model cuda_graph_warmup_steps.

The vision graph capture condition (line 370) reuses model_config.cuda_graph_warmup_steps, which is the language model's warmup setting. If the vision encoder requires a different warmup period (or if the LM has CUDA graphs disabled but vision has them enabled), this shared condition may not trigger correctly. Consider whether a separate vision_cuda_graph_warmup_steps config is needed.

src/megatron/bridge/training/utils/packed_seq_utils.py (2)

77-171: Significant code duplication with src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.py.

Both preprocess_packed_seqs and postprocess_packed_seqs are near-identical copies of the implementations in the qwen_vl module (see relevant snippets). The only meaningful difference is the use_fp8_padding parameter and the FP8 padding logic (lines 97–115). Consider extracting a shared implementation to avoid drift between the two copies.

Also, postprocess_packed_seqs docstring (line 184) states "This function is not used for now" — if unused, consider deferring its addition until it's needed (YAGNI).


78-78: Missing type hint for use_fp8_padding parameter.

-def preprocess_packed_seqs(
-    input_ids: torch.Tensor, attention_mask: torch.Tensor, pre_process: bool = True, use_fp8_padding=False
-) -> tuple[torch.Tensor, PackedSeqParams]:
+def preprocess_packed_seqs(
+    input_ids: torch.Tensor, attention_mask: torch.Tensor, pre_process: bool = True, use_fp8_padding: bool = False
+) -> tuple[torch.Tensor, PackedSeqParams]:

As per coding guidelines: "Use type hints for function arguments and return types."

src/megatron/bridge/training/qwen3vl_step.py (3)

266-267: assert is_qwen3vl makes subsequent is_qwen3vl checks redundant.

Line 267 asserts the model is Qwen3-VL, guaranteeing is_qwen3vl is always True past this point. The later checks at lines 289 and 326 are therefore dead conditions. Either remove the assert and keep the conditional logic (to support future models), or remove the redundant checks.


154-163: Missing type hint for this_pg_collection parameter.

 def pack_or_pad_batch_sequences(
     tokens: torch.Tensor,
     labels: torch.Tensor,
     loss_mask: torch.Tensor,
     attention_mask: torch.Tensor,
     position_ids: torch.Tensor,
-    this_pg_collection,
+    this_pg_collection: ProcessGroupCollection,
     use_fp8_padding: bool = False,
     data_format: str = "bshd",
 ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, PackedSeqParams]:

You'll need to add the ProcessGroupCollection import from megatron.core.process_groups_config. As per coding guidelines: "Use type hints for function arguments and return types."


14-35: Import ordering: megatron.core.packed_seq_params import is misplaced between Bridge imports.

Line 28 (from megatron.core.packed_seq_params import PackedSeqParams) is between Bridge imports (lines 24-27 and lines 29-35). It should be grouped with the other megatron.core imports (lines 20-22).

Reordered imports
 import torch
 from megatron.core.models.gpt import GPTModel
 from megatron.core.pipeline_parallel.utils import is_pp_first_stage, is_pp_last_stage
 from megatron.core.utils import get_batch_on_this_cp_rank, get_model_config, get_thd_batch_on_this_cp_rank
+from megatron.core.packed_seq_params import PackedSeqParams
 
 from megatron.bridge.training.config import ConfigContainer
 from megatron.bridge.training.losses import (
     create_masked_next_token_loss_function as _create_loss_function,
 )
-from megatron.core.packed_seq_params import PackedSeqParams
 from megatron.bridge.training.state import GlobalState
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.py (3)

348-349: Use raise AssertionError instead of assert False.

assert False is stripped by python -O. Since this is a reachable error path (unexpected token_id), use an explicit raise.

-            assert False, f"should not have {token_id=}"
+            raise AssertionError(f"Unexpected token_id: {token_id}")

16-17: Use modern union syntax per coding guidelines.

The guidelines require T | None instead of Optional[T] and X | Y instead of Union[X, Y].

Proposed fix
-from dataclasses import dataclass
-from typing import Optional, Union
+from __future__ import annotations
+
+from dataclasses import dataclass

Then update usages (e.g., line 92-94, line 103, line 365) from Optional[...]/Union[...] to ... | None/X | Y.

As per coding guidelines: "Use 'T | None' for nullable types instead of 'Optional[T]'" and "Use 'X | Y' for union types instead of 'Union[X, Y]'".


493-493: Change @torch.no_grad to @torch.no_grad().

The documented and correct form in PyTorch is @torch.no_grad() with parentheses. This creates and applies the no_grad context manager as a decorator, which is the official API convention shown in the PyTorch documentation.

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_config.py (2)

115-132: hasattr checks are fragile on a dataclass — prefer getattr with defaults.

config is a Qwen3VLTransformerConfig dataclass, so using hasattr is unnecessary if the fields exist on the class, or fragile if they're dynamically set upstream. Consider using getattr(config, 'vision_cuda_graph_impl', 'none') instead, and similarly for vision_cuda_graph_scope. This also avoids potential AttributeError if the field is absent while keeping the same fallback behavior.

Also, per coding guidelines, use double quotes for strings (e.g., "none" instead of 'none', 'vision_cuda_graph_impl'"vision_cuda_graph_impl").

Proposed refactor
-    if hasattr(config, 'vision_cuda_graph_impl') and config.vision_cuda_graph_impl != "none":
-        config.cuda_graph_impl = config.vision_cuda_graph_impl
-        if hasattr(config, 'vision_cuda_graph_scope') and config.vision_cuda_graph_scope:
-            # Convert string scope list to CudaGraphScope enums if needed
-            from megatron.core.transformer.cuda_graphs import CudaGraphScope
-            scope_list = config.vision_cuda_graph_scope
-            if scope_list and isinstance(scope_list[0], str):
-                config.cuda_graph_scope = [CudaGraphScope[scope] for scope in scope_list]
-            else:
-                config.cuda_graph_scope = scope_list
-        else:
-            config.cuda_graph_scope = []
-    else:
-        config.cuda_graph_impl = "none"
-        config.cuda_graph_scope = []
+    from megatron.core.transformer.cuda_graphs import CudaGraphScope
+
+    vision_impl = getattr(config, "vision_cuda_graph_impl", "none")
+    if vision_impl != "none":
+        config.cuda_graph_impl = vision_impl
+        scope_list = getattr(config, "vision_cuda_graph_scope", [])
+        if scope_list and isinstance(scope_list[0], str):
+            config.cuda_graph_scope = [CudaGraphScope[scope] for scope in scope_list]
+        else:
+            config.cuda_graph_scope = scope_list if scope_list else []
+    else:
+        config.cuda_graph_impl = "none"
+        config.cuda_graph_scope = []

15-18: Use T | None instead of Optional[T].

Per coding guidelines, prefer T | None over Optional[T] and X | Y over Union[X, Y] (Python 3.10+). The Optional import on line 18 and its usage on line 60 should be updated.

Proposed fix
-from typing import List, Optional
+from typing import List

And at line 60:

-    max_vision_cuda_graph_seq_length: Optional[int] = None
+    max_vision_cuda_graph_seq_length: int | None = None
src/megatron/bridge/models/qwen_vl/qwen3_vl_bridge.py (1)

168-168: Double hash-comment prefix (# #) looks like an editing artifact.

Line 168 uses # # before "vision module deepstack merger". If this is intentional (e.g., a section sub-header), a single # with the comment text is clearer.

-            # # vision module deepstack merger
+            # vision module deepstack merger
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.py (2)

177-178: Variable input_ids is shadowed inside the loop.

The outer input_ids parameter (line 135) is stored as total_input_ids on line 165, but the for loop at line 177 reuses input_ids as the loop variable. Then at line 178, input_ids is further reassigned to a masked slice. This means at line 241, input_ids.device refers to the last loop iteration's masked tensor, not the original input_ids argument — which happens to work but is fragile and confusing.

Rename the loop variable to avoid shadowing (e.g., sample_ids).


54-54: Use explicit | None annotation instead of implicit Optional.

Per coding guidelines: use T | None for nullable types. Also, PEP 484 prohibits param: SomeType = None without an explicit optional annotation.

-        cp_group: torch.distributed.ProcessGroup = None,
+        cp_group: torch.distributed.ProcessGroup | None = None,
src/megatron/bridge/models/qwen_vl/qwen3_vl_provider.py (2)

45-45: Import ordering: from copy import deepcopy should be in the standard library group.

Per coding guidelines, organize imports as: future → standard library → third-party → first-party → local. copy is a standard library module but is placed after first-party imports.


146-151: Mutating self (the provider) when converting cuda_graph_scope — side effect on repeated calls.

language_transformer_config is self (line 144/339). Converting cuda_graph_scope from strings to enums mutates the provider's own field. If provide() is called a second time, isinstance(...[0], str) will be False and the conversion is skipped, which is harmless. However, mutating the provider's state is a subtle side effect. Consider operating on a copy or doing this conversion in finalize() instead.

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/vision_model.py (2)

314-318: torch.split + torch.cat along the same dimension is a no-op.

Lines 315-317 split hidden_states by per-grid sizes and immediately concatenate them back along dim=0. This produces the same tensor as the input. If the intent is to validate that sizes sum correctly, an assert would be clearer. If it's prep for future per-grid processing, a comment explaining the intent would help.


134-171: Loop variable t unused at line 140; zip() calls lack strict= parameter.

Ruff flags t as unused (B007) at line 140. If the temporal dimension isn't needed in the interpolation loop body, use _ instead:

-        for t, h, w in zip(grid_ts, grid_hs, grid_ws):
+        for _, h, w in zip(grid_ts, grid_hs, grid_ws, strict=True):

Adding strict=True to zip() calls (lines 140, 182, 186) catches mismatched lengths early.

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py (2)

124-131: CUDA graph detection logic is duplicated between checkpointed and non-checkpointed paths.

The layer_uses_te_cudagraph check (lines 124-130 and 339-345) is identical in both the _checkpointed_forward and the non-checkpointed forward paths. Extract this into a small helper method to reduce duplication and ensure both paths stay in sync.

Proposed helper
`@staticmethod`
def _layer_uses_te_cudagraph(layer) -> bool:
    """Check if a transformer layer will use TE CUDA graph replay."""
    return (
        hasattr(layer, "cuda_graphs")
        and layer.cuda_graphs
        and layer.training
        and hasattr(layer, "config")
        and getattr(layer.config, "cuda_graph_impl", "none") == "transformer_engine"
    )

Also applies to: 339-346


69-69: Duplicate assertion: assert post_process and pre_process exists here and in Qwen3VLVisionModel.__init__ (vision_model.py line 44).

The same guard is present in both the vision model and its inner transformer block. Having it in both places is defensive but redundant — the vision model already validates this before constructing the block. Consider removing the inner one or adding a comment explaining why it's repeated.

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py (3)

124-124: Line exceeds 119-character limit.

This line is well over the configured max line length.

Proposed fix
-                megatron_vision_transformer_config = get_vision_model_config(copy(language_transformer_config), vision_transformer_config)
+                megatron_vision_transformer_config = get_vision_model_config(
+                    copy(language_transformer_config), vision_transformer_config
+                )

As per coding guidelines: "Maximum line length is 119 characters (matching ruff configuration)."


24-39: Import ordering: third-party transformers import is interleaved with first-party imports.

Per coding guidelines, imports should be organized: standard library → third-party (including megatron.core, torch, transformers) → first-party (megatron.bridge.*) → local, separated by blank lines. The transformers import on line 25 is sandwiched between megatron.bridge imports.

Proposed fix (reorder)
 from megatron.bridge.models.qwen_vl.modelling_qwen3_vl.text_model import Qwen3VLGPTModel
-from transformers.models.qwen3_vl.configuration_qwen3_vl import Qwen3VLConfig as Qwen3VLConfigHF
 from megatron.bridge.models.qwen_vl.modelling_qwen3_vl.transformer_config import Qwen3VLTransformerConfig

Move the transformers import up to the third-party block (after the megatron.core imports):

from transformers.models.qwen3_vl.configuration_qwen3_vl import Qwen3VLConfig as Qwen3VLConfigHF

310-315: Informal workaround notes in the public docstring.

These lines read like internal developer notes (e.g., "there is a workaround", "cooperate with verl's models/mcore/model_forward.py") rather than proper API documentation. Either move them to inline comments within the method body or rewrite as a proper docstring note.

Comment on lines +68 to +71
# Cleanup
if torch.distributed.is_initialized():
torch.distributed.barrier()
torch.distributed.destroy_process_group()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, find the pretrain function to understand its cleanup behavior
fd '.*pretrain.*\.py' | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1262


🏁 Script executed:

#!/bin/bash
# Check for destroy_process_group in pretrain-related files
rg -n "destroy_process_group|destroy_model_parallel" --type=py -C3 -g '*pretrain*'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 8630


🏁 Script executed:

#!/bin/bash
# Also check the specific file mentioned in the review
cat -n examples/recipes/decentralized_pg/pretrain_qwen3_vl_simple.py | head -100

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3428


Remove manual process group cleanup; pretrain() handles this internally.

The pretrain() function already manages distributed cleanup via _maybe_destroy_process_group(), which destroys the process group only if it was created by pretrain() itself (lines 111, 191 in src/megatron/bridge/training/pretrain.py). The manual barrier() + destroy_process_group() call will cause a runtime error in the typical scenario where the process group is not initialized before pretrain() is called, since pretrain() will have already destroyed it upon return.

Delete lines 68-71 (or the equivalent cleanup block in similar example files).

🤖 Prompt for AI Agents
In `@examples/recipes/decentralized_pg/pretrain_qwen3_vl_simple.py` around lines
68 - 71, Remove the manual distributed cleanup block that calls
torch.distributed.is_initialized(), torch.distributed.barrier(), and
torch.distributed.destroy_process_group() in the example; pretrain() already
performs cleanup via _maybe_destroy_process_group(), so delete the
barrier/destroy_process_group block (the lines around the manual cleanup) to
avoid double-destroy/runtime errors when pretrain() has already torn down the
process group.

Comment on lines +130 to +167
def _set_vision_cuda_graph_overrides(
recipe: ConfigContainer,
vision_cuda_graph_impl: Optional[str] = None,
vision_cuda_graph_scope: Optional[str | List[str]] = None,
) -> ConfigContainer:
"""Set the vision encoder CUDA graph overrides.

This enables TE CUDA graph for the vision encoder separately from the language model.

Args:
recipe: The config container
vision_cuda_graph_impl: Vision encoder CUDA graph implementation ("none" or "transformer_engine")
vision_cuda_graph_scope: Vision encoder CUDA graph scope (e.g., ["attn"])

Returns:
Updated config container
"""
if isinstance(vision_cuda_graph_scope, str):
vision_cuda_graph_scope = [vision_cuda_graph_scope]

if vision_cuda_graph_impl is not None:
recipe.model.vision_cuda_graph_impl = vision_cuda_graph_impl

if vision_cuda_graph_impl == "transformer_engine":
# Ensure TE RNG tracker is enabled for CUDA graph compatibility
recipe.rng.te_rng_tracker = recipe.model.use_te_rng_tracker = True

valid_te_scopes = ["attn", "mlp"] # Vision encoder typically only has attn and mlp
if vision_cuda_graph_scope:
assert all(scope in valid_te_scopes for scope in vision_cuda_graph_scope), (
f"Invalid vision cuda graph scope: {vision_cuda_graph_scope}. "
f"Valid options for vision encoder are: {valid_te_scopes}"
)

if vision_cuda_graph_scope is not None:
recipe.model.vision_cuda_graph_scope = vision_cuda_graph_scope

return recipe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Check if _set_vision_cuda_graph_overrides is called anywhere in the codebase
rg -n '_set_vision_cuda_graph_overrides' --type=py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 151


🏁 Script executed:

# Also check the file to see the full context of both functions and their calls
cat -n scripts/performance/utils/overrides.py | head -270

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 13292


_set_vision_cuda_graph_overrides is defined but never called and lacks critical handling for the "none" case.

The function is not integrated into set_workload_base_configs (line 202) or set_user_overrides (line 252), so vision CUDA graph overrides will not be applied. Compare with _set_cuda_graph_overrides (called at lines 215 and 255).

Additionally, the function is missing the reset logic for vision_cuda_graph_impl == "none". In _set_cuda_graph_overrides (lines 115–116), when the implementation is set to "none", the RNG tracker is explicitly disabled. Without this in the vision variant, stale RNG tracker state could persist if the vision graph is disabled after being enabled.

🤖 Prompt for AI Agents
In `@scripts/performance/utils/overrides.py` around lines 130 - 167, The
_set_vision_cuda_graph_overrides function is never invoked and also lacks
resetting behavior for the "none" case; to fix, call
_set_vision_cuda_graph_overrides from the same two places where
_set_cuda_graph_overrides is called (inside set_workload_base_configs and
set_user_overrides) so vision overrides are applied, and add explicit handling
inside _set_vision_cuda_graph_overrides for vision_cuda_graph_impl == "none" to
disable TE RNG tracking (set recipe.rng.te_rng_tracker =
recipe.model.use_te_rng_tracker = False) mirroring the reset logic in
_set_cuda_graph_overrides; ensure the function still validates
vision_cuda_graph_scope against valid_te_scopes and sets
recipe.model.vision_cuda_graph_scope when provided.

Comment on lines +178 to +206
# Expose position_embedding_type, rotary_pos_emb, and decoder for CUDA graph helper compatibility
# The CUDA graph helper expects model.position_embedding_type, model.rotary_pos_emb, and model.decoder,
# but in Qwen3VL these are nested under language_model. This provides direct access.
# Expose these attributes for CUDA graph helper compatibility only when CUDA graph is enabled
cuda_graph_enabled = getattr(self.language_model.config, "cuda_graph_impl", "none") != "none"
if cuda_graph_enabled:
self.position_embedding_type = self.language_model.position_embedding_type
self.rotary_pos_emb = self.language_model.rotary_pos_emb
self.decoder = self.language_model.decoder

# Expose position_embedding_type, rotary_pos_emb, and decoder for CUDA graph helper compatibility
# The CUDA graph helper expects model.position_embedding_type, model.rotary_pos_emb, and model.decoder,
# but in Qwen3VL these are nested under language_model. This provides direct access.
# Expose these attributes for CUDA graph helper compatibility only when CUDA graph is enabled
cuda_graph_enabled = getattr(self.language_model.config, "cuda_graph_impl", "none") != "none"
if cuda_graph_enabled:
self.position_embedding_type = self.language_model.position_embedding_type
self.rotary_pos_emb = self.language_model.rotary_pos_emb
self.decoder = self.language_model.decoder

# Expose position_embedding_type, rotary_pos_emb, and decoder for CUDA graph helper compatibility
# The CUDA graph helper expects model.position_embedding_type, model.rotary_pos_emb, and model.decoder,
# but in Qwen3VL these are nested under language_model. This provides direct access.
# Expose these attributes for CUDA graph helper compatibility only when CUDA graph is enabled
cuda_graph_enabled = getattr(self.language_model.config, "cuda_graph_impl", "none") != "none"
if cuda_graph_enabled:
self.position_embedding_type = self.language_model.position_embedding_type
self.rotary_pos_emb = self.language_model.rotary_pos_emb
self.decoder = self.language_model.decoder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Copy-paste bug: the CUDA graph compatibility block is duplicated three times.

Lines 178–186, 188–196, and 198–206 are identical. This is clearly an accidental triplication. Keep only one copy.

Proposed fix
         self.pg_collection = get_pg_collection(self)
         # Expose position_embedding_type, rotary_pos_emb, and decoder for CUDA graph helper compatibility
         # The CUDA graph helper expects model.position_embedding_type, model.rotary_pos_emb, and model.decoder,
         # but in Qwen3VL these are nested under language_model. This provides direct access.
         # Expose these attributes for CUDA graph helper compatibility only when CUDA graph is enabled
         cuda_graph_enabled = getattr(self.language_model.config, "cuda_graph_impl", "none") != "none"
         if cuda_graph_enabled:
             self.position_embedding_type = self.language_model.position_embedding_type
             self.rotary_pos_emb = self.language_model.rotary_pos_emb
             self.decoder = self.language_model.decoder
 
-        # Expose position_embedding_type, rotary_pos_emb, and decoder for CUDA graph helper compatibility
-        # The CUDA graph helper expects model.position_embedding_type, model.rotary_pos_emb, and model.decoder,
-        # but in Qwen3VL these are nested under language_model. This provides direct access.
-        # Expose these attributes for CUDA graph helper compatibility only when CUDA graph is enabled
-        cuda_graph_enabled = getattr(self.language_model.config, "cuda_graph_impl", "none") != "none"
-        if cuda_graph_enabled:
-            self.position_embedding_type = self.language_model.position_embedding_type
-            self.rotary_pos_emb = self.language_model.rotary_pos_emb
-            self.decoder = self.language_model.decoder
-
-        # Expose position_embedding_type, rotary_pos_emb, and decoder for CUDA graph helper compatibility
-        # The CUDA graph helper expects model.position_embedding_type, model.rotary_pos_emb, and model.decoder,
-        # but in Qwen3VL these are nested under language_model. This provides direct access.
-        # Expose these attributes for CUDA graph helper compatibility only when CUDA graph is enabled
-        cuda_graph_enabled = getattr(self.language_model.config, "cuda_graph_impl", "none") != "none"
-        if cuda_graph_enabled:
-            self.position_embedding_type = self.language_model.position_embedding_type
-            self.rotary_pos_emb = self.language_model.rotary_pos_emb
-            self.decoder = self.language_model.decoder
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Expose position_embedding_type, rotary_pos_emb, and decoder for CUDA graph helper compatibility
# The CUDA graph helper expects model.position_embedding_type, model.rotary_pos_emb, and model.decoder,
# but in Qwen3VL these are nested under language_model. This provides direct access.
# Expose these attributes for CUDA graph helper compatibility only when CUDA graph is enabled
cuda_graph_enabled = getattr(self.language_model.config, "cuda_graph_impl", "none") != "none"
if cuda_graph_enabled:
self.position_embedding_type = self.language_model.position_embedding_type
self.rotary_pos_emb = self.language_model.rotary_pos_emb
self.decoder = self.language_model.decoder
# Expose position_embedding_type, rotary_pos_emb, and decoder for CUDA graph helper compatibility
# The CUDA graph helper expects model.position_embedding_type, model.rotary_pos_emb, and model.decoder,
# but in Qwen3VL these are nested under language_model. This provides direct access.
# Expose these attributes for CUDA graph helper compatibility only when CUDA graph is enabled
cuda_graph_enabled = getattr(self.language_model.config, "cuda_graph_impl", "none") != "none"
if cuda_graph_enabled:
self.position_embedding_type = self.language_model.position_embedding_type
self.rotary_pos_emb = self.language_model.rotary_pos_emb
self.decoder = self.language_model.decoder
# Expose position_embedding_type, rotary_pos_emb, and decoder for CUDA graph helper compatibility
# The CUDA graph helper expects model.position_embedding_type, model.rotary_pos_emb, and model.decoder,
# but in Qwen3VL these are nested under language_model. This provides direct access.
# Expose these attributes for CUDA graph helper compatibility only when CUDA graph is enabled
cuda_graph_enabled = getattr(self.language_model.config, "cuda_graph_impl", "none") != "none"
if cuda_graph_enabled:
self.position_embedding_type = self.language_model.position_embedding_type
self.rotary_pos_emb = self.language_model.rotary_pos_emb
self.decoder = self.language_model.decoder
self.pg_collection = get_pg_collection(self)
# Expose position_embedding_type, rotary_pos_emb, and decoder for CUDA graph helper compatibility
# The CUDA graph helper expects model.position_embedding_type, model.rotary_pos_emb, and model.decoder,
# but in Qwen3VL these are nested under language_model. This provides direct access.
# Expose these attributes for CUDA graph helper compatibility only when CUDA graph is enabled
cuda_graph_enabled = getattr(self.language_model.config, "cuda_graph_impl", "none") != "none"
if cuda_graph_enabled:
self.position_embedding_type = self.language_model.position_embedding_type
self.rotary_pos_emb = self.language_model.rotary_pos_emb
self.decoder = self.language_model.decoder
🤖 Prompt for AI Agents
In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py` around lines
178 - 206, Remove the duplicated CUDA graph compatibility blocks and keep a
single instance that checks getattr(self.language_model.config,
"cuda_graph_impl", "none") != "none" and, if true, assigns
self.position_embedding_type = self.language_model.position_embedding_type,
self.rotary_pos_emb = self.language_model.rotary_pos_emb, and self.decoder =
self.language_model.decoder; locate the repeated blocks around the class that
references self.language_model and consolidate them into one occurrence to
eliminate the triplication.

Comment on lines +120 to 126
# shape (seq_length, bs, 1, 2 * dim)
emb = emb[..., None, :].transpose(0, 1).contiguous()
_ = packed_seq_params # packed sequences not supported yet
if self.cp_group.size() > 1 and not self.is_thd_format:
# slice rotary_pos_emb along sequence dimension and select the parition of the current
# CP rank
emb = get_pos_emb_on_this_cp_rank(emb, 0, self.cp_group)
return emb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

self.cp_group.size() will raise AttributeError when cp_group is None.

The constructor default for cp_group is None (line 54), but line 122 unconditionally calls self.cp_group.size(). If the caller doesn't supply a process group, this will crash at runtime.

Proposed fix
-        if self.cp_group.size() > 1 and not self.is_thd_format:
+        if self.cp_group is not None and self.cp_group.size() > 1 and not self.is_thd_format:
🤖 Prompt for AI Agents
In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.py` around lines
120 - 126, The code calls self.cp_group.size() without ensuring cp_group is set;
change the conditional in the method that manipulates emb to first check that
self.cp_group is not None (e.g., self.cp_group is not None and
self.cp_group.size() > 1 and not self.is_thd_format) so you only call size()
when cp_group exists, and then call get_pos_emb_on_this_cp_rank(emb, 0,
self.cp_group) as before; reference symbols: self.cp_group, is_thd_format, emb,
get_pos_emb_on_this_cp_rank.

Comment on lines +265 to +279
def apply_rotary_pos_emb_thd_absolute(
t: Tensor, cu_seqlens: Tensor, freqs: Tensor, rotary_interleaved: bool = False
) -> Tensor:
"""A baseline implementation of applying RoPE for `thd` format.

Args:
t (Tensor): Input tensor T is of shape [t, h, d]
cu_seqlens(Tensor): Cumulative sum of sequence lengths in a batch for `t`,
with shape [b + 1] and dtype torch.int32.
freqs (Tensor): Rotary Positional embedding tensor freq is of shape [max_s, 1, 1, d]

Returns:
Tensor: Shape [t, h, d]. The input tensor after applying RoPE.
"""
return _apply_rotary_pos_emb_bshd(t[:, None], freqs, rotary_interleaved=rotary_interleaved).squeeze(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the rope.py file to understand the full context
wc -l src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 136


🏁 Script executed:

# Get the function in question and surrounding context
sed -n '260,285p' src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 945


🏁 Script executed:

# Search for _apply_rotary_pos_emb_bshd function definition
rg -n "_apply_rotary_pos_emb_bshd" src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.py -A 20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2139


🏁 Script executed:

# Check for other thd/cu_seqlens related functions in the file
rg -n "thd|cu_seqlens" src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1071


🏁 Script executed:

# Find where apply_rotary_pos_emb_thd_absolute is called
rg -n "apply_rotary_pos_emb_thd_absolute" --type py -B 2 -A 2

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1144


🏁 Script executed:

# Check if there's a THD-aware implementation of _apply_rotary_pos_emb_bshd elsewhere
rg -n "def _apply_rotary_pos_emb_bshd" --type py -A 25

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53


🏁 Script executed:

# Check the imports to see where _apply_rotary_pos_emb_bshd comes from
head -30 src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1128


🏁 Script executed:

# Search for _apply_rotary_pos_emb_bshd definition in the entire codebase
rg -n "def _apply_rotary_pos_emb_bshd" --type py -A 30

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53


🏁 Script executed:

# Search for any other _apply_rotary_pos_emb functions or THD-specific implementations
rg -n "apply_rotary_pos_emb" --type py -i | head -50

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2922


🏁 Script executed:

# Look for any references to how THD format should be handled with RoPE
rg -n "thd.*format\|packed.*seq" --type py -i -C 3 src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53


🏁 Script executed:

# Check if there's a test or usage example that clarifies the intent
rg -n "apply_rotary_pos_emb_thd_absolute\|is_thd_format" --type py -C 5

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53


🏁 Script executed:

# Let's look at the attention.py file to see how apply_rotary_pos_emb_absolute is called
sed -n '160,200p' src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1684


🏁 Script executed:

# Get more context on how packed_seq_params is passed
grep -n "packed_seq_params\|cu_seqlens" src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1178


🏁 Script executed:

# Check if _apply_rotary_pos_emb_bshd from megatron.core has THD support or different signature
rg -n "_apply_rotary_pos_emb_bshd\|apply_rotary_pos_emb_thd" --type py | grep -v "qwen3_vl"

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53


🏁 Script executed:

# Look for any other THD-aware RoPE implementations in megatron.core
find . -name "*.py" -path "*/megatron/core/*" -type f | xargs grep -l "thd.*rope\|rope.*thd" 2>/dev/null | head -5

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53


🏁 Script executed:

# Let's examine the full apply_rotary_pos_emb_thd_absolute function and context
sed -n '265,280p' src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 769


🏁 Script executed:

# Check the full apply_rotary_pos_emb_absolute function to understand the routing logic
sed -n '282,307p' src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 964


🏁 Script executed:

# Search for comments or TODOs that might indicate this is incomplete work
grep -n "TODO\|FIXME\|XXX\|HACK" src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53


cu_seqlens parameter is accepted but never used — THD RoPE will be incorrect for packed sequences.

apply_rotary_pos_emb_thd_absolute accepts cu_seqlens but completely ignores it, simply calling _apply_rotary_pos_emb_bshd(t[:, None], freqs, rotary_interleaved=rotary_interleaved).squeeze(1). The function description labels this a "baseline implementation," indicating incomplete work.

For THD (packed sequence) format, RoPE must apply positional embeddings relative to each sequence's position within its boundaries, not globally across all tokens. The current implementation treats the entire packed input as a single contiguous sequence, violating correctness for multi-sequence batches. Ruff flags cu_seqlens as unused (ARG001).

Either implement proper THD-aware RoPE using cu_seqlens to compute per-sequence position offsets, or remove the parameter and rename the function to clarify it only handles single-sequence batches.

🧰 Tools
🪛 Ruff (0.14.14)

[warning] 266-266: Unused function argument: cu_seqlens

(ARG001)

🤖 Prompt for AI Agents
In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.py` around lines
265 - 279, The function apply_rotary_pos_emb_thd_absolute currently ignores the
cu_seqlens argument causing incorrect RoPE for packed THD inputs; update it to
compute per-sequence absolute token positions from cu_seqlens and apply rotary
embeddings per segment (e.g., iterate or vectorize over batches using cu_seqlens
to slice t and select matching freqs offsets) before concatenating results,
using _apply_rotary_pos_emb_bshd on each sequence chunk (or a vectorized
equivalent) so positions reset per sequence; alternatively if you intend only
single-sequence use remove the cu_seqlens parameter and rename the function to
reflect single-sequence behavior.

Comment on lines +155 to 191
vision_transformer_config = get_vision_model_config(deepcopy(language_transformer_config), hf_vision_config)
vision_transformer_config.pipeline_model_parallel_size = 1
vision_transformer_config.first_pipeline_num_layers = None

# Apply vision encoder CUDA graph settings
vision_transformer_config.cuda_graph_impl = self.vision_cuda_graph_impl
# Convert string scope list to CudaGraphScope enums
vision_transformer_config.cuda_graph_scope = _convert_cuda_graph_scope_to_enum(
self.vision_cuda_graph_scope
) if self.vision_cuda_graph_scope else []
# Set max sequence length for vision CUDA graphs
vision_transformer_config.max_vision_cuda_graph_seq_length = self.max_vision_cuda_graph_seq_length

# Spec for the Qwen3VLTransformerLayer
language_transformer_layer_spec = get_gpt_layer_with_transformer_engine_spec(
num_experts=None,
moe_grouped_gemm=False,
qk_layernorm=self.qk_layernorm,
fp8=False,
)
vision_transformer_layer_spec = get_vit_layer_with_transformer_engine_spec()
vision_patch_merger_spec = PatchMergerSubmodules(
patch_norm=TENorm,
linear_fc1=TEColumnParallelLinear,
linear_fc2=TERowParallelLinear,
)

model = Qwen3VLModel(
language_transformer_config=language_transformer_config,
language_transformer_layer_spec=language_transformer_layer_spec,
vision_transformer_config=hf_vision_config,
vision_transformer_layer_spec=vision_transformer_layer_spec,
vision_patch_merger_spec=vision_patch_merger_spec,
pre_process=pre_process,
post_process=post_process,
pg_collection=self._pg_collection,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, check the file exists and get overall structure
wc -l src/megatron/bridge/models/qwen_vl/qwen3_vl_provider.py

# Check lines around 155 and 185 in qwen3_vl_provider.py
sed -n '150,195p' src/megatron/bridge/models/qwen_vl/qwen3_vl_provider.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2294


🏁 Script executed:

#!/bin/bash
# Find the Qwen3VLModel class definition
fd qwen3_vl -t f --extension py | grep -E "model|qwen3_vl"

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 607


🏁 Script executed:

#!/bin/bash
# Search for Qwen3VLModel class definition
rg -n "class Qwen3VLModel" src/megatron/bridge/models/qwen_vl/ -A 20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 4969


🏁 Script executed:

#!/bin/bash
# Check the MoE provider for the same issue at lines 352-363 and 383
sed -n '340,390p' src/megatron/bridge/models/qwen_vl/qwen3_vl_provider.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2616


🏁 Script executed:

#!/bin/bash
# Get the Qwen3VLModel __init__ signature to see what types it expects
sed -n '42,120p' src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 4223


🏁 Script executed:

#!/bin/bash
# Check what get_vision_model_config returns
rg -n "def get_vision_model_config" src/megatron/bridge/models/qwen_vl/ -A 15

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1968


🏁 Script executed:

#!/bin/bash
# Check the type hints and what Qwen3VLConfigHF is
rg -n "Qwen3VLConfigHF" src/megatron/bridge/models/qwen_vl/ -B 2 -A 2

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1361


🏁 Script executed:

#!/bin/bash
# Verify the type of get_vision_model_config's return value by checking full function
sed -n '63,110p' src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_config.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2301


🏁 Script executed:

#!/bin/bash
# Check what attributes vision_transformer_config is expected to have in Qwen3VLModel
rg -n "vision_transformer_config\." src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 529


🏁 Script executed:

#!/bin/bash
# Check the return statement of get_vision_model_config
sed -n '63,130p' src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_config.py | tail -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1171


🏁 Script executed:

#!/bin/bash
# Get the full end of get_vision_model_config function to see return statement
sed -n '63,135p' src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_config.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3498


🏁 Script executed:

#!/bin/bash
# Verify Qwen3VLTransformerConfig is the return type - check inheritance and what it is
rg -n "class Qwen3VLTransformerConfig" src/megatron/bridge/models/qwen_vl/ -A 3

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 552


🏁 Script executed:

#!/bin/bash
# Double-check the actual line where model is instantiated in dense provider
sed -n '180,195p' src/megatron/bridge/models/qwen_vl/qwen3_vl_provider.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 775


Critical: vision_transformer_config is configured with CUDA graph settings but never passed to Qwen3VLModel.

At line 155, vision_transformer_config is created via get_vision_model_config(deepcopy(language_transformer_config), hf_vision_config), and lines 159-166 apply CUDA graph settings to it. However, at line 185, Qwen3VLModel is instantiated with vision_transformer_config=hf_vision_config (the raw HuggingFace config), not the configured vision_transformer_config. The CUDA graph settings applied to the local variable are never used.

The same issue exists in the MoE provider at lines 352-363 / 383.

Proposed fix (dense provider)
         model = Qwen3VLModel(
             language_transformer_config=language_transformer_config,
             language_transformer_layer_spec=language_transformer_layer_spec,
-            vision_transformer_config=hf_vision_config,
+            vision_transformer_config=vision_transformer_config,
             vision_transformer_layer_spec=vision_transformer_layer_spec,
             vision_patch_merger_spec=vision_patch_merger_spec,
             pre_process=pre_process,
             post_process=post_process,
             pg_collection=self._pg_collection,
         )
🤖 Prompt for AI Agents
In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_provider.py` around lines 155 -
191, The local vision_transformer_config is modified (CUDA graph settings,
scope, max_vision_cuda_graph_seq_length) but the Qwen3VLModel call uses the
original hf_vision_config; update the Qwen3VLModel instantiation to pass the
configured vision_transformer_config (i.e., change
vision_transformer_config=hf_vision_config to
vision_transformer_config=vision_transformer_config) so the applied settings are
used, and make the same fix in the MoE provider where the analogous local
vision_transformer_config is created and then incorrectly replaced with
hf_vision_config in that model's constructor.

Comment on lines +191 to +206
packed_tokens = torch.zeros(1, total_len, dtype=tokens.dtype, device=device)
packed_labels = torch.zeros(1, total_len, dtype=labels.dtype, device=device)
packed_loss_mask = torch.zeros(1, total_len, dtype=loss_mask.dtype, device=device)
packed_position_ids = torch.zeros(1, total_len, dtype=position_ids.dtype, device=device)

for i, seqlen in enumerate(seqlens_in_batch_cpu):
start_idx = cu_seqlens_padded_cpu[i]
packed_tokens[start_idx : start_idx + seqlen] = tokens[i, :seqlen]
packed_labels[start_idx : start_idx + seqlen] = labels[i, :seqlen]
packed_loss_mask[start_idx : start_idx + seqlen] = loss_mask[i, :seqlen]
packed_position_ids[start_idx : start_idx + seqlen] = position_ids[i, :seqlen]

tokens = packed_tokens
labels = packed_labels
loss_mask = packed_loss_mask
position_ids = packed_position_ids
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical indexing bug: packing loop slices along dim 0 instead of dim 1.

packed_tokens has shape (1, total_len). The assignment packed_tokens[start_idx : start_idx + seqlen] slices along dimension 0 (batch), not dimension 1 (sequence). For any start_idx > 0, this produces an empty slice and silently loses all data. The same bug affects packed_labels, packed_loss_mask, and packed_position_ids.

Fix: index dim 1 explicitly
         for i, seqlen in enumerate(seqlens_in_batch_cpu):
             start_idx = cu_seqlens_padded_cpu[i]
-            packed_tokens[start_idx : start_idx + seqlen] = tokens[i, :seqlen]
-            packed_labels[start_idx : start_idx + seqlen] = labels[i, :seqlen]
-            packed_loss_mask[start_idx : start_idx + seqlen] = loss_mask[i, :seqlen]
-            packed_position_ids[start_idx : start_idx + seqlen] = position_ids[i, :seqlen]
+            packed_tokens[0, start_idx : start_idx + seqlen] = tokens[i, :seqlen]
+            packed_labels[0, start_idx : start_idx + seqlen] = labels[i, :seqlen]
+            packed_loss_mask[0, start_idx : start_idx + seqlen] = loss_mask[i, :seqlen]
+            packed_position_ids[0, start_idx : start_idx + seqlen] = position_ids[i, :seqlen]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
packed_tokens = torch.zeros(1, total_len, dtype=tokens.dtype, device=device)
packed_labels = torch.zeros(1, total_len, dtype=labels.dtype, device=device)
packed_loss_mask = torch.zeros(1, total_len, dtype=loss_mask.dtype, device=device)
packed_position_ids = torch.zeros(1, total_len, dtype=position_ids.dtype, device=device)
for i, seqlen in enumerate(seqlens_in_batch_cpu):
start_idx = cu_seqlens_padded_cpu[i]
packed_tokens[start_idx : start_idx + seqlen] = tokens[i, :seqlen]
packed_labels[start_idx : start_idx + seqlen] = labels[i, :seqlen]
packed_loss_mask[start_idx : start_idx + seqlen] = loss_mask[i, :seqlen]
packed_position_ids[start_idx : start_idx + seqlen] = position_ids[i, :seqlen]
tokens = packed_tokens
labels = packed_labels
loss_mask = packed_loss_mask
position_ids = packed_position_ids
packed_tokens = torch.zeros(1, total_len, dtype=tokens.dtype, device=device)
packed_labels = torch.zeros(1, total_len, dtype=labels.dtype, device=device)
packed_loss_mask = torch.zeros(1, total_len, dtype=loss_mask.dtype, device=device)
packed_position_ids = torch.zeros(1, total_len, dtype=position_ids.dtype, device=device)
for i, seqlen in enumerate(seqlens_in_batch_cpu):
start_idx = cu_seqlens_padded_cpu[i]
packed_tokens[0, start_idx : start_idx + seqlen] = tokens[i, :seqlen]
packed_labels[0, start_idx : start_idx + seqlen] = labels[i, :seqlen]
packed_loss_mask[0, start_idx : start_idx + seqlen] = loss_mask[i, :seqlen]
packed_position_ids[0, start_idx : start_idx + seqlen] = position_ids[i, :seqlen]
tokens = packed_tokens
labels = packed_labels
loss_mask = packed_loss_mask
position_ids = packed_position_ids
🤖 Prompt for AI Agents
In `@src/megatron/bridge/training/qwen3vl_step.py` around lines 191 - 206, The
loop that repacks sequences into
packed_tokens/packed_labels/packed_loss_mask/packed_position_ids is slicing the
batch dim instead of the sequence dim (packed_tokens[start_idx:...]) causing
empty assignments for start_idx>0; update the assignments in qwen3vl_step.py
(the for loop using seqlens_in_batch_cpu and cu_seqlens_padded_cpu) to index
dimension 1 explicitly (e.g., select all rows then slice columns like [:,
start_idx:start_idx+seqlen] or use .narrow on dim=1) so each tensor receives the
sequence slice into the sequence axis rather than the batch axis.

Comment on lines +312 to +320
if data_format == "thd":
forward_args, packed_seq_params = get_thd_batch_on_this_cp_rank(
forward_args,
packed_seq_params.cu_seqlens,
packed_seq_params.cu_seqlens_padded,
packed_seq_params.max_seqlen,
cp_size=this_pg_collection.cp.size(),
cp_rank=this_pg_collection.cp.rank(),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Search for PackedSeqParams class definition
rg -n 'class PackedSeqParams' --type=py -A 30

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53


🏁 Script executed:

# Also search for any dataclass or NamedTuple definition
rg -n '@dataclass|@dataclasses' --type=py -B 2 -A 20 | grep -A 20 'PackedSeqParams'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53


🏁 Script executed:

# Search for where PackedSeqParams is defined across the codebase
fd --type=py -x rg -l 'class PackedSeqParams'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 240


🏁 Script executed:

# Read the file to see imports and context
cat -n src/megatron/bridge/training/qwen3vl_step.py | head -50

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2210


🏁 Script executed:

# Look at lines 229-237 and 312-320 specifically
cat -n src/megatron/bridge/training/qwen3vl_step.py | sed -n '220,240p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1189


🏁 Script executed:

# Look at lines 312-320
cat -n src/megatron/bridge/training/qwen3vl_step.py | sed -n '312,320p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 504


🏁 Script executed:

# Search for PackedSeqParams more broadly without file type restriction
rg -n 'PackedSeqParams' --type-add='py:*.py'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 6738


🏁 Script executed:

# Find the PackedSeqParams class definition
find . -name "packed_seq_params.py" -type f

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53


🏁 Script executed:

# Search for the class definition
rg -n 'class PackedSeqParams' -A 50

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53


🏁 Script executed:

# Find the get_thd_batch_on_this_cp_rank function
rg -n 'def get_thd_batch_on_this_cp_rank' -A 20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53


🏁 Script executed:

# Check how PackedSeqParams is used in other parts of this file and related files
rg -n 'packed_seq_params\.' src/megatron/bridge/training/qwen3vl_step.py -B 2 -A 2

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 430


🏁 Script executed:

# Check the signature of get_thd_batch_on_this_cp_rank in other files that use it
rg -n 'get_thd_batch_on_this_cp_rank' -A 15

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3008


🏁 Script executed:

# Look for how packed_seq_params attributes are accessed in other files
rg -n 'packed_seq_params\.(cu_seqlens|max_seqlen)' -B 2 -A 2

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 8692


🏁 Script executed:

# Check the packed_seq_utils.py to see how PackedSeqParams is constructed and used
cat -n src/megatron/bridge/training/utils/packed_seq_utils.py | head -100

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 4911


🏁 Script executed:

# Look at lines around the construction in packed_seq_utils.py
sed -n '155,180p' src/megatron/bridge/training/utils/packed_seq_utils.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 906


Fix AttributeError: PackedSeqParams does not have cu_seqlens, cu_seqlens_padded, or max_seqlen attributes.

Lines 315–317 attempt to access non-existent attributes on the PackedSeqParams object. Based on the constructor at lines 229–237 and usages throughout the codebase, PackedSeqParams only provides cu_seqlens_q, cu_seqlens_kv, cu_seqlens_q_padded, cu_seqlens_kv_padded, max_seqlen_q, and max_seqlen_kv. Replace the generic attribute names with the appropriate _q or _kv suffixed variants.

🤖 Prompt for AI Agents
In `@src/megatron/bridge/training/qwen3vl_step.py` around lines 312 - 320, The
call to get_thd_batch_on_this_cp_rank is using non-existent PackedSeqParams
attributes (cu_seqlens, cu_seqlens_padded, max_seqlen); update the call in
qwen3vl_step.py to use the correct PackedSeqParams fields (choose the
appropriate query or key/value variants) such as cu_seqlens_q or cu_seqlens_kv,
cu_seqlens_q_padded or cu_seqlens_kv_padded, and max_seqlen_q or max_seqlen_kv
so they match the PackedSeqParams constructor and usages; ensure the chosen _q
vs _kv set aligns with how get_thd_batch_on_this_cp_rank expects sequence
lengths.

Comment on lines +50 to +53
from megatron.core.transformer.vision_cuda_graphs import (
VisionTECudaGraphHelper,
get_vision_cuda_graph_seq_length,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if vision_cuda_graphs module exists in the megatron.core package
fd 'vision_cuda_graphs' --type f

# Also check the megatron-core version requirement
rg 'megatron.core|megatron-core' -A2 -B2 --type py --type toml --type txt | head -100

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 7770


🏁 Script executed:

#!/bin/bash
# Read the actual file to see the imports
echo "=== train.py imports around lines 50-53 ==="
sed -n '40,65p' src/megatron/bridge/training/train.py

# Search for any references to vision_cuda_graphs in the entire codebase
echo -e "\n=== Search for vision_cuda_graphs references ==="
rg 'vision_cuda_graphs' -B2 -A2

# Check pyproject.toml for megatron-core version
echo -e "\n=== megatron-core version requirement ==="
rg 'megatron' pyproject.toml requirements.txt 2>/dev/null || echo "Checking pyproject.toml..."
cat pyproject.toml | grep -A20 dependencies

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 4325


🏁 Script executed:

#!/bin/bash
# Check if 3rdparty/Megatron-LM exists and look for vision_cuda_graphs
echo "=== Check 3rdparty/Megatron-LM structure ==="
ls -la 3rdparty/ 2>/dev/null || echo "3rdparty directory not found or empty"

# Look for vision_cuda_graphs in the megatron-core path
echo -e "\n=== Search for vision_cuda_graphs in expected location ==="
find 3rdparty -name "*vision_cuda*" -type f 2>/dev/null || echo "Not found in 3rdparty"

# Check if megatron-core is available as installed package by checking Python imports
echo -e "\n=== Try to check via Python if module exists ==="
python3 << 'PYEOF'
try:
    # Try importing from megatron.core
    from megatron.core.transformer import vision_cuda_graphs
    print("✓ megatron.core.transformer.vision_cuda_graphs module exists")
    print(f"Module path: {vision_cuda_graphs.__file__}")
    print(f"Available items: {dir(vision_cuda_graphs)}")
except ImportError as e:
    print(f"✗ Import failed: {e}")
except Exception as e:
    print(f"✗ Error: {e}")
PYEOF

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 441


🏁 Script executed:

#!/bin/bash
# Check the actual Megatron-LM directory structure
echo "=== Check Megatron-LM directory ==="
ls -la 3rdparty/Megatron-LM/Megatron-LM/ 2>/dev/null || echo "Directory empty or structure different"

# Check if there's a megatron directory
echo -e "\n=== Look for megatron core structure ==="
find 3rdparty/Megatron-LM -type d -name 'core' 2>/dev/null | head -5

# Look for any vision-related files in the megatron-core transformer directory
echo -e "\n=== Search all vision-related files in 3rdparty ==="
find 3rdparty -name "*vision*" -type f 2>/dev/null

# Check git status to see if submodule is initialized
echo -e "\n=== Check if Megatron-LM is a git submodule ==="
git config --file .gitmodules --get-regexp path | grep -i megatron 2>/dev/null || echo "Not a git submodule or empty"

# List all .py files in megatron/core/transformer if it exists
echo -e "\n=== List Python files in megatron/core/transformer ==="
find 3rdparty -path "*/megatron/core/transformer/*.py" -type f 2>/dev/null | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 402


🌐 Web query:

megatron-core vision_cuda_graphs module 0.15.0

💡 Result:

In megatron-core v0.15.0, there is no Python module named vision_cuda_graphs (it does not appear in the v0.15.0 module index / API docs), so an import like from megatron.core... import vision_cuda_graphs will fail. The CUDA-Graphs implementation that is documented in v0.15.0 lives under core.transformer.cuda_graphs (and related CUDA-graph hooks on transformer modules), not under core.models.vision. [1], [2]

If you meant “CUDA Graphs for vision models (ViT/CLIP)”, the vision package in the docs only lists submodules like vit_layer_specs, clip_vit_model, radio, and multimodal_projector—no vision_cuda_graphs. [3]

What to do

  • Replace any vision_cuda_graphs import with core.transformer.cuda_graphs (or use the CUDA-graph entrypoints exposed on transformer layers/modules). [1], [2]
  • If you saw vision_cuda_graphs in some fork/branch, it’s not part of the official 0.15.0 surface area; check the exact repo/commit you copied from (or upgrade/downgrade to the version that contains it). The official Megatron Core 0.15.0 release tag exists on GitHub. [2]

Sources
[1] Megatron Core 0.15.0 Python module index (shows core.transformer.cuda_graphs, no vision_cuda_graphs)
[2] NVIDIA/Megatron-LM GitHub releases (Core v0.15.0 tag/release)
[3] Megatron Core docs: core.models.vision submodules list (no vision_cuda_graphs)


Remove or update imports from non-existent megatron.core.transformer.vision_cuda_graphs module.

The module megatron.core.transformer.vision_cuda_graphs does not exist in megatron-core v0.15.0 (the pinned version constraint is >=0.15.0a0,<0.17.0). The imports for VisionTECudaGraphHelper and get_vision_cuda_graph_seq_length will fail at runtime. The correct CUDA graphs module in megatron-core is megatron.core.transformer.cuda_graphs. Either update the imports to use the correct module path, verify that the megatron-core version constraint allows a version containing this module, or remove the imports if they are not yet supported.

🤖 Prompt for AI Agents
In `@src/megatron/bridge/training/train.py` around lines 50 - 53, The imports from
megatron.core.transformer.vision_cuda_graphs are invalid; update the import
statements that reference VisionTECudaGraphHelper and
get_vision_cuda_graph_seq_length to import them from
megatron.core.transformer.cuda_graphs instead (or remove these imports entirely
if those symbols are unused in train.py); ensure any downstream references to
VisionTECudaGraphHelper and get_vision_cuda_graph_seq_length still resolve after
changing the import path and run tests to confirm no runtime import errors.

Comment on lines +86 to +92
batch_size = input_ids.shape[0]

seqlens_in_batch = (
attention_mask.sum(dim=-1, dtype=torch.int32)
if attention_mask is not None
else torch.ones((batch_size), dtype=torch.int32, device=input_ids.device) * input_ids.shape[1]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

attention_mask=None causes crash when used as boolean index in the packing loop.

Lines 88–92 handle a None attention mask for computing seqlens_in_batch, but the packing loop at lines 137 and 145 unconditionally indexes input_ids[i, attention_mask[i]], which will raise a TypeError when attention_mask is None.

Suggested fix: guard the indexing
         for i in range(batch_size):
             # Use Python int, so no GPU→CPU sync in the loop
             if cp_size <= 1:
                 seqlen = seqlens_in_batch_cpu[i]
                 start_idx = cu_seqlens_padded_cpu[i]
-                input_ids_rmpad[start_idx : start_idx + seqlen] = input_ids[i, attention_mask[i]]
+                if attention_mask is not None:
+                    input_ids_rmpad[start_idx : start_idx + seqlen] = input_ids[i, attention_mask[i]]
+                else:
+                    input_ids_rmpad[start_idx : start_idx + seqlen] = input_ids[i, :seqlen]
                 continue
 
             seqlen_padded_i = seqlens_in_batch_padded_cpu[i]
             seqlen = seqlen_padded_i // cp_size
             half_seqlen = seqlen // 2
             start_idx = cu_seqlens_padded_cpu[i] // cp_size
             # split to 2 chunks
-            d = input_ids[i, attention_mask[i]]
+            if attention_mask is not None:
+                d = input_ids[i, attention_mask[i]]
+            else:
+                d = input_ids[i, :seqlens_in_batch_cpu[i]]

Also applies to: 130-145

🤖 Prompt for AI Agents
In `@src/megatron/bridge/training/utils/packed_seq_utils.py` around lines 86 - 92,
The packing loop in packed_seq_utils.py indexes tensors with attention_mask
(e.g., input_ids[i, attention_mask[i]]) which crashes when attention_mask is
None; update the loop to guard indexing by computing per-example views: for each
i use mask_i = attention_mask[i] if attention_mask is not None else slice(0,
seqlens_in_batch[i]) and then use input_ids[i, mask_i] (and do the same for
token_type_ids, position_ids, etc.), or simply slice with :seqlens_in_batch[i]
when attention_mask is None so the code works regardless of whether
attention_mask is provided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants