Skip to content

[fix] Add cu_seqlens_argmin to vlm packed sequence#2246

Open
cuichenx wants to merge 6 commits intomainfrom
chcui/fix_vlm_packed_sequence
Open

[fix] Add cu_seqlens_argmin to vlm packed sequence#2246
cuichenx wants to merge 6 commits intomainfrom
chcui/fix_vlm_packed_sequence

Conversation

@cuichenx
Copy link
Contributor

@cuichenx cuichenx commented Feb 5, 2026

What does this PR do ?

#1997 support in-batch sequence packing for VLMs but introduced a perf degradation.
#2180 resolved the perf issue but introduced a bug for in-batch sequence packing.
This PR fixes the bug by passing in cu_seqlens_argmin in vlm_step.py so there is no perf degradation.

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

  • Documentation

    • Added guidance on packed sequence optimization considerations, highlighting potential performance implications for training workflows to help ensure awareness of best practices.
  • Improvements

    • Optimized packed sequence parameter configuration and metadata handling to support more efficient sequence processing during model training.

Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
@cuichenx cuichenx added the r0.3.0 Cherry-pick label for r0.3.0 release branch label Feb 5, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 5, 2026

📝 Walkthrough

Walkthrough

The pull request adds documentation about potential device-to-host synchronization costs from torch.argmin calls in packed sequence utilities and introduces a new cu_seqlens_argmin scalar tensor parameter to the packed sequence metadata handling in VLM training.

Changes

Cohort / File(s) Summary
Packed Sequence Support Enhancement
src/megatron/bridge/training/utils/packed_seq_utils.py, src/megatron/bridge/training/vlm_step.py
Added documentation note warning about torch.argmin device-to-host synchronization overhead and introduced cu_seqlens_argmin as a new parameter in packed sequence metadata to enable pre-computed argmin values.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

  • Revert packed seq extra checks #2180 — Directly related as it removes argmin-based padding checks and uses provided argmin values, complementing this PR's introduction of the cu_seqlens_argmin parameter.

Suggested reviewers

  • yaoyu-33
  • ko3n1g
🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR introduces performance optimization (cu_seqlens_argmin) to avoid device-to-host synchronization. Commits and code explicitly reference performance impact, yet PR description contains only template placeholders with no test results, performance metrics, or validation data. Add PR description with performance comparison data, test results confirming correctness, test configurations, and confirmation that numerics/convergence are unaffected by this change.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately summarizes the main change: adding cu_seqlens_argmin to VLM packed sequence support, which is directly reflected in the file changes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chcui/fix_vlm_packed_sequence

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/megatron/bridge/training/utils/packed_seq_utils.py`:
- Around line 46-47: Remove the trailing space at the end of the comment line
that starts with "# note: if argmin is not pre-computed in the dataloader,
torch.argmin here will incur a" in
src/megatron/bridge/training/utils/packed_seq_utils.py; edit that comment to end
without any trailing whitespace (and optionally run the repo's pre-commit hooks
or a trim-whitespace action to ensure no other trailing spaces remain).
🧹 Nitpick comments (1)
src/megatron/bridge/training/vlm_step.py (1)

405-412: LGTM! The implementation correctly avoids device-to-host sync.

The logic is correct: since pack_batch_sequences creates cu_seqlens without -1 padding, the argmin index should be the full length of the tensor. Providing this pre-computed value avoids the torch.argmin call and the associated device-to-host synchronization mentioned in the documentation note.

Optional: Consider specifying dtype=torch.int32 for consistency with max_seqlen and other scalar metadata tensors created in pack_batch_sequences.

♻️ Optional consistency improvement
-        cu_seqlens_argmin = torch.tensor(len(cu_seqlens))  # no padding in cu_seqlens since packing is done in-batch
+        cu_seqlens_argmin = torch.tensor(len(cu_seqlens), dtype=torch.int32)  # no padding in cu_seqlens since packing is done in-batch

Signed-off-by: Chen Cui <chcui@nvidia.com>
Add validation for micro_batch_size when packing sequences

Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r0.3.0 Cherry-pick label for r0.3.0 release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant