[fix] Add cu_seqlens_argmin to vlm packed sequence#2246
[fix] Add cu_seqlens_argmin to vlm packed sequence#2246
Conversation
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
📝 WalkthroughWalkthroughThe pull request adds documentation about potential device-to-host synchronization costs from Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/megatron/bridge/training/utils/packed_seq_utils.py`:
- Around line 46-47: Remove the trailing space at the end of the comment line
that starts with "# note: if argmin is not pre-computed in the dataloader,
torch.argmin here will incur a" in
src/megatron/bridge/training/utils/packed_seq_utils.py; edit that comment to end
without any trailing whitespace (and optionally run the repo's pre-commit hooks
or a trim-whitespace action to ensure no other trailing spaces remain).
🧹 Nitpick comments (1)
src/megatron/bridge/training/vlm_step.py (1)
405-412: LGTM! The implementation correctly avoids device-to-host sync.The logic is correct: since
pack_batch_sequencescreatescu_seqlenswithout -1 padding, the argmin index should be the full length of the tensor. Providing this pre-computed value avoids thetorch.argmincall and the associated device-to-host synchronization mentioned in the documentation note.Optional: Consider specifying
dtype=torch.int32for consistency withmax_seqlenand other scalar metadata tensors created inpack_batch_sequences.♻️ Optional consistency improvement
- cu_seqlens_argmin = torch.tensor(len(cu_seqlens)) # no padding in cu_seqlens since packing is done in-batch + cu_seqlens_argmin = torch.tensor(len(cu_seqlens), dtype=torch.int32) # no padding in cu_seqlens since packing is done in-batch
Add validation for micro_batch_size when packing sequences Signed-off-by: Chen Cui <chcui@nvidia.com>
What does this PR do ?
#1997 support in-batch sequence packing for VLMs but introduced a perf degradation.
#2180 resolved the perf issue but introduced a bug for in-batch sequence packing.
This PR fixes the bug by passing in cu_seqlens_argmin in vlm_step.py so there is no perf degradation.
Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit
Documentation
Improvements