THD training in VLMs #1997

cuichenx · 2026-01-20T17:18:42Z

What does this PR do ?

Support THD Training in VLMs

TODOs

Test CP + THD
Test more models
- Gemma 3 VL
- Ministral 3
- ~~GLM 4.5 v~~ (after this PR)
- ~~Qwen 2.5 VL~~ (after this PR)
- ~~Qwen 3 VL~~ (wip in [WIP] Support qwen3-vl for THD format and CP #1943)

Changelog

Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Gemma 3 VL: Loss and grad norm are both matching closely between cp1 and cp2, for both BSHD and THD.

MInistral 3: Not sure why grad norm is different between cp1 and cp2. It might be a display issue. I will look into this further. Loss is matching very closely between cp1 and cp2, for both BSHD and THD.

More plots to come

Related to # (issue)

Summary by CodeRabbit

New Features
- Added batch-level sequence packing support for optimized dataset processing
- Introduced context-parallel distributed training support for Gemma3, Ministral3, GLM, and Qwen vision-language models
Refactor
- Updated model forward signatures to support packed sequence parameters across vision-language models
Tests
- Added comprehensive test coverage for sequence packing utilities, distributed training configurations, and attention scaling algorithms

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Chen Cui <chcui@nvidia.com>

thd initial commit

125f30d

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot bot temporarily deployed to nemo-ci January 20, 2026 17:19 Inactive

cuichenx requested a review from yaoyu-33 January 20, 2026 17:19

copy-pr-bot bot temporarily deployed to test January 20, 2026 17:19 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 20, 2026 17:34 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 20, 2026 17:41 Failure

Revert gpt_oss.py changes (moved to chcui/gpt-oss-thd branch)

e441cc1

copy-pr-bot bot temporarily deployed to nemo-ci January 21, 2026 00:11 Inactive

copy-pr-bot bot temporarily deployed to test January 21, 2026 00:11 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 21, 2026 01:09 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 21, 2026 01:16 Failure

Merge branch 'main' into chcui/vlm_thd

2ffb50f

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot bot temporarily deployed to nemo-ci January 22, 2026 21:13 Inactive

merge conflict fix

f48d403

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot bot had a problem deploying to nemo-ci January 22, 2026 23:20 Error

gemma3 fix

412deb0

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot bot temporarily deployed to nemo-ci January 22, 2026 23:21 Inactive

copy-pr-bot bot temporarily deployed to test January 22, 2026 23:21 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 22, 2026 23:32 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 22, 2026 23:39 Failure

enable THD + CP (wip, loss incorrect)

fac2cae

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot bot temporarily deployed to nemo-ci January 28, 2026 00:35 Inactive

copy-pr-bot bot temporarily deployed to test January 28, 2026 00:35 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 28, 2026 01:57 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 28, 2026 02:04 Failure

fix thd cp convergence issue

6019f49

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot bot temporarily deployed to nemo-ci January 29, 2026 00:49 Inactive

copy-pr-bot bot had a problem deploying to test January 29, 2026 00:50 Error

add ministral and pull out common code

de67930

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot bot had a problem deploying to nemo-ci January 29, 2026 01:15 Error

copy-pr-bot bot temporarily deployed to nemo-ci January 29, 2026 07:08 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 29, 2026 07:15 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 29, 2026 07:24 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 29, 2026 07:24 Failure

copy-pr-bot bot temporarily deployed to nemo-ci January 29, 2026 07:24 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 29, 2026 07:24 Failure

copy-pr-bot bot temporarily deployed to nemo-ci January 29, 2026 07:24 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 29, 2026 07:24 Failure

copy-pr-bot bot temporarily deployed to nemo-ci January 29, 2026 07:24 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

THD training in VLMs #1997

THD training in VLMs #1997

Uh oh!

cuichenx commented Jan 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

THD training in VLMs #1997

Are you sure you want to change the base?

THD training in VLMs #1997

Uh oh!

Conversation

cuichenx commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cuichenx commented Jan 20, 2026 •

edited

Loading