Skip to content

Add Nemotron 3 to tests via tiny model#5278

Open
sergiopaniego wants to merge 9 commits intomainfrom
nemotron3-tiny-tests
Open

Add Nemotron 3 to tests via tiny model#5278
sergiopaniego wants to merge 9 commits intomainfrom
nemotron3-tiny-tests

Conversation

@sergiopaniego
Copy link
Member

@sergiopaniego sergiopaniego commented Mar 12, 2026

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

@qgallouedec @albertvillanova


Note

Low Risk
Low risk: changes are limited to test fixtures and unit tests, gated on transformers>=5.3.0, with a CPU-only fallback to avoid known NemotronH kernel/gradient-checkpointing incompatibilities.

Overview
Adds a new tiny NemotronH (hybrid Mamba-Attention) causal LM to the generate_tiny_models.py script so it can be published under trl-internal-testing for CI.

Extends SFTTrainer and DPOTrainer parametrized training tests to include tiny-NemotronHForCausalLM (skipped on older transformers), and conditionally disables gradient checkpointing + forces CPU for this model to avoid Mamba kernel stride constraints with tiny dimensions.

Written by Cursor Bugbot for commit 3c8f9d4. This will update automatically on new commits. Configure here.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

The CI is red:

  FAILED tests/test_dpo_trainer.py::TestDPOTrainer::test_train[trl-internal-testing/tiny-NemotronHForCausalLM] - RuntimeError: causal_conv1d with channel last layout requires strides (x.stride(0) and x.stride(2)) to be multiples of 8
  FAILED tests/test_sft_trainer.py::TestSFTTrainer::test_train[trl-internal-testing/tiny-NemotronHForCausalLM] - RuntimeError: causal_conv1d with channel last layout requires strides (x.stride(0) and x.stride(2)) to be multiples of 8

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Copy link
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!! just a few comments

use_mamba_kernels=False, # CPU-friendly for testing
)
model = NemotronHForCausalLM(config).to(dtype=torch.bfloat16)
init_weights_tiny_model(model)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you cast backbone.layers.[N].mixer.D and backbone.layers.[N].mixer.A_log to fp32?: it seems like these two layers are in fp32, and we want to be as close as possible to the reference model

check how we do here for Qwen3.5 https://github.com/huggingface/trl/pull/5278/changes#diff-dd3349f840a26de373fc88378e6fcded0b75423da8a34f7cfa6ac573b7398b8bL404

kwargs = {}
if "NemotronH" in model_id:
kwargs["gradient_checkpointing"] = False
kwargs["use_cpu"] = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really not sure about this. we don't train on cpu, so why testing it + we wouldn't know it a gpu-specific issue is introduced

@qgallouedec
Copy link
Member

Thanks!

The CI is red:

  FAILED tests/test_dpo_trainer.py::TestDPOTrainer::test_train[trl-internal-testing/tiny-NemotronHForCausalLM] - RuntimeError: causal_conv1d with channel last layout requires strides (x.stride(0) and x.stride(2)) to be multiples of 8
  FAILED tests/test_sft_trainer.py::TestSFTTrainer::test_train[trl-internal-testing/tiny-NemotronHForCausalLM] - RuntimeError: causal_conv1d with channel last layout requires strides (x.stride(0) and x.stride(2)) to be multiples of 8

is it possible that this error originates from what params are used to build the model?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants