Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
…o nemotron3-tiny-tests
There was a problem hiding this comment.
Thanks!
The CI is red:
FAILED tests/test_dpo_trainer.py::TestDPOTrainer::test_train[trl-internal-testing/tiny-NemotronHForCausalLM] - RuntimeError: causal_conv1d with channel last layout requires strides (x.stride(0) and x.stride(2)) to be multiples of 8
FAILED tests/test_sft_trainer.py::TestSFTTrainer::test_train[trl-internal-testing/tiny-NemotronHForCausalLM] - RuntimeError: causal_conv1d with channel last layout requires strides (x.stride(0) and x.stride(2)) to be multiples of 8
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
qgallouedec
left a comment
There was a problem hiding this comment.
thanks!! just a few comments
| use_mamba_kernels=False, # CPU-friendly for testing | ||
| ) | ||
| model = NemotronHForCausalLM(config).to(dtype=torch.bfloat16) | ||
| init_weights_tiny_model(model) |
There was a problem hiding this comment.
Can you cast backbone.layers.[N].mixer.D and backbone.layers.[N].mixer.A_log to fp32?: it seems like these two layers are in fp32, and we want to be as close as possible to the reference model
check how we do here for Qwen3.5 https://github.com/huggingface/trl/pull/5278/changes#diff-dd3349f840a26de373fc88378e6fcded0b75423da8a34f7cfa6ac573b7398b8bL404
| kwargs = {} | ||
| if "NemotronH" in model_id: | ||
| kwargs["gradient_checkpointing"] = False | ||
| kwargs["use_cpu"] = True |
There was a problem hiding this comment.
really not sure about this. we don't train on cpu, so why testing it + we wouldn't know it a gpu-specific issue is introduced
is it possible that this error originates from what params are used to build the model? |
What does this PR do?
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
Who can review?
@qgallouedec @albertvillanova
Note
Low Risk
Low risk: changes are limited to test fixtures and unit tests, gated on
transformers>=5.3.0, with a CPU-only fallback to avoid known NemotronH kernel/gradient-checkpointing incompatibilities.Overview
Adds a new tiny NemotronH (hybrid Mamba-Attention) causal LM to the
generate_tiny_models.pyscript so it can be published undertrl-internal-testingfor CI.Extends
SFTTrainerandDPOTrainerparametrized training tests to includetiny-NemotronHForCausalLM(skipped on oldertransformers), and conditionally disables gradient checkpointing + forces CPU for this model to avoid Mamba kernel stride constraints with tiny dimensions.Written by Cursor Bugbot for commit 3c8f9d4. This will update automatically on new commits. Configure here.