Trouble fine-tuning Nemotron 3 Nano. #3810

icsy7867 · 2025-12-31T18:09:39Z

icsy7867
Dec 31, 2025

Hey everyone! I am sure I am doing something wrong. But I can't seem to get nemotron 3 nano to fine tune successfully. I am trying to use an H200 on vast.ai and also on runpod.ai.

I have tried all different sorts of CUDA versions. After trying some vanilla installs of unsloth, and failing I looked at the google collab notebook and copied some of the installation parts in there:

pip install unsloth unsloth_zoo && pip install --no-build-isolation mamba_ssm==2.2.5 && pip install --no-build-isolation causal_conv1d==1.5.2

I downloaded the unsloth nemotron 3 nano model locally, and I am setting up my python script to do a single step to save testing time...

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "/workspace/nemotron-30B",
    max_seq_length = 32768,
    load_in_4bit = False,
    load_in_8bit = True,
    full_finetuning = False, # Full finetuning now in Unsloth!
    trust_remote_code = True,
    unsloth_force_compile = True,
    attn_implementation="eager",
)

model = FastLanguageModel.get_peft_model(
    model,
    r = 16,           # Choose any number > 0! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj","in_proj", "out_proj",],
    lora_alpha = 32,  # Best to choose alpha = rank or rank*2
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,   # We support rank stabilized LoRA
    loftq_config = None,  # And LoftQ
)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = None, # Can set up evaluation!
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 1, # Use GA to mimic batch size!
        warmup_steps = 1,
        #num_train_epochs = 2, # Set this for 1 full training run.
        max_steps = 1,
        learning_rate = 1e-4, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim  = "adamw_8bit",
        weight_decay = 0.001,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none", # Use this for WandB etc
    ),
)
trainer_stats = trainer.train()

Everything seems to work, the steps go through (I originally did a several hour long run which appeated to be training), but when the BF16 model was trying to merge I am always getting an issue merging the layers/model:

...
, 'backbone.layers.38.mixer.experts.1.up_proj.SCB', 'backbone.layers.51.mixer.experts.91.up_proj.SCB', 'backbon                                                                    .45.mixer.experts.23.up_proj.SCB', 'backbone.layers.22.mixer.experts.18.down_proj.SCB', 'backbone.layers.27.mixer.expert                                                                    _proj.SCB', 'backbone.layers.47.mixer.experts.127.down_proj.SCB', 'backbone.layers.40.mixer.experts.0.down_proj.SCB', 'b                                                                    .experts.87.up_proj.SCB', 'backbone.layers.49.mixer.experts.110.up_proj.SCB', 'backbone.layers.1.mixer.experts.28.down_p                                                                    'backbone.layers.45.mixer.experts.125.down_proj.SCB', 'backbone.layers.22.mixer.experts.75.up_proj.SCB', 'backbone.layer                                                                    er.experts.112.down_proj.SCB', 'backbone.layers.49.mixer.experts.102.up_proj.SCB', 'backbone.layers.15.mixer.experts.60.                                                                     'backbone.layers.17.mixer.experts.38.up_proj.SCB', 'backbone.layers.45.mixer.experts.68.down_proj.SCB', 'backbone.layer                                                                    perts.78.down_proj.SCB', 'backbone.layers.51.mixer.experts.79.up_proj.SCB', 'backbone.layers.38.mixer.experts.7.up_proj.                                                                    ne.layers.49.mixer.experts.76.down_proj.SCB', 'backbone.layers.51.mixer.experts.31.down_proj.SCB', 'backbone.layers.27.m                                                                    .29.up_proj.SCB', 'backbone.layers.17.mixer.experts.69.down_proj.SCB', 'backbone.layers.38.mixer.experts.41.up_proj.SCB'                                                                    ne.layers.31.mixer.experts.115.down_proj.SCB', 'backbone.layers.20.mixer.experts.81.down_proj.SCB', 'backbone.layers.1.m                                                                    ts.109.up_proj.SCB', 'backbone.layers.29.mixer.experts.81.down_proj.SCB', 'backbone.layers.38.mixer.experts.49.down_proj                                                                    o not match!

I am installing unsloth and mamba libraries when each container starts, so it should be the latest but I have definitely tried:

pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth unsloth_zoo

My last attempt was also using an older CUDA 12.4 container (I believe the documentation says unlosth only supports up to 12.4?) and manually ran:

pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth unsloth_zoo && \
pip install "torch==2.7.1" "triton>=3.3.0" "transformers==4.56.2" "mamba_ssm==2.2.5" "causal_conv1d==1.5.2" "torchvision>=0.22.0" "datasets==4.3.0"

to try to force the same versions as the google collab.

I am not sure what else to try! Any suggestions?

shimmyshimmer · 2026-01-05T11:26:35Z

shimmyshimmer
Jan 5, 2026
Maintainer

Could you make a github issue instead so it receives more visibility? Thamk you!

0 replies

xXMrNidaXx · 2026-02-23T13:38:24Z

xXMrNidaXx
Feb 23, 2026

Nemotron 3 Nano can be tricky — it's a smaller model with some specific requirements.

Common issues:

1. Tokenizer mismatch
Make sure you're using the correct tokenizer for Nano specifically:

from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    "nvidia/Nemotron-3-8B-Base-4k",  # Check exact model name
    load_in_4bit=True
)

2. Context length issues
Nano has limited context — ensure your training samples fit:

max_seq_length = 2048  # Check model's native limit

3. LoRA target modules
Nemotron might need different target modules than Llama:

target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]
# May differ from default — check model architecture

4. Memory settings
Nano should be lighter, but ensure proper config:

max_memory = {0: "20GB", "cpu": "30GB"}

Debugging tip: Start with a tiny subset (10 samples) to verify the pipeline works before scaling up.

We've fine-tuned various model sizes at RevolutionAI. Happy to help debug — what specific error are you seeing?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trouble fine-tuning Nemotron 3 Nano. #3810

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Trouble fine-tuning Nemotron 3 Nano. #3810

Uh oh!

icsy7867 Dec 31, 2025

Replies: 2 comments

Uh oh!

shimmyshimmer Jan 5, 2026 Maintainer

Uh oh!

xXMrNidaXx Feb 23, 2026

icsy7867
Dec 31, 2025

shimmyshimmer
Jan 5, 2026
Maintainer

xXMrNidaXx
Feb 23, 2026